Platform Engineer - AI / HPC

Carbon3.ai - The UK's AI Solution Platform • Full-time • London Area, GB • 1d ago

We are building the UK’s next generation AI platform, powered by renewable energy, rooted in sovereign capability, and designed to give enterprises and innovators the compute they need.

AI Platform Operations Manager

Support Engineer / Cluster Administrator to provide Level 1 and Level 2 support for AI platform. This role will be customer facing, involve technical troubleshooting, and collaboration with vendor engineering teams to ensure seamless AI platform operations. 

Key Responsibilities

L1 support for customer-reported issues and requests
L2 support by diagnosing, replicating, and troubleshooting issues across platform and infrastructure.
Coordinate resolution of complex issues (L3) to (vendor) product/engineering teams and manage vendor responses
Monitor system health, alerts, and customer usage patterns
Document solutions/workarounds, create and maintain knowledge, document support procedures
Automate common tasks and fixes
Configure and integrate tooling to support optimal operation of the platform, and support tool selection
Assist customers with platform configuration, onboarding, and usage best practices
Collaborate with platform and infrastructure support/engineering teams to resolve platform integration issues
Ensure SLAs and customer satisfaction targets are met
Work with customers and multiple stakeholders to understand requirements and challenges, provide reporting on usage, workflow and billing

 Technical responsibilities

Cluster Infrastructure management: Managing the Nvidia GPU cluster 
High availability and resilience: Implement failover strategies and manage maintenance events to minimise downtime
Resource allocation and optimisation: Resource partitioning (GPU resources), workload scheduling, capacity planning
Performance monitoring and troubleshooting: Performance analysis, monitoring (realtime) with available Nvidia and HPE tools 
Incident response: node failure management, network issues, driver issues, troubleshooting common issues and then working with vendor support to resolve any critical issues
Security and access control: Manage user permissions, RBAC, security hardening, data protection

  

Required Skills & Experience

10 years of experience (or equivalent) in technical support, system engineering, or platform operations
Strong understanding of L1 and L2 support processes (ticketing, escalation, troubleshooting)
Familiarity with cloud-based platforms, APIs, and distributed systems
Understanding of AI/ML concepts and tooling (model training, inference, data pipelines basics)
Experience with monitoring/logging tools (e.g., Grafana, Kibana, Splunk)
Excellent communication skills to interface with both customers and internal / vendor teams
Good understanding of tools requirements for ML engineers and data scientists, and how to optimize the experience

 

Core Technical Skills

System administration experience with OS's like RHEL/CentOS, Ubuntu, tuning Linux kernel
Proficiency with Ansible, Nvidia and CUDA toolkits, Kubernetes and container orchestration
Understanding of automation, monitoring and security with GPU as a service

 

Preferred Experience

Experience supporting HPE PCAI or other AI/HPC infrastructure and platforms
Experience with GPU resource allocation (across instances, GPUs count and time)
Advanced networking skills with High performance networking, troubleshooting and fine tuning
Background in DevOps or SRE practices
ITIL familiarity

 

Success Metrics