Required Qualifications
- Experience: 5+ years of hands-on experience in machine learning systems, performance engineering, or related software engineering roles focused on model optimization. (Alternatively, 3+ years with a relevant advanced degree such as MS or PhD in Computer Science, Electrical Engineering, or related field.)
- Significant hands-on experience optimizing deep learning models for latency, throughput, and cost.
- Proven ability to profile and debug performance bottlenecks across the stack (model, framework, runtime, and system-level).
- Experience with distributed or large-scale training and inference, including data/model parallelism, pipeline parallelism, sharding, and gradient accumulation.
- Practical CUDA development experience and familiarity with GPU programming concepts, tensor cores, memory management, and asynchronous execution.
- Deep understanding of at least one major deep learning framework (ideally PyTorch) and experience with model export/runtime formats (TorchScript, ONNX, SavedModel).
- Familiarity with optimization techniques such as mixed precision (FP16/BF16/AMP), quantization (PTQ/QAT, int8, q4/q8), distillation, pruning (structured/unstructured), activation checkpointing, operator fusion, and caching/batching strategies.
- Experience working with large models (e.g., transformers) and productionizing them, understanding attention mechanisms, optimization of attention kernels, and memory/compute tradeoffs.
- Experience building and operating ML systems on cloud platforms (AWS, Azure, or GCP) and containerized deployments.
- Comfort working with experiment tracking, monitoring, and evaluation pipelines.
Preferred Qualifications
- Experience developing custom CUDA kernels, integrating low-level GPU optimizations, or contributing to performance-focused libraries and runtimes (e.g., cuBLAS/cuDNN, cuFFT, NCCL, XLA, TVM, ONNX Runtime).
- Prior experience optimizing inference serving systems and cost/latency trade-offs at scale, including use of Triton Inference Server, TensorRT, FasterTransformer, or DeepSpeed inference optimizations.
- Familiarity with container orchestration (Kubernetes), serving frameworks, deployment tooling, and continuous delivery for ML models.
- Experience with performance benchmarking, load testing (including MLPerf/ custom benchmarks), and building internal tooling/automation for continuous performance validation.
- Background in compiler optimizations, kernel fusion, MLIR/XLA, or other systems-level optimizations.
- Strong communication skills and demonstrated ability to translate product requirements into measurable performance goals and SLIs (p50/p95/p99 latency, throughput (tokens/sec), GPU utilization, memory footprint, cost per token).
Education Guidance
- Bachelor's degree in Computer Science, Electrical Engineering, or a related technical field is typically expected for this role.
- A master's degree or PhD is preferred for candidates with fewer years of industry experience or for roles with heavy research/architecture responsibilities.
Job Description
The role focuses on leading performance optimization across our AI/ML foundation model stack, designing GPU components, and delivering measurable reductions in latency and cost while maintaining throughput and reliability.
Key Responsibilities
- Own performance, scalability, and reliability for the foundation model during both training and inference, defining success metrics and tracking improvements.
- Profile and optimize the end-to-end ML stack, including data pipelines, training loops, inference serving, and deployment workflows.
- Design, implement, and integrate GPU-accelerated components; develop custom CUDA kernels when existing libraries are insufficient.
- Reduce latency and cost per inference token while maximizing throughput and hardware utilization through software and system-level optimizations.
- Translate product requirements into clear, actionable optimization goals and technical roadmaps in close collaboration with the founders and cross-functional teams.
- Build and maintain internal tooling, benchmarks, and evaluation harnesses to enable reliable experimentation, debugging, and safe rollouts.
- Contribute to model architecture and system design decisions where they impact performance, robustness, and operational efficiency.
- Advocate best practices for performance-aware development, monitoring, and continuous improvement across the engineering team.
Technical Keywords (for Discoverability)
- Frameworks & Runtimes: PyTorch, TensorFlow, TorchScript, ONNX, ONNX Runtime, Triton Inference Server, TensorRT, TVM, XLA, MLIR
- Libraries & Optimizations: cuBLAS, cuDNN, NCCL, FasterTransformer, DeepSpeed, Hugging Face Transformers, PEFT, OpenVINO
- Languages & Platforms: Python, C++, CUDA, Rust; Linux, Docker, Kubernetes
- Distributed & Parallelism: Data parallelism, model parallelism, pipeline parallelism, tensor parallelism, ZeRO, sharding, Horovod
- Quantization & Precision: FP32/FP16/BF16, mixed precision, INT8, post-training/QAT, weight-only quantization, pruning, sparsity
- Profiling & Debugging: NVIDIA Nsight Systems, Nsight Compute, CUPTI, nvprof, perf, flamegraphs, eBPF, gdb
- Benchmarking & Testing: MLPerf, custom microbenchmarks, load testing, latency p95/p99 measurements, throughput (tokens/sec)
- Monitoring & Observability: Prometheus, Grafana, OpenTelemetry, Jaeger, tracing, logging, SLIs/SLOs
- Cloud & MLOps: AWS (SageMaker, EC2/GPU instances), GCP (Vertex AI), Azure ML, Terraform, CI/CD (GitHub Actions, Jenkins)
- Serving & Deployment: Kubernetes, Docker, Helm, CI/CD for models, inference autoscaling, canary/blue-green rollouts
- Data & Pipelines: Kafka, Spark, Dask, data preprocessing, streaming/batching strategies
- Model formats & tooling: TorchScript, ONNX, SavedModel, HF Transformers, model sharding tools
By applying, you are giving your GDPR consent for your CV and personal details to be processed for recruitment purposes.
Our Commitment to Inclusive Tone and Diversity
We strive to create a respectful, collaborative, and growth-oriented environment where people from all backgrounds can do their best work. We value diverse perspectives and believe they make our products and teams stronger.
We are an equal opportunity employer and welcome applicants regardless of race, color, religion, sex, sexual orientation, gender identity or expression, national origin, age, disability, veteran status, or any other characteristic protected by law. We encourage candidates from underrepresented groups in technology to apply.
If you’re excited about this role but your experience doesn’t match every listed qualification, we still encourage you to apply — you may be a great fit even if you don’t meet every requirement. Reasonable accommodations are available during the recruiting process; please let us know if you need any accommodations to participate.
Skills: ml,deep learning,pytorch,python