Cohere AI Logo

Cohere AI

Senior ML Systems Engineer, Frameworks & Tooling

Reposted 21 Hours Ago
In-Office
6 Locations
Senior level
In-Office
6 Locations
Senior level
The Senior ML Systems Engineer will build and maintain the training framework for large-scale language models, focusing on distributed training and performance optimization.
The summary above was generated by AI

Who are we?

Our mission is to scale intelligence to serve humanity. We’re training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI.

We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. We like to work hard and move fast to do what’s best for our customers.

Cohere is a team of researchers, engineers, designers, and more, who are passionate about their craft. Each person is one of the best in the world at what they do. We believe that a diverse range of perspectives is a requirement for building great products.

Join us on our mission and shape the future!

We’re looking for a senior engineer to help build, maintain and evolve the training framework that powers our frontier-scale language models. This role sits at the intersection of large-scale training, distributed systems, and HPC infrastructure. You will design and maintain the core components that enable fast, reliable, and scalable model training — and build the tooling that connects research ideas to thousands of GPUs.

If you enjoy working across the full stack of ML systems, this role gives you the opportunity and autonomy to have massive impact.

What You’ll Work On
  • Build and own the training framework responsible for large-scale LLM training.

  • Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing).

  • Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100).

  • Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics.

  • Collaborate closely with infra teams to ensure Slurm setups, container environments, and hardware configurations support high-performance training.

  • Investigate and resolve performance bottlenecks across the ML systems stack.

  • Build robust systems that ensure reproducible, debuggable, large-scale runs.

You Might Be a Good Fit If You Have
  • Strong engineering experience in large-scale distributed training or HPC systems.
    Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops.

  • Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar).

  • Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines.

  • Experience working with containerized environments (Docker, Singularity/Apptainer).

  • A track record of building tools that increase developer velocity for ML teams.

  • Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability.

  • Strong collaboration skills — you’ll work closely with infra, research, and deployment teams.

Nice to Have
  • Experience with training LLMs or other large transformer architectures.

  • Contributions to ML frameworks (PyTorch, JAX, DeepSpeed, Megatron, xFormers, etc.).

  • Familiarity with evaluation and serving frameworks (vLLM, TensorRT-LLM, custom KV caches).

  • Experience with data pipeline optimization, sharded datasets, or caching strategies.

  • Background in performance engineering, profiling, or low-level systems.

Bonus: paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP).

Why Join Us
  • You’ll work on some of the most challenging and consequential ML systems problems today.

  • You’ll collaborate with a world-class team working fast and at scale.

  • You’ll have end-to-end ownership over critical components of the training stack.

  • You’ll shape the next generation of infrastructure for frontier-scale models.

  • You’ll build tools and systems that directly accelerate research and model quality.

Sample Projects:

  • Build a high-performance data loading and caching pipeline.

  • Implement performance profiling across the ML systems stack

  • Develop internal metrics and monitoring for training runs.

  • Build reproducibility and regression testing infrastructure.

  • Develop a performant fault-tolerant distributed checkpointing system.

If some of the above doesn’t line up perfectly with your experience, we still encourage you to apply!

We value and celebrate diversity and strive to create an inclusive work environment for all. We welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.

Full-Time Employees at Cohere enjoy these Perks:

🤝 An open and inclusive culture and work environment 

🧑‍💻 Work closely with a team on the cutting edge of AI research 

🍽 Weekly lunch stipend, in-office lunches & snacks

🦷 Full health and dental benefits, including a separate budget to take care of your mental health 

🐣 100% Parental Leave top-up for up to 6 months

🎨 Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement

🏙 Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend

✈️ 6 weeks of vacation (30 working days!)

Top Skills

Cuda
Deepspeed
Docker
Jax
Kubernetes
Megatron
Nccl
PyTorch
Ray
Singularity
Slurm
Tensorrt-Llm
Vllm
Xformers

Cohere AI New York, New York, USA Office

New York, New York, United States

Similar Jobs

43 Minutes Ago
Hybrid
Burlington, ON, CAN
61K-91K Annually
Junior
61K-91K Annually
Junior
Big Data • Fintech • Information Technology • Business Intelligence • Financial Services • Cybersecurity • Big Data Analytics
Support business units in compliance, assist in creating and implementing policies, perform risk assessments, and analyze legal agreements to ensure compliance.
Top Skills: Compliance SoftwareLegal Documentation ToolsMS Office
44 Minutes Ago
Hybrid
2 Locations
81K-122K Annually
Senior level
81K-122K Annually
Senior level
Big Data • Fintech • Information Technology • Business Intelligence • Financial Services • Cybersecurity • Big Data Analytics
The Data Engineer will develop and manage data pipelines, collaborate cross-functionally, support project execution, and adhere to cloud governance practices in a hybrid work environment.
Top Skills: ElasticsearchGitGitlab RunnerGoogle Cloud Platform (Gcp)HadoopHiveHueJavaJavaScriptKibanaKubernetesLogstashMicrostrategyOoziePythonScalaSqoopTrino
44 Minutes Ago
Remote or Hybrid
Ontario, ON, CAN
81K-122K Annually
Mid level
81K-122K Annually
Mid level
Big Data • Fintech • Information Technology • Business Intelligence • Financial Services • Cybersecurity • Big Data Analytics
The Account Executive will manage public sector client relationships, lead account planning, coordinate internal resources, and facilitate contract renewals while ensuring compliance and effective communication.
Top Skills: Business AdministrationProject Management

What you need to know about the NYC Tech Scene

As the undisputed financial capital of the world, New York City is an epicenter of startup funding activity. The city has a thriving fintech scene and is a major player in verticals ranging from AI to biotech, cybersecurity and digital media. It also has universities like NYU, Columbia and Cornell Tech attracting students and researchers from across the globe, providing the ecosystem with a constant influx of world-class talent. And its East Coast location and three international airports make it a perfect spot for European companies establishing a foothold in the United States.

Key Facts About NYC Tech

  • Number of Tech Workers: 549,200; 6% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Capgemini, Bloomberg, IBM, Spotify
  • Key Industries: Artificial intelligence, Fintech
  • Funding Landscape: $25.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Greycroft, Thrive Capital, Union Square Ventures, FirstMark Capital, Tiger Global Management, Tribeca Venture Partners, Insight Partners, Two Sigma Ventures
  • Research Centers and Universities: Columbia University, New York University, Fordham University, CUNY, AI Now Institute, Flatiron Institute, C.N. Yang Institute for Theoretical Physics, NASA Space Radiation Laboratory

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account