Prior Labs Logo

Prior Labs

Senior ML Infrastructure Engineer

Reposted 19 Days Ago
Be an Early Applicant
In-Office
New York, NY, USA
Senior level
In-Office
New York, NY, USA
Senior level
The Senior ML Infrastructure Engineer will own and optimize GPU infrastructure for training tabular foundation models, ensuring high efficiency and collaboration with research teams.
The summary above was generated by AI
Who we are

Foundation models have transformed text and images, but structured data - the largest and most consequential data modality in the world - has remained untouched. Tables power every clinical trial, every financial model, every scientific experiment, every business decision. No one has built a foundation model that truly understands them.

Until now. What LLMs did for language, we're doing for tables. The next modality shift in AI is happening - and we're hiring the team that makes it.

Momentum: We pioneered tabular foundation models and are now the world-leading organization in structured data ML. Our TabPFN v2 model was published in Nature and set a new state-of-the-art for tabular machine learning. Since its release, we've scaled model capabilities more than 20x, reached 3M+ downloads, 6,000+ GitHub stars, and are seeing accelerating adoption across research and industry - from detecting lung disease with Oxford Cancer Analytics to preventing train failures with Hitachi to improving clinical trial decisions with BostonGene.

The hardest work is in front of us. We're scaling tabular foundation models to handle millions of rows, thousands of features, real-time inference, and entirely new data modalities - while building the infrastructure to deploy them in production across some of the most demanding industries on earth. These are open problems no one else is working on at this level.

Our team: We’re a small, highly selective team of 20+ engineers, researchers and GTM specialists, selected from over 5,000 applicants, with backgrounds spanning Google, Apple, Amazon, Microsoft, G-Research, Jane Street, Goldman Sachs, and CERN, led by Frank Hutter, Noah Hollmann and Sauraj Gambhir and advised by world-leading AI researchers such as Bernhard Schölkopf and Turing Award winner Yann LeCun. We ship fast, create top-tier research, and hold each other to an extremely high bar.

What’s Next: In 2025, we raised €9m pre-seed led by Balderton Capital, backed by leaders from Hugging Face, DeepMind, and Black Forest Labs. The next phase of growth is here which makes this an optimal time to join.

About the Role

We spend tens of millions per year on GPU compute to train tabular foundation models. That's not a target, it's what we're running today, and it's growing. The person who owns this infrastructure makes decisions worth millions of dollars: cluster architecture, scheduling efficiency, provider strategy, hardware selection. A wrong call costs six figures.

Today we run Slurm on GCP across multiple clusters. We're scaling to multi-cluster, multi-provider infrastructure and evaluating new hardware generations as they come online. You own the full stack, from cluster operations and cost optimization to distributed training performance and the tooling layer that keeps researchers moving fast. You work directly with the research team and understand what they're doing well enough to make infrastructure decisions that actually help them. And this isn't a pure support role. We operate an open environment. If you've got the next SOTA tabular architecture up your sleeve, go ahead and train it.

What you'll work on:

  • Own and evolve multi-cluster GPU infrastructure. Slurm on GCP today, multi-provider and new hardware tomorrow. Architecture, scheduling, reliability, cost optimization

  • Drive GPU utilization and training throughput: profiling, memory optimization, communication bottlenecks, systems-level debugging of distributed training across large runs

  • Architect the next generation of our infrastructure: multi-cluster orchestration, new GPU generations, provider diversification, capacity planning against growing compute demands

  • Build the developer productivity layer: CI pipelines, experiment tracking, model registry, data processing, and internal tooling that keeps research iteration speed high

  • Own the compute budget. You understand cost per FLOP across providers and hardware, and you hate wasted compute

Tech stack: Slurm, GCP, Docker, wandb, GitHub Actions, uv, PyTorch, Triton

You may be a good fit if you have:

  • 5+ years building and operating production GPU infrastructure or distributed training systems at scale. At a major AI lab, a well-funded ML startup, or an HPC environment

  • Deep hands-on experience with Slurm and cluster management. You've debugged scheduling failures, optimized utilization across multi-tenant GPU workloads, and operated infrastructure where downtime has real cost

  • Expert-level systems thinking: memory bandwidth, GPU profiling. You reason about hardware, not configs

  • Strong Python and genuine fluency with PyTorch internals. Enough to profile a training run and tell whether the bottleneck is data loading, communication, or compute

  • Track record of making infrastructure decisions that measurably improved training throughput or cost efficiency

  • Strong AI tooling skills. You use Claude Code, Cursor, or similar fluently to move fast without sacrificing quality

Bonus:

  • Experience operating at tens-of-millions-scale GPU spend

  • Multi-cloud or hybrid HPC/cloud infrastructure experience

  • Triton, CUDA, or custom kernel experience

  • Experience scaling from single cluster to multi-cluster orchestration

  • Background building experiment tracking, model registry, or ML pipeline tooling

Life at Prior Labs

We're a small, ambitious team solving one of the hardest problems in AI, and we're just getting started. You'll work closely with world-class researchers and builders who care deeply about the quality of their craft, the impact of their work, and the people they work with.

We move fast, we think rigorously, and we take the time to do things right. If you're excited by hard problems, motivated by real-world impact, and want to be part of building something that matters, we'd love to hear from you.

Our Commitments

We believe the best products and teams come from a wide range of perspectives, experiences, and backgrounds. That's why we welcome applications from people of all identities and walks of life, especially anyone who's ever felt discouraged by "not checking every box."

We're committed to creating a safe, inclusive environment and providing equal opportunities regardless of gender, sexual orientation, origin, disability, or any other trait that makes you who you are.

We care about how your data is handled. Read our Recruiting Privacy Notice to see exactly what we collect, why, and how long we keep it.

Similar Jobs

5 Hours Ago
Easy Apply
Remote or Hybrid
Easy Apply
Junior
Junior
Artificial Intelligence • Cloud • Computer Vision • Hardware • Internet of Things • Software
As an Account Development Representative at Samsara, you will generate business opportunities through outbound outreach, support the sales team, and learn about various industries. The role focuses on building buyer relationships and requires excellent communication skills, motivation, and resilience.
Top Skills: LushaSalesforceSalesloft
5 Hours Ago
Remote or Hybrid
Mid level
Mid level
Software
Manage a set territory of accounts in the DACH region to drive sales growth, educate customers on API capabilities, and execute a land and expand strategy.
Top Skills: APIsSaaSSales Strategy
5 Hours Ago
Easy Apply
Hybrid
Easy Apply
Junior
Junior
Fintech • Payments • Financial Services
The Enterprise Business Representative will identify prospects, design campaigns, collaborate closely with sales teams, and optimize prospecting using technology. The role emphasizes strategic account planning and engaging with high-value prospects.
Top Skills: AICrm ToolsSales Engagement Platforms

What you need to know about the NYC Tech Scene

As the undisputed financial capital of the world, New York City is an epicenter of startup funding activity. The city has a thriving fintech scene and is a major player in verticals ranging from AI to biotech, cybersecurity and digital media. It also has universities like NYU, Columbia and Cornell Tech attracting students and researchers from across the globe, providing the ecosystem with a constant influx of world-class talent. And its East Coast location and three international airports make it a perfect spot for European companies establishing a foothold in the United States.

Key Facts About NYC Tech

  • Number of Tech Workers: 549,200; 6% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Capgemini, Bloomberg, IBM, Spotify
  • Key Industries: Artificial intelligence, Fintech
  • Funding Landscape: $25.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Greycroft, Thrive Capital, Union Square Ventures, FirstMark Capital, Tiger Global Management, Tribeca Venture Partners, Insight Partners, Two Sigma Ventures
  • Research Centers and Universities: Columbia University, New York University, Fordham University, CUNY, AI Now Institute, Flatiron Institute, C.N. Yang Institute for Theoretical Physics, NASA Space Radiation Laboratory

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account