Featherless AI Logo

Featherless AI

Machine Learning Engineer — Inference Optimization

Reposted 22 Hours Ago
In-Office or Remote
Hiring Remotely in World Golf Village, FL
Mid level
In-Office or Remote
Hiring Remotely in World Golf Village, FL
Mid level
Optimize inference latency and throughput for large-scale ML models, collaborating on performance tuning, and building inference-serving systems.
The summary above was generated by AI
About the Role

We’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale. You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users.

This role is ideal for someone who enjoys deep technical work, profiling systems down to the kernel/GPU level, and translating research ideas into production-grade performance gains.

What You’ll Do
  • Optimize inference latency, throughput, and cost for large-scale ML models in production

  • Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)

  • Implement and tune techniques such as:

    • Quantization (fp16, bf16, int8, fp8)

    • KV-cache optimization & reuse

    • Speculative decoding, batching, and streaming

    • Model pruning or architectural simplifications for inference

  • Collaborate with research engineers to productionize new model architectures

  • Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)

  • Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups

  • Improve system reliability, observability, and cost efficiency under real workloads

What We’re Looking For
  • Strong experience in ML inference optimization or high-performance ML systems

  • Solid understanding of deep learning internals (attention, memory layout, compute graphs)

  • Hands-on experience with PyTorch (or similar) and model deployment

  • Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)

  • Experience scaling inference for real users (not just research benchmarks)

  • Comfortable working in fast-moving startup environments with ownership and ambiguity

Nice to Have
  • Experience with LLM or long-context model inference

  • Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)

  • Experience optimizing across different hardware vendors

  • Open-source contributions in ML systems or inference tooling

  • Background in distributed systems or low-latency services

Why Join Us
  • Real ownership over performance-critical systems

  • Direct impact on product reliability and unit economics

  • Close collaboration with research, infra, and product

  • Competitive compensation + meaningful equity at Series A

  • A team that cares about engineering quality, not hype

Top Skills

Cuda
Ml Inference Optimization
Onnx Runtime
PyTorch
Tensorrt
Triton

Similar Jobs

8 Minutes Ago
Easy Apply
Remote
United States
Easy Apply
69K-98K Annually
Junior
69K-98K Annually
Junior
AdTech • Digital Media • Marketing Tech • Software • Automation
Manage a book of customers to drive revenue and platform adoption through strategic account management, training, optimization, and cross-functional collaboration. Deliver partner reviews, maintain client health metrics in CRM/CS tools, provide campaign feedback, and influence product improvements while growing customer advocacy and NPS.
Top Skills: BasisCRMCustomer Success ToolsDspProgrammatic
17 Minutes Ago
Easy Apply
Remote
USA
Easy Apply
140K-185K Annually
Mid level
140K-185K Annually
Mid level
Big Data • Healthtech • HR Tech • Machine Learning • Software • Telehealth • Big Data Analytics
Manage and optimize GTM systems to support revenue teams: implement rapid system changes, enforce Salesforce data hygiene, manage vendor licenses, monitor system health, and advise on RevOps tooling to improve operational efficiency.
Top Skills: Salesforce,Outreach,Clay,Gong,Hubspot,Zoominfo
17 Minutes Ago
Easy Apply
Remote
United States
Easy Apply
79K-109K Annually
Senior level
79K-109K Annually
Senior level
Artificial Intelligence • Fintech • Machine Learning • Social Impact • Software
Manage day-to-day performance and operations of collections vendors, monitor KPIs and SLAs, drive performance improvements, ensure compliance and high-quality borrower experience, and coordinate cross-functional stakeholders while supporting onsite audits and vendor relationship management.

What you need to know about the NYC Tech Scene

As the undisputed financial capital of the world, New York City is an epicenter of startup funding activity. The city has a thriving fintech scene and is a major player in verticals ranging from AI to biotech, cybersecurity and digital media. It also has universities like NYU, Columbia and Cornell Tech attracting students and researchers from across the globe, providing the ecosystem with a constant influx of world-class talent. And its East Coast location and three international airports make it a perfect spot for European companies establishing a foothold in the United States.

Key Facts About NYC Tech

  • Number of Tech Workers: 549,200; 6% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Capgemini, Bloomberg, IBM, Spotify
  • Key Industries: Artificial intelligence, Fintech
  • Funding Landscape: $25.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Greycroft, Thrive Capital, Union Square Ventures, FirstMark Capital, Tiger Global Management, Tribeca Venture Partners, Insight Partners, Two Sigma Ventures
  • Research Centers and Universities: Columbia University, New York University, Fordham University, CUNY, AI Now Institute, Flatiron Institute, C.N. Yang Institute for Theoretical Physics, NASA Space Radiation Laboratory

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account