Baseten Jobs

SRE

Baseten

SRE

Reposted 10 Days Ago

Remote or Hybrid

Hiring Remotely in New York, NY, USA

165K-330K Annually

Mid level

Remote or Hybrid

Hiring Remotely in New York, NY, USA

165K-330K Annually

Mid level

As an AI Support Engineer, you'll manage support requests, resolve user issues, optimize ML models, and contribute to product development.

The summary above was generated by AI

ABOUT BASETEN

Baseten powers mission-critical inference for the world's most dynamic AI companies, like Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma and Writer. By uniting applied AI research, flexible infrastructure, and seamless developer tooling, we enable companies operating at the frontier of AI to bring cutting-edge models into production. We're growing quickly and recently raised our $300M Series E, backed by investors including BOND, IVP, Spark Capital, Greylock, and Conviction. Join us and help build the platform engineers turn to to ship AI products.

THE ROLE

As a Site Reliability Engineer at Baseten, you'll define and codify the gold standards of day 2 operations for our ML infrastructure platform. You'll envision and build robust systems, processes, automations, and observability tooling that keep our platform reliable at scale — and that empower the broader organization to operate confidently.

You'll work closely with engineering, forward-deployed and product teams: learning from recurring failure patterns, turning tribal knowledge into automated mitigations, and raising the operational floor for the entire company.

EXAMPLE INITIATIVES

You'll work on projects like these as part of the SRE team:

Improve Baseten SRE Practices, by instrumenting SLOs and SLIs, improving alerting and observability for all services.
Building AI-assisted tooling for incident triage and response.

RESPONSIBILITIES

Own the reliability of Baseten's multi-cloud Kubernetes infrastructure, including incident response, post-mortems, and remediation tracking.
Build and maintain observability infrastructure — metrics, logging, dashboards, and alerting — as code.
Author, validate, and improve runbooks for recurring failure patterns, ensuring they're structured for low-context, safe execution.
Identify high-frequency failure patterns and convert them into automated mitigations or self-healing automations.
Diagnose and resolve runtime issues related to latency, memory behavior, GPU utilization, concurrency, and model lifecycle management.
Define and instrument SLOs and SLIs across customer workloads and internal services.
Navigate ambiguity, make principled tradeoffs, and avoid unnecessary complexity in the systems you build and the processes you define.

REQUIREMENTS

Extensive hands-on experience with Kubernetes (multi-cloud experience across EKS, GKE, or similar is a strong plus).
Experience in building and maintaining scalable infrastructure.
Strong foundation in observability tooling: metrics (VictoriaMetrics, Prometheus), logging (Loki, ELK), dashboards (Grafana), and alerting pipelines. Observability-as-code experience is a plus.
Experience with infrastructure-as-code (Terraform, Helm) and GitOps workflows (Flux CD, ArgoCD).
Experience writing and improving runbooks, leading incident response, and doing post-mortem analysis.
Comfort working at the intersection of engineering and operations — you write code, but you also think deeply about process, escalation paths, and operational leverage.
Familiarity with incident management platforms (incident.io or similar) is a plus.
No prior ML experience required, but curiosity about how ML models are deployed and served at scale will serve you well.

BENEFITS

Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Flexible PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Fertility and family-building stipend through Carrot
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.

Apply now to embark on a rewarding journey in shaping the future of AI! If you are a motivated individual with a passion for machine learning and a desire to be part of a collaborative and forward-thinking team, we would love to hear from you.

At Baseten, we are committed to fostering a diverse and inclusive workplace. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity or expression, national origin, age, genetic information, disability, or veteran status.

We are an Equal Opportunity Employer and will consider qualified applicants with criminal histories in a manner consistent with applicable law (by example, the requirements of the San Francisco Fair Chance Ordinance, where applicable).

Similar Jobs

Coinbase

Site Reliability Engineer

7 Days Ago

Easy Apply

Remote

USA

Easy Apply

218K-257K Annually

Senior level

218K-257K Annually

Senior level

Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3

Own reliability, monitoring, and incident response for AI infrastructure; build automation and CI/CD tooling; manage Kubernetes/Docker production workloads; partner with infrastructure, security, and compliance; improve observability and documentation; develop internal full‑stack tooling in Go or Python.

Top Skills: AnsibleAWSBashChefCi/CdDockerEc2GitGoKubernetesLinuxLog AggregationNetwork SecurityPuppetPythonRubySaltTerraform

Dropbox

Site Reliability Engineer

14 Days Ago

Remote

United States

223K-302K Annually

Expert/Leader

223K-302K Annually

Expert/Leader

Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy

The role involves defining reliability strategies, leading initiatives across teams, enhancing monitoring and incident response, and mentoring engineers at Dropbox.

Top Skills: Ai TechnologiesDebuggingDistributed SystemsIncident ResponseObservabilityReliability Risk ManagementSlasSlos

Runpod

Site Reliability Engineer

25 Days Ago

Easy Apply

Remote

USA

Easy Apply

150K-200K Annually

Senior level

150K-200K Annually

Senior level

Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)

As a Site Reliability Engineer, you will ensure system stability and resilience, define reliability standards, and automate operational processes while collaborating cross-functionally to improve performance and reduce incidents.

Top Skills: BashCi/CdDockerGoGrafanaKubernetesLinuxPrometheusPython

What you need to know about the NYC Tech Scene

As the undisputed financial capital of the world, New York City is an epicenter of startup funding activity. The city has a thriving fintech scene and is a major player in verticals ranging from AI to biotech, cybersecurity and digital media. It also has universities like NYU, Columbia and Cornell Tech attracting students and researchers from across the globe, providing the ecosystem with a constant influx of world-class talent. And its East Coast location and three international airports make it a perfect spot for European companies establishing a foothold in the United States.

Key Facts About NYC Tech

Number of Tech Workers: 549,200; 6% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Capgemini, Bloomberg, IBM, Spotify
Key Industries: Artificial intelligence, Fintech
Funding Landscape: $25.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Greycroft, Thrive Capital, Union Square Ventures, FirstMark Capital, Tiger Global Management, Tribeca Venture Partners, Insight Partners, Two Sigma Ventures
Research Centers and Universities: Columbia University, New York University, Fordham University, CUNY, AI Now Institute, Flatiron Institute, C.N. Yang Institute for Theoretical Physics, NASA Space Radiation Laboratory