SmarterDx Logo

SmarterDx

Staff Site Reliability Engineer

Posted 8 Days Ago
Easy Apply
Remote
Hiring Remotely in United States
230K-250K Annually
Expert/Leader
Easy Apply
Remote
Hiring Remotely in United States
230K-250K Annually
Expert/Leader
The Staff Site Reliability Engineer will lead the reliability of production systems by defining SRE practices, improving observability, and ensuring fault-tolerance in cloud environments.
The summary above was generated by AI

SmarterDx, a Smarter Technologies company, builds clinical AI that is transforming how hospitals translate care into payment. Founded by physicians in 2020, our platform connects clinical context with revenue intelligence, helping health systems recover millions in missed revenue, improve quality scores, and appeal every denial. Become a Smartian and help optimize the way the healthcare system works for everyone. Learn more at smarterdx.com/careers.

Role

We are seeking a Staff Site Reliability Engineer (SRE) to lead the reliability, scalability, and operational excellence of our production systems. This role is responsible for defining and driving SRE practices across the organization, including SLIs/SLOs, incident management, capacity planning, and resilience engineering. You will design and implement automation that reduces toil, improve observability and performance across our Kubernetes and AWS environments, and ensure our systems are highly available and fault-tolerant.

The ideal candidate is a deeply technical engineer with strong distributed systems expertise, a passion for operational rigor, and a track record of improving reliability through thoughtful engineering, automation, and data-driven decision-making.

**This role is fully remote within the US**

What You’ll Do

  • Define and evolve reliability standards for the SmarterDx platform, including SLIs, SLOs, and error budgets that align engineering work with customer impact.
  • Implement a “reliability” platform using Terraform and infrastructure-as-code best practices.
  • Enhance observability systems (metrics, logs, traces, alerting) to provide actionable insights and reduce mean time to detect (MTTD) and resolve (MTTR).
  • Lead incident response, drive blameless postmortems, and implement systemic improvements to prevent recurrence.
  • Reduce operational toil through automation, self-healing systems, and improved deployment and rollback mechanisms.
  • Provide production support for the SmarterDx platform, applying SRE principles to ensure availability, performance, and data durability.
  • Research, prototype, and advocate for new reliability practices, tooling, and architectural improvements across the engineering organization.

What You Bring

  • 10+ years of software and software reliability engineering experience, with significant time spent operating and scaling distributed systems in production environments.
  • 3+ years of hands-on experience running cloud-native infrastructure in AWS, including deep familiarity with containers, Kubernetes, monitoring, and alerting in live production systems.
  • Proven experience defining and managing SLIs/SLOs, leading incident response, and driving postmortems and systemic reliability improvements.
  • Strong expertise with Terraform and infrastructure-as-code practices for managing production infrastructure safely and reproducibly.
  • Deep experience with Kubernetes architecture and operations, including workload reliability, cluster scaling, networking, and failure modes.
  • Experience working in security-conscious, compliance-oriented environments where reliability and data protection are first-class concerns.
  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field — or equivalent practical experience operating large-scale systems.

Nice To Haves

  • Reliability engineering experience with production database systems (e.g. Postgres)

Our Tech Stack

  • AWS
  • Terraform
  • Kubernetes
  • Go, Python, Typescript
  • Postgres

Compensation

$230K to $250K base salary

#LI-DNI

Benefits
  • Medical, Dental & Vision – Comprehensive plans with leading insurance providers, covering 75% of your premiums, depending on the plan.
  • Paid Parental Leave – Generous paid leave to support families through birth or adoption: Up to 12 weeks for parents.
  • Remote-First Team – Work from anywhere in the U.S.
  • Unlimited PTO & 10 Holidays – So you can relax and recharge.
  • 401(k) with Traditional & Roth Options – Tax-advantaged retirement savings through Fidelity with a 4% match.
  • Minimal Bureaucracy – A fast-moving, high-impact environment where you can focus on what matters.
  • Incredible Teammates! – Work alongside smart, supportive, and mission-driven colleagues.

Top Skills

AWS
Go
Kubernetes
Postgres
Python
Terraform
Typescript

SmarterDx New York, New York, USA Office

New York, New York, United States, 10003

Similar Jobs

4 Days Ago
Remote or Hybrid
New York, NY, USA
130K-170K Annually
Senior level
130K-170K Annually
Senior level
AdTech • Cloud • Digital Media • Information Technology • News + Entertainment • App development
Oversee operational support of SAP BTP CPI applications, manage incidents, lead support specialists, and collaborate on architecture and governance for finance processes.
Top Skills: Abap ProxiesAemCapmCloud ConnectorCloud FoundryEdge Integration CellIdocJSONMessage QueuesOauthOdataRestSAMLSap BtpSfapiSftpSoapXML
4 Days Ago
Remote or Hybrid
United States
165K-235K Annually
Mid level
165K-235K Annually
Mid level
Big Data • Cloud • Productivity • Software • Database • Analytics • Automation
The Site Reliability Engineer will automate tasks, enhance platform infrastructure, improve observability, and lead incident response efforts for optimal performance.
Top Skills: AWSGrafanaHoneycombLinuxPythonTerraform
31 Minutes Ago
Remote
United States
208K-330K Annually
Senior level
208K-330K Annually
Senior level
Fintech
The Staff Site Reliability Engineer role involves leading architecture, automating GCP environment, defining SLIs and SLOs, mentoring teammates, and enhancing system reliability and performance.
Top Skills: ArgocdDatadogGCPGoHelmJavaScriptKubernetesPythonTerraformTypescript

What you need to know about the NYC Tech Scene

As the undisputed financial capital of the world, New York City is an epicenter of startup funding activity. The city has a thriving fintech scene and is a major player in verticals ranging from AI to biotech, cybersecurity and digital media. It also has universities like NYU, Columbia and Cornell Tech attracting students and researchers from across the globe, providing the ecosystem with a constant influx of world-class talent. And its East Coast location and three international airports make it a perfect spot for European companies establishing a foothold in the United States.

Key Facts About NYC Tech

  • Number of Tech Workers: 549,200; 6% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Capgemini, Bloomberg, IBM, Spotify
  • Key Industries: Artificial intelligence, Fintech
  • Funding Landscape: $25.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Greycroft, Thrive Capital, Union Square Ventures, FirstMark Capital, Tiger Global Management, Tribeca Venture Partners, Insight Partners, Two Sigma Ventures
  • Research Centers and Universities: Columbia University, New York University, Fordham University, CUNY, AI Now Institute, Flatiron Institute, C.N. Yang Institute for Theoretical Physics, NASA Space Radiation Laboratory

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account