Fabric Health Logo

Fabric Health

Site Reliability Engineer

Posted 5 Days Ago
In-Office or Remote
Hiring Remotely in New York City, NY, USA
135K-160K Annually
Senior level
In-Office or Remote
Hiring Remotely in New York City, NY, USA
135K-160K Annually
Senior level
Own and evolve Fabric's AWS/EKS infrastructure, build Terraform-managed infrastructure, improve observability with Datadog, lead incident response and SLOs, automate operations with AI/agentic workflows, optimize AWS resources, and ensure HIPAA-compliant, high-availability platform architecture while mentoring engineers.
The summary above was generated by AI
About Fabric Health
At Fabric Health, we are powering boundless care by solving healthcare’s biggest challenge: clinical capacity. We aren’t here to disrupt healthcare; we’re here to fix it. We unify the care journey from intake to treatment, using intelligent automation to remove administrative burdens and make care delivery 2-10x more efficient. Our technology empowers clinicians to move faster and focus on what matters most: the patient.

We are a mission-driven team of brilliant minds trusted by leading organizations including Intermountain Health, OSF HealthCare, SSM Health, and MUSC Health. Our vision is backed by premier investors such as Thrive Capital, GV (Google Ventures), General Catalyst, and Salesforce Ventures. We move quickly for good reason, listen deeply to solve big challenges, and build products with the same care and quality we’d want for our own loved ones.

Learn more: About Us | News & Press | LinkedIn | Careers

About the Role
As a Site Reliability Engineer, you will own and evolve the infrastructure powering healthcare experiences for millions of patients. This role bridges the gap between traditional infrastructure excellence and the future of AI-driven operations. You will act as a primary architect for our AWS and Kubernetes (EKS) environment, ensuring the platform is resilient, scalable, and compliant while exploring how agentic workflows can modernize SRE practices.

What You'll Do
As a Site Reliability Engineer, you will be a steward of Fabric’s production integrity, leading the strategy for infrastructure automation, observability, and system resilience. Your primary responsibilities include:

  • Infrastructure & Kubernetes Orchestration
    • Designing, deploying, and maintaining production Kubernetes (EKS) clusters to ensure enterprise-grade availability for our users.
    • Eliminating manual configuration by building and managing a scalable infrastructure state entirely through Terraform.
    • Optimizing the AWS footprint—specifically EC2, RDS, and S3—to balance high performance with cost-efficiency and reliability.
  • AI-Assisted Operations & Automation
    • Exploring and deploying agentic workflows for AI-assisted runbooks that automate complex operational decisions and repetitive tasks.
    • Building and evolving deployment pipelines using GitHub Actions or Semaphore to ensure delivery is both rapid and safe.
    • Focusing on toil reduction by developing internal tools that replace manual operational work with intelligent, autonomous systems.
  • Observability & Incident Management
    • Driving the evolution of the observability stack in Datadog by implementing the sophisticated metrics, traces, and logs needed to meet SLOs.
    • Leading incident response efforts and facilitating the blameless postmortems that help systematically reduce recovery time (MTTR).
    • Defining and monitoring the SLIs and SLOs that ensure the platform consistently meets rigorous healthcare performance standards.
  • Compliance & Collaboration
    • Ensuring every piece of infrastructure remains fully compliant with HIPAA and other critical healthcare regulatory requirements.
    • Mentoring engineers across the company on reliability best practices and contributing a clinical-safety perspective to cross-functional design reviews.

Why You Might Be a Good Fit
  • You are a deeply proficient engineer who excels at the intersection of cloud infrastructure, automation, and system design.
  • You possess a meticulous approach to observability and a passion for finding the "root cause" rather than just applying a patch.
  • You enjoy exploring the "next frontier" of SRE, including how AI and agentic tools can make operations more efficient.
  • You thrive in fast-paced environments where technical rigor is balanced with pragmatism and clinical-grade safety.

This Might Not Be The Right Fit If...
  • You prefer working on static infrastructure rather than evolving systems through code and automation.
  • You are uncomfortable with the "agile" pace of tech-driven platform development or integrating AI tools into your daily workflow.
  • You prefer a siloed role that does not involve active participation in incident response or collaborative postmortems.

Your Qualifications
  • 5+ years of experience in SRE, DevOps, or Platform roles managing production environments at scale.
  • Expert technical depth in AWS (EKS, EC2, RDS, S3) and production-grade Kubernetes management.
  • Proficiency with modern tooling including Terraform (IaC), Datadog (Observability), and CI/CD systems.
  • Deeply proficient coding and scripting skills in Python, Bash, Ruby, or Go.
  • Preferred experience building agentic workflows or AI-assisted tooling to drive operational efficiency.
  • A "rigor-first" mindset with a dedication to HIPAA-compliant, high-availability architecture.

The national pay range for this role is $135,000.00 – $160,000.00 per year. Actual compensation will be determined by factors such as the candidate's geographic market, experience, skills, and qualifications. Certain roles may also be eligible for additional compensation, including a comprehensive benefits package such as medical, dental, vision, unlimited PTO, and a 401(k) plan, stock options and bonuses. If your compensation requirement is greater than our posted range, please still consider applying; a determination can be made based on unique qualifications. Expected compensation ranges for this role may change over time.At Fabric, we believe that a diverse workforce is essential to our success. We are an equal opportunity employer and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, color, religion, sex, national origin, age, disability, veteran status, or any other legally protected characteristic. We actively encourage individuals from all backgrounds to apply.

Recruitment Fraud Alert: Protect Yourself
Fabric Health is aware of scammers attempting to impersonate employers. To ensure that any recruiting contact you receive is legitimate, please adhere to the following:

  • Verify the Domain: Official recruitment emails will only come from addresses ending in @fabrichealth.com or @gem.com. No other domain names are legitimate.
  • Official Interview Tools: We use Gem for our recruitment process and Google Meet for all video interviews. Google Meet is always the platform used for your first interview; you will never be sent a Zoom link to set up or conduct an initial interview. All interviews are conducted via video unless specifically stated by our team as an audio call. We never conduct interviews via chat, social media, Skype, or WhatsApp.
  • Zoom Usage: Zoom is utilized only for specific meetings set directly by our team for purposes outside of the standard interview process (e.g., coordination or onboarding discussions). It is never the first link you will receive from us.
  • Authorized Contact & Texting: Fabric will only contact you if you have submitted an application or if you are connected to a current employee who shared your information with us. We will only send text messages if you have provided explicit authorization and consent, either through your application or while communicating directly with our team. If you have not explicitly authorized us to reach out, treat any SMS or unsolicited outreach as fraudulent and do not respond.
  • Sensitive Data: We will never ask you for sensitive personal or financial documents (ID, banking info, SSN) during the application, interview, or candidacy stages. All sensitive data is handled through secure internal systems post-offer.
  • Verify the Team: You can reference LinkedIn to verify members of our recruiting team; however, please remain vigilant as scammers may create fraudulent profiles. Always cross-reference the sender's email domain with our official @fabrichealth.com address.

If you question the validity of a contact or receive a suspicious message, do not click any links. Report the issue immediately to [email protected].

Please note: The security inbox is for reporting fraudulent activity only. Do not email this address for application status updates or to share application materials, as these will not be reviewed. Applications are only accepted and reviewed if submitted through our official application portal, and no application status information will be provided via the security email. 
HQ

Fabric Health New York, New York, USA Office

New York, NY, United States, 10011

Similar Jobs

Yesterday
Easy Apply
Remote
USA
Easy Apply
218K-257K Annually
Senior level
218K-257K Annually
Senior level
Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3
Own reliability, monitoring, and incident response for AI infrastructure; build automation and CI/CD tooling; manage Kubernetes/Docker production workloads; partner with infrastructure, security, and compliance; improve observability and documentation; develop internal full‑stack tooling in Go or Python.
Top Skills: AnsibleAWSBashChefCi/CdDockerEc2GitGoKubernetesLinuxLog AggregationNetwork SecurityPuppetPythonRubySaltTerraform
8 Days Ago
Remote
United States
223K-302K Annually
Expert/Leader
223K-302K Annually
Expert/Leader
Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy
The role involves defining reliability strategies, leading initiatives across teams, enhancing monitoring and incident response, and mentoring engineers at Dropbox.
Top Skills: Ai TechnologiesDebuggingDistributed SystemsIncident ResponseObservabilityReliability Risk ManagementSlasSlos
19 Days Ago
Easy Apply
Remote
USA
Easy Apply
150K-200K Annually
Senior level
150K-200K Annually
Senior level
Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)
As a Site Reliability Engineer, you will ensure system stability and resilience, define reliability standards, and automate operational processes while collaborating cross-functionally to improve performance and reduce incidents.
Top Skills: BashCi/CdDockerGoGrafanaKubernetesLinuxPrometheusPython

What you need to know about the NYC Tech Scene

As the undisputed financial capital of the world, New York City is an epicenter of startup funding activity. The city has a thriving fintech scene and is a major player in verticals ranging from AI to biotech, cybersecurity and digital media. It also has universities like NYU, Columbia and Cornell Tech attracting students and researchers from across the globe, providing the ecosystem with a constant influx of world-class talent. And its East Coast location and three international airports make it a perfect spot for European companies establishing a foothold in the United States.

Key Facts About NYC Tech

  • Number of Tech Workers: 549,200; 6% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Capgemini, Bloomberg, IBM, Spotify
  • Key Industries: Artificial intelligence, Fintech
  • Funding Landscape: $25.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Greycroft, Thrive Capital, Union Square Ventures, FirstMark Capital, Tiger Global Management, Tribeca Venture Partners, Insight Partners, Two Sigma Ventures
  • Research Centers and Universities: Columbia University, New York University, Fordham University, CUNY, AI Now Institute, Flatiron Institute, C.N. Yang Institute for Theoretical Physics, NASA Space Radiation Laboratory

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account