CVS Health Logo

CVS Health

Principal Software Engineer – AI Platform (Production Engineering / Reliability)

Sorry, this job was removed at 06:11 a.m. (EST) on Thursday, Jun 04, 2026
Be an Early Applicant
In-Office
City of New Home, TX
In-Office
City of New Home, TX

Similar Jobs

56 Minutes Ago
Remote or Hybrid
140K-180K Annually
Senior level
140K-180K Annually
Senior level
Cloud • Insurance • Payments • Software • Business Intelligence • App development • Big Data Analytics
Lead design and deployment of AI agents and automation across customer delivery, defining ROI and performance metrics, building RAG/LLM solutions, creating an AI playbook for CX teams, and partnering with Product and Engineering to drive adoption and quality in implementations.
Top Skills: Agentic FrameworksAutogptLangchainLlmsPrompt EngineeringRetrieval-Augmented Generation (Rag)
2 Hours Ago
Hybrid
2 Locations
99K-232K Annually
Mid level
99K-232K Annually
Mid level
Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Lead supply chain strategy and warehouse automation projects for clients, optimizing logistics, distribution, and inventory. Analyze supply chain data, design integrated business planning solutions, manage budgets, and recommend transformation initiatives. Coach and mentor teams, oversee client relationships, and drive process improvements to enhance operational efficiency and profitability.
2 Hours Ago
Hybrid
New York, NY, USA
77K-202K Annually
Senior level
77K-202K Annually
Senior level
Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Advise clients on supply chain strategy and execution, focusing on logistics, inventory, procurement, warehouse automation, and connected operations. Analyze complex business issues, implement supply chain management solutions, mentor junior staff, and drive operational improvements to enhance performance and profitability.

We’re building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you’ll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time.

Overview
We are seeking a Principal Individual Contributor (IC) to lead production engineering, observability, and operational excellence for our AI Platform. This role sits at the intersection of ML systems, distributed infrastructure, and production reliability, ensuring that our AI services are scalable, observable, and resilient in real-world environments.
As a senior technical leader, you will define and drive best-in-class production practices, build robust monitoring and alerting ecosystems, and partner across engineering, ML, and platform teams to ensure mission-critical AI systems meet high availability, performance, and reliability standards.
Key Responsibilities
Production Reliability & Operations Leadership
  • Own and evolve production operations strategy for AI/ML platforms and services
  • Define SLOs, SLIs, and error budgets for AI systems (online & batch/inference pipelines)
  • Lead root cause analysis (RCA) and drive systemic improvements post-incident
  • Establish operational readiness standards for launching new AI capabilities
  • Build frameworks for on-call excellence, incident response, and escalation

Observability, Monitoring & Alerting
  • Design and implement end-to-end observability systems across AI workloads:
    • Model performance monitoring
    • Data pipeline health
    • Infrastructure metrics
  • Build and scale monitoring and alerting frameworks using modern tooling (e.g., Prometheus, Grafana, OpenTelemetry, Datadog, Azure Monitor, etc.)
  • Define actionable, low-noise alerts tied to business and system impact
  • Develop dashboards and telemetry standards for real-time visibility across services
  • Drive adoption of golden signals (latency, errors, throughput, saturation) in AI systems

AI/ML Production Systems Excellence
  • Ensure reliable deployment and operation of:
    • Real-time inference services
    • Model pipelines (training, validation, deployment)
    • Data ingestion and feature pipelines
  • Implement model observability (drift detection, data skew, performance degradation)
  • Partner with ML engineers to improve production readiness of models
  • Establish lifecycle standards for models in production environments

Automation & Platform Development
  • Build internal platforms and tooling for:
    • Automated incident detection and response
    • Self-healing systems
    • Deployment validation and canarying
  • Drive Infrastructure as Code (IaC) and policy automation
  • Improve system resilience through chaos testing and fault injection

Technical Leadership & Strategy
  • Act as a trusted technical advisor across platform, ML, and product teams
  • Set direction for operational excellence in AI systems at org scale
  • Mentor senior engineers and influence cross-team architectural decisions
  • Lead adoption of industry best practices in reliability engineering and observability

Required Qualifications
  • 10+ years in software engineering, production engineering, or SRE roles
  • Deep experience operating large-scale distributed systems in production
  • Proven track record building monitoring, observability, and alerting systems
  • Strong expertise in incident management and production support models
  • Experience working with cloud platforms (Azure, AWS, GCP)

Preferred Qualifications
  • Experience supporting AI/ML platforms or data-intensive systems
  • Familiarity with model lifecycle management and MLOps practices
  • Knowledge of:
    • OpenTelemetry, Prometheus, Grafana, Datadog
    • Kubernetes and containerized workloads
    • Streaming systems (Kafka, Event Hub, etc.)
  • Experience defining and implementing SLO-driven engineering
  • Background in high-availability, low-latency systems

Key Competencies
  • Systems thinking and ability to reason about complex, interdependent systems
  • Strong bias for automation, scalability, and long-term solutions
  • Exceptional debugging and incident management skills
  • Ability to influence without authority across multiple teams
  • Passion for operational excellence and reliability

Pay Range

The typical pay range for this role is:

$144,200.00 - $288,400.00


This pay range represents the base hourly rate or base annual full-time salary for all positions in the job grade within which this position falls.  The actual base salary offer will depend on a variety of factors including experience, education, geography and other relevant factors.  This position is eligible for a CVS Health bonus, commission or short-term incentive program in addition to the base pay range listed above.  This position also includes an award target in the company’s equity award program. 
 

Our people fuel our future. Our teams reflect the customers, patients, members and communities we serve and we are committed to fostering a workplace where every colleague feels valued and that they belong.

Great benefits for great people

We take pride in offering a comprehensive and competitive mix of pay and benefits that reflects our commitment to our colleagues and their families.

This full‑time position is eligible for a comprehensive benefits package designed to support the physical, emotional, and financial well‑being of colleagues and their families. The benefits for this position include medical, dental, and vision coverage, paid time off, retirement savings options, wellness programs, and other resources, based on eligibility.


Additional details about available benefits are provided during the application process and on
Benefits Moments.

We anticipate the application window for this opening will close on: 06/04/2026

Qualified applicants with arrest or conviction records will be considered for employment in accordance with all federal, state and local laws.

What you need to know about the NYC Tech Scene

As the undisputed financial capital of the world, New York City is an epicenter of startup funding activity. The city has a thriving fintech scene and is a major player in verticals ranging from AI to biotech, cybersecurity and digital media. It also has universities like NYU, Columbia and Cornell Tech attracting students and researchers from across the globe, providing the ecosystem with a constant influx of world-class talent. And its East Coast location and three international airports make it a perfect spot for European companies establishing a foothold in the United States.

Key Facts About NYC Tech

  • Number of Tech Workers: 549,200; 6% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Capgemini, Bloomberg, IBM, Spotify
  • Key Industries: Artificial intelligence, Fintech
  • Funding Landscape: $25.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Greycroft, Thrive Capital, Union Square Ventures, FirstMark Capital, Tiger Global Management, Tribeca Venture Partners, Insight Partners, Two Sigma Ventures
  • Research Centers and Universities: Columbia University, New York University, Fordham University, CUNY, AI Now Institute, Flatiron Institute, C.N. Yang Institute for Theoretical Physics, NASA Space Radiation Laboratory

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account