Zingtree Logo

Zingtree

Senior DevOps / Platform Reliability Engineer

Reposted 3 Days Ago
Remote
Hiring Remotely in East Coast, USA
Senior level
Remote
Hiring Remotely in East Coast, USA
Senior level
As a Senior DevOps / Platform Reliability Engineer, you will manage CI/CD pipelines, automate infrastructure, operate Kubernetes, and enhance observability while ensuring security and compliance for enterprise systems.
The summary above was generated by AI
About Zingtree

Zingtree is the next-generation intelligent process automation platform reimagining customer experience operations for the world’s top support leaders. With 500+ customers, including Optum, Corpay, Sony, SharkNinja, and Allianz, we transform self-service, surface automation opportunities, and turn every agent into an expert.

The Role

We’re hiring a Senior DevOps / Platform Reliability Engineer to own the platform that powers our agentic CX product. You’ll build the CI/CD, infrastructure, and observability backbone that enables us to ship multi-agent systems safely to enterprise customers.

If you want to operate a production AI platform and use AI to help operate it, this role is for you.

In this role, you will collaborate with development, operations, and infrastructure teams to automate and streamline processes, build and maintain tools for deployment, monitoring, and operations, and troubleshoot issues across development and production environments.

What You'll Do

    • Own and evolve CI/CD pipelines using GitHub Actions and OIDC-based authentication for microservices and agentic workloads, with safe, fast, and reversible deployments.

    • Automate infrastructure provisioning using Infrastructure as Code (IaC) tools such as Terraform and CloudFormation.

    • Operate and scale our Kubernetes platform (EKS + Argo CD), including autoscaling, ingress, external-dns, cert-manager, External Secrets Operator, backups, runtime guardrails, and multi-tenant isolation for enterprise customers.

    • Manage the edge and network perimeter, including Cloudflare (CDN, WAF, Bot Management, DDoS protection, Zero Trust / Access), CloudFront, API Gateway, ALB/NLB, Route 53, and network security controls.

    • Operate the data and event tier, including Aurora MySQL, ElastiCache/Redis, S3, and MSK (Kafka), with responsibility for backups, point-in-time recovery (PITR), and multi-AZ disaster recovery aligned to defined RTO/RPO objectives.

    • Build and maintain Lambda workloads where event-driven or serverless architectures are the right fit.

    • Build observability as a product using Prometheus, Grafana, and OpenTelemetry, including telemetry for LLM and agentic systems such as token cost, tool-call latency, evaluation signals, and prompt/version tracking.

    • Strengthen our security and compliance posture for SOC 2 and HIPAA, including least-privilege IAM, SCPs, secrets management, SAST/DAST, dependency and container scanning, image signing, AWS Config, Security Hub, GuardDuty, Inspector, and evidence automation.

    • Drive FinOps initiatives, including tagging standards, Savings Plans and Reserved Instances, per-tenant and per-workload cost attribution, and LLM cost controls.

    • Build and evolve our AI-native DevOps capabilities (see section below).

    • Partner with engineering teams to define platform standards, service templates, deployment best practices, and operational SLOs.

    • Monitor system performance and ensure reliability, scalability, and security across infrastructure and services.

    • Collaborate with software engineering teams to support continuous integration and continuous delivery best practices.

    • Document infrastructure, deployment processes, and operational standards to support knowledge sharing across the team.

Agentic AI in DevOps

    You’ll help define how Zingtree uses agentic AI to operate and improve our platform using modern AI operational practices.

    Responsibilities include:
    • Design and operate auto-remediation agents for common production toil such as certificate rotation, noisy pods, infrastructure drift, and flaky CI pipelines, with human-in-the-loop (HITL) controls for any destructive or customer-impacting actions.

    • Use LLMs for incident triage and root cause analysis, including log and trace summarization, signal correlation, and first-draft postmortems that are always reviewed by humans.

    • Connect AI agents to internal systems through the Model Context Protocol (MCP), including GitHub, Jira, PagerDuty, AWS, Kubernetes, Terraform, and related platforms, using scoped credentials, audit logging, and allow-listed access.

    • Apply AI-driven observability techniques, including anomaly detection on metrics, LLM-based log clustering, and alert deduplication and summarization on top of Prometheus and OpenTelemetry.

    • Establish operational guardrails such as prompt/version pinning, evaluation frameworks for agent behavior, cost and rate-limit controls, policy-as-code (OPA/Conftest) for AI-generated infrastructure changes, and clearly defined blast-radius controls.

    • Define best practices for AI coding assistants such as GitHub Copilot, Claude, and Amazon Q in infrastructure repositories, including review workflows, prompt design, and restrictions on auto-merged changes.

    • Treat AI components as production systems with SLOs, observability, on-call readiness, runbooks, and rollback strategies for agents and prompts.

About You

    Required Qualifications
    • 5+ years of experience in DevOps, SRE, or Platform Engineering operating production systems on AWS.

    • Strong experience with CI/CD pipelines and tools such as GitHub Actions, GitLab CI, Jenkins, or CircleCI.

    • Hands-on experience operating production EKS environments, including autoscaling, ingress, secrets management, and cluster upgrades.

    • Strong AWS networking experience, including multi-account VPC design, subnets, routing, security groups, NACLs, Route 53, ACM, and load balancers.

    • Deep experience with Terraform and GitHub Actions, ideally using OIDC-based cloud authentication.

    • Experience with Aurora/RDS MySQL, Redis (ElastiCache), and S3, including backups, PITR, migrations, and lifecycle management.

    • Strong observability experience using Prometheus, Grafana, and OpenTelemetry.

    • Experience operating Argo CD at scale.

    • Experience with Infrastructure as Code tools such as Terraform, CloudFormation, or Ansible.

    • Experience managing Cloudflare services including WAF, Bot Management, Rate Limiting, and Zero Trust / Access, along with CloudFront.

    • Experience operating Kafka/MSK at scale, including topics, consumer groups, and schema registries.

    • Experience with Lambda and event-driven architectures.

    • Comfortable working with Python, Bash, and Linux systems.

    • Strong understanding of security best practices across IAM, KMS, secrets management, networking, and software supply chain security.

    • Familiarity with vulnerability scanning and compliance tooling.

    • Nice to Have
      • Experience operating LLM or ML workloads in production, including LiteLLM, Bedrock, pgvector, prompt caching, or evaluation systems.

      • Experience building or integrating MCP servers or deploying agent frameworks such as LangGraph or CrewAI in production environments.

How We Work

    • We bias toward automation over toil. If you do it twice, script it. If it pages twice, fix it.

    • We’re a small team with high ownership. You’ll help define standards, not just follow them.

    • Humans stay in the loop for anything risky. AI accelerates decision-making but does not replace judgment.

    • We value blameless incident reviews, documented decisions, and fast feedback loops.

What We Offer

    • Competitive compensation packages

    • Comprehensive health benefits:

      • 100% of employee premiums covered

      • 75%–80% of dependent premiums covered for most health, dental, and vision plans

      • 401(k) plans to support retirement planning (no employer matching currently)

      • Paid parental leave

      • Unlimited PTO

      • Flexible remote work from anywhere

      • Up to $200/month co-working reimbursement

      • Home office stipend:

        • Up to $500 for home office setup

        • $100/month for internet, phone, and related expenses

Zingtree Values

    Lead with Action

    We are doers. We move quickly with purpose, take smart risks, learn fast, and focus on outcomes that benefit our customers and the business.

    People Really Matter

    We win as a team. We care deeply about our customers and employees, helping each other achieve professional growth and meaningful impact.

    Ownership Leads to Results

    When we commit, we deliver. We operate with integrity, accountability, and high standards.

    Expertise Creates Value

    We are learners. We continuously grow our knowledge, share expertise, and apply it to create meaningful results.

    Transparency Builds Trust

    We communicate openly, honestly, and respectfully. We share information that matters and build trusted relationships through clarity and empathy.

Similar Jobs

5 Hours Ago
In-Office or Remote
92K-164K Annually
Senior level
92K-164K Annually
Senior level
Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
Identify, analyze, and prevent Medicaid fraud, waste, and abuse by developing and deploying detection algorithms, writing advanced SQL, researching claims data, producing reports and visualizations, troubleshooting client issues, and mentoring analysts while collaborating with engineering and product teams.
Top Skills: ExcelMicrosoft OutlookMicrosoft PowerpointMicrosoft TeamsMicrosoft WordRallySQL
7 Hours Ago
Remote
16 Locations
130K-180K Annually
Senior level
130K-180K Annually
Senior level
Healthtech
Lead end-to-end business hiring for Operations, Support, and G&A at an early-stage healthcare startup. Build outbound sourcing pipelines, partner with hiring managers, improve hiring processes and scorecards, maintain candidate experience, and report hiring insights and market feedback.
7 Hours Ago
Remote
Alabama, USA
76K-200K Annually
Junior
76K-200K Annually
Junior
Artificial Intelligence • Healthtech • Machine Learning • Natural Language Processing • Biotech • Pharmaceutical
Promote Pfizer vaccines across assigned territory via in-person and virtual customer engagements. Drive sales, launch products, secure formulary access, develop territory call plans, build relationships with customers and KOLs, collaborate cross-functionally, use analytics and digital tools, and complete administrative tasks compliantly.

What you need to know about the NYC Tech Scene

As the undisputed financial capital of the world, New York City is an epicenter of startup funding activity. The city has a thriving fintech scene and is a major player in verticals ranging from AI to biotech, cybersecurity and digital media. It also has universities like NYU, Columbia and Cornell Tech attracting students and researchers from across the globe, providing the ecosystem with a constant influx of world-class talent. And its East Coast location and three international airports make it a perfect spot for European companies establishing a foothold in the United States.

Key Facts About NYC Tech

  • Number of Tech Workers: 549,200; 6% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Capgemini, Bloomberg, IBM, Spotify
  • Key Industries: Artificial intelligence, Fintech
  • Funding Landscape: $25.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Greycroft, Thrive Capital, Union Square Ventures, FirstMark Capital, Tiger Global Management, Tribeca Venture Partners, Insight Partners, Two Sigma Ventures
  • Research Centers and Universities: Columbia University, New York University, Fordham University, CUNY, AI Now Institute, Flatiron Institute, C.N. Yang Institute for Theoretical Physics, NASA Space Radiation Laboratory

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account