TrueML Logo

TrueML

Senior Manager, DevOps

Posted 5 Days Ago
Be an Early Applicant
In-Office or Remote
Hiring Remotely in San Francisco, CA
150K-220K Annually
Senior level
In-Office or Remote
Hiring Remotely in San Francisco, CA
150K-220K Annually
Senior level
Lead infrastructure and platform engineering for cloud architecture, CI/CD standards, and scalability of machine learning products, while managing a team of DevOps engineers.
The summary above was generated by AI
TrueML Products is seeking a highly experienced and strategic Sr. Manager, DevOps to lead our infrastructure and platform engineering efforts. This role is critical in driving our cloud architecture strategy, establishing elite CI/CD standards, and ensuring the scalability and reliability of our machine learning-driven products.
 
Reporting to the Sr. Director, Program & Operations, you will lead the evolution of our internal developer platform and infrastructure-as-code (IaC) architecture. The ideal candidate is a hands-on leader with a "systems-thinking" mindset. We are looking for a visionary who thrives on solving complex distributed systems challenges and considers leveraging GenAI and AIOps tooling second-nature for optimizing system performance and automation.

What You'll Do (Technical Leadership & Strategy):

  • Define and execute the long-term strategic vision for Infrastructure as Code (IaC), CI/CD evolution, and cloud-native architecture to support TrueML’s scaling needs.
  • Lead the design and implementation of self-service internal platforms to reduce developer cognitive load, enabling feature teams to deploy and manage services with minimal friction at increased velocity.
  • Act as the primary stakeholder for cloud spend (AWS); drive cost-optimization initiatives and lead contract negotiations for the DevOps toolstack and third-party vendors.
  • Ensure the infrastructure architecture supports strict High Availability (HA) requirements and robust Disaster Recovery (DR) protocols, maintaining system integrity across multiple regions.
  • Oversee the implementation and evolution of comprehensive monitoring, logging, and distributed tracing systems, leveraging AIOps to move from reactive to predictive system maintenance.
  • Champion security by design by integrating automated vulnerability scanning, secret management, and compliance checks directly into the automated build pipelines.
  • Serve as the ultimate escalation point for major production outages, facilitating blameless post-mortem reviews that focus on systemic improvements rather than individual error.
  • Maintain deep technical currency in container orchestration (Kubernetes), serverless patterns, and modern automation frameworks to provide meaningful mentorship and architectural guidance to senior engineering staff.

What You'll Do (Hands-On Engineering & Technical Execution):

  • Maintain the ability to write and review high-quality code in languages like Python, Go, or Bash to automate complex operational tasks and system integrations.
  • Hands-on development of Terraform  Infrastructure as Code for resource provisioning.
  • Directly architect and troubleshoot complex CI/CD workflows (GitHub Actions, ArgoCD, Atlantis), ensuring build-and-deploy cycles are optimized for speed and reliability.
  • Proactively manage and tune container orchestration environments, including hands-on configuration of Ingress controllers, declarative GitOps workflows, and cluster autoscaling.
  • Lead from the front during critical incidents by conducting deep-dive technical analysis across the EKS stack, troubleshooting Node-level kernel panics, VPC CNI networking bottlenecks, and RDS performance constraints to minimize MTTR
  • Conduct hands-on audits of cloud configurations and IAM policies, implementing "least privilege" access controls and automated remediation scripts.
  • Directly manage the integration and API configurations between various tools in the DevOps stack (e.g., connecting Jira, VictorOps, Slack, and Observe for seamless incident flow).

What You'll Do (People Leadership & Engineering Collaboration):

  • Recruit, hire, and develop a world-class team of DevOps Engineers; provide career pathing and technical mentorship to foster a culture of continuous learning.
  • Partner closely with Engineering Managers to align infrastructure deliverables with product roadmap, ensuring DevOps is an accelerator rather than a bottleneck.
  • Collaborate with the Quality Engineering and Security leadership to define and enforce "Definition of Done" standards that include automated testing and security gates.
  • Set clear, measurable goals (KPIs and OKRs) for the team, conducting regular performance reviews and providing feedback to drive individual and collective excellence.
  • Lead internal Brunch & Learns to educate the broader engineering organization on modern cloud-native patterns and self-service capabilities.

Who You Are (Qualifications):

  • Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
  • 10+ years of experience in DevOps, Site Reliability Engineering (SRE), or Software Engineering; 5+ years of experience managing engineers
  • Expert-level mastery with AWS and experience managing multi-region, high-availability deployments
  • Advanced experience with Kubernetes (K8s) and Docker, including cluster management, networking, and scaling in a production environment.
  • Proficiency in Terraform to drive consistency and automation across all infrastructure layers. Experience with Atlantis is a plus. 
  • Deep experience designing and maintaining complex pipelines (GitHub Actions, GitLab CI, or Jenkins) and mastery of scripting languages like Python, Go, or Bash.
  • Hands-on experience with modern monitoring, observability, and tracing stacks (Datadog, Observe) and a firm grasp of SRE principles (SLIs/SLOs/Error Budgets).
  • Experience acting as an Incident Commander for high-severity outages and fostering a "blameless" post-mortem culture.
  • Demonstrated ability to influence executive leadership and collaborate cross-functionally with Product, Engineering, and Security teams.
  • Experience integrating AI-assisted productivity tools (Cline, GitHub Copilot) into the engineering workflow to accelerate delivery.

Ways to "Stand Out":

  • Experience leading organizational platform migration, including the development of rollback strategies, stakeholder communication plans, and post-migration validation
  • Prior experience working with high-velocity, product-driven early-to-mid stage technology companies where reliability, extensibility, and availability were mission-critical to success
  • AWS or Kubernetes Certifications a plus -- but not in lieu of hands-on experience with the same within production environments
  • Notable contributions to Open Source projects or communities

Top Skills

Argocd
Atlantis
AWS
Bash
Datadog
Github Actions
Go
Kubernetes
Observe
Python
Terraform

Similar Jobs at TrueML

11 Hours Ago
In-Office or Remote
United States
60K-150K Annually
Senior level
60K-150K Annually
Senior level
Fintech • Machine Learning • Payments • Social Impact • Software • Financial Services
As a Senior DevOps Engineer, you will enhance our cloud-native infrastructure, manage IaC with Terraform, and optimize CI/CD processes, focusing on AWS and Kubernetes operations.
Top Skills: ArgocdAWSGithub ActionsGoHelmKubernetesPythonTerraformTypescript
Yesterday
Remote
United States
80K-90K Annually
Senior level
80K-90K Annually
Senior level
Fintech • Machine Learning • Payments • Social Impact • Software • Financial Services
Lead and modernize licensing, regulatory complaints, and audit programs. Automate processes, manage regulatory filings and vendors, drive audit remediation, build dashboards, and ensure continuous compliance.
Top Skills: AIAuditboardCfpb Complaint PortalGoogle WorkspaceJIRAmacOSNo-Code Platforms
2 Days Ago
Remote
United States
80K-100K Annually
Mid level
80K-100K Annually
Mid level
Fintech • Machine Learning • Payments • Social Impact • Software • Financial Services
The Client Manager will manage a portfolio of clients, driving performance through data analysis, problem-solving, and strategic partnerships to enhance client success.
Top Skills: DashboardsData AnalysisOmnichannel Performance Data

What you need to know about the NYC Tech Scene

As the undisputed financial capital of the world, New York City is an epicenter of startup funding activity. The city has a thriving fintech scene and is a major player in verticals ranging from AI to biotech, cybersecurity and digital media. It also has universities like NYU, Columbia and Cornell Tech attracting students and researchers from across the globe, providing the ecosystem with a constant influx of world-class talent. And its East Coast location and three international airports make it a perfect spot for European companies establishing a foothold in the United States.

Key Facts About NYC Tech

  • Number of Tech Workers: 549,200; 6% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Capgemini, Bloomberg, IBM, Spotify
  • Key Industries: Artificial intelligence, Fintech
  • Funding Landscape: $25.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Greycroft, Thrive Capital, Union Square Ventures, FirstMark Capital, Tiger Global Management, Tribeca Venture Partners, Insight Partners, Two Sigma Ventures
  • Research Centers and Universities: Columbia University, New York University, Fordham University, CUNY, AI Now Institute, Flatiron Institute, C.N. Yang Institute for Theoretical Physics, NASA Space Radiation Laboratory

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account