Milestone Systems

Lead Site Reliability Engineer - Infrastructure

Posted Yesterday

Be an Early Applicant

Remote or Hybrid

2 Locations

160K-180K Annually

Expert/Leader

Remote or Hybrid

2 Locations

160K-180K Annually

Expert/Leader

The Lead Site Reliability Engineer will oversee the reliability and scalability of the infrastructure, lead a team in operational execution, ensure best practices in SRE, and mentor senior engineers.

The summary above was generated by AI

We are seeking a Lead Site Reliability Engineer (Infrastructure) to join our fast-moving VSaaS engineering organization. This role carries responsibility for technical leadership and operational execution of the Infrastructure SRE team. You will own the reliability, scalability, and operability of our shared platform and production systems, while shaping how reliability engineering and SRE practices are applied across the organization and mentoring senior and staff engineers.

You will work closely with product engineering and platform teams to ensure a seamless developer experience, while setting standards, driving priorities, and leading by example during incidents and high-impact operational work. This role requires a strong technical background in cloud infrastructure, distributed systems, CI/CD, and GitOps, along with hands-on development experience in Golang and/or Python, to improve developer workflows, automation, and long-term system reliability.

This is a remote role in the United States.

Role Overview

Site Reliability Engineer - Infrastructure

The Infrastructure team provides leadership, direction, and accountability for platform architecture, system design, and end-to-end implementation to meet and exceed product non-functional requirements, including quality, security, reliability, availability, and performance. Site Reliability Engineers enable Product Development teams to ship features with reliable velocity by owning the stability, scalability, and operability of the underlying infrastructure and shared services.

What You Will Do:

As a Lead Site Reliability Engineer, you will:

Operate and evolve large-scale distributed systems, anticipating failure modes and proactively mitigating risks across production environments, while owning day-to-day production operations, including monitoring, alert triage, incident response, post-incident analysis, and critical incident coordination and documentation.

Lead the design, build, and implementation of automation, orchestration, and operational tooling to improve efficiency, reliability, signal-to-noise ratio, and reduce recurring issues, minimizing service-impacting events.

Set technical direction and influence platform strategy by defining platform architecture, system design, and documentation to guide development, testing, deployment, and long-term maintenance of complex distributed systems.

Establish and enforce standards, operational rigor, and best practices for deploying, monitoring, managing, and operating cloud-native and distributed infrastructure environments.

Lead the adoption and execution of modern CI/CD, GitOps, and cloud-native infrastructure practices, ensuring reliable, scalable, and traceable software and infrastructure releases.

Mentor and develop senior and staff engineers, reinforcing SRE principles, DevOps practices, accountability, and operational excellence across the Infrastructure SRE team.

Collaborate closely with product and engineering stakeholders, advocating for an SRE mindset and system-level thinking to maximize reliability, performance, availability, security, and scalability across shared platforms and services.

Other duties as assigned are absorbed into the above ownership and operational responsibilities.

What You Have:

10+ years of experience in site reliability engineering, infrastructure, or systems engineering, with deep ownership of large-scale production systems and demonstrated leadership of SRE or infrastructure teams, including setting technical direction and mentoring senior engineers.

Strong hands-on experience designing and building automation and operational tooling using Golang and/or Python, with expert-level proficiency in Linux/Unix systems, shell scripting, and production troubleshooting.

Advanced expertise in cloud-native and IaaS architectures, distributed systems, and container orchestration in production environments, including compliance, security, and network considerations.

Expertise in architecting modular Terraform frameworks and Infrastructure-as-code (IaC) design patterns.

Deep understanding of SRE and DevOps principles, including incident management, SLA/SLO ownership, automation, reliability engineering practices and leading incident response with post-incident analysis and preventive improvements.

Strong experience with CI/CD pipelines, GitOps workflows, release tooling, and modern cloud-native infrastructure practices, ensuring reliable and traceable software and infrastructure changes.

Hands-on experience operating Docker and Kubernetes environments, observability platforms (logging, monitoring, alerting), and SQL/NoSQL databases (e.g., Postgres, MongoDB, Graph DB), including performance tuning and operational troubleshooting.

Skills / Training Desired

Subject matter expertise in Google Cloud preferred; experience with other public cloud providers is also valuable.

Demonstrated expertise in microservices lifecycle management, including integration, testing, deployment, and operational best practices, supported by advanced knowledge of software release tooling and CI/CD platforms such as GitLab, Jenkins, Cloud Build, ArgoCD, and Spinnaker.

Deep understanding of the Docker and Kubernetes ecosystem, including orchestration, cluster management, and image lifecycle optimization.

Strong experience with observability, logging, and monitoring tools such as ELK Stack, Prometheus, Stackdriver, Datadog, New Relic, or Dynatrace.

Hands-on experience with algorithms, data structures, complexity analysis, and software/system design for large-scale distributed environments.

Experience driving automation for operational efficiency, signal noise reduction, recurring issue mitigation, performance testing, capacity planning, and system optimization in production environments.

Experience implementing security best practices and compliance considerations in infrastructure and platform design, along with the ability to influence cross-functional teams, evangelize SRE and DevOps practices, and foster a culture of reliability and operational excellence.

Why Milestone?

Milestone offers not only great benefits but also great culture. Employees here have flexible work environments, opportunities for further education, and the ability to effect change in our Organization directly.

The annual salary for this position ranges from $160,000 to $180,000 range. Pay is based on the level, location, complexity, responsibility, and job duties of the specific position and is just one component of Milestone’s total compensation package. Additionally, we offer an attractive benefits package that includes medical/dental benefits, FSA or HSA, 401k with 6% Safe Harbor employer match, paid parental leave, generous PTO (20 days' vacation, 10 days paid sick time, and 12 company holidays), fully paid Short Term disability policy, fully paid Long Term disability policy, and Life Insurance. If you are selected for an interview, please feel welcome to speak to our Talent Partner about our compensation philosophy.

All employees must complete a background check. Employees in fiscal roles are also required to undergo a credit check. All information obtained during these checks is handled confidentially and shared only with authorized personnel.

Milestone is committed to creating a diverse and inclusive workplace and is proud to be an equal opportunity employer.

Contact and application

Please apply at our website: www.milestonesys.com

We are looking forward to receiving your application

Top Skills

Ci/Cd

Docker

Gitops

Kubernetes

Linux

Python

Terraform

Similar Jobs at Milestone Systems

Milestone Systems

Senior Researcher

7 Days Ago

Remote or Hybrid

130K-140K Annually

Senior level

130K-140K Annually

Senior level

Artificial Intelligence • Other • Security • Software • Analytics • Big Data Analytics

As a Senior User Experience Researcher, you'll drive impactful research that influences product decisions, mentor peers, and embed user insights into organizational practices.

Top Skills: AISmaply

Milestone Systems

Sales Executive

11 Days Ago

Remote or Hybrid

180K-200K Annually

Senior level

180K-200K Annually

Senior level

Artificial Intelligence • Other • Security • Software • Analytics • Big Data Analytics

The Enterprise Sales Executive focuses on driving growth for BriefCam video analytics, managing large enterprise accounts, achieving sales targets, and expanding solutions adoption in the North Central US.

Top Skills: Salesforce

Milestone Systems

Sales Engineer

15 Days Ago

Remote or Hybrid

125K-140K Annually

Mid level

125K-140K Annually

Mid level

Artificial Intelligence • Other • Security • Software • Analytics • Big Data Analytics

As a Solutions Engineer, you will provide technical sales support, conduct customer demos, and manage proof of concept implementations while collaborating with the sales team to exceed customer expectations.

Top Skills: Client/Server EnvironmentsIisIp NetworkingLinuxPostgresRestful ApiSaas ApplicationsVideo AnalyticsVideo Management SystemsVideo RecordingVideo SecurityWindows

What you need to know about the NYC Tech Scene

As the undisputed financial capital of the world, New York City is an epicenter of startup funding activity. The city has a thriving fintech scene and is a major player in verticals ranging from AI to biotech, cybersecurity and digital media. It also has universities like NYU, Columbia and Cornell Tech attracting students and researchers from across the globe, providing the ecosystem with a constant influx of world-class talent. And its East Coast location and three international airports make it a perfect spot for European companies establishing a foothold in the United States.

Key Facts About NYC Tech

Number of Tech Workers: 549,200; 6% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Capgemini, Bloomberg, IBM, Spotify
Key Industries: Artificial intelligence, Fintech
Funding Landscape: $25.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Greycroft, Thrive Capital, Union Square Ventures, FirstMark Capital, Tiger Global Management, Tribeca Venture Partners, Insight Partners, Two Sigma Ventures
Research Centers and Universities: Columbia University, New York University, Fordham University, CUNY, AI Now Institute, Flatiron Institute, C.N. Yang Institute for Theoretical Physics, NASA Space Radiation Laboratory