We are seeking a Lead Site Reliability Engineer (Infrastructure) to join our fast-moving VSaaS engineering organization. This role carries responsibility for technical leadership and operational execution of the Infrastructure SRE team. You will own the reliability, scalability, and operability of our shared platform and production systems, while shaping how reliability engineering and SRE practices are applied across the organization and mentoring senior and staff engineers.
You will work closely with product engineering and platform teams to ensure a seamless developer experience, while setting standards, driving priorities, and leading by example during incidents and high-impact operational work. This role requires a strong technical background in cloud infrastructure, distributed systems, CI/CD, and GitOps, along with hands-on development experience in Golang and/or Python, to improve developer workflows, automation, and long-term system reliability.
This is a remote role in the United States.
Role Overview
Site Reliability Engineer - Infrastructure
The Infrastructure team provides leadership, direction, and accountability for platform architecture, system design, and end-to-end implementation to meet and exceed product non-functional requirements, including quality, security, reliability, availability, and performance. Site Reliability Engineers enable Product Development teams to ship features with reliable velocity by owning the stability, scalability, and operability of the underlying infrastructure and shared services.
What You Will Do:
As a Lead Site Reliability Engineer, you will:
-
Operate and evolve large-scale distributed systems, anticipating failure modes and proactively mitigating risks across production environments, while owning day-to-day production operations, including monitoring, alert triage, incident response, post-incident analysis, and critical incident coordination and documentation.
-
Lead the design, build, and implementation of automation, orchestration, and operational tooling to improve efficiency, reliability, signal-to-noise ratio, and reduce recurring issues, minimizing service-impacting events.
-
Set technical direction and influence platform strategy by defining platform architecture, system design, and documentation to guide development, testing, deployment, and long-term maintenance of complex distributed systems.
-
Establish and enforce standards, operational rigor, and best practices for deploying, monitoring, managing, and operating cloud-native and distributed infrastructure environments.
-
Lead the adoption and execution of modern CI/CD, GitOps, and cloud-native infrastructure practices, ensuring reliable, scalable, and traceable software and infrastructure releases.
-
Mentor and develop senior and staff engineers, reinforcing SRE principles, DevOps practices, accountability, and operational excellence across the Infrastructure SRE team.
-
Collaborate closely with product and engineering stakeholders, advocating for an SRE mindset and system-level thinking to maximize reliability, performance, availability, security, and scalability across shared platforms and services.
Other duties as assigned are absorbed into the above ownership and operational responsibilities.
What You Have:
-
10+ years of experience in site reliability engineering, infrastructure, or systems engineering, with deep ownership of large-scale production systems and demonstrated leadership of SRE or infrastructure teams, including setting technical direction and mentoring senior engineers.
-
Strong hands-on experience designing and building automation and operational tooling using Golang and/or Python, with expert-level proficiency in Linux/Unix systems, shell scripting, and production troubleshooting.
-
Advanced expertise in cloud-native and IaaS architectures, distributed systems, and container orchestration in production environments, including compliance, security, and network considerations.
-
Expertise in architecting modular Terraform frameworks and Infrastructure-as-code (IaC) design patterns.
-
Deep understanding of SRE and DevOps principles, including incident management, SLA/SLO ownership, automation, reliability engineering practices and leading incident response with post-incident analysis and preventive improvements.
-
Strong experience with CI/CD pipelines, GitOps workflows, release tooling, and modern cloud-native infrastructure practices, ensuring reliable and traceable software and infrastructure changes.
-
Hands-on experience operating Docker and Kubernetes environments, observability platforms (logging, monitoring, alerting), and SQL/NoSQL databases (e.g., Postgres, MongoDB, Graph DB), including performance tuning and operational troubleshooting.
Skills / Training Desired
-
Subject matter expertise in Google Cloud preferred; experience with other public cloud providers is also valuable.
-
Demonstrated expertise in microservices lifecycle management, including integration, testing, deployment, and operational best practices, supported by advanced knowledge of software release tooling and CI/CD platforms such as GitLab, Jenkins, Cloud Build, ArgoCD, and Spinnaker.
-
Deep understanding of the Docker and Kubernetes ecosystem, including orchestration, cluster management, and image lifecycle optimization.
-
Strong experience with observability, logging, and monitoring tools such as ELK Stack, Prometheus, Stackdriver, Datadog, New Relic, or Dynatrace.
-
Hands-on experience with algorithms, data structures, complexity analysis, and software/system design for large-scale distributed environments.
-
Experience driving automation for operational efficiency, signal noise reduction, recurring issue mitigation, performance testing, capacity planning, and system optimization in production environments.
-
Experience implementing security best practices and compliance considerations in infrastructure and platform design, along with the ability to influence cross-functional teams, evangelize SRE and DevOps practices, and foster a culture of reliability and operational excellence.
Why Milestone?
Milestone offers not only great benefits but also great culture. Employees here have flexible work environments, opportunities for further education, and the ability to effect change in our Organization directly.
The annual salary for this position ranges from $160,000 to $180,000 range. Pay is based on the level, location, complexity, responsibility, and job duties of the specific position and is just one component of Milestone’s total compensation package. Additionally, we offer an attractive benefits package that includes medical/dental benefits, FSA or HSA, 401k with 6% Safe Harbor employer match, paid parental leave, generous PTO (20 days' vacation, 10 days paid sick time, and 12 company holidays), fully paid Short Term disability policy, fully paid Long Term disability policy, and Life Insurance. If you are selected for an interview, please feel welcome to speak to our Talent Partner about our compensation philosophy.
All employees must complete a background check. Employees in fiscal roles are also required to undergo a credit check. All information obtained during these checks is handled confidentially and shared only with authorized personnel.
Milestone is committed to creating a diverse and inclusive workplace and is proud to be an equal opportunity employer.
Contact and application
Please apply at our website: www.milestonesys.com
We are looking forward to receiving your application
Top Skills
Similar Jobs at Milestone Systems
What you need to know about the NYC Tech Scene
Key Facts About NYC Tech
- Number of Tech Workers: 549,200; 6% of overall workforce (2024 CompTIA survey)
- Major Tech Employers: Capgemini, Bloomberg, IBM, Spotify
- Key Industries: Artificial intelligence, Fintech
- Funding Landscape: $25.5 billion in venture capital funding in 2024 (Pitchbook)
- Notable Investors: Greycroft, Thrive Capital, Union Square Ventures, FirstMark Capital, Tiger Global Management, Tribeca Venture Partners, Insight Partners, Two Sigma Ventures
- Research Centers and Universities: Columbia University, New York University, Fordham University, CUNY, AI Now Institute, Flatiron Institute, C.N. Yang Institute for Theoretical Physics, NASA Space Radiation Laboratory

.png)