- As a principal member of the team, you will be the authority on SRE subject matter. This includes but is not limited to SLI/SLO development, error budgets, eliminating toil, and automation.
- You will lead cross-company projects and help set the roadmap for the team in collaboration with the Director of Site Reliability Engineering.
- You will mentor other engineers within the team and across Hinge in best practices regarding Kubernetes, microservices, production readiness, and others.
- You will be sharing ownership of the Production environment with the rest of the SRE team. You will work closely with the product developers and other engineering departments to help deploy their work in a safe and reliable manner.
- You will build tools to make our infrastructure more consistent, more reliable, more observable, and require less manual intervention. We are currently rebuilding our infrastructure in Terraform and Ansible. This work will continue in 2021 to include global expansion.
- Join the 24x7 on-call rotation for the production infrastructure. The rotation will be one week tours rotated amongst 4-5 (including yourself) SREs (we plan to grow and spread this out further in 2020)
What We’re Looking For:
- 7+ years of experience with Linux systems administration.
- 7+ years coding experience (can run alongside the above SA timeline).
- Extensive experience with AWS.
- Deep database knowledge is a plus, we are in need of a guru in this area.
- Understanding of SRE processes, SLI vs SLO vs SLA, how to define and live by an error budget, etc.
- Coding experience in Go, Bash, or Python (the SRE team uses mostly Go and Bash).
- Experience as a member of 24/7 on-call rotation.
- 2+ years of Docker/Kubernetes experience.
- The ability to automate your tasks in Bash or Go.
- Terraform is optional, but a strong plus.