Site Reliability Engineer
About our Site Reliability Engineer team:
At Foursquare, our production systems run on an innovative hybrid cloud-and-coloc installation. We embrace open source and home-grown tools in the belief that what works best, is best. We're looking for a seasoned site reliability engineer to help us grow, automate, and monitor our footprint, in the datacenter and in the cloud.
You should have a proven track record of writing automation tools, a solid understanding of operating system fundamentals, and familiarity with common production environment services. You should be comfortable running with your own ideas and eager to learn new skills on a bleeding edge platform. We use a variety of tools, technologies, and languages to build software (e.g., Scala, Hadoop, Python, Thrift, MongoDB, Memcached, Redis, Kafka, Chef, Aurora, Mesos, RocksDB, Luigi, Pants, Nginx, Haproxy, Logstash, Grafana), but experience with equivalent ones will do just fine.
A background and interest in security on distributed systems is a major plus.
Here are some high-level areas you could get involved in:
- Rebuilding our proxy tier to support more advanced load-balancing algorithms
- Improving the speed with which we can reliably continuously deploy our backend services
- Building tools to analyze and optimize CPU, core, memory and disk utilization of services that run on our Aurora and Hadoop clusters.
- Improving our logging pipeline, and adding to the growing set of data sources we parse for actionable information.
You can join our Production Engineering team at our San Francisco or New York City office.
- 3+ years production environment experience
- Demonstrated tool building capability.
- Grace under fire and willingness to help troubleshoot to keep our services up and running, in a 24x7 on-call rotation
- Positive attitude, and a self-directed work ethic
- Help evolve our microservices deployment
- Improve metrics and visualization tools
- Implement stronger controls for authentication and authorization across our fleet
- Develop automation to take greater advantage of cloud elasticity to save us money and maintain high availability