Site Reliability Engineer (Big Data / Hadoop)
About the Role:
At Foursquare, our production systems run on an innovative hybrid cloud-and-coloc installation. We embrace open source and home-grown tools in the belief that what works best, is best. We're looking for a seasoned site reliability engineer to help us grow, automate, and monitor our footprint, in the datacenter and in the cloud.
The Big Data SRE will focus on operation and optimization of our large (7000+ cores, 4 petabytes storage and growing!) Hadoop cluster. You will work closely with the rest of the engineering org, to ensure a stable and scalable platform is available to support our extensive data analytics and machine learning efforts. You will cross train with the rest of the SRE team to share your Hadoop expertise, and to acquire skills relevant to maintaining and scaling the rest of our infrastructure.
You should have a proven track record of writing automation tools, a solid understanding of operating system fundamentals, and familiarity with common production environment services. You should be comfortable running with your own ideas and eager to learn new skills on a bleeding edge platform. We use a variety of tools, technologies, and languages to build software (e.g., Scala, Hadoop, Python, Thrift, MongoDB, Memcached, Redis, Kafka, Chef, Aurora, Mesos, RocksDB, Luigi, Pants, Nginx, Haproxy, Logstash, Grafana), but experience with equivalent ones will do just fine.
- 5+ years of proven industry experience.
- Strong written and verbal communication skills.
- Solid background using Linux and *nix operating systems.
- Experience with deployment automation tools like Ambari, Chef, Puppet or similar systems.
- Familiarity with a breadth of projects in the Hadoop ecosystem, and expert with at least a few of them. We primarily use HDFS, YARN, Hive, MapReduce, Cascading, Scalding, Presto, Spark, PySpark, Jupyter, Zeppelin.
- Familiarity with using and supporting analytics systems like Hive, Redshift, Presto, Athena, Tableau and similar tools.
- Familiarity with performance debugging and tuning at the OS, JVM and cluster (MapReduce, Hive, Spark jobs) levels.
- Bonus points for deploying/operating large-ish Hadoop clusters in AWS/GCP and use of EMR, Terraform, DC/OS, Dataproc.
- Bachelors Degree or higher in Computer Science, Electrical Engineering or related field