Site Reliability Engineer
The TechOps organization strives to accelerate Flatiron’s mission to improve cancer care and learn from patient experiences by ensuring that our technical infrastructure and staff maintain the highest levels of reliability, performance, and agility. Our Site Reliability Engineering (SRE) teams are a key part of this mission as they provide best practice guidance on reliability and scalability to our engineering teams. As a member of one of our SRE teams you will have a key role in scaling our technology platforms and empowering our development teams to consume them frictionlessly.
As an SRE you will:
- Design and build infrastructure & systems that provide high levels of scalability, reliability, and performance, while balancing security, maintainability, and operational excellence.
- Interface across teams to codify and reliably test infrastructure changes using Flatiron’s software development lifecycle.
- Partner with product and application teams to provide guidance and best practices around scalability, reliability, and performance of our productions systems, infrastructure, and software.
- Actively participate in code and configuration reviews.
- Craft solid and clearly explained designs, playbooks, and documentation, for consumption by teammates and the larger engineering organization.
- Improve operational efficiency through automation and deployment or development of new tools.
- Be proactive in performance & availability monitoring; provide remediations for systemic issues.
- Ingest requirements, scope work, produce estimates and help define deliverables with project timelines.
- Actively participate in on-call duties.
- Work as a team on escalations, resolving critical issues that impact our high SLA production systems.
- 2+ years working in a devops or software engineering role.
- Experience writing simple, readable, useful code, especially for operational tooling.
- Experience with cloud environments such as AWS, Azure, or GCP.
- Experience working with a production environment with high uptime requirements and measurable SLAs.
- Familiarity with container technologies such as Docker, Kubernetes or Mesos.
- Proficient with configuration management, orchestration, and infrastructure-as-code tools such as Ansible and Terraform.
- Demonstrated ability to deliver high-quality, on-time solutions that are reliable, scalable, and maintainable.
- Strong communication skills and ability to work effectively across multiple business and engineering teams.
- Preference for working in a dynamic environment, comfortable challenging the status quo.
- Ability to adjust quickly to changing priorities and make quick decisions with limited information.
- Belief that a team working well together is truly smarter than the single smartest person on that team.