Senior Site Reliability Engineer
About Us
ActionIQ unifies customer data and empowers marketers to deliver relevant customer experiences. Our product features self-service audience discovery and true cross-channel orchestration powered by AI-driven insights and decisions. This product and the platform it operates on is a complex and feature-rich distributed data system which we offer as a multi-cloud SaaS solution to our clients.
The SRE team’s goal is to provide our engineers with deep insight into the reliability and performance of our product in production. We work to bring operational context and realtime information to help our engineering teams to make actionable, data driven decisions. The SRE team is responsible for our observability strategy, incident tracking and analysis, tooling, and anything else that impacts the operability, reliability, and efficiency of our platform. Our systems process petabytes of data and serve high profile enterprise customers, and we’re just getting started!
About You
You’re excited about working for a growing startup and influencing the success and direction of our engineering organization. You’ve implemented Site Reliability Engineering practices in a startup at least once before, and you’ve got stories and experiences to share. You understand the theory and practice of SRE well enough to help us design and implement a practice that is uniquely our own.
Requirements
- 5+ years experience in roles like DevOps Engineer, Site Reliability Engineer, Systems Engineer, or a similar job where you have supported large-scale customer-facing products.
- Designing and reviewing systems and architectures for reliability.
- Creating and tracking Service Level Objectives and Service Level Indicators for complex products and services.
- Communication skills necessary to coach and persuade other engineers in the adoption of SRE practices and principles.
- Software development skills in at least one object-oriented language.
- Designing and executing processes for incident analysis, post mortems, disaster recovery drills, and similar.
- Observability tools like Datadog, Honeycomb, New Relic, Zipkin, Prometheus, ELK, etc.
Preferred Experience
- Working in an environment where Infrastructure as Code is the preferred approach.
- Supporting Java, Python, and JavaScript applications throughout their lifecycle.
- Design of alert configuration and routing.
- Experience normalizing and automating manual processes.
- Experience being on call and supporting an SLA-driven product.
Benefits
- Strong leadership buy-in on the value of SRE for our Engineering teams.
- Work with a fun, inclusive, and smart team of people as we build a NYC-based enterprise software company!
- Backed by top-tier VCs (Sequoia, Andreessen Horowitz, FirstMark Capital)
- Convenient working location with great subway access