Site Reliability Engineer
AlphaSense provides a revolutionary search engine for knowledge professionals.
Our mission is to curate and summarize the world's business knowledge, including the vast high-value content sets that traditional web search engines cannot reach. Our users can rapidly discover key data points & relationships while tracking impactful real-time events with intelligent, automated alerts.
Our 1000+ clients include many of the world's largest investment & advisory firms, global banks & Fortune 500 corporations - helping our users become dramatically more productive & gain an information edge by discovering critical data points & trends that others miss.
We are seeking a passionate Site Reliability Engineer to help create the next big thing in data analysis and search solutions.
You will join our Cloud infrastructure team supporting our team of development engineers taking care of the AlphaSense platform. We will pair you up with world-class talent in cloud and software engineering and provide a position and environment for continuous learning.
The ideal candidate has a strong system cloud configuration, monitoring, support and scripting skills. He is passionate about system engineering, scalability, stability and never wants to stop learning. Experience with AWS is essential.
Your responsibilities will include:
- Establish tools and instrumentation to measure and monitor availability, latency and overall system health
- Continuously provide help on Cloud-Native transition
- Provide sustainable incident response and blameless postmortems
- Learn the system far and wide and know all it’s weak points
- Troubleshoot production and development issues
- Provide help with deployments, tooling support
- Help drive the team towards continuous deployment
- Improve system stability by close communication with developers regarding the weak points in the system
- Passion to solve all engineers issues in a cloud-native way
- Keeping the system green and stable
- With help of our strong development team, you should be able to find a way to prevent incidents instead of just fixing those
- Create and maintain operational runbooks
- BS / MS Degree in Computer Science or related discipline preferred
- Experience with AWS
- At least basic experience with K8s (helm, operators)
- Strong skills in scripting languages (shell scripts, Perl, Python)
- Experience with Prometheus, Grafana and other open source monitoring/logging solutions
- Interest in designing and troubleshooting of large-scale distributed systems
- Strong communication skills as well as a problem-solving mind
- Ability to automate routine tasks
- Good working knowledge of relational and NoSQL databases
Nice to have:
- Understanding of continuous deployment and how to get there
- Infrastructure as code experience (Ansible, Terraform, CloudFormation)
- Experience with logging setup configuration and maintenance (EKS, FluentD, LogStash)