Site Reliability Engineer - Applications at Lifion by ADP
Our industry is starting to go through a transformational shift and we intend to lead it. As talent becomes the main differentiator between failure and success, organizations must attract, engage and develop their people more than ever. To do so, they need powerful and sophisticated tools, which take the pain out of HR management and empower employees & people leaders. That's where we come in.
Lifion by ADP is expanding our startup style operation in NYC in order to accelerate new technical innovation across UI, Search, Platform Technology, IaaS, Big Data, Social, etc. The concept and vision behind the strategy is "Innovate like a Startup" with the goal of delivering highly automated, intelligent and predictive solutions to the market. Our goal is to have specialized teams of superstars focused in these areas to keep pace with market trends and quickly incubate and deliver capabilities that dramatically increase the value of our solutions for clients.
As an Application Site Reliability Engineer you have come up through the ranks as a full-stack engineer and are passionate about automating all the things. You are very opinionated about patterns and practices but pragmatic in your discourse and implementation. You have experience building fully automated, highly elastic, cloud-orchestrated platforms over various IaaS providers like AWS, GCE, and / or Azure, or using on-premise solutions like OpenStack. You see containers as the future of software deployments and are familiar with how to orchestrate them with frameworks like Docker Engine, Mesos, and/or Kubernetes.
This a combined technical leadership and hands-on development role that contributes to Lifion’s success through expertise in large-scale distributed systems. You will leverage matured existing systems to help design and create the next generation service architecture. Qualified individuals will have a solid background in the fundamentals of computer science, distributed computing, high availability, software development process and best practices.REQUIREMENTS
- Solve problems relating to mission critical services and create solutions to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions.
- Deep understanding of distributed systems and the ability to lead/teach engineers to design and deliver software to improve the reliability, scalability, latency, and efficiency of our services.
- Influence and create new designs, architectures, standards, and methods for large-scale distributed systems.
- Design and implement stability and reliability best practices and proactive solutions to potential issues by collaborating with global technology partners.
- Define, track, review and report on Service Level Objectives (SLOs), Service Level Indicators (SLIs), System Availability, and the progress and outcomes related to reliability initiatives.
- Capable of decision making and Leadership without oversight. As well as influencing others without hierarchy (both upwards and sideways in parallel teams)
- Understand the operational complexity of a microservice architecture
- Conduct periodic on call duties (on an as needed basis). Ideally reducing the need for on-call incidents.
- Work with Incident Commanders, SREs, and Platform engineers during and after the incident recovery life-cycle.
- Identify key priority initiatives to significantly improve reliability, both proactively and reactively.
- Follow up and publish After Action Reviews which are timely and clearly understood by technical and business personnel, and include accurate root causes and concrete follow-up items with clear owners.
- Increasing efficiency by identifying and addressing performance bottlenecks
- Upholding a high standard of code quality through code reviews and extensive testing
- At least 5 years of experience in software engineering or development operations
- Strong understanding of containerization, container networking, and Kubernetes
- Strong knowledge of continuous integration and continuous deployment
- Strong production experience with cloud native services (AWS, Azure, GCP)
- Strong skills with Git SCM and one or more repository managers such as Github, Gitlab, Stash, Bitbucket, or Gerrit.
- Experience with higher level-level network protocols including HTTP and REST
- Experience with lightweight development methodologies such as Agile - Scrum and / or Kanban