Lead Site Reliability Engineer - Incident Commander

ADP

| Hybrid

Sorry, this job was removed at 11:33 a.m. (EST) on Saturday, March 28, 2020

View 1470 Jobs

Find out who's hiring in Greater NYC Area.

See all Developer + Engineer jobs in Greater NYC Area

View 1470 Jobs

Apply

By clicking Apply Now you agree to share your profile information with the hiring company.

Save job

SUMMARY

Our industry is starting to go through a transformational shift and we intend to lead it. As talent becomes the main differentiator between failure and success, organizations must attract, engage and develop their people more than ever. To do so, they need powerful and sophisticated tools, which take the pain out of HR management and empower employees & people leaders. That's where we come in.

Lifion by ADP is expanding our startup style operation in NYC in order to accelerate new technical innovation across UI, Search, Platform Technology, IaaS, Big Data, Social, etc. The concept and vision behind the strategy is "Innovate like a Startup" with the goal of delivering highly automated, intelligent and predictive solutions to the market. Our goal is to have specialized teams of superstars focused in these areas to keep pace with market trends and quickly incubate and deliver capabilities that dramatically increase the value of our solutions for clients.

The incident commander is responsible for managing incident to its resolution as quickly as possible, coordinating with teams, communicating outward and planning next steps. During a major outage or incident the IC must make decisions, delegate to appropriate teams, and create multiple backup plans in order to minimize the time to resolution.

The IC will have superb listening and delegation skills. Deferring tasks to appropriate teams and listening to their expertise as input for next steps. This person must be able to weigh alternatives and keep options for multiple paths to avoid delay in moving the restoration effort forward.

The Incident Manager is also responsible for keeping a clear communication line to senior stakeholders and those not immediately in the triage effort. Additionally the IC will work with the teams to document and analyze the issue post-mortem to prevent future incidents.

REQUIREMENTS

Excellent communication skills, both verbal and written.
A high-level knowledge of incident management best practices and systems
Problem-solving skills
The ability to make quick, confident decisions
Listening and synthesis skills
Previous experience with major incidents (either as a participant or an observer)
Leadership skills—the ability to take command in a high-stress situation
Solve problems relating to mission critical services and create solutions to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions.
Understand the operational complexity of a microservice architecture
Increasing efficiency by identifying and addressing performance bottlenecks
Define, track, review and report on Service Level Objectives (SLOs), Service Level Indicators (SLIs), System Availability, and the progress and outcomes related to reliability initiatives.
Capable of decision making and Leadership without oversight. As well as influencing others without hierarchy (both upwards and laterally)
Ability to manage incidents and keep everyone calm and focused on solving issues. Removing folks who distract the immediate service restoration. This should be true regardless of the level of person causing distraction.
Planning backups, rollbacks, and next steps before and during an incident.

PREFERRED QUALIFICATIONS

At least 5 years combined of experience in software engineering and automated test engineering
Fluency in one or more languages, such as Go, Javascript, or Python
Strong production experience with cloud native services (AWS, Azure, GCP)
Familiarity with Git SCM and one or more repository managers such as Github, Gitlab, Stash, Bitbucket, or Gerrit.
Experience with lightweight development methodologies such as Agile - Scrum and / or Kanban

Read Full Job Description

Lead Site Reliability Engineer - Incident Commander

Location

Similar Jobs