The Home Depot

Senior Software Engineer - Site Reliability Engineering (Remote)

Posted 3 Days Ago

Remote

Hiring Remotely in Georgia, USA

90K-180K Annually

Senior level

Remote

Hiring Remotely in Georgia, USA

90K-180K Annually

Senior level

The Senior Software Engineer for Site Reliability Engineering builds and operates tools for reliability, focusing on automation, observability, and incident response while driving internal platforms for development teams.

The summary above was generated by AI

With a career at The Home Depot, you can be yourself and also be part of something bigger.

Position Purpose:

The Senior Software Engineer for Site Reliability Engineering (Store Systems Enablement) builds and operates the internal platforms that keep Home Depot's store systems observable, reliable, and automated. This is a platform engineering role: you will design, develop, and maintain the tools that hundreds of development and reliability teams depend on, not just use them.

The team owns and operates a portfolio of reliability platforms, including a custom-built synthetic testing system that runs inside physical Home Depot stores, operational automation infrastructure serving dozens of teams, and the full observability stack (logging, tracing, and profiling) for Store Systems. You will write code, deploy infrastructure, tune distributed systems, and reduce operational toil through automation, including AI-assisted workflows.

Key focus areas include:

Platform Development: Build and extend internal reliability tools using Kubernetes, Terraform, and modern infrastructure-as-code patterns on Google Cloud Platform.
Observability Operations: Deploy, configure, and maintain production logging, tracing, and profiling systems. Own the SLO/CUJ platform that enables multi-window, multi-burn-rate alerting and automated tracking dashboards for RE teams across Store Systems.
Toil Reduction & Automation: Identify repetitive operational work and engineer it away. Build self-service capabilities, Copilot skills, and automation pipelines so teams can operate independently.
SLO & CUJ Enablement: Maintain and extend the platform that powers SLO and Critical User Journey definition across the organization. Educate RE teams on what good SLOs and CUJs look like, assist with onboarding, and build automation and documentation so teams can self-serve. You will have strong opinions on the right way to measure reliability and the tooling to back them up.
Synthetic Monitoring: Extend our in-store synthetic testing platform: onboard teams, enable them to write and deploy their own tests, and evolve the platform's orchestration, alerting, and self-service capabilities.
Incident Response & Resilience: Participate in on-call rotation for observability infrastructure. Lead and contribute to blameless post-mortems. Design and execute destructive tests to validate platform resilience.

You will work on a small, high-impact team where the work is varied: some weeks you're writing Terraform and Helm charts, others you're debugging Loki query performance or building a Copilot skill to automate a support workflow. You will be expected to own problems end-to-end, from investigation through implementation to production deployment.

Key Responsibilities:

50% Delivery and Execution - Develops, tests, deploys, and maintains software, with a clear understanding of the value the software is to provide; Takes on new opportunities and tough challenges with a sense of urgency, high energy and enthusiasm; Consistently achieves results, even under tough circumstances; Develops test suites (functional, destructive, etc) to enable success, rapid deployment of code to production; Takes a broad view when approaching issues; using a global lens
20% Learns and Grows - Learns through successful and failed experiment when tackling new problems; Actively seeks ways to grow and be challenged using both formal and informal development channels
20% Plans and Aligns - Collaborates with other team members in agile processes; Creates new and better ways for the organization to be successful; Works the Product Team to ensure user stories are valuable, developer ready, easy to understand and testable; Delivers multi-mode communications that convey a clear understanding of the unique needs of different audiences; Adapts approach and demeanor in real time to match the shifting demands of different situations; Relates openly and comfortably with diverse groups of people
10% Supports and Enables - Helps grow junior engineers by providing guidance on modern software development frameworks, and leading technical discussions

Direct Manager/Direct Reports:

This position typically reports to Software Engineer Manager or Sr. Manager
This position has 0 Direct Reports

Travel Requirements:

No travel required.

Physical Requirements:

Most of the time is spent sitting in a comfortable position and there is frequent opportunity to move about. On rare occasions there may be a need to move or lift light articles.

Working Conditions:

Located in a comfortable indoor area. Any unpleasant conditions would be infrequent and not objectionable.

Minimum Qualifications:

Must be eighteen years of age or older.
Must be legally permitted to work in the United States.

Preferred Qualifications:

3-5 years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Infrastructure Engineering
Hands-on experience with Google Cloud Platform (GCP), including GKE, GCS, BigQuery, Cloud Pub/Sub, Cloud Logging, IAM, and Workload Identity. Experience with other major cloud providers (AWS, Azure) is also valuable.
Strong Kubernetes experience: deploying and managing workloads on GKE or similar managed Kubernetes services, writing and debugging Helm charts, managing namespaces, RBAC, service accounts, and troubleshooting issues
Experience with infrastructure-as-code tools, particularly Terraform for cloud resource management. Familiarity with cdk8s (CDK for Kubernetes) or similar programmatic IaC tools is a plus.
Proficiency in one or more of: Go, Python, JavaScript/TypeScript, YAML. You don't need all of them, but you should be comfortable reading Go, writing YAML and HCL, and scripting in Python or JavaScript.
Experience with observability platforms: deploying, configuring, or operating log aggregation, distributed tracing, metrics, dashboarding, or continuous profiling
Practical understanding of SLOs, SLIs, and error budgets. Experience defining Service Level Objectives, instrumenting services for SLI measurement, and configuring burn-rate alerting is highly preferred.
Experience with synthetic monitoring or performance testing frameworks (k6, Playwright, Selenium, Locust, or similar). Bonus if you've built or operated a synthetic testing platform rather than just consumed one.
Familiarity with incident management and on-call practices: Blameless post-mortems, runbook development, and incident communication
Experience with CI/CD pipelines using GitHub Actions, Spinnaker, ArgoCD, or similar. Understanding of deployment strategies (blue/green, canary, rolling).
Experience with automation to reduce operational toil: building self-service tooling, writing scripts or bots to handle repetitive tasks, or developing internal developer platforms
Familiarity with AI-assisted development tools (GitHub Copilot, LLM-based automation, MCP servers) is a plus. We are actively building AI skills and automation into our workflows.
Experience writing clear technical documentation, runbooks, and onboarding guides
Comfort working on a small team with broad ownership: you will context-switch between writing code, debugging production systems, and onboarding partner teams

Minimum Education:
The knowledge, skills and abilities typically acquired through the completion of a bachelor's degree program or equivalent degree in a field of study related to the job.

Preferred Education:

No additional education

Minimum Years of Work Experience:

Preferred Years of Work Experience:

No additional years of experience

Minimum Leadership Experience:

None

Preferred Leadership Experience:

None

Certifications:

None

Competencies:

Global Perspective
Manages Ambiguity
Nimble Learning
Self-Development
Collaborates
Cultivates Innovation
Situational Adaptability
Communicates Effectively
Drives Results
Interpersonal Savvy

For California, Colorado, Connecticut, Rhode Island, Nevada, New York City, Ithaca (NY), Westchester County (NY), and Washington residents:

The pay range for this position is between $90,000.00 - $180,000.00

Similar Jobs

Rula

Accounting Manager

4 Hours Ago

Remote

United States

153K-171K Annually

Senior level

153K-171K Annually

Senior level

Healthtech • Other • Social Impact • Software • Telehealth

The Sr. Corporate Accounting Manager will oversee corporate accounting, manage a team, ensure compliance with US GAAP, and drive process improvements.

Top Skills: Accounting SoftwareUs Gaap

Lob

Solutions Engineer

4 Hours Ago

Easy Apply

Remote

United States

Easy Apply

115K-133K Annually

Senior level

115K-133K Annually

Senior level

Logistics • Marketing Tech • Software

The Senior Solutions Engineer partners with Account Executives to develop technical solutions for customers’ direct mail programs, requiring effective communication, technical skills, and flexibility in project management.

Top Skills: APIsCdpCRMMarketing Automation Tools

ProCon Home Inc

Data Entry Specialist

4 Hours Ago

Remote

United States

35-45 Annually

Junior

35-45 Annually

Junior

Information Technology • Logistics • Machine Learning • Industrial • Infrastructure as a Service (IaaS) • Manufacturing

The Data Entry Specialist will enter, update, and maintain data in company systems, ensuring accuracy and confidentiality while assisting with clerical tasks.

Top Skills: Google WorkspaceMS Office

What you need to know about the NYC Tech Scene

As the undisputed financial capital of the world, New York City is an epicenter of startup funding activity. The city has a thriving fintech scene and is a major player in verticals ranging from AI to biotech, cybersecurity and digital media. It also has universities like NYU, Columbia and Cornell Tech attracting students and researchers from across the globe, providing the ecosystem with a constant influx of world-class talent. And its East Coast location and three international airports make it a perfect spot for European companies establishing a foothold in the United States.

Key Facts About NYC Tech

Number of Tech Workers: 549,200; 6% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Capgemini, Bloomberg, IBM, Spotify
Key Industries: Artificial intelligence, Fintech
Funding Landscape: $25.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Greycroft, Thrive Capital, Union Square Ventures, FirstMark Capital, Tiger Global Management, Tribeca Venture Partners, Insight Partners, Two Sigma Ventures
Research Centers and Universities: Columbia University, New York University, Fordham University, CUNY, AI Now Institute, Flatiron Institute, C.N. Yang Institute for Theoretical Physics, NASA Space Radiation Laboratory