Dive Into the World of Site Reliability Engineering

Written by Stephen Gossett
Published on Mar. 31, 2020
Dive Into the World of Site Reliability Engineering
site reliability engineer hero
IMage: Shutterstock

John Turner wanted to be a rock star. Not a rock-star developer, not a rock-star software engineer, not anything having to do with the weird corporate-speak contortion of “rock star.” A legit rock star.

A metal lover, he fine-tuned his chops studying jazz guitar in Atlanta, then went to work as a professional workaday musician after graduation. But prestige industries sometimes feel less prestigious from the inside, and he soon found himself feeling “very dissatisfied with the day-to-day life of a working musician.”

“It’s not as glamorous as people might think,” he said.

That disillusionment led him to New York University’s music technology graduate program, where he could bridge music with lifelong flirtations with programming — including childhood dabblings in Hello, World! and Delphi. But it was an assignment that had nothing to do with music — building a rudimentary spellchecker with C — that may have been most formative.

It was that taste of pure creation that I eventually fell in love with.”

“It was that taste of pure creation that I eventually fell in love with,” Turner said. “And it’s the thing that keeps me writing software day to day.”

The software he writes now, as a site reliability engineer at website-hosting service Squarespace, is more complicated by degrees of magnitude. But before we get into the day-to-day details, it’s worth clarifying:

 

Just What Is a Site Reliability Engineer?

In a manner of speaking, SREs are part medics, part streamliners. They take shifts in the company’s on-call rotation, during which they act as designated responder, available to manually intervene if the infrastructure system starts to show symptoms. Off rotation, they’re spending a lot of time writing code, including a significant amount of automation tooling. Automation helps cut out toil — or the monotonous, routine tasks that would have otherwise become time-sucks for developers — which in turn helps stabilize the system.

“Computers are really good at doing automated tasks over and over — much better than most humans,” Turner said. The critical and creative freedom such automation affords becomes a boon for both reliability and incident response. “Having humans not be stressed out because they’ve just had to engage in monotonous toil turns out to be a really helpful thing — that, to me, is the big industry-wide takeaway from SRE,” he said.

But even as the title becomes more ubiquitous across development teams, other SRE takeaways are still germinating. Site reliability engineering, as both concept and practice, comes from a long tradition that also includes code-review tool Gerrit, machine learning platform TensorFlow and the very Kubernetes system that’s often so closely tied with SRE work — it originated internally at Google before finding foothold in the broader development world.

But most companies are smaller than Google. (All except 16, to be precise, according to Forbes.) Indeed, when Google in 2016 published the field’s ur-text, a collection of essays by then-current and former SREs, editors caveated their findings somewhat by underscoring the company’s unique status:

“It is no surprise that [SRE] arose in the fast-moving world of web services, and perhaps in origin owes something to the peculiarities of our infrastructure,” they wrote.

Translation, as it were, is still in progress. “SRE outside of Google is still very much in its infancy, which is very interesting,” Turner said. “Some of what works for Google doesn’t or can’t work for other companies.” A designated SRE may not make sense, for instance, at a 20-person startup, he said.

SRE outside of Google is still very much in its infancy, which is very interesting.”

Squarespace has experienced both, having grown from scrappy startup to a 500-plus-employee, if not quite Alphabet-sized, enterprise. It first hit the gas on expansion roughly between 2010 and 2014, a stretch that included the company’s legendary, diesel-lugging efforts to keep thousands of customers’ sites running after Hurricane Sandy.

Squarespace’s interpretation of SRE sees Turner spending the majority of his time coding, including writing tools that make it easier for engineers to interact with the infrastructure, and the all-important, aforementioned automation tools.

How else does SRE look outside of Google? “It turns out it’s pretty wide and varied,” Turner said. Some operations teams have simply rebranded with no real functional change. (SRE is, in some part, a marketing term, Turner noted.) Others approach Google’s example like a cafeteria, picking what works and adapting what doesn’t.

That definitional question helped spawn another key book, Seeking SRE. In it, one contributor, Coburn Watson, who’s now the head of infrastructure and SRE at Pinterest, argues for a context-, rather than control-, driven approach to improving availability of microservices.

In effect, skip micro-managerial direction and guide engineering teams by providing them with plenty of information, from trended availability data to high-level business-review documents. “That’s an interesting way to think about it — providing context to the rest of the engineers to help them make good decisions around what makes the service more reliable,” Turner said.

 

site reliability engineer squarespace ext
Swuarespace headquarters, in New York City. | Photo: Squarespace

How Do You Measure Site Reliability?

Reliability is a quality, but it’s also a metric. Turner describes a scenario not far removed from a toddler repeatedly asking “why” — except, you know, helpfully. Say your service runs on a single server. Does reliability simply mean the service is up? “What if it’s up, but it’s only serving errors?” Turner said. “Or what if it’s up and not serving errors, but not serving what the user expected? You can keep asking this series of questions until you arrive at a synthetic measurement that approximates what our user wants our service to do.”

That measurement — a service level indicator (SLI), in SRE parlance — lets you then establish a goal — or service level objective (SLO). “It might be that we want 99.9 percent of every request to return known good data,” Turner said. “Once you have a percentage, you essentially have an objective.”

You can keep asking this series of questions until you arrive at a synthetic measurement that approximates what our user wants our service to do.”

(Along with SLIs and SLOs, a third foundational pillar of site reliability engineering is SLAs — or service level agreements. That’s where you often get into legalese, so many development teams, including the one at Squarespace, are happy to leave that to the lawyers.)

That’s SRE’s central concept: reliability as a measurable data point, “not just a gut feeling or a personal perspective,” and the concrete steps taken to improve that point.

 

A Day in the Life of a Site Reliability Engineer

So how, exactly, does that manifest day to day? For laypeople, “site reliability” likely conjures up images of putting out digital fires, fixing outages — or, preferably, precursors to outages. But Turner spends some 70 percent of his time writing code — much of it automation code.

“To take a trivial example, if a computer needs to be restarted, one way is to hit the virtual restart button,” Turner said. “Another way is to write something that does it from your computer. And then another way is to do something that automatically monitors the machine, to see if it needs to be restarted, and just restart it without any humans. So it’s that level of writing automation tooling.”

Each day also typically brings a one-on-one, either with a direct report, a colleague or an earlier-career developer.

One-on-ones with other staff engineers can be particularly fruitful for clearing technical roadblocks. It’s a time to ask: “Have you seen a similar [issue]? How did you fix it? Do you know other people who are doing this across the organization that I can talk to?”

It’s also an opportunity to gather some high-level career advice. “It might be: ‘Hey, I’m getting really interested in distributed systems. What should I be doing to work on these more?’” Turner said.

Likewise, check-ins with more entry level contributors range from double-checking work through pair code reviews to lending advice on technical interests.

So what are the kids into? You could say that the container revolution has had a long tail. “A lot of people are still very interested in Kubernetes,” said Turner, whose team works closely with container technology. “It’s certainly not the newest tech on the market certainly anymore, but it’s still very exciting, complex and will likely be a building block for important systems down the road.”

The better young programmers grasp the technical details of containers, the more valuable they’ll be, he added.

 

site reliability engineer squarespace int
Squarespace headquarters. | Photo: Squarespace

Cross-Departmental Collaboration

That points to the importance of leadership. SREs tend to work across departments more than a traditional software engineer. As Turner put it, “it turns out that reliability is often a fairly complex problem,” and being able to drive the conversation can be key. “You have to be able to zoom in on any particular developer’s portion of the product, and then see how that might affect the reliability of other parts of the products,” he said.

That leadership aspect manifests away from the codebase, too. At the beginning or end of quarters, Turner and other senior engineers and contributors are either running postmortems or sussing out the technical details of new projects. “What do the solutions look like? What are the dependencies between different types of projects?” he said.

Turner also spends time contributing to architecture reviews, or “making sure [product] designs are technically sound and that all contingencies have been thought through.” Members from across the engineering team contribute; Turner represents the reliability/infrastructure side, “especially if it has to do with containers or Kubernetes, which is sort of my expertise,” he said.

That said, the day-to-day changes complexion when Turner is on PagerDuty call. You don’t want to be occupied with anything too demanding when the prospect of interruption is forever looming.

“When I’m not being paged, I’ll do smaller tasks like writing documentation, helping people with technical guidance, or finding ways to make the next person’s on-call shift a little bit easier: writing automation, changing alerting windows, making visualizations of system properties look a little bit nicer, whatever I can do to make the next person’s life a little easier,” he said.

Shifts are 24/7 and run for one week, always with a secondary on-call, in case the designated PagerDuty holder has an emergency. Squarespace operates what’s known as a tiered escalation policy, a failsafe that notifies a broader range of members if designated responders were incapacitated.

Rules are also in place to avoid burnout. If you have to respond to alerts during the night, you can sleep in in the morning; if you’re awakened by alerts on three consecutive nights, you’re off rotation. “Health and life concerns are really important to acknowledge,” Turner said. “People are not an expendable resource.”

 

site reliability engineer squarespace commons
Common area at Squarespace headquarters. | Photo: Squarespace

Wait, Isn’t This Just Like DevOps?

From a bird’s-eye view, much of SRE is about automation and streamlining interactions between infrastructure and operations. That probably sounds similar to another dev-world practice that graduated from buzz phrase to broad adoption: DevOps.

To be sure, in Seeking SRE, a Microsoft cloud advocate plainly states that “reliability engineering and DevOps aim to solve the same problem set” (keeping digital services up, while adding improvements), while a Deswik Mining release engineer opined that “SRE and DevOps have a wide scope of overlap, but they are distinct ideas.”

One of the most prominent thought leaders in the field, former Google SRE Liz Fong-Jones once likened SRE to a concrete class that implements the interface in a programming language — “a prescriptive way of accomplishing that [DevOps] philosophy.” In other words, where you draw the line may differ.

There aren’t necessarily formalized handoffs between product and operation and security. Instead, you have all these teams working together, sort of all at once.”

For Turner, it boils down to an aversion to silos. “It all starts from the notion that there’s a foundation of trust between teams and the best way to work is cross-functional,” he said. “There aren’t necessarily formalized handoffs between product and operation and security. Instead, you have all these teams working together, sort of all at once,” he said.

“There’s a lot of cross-pollination there,” between SRE and DevOps, he said.

That includes the container orchestration systems in which Turner specializes at Squarespace. From containers, our conversation veers a bit into programming languages and ecosystem dependency. Go offers an across-the-board ease-of-use, but it’s not very expressive, he mused. Haskell is more interesting, though maybe not a great fit to use professionally, he said. Rust, on the other hand, seems just a killer app away from seeing greater adoption for “both systems-level programming and higher-level application programming.”

“I’ve always been a bit of a programming-language nerd,” he said — a long way from primitive spell-checkers and MAX/MSP patches but still as opinionated and passionate.

Hiring Now
PEAK6
Fintech