← Stackzilla Blog
Site Reliability Engineer: Google's Answer to Keeping the Internet Running
Published July 17, 2026
· 12 min read
· SRE, site reliability engineering, infrastructure, career, DevOps, engineering
Site Reliability Engineering was invented at Google in 2003, kept internal for over a decade, then published as a book in 2016 that the entire industry read. The SLO/error budget framework it introduced changed how reliability is managed everywhere from Netflix to JPMorgan.
Site Reliability Engineering was invented at Google in 2003 and did not have a name that most of the industry recognised until Google published a book about it in 2016. That book — simply titled Site Reliability Engineering — was made freely available online and was read by hundreds of thousands of engineers. Within a few years, SRE teams had been established at Netflix, Spotify, Dropbox, LinkedIn, Amazon, Microsoft, and most major technology companies. The discipline went from Google-internal practice to industry standard in less than a decade.
## The Origin of SRE
In 2003, Ben Treynor Sloss was a software engineer at Google tasked with managing the production environment — keeping Google's services running reliably at a scale that no organisation had previously operated. He made a decision that defined an entire discipline: he hired software engineers, not traditional systems administrators, to run operations.
Treynor Sloss later described the philosophy in a single sentence: "SRE is what happens when you ask a software engineer to design an operations function."
The reasoning was pragmatic. Traditional operations teams maintained systems through manual intervention — when something broke, they fixed it by hand. At Google's scale, manual intervention could not keep pace with the volume and complexity of failures. Software engineers could write code to handle failures automatically. The solution to an operations problem at scale was not more operations staff — it was software that eliminated the need for manual operations.
This insight is the intellectual foundation of SRE. Everything else — the metrics framework, the on-call structure, the toil management practices — flows from it.
## The Core Framework: SLIs, SLOs, and Error Budgets
The most important conceptual contribution SRE made to the industry is a rigorous framework for defining and managing reliability. It has three components:
**Service Level Indicators (SLIs)** are quantitative measurements of service behaviour. Availability (what percentage of requests succeed), latency (how quickly requests are served), error rate, and throughput are common SLIs. The key is that SLIs are specific, measurable, and directly tied to user experience.
**Service Level Objectives (SLOs)** are the targets for those SLIs. An SLO might state that 99.9% of requests should succeed (availability) and that 95% of requests should complete in under 200 milliseconds (latency). SLOs are internally agreed targets — they define what "reliable" actually means for a specific service.
**Error Budgets** are the most operationally powerful concept. If a service has an SLO of 99.9% availability, it has a budget of 0.1% downtime — approximately 8.7 hours per year, or 43 minutes per month. The error budget is the acceptable quantity of unreliability. When the error budget is being consumed faster than planned, the SRE and development teams slow down feature releases and focus on reliability work. When the error budget has plenty of remaining capacity, the team can move faster and take more risk.
This framework converts reliability from a vague aspiration ("be more reliable") into a quantitative balance between reliability and velocity. Development teams understand that burning through the error budget by shipping buggy code costs them release velocity. SRE teams understand that over-investing in reliability beyond the SLO wastes engineering resources. The error budget creates a shared economic incentive.
## Toil: What SREs Are Supposed to Eliminate
The Google SRE book introduced the concept of "toil" — manual, repetitive, automatable work that scales with service growth rather than with engineering investment. Restarting a service that crashes, manually provisioning capacity before a traffic spike, updating configuration files by hand — these are toil.
Google's SRE practice establishes a rule: SREs should spend no more than 50% of their time on toil. If toil exceeds 50%, the team either needs more headcount or — the preferred solution — needs to automate the toil away. The other 50% of time should be spent on engineering work that permanently reduces future toil.
This principle is what distinguishes SRE from traditional operations. A traditional sysadmin team that manages scaling by manually provisioning servers will need to grow proportionally with traffic. An SRE team that automates scaling will not — the automation absorbs the growth without additional headcount. This is the compounding return on engineering investment in reliability.
## What an SRE Does Day to Day
**Incident response and post-mortems.** When production systems fail, SREs are on call to diagnose and restore service. After incidents, SRE teams conduct blameless post-mortems — structured analysis of what happened, what the contributing factors were, and what systemic changes will prevent recurrence. The "blameless" aspect is important: SRE post-mortems focus on process and system failures, not individual fault. This is not idealism — it is the mechanism by which organisations learn from failures rather than suppressing them.
**Capacity planning.** SREs model traffic patterns and provision infrastructure to handle expected and unexpected load. They define and test the limits of systems so that spikes in demand do not become outages.
**SLO definition and review.** SREs work with product and engineering teams to establish SLIs and SLOs for services, then track whether those objectives are being met. Services that consistently exceed their SLOs may have SLOs that are too conservative — which wastes engineering investment in reliability. Services that regularly miss SLOs need reliability work.
**Reliability engineering.** SREs design and implement technical improvements that increase the inherent reliability of systems: redundancy (multiple instances so failures do not cause outages), graceful degradation (returning partial functionality rather than complete failures), circuit breakers (stopping requests to a failing downstream service before it cascades), and chaos engineering (deliberately injecting failures to validate recovery mechanisms work as designed).
**Observability infrastructure.** SREs build and maintain the monitoring, alerting, and tracing systems that make production behaviour visible. The standard open-source stack is Prometheus for metrics collection and storage, Grafana for dashboards and alerting, and OpenTelemetry for standardised instrumentation across services. Commercial alternatives include Datadog, Dynatrace, and New Relic.
## How SRE Differs from DevOps
The two disciplines are complementary but distinct. DevOps is a cultural and organisational philosophy — a set of practices and values aimed at breaking down the wall between development and operations. SRE is a specific job function with defined responsibilities, metrics, and practices.
Google describes the relationship this way: "SRE is a specific implementation of DevOps with some particular extensions." DevOps answers "should development and operations collaborate?" (yes). SRE answers "how, specifically, do you run reliability at scale?" The SLO/error budget framework, the toil metric, the post-mortem process — these are SRE's specific answers to questions that DevOps identifies but does not prescribe.
Organisations that adopt DevOps principles may or may not have dedicated SRE teams. Large organisations with complex production environments tend to have both: DevOps practices embedded in engineering teams and SRE functions centralised in platform or reliability organisations.
## The Job Market for SREs
Glassdoor's 2024 data places the median SRE salary in the United States at approximately $140,000 — significantly higher than most DevOps Engineer roles, reflecting the specialised nature of the work. SRE roles at major technology companies (Google, Meta, Amazon, Microsoft, Netflix) and at financial services firms that have adopted SRE practices (JPMorgan, Goldman Sachs) reach $180,000-$220,000 in total compensation.
LinkedIn reported over 30,000 SRE job postings in 2023, with the role appearing in the top 15 most-posted engineering jobs. Demand has grown consistently since 2018 as the SRE model has spread beyond the large technology companies that pioneered it into financial services, healthcare, retail, and government technology.
The backgrounds most common among SREs are software engineering and systems administration — the role demands comfort with both writing code and understanding infrastructure. Certifications carry less weight in SRE hiring than demonstrable experience with production systems, though the Certified Kubernetes Administrator (CKA) and cloud provider reliability-focused certifications are valued.
Read the full article on Stackzilla →