David K. Rensin
Dave Rensin a Distinguished Engineer at Google where he serves on a small team of technical advisers to Alphabet's CFO to give her guidance on the appropriate allocation of Google's capital to its various businesses and long-term technical investments.
He is an accomplished entrepreneur, author, and public speaker.
Authored Publications
Sort By
Preview abstract
In "Elephants, Goldfish and the New Golden Age of Software Engineering," the author discusses how AI is changing knowledge work, especially software development. Written from the perspective of April 2026, the article points out that while AI speeds up coding, it can also quickly generate a lot of mistakes and messy code if it isn't carefully managed by human oversight and clear processes.
The paper outlines a practical approach to working with AI, broken down into three main sections:
* **Using AI as a Tool, Not a Toy:**
The author notes that people often get poor results by asking AI to do everything in a single prompt. Instead, users should have back-and-forth conversations with AI to question assumptions, set clear grading rules, and guide the research. The main point is that humans must still provide the final judgment; AI is simply a way to speed up and record that thinking.
* **The Elephant-Goldfish Model:**
As AI creates more code than humans can easily read, written design documents become more important than the code itself. To keep AI on track, the author suggests a two-part method:
* **The Elephant:** A long chat session where the human and AI discuss ideas and write a detailed design document *before* any code is written. This session holds all of the project's background information and decisions.
* **The Goldfish:** A brand-new AI chat session with no memory. The human asks this "goldfish" to read the design document. If the goldfish cannot understand the plan based only on that document, the document needs more details.
* Only after the design document is clear enough for the goldfish to understand does the human ask the AI to write the code based on those strict instructions.
* **Managing AI and the Future of Work:**
The author expects that regular employees will soon act like managers, overseeing multiple AI helpers. Because of this, workers need to learn basic management skills, like how to delegate tasks and set clear boundaries. Also, since AI will handle routine chores, humans will need to practice focusing for longer periods to do deeper, harder thinking. Ultimately, a worker's value will come from their planning and decision-making skills, rather than their ability to type code.
View details
Chaos Engineering for People Systems
Chaos Conf 2019 (2019)
Preview abstract
What if we applied the principles of Chaos Engineering to our orgs? What might we learn?
View details
Less Risk Through Greater Humanity
Cerner DevCon (2019)
Preview abstract
What are the moral underpinnings of SRE? Can we use those to reason about how to apply SRE to non-technical domains?
View details
Preview abstract
In 2016, Google’s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today—and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.
This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t.
Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.
You’ll learn:
* How to run reliable services in environments you don’t completely control—like cloud
* Practical applications of how to create, monitor, and run your services via Service Level Objectives
* How to convert existing ops teams to SRE—including how to dig out of operational overload
* Methods for starting SRE from either greenfield or brownfield
View details
Building Successful SRE in Large Enterprises—One Year Later
SRECon Americas 2018 (2018)
Preview abstract
At SRECon2017 I talked about the formation of a special group of Google SREs who go into the world and teach enterprise customers—via actual production systems—how to "do SRE" in their orgs. It was new when I presented it. It's one year later and we have a lot of interesting data about how it's going. Some things that we thought would be hard, weren't. Others were nigh on impossible. We've written many postmortems and learned a bunch of lessons you can only learn the hard way.
Things you can expect to learn:
Why it's easier to bootstrap SRE in a large traditional enterprise than a cloud native!
Things enterprises assume are true, but aren't.
All the things we should have known better, but still learned the hard way—and how you can avoid them when bootstrapping SRE in your culture (or your customers' cultures)
View details
Preview abstract
Do SRE and DevOps have to compete for mindshare in the operations space? No. Here's why.
View details
How SRE relates to DevOps
Niall Richard Murphy
Liz Fong-Jones
Todd Underwood
Laura Nolan
O'Reilly and Associates (2018)
Preview abstract
DevOps and Site Reliability Engineering (SRE) have emerged in recent years as solutions for managing operations in IT and software development. Is one method better than the other? Will one of them eventually win out? This article explains why these two disciplines—in both practice and philosophy—are much more alike than you may think.
Humans have been thinking about better ways to operate things for millennia, but despite all of this effort and thought, running enterprise software operations well remains elusive for many organizations. In this article, IT operations experts provide the key tenets of DevOps and SRE, compare and contrast the two, and explain the incentives necessary to successfully adopt either approach.
View details
Preview abstract
The general trend in software over the last several years is to give every system an API and turn every product into a platform. When these systems only served end users, their reliability depended solely on how well we did our jobs as SREs. Increasingly, however, our customers' perceptions of our reliability are being driven by the quality of the software they bring to our platforms. The normal boundaries between our platforms and our customers are being blurred and it's getting harder to deliver a consistent end user reliability experience.
In this talk we'll discuss a provocative idea—that as SREs we should take joint operational responsibility and go on-call for the systems our customers build on our platforms. We'll discuss the specific technical and operational challenges in this approach and the results of an experiment we're running at Google to address this need.
Finally, we'll try to take a glimpse into the future and see what these changes mean for the future of SRE as a discipline.
View details
Kubernetes - Scheduling the Future at Cloud Scale
O'Reilly and Associates, 1005 Gravenstein Highway North Sebastopol, CA 95472, All
Preview abstract
Containers are taking over the world, but they aren’t full VMs and present special challenges to people build web-scale services.
They need a lot of orchestration to run efficiently and resiliently. Their execution needs to be scheduled and managed. When they die (and they do), they need to be seamlessly replaced and re-balanced.
An introductory mini-book designed to explain Kubernetes to IT managers, CIOs, and the otherwise cloud-curious.
View details