David K. Rensin
Dave Rensin a Sr. Director of Engineering at Google where he serves on a small team of technical advisers to Alphabet's CFO to give her guidance on the appropriate allocation of Google's capital to its various businesses and long-term technical investments.
He is an accomplished entrepreneur and author.
Research Areas
Authored Publications
Sort By
Less Risk Through Greater Humanity
Cerner DevCon (2019)
Preview abstract
What are the moral underpinnings of SRE? Can we use those to reason about how to apply SRE to non-technical domains?
View details
Chaos Engineering for People Systems
Chaos Conf 2019 (2019)
Preview abstract
What if we applied the principles of Chaos Engineering to our orgs? What might we learn?
View details
Preview abstract
Do SRE and DevOps have to compete for mindshare in the operations space? No. Here's why.
View details
Preview abstract
In 2016, Google’s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today—and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.
This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t.
Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.
You’ll learn:
* How to run reliable services in environments you don’t completely control—like cloud
* Practical applications of how to create, monitor, and run your services via Service Level Objectives
* How to convert existing ops teams to SRE—including how to dig out of operational overload
* Methods for starting SRE from either greenfield or brownfield
View details
How SRE relates to DevOps
Niall Richard Murphy
Liz Fong-Jones
Todd Underwood
Laura Nolan
O'Reilly and Associates (2018)
Preview abstract
DevOps and Site Reliability Engineering (SRE) have emerged in recent years as solutions for managing operations in IT and software development. Is one method better than the other? Will one of them eventually win out? This article explains why these two disciplines—in both practice and philosophy—are much more alike than you may think.
Humans have been thinking about better ways to operate things for millennia, but despite all of this effort and thought, running enterprise software operations well remains elusive for many organizations. In this article, IT operations experts provide the key tenets of DevOps and SRE, compare and contrast the two, and explain the incentives necessary to successfully adopt either approach.
View details
Building Successful SRE in Large Enterprises—One Year Later
SRECon Americas 2018 (2018)
Preview abstract
At SRECon2017 I talked about the formation of a special group of Google SREs who go into the world and teach enterprise customers—via actual production systems—how to "do SRE" in their orgs. It was new when I presented it. It's one year later and we have a lot of interesting data about how it's going. Some things that we thought would be hard, weren't. Others were nigh on impossible. We've written many postmortems and learned a bunch of lessons you can only learn the hard way.
Things you can expect to learn:
Why it's easier to bootstrap SRE in a large traditional enterprise than a cloud native!
Things enterprises assume are true, but aren't.
All the things we should have known better, but still learned the hard way—and how you can avoid them when bootstrapping SRE in your culture (or your customers' cultures)
View details
Preview abstract
The general trend in software over the last several years is to give every system an API and turn every product into a platform. When these systems only served end users, their reliability depended solely on how well we did our jobs as SREs. Increasingly, however, our customers' perceptions of our reliability are being driven by the quality of the software they bring to our platforms. The normal boundaries between our platforms and our customers are being blurred and it's getting harder to deliver a consistent end user reliability experience.
In this talk we'll discuss a provocative idea—that as SREs we should take joint operational responsibility and go on-call for the systems our customers build on our platforms. We'll discuss the specific technical and operational challenges in this approach and the results of an experiment we're running at Google to address this need.
Finally, we'll try to take a glimpse into the future and see what these changes mean for the future of SRE as a discipline.
View details
Kubernetes - Scheduling the Future at Cloud Scale
O'Reilly and Associates, 1005 Gravenstein Highway North Sebastopol, CA 95472, All
Preview abstract
Containers are taking over the world, but they aren’t full VMs and present special challenges to people build web-scale services.
They need a lot of orchestration to run efficiently and resiliently. Their execution needs to be scheduled and managed. When they die (and they do), they need to be seamlessly replaced and re-balanced.
An introductory mini-book designed to explain Kubernetes to IT managers, CIOs, and the otherwise cloud-curious.
View details