Operating services reliably and securely at scale with SRE

5-2-2024

By Felix Speulman

Site Reliability Engineering (SRE) has been a go-to method for improving highly scaled platforms’ reliability, efficiency, and security for almost two decades. Yet, this methodology is not as well-known as DevOps, which is related. Vladyslav Ukis, Head of R&D at Siemens Healthineers, has authored a book on the topic. We asked him about SRE, digital transformation, DevOps, cybersecurity, and how to communicate SRE’s benefits to upper management effectively.

How does SRE fit in your professional journey?
“I’ve been driving many transformations within Siemens Healthineers related to introducing continuous delivery, SRE, developer relations, portfolio management, and data-driven engineering management; all the disciplines necessary when you tackle large digital transformations in a big enterprise. So I gained lots of experience there, both in the platform space and the application space.”

“About a decade ago, we started building the Siemens Healthineers digital health platform – teamplay – our first software-as-a-service product. Before that, all the software was sold as a product, not as a service. It was also our first cloud-based product. As a company, we started learning what it means to provide software as a service instead of providing software as a product. And with that came the realization that we needed to operate the service to provide a quality of service that can be sold.”

“We started having our fair share of trouble with the operation of the services, especially as the demand increased. We realized that how we were operating the services would not bring us into the future. And at some point, I started learning about SRE. Many companies claimed success in operations by using it. It was new territory for us. We were eager to try something new.”

“After gaining lots of experience with our SRE implementation, I started publishing articles about the various aspects of SRE implementation on infoq.com. And at some point, a publisher came along and asked, “Do you want to expand that article into a book?” I said, “Well, I’m not a writer. And I don’t have time. Then Corona came. And then I suddenly realized I’ve got time because I don’t have to commute, which is an hour one way. This is how the book came to be.”

Site Reliability Engineering can be a confusing term. What does it cover, what isn’t it for, and who can employ it?
“SRE is a discipline within computer science that tells you how to operate services reliably at scale. So, if you are operating services, SRE is likely a very good methodology for you to follow to operate them reliably. Google wrote up and brought together the operations practices they had under the umbrella of SRE. They are the fathers of the discipline.”

“If you’ve got lots of legacy, you are surely on some path toward digital transformation. You are likely to transform what you’ve got into something more manageable. And with that, you usually deploy services in the cloud, which you then offer as a service to your users. Once you’re offering services in the cloud, sooner or later, you’ll start thinking about how to operate them. This is where SRE comes in handy because it’s an opinionated methodology for operating services reliably at scale.”

SRE is not as well-known as DevOps. Could you explain the differences between the two?
“They are related. DevOps is an overarching philosophy of making developers and operations engineers work together and start delivering products much more frequently than before. But DevOps doesn’t tell you the specifics about how to do this. It doesn’t tell you the How – this is where SRE comes in. SRE is an opinionated implementation of the DevOps philosophy in the operations arena. It gives specific guidance on what each party should do to achieve that collaboration. DevOps is the philosophical foundation: Developers and operations engineers should not work in a siloed manner but must work together. SRE comes on top of this foundation.”

Can you clarify what SRE transformation is and how it relates to digital transformation?
“As part of a digital transformation, you would cover many things, starting with the business model and ending somewhere in operations. And where you are taking operations, this is where SRE comes in. Typically, what companies don’t have when they haven’t started transforming operations as part of their digital transformation is development teams with operations capabilities. Their development teams are just for developing software. And their operations teams typically just operate the services. It’s a typical divide between development and operations.”

“As part of the digital transformation, you start moving towards DevOps. And that means you start bringing operations aspects into the development teams, and you start bringing development aspects into the operations teams. Why? Under the DevOps and SRE philosophy, you want developers and operations engineers to work together. Operations engineers are not necessarily fully responsible for operating the services anymore. They are responsible for providing a framework or SRE infrastructure for the development teams to enable them to operate the services in production.”

Vladyslav Ukis

“On the development side, you then transform the developers; they don’t just develop and then hand over the services to the operations team, but rather, they develop and operate the services as well. To make their lives easier, they use the SRE infrastructure provided by the operations teams to do operations efficiently. That said, they won’t necessarily have to wake up in the middle of the night to attend to their services. This can be handled differently based on agreements with the operations teams .”

“So it’s a big shift because, in the operations teams, you suddenly need to put development skills so they can develop the SRE infrastructure that enables development teams to do operations. The operations teams usually come from the world where they operate the services. And now they’re asked not to operate the services but to enable others to do the operations. On the development team side, you then ask them to do development, testing, and operations. They are fully responsible for operating the services in production.”

“But, of course, the world is not black and white. There can be things in between. For instance, the development teams go on call for their services, but only during business hours. Outside of those, there could be an arrangement where the operations team supports the services or any other arrangement between the development and operations teams.”

Additionally, as part of the SRE transformation, you also involve product management in operations, which has traditionally never been the case. Under the SRE model, product management is provided with appropriate visibility into how the services are running in production so they can make data-driven, informed decisions about when to invest in reliability versus when to invest in new features.

Could you explain the relationship between SRE and cybersecurity? Is there one?
“Absolutely. So, SRE handles operations in general. And, of course, a big part of that is to operate the services securely. When there are incidents in production, they can be cybersecurity incidents. When these happen, under the SRE framework, there will be an incident response process governing how the organization mobilizes the people for a given incident. How does the organization troubleshoot the incident? How does it learn from post-mortems? And in that incident response process, there could be additional provisions. Governmental or industry bodies might have to be notified if there is a serious breach. This will be governed by the overall incident response process required as part of introducing SRE in the organization.”

How do you “sell” SRE to the organizational leadership?
“One way to do this is to render your investment in SRE as a “revenue protection insurance”. You will only get revenue from your services if they are reliable. If your services are down often, nobody will pay for them. You need insurance to protect your revenue. The insurance premium – an investment in SRE – will be as high as your reliability requirements. So, if you need to be as reliable as Google, then it will cost more than if you need to be only as reliable as Expedia. So you need to buy appropriate insurance to protect your revenue.”

Equal Experts

Equal Experts have worked with Siemens Healthineers since 2018 with teams in Germany, Bangalore, and Lisbon working on improving test automation and building “compliance into the (CI/CD) pipeline.” After an initial Discovery period focusing on assisting internal teams with a Continuous Delivery health check, Equal Experts went on to provide coaching, support, and delivery capabilities for the development of Siemens Healthineers teamplay platform – a suite of tools and applications for medical imaging equipment fleet performance management in Healthcare.

Teams of Siemens Healthineers and Equal Experts developers delivered CI/CD pipelines using Infrastructure as Code, high levels of automated testing, automated BDD tests using SpecFlow, decomposing systems into easy-to-deploy microservices, all with the goal of building “compliance into the pipeline” and improving the Developer Experience so that teams could reduce their focus on “busywork” and focus more on the high levels of innovation that make Siemens Healthineers the leader in their field.

The work was able to show that Continuous Delivery is an approach that can bring huge benefits to this field, and can give teams high degrees of agility and confidence when dealing with the challenges of working to strict medical regulatory requirements. Teams saw improvements in speed but also the quality of audits and operations.

Siemens Healthineers team members have gone on to write books and give talks about their successes to great acclaim.