Esparrachiari: My name is Silvia. I’m going to share with you some pitfalls and patterns in microservice dependency management that I bumped into while working at Google for over 10 years. The content and examples in this presentation are based on my own experiences as a software engineer at Google. I’m not focused on any particular product or team.
We’re going to start by taking a quick look on transition from service monoliths, into microservices, and then taking one last jump to include services running in the cloud. We will continue our journey through patterns and traffic growth, failure isolation, and how we can plan reasonable SLOs in a world where every backend has different prompts.
Monoliths, Microservices, and into the Cloud
In the beginning, we all wrote a single binary, often called Hello World, which evolves to include more complex functionalities like database, user authentication, flow control, operational monitoring, and an HTTP API so our customers can find us online. This binary runs in a single machine, but could also have many replicas to allow for traffic growth in different geo-locations. Several reasons pushed for our monoliths to be decoupled into separate binaries. A common reason is the complexity of the binary that turned the code base almost impossible to maintain and add new features. Another common reason is the requirement for independent logical components to grow hardware resources without impacting the performance of remaining components. These reasons motivated the birth of microservices, where different binaries communicate over a network, but they all serve and represent a single product. The network is an important part of the product and must always be kept in mind. Each component can grow the hardware resources independently, and it’s much easier for engineering teams to control the lifecycle of each binary. Product owners may choose between running their binaries on their own machines or in the cloud. A product owner may even choose to run all their binaries in the cloud, which is often associated with higher availability and a lower cost.
Benefits of Microservices
Running a product in a microservice architecture provides a series of benefits like allowing for independent vertical or horizontal scaling, or growth of the hardware resources for each component, or replicating the components in different regions independently. Better logical decoupling and lower internal complexity, which makes it easier for developers to reason about changes in the services and guarantee that new features have a predictable outcome. Independent development of each component, allowing for localized changes without disturbing components that are unrelated to a new feature. Releases can be pushed forward or rolled back independently, promoting a faster reaction to outages and more focused production changes.
Challenges of Microservices
Although having an architecture based on microservices may also make some processes harder to deal with. We will see some useful tips that hopefully will save your time and some customer outages. Some memorable pains from my own experience in managing microservices include aligning traffic and resource growth between frontends and backends. Designing failure domains, and computing product SLOs based on the combined SLOs of all microservices.
Let’s start by understanding our example product. PetPic is a fictional product that we will use to exemplify these challenges. PetPic serves pictures of dogs for dog lovers in two regions: Happytails and Furland. It currently has 100 customers in each region, summing 200 customers total. The frontend API runs in independent machines in Happytails and Furland. The service has several components, but for the purpose of this first example, let’s consider for now only the database backend. The database runs in the cloud in a global region and serves both regions, Happytails and Furland.
Aligning Traffic Growth
The database currently uses 50% of all its resources at peak. PetPic owner decided to launch a new feature to also serve pictures of cats to their customers. PetPic engineers decided to launch the new feature in Happytails first, so they could look for user eager traffic or resource usage change before making the new feature available to everybody. This looks like a very reasonable strategy. In preparation for the launch, engineers doubled the processing resources for the API service in Happytails and increased the database resources by 10%. The launch was a success. The engineers observed a 10% growth in customers, which might indicate that some cat lovers had joined PetPic. The database resource utilization is at 50% at peak, again, showing that the extra resources were indeed necessary.
All signals indicate that 10% growth in users requires a 10% growth in the database. In preparation for the launch in Furland, engineers added 10% more resources to the database again. They also doubled the API resources in Furland to cope with the request for new customers. They launched it on a Wednesday, and waited. In the middle of lunch time, pagers started bringing alerts about users seeing 500s. Yes, threads of 500s. What’s happening? The database team reaches out and mentions that the resource utilization has just reached 80% two hours ago, and they were trying to allocate more CPU to handle the extra traffic but that’s unlikely to happen today. The API team checks out user growth graphs and there’s no change, still 220 customers. What’s happening? They decided to abort the launch and roll back the feature in Furland. Several customer support tickets are opened by unhappy customers who are eager for some cat love during lunch break. Engineers scratch their head and look at the monitoring logs to understand the outage.
In the logs, they can see that the feature launch in Happytails had a 10% customer growth aligned with a 10% traffic growth to the database. Once the feature was launched in Furland, the traffic to the database rose 60% even without a single new user registered in Furland. They learned that customers in Furland were actually cat lovers, and had never had much interest in interacting with PetPic before. The cat picture feature was a huge success in regaining these customers, but the rollout strategy could never have predicted that.
What can we do better next time? First, keep in mind that every product experiences different types of growth. Growth in the number of customers is not always associated with more engagement from customers. The amount of hardware resources to process user requests may vary according to user behavior. When preparing for launch, run experiments across all different regions, so you can have a better view of how the new feature will impact user behavior and resource utilization. When requesting for extra hardware resources, allow backend owners extra time to actually allocate them. Allocating a new machine requires buying orders, transportation, and physical installation of the hardware.
We just observed a scenario where a global service operated as a single point of failure and caused an outage in two distinct regions. In the world of monoliths, isolating failure across components is quite difficult, if not impossible. The main reason is that all logical components coexist in the same binary, and thus, in the same execution environment. A large benefit of working with microservices is that we can allow for independent logical components to fail in isolation, preventing failures from widely spreading and compromising performance of other system components. This design process is often called failure isolation, or the analysis of how services fail together.
In our example, PetPic is deployed independently in two different regions: Happytails and Furland. Unfortunately, the performance of these regions is strongly tied to the performance of the global database serving both regions. As we observed so far, customers in Happytails and Furland have quite distinct interests, making it hard to tune the database to efficiently serve both regions. Changes in the way Furland customers access back this data, can resonate poorly on the user experience of Happytail users. There are ways to avoid that. A simple strategy is to use a bounded local cache. The local cache can guarantee an improved user experience since it also reduces…
Read More:Pitfalls and Patterns in Microservice Dependency Management