Embracing Observability in Distributed Systems



Hausenblas: Welcome to my session, embracing observability in distributed systems. My name is Michael Hausenblas. I’m an Open Source Product Developer Advocate in the AWS container service team. The focus here is really on observability in the context of distributed systems, for example, containerized microservices.


Let’s have a look at the more traditional setup that we have, and that would be a monolith. You would have a number of different modules in that monolith. For example, in this case, that might be an eCommerce application. You have interfaces with external systems like payment, or single sign-on, or some risk profiles, or an ERP stock application. Obviously, you’re talking to your clients, you want to sell something. If you take that monolith, and you break it apart into a number of microservices, obviously, the better you did a job of modularizing your original monolith, the easier it is now for these simple, smaller microservices to exist and to interact. What are the characteristics, if you look at the overall setup of such a, for example, containerized microservices system?


What speaks for it is, on the one hand, you have increased developer velocity, because different teams can now be responsible and can iterate independently from each other for different microservices, and they can have different release cycles and testing. That makes the whole thing faster. You end up with a polyglot system in the sense that you potentially have different programming languages and different datastores there that you can optimize for the task at hand. For example, you might write the renderer in Node.js, and the payment options microservice might be in Java. You also have what I would call partial high availability, meaning that parts of the system might go down. However, to the end user, it still looks like there is some functionality available. Think of, for example, an eCommerce setup you might not be able to search for something, but you can still check out something in your shopping basket.


What are the cons? It is a distributed system now, and very likely, those different microservices end up on different nodes. Think of for example Kubernetes where each of the microservice might, for example, be a deployment and parts owned by that deployment. That ends up on different nodes, and the different microservices now end up using networking to talk to each other. It is much more complex than a monolith. It’s already hard to figure out how different parts work together in the case of a monolith, but in the case of the distributed system, in the case of a microservices system, you have a lot of additional complexity. One of the biggest challenges in this context of a microservice setup is the observability of the overall system. That is equally true for developers and for operation folks.

Observability Challenges

Let’s have a look at the challenges. Thinking of that we’re talking about a distributed system, one of the things you have to wonder is how to keep track of the time and location of different signals. You might wonder, what is the right retention period of a signal? How long should you keep around the logs? Maybe you’re required to keep around the logs for a certain period of time for regulatory purposes. You have to consider the return on investment. By that I mean that it is a certain effort, for example, for a developer to instrument their application, their microservice. It costs money to have the signals around to store them, to have applications to look at these different signals. You want to make sure that whatever effort and whatever money you put in there, you have a clear outcome and a clearly defined scope, what you get for it. The different signals may be relevant to different roles and different circumstances. For example, a developer looking at troubleshooting or profiling their microservice along a request path, might need a different set of tools compared to someone from the infrastructure team looking at a Kubernetes cluster, for example.

Observability, End-to-End

Before we get into the landscape and what is going on currently, especially in CNCF, let’s have a look at the observability basics. When I talk about observability, I mean the entirety of all the things that you see there, all the sources that might be an app or microservices, in our case here. It might be infrastructure sources, like for example, VPC Flow Log or database, datastore, you typically treat them as opaque. You don’t know what’s going on inside. You get some signal out there in the compute unit, compute engine, for example, containers or functions or whatever, and then you have some compute engine that actually runs and executes your code. You have the telemetry bits that include usually agents, SDKs, protocols that take, route, and ingest the signals from the sources into some destinations. A couple of different types of destinations there. There are things like dashboards, where you can look at how metrics are doing over time, for example. You might have alerts. You have long term storage. For example, you put some logs on an S3 bucket. Ultimately, this is what you really want, the sources and the telemetry bits, that is what you have to invest. You have to instrument your code. You have to deploy agents to collect signals and forward them, ingest them into some destination. You ultimately want to consume them. You want to do something with those signals, generate insights, and make decisions based on those signals. The last thing you want to do in the context of a distributed system is obviously to [inaudible 00:07:45].


A different way to view this, not from this pipeline point of view, but from a more conceptually decomposed point of view is performing what is called a morphological analysis. This is a problem solving method developed by a Swiss astrophysicist called Fritz Zwicky. The basic idea is that you decompose your solutions into small units that are more or less independent. In this case, you would have six dimensions. Of course, you can have more or less depending on how you view it. I came up with these six dimensions. They’re analytics, which as I said, is what you actually want to have. You want to consume them. You want to store signals. The telemetry bit, again, this is agents. This is protocols, like OpenMetrics, for example. You have the programming languages, that as a developer, you’re most interested in. Does a certain set of telemetry technologies support your programming language? Are they available there? Can you use them in your programming language? The infrastructure piece, where you have on the one hand, things like compute related sources that could be for example, Docker logging drivers, VPC Flow Logs, S3 bucket logs, but also datastores. Very important. You almost always have some state involved, and very often these are opaque boxes so you get some signals out but you can’t really look inside that box. Then the compute unit, as I said, in this case, highlighted for what we have in AWS. You want to think of EKS for example, which is Kubernetes, Lambda function. Compute unit referring to how a certain service, microservice is scheduled and exposed. The compute engine is the actual runtime environment, for example, EC2, or Fargate, or Lightsail.

This allows a relatively straightforward way to answer the question for a specific word-glot, for a specific example, what options are available? Let’s, for example, say you’re running EKS on Fargate. You’re interested in logs, so you also have the logging driver there from Docker. Are you writing your microservice in Java? You might be using Fluent Bit to ship the logs and route the logs, and you’re consuming the logs in the context of Elasticsearch. There you have one particular path through these six dimensions, and you can imagine that there are many combinations possible.


Let’s move on to signals. We have essentially the three pillars, which are the logs, essentially discrete events that usually are timestamped and can be structured, for example, adjacent here. Metrics which are regularly sampled, numerical data values that are usually with dimensions and labels that capture their semantics. For example, a destination to view them is Grafana, a very popular one. Traces, which are the signals that happens along the request path in a number of microservices. Think…


Read More:Embracing Observability in Distributed Systems