Engineering

Monitoring & Observability – A DevOps Perspective

What’s Monitoring?

When we talk about distributed systems, monitoring is a key part of the stack. Monitoring is a practice in charge of collecting, processing, aggregating, and displaying key information about the system’s state in real-time.

Historically, monitoring has been defined in several different ways. The critical element of monitoring is to provide the information that will reveal a clear picture of a distributed system state in a real-time fashion. Many techniques have popped up around monitoring. Site Reliability Engineering mentions two big classes of monitoring: the white-box (internal elements of the system) and the black-box (external system’s visibility). Both techniques aim to provide visibility to the different levels of engineering, support, and stakeholder teams.

Why Monitoring?

As mentioned above, ideally, monitoring will provide an overview of the current state of a distributed system, but not only the current state — it can also provide historical records that can help to perform historical trends and make comparisons between different experiments that are placed in the timeline.

Monitoring data needs to be in a centralized place where people can see it. Usually, we’ll have dashboards that will consolidate and display our key information. If we don’t show information, it’s like it doesn’t exist.

One last but equally important concept is that our monitoring data can identify or highlight possible issues or bugs that we need to address. Monitoring will provide the information, and with alerts, we’ll trigger actions that will make those events visible to the interested parties. Common alert mechanisms will be emails, tickets, and pages.

Monitoring in the Software Development Life Cycle

Monitoring is something to keep in mind at the beginning of any distributed system formation. In fact, monitoring should be a must to consider in every stage of the software development life cycle. Why? Remember that monitoring is an excellent mechanism to know the current and historical state of a distributed system. Monitoring information will help us with finding issues, root-causing failures, decision-making, debuggability, patterns analysis, and many more features that can be useful for everyone involved in the different stages.

Observability: the Joint Venture of Testing and Monitoring

Observability is not monitoring, and monitoring is not observability, but monitoring is inside the observability superset. Observability shows the system’s current health and provides unpredictable failures that could not be monitored or tested. As its name stands, observability offers an excellent way to observe things that are not necessarily identified during the system’s correctness verification (testing) and the monitoring of predictable failures.

The 3 Pillars of Observability

Logging

Logging is a strategy where we record the history of critical events over time in our systems. When we talk about a distributed system, it will require proper centralized logging machinery to gather, process, and store log records.

An event log is an immutable and timestamped record of discrete events that happened over time. Logs are usually stored in the following three formats:

  • Plaintext (free-form text)
  • Structured (usually in JSON format)
  • Binary (commonly used in databases)

Metrics

Everyone loves metrics, especially managers. A distributed system’s metrics will provide a general overview of the system state in terms of representative data measured over time intervals. Examples of metrics are CPU/memory/storage usage, network traffic, database reads/writes, number of API requests, etc.

There are plenty of metrics data storing and visualization solutions. You need to consider which one fits best for your case, as there are many available alternatives for metrics.

Tracing

According to Cindy Sridharan in her book Distributed Systems Observability, Traces are a representation of logs,” or, in other words, “traces make sense of logs.” Tracing is a technique that aims to provide a relationship between events that are happening in the different components of a distributed system. Tracing will help to represent the end-to-end interactions between the system entities.

Sridharan continues, “A trace is a representation of a series of causally related distributed events that encode the end-to-end request flow through a distributed system.”

Tracing will provide visibility of the different flows in the systems in terms of participants’ tasks, timeframe for each sub-task, and relationships between participants.

Coding and Testing for Failure

Coding for failure means that we are all sure that the system will fail at some point, and we need to be ready for that. The mechanism that will determine the “readiness for failure” will be comprehensive, but this is part of a good observability strategy. According to Sridharan, there are three main things to consider when coding for failure:

  • Understand the operational semantics of the application
  • Understand the operational characteristics of the dependencies
  • Write debuggable code

Testing for failure implies that the “verification of correctness” of the systems is not “the only way” for finding all issues. According to Sridharan’s book, testing for failure “will involve an acknowledgment that certain types of failures can only be surfaced in the production environment,”and we need to create the proper observability mechanisms to catch those failures in production without impacting the key functionality of the distributed system.

Observability is Not the Complete Solution

Sridharan concludes that “Observability isn’t a Panacea.” She also cites Brian Kernighan’s note in the book Unix for Beginners: “The most effective debugging tool is still careful thought, coupled with judiciously placed print statements.”

We always need to be diligent and curious to find better ways to improve the visibility of our system’s state. We have the tools, so we need to use them and improve them or create new ones if required. One thing is clear; there’s no complete solution to this matter. 

Setting a good observability strategy is a continuously evolving challenge. Each project will dictate its own particular needs and determine the type of observability you want to implement. This needs to be considered in the different phases of the system’s design, coding, and testing.

By Obed N Muñoz, Former Wizeline Site Reliability Engineer
By Obed N Muñoz, Former Wizeline Site Reliability Engineer

Aisha Owolabi

Posted by Aisha Owolabi on March 17, 2021