The Three Pillars of Observability

By Colin Mo
August 18, 2023
black and white rectangular frame

Observability refers to the ability to gain insights and understand the internal workings of a system or process through the collection, analysis, and visualization of relevant data. In the context of software development and operations, observability involves having comprehensive visibility into various components, such as applications, infrastructure, and services, to effectively monitor, troubleshoot, and optimize their performance and behavior. It often involves the use of monitoring tools, logs, metrics, and distributed tracing to provide a holistic understanding of system behavior and enable efficient problem-solving.

That’s a bit technical, so here’s what that means in practice:

  • With Observability — when system breaks, IT can look at the data and understand why and what happened
  • Without Observability — when system breaks, IT isn’t sure what broke and say they need to do further analysis

Of course, reality is a bit more complicated than that. Observability is like having senses — a person with two working senses has a better understanding of their environment compared to one with none. But a person with five working senses has access to more data of the environment, and is more likely to detect if, say, the room is burning.

So, it comes down to data. That’s why observability logs contain the who what when where and how as this gives SysAdmin, DevOps, and SecOps teams a starting point to analyze what broke or worked incorrectly.

The stack should be built with observability in mind and include methods for tracking the data that matter. The goal is to enable teams to discover, gain insight, and ultimately resolve issues that cause the system to fail to work as intended.

Observability differs from monitoring in that it aggregates all of the data across the system; context matters because individual actions may only make sense when you understand the ecosystem it happened in.

The end result is a reliable and efficient system made possible by teams having insight into why it breaks, or peace of mind that it’s working correctly.

We’ll discuss the three pillars of observability as they relate to access control and cybersecurity.

The Three Pillars of Observability

There are three main categories of data for observability:

  • Metrics
  • Traces
  • Logs

This Golden Triangle of Observability in Monitoring provides DevSecOps teams insightful overviews into distributed infrastructure systems. The three pillars contribute various insights and help expedite the triaging and resolution process together.

Metrics

Metrics are great at giving teams an overview of what they should pay attention to.

Observability metrics are quantitative measurements used to assess the performance and health of a system or application. They provide insights into various aspects of the system’s behavior, allowing engineers to monitor, analyze, and optimize its performance. Observability metrics help in understanding how a system is functioning, diagnosing issues, and making informed decisions for improvement.

Examples:

Common metrics used by software teams include:

  • Latency: It measures the time taken for a data to travel through a system. High latency can indicate performance bottlenecks or network issues.
  • Error Rate: This metric quantifies the frequency of errors or failed requests within a given time frame. It helps identify issues that affect the reliability and correctness of the system. Analyzing error rate patterns can reveal traffic spikes, peak usage periods, or changes in user behavior.
  • Throughput: Throughput measures the rate at which a system or component can process requests successfully. It indicates the system’s capacity and performance.
  • Saturation: Saturation metrics monitor resource utilization levels, such as CPU, memory, disk I/O, or network bandwidth. High saturation levels suggest resource constraints or the need for scaling.
  • Availability: Availability measures the proportion of time a system or service is accessible and functioning correctly. It helps evaluate the system’s reliability and uptime.

Limitations:

Metrics can only tell you “hey, pay attention to this service/process!” if you’re already tracking and have robust systems set up to alert you via monitoring. There’s a fine line between false positives and simmering problems. But also, metrics only tell you to pay attention.

If there’s smoke, you want to discover the fire, and that’s where Traces and Logs come in.

Traces

Traces are about the flow of data and requests

If Metrics is about performance indicators for data (to immediately see what’s wrong), then Tracing is about the flow of that data as it moves through the different components or services within the system.

Distributed traces can even give insight into the path of a specific request through multiple services.

Trace example showing a request going through the Pomerium authorization flow

The result of good tracing allows for teams to understand the request’s journey, enabling the following:

  • Performance optimization: Tracing allows teams to understand the interplay between each of the services to identify bottlenecks, latency issues, and any other areas for optimization.
  • Troubleshooting and Debugging: Because tracing gives a detailed view of request flow, teams have an easier time identifying the source of errors and/or failures.
  • Dependency Analysis: The detailed view of request flow also helps highlight potential points of failure, enabling teams to assess the impact of changes or failures in one service on the overall system.

As for how it’s done, tracing is accomplished by adding to a trace ID as it traverses individual services in the request flow.. As the requests move from one service to another, each service appends its own trade ID, which represents a specific operation or activity performed by that service. These trace IDs contain information such as timestamps, duration, metadata, and contextual data.

Logs

Logs are great at giving teams precise data on who what when where and how. Once teams see a service is misbehaving or dead, they use the logs to understand why.

Logs can be granular records of events and messages to provide a chronological record of all activities. This provides incredible detail to teams, but now they face an overwhelming mountain of data. A lot of thought and effort needs to go into making sure the system captures, tags, and indexes the logs so the data is presented in a manner that indicates importance to the teams responsible.

How does observability logging work?

It is broken down into the following steps:

  1. Log Generation: Logs are generated by services and resources. These logs should contain information such as status updates, error messages, warnings, request details, performance metrics, and other relevant data. Logging is typically performed using dedicated libraries or frameworks that provide structured logging capabilities.
  2. Log Collection: Once they’re generated, logs are collected and centralized into a log management system or log aggregator. This could involve collecting logs from various servers, containers, or distributed instances into a centralized location or using distributed logging solutions that aggregate logs in real-time.
  3. Log Storage and Retention: After collection, the log management system stores the collected log data in local or cloud-based storage. Because logs are raw data that pile up quickly, retention periods are often set to retire old logs. The retention period can vary based on the system’s requirements, compliance regulations, and storage capacity.
  4. Log Analysis and Visualization: Now that it’s all collected and stored, log analysis tools and techniques are used to search, filter, and extract insights from log data. Visualization tools enable the representation of log data in meaningful ways, such as graphs, charts, or dashboards. These tools help in identifying patterns, anomalies, errors, and performance issues.
  5. Log Alerting: No one has time to look through every single log, so specific conditions and events can be set as triggers for alerts. Alerting mechanisms notify system operators or engineers when predefined thresholds or patterns are detected in the log data (often in the form of Metrics). This enables proactive issue detection and resolution.

A well-implemented logging solution enables teams to expedite triaging and resolution processes. Unfortunately, many tools are not able to log the why. As for why that’s important, read the follow-up piece here: Logs Are Incomplete Without The “Why.”

Implementing Observability

Implementing observability in an organization requires a systematic approach to ensure its successful adoption and integration into existing processes. Here is a five-step guide to help you implement observability effectively:

Step 1: Define Goals and Scope

Start by clearly defining your observability goals and the scope of implementation. Identify the specific areas or systems where observability is crucial and prioritize them based on their impact on your organization’s performance and reliability. Determine the key metrics, logs, and traces you need to collect and analyze to gain actionable insights.

Step 2: Select and Configure Tools

Research and select appropriate observability tools based on your organization’s needs and requirements. Consider tools for metrics collection, log aggregation, distributed tracing, error tracking, and visualization. Popular options include Prometheus, Grafana, Jaeger, the ELK Stack, Sentry, and more. Configure these tools to integrate with your existing infrastructure, applications, and services.

Step 3: Instrumentation and Monitoring

Implement instrumentation within your systems to collect the necessary data for observability. This involves embedding monitoring agents or libraries into your applications and infrastructure components. Ensure that the instrumentation covers critical aspects such as capturing metrics, generating logs, and tracing requests across distributed systems. Leverage frameworks or libraries compatible with your chosen observability tools.

Step 4: Visualization and Analysis

Set up a centralized observability platform where you can aggregate and analyze the collected data. Configure dashboards and visualizations to provide real-time insights into the performance, errors, and behavior of your systems. Use the selected tools to create meaningful visualizations, correlations, and anomaly detection mechanisms. This enables teams to identify and respond to issues quickly, improving system reliability and performance.

Step 5: Establish Processes and Collaboration

Implement processes and establish a collaborative culture to fully leverage observability in your organization. Encourage cross-functional collaboration between development, operations, and QA teams to ensure observability practices are integrated into the software development lifecycle. Foster a culture of learning from incidents and sharing knowledge gained through observability data. Regularly review and refine your observability practices based on feedback and lessons learned.

Remember that implementing observability is an iterative process. Continuously monitor and refine your observability strategy based on evolving business needs, technological advancements, and feedback from teams. By following this five-step guide, you can establish a solid foundation for observability in your organization, leading to improved system performance, reliability, and proactive issue resolution.

Gain Unprecedented Insight into Your Organization’s Infrastructure Access Decisions

The pursuit of efficiency always comes back to actionable data, and that can only be found with observability. Observability isn’t just a nice-to-have for access control infrastructure — efficient security processes have become a fundamental need for competitive edge and success in the market.

Pomerium’s place as an open-source context-aware reverse proxy means that it processes all the data and provides actionable information to your team. Whether you’re spinning up a new application or trying to add access control to a legacy service, Pomerium can easily provide metrics, logs, and traces for all authentication and authorization processes to your protected resources. Moreover, the result is:

  • Easier because you don’t have to maintain a client or software.
  • Faster because it’s deployed directly where your apps and services are. No more expensive data backhauling.
  • Safer because every single action is verified for trusted identity, device, and context.

Our customers depend on us to secure zero trust, clientless access to their web applications everyday.

Check out our open-source Github Repository or give Pomerium a try today!

Revolutionize Your Security: Achieve Compliance Hassle-Free!

Embrace Seamless Resource Access, Robust Zero Trust Integration, and Streamlined Compliance with Our App.

Download Now
Download Now