OpenTelemetry democratises access to observability data & will enable massive innovation
I have been building software platforms and products for almost 18+ years now. Having worked in both big enterprises and startups, three things remain on top of mind for all engineering leaders from operational perspective:
- Automate and speed up development & deployment cycles for new features and innovation.
- Provide five 9s (99.999%) of availability and great app performance to customers.
- Seamless access to prod telemetry data for debugging critical and time-sensitive issues.
At the heart of all these lies the DevOps world and the associated observability stack. The criticality of DevOps and Site Reliability org has increased dramatically in last decade for automation, availability and scalability of the prod systems. In early 2010s, APM companies like NewRelic, Dynatrace, Datadog and log management companies like Splunk, SumoLogic, Datadog started penetrating the engineering stack. All had their own proprietary ways of collecting telemetry data to provide functionality they offered. Let’s get deeper into this.
Observability World before OpenTelemetry
Before OpenTelemetry came to the fore, the telemetry data collection was proprietary and fragmented. For collecting and exposing telemetry data, four components usually worked in tandem:
- Telemetry SDKs: A set of proprietary programming language SDKs provided by vendors to developers. Developers added these SDKs to their codebase, implemented vendor specific interfaces for spitting telemetry data, which was then piped to the collector.
- Instrumentation: Many vendors provided, easy to use, language specific instrumentation libraries for initialising and collecting telemetry data from different applications. Some even provided ways to auto instrument telemetry data collection for certain areas like metrics. The data was typically traces, metrics or logs.
- Collector: Vendors ran agents in customer environments, which acted as brokers to efficiently transmit telemetry data to vendor cloud. The wire format and protocol of telemetry data collection and transmission was defined by the vendor.
- Vendor cloud storage and user interface: Once the data was available in cloud, developers were able to access and analyse it using vendor provided user interface.
Though the telemetry instrumentation and data collection involved quite common methodology, but all vendors like NewRelic, Datadog, Splunk, SumoLogic implemented these components in their own specific ways. This resulted in two problems:
- Vendor Lock-in: Organisation got locked in with vendors since implementation of the telemetry instrumentation & data collection was vendor specific and arduous. Replacing vendor meant, going through the whole process again.
- Vendor Differentiation: Since telemetry instrumentation & data collection was almost common across vendors, it was not the core value proposition.Vendors were wasting resources on an area which was table-stake rather than putting those efforts in differentiating the functionality offered on the collected telemetry data.
This resulted in many open source telemetry data collection initiative like Open Tracing and Open Census etc. Eventually, all of them merged into widely accepted OpenTelemetry project incubated by Cloud Native Computing Foundation (CNCF).
OpenTelemetry and its Components
Let’s first talk about what OpenTelemetry isn’t. OpenTelemetry doesn’t deal with visualisation or analysis of telemetry data collected using it’s standard, APIs and collector.
The goal of OpenTelemetry project is to define standards and components for following four pieces of telemetry data collection:
- Define open wire format for telemetry data.
- Provide language specific SDKs for collecting and transforming data into open wire format.
- Provide specification for semantic convention of different events which result in telemetry data.
- Provide opentelemetry collector, which is a software component, used to collect and ingest telemetry data from any source. It can subsequently export that data to any destination based on configuration. It doesn’t store any data by itself but is more of a broker to publish data to different destination.
Data Pillars of OpenTelemetry
There are three key telemetry data pillars of observability stack; Tracing, Metrics and Logs. In the past, they have often been collected and analysed in silos.
OpenTelemetry standard wants to break silos, co-relate data and bring all these together but decided to tackle wire format, protocol, APIs etc for these, one at a time. Since OpenTelemetry project originated from Open Tracing, they decided to tackle trace related data first, followed by metrics and logs.
Here is the current status of each pillar in OpenTelemetry ecosystem:
Tracing: API specification is stable and feature frozen. Both SDKs and Protocol are stable. The OpenTelemetry Tracing specification is now completely stable and covered by long term support. Many vendors already provide support for it.
Metrics: API specification is stable and feature frozen. Both SDKs and Protocol are stable but OpenTelemetry Metrics is still currently under active development for few other aspects. Many vendors are actively working to provide support for it.
Logging: Both API specification and SDKs are in draft. Protocol is stable. The data model is experimentally released as part of the OTLP protocol. Log processing for many data formats has been added to the Collector. Log appenders are currently under development in many languages. Log appenders allow OpenTelemetry tracing data, such as trace and span IDs, to be appended to existing logging systems. OpenTelemetry Logging is essentially under active development & will stabilise by the end of the year, based on my estimation.
You can see current status of all these components here.
I am also excited about two other areas which OpenTelemetry team is currently exploring; Client (or RUM) Telemetry & Network Telemetry. It’s still early for these but there are newly formed Special Interest Groups (SIGs) which meet on these two new areas. You can look at SIGs, their meeting schedule and slack channels for different areas of OpenTelemetry here. These meetings are open for everyone.
Once all these five telemetry areas (Tracing, Metrics, Logs, Client or RUM & Network) stabilise and mature, the impact of OpenTelemetry is going to be huge on the tech world. I feel Tracing, Metrics, Logs telemetry collection standards will stabilise by end of 2022 and all vendors will provide support for it. Client & Network telemetry is relatively new & will take time to stabilise, probably by early or mid 2024.
Impact of OpenTelemetry on Observability World
With democratisation of telemetry data collection, OpenTelemetry will increase the pace of innovation in observability world. The days of vendor lock-in and arduous process of vendor specific data collection will be a thing of past.
As you can see from above figure, OpenTelemetry completely removes any vendor specific instrumentation, implementation and collection of telemetry data. Since OpenTelemetry collector has standard interface, any company can now hook in logic to consume telemetry data. All vendors will now compete on the value they provide on data rather than the collection process itself. This presents an incredible opportunity for new startups to provide unique value on the telemetry data without having to worry about barrier of entry. The correlation between different telemetry data along with open & structured format, will also bring new innovation using advanced AI/ML techniques. We are in the golden age of software and observability tech will play a big role in providing great experience to users. I can’t wait to see the incredible innovation coming in few years.