Elastic Observability Labs

3 models for logging with OpenTelemetry and Elastic

Tue, 27 Jun 2023 00:00:00 GMT

Arguably, OpenTelemetry exists to (greatly) increase usage of tracing and metrics among developers. That said, logging will continue to play a critical role in providing flexible, application-specific, event-driven data. Further, OpenTelemetry has the potential to bring added value to existing application logging flows:

Common metadata across tracing, metrics, and logging to facilitate contextual correlation, including metadata passed between services as part of REST or RPC APIs; this is a critical element of service observability in the age of distributed, horizontally scaled systems
An optional unified data path for tracing, metrics, and logging to facilitate common tooling and signal routing to your observability backend

Adoption of metrics and tracing among developers to date has been relatively small. Further, the number of proprietary vendors and APIs (compared to adoption rate) is relatively large. As such, OpenTelemetry took a greenfield approach to developing new, vendor-agnostic APIs for tracing and metrics. In contrast, most developers have nearly 100% log coverage across their services. Moreover, logging is largely supported by a small number of vendor-agnostic, open-source logging libraries and associated APIs (e.g., Logback and ILogger). As such, OpenTelemetry’s approach to logging meets developers where they already are using hooks into existing, popular logging frameworks. In this way, developers can add OpenTelemetry as a log signal output without otherwise altering their code and investment in logging as an observability signal.

Notably, logging is the least mature of OTel supported observability signals. Depending on your service’s language, and your appetite for adventure, there exist several options for exporting logs from your services and applications and marrying them together in your observability backend.

The intent of this article is to explore the current state of the art of OpenTelemetry logging and to provide guidance on the available approaches with the following tenants in mind:

Correlation of service logs with OTel-generated tracing where applicable
Proper capture of exceptions
Common context across tracing, metrics, and logging
Support for slf4j key-value pairs (“structured logging”)
Automatic attachment of metadata carried between services via OTel baggage
Use of an Elastic^® Observability backend
Consistent data fidelity in Elastic regardless of the approach taken

OpenTelemetry logging models

Three models currently exist for getting your application or service logs to Elastic with correlation to OTel tracing and baggage:

Output logs from your service (alongside traces and metrics) using an embedded OpenTelemetry Instrumentation library to Elastic via the OTLP protocol
Write logs from your service to a file scraped by the OpenTelemetry Collector, which then forwards to Elastic via the OTLP protocol
Write logs from your service to a file scraped by Elastic Agent (or Filebeat), which then forwards to Elastic via an Elastic-defined protocol

Note that (1), in contrast to (2) and (3), does not involve writing service logs to a file prior to ingestion into Elastic.

Logging vs. span events

It is worth noting that most APM systems, including OpenTelemetry, include provisions for span events. Like log statements, span events contain arbitrary, textual data. Additionally, span events automatically carry any custom attributes (e.g., a “user ID”) applied to the parent span, which can help with correlation and context. In this regard, it may be advantageous to translate some existing log statements (inside spans) to span events. As the name implies, of course, span events can only be emitted from within a span and thus are not intended to be a general purpose replacement for logging.

Unlike logging, span events do not pass through existing logging frameworks and therefore cannot (practically) be written to a log file. Further, span events are technically emitted as part of trace data and follow the same data path and signal routing as other trace data.

Polyfill appender

Some of the demos make use of a custom Logback “Polyfill appender” (inspired by OTel’s Logback MDC), which provides support for attaching slf4j key-value pairs to log messages for models (2) and (3).

Elastic Common Schema

For log messages to exhibit full fidelity within Elastic, they eventually need to be formatted in accordance with the Elastic Common Schema (ECS). In models (1) and (2), log messages remain formatted in OTel log semantics until ingested by the Elastic APM Server. The Elastic APM Server then translates OTel log semantics to ECS. In model (3), ECS is applied at the source.

Notably, OpenTelemetry recently adopted the Elastic Common Schema as its standard for semantic conventions going forward! As such, it is anticipated that current OTel log semantics will be updated to align with ECS.

Getting started

The included demos center around a “POJO” (no assumed framework) Java project. Java is arguably the most mature of OTel-supported languages, particularly with respect to logging options. Notably, this singular Java project was designed to support the three models of logging discussed here. In practice, you would only implement one of these models (and corresponding project dependencies).

The demos assume you have a working Docker environment and an Elastic Cloud instance.

git clone https://github.com/ty-elastic/otel-logging
Create an .env file at the root of otel-logging with the following (appropriately filled-in) environment variables:

# the service name
OTEL_SERVICE_NAME=app4

# Filebeat vars
ELASTIC_CLOUD_ID=(see https://www.elastic.co/guide/en/beats/metricbeat/current/configure-cloud-id.html)
ELASTIC_CLOUD_AUTH=(see https://www.elastic.co/guide/en/beats/metricbeat/current/configure-cloud-id.html)

# apm vars
ELASTIC_APM_SERVER_ENDPOINT=(address of your Elastic Cloud APM server... i.e., https://xyz123.apm.us-central1.gcp.cloud.es.io:443)
ELASTIC_APM_SERVER_SECRET=(see https://www.elastic.co/guide/en/apm/guide/current/secret-token.html)

Start up the demo with the desired model:

If you want to demo logging via OTel APM Agent, run MODE=apm docker-compose up
If you want to demo logging via OTel filelogreceiver, run MODE=filelogreceiver docker-compose up
If you want to demo logging via Elastic filebeat, run MODE=filebeat docker-compose up

Validate incoming span and correlated log data in your Elastic Cloud instance

Model 1: Logging via OpenTelemetry instrumentation

This model aligns with the long-term goals of OpenTelemetry: integrated tracing, metrics, and logging (with common attributes) from your services via the OpenTelemetry Instrumentation libraries, without dependency on log files and scrappers.

In this model, your service generates log statements as it always has, using popular logging libraries (e.g., Logback for Java). OTel provides a “Southbound hook” to Logback via the OTel Logback Appender, which injects ServiceName, SpanID, TraceID, slf4j key-value pairs, and OTel baggage into log records and passes the composed records to the co-resident OpenTelemetry Instrumentation library. We further employ a custom LogRecordProcessor to add baggage to the log record as attributes.

The OTel instrumentation library then formats the log statements per the OTel logging spec and ships them via OTLP to either an OTel Collector for further routing and enrichment or directly to Elastic.

Notably, as language support improves, this model can and will be supported by runtime agent binding with auto-instrumentation where available (e.g., no code changes required for runtime languages).

One distinguishing advantage of this model, beyond the simplicity it affords, is the ability to more easily tie together attributes and tracing metadata directly with log statements. This inherently makes logging more useful in the context of other OTel-supported observability signals.

Architecture

Although not explicitly pictured, an OpenTelemetry Collector can be inserted in between the service and Elastic to facilitate additional enrichment and/or signal routing or duplication across observability backends.

Pros

Simplified signal architecture and fewer “moving parts” (no files, disk utilization, or file rotation concerns)
Aligns with long-term OTel vision
Log statements can be (easily) decorated with OTel metadata
No polyfill adapter required to support structured logging with slf4j
No additional collectors/agents required
Conversion to ECS happens within Elastic keeping log data vendor-agnostic until ingestion
Common wireline protocol (OTLP) across tracing, metrics, and logs

Cons

Not available (yet) in many OTel-supported languages
No intermediate log file for ad-hoc, on-node debugging
Immature (alpha/experimental) Unknown “glare” conditions, which could result in loss of log data if service exits prematurely or if the backend is unable to accept log data for an extended period of time

Demo

MODE=apm docker-compose up

Model 2: Logging via the OpenTelemetry Collector

Given the cons of Model 1, it may be advantageous to consider a model that continues to leverage an actual log file intermediary between your services and your observability backend. Such a model is possible using an OpenTelemetry Collector collocated with your services (e.g., on the same host), running the filelogreceiver to scrape service log files.

In this model, your service generates log statements as it always has, using popular logging libraries (e.g., Logback for Java). OTel provides a MDC Appender for Logback (Logback MDC), which adds SpanID, TraceID, and Baggage to the Logback MDC context.

Notably, no log record structure is assumed by the OTel filelogreceiver. In the example provided, we employ the logstash-logback-encoder to JSON-encode log messages. The logstash-logback-encoder will read the OTel SpanID, TraceID, and Baggage off the MDC context and encode it into the JSON structure. Notably, logstash-logback-encoder doesn’t explicitly support slf4j key-value pairs. It does, however, support Logback structured arguments, and thus I use the Polyfill Appender to convert slf4j key-value pairs to Logback structured arguments.

We then configure the OTel Collector to scrape this log file (using the filelogreceiver). Because no assumptions are made about the format of the log lines, you need to explicitly map fields from your log schema to the OTel log schema.

From there, the OTel Collector batches and ships the formatted log lines via OTLP to Elastic.

Architecture

Pros

Easy to debug (you can manually read the intermediate log file)
Inherent file-based FIFO buffer
Less susceptible to “glare” conditions when service prematurely exits
Conversion to ECS happens within Elastic keeping log data vendor-agnostic until ingestion
Common wireline protocol (OTLP) across tracing, metrics, and logs

Cons

All the headaches of file-based logging (rotation, disk overflow)
Beta quality and not yet proven in the field
No support for slf4j key-value pairs

Demo

MODE=filelogreceiver docker-compose up

Model 3: Logging via Elastic Agent (or Filebeat)

Although the second model described affords some resilience as a function of the backing file, the OTel Collector filelogreceiver module is still decidedly “beta” in quality. Because of the importance of logs as a debugging tool, today I generally recommend that customers continue to import logs into Elastic using the field-proven Elastic Agent or Filebeat scrappers. Elastic Agent and Filebeat have many years of field maturity under their collective belt. Further, it is often advantageous to deploy Elastic Agent anyway to capture the multitude of signals outside the purview of OpenTelemetry (e.g., deep Kubernetes and host metrics, security, etc.).

In this model, your service generates log statements as it always has, using popular logging libraries (e.g., Logback for Java). As with model 2, we employ OTel’s Logback MDC to add SpanID, TraceID, and Baggage to the Logback MDC context.

From there, we employ the Elastic ECS Encoder to encode log statements compliant to the Elastic Common Schema. The Elastic ECS Encoder will read the OTel SpanID, TraceID, and Baggage off the MDC context and encode it into the JSON structure. Similar to model 2, the Elastic ECS Encoder doesn’t support sl4f key-vair arguments. Curiously, the Elastic ECS encoder also doesn’t appear to support Logback structured arguments. Thus, within the Polyfill Appender, I add slf4j key-value pairs as MDC context. This is less than ideal, however, since MDC forces all values to be strings.

From there, we write the log lines to a log file. If you are using Kubernetes or other container orchestration in your environment, you would more typically write to stdout (console) and let the orchestration log driver write to and manage log files.We then configure Elastic Agent or Filebeat to scrape the log file. Notably, the Elastic ECS Encoder does not currently translate incoming OTel SpanID and TraceID variables on the MDC. Thus, we need to perform manual translation of these variables in the Filebeat (or Elastic Agent) configuration to map them to their ECS equivalent.

Architecture

Pros

Robust and field-proven
Easy to debug (you can manually read the intermediate log file)
Inherent file-based FIFO buffer
Less susceptible to “glare” conditions when service prematurely exits
Native ECS format for easy manipulation in Elastic
Fleet-managed via Elastic Agent

Cons

All the headaches of file-based logging (rotation, disk overflow)
No support for slf4j key-value pairs or Logback structured arguments
Requires translation of OTel SpanID and TraceID in Filebeat config
Disparate data paths for logs versus tracing and metrics
Vendor-specific logging format

Demo

MODE=filebeat docker-compose up

Recommendations

For most customers, I currently recommend Model 3 — namely, write to logs in ECS format (with OTel SpanID, TraceID, and Baggage metadata) and collect them with an Elastic Agent installed on the node hosting the application or service. Elastic Agent (or Filebeat) today provides the most field-proven and robust means of capturing log files from applications and services with OpenTelemetry context.

Further, you can leverage this same Elastic Agent instance (ideally running in your Kubernetes daemonset) to collect rich and robust metrics and logs from Kubernetes and many other supported services via Elastic Integrations. Finally, Elastic Agent facilitates remote management via Fleet, avoiding bespoke configuration files.

Alternatively, for customers who either wish to keep their nodes vendor-neutral or use a consolidated signal routing system, I recommend Model 2, wherein an OpenTelemetry collector is used to scrape service log files. While workable and practiced by some early adopters in the field today, this model inherently carries some risk given the current beta nature of the OpenTelemetry filelogreceiver.

I generally do not recommend Model 1 given its limited language support, experimental/alpha status (the API could change), and current potential for data loss. That said, in time, with more language support and more thought to resilient designs, it has clear advantages both with regard to simplicity and richness of metadata.

Extracting more value from your logs

In contrast to tracing and metrics, most organizations have nearly 100% log coverage over their applications and services. This is an ideal beachhead upon which to build an application observability system. On the other hand, logs are notoriously noisy and unstructured; this is only amplified with the scale enabled by the hyperscalers and Kubernetes. Collecting log lines reliably is the easy part; making them useful at today’s scale is hard.

Given that logs are arguably the most challenging observability signal from which to extract value at scale, one should ideally give thoughtful consideration to a vendor’s support for logging in the context of other observability signals. Can they handle surges in log rates because of unexpected scale or an error or test scenario? Do they have the machine learning tool set to automatically recognize patterns in log lines, sort them into categories, and identify true anomalies? Can they provide cost-effective online searchability of logs over months or years without manual rehydration? Do they provide the tools to extract and analyze business KPIs buried in logs?

As an ardent and early supporter of OpenTelemetry, Elastic, of course, natively ingests OTel traces, metrics, and logs. And just like all logs coming into our system, logs coming from OTel-equipped sources avail themselves of our mature tooling and next-gen AI Ops technologies to enable you to extract their full value.Interested? Reach out to our pre-sales team to get started building with Elastic!

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

Accelerate Otel Adoption with Elastic Agent Hybrid Ingestion

Fri, 09 Jan 2026 00:00:00 GMT

Hybrid Elastic Agent: The Most Practical Path to OpenTelemetry Adoption

OpenTelemetry is quickly becoming the standard foundation for modern observability. Organizations want its open ecosystem, unified model, and vendor-neutral instrumentation—but moving a mature production environment to OTel is rarely straightforward.

Most teams already rely on battle-tested pipelines for logs, metrics, and security signals. They have dashboards tuned over years, operational practices built around existing data flows, and mission-critical systems where disruption simply isn’t an option.

This means the question isn’t "Why OpenTelemetry?" It’s "How do we get there without breaking what already works?"

Elastic Observability introduces a way to ingest telemetry without disrupting existing data and dashboards with Hybrid ingestion. Released in Elastic 9.2, its a low-friction way to adopt OTel receivers alongside existing native Elastic integrations-managed centrally through Fleet.

This hybrid approach offers one of the most pragmatic and operationally safe routes to OTel adoption available today.

The Challenge: Adopting OTel Without Disrupting the Present

For many organizations, the path to OTel adoption is complicated by realities such as:

Established log pipelines powering critical alerting
Legacy infrastructure that isn’t easily re-instrumented
Existing dashboards and visualizations built on Elastic-native datasets
Teams with different levels of OTel experience
Risk constraints that make large changes difficult to roll out

Standardizing on OTel is the right long-term direction, but replacing everything at once is neither realistic nor desirable.

Teams need a way to bring OTel into their environment incrementally, while preserving continuity, reliability, and central governance.

Elastic Agent 9.2+: Hybrid Ingestion as a Bridge to the Future

Elastic Agent now supports two fully supported ingestion paths, both running inside the same unified agent:

Elastic-native integrations

Perfect for logs and host-level telemetry, with mature dashboards, alerts, and ECS mappings.

OpenTelemetry input integrations (OTel receivers)

And crucially:

You can use both, simultaneously, on the same agent.

This hybrid ingestion model allows teams to:

Continue collecting logs using native Elastic integrations
Begin collecting metrics or traces via OTel receivers
Maintain full control through Fleet
Introduce OTel exactly where and when it makes sense
Avoid running parallel agents or duplicate pipelines

It’s a way to evolve—not replace—your observability strategy.

A Practical Example: Adding OTel Inputs While Keeping Your Existing Pipelines

Imagine a system where NGINX logs are already handled via Elastic-native integrations. These pipelines drive dashboards, audits, and critical alerts. Interrupting them isn’t an option.

At the same time, your platform team wants to standardize metrics and service telemetry using OpenTelemetry.

With Elastic Agent hybrid ingestion, both goals align:

Keep your existing log integration in Fleet
Add an OTel input integration (e.g., OTel nginxreceiver)
Fleet deploys both inside the same Elastic Agent
Deployment is done at scale across your infrastructure from a single management console
Logs and OTel metrics flow into Elasticsearch side-by-side

No re-instrumentation. No duplicate agents. No loss of historical visibility. No new tooling for operations. No external deployment tool.

Whether the component is a web server, reverse proxy, database, JVM runtime, or custom service already instrumented in OTel, the workflow is the same.

Why This Hybrid Approach Matters Strategically

Hybrid ingestion is not simply a technical capability—it’s an organizational enabler for OpenTelemetry transformation.

Incremental migration without downtime

Teams can begin adopting OTel at the exact pace they’re comfortable with. Existing collection signals remain stable. OTel metrics or logs are added progressively.

Fleet remains your single control plane

Fleet continues to manage:

agent lifecycle
policy management
version upgrades
diagnostics and monitoring

Even as OTel becomes part of your ingestion strategy.

Consistent semantics across teams

Adopting OTel receivers through EDOT helps harmonize telemetry models across microservices, infrastructure, and applications.

OTel becomes the shared language—Elastic becomes the scalable backend.

Future-proof flexibility

When the day comes that a team needs advanced OTel features, custom pipelines, custom processors, or additional exporters, they can build their own EDOT custom collector flavor and use it in their elastic-agent in hybrid mode.

This allows deep customization without abandoning the Elastic Agent runtime.

No vendor lock-in—full ecosystem alignment

Hybrid ingestion leverages upstream OpenTelemetry components directly. This reinforces the open, vendor-neutral ecosystem organizations prefer when standardizing observability across teams while being supported by Elastic.

What About Standalone Mode? (Advanced Use Cases)

While Fleet-managed hybrid ingestion will meet the needs of most users, Elastic Agent in hybrid mode also support standalone deployment with the same functions as the managed version.

native integrations support
full control over Otel receivers, processors, and exporters
Elasticsearch output as the backend

This is particularly useful for platform teams testing advanced OTel deployments or building custom telemetry strategies.

But it remains optional—the managed experience is still the default path.

Conclusion: A Modern, Flexible Path Toward OpenTelemetry

Migrating to OpenTelemetry is a journey, not a switch. With hybrid ingestion, Elastic provides a realistic, scalable, and low-risk pathway for organizations that want to adopt OTel gradually while maintaining operational continuity.

Elastic Agent 9.2+ enables teams to:

retain reliable log integrations
introduce OTel inputs seamlessly
manage everything from Fleet
reduce complexity and operational overhead
expand into OTel at the right pace
stay aligned with open standards and best practices

It brings the best of both worlds—Elastic-native richness and OTel-standard flexibility—into a single agent and a unified operational model.

Hybrid isn’t a workaround. It’s the strategic bridge between where your observability platform is today and where it needs to go next.

Technical Walkthrough: Deploying Hybrid Elastic Agent + EDOT in Fleet

Before we close, let’s look at what this actually looks like in practice. Conceptual advantages are important, but many teams want to see how hybrid ingestion works when deployed through Fleet.

The example below walks through a simple, production-ready setup using Elastic Agent 9.2, combining a native integration and an OTel input integration inside a single agent, the same approach you can apply to any service across your environment.

Here is a step-by-step guide showing how to deploy Elastic Agent 9.2 in Fleet-managed hybrid mode, using the OTel nginxreceiver as one concrete example. This applies to any service with an OTel receiver (Redis, HAProxy, Kafka, JVM, etc.).

Requirements

Elastic Stack 9.2+
Elastic Agent 9.2+
Fleet configured in Kibana
A host running your workload (NGINX in this example)
NGINX stub_status endpoint or any equivalent OTel metrics endpoint
API key with ingest privileges

1. Create or Select an Agent Policy

In Kibana → Management → Fleet → Agent policies
Create a new policy: nginx-o11y
Enable system monitoring (recommended)
Save

2. Enroll Elastic Agent into the Policy

From the policy page:

Click Add agent
Choose your OS
Copy the installation command
Run:

sudo elastic-agent install \
  --url= \
  --enrollment-token=

You should soon see the agent appear as Healthy in Fleet.

3. Add the Native Integration (Logs)

In Fleet, go to Integrations.
Search for NGINX.
Click Add NGINX.
Select your nginx-o11y policy.
Only enable log collection (access + error logs).
Save and deploy.

4. Validate Log Collection

In Kibana, go to Analytics → Discover and search for:

data_stream.dataset : "nginx.access" or "nginx.error"

Or open the built-in dashboard:

Analytics → Dashboards → [Logs Nginx] Access and error logs

5. Collecting NGINX Metrics via the OTel NGINX Receiver

Elastic Agent 9.2+ allows Fleet to deploy OTel input integrations.
This scenario uses the OpenTelemetry nginxreceiver through a Fleet-managed integration.

5.1. Install the NGINX OpenTelemetry Integration Content

In Kibana, go to Management → Fleet → Integrations.
Search for NGINX OpenTelemetry Assets.
Click Add Integration.

5.2. Install the NGINX OpenTelemetry Input Integration

In Kibana, go to Management → Fleet → Integrations.
Search for NGINX OpenTelemetry Input Package.
Click Add Integration.
Assign it to your agent nginx-o11y policy.

Provide the endpoint for the NGINX status page:

Endpoint: http://localhost/status
Collection interval: 10s

Click Add integration.

6. Validate OTel Metrics

Go to Analytics → Dashboards.
Open: [Metrics Nginx OTEL Overview] Dashboard

You should see metrics such as active connections, writes, reads, waiting, and request counts.

7. Closing thoughts

This example highlights how straightforward hybrid ingestion becomes with Elastic Agent 9.2. By combining native integrations and OTel receivers within a single, centrally managed policy, you gain the flexibility to adopt OpenTelemetry where it adds the most value without disrupting existing pipelines or introducing operational overhead.

Whether you extend this pattern to additional services, experiment with other OTel receivers, or scale it across your fleet, the deployment model remains consistent, repeatable, and production-ready.

For more information and other innovations Elastic Observability has made check out:

Adding free and open Elastic APM as part of your Elastic Observability deployment

Wed, 28 Feb 2024 00:00:00 GMT

In a recent post, we showed you how to get started with the free and open tier of Elastic Observability. Below, we'll walk through what you need to do to expand your deployment so you can start gathering metrics from application performance monitoring (APM) or "tracing" data in your observability cluster, for free.

What is APM?

Application performance monitoring lets you see where your applications spend their time, what they are doing, what other applications or services they are calling, and what errors or exceptions they are encountering.

In addition, APM also lets you see history and trends for key performance indicators, such as latency and throughput, as well as transaction and dependency information:

Whether you're setting up alerts for SLA breaches, trying to gauge the impact of your latest release, or deciding where to make the next improvement, APM can help with your root-cause analysis to help improve your users' experience and drive your mean time to resolution (MTTR) toward zero.

Logical architecture

Elastic APM relies on the APM Integration inside Elastic Agent, which forwards application trace and metric data from applications instrumented with APM agents to an Elastic Observability cluster. Elastic APM supports multiple agent flavors:

Native Elastic APM Agents, available for multiple languages, including Java, .NET, Go, Ruby, Python, Node.js, PHP, and client-side JavaScript
Code instrumented with OpenTelemetry
Code instrumented with OpenTracing
Code instrumented with Jaeger

In this blog, we'll provide a quick example of how to instrument code with the native Elastic APM Python agent, but the overall steps are similar for other languages.

Please note that there is a strong distinction between the Elastic APM Agent and the Elastic Agent. These are very different components, as you can see in the diagram above, so it's important not to confuse them.

Install the Elastic Agent

The first step is to install the Elastic Agent. You either need Fleet installed first, or you can install the Elastic Agent standalone. Install the Elastic Agent somewhere by following this guide. This will give you an APM Integration endpoint you can hit. Note that this step is not necessary in Elastic Cloud, as we host the APM Integration for you. Check Elastic Agent is up by running:

curl :8200

Instrumenting sample code with an Elastic APM agent

The instructions for the various language agents differ based on the programming language, but at a high level they have a similar flow. First, you add the dependency for the agent in the language's native spec, then you configure the agent to let it know how to find the APM Integration.

You can try out any flavor you'd like, but I am going to walk through the Python instructions using this Python example that I created.

Get the sample code (or use your own)

To get started, I clone the GitHub repository then change to the directory:

git clone https://github.com/davidgeorgehope/PythonElasticAPMExample
cd PythonElasticAPMExample

How to add the dependency

Adding the Elastic APM Dependency is simple — check the app.py file from the github repo and you will notice the following lines of code.

import elasticapm
from elasticapm import Client

app = Flask(__name__)
app.config["ELASTIC_APM"] = {    "SERVICE_NAME": os.environ.get("APM_SERVICE_NAME", "flask-app"),    "SECRET_TOKEN": os.environ.get("APM_SECRET_TOKEN", ""),    "SERVER_URL": os.environ.get("APM_SERVER_URL", "http://localhost:8200"),}
elasticapm.instrumentation.control.instrument()
client = Client(app.config["ELASTIC_APM"])

The Python library for Flask is capable of auto detecting transactions, but you can also start transactions in code as per the following, as we have done in this example:

@app.route("/")
def hello():
    client.begin_transaction('demo-transaction')
    client.end_transaction('demo-transaction', 'success')

Configure the agent

The agents need to send application trace data to the APM Integration, and to do this it has to be reachable. I configured the Elastic Agent to listen on my local host's IP, so anything in my subnet can send data to it. As you can see from the code below, we use docker-compose.yml to pass in the config via environment variables. Please edit these variables for your own Elastic installation.

# docker-compose.yml
version: "3.9"
services:
  flask_app:
    build: .
    ports:
      - "5001:5001"
    environment:
      - PORT=5001
      - APM_SERVICE_NAME=flask-app
      - APM_SECRET_TOKEN=your_secret_token
      - APM_SERVER_URL=http://host.docker.internal:8200

Some commentary on the above:

service_name: If you leave this out it will just default to the application's name, but you can override that here.
secret_token: Secret tokens allow you to authorize requests to the APM Server, but they require that the APM Server is set up with SSL/TLS and that a secret token has been set up. We're not using HTTPS between the agents and the APM Server, so we'll comment this one out.
server_url: This is how the agent can reach the APM Integration inside Elastic Agent. Replace this with the name or IP of your host running Elastic Agent.

Now that the Elastic APM side of the configuration is done, we simply follow the steps from the README to start up.

docker-compose up --build -d

The build step will take several minutes.

You can navigate to the running sample application by visiting http://localhost:5001. There's not a lot to the sample, but it does generate some APM data. To generate a bit of a load, you can reload them a few times or run a quick little script:

#!/bin/bash
# load_test.sh
url="http://localhost:5001"
for i in {1..1000}
do
  curl -s -o /dev/null $url
  sleep 1
done

This will just reload the pages every second.

Back in Kibana, navigate back to the APM app (hamburger icon, then select APM ) and you should see our new flask-app service (I let mine run so it shows a bit more history):

The Service Overview page provides an at-a-glance summary of the health of a service in one place. If you're a developer or an SRE, this is the page that will help you answer questions like:

How did a new deployment impact performance?
What are the top impacted transactions?
How does performance correlate with underlying infrastructure?

This view provides a list of all of the applications that have sent application trace data to Elastic APM in the specified period of time (in this case, the last 15 minutes). There are also sparklines showing mini graphs of latency, throughput, and error rate. Clicking on flask-app takes us to the service overview page, which shows the various transactions within the service (recall that my script is hitting the / endpoint, as seen in the Transactions section). We get bigger graphs for Latency , Throughput , Errors , and Error Rates.

When you're instrumenting real applications, under real load, you'll see a lot more connectivity (and errors!)

Clicking on a transaction in the transaction view, in this case, our sample app's demo-transaction transaction, we can see exactly what operations were called:

This includes detailed information about calls to external services, such as database queries:

What's next?

Now that you've got your Elastic Observability cluster up and running and collecting out-of-the-box application trace data, explore the public APIs for the languages that your applications are using, which allow you to take your APM data to the next level. The APIs allow you to add custom metadata, define business transactions, create custom spans, and more. You can find the public API specs for the various APM agents (such as Java, Ruby, Python, and more) on the APM agent documentation pages.

If you'd like to learn more about Elastic APM, check out our webinar on Elastic APM in the shift to cloud native to see other ways that Elastic APM can help you in your ecosystem.

If you decide that you'd rather have us host your observability cluster, you can sign up for a free trial of the Elasticsearch Service on Elastic Cloud and change your agents to point to your new cluster.

Originally published May 5, 2021; updated April 6, 2023.

Using Elastic Agent Builder & OpenTelemetry to Observe Devices

Mon, 02 Feb 2026 00:00:00 GMT

“Anything that emits data can be instrumented and observed.”

That’s the mindset that started this little experiment.

The Curiosity Behind It

Over the years, I’ve worked closely with customers to design IT solutions that are scalable, secure, and cost-effective. I’ve partnered with them on their cloud migration and digital transformation journeys, enabling full-stack Observability across their cloud and on-premise systems.

One day, I found myself looking at the appliances in my own home — the dishwasher, washer, dryer, and refrigerator — and realized that they, too, were generating valuable data. What if I could observe them? What if the same principles that power enterprise telemetry could help me understand my home appliances — their patterns, behavior, and efficiency?

That curiosity became the seed for this experiment: IoT Observability at home, powered by OpenTelemetry - EDOT, and Agent Builder.

Building the IoT Observability Foundation

The idea was simple:

Treat every device as a data source.
Use OpenTelemetry to capture signals
Use EDOT (Elastic Distribution of OpenTelemetry) as a unified collector and exporter.
Send all data to an Elastic Serverless Observability cluster.
Layer Agent Builder on top, to talk to the data using natural language.

So now, my dishwasher, washer, dryer and refrigerator — all part of an Elastic-powered, home-scale telemetry pipeline.

Turning Signals into Stories

Technical overview: What does this system do?

I set up a system that connects my LG ThinQ smart appliances — washer, dryer, dishwasher, and refrigerator — to Home Assistant, turning everyday household devices into observable systems by sending metrics, logs, and traces to Elastic Cloud Serverless.

Key Capabilities:

✅ Natural language queries (Agent Builder)
✅ Real-time appliance state monitoring
✅ Anomaly detection
✅ Full stack observability stack

Architecture overview

The Aha Moment

What is Agent Builder?

Agent Builder provides an out-of-the-box conversational agent to allow you to immediately start chatting with any data in Elasticsearch (or from external sources through integrations) with a full experience built in Kibana and accessible via API. Developers can also customize their tools to search specific indexes or use ES|QL for business logic, relevance tuning, or personalization.

It has the ability to transform natural language into intuitive, piped, multi-step ES|QL, giving the agent the power to do analytical and hybrid semantic search. Finally, developers can compose custom Agents based on a set of user defined instructions and configurable set of available tools, and these Agents can be interacted with via chat in Kibana or via APIs, MCP and A2A.

Agent Builder creates a transformative experience, turning raw telemetry into an interactive dialogue. So instead of building complex queries manually, I can simply ask:

"Can you show me a report for all my appliances?" …and voilà, I get the insights right in Kibana.

Conclusion

This experiment reminded me that Observability isn’t limited to enterprise systems. Anything that emits data, whether it’s a Kubernetes pod or a coffee maker, has insights that can be uncovered. It could be any IoT devices for that matter, your data center thermostat, your office building badge scanners — all emit telemetry that can be valuable to ensuring safe and efficient operations. The same principles that help organizations gain visibility into production workloads can also bring insights, efficiency, and a sense of connection to the systems around us every day.

By combining OpenTelemetry (EDOT), Elastic Cloud Serverless, and Agent Builder, I realized how simple it can be to go from raw telemetry to conversation — turning metrics into meaning and data into dialogue.

This experiment showed me something simple yet profound: Observability is no longer just about dashboards and alerts; it’s about conversations. When data becomes conversational, insights become accessible to everyone — not just developers or SREs, but anyone curious enough to ask “why?”

Anything that emits data can be observed.

Now, with Agent Builder:

Anything that emits data can also answer back.

Agentic CI/CD: Kubernetes Deployment Gates with Elastic MCP Server

Mon, 09 Feb 2026 00:00:00 GMT

The "Build-Push-Deploy" cycle is never simple. High-availability environments require automated guardrails, proactive checks that prevent a deployment from even starting if the target cluster is under stress. Today these are generally performed with APIs and scripts during the CI/CD process. Different gates are initiated during the process to ensure the application tests have passed, the artifact is clean, the infrastructure is stable, and many more.

With AI, and agents, these gates are slowly becoming more sophisticated. More and more these gates are using a Model Context Protocol (MCP) server for this check. This is a newer, more cutting-edge "agentic" approach. It allows your CI/CD pipeline to act as an intelligent agent that "asks" your cluster for its health status before making a change.

A standard Kubernetes deployment workflow generally follows these high-level steps:

Verification Gate: Ensuring all automated testing has passed.
Artifact Creation: Building the Docker container.
Environment Gate: Verifying that the production Kubernetes environment, supporting infrastructure, and existing applications are healthy.

Kubernetes Deployment: Triggering the final release. Modern workflows often use GitOps tools like ArgoCD or Flux, where a simple image tag update in Docker Hub automatically synchronizes the cluster.

Kubernetes health checks can range from simple to complex depending on your Service Level Objectives (SLOs) and operational maturity. Typically, the primary goal is to ensure the cluster is healthy and not nearing a resource bottleneck. Common "red flag" metrics used in these gates include:


Red Flag	Scenario	SRE Meaning
Pod Count > 90%	High pod density	Approaching node-level scheduling limits.
CPU Usage > 70%	High real-time load	Risk of CPU throttling during deployment.
Memory Usage > 80%	Memory pressure	High risk of Out-of-Memory (OOM) kills.
OOM Terminating Processes	Resource limits reached	Inadequate pod configuration or sizing.
Available vs. Requested	Capacity imbalance	Risk of deployment failure due to insufficient reserved space.

I will show you how you can use a CI/CD pipeline that integrates Observability AI Agents with GitHub Actions via an Model Context Protocol (MCP) server, creating automated pre-deployment health checks for Kubernetes clusters.

By introducing an observability checkpoint before deployment, we transform the pipeline into an intelligent system that:

Queries real-time metrics from Kubernetes clusters
Analyzes capacity using custom ESQL queries
Makes autonomous decisions about deployment readiness
Prevents failures proactively rather than reacting to them
Provides actionable feedback to engineering teams

Here is the “architecture” of what is being deployed and how it works in this blog.

As you can see the flow uses Elastic Observability, which is storing and analyzing Kubernetes OpenTelemetry metrics from the opentelemetry-kube-stack-cluster-stats-collector (deployed via OpenTelemetry Operator).

Github Actions calls the Observability Kubernetes Agent, via the Elastic MCP server, which has tools that help check for some of the “red flag” issues identified in the table above.

Based on the results, Github Actions will either stop the process or continue to deploy the artifact via a trigger for ArgoCD.

The Observability Kubernetes Agent was built using Elastic’s Agent Builder capability, as well as some of the tools it uses. These are then exposed via the MCP server.

Hence the overall set of components used here include:

GitHub Actions: Orchestrates the build and deployment workflow
Elastic MCP Server: Serverless endpoint that exposes AI agents
Observability Kubernetes Agent: Custom agent with specialized ESQL tools
Kubernetes Cluster: Target deployment environment with metrics collection
ES|QL Query Tools: Precision queries for node and pod resource analysis

What Happens When a Kubernetes Health Check Fails in GitHub Actions?

How the Pipeline Blocks a Deployment Automatically

When the cluster exceeds capacity thresholds, the workflow automatically blocks deployment. In this scenario I didn’t load the cluster, but used a simple check of whether more than 25% of resources were being used to purposely stop the deployment.

The workflow shows:

Build Docker Image (28s)
Push to Docker Hub (5s)
K8s Health via Elastic O11y K8s Agent (16s) - FAILED
Deploy to otel-test Cluster - BLOCKED

Annotation: "Cluster has resource issues - blocking deployment"

What Does the AI Agent's Health Check Response Look Like?

The agent provides detailed analysis:

Step 1: Finding Kubernetes analysis agent...
Found agent: Observability Kubernetes Agent (kubernetes_analysis_agent)

Step 2: Querying cluster health...
Prompt: tell me if my cluster otel-test is using more than 25% memory or CPU on any of its nodes

Agent Response:
================================================================
Yes, your cluster "otel-test" has nodes and pods using more than 25% of resources.

++Node exceeding 25%:++
- ip-192-168-165-175.us-west-2.compute.internal
  - Memory: 36.44%
  - CPU: 7.99% (below threshold)

++All other nodes are below the 25% threshold++ for both CPU and memory.

While the query for pods doesn't show percentage values directly, the data indicates
normal resource usage patterns for the pods in your cluster, with none appearing to
consume excessive resources relative to their allocations.
================================================================

Cluster has resource issues - blocking deployment
Error: Process completed with exit code 1.

As you can see, a prompt was sent to the Observability Kubernetes Agent via MCP vs having to build some logic or call another script etc.

This single check prevented:

A deployment that would have failed
Wasted CI/CD minutes
Potential service degradation
Manual SRE intervention

What it provided:

Provided actionable intelligence for capacity planning

How to Build a Kubernetes Health Check Agent in Elastic

Building the agent isn’t hard, Elastic’s AgentBuilder’s UI makes it easy to create it and have it running in minutes.

How to Configure the Observability Kubernetes Agent

Other than naming the agent, you need to provide it with some instructions.

Custom Instructions:

# Agent Instructions

## Primary Role
You are a Kubernetes monitoring assistant that helps users analyze cluster performance
and resource utilization. Your primary goal is to provide clear, accurate information
about Kubernetes clusters using available data sources.

## Tool Selection Guidelines
1. When users ask about Kubernetes metrics, node performance, or cluster health:
   - Use ESQL tools for detailed analysis
   - Query metrics from kubeletstatsreceiver.otel-default

2. For alert-related queries:
   - Use the alerts tool to check active alerts

3. Always provide context about:
   - Time ranges queried
   - Cluster names
   - Resource thresholds

How to Write ES|QL Queries for Kubernetes Node and Pod Metrics

I created several tools that checked Node CPU and memory, pod CPU and memory, and OOM from pods. Additionally, the Observability Kubernetes Agent utilized a large portion of the OOTB tools like observability_alerts as part of its abilities.

Here is an example of the node CPU and memory tool, which uses a simple ES|QL query against OpenTelemetry metrics to check the CPU and memory utilization in the cluster.

ES|QL Query:

FROM metrics-kubeletstatsreceiver.otel-default
| WHERE resource.attributes.k8s.cluster.name == ?cluster_name
  AND @timestamp > NOW() - 3 hours
| STATS
    avg_cpu_usage = AVG(metrics.k8s.node.cpu.usage),
    avg_memory_usage = AVG(metrics.k8s.node.memory.usage),
    avg_memory_available = AVG(metrics.k8s.node.memory.available),
    avg_memory_working_set = AVG(metrics.k8s.node.memory.working_set)
  BY resource.attributes.k8s.node.name
| EVAL
    cpu_usage_pct = avg_cpu_usage * 100,
    memory_usage_pct = (avg_memory_working_set / (avg_memory_working_set + avg_memory_available)) * 100
| SORT cpu_usage_pct DESC, memory_usage_pct DESC
| KEEP resource.attributes.k8s.node.name, cpu_usage_pct, memory_usage_pct
| LIMIT 100

Parameters:

cluster_name (string): Name of the K8s cluster to analyze

How to Expose the Agent via the Elastic MCP Server

Once configured, the agent is automatically available via Elastic's MCP server running in your Observability project. The MCP server provides a standardized interface that any MCP-compatible client can query.

MCP Endpoint: https://your-elastic-project.elastic.cloud/mcp

Authentication: Uses Elastic API keys for secure access

Why Agentic CI/CD Matters for Kubernetes Operations

Agentic CI/CD represents an evolution in proactive deployment strategies. By integrating Elastic Observability AI agents with GitHub Actions via MCP, we've created a system that:

Prevents failures before they happen Provides real-time cluster health insights Makes data-driven deployment decisions Reduces operational burden on SRE teams Improves overall deployment reliability

This approach is at the cutting edge of modern CI/CD practices. While traditional pipelines focus solely on the "Build-Push-Deploy" cycle, agentic pipelines introduce automated pre-deployment guardrails using observability data, transforming your CI/CD infrastructure into an intelligent agent that actively protects production environments.

Resources and Next Steps

Documentation

Bringing observability insights from Elastic AI Assistant to the world of GitHub Copilot

Thu, 23 May 2024 00:00:00 GMT

GitHub announced GitHub Copilot Extensions this week at Microsoft Build. We are working with the GitHub team in the Limited Beta Program to explore bringing observability insights from Elastic AI Assistant to GitHub Copilot users.

Elastic’s GitHub Copilot Extension aims to combine the capabilities of GitHub Copilot and Elastic AI Assistant for Observability. This could enable developers to access critical insights from Elastic AI Assistant from GitHub Copilot Chat on GitHub.com, Visual Studio, GitHub.com, Visual Studio, and VS Code - places where they write their code.

Developers will be able ask questions such as

What errors are active?
What’s the latest stacktrace for my application?
What caused a slowdown in the application after the last push to the dev environment?
How to write an ES|QL for query that my app will send to Elasticsearch?
What runbook from Github has been loaded into Elasticsearch and is related to the issue I’m investigating And many more!

Watch Jeff's PoC Demo@Microsoft Build 2024

Elastic AI Assistant surfaced in GitHub Copilot Chat from our Extension (Proof of Concept)

What is the Elastic AI Assistant for Observability

The Elastic Observability AI Assistant for Observability, a user-centric tool, is a game-changer in providing contextual insights and streamlining troubleshooting within the Elastic Observability environment. By harnessing generative AI capabilities, the assistant offers open prompts that decipher error messages and propose remediation actions. It adopts a Retrieval-Augmented Generation (RAG) approach to fetch the most pertinent internal information, such as APM traces, log messages, SLOs, GitHub issues, runbooks, and more. This contextual assistance is a huge leap forward for Site Reliability Engineers (SREs) and operations teams, offering immediate, relevant solutions to issues based on existing documentation and resources, boosting developer productivity.

For more information on setting up and using the AI Assistant for Observability check out the blog Getting started with the Elastic AI Assistant for Observability and Microsoft Azure OpenAI. Additionally, learn how Elastic Observability AI Assistant uses RAG to help analyze application issues with GitHub issues.

One unique feature of the AI Assistant is its API support. This allows you to take advantage of all the capabilities provided by the Elastic AI Assistant, and integrate them right into your workflow.

What is a GitHub Copilot Extension

GitHub Copilot Extensions, a new addition to GitHub Copilot, revolutionizes the developer experience by integrating a diverse array of tools and services directly into the developer's workflow. These unique extensions, crafted by partners, enable developers to interact with various services and tools using natural language within their Integrated Development Environment (IDE) or GitHub.com. This integration eliminates the need for context-switching, allowing developers to maintain their flow state, troubleshoot issues, and deploy solutions with unparalleled efficiency. These extensions will be accessible through GitHub Copilot Chat in the GitHub Marketplace, with options for organizations to create private extensions tailored to their internal tooling.

What’s next

We are participating in the Github Limited Beta Program as a partner and exploring the possibility of bringing Elastic GitHub Copilot Extension to the GitHub Marketplace. We are excited to unlock insights from Elastic Observability to GitHub Copilot users side by side to the code behind those services. Stay tuned!

Resources:

AI-driven incident response with logs: A technical deep dive in Elastic Observability

Mon, 20 Oct 2025 00:00:00 GMT

AI-driven incident response with logs: A technical deep dive in Elastic Observability

Modern customer‑facing applications, whether e‑commerce sites, streaming platforms, or API gateways, run on fleets of microservices and cloud resources. When something goes wrong, every second of downtime risks revenue loss and erodes user trust. Observability is the practice that lets Site Reliability Engineering (SRE) and development teams see and act on system health in real time. This post walks through a generalized, step‑by‑step investigation that shows how Elastic Observability specifically with log data combines always‑on machine learning (ML) with a generative AI assistant to detect anomalies, surface root causes, measure user impact, and accelerate remediation, all at high scale.

Anomaly Detection

A production environment is ingesting millions of log lines per minute. Elastic’s AIOps jobs continuously profile normal log throughput and content without any manual rules. When log volume or message structure deviates beyond learned baselines, the platform automatically fires a high‑fidelity anomaly alert. Because the models are unsupervised, they adapt to changing traffic patterns and flag both sudden spikes (e.g., 10× error surge) and rare new log categories.

In addition to looking directly for Log Spikes, Elastic trains seasonal/univariant models to predict expected event counts per bucket and applies statistical tests to classify outliers. Simultaneously, log categorization clusters similar messages with cosine similarity on token embeddings, making it trivial to identify a previously unseen error string.

Investigating Alerts: Automated Pattern Analysis

Clicking the alert reveals more than a timestamp. Elastic’s ML job already correlates the spike with the dominant new log pattern ERROR 1114 (HY000): table "orders" is full and surfaces example lines. Instead of grep‑driven hunting, engineers get an immediate hypothesis about what subsystem is failing and why.

If deeper context is needed, the builtin Elastic AI Assistant can be invoked directly from the alert. Thanks to Retrieval‑Augmented Generation (RAG) over your telemetry, the assistant explains the anomaly in plain language, references the exact log events, and proposes next steps without hallucinating.

AI‑Assisted Root Cause Verification

From within the same chat, you might ask, “Using lens create a single graph of all http response status codes =400 from logs-nginx.access-default over the last 3 hours..” The assistant translates that intent into an ES|QL aggregation, retrieves the data, and renders a bar chart with no DSL knowledge required. If there are a number of errors with a status code above 400, you’ve validated that end‑users are impacted.

Global Impact Analysis with Enriched Logs

Structured log enrichment (e.g., GeoIP, user ID, service tags) lets the assistant answer business questions on the fly. A query like “What are the top 10 source.geo.country_name with http.response.status.code>=400 over the last 3 hours. Use logs-nginx.access-default. Provide counts for each country name.” surfaces whether the incident is regional or global.

Quantifying Business Impact

Technical metrics alone rarely sway executives. Suppose historical data shows the application normally processes $1,000 in transactions per minute. The assistant can combine that baseline with real‑time failure counts to estimate revenue loss. Presenting financial impact alongside error graphs sharpens prioritization and justifies extraordinary remediation steps.

Pinpointing Infrastructure & Ownership

Every log is automatically enriched with Kubernetes, cloud, and custom metadata. A single question “Which pod and cluster emit the ‘table full’ error, and who owns it?” returns the full information about the pod, namespace and owner as shown below.

Immediate, accurate routing replaces frantic Slack threads, cutting minutes (or hours) off of downtime.

Some of the magic happening here is because we can put instructions in the Elastic AI Assistants knowledge base to guide the AI assistant. For example this simple entry in the knowledge base is what allows the assistant to populate the response in the previous screenshot.

If asked about Kubernetes pod, namespace, cluster, location, or owner run the "query" tool.
1. Use the index `logs-mysql.error-default` unless another log location is specified.
2. Include the following fields in the query:
   - Pod: `agent.name`
   - Namespace: `data\_stream.namespace`
   - Cluster Name: `orchestrator.cluster.name`
   - Cloud Provider: `cloud.provider`
   - Region: `cloud.region`
   - Availability Zone: `cloud.availability\_zone`
   - Owner: `cloud.account.id`
3. Use the ES|QL query format:
   esql
   FROM logs-mysql.error-default
   | KEEP agent.name, data\_stream.namespace, orchestrator.cluster.name, cloud.provider, cloud.region, cloud.availability\_zone, cloud.account.id
   
4. Ensure the query is executed within the appropriate time range and context.

Leveraging Institutional Knowledge with RAG

Elastic can index runbooks, GitHub issues, and wikis alongside telemetry. Asking “Find documentation on fixing a full orders table” retrieves and summarizes a prior runbook that details archiving old rows and adding a partition. Grounding remediation in proven procedures avoids guesswork and accelerates fixes.

Automated Communication & Documentation

Good incident response includes timely stakeholder updates. A prompt such as “Draft an incident update email with root cause, impact, and next steps” lets the assistant assemble a structured message and send it via the alerting framework’s email or Slack connector complete with dashboard links and next‑update timelines. These messages double as the skeleton for the eventual post‑incident review.

Again as before, some of the magic happening here is because we can put instructions in the Elastic AI Assistants knowledge base to guide the AI assistant. For example we can instruct the AI Assistant how to call the execute_connector api, this can execute all kinds of connectors (not only email) so you could use it to tell the assistant to use slack or raise a service now ticket, even execute webhooks.

Here are specific instructions to send an email. Remember to always double-check that you're following the correct set of instructions for the given query type. Provide clear, concise, and accurate information in your response.

## Email Instructions

If the user's query requires sending an email:
1. Use the `Elastic-Cloud-SMTP` connector with ID `elastic-cloud-email`.
2. Prepare the email parameters:
   - Recipient email address(es) in the `to` field (array of strings)
   - Subject in the `subject` field (string)
   - Email body in the `message` field (string)
3. Include
   - Details for the alert along with a link to the alert
   - Root cause analysis
   - Revenue impact
   - Remediation recommendations
   - Link to GitHub issue
   - All relevant information from this conversation
   - Link to the Business Health Dashboard
4. Send the email immediately. Do not ask the user for confirmation.
5. Execute the connector using this format:
   
   execute_connector(
     id="elastic-cloud-email",
     params={
       "to": ["recipient@example.com"],
       "subject": "Your Email Subject",
       "message": "Your email content here."
     }
   )
   
6. Check the response and confirm if the email was sent successfully.

Conclusion & Key Takeaways

Elastic Observability's combination of unsupervised ML, schema-aware data ingestion, and a context-rich RAG powered AI assistant enables teams to transform incident response from reactive firefighting into proactive, data-driven operations. By automatically detecting anomalies, correlating patterns, and providing contextual insights, teams can:

Preserve revenue by quantifying business impact in real-time and prioritizing accordingly
Scale expertise by embedding institutional knowledge into RAG-powered recommendations
Improve continuously through automated documentation that feeds back into the knowledge base

The key is to collect logs broadly, maintain a unified observability store, and let ML and AI handle the heavy lifting. The payoff isn't just reduced downtime, it's the transformation of incident response from a source of organizational stress into a competitive advantage.

Try out this exact scenario and get hands in with this Elastic Logging Workshop: https://play.instruqt.com/elastic/invite/rx4yvknhpfci

AI agent observability and monitoring with OTel, OpenLit & Elastic

Mon, 30 Mar 2026 00:00:00 GMT

AI agents don't fail like traditional apps. They hallucinate, loop, burn tokens, and make unpredictable tool calls that standard monitoring was never designed to capture. Traditional APM tools show HTTP status codes and latency, but they miss the AI-specific failures that matter: prompt injection attempts, evaluation score degradation, and tool-calling loops.

This guide explains the key considerations for full-stack monitoring AI web agents, exploring both best practices and practical examples using OpenLit, OpenTelemetry and Elastic. Specifically we'll cover monitoring an example web travel planner located in this example repo.

Why is AI agent observability different?

The aim of traditional monitoring is to detect and alert on failures, performance issues, inefficiencies and resource bottlenecks. Monitoring AI agents still adheres to this common goal, but there are several differences that must be considered:

AI models are probabilistic, meaning that the same input can lead to different outputs. This makes it hard to define and monitor success based on a single correct answer.
AI systems can appear to function correctly on the surface, but their outputs may be suspect, incorrect, or biased without a way to immediately detect it. Telemetry must therefore be able to capture hidden capabilities such as tool call executions for SREs to scrutinize.
The dynamic and evolving nature of LLMs can mean that their behavior can change dramatically between updates and versions due to changes in data, embeddings, or prompts. This means monitoring and pre-production evaluation when upgrading is vitally important for performance continuity.
Models are black boxes. For this reason it's often difficult to understand why an AI made a particular decision. This makes troubleshooting harder compared to systems with clear, explicit logic.
Beyond traditional metrics, AI output must be monitored for issues like hallucinations (generating false information), toxicity, and bias, which can damage user trust and lead to reputational harm.
Contextual performance of an AI system can vary greatly depending on the context, including user interaction. Capturing user prompts and telemetry helps establish a complete picture of system performance.
From a security perspective, AI agents can be vulnerable to adversarial attacks including data poisoning and obfuscation. Monitoring for unusual behavioral patterns and prompts is crucial to detect and mitigate these threats.

For these reasons the metrics and tracing that SREs capture and investigate will differ.

AI agent monitoring in practice

Let's apply these concepts by instrumenting an actual AI agent and capturing telemetry. Here we shall be using the TypeScript SDK of OpenLit, an open-source library that generates OpenTelemetry signals from LLM interactions in JavaScript applications. Specifically we shall instrument a simple web travel planner agent, available here that uses LLMs to generate travel recommendations based on user prompts and information from various tools. OpenLit works well for this type of project due to its TypeScript SDK and built in capabilities for capturing LLM interactions, tool calls, and generating evaluation and guardrail metrics.

The architecture diagram shows the key components:

The concepts and best practices discussed in this article can be applied to any AI agent regardless of the specific monitoring tools used. Many vendors have AI monitoring capabilities. Alternative open source technologies are also available for agentic monitoring, including LangSmith, OpenLLMetry, or indeed manual instrumentation using OpenTelemetry SDKs and the AI semantic conventions.

Prerequisites

This project requires that the following prerequisites are met:

Active Elastic cluster (Cloud, Serverless or self-managed)
OpenAI, Azure OpenAI, or compatible LLM provider API key
Node.js 18+ with npm or yarn
OTLP-compatible endpoint (Elastic Managed OTLP endpoint or OTel collector)

The following environment variables should be set:

OTEL_ENDPOINT: Your Elastic OTLP endpoint or OTel collector URL
OPENAI_API_KEY: API key for the evaluation/guardrail LLM
OPENAI_ENDPOINT: Optional custom base URL for OpenAI-compatible providers

Basic instrumentation

Often DevOps engineers and SREs start with automatic instrumentation to obtain basic telemetry. This is possible with the OpenLit Python SDK. However with TypeScript we manually have to add our configuration to the AI entrypoint (here api/chat/route.ts).

First we install the dependency using our favourite package manager:

npm install openlit

Then we add the OpenLit configuration to our entrypoint:

import openlit from "openlit";

// Other imports omitted for brevity 

// Allow streaming responses up to 30 seconds to address typically longer responses from LLMs
export const maxDuration = 30;

openlit.init({
  applicationName: "ai-travel-agent", // akin to OTEL resource name
  environment: "development",
  otlpEndpoint: process.env.OTEL_ENDPOINT, // OTLP compatible endpoint (Elastic ingest or OTel collector)
  disableBatch: true, // batching disabled for demo purposes - not recommended for production use
});

// Post request handler
export async function POST(req: Request) {
   // AI logic omitted for brevity - see full code in repo
}

This instrumentation will automatically generate OpenTelemetry traces for all LLM interactions, including tool calls, and send them to the specified OTLP endpoint. Note that for production rather than demo usage, environment should be set to production and batching should not be disabled to ensure optimal network usage and protect the OTel backend.

Let's discuss the key telemetry signals that are generated in subsequent sections.

Inputs

The first rule of debugging AI agents is simple: if you don't capture the prompt, you can't reproduce the problem. Unlike traditional applications where inputs are predictable request parameters, AI agents consume free-form user prompts that can trigger wildly different behaviors based on subtle phrasing changes. OpenLit automatically captures system prompts and all user messages as structured attributes on your traces, giving you the exact input that caused your agent to hallucinate, loop, or fail.

The full conversation needs to be available to SREs to understand the context of failures and performance issues and identify patterns in the inputs that may be causing issues. However these inputs are also useful for improving agent behavior, and can be used as testing messages to evaluate model performance and test enhancements once sanitized for identifiable attributes such as PII.

Beyond prompts, we still need comprehensive logging, specifically capturing full stack traces emitted by our applications. This is crucial for diagnosing issues that may arise from the underlying infrastructure or codebase, rather than the AI model itself. For example, the below error sent to Elastic shows a simple fetch error. We must not forget that traditional errors can still occur in AI applications, and capturing them is essential.

Tracing

Traces are essential for understanding the flow of requests through your AI agent, especially when it comes to tool calls. Generally traces are a hierarchy of spans, which themselves are a single, timed unit representing a specific operation, such as a database query or an HTTP handler. In AI systems they also represent tool calls made by the LLM, along with the API calls and data retrieval steps performed within the tool execution.

Visualizing tool calling patterns is important in validating pre-production systems as well as monitoring production systems for several reasons:

It helps us evaluate the tool calling capabilities of different models. LLMs make the choice of which tools to use based on the user prompt, system instructions and the tool metadata (such as name and description). By visualizing the tool calling patterns we can understand whether the model is correctly interpreting the tool metadata and making appropriate calls based on the prompt.
It allows us to identify inefficient or erroneous tool calling patterns. For example, if we see a pattern of repeated calls to the same tool with similar inputs, it may indicate that the model is stuck in a loop or not effectively utilizing the tools. Or if a single tool is being called where we would expect multiple tools to be called, it may indicate that the model is not correctly connecting the prompt or system instructions require said tool(s).
Commonly occurring tool-calling patterns can also be identified to optimize the available tools. For example, if the location and weather tools are frequently called together, it may make sense to combine them into a single tool that provides both pieces of information in one call.

With the above configuration, we can see the traces for each tool call, as illustrated in the below example:

Metrics

While tracing is essential for understanding the flow of requests and tool calls, metrics are crucial for monitoring the overall health and performance of your system, agentic or not. When considering metrics many think solely of cost and total token usage. While both are important, they are not the only metrics that matter.

Through the example above OpenLit automatically generates key metrics that can be used to evaluate agent performance, such as request latency, error rates, cost and token usage, which can be visualized in Elastic to identify trends and anomalies. Token usage specifically can be split by input, output and reasoning token counts, helping us identify optimization opportunities at key stages in the generation cycle. For example, an increase in input token counts may indicate a significant increase in context length that can be optimized via prompt and context engineering techniques.

It's also important to make sure that metrics capture traditional performance metrics such as CPU, memory and request counts to caches, traditional databases and vector databases. This helps us identify whether performance issues are being caused by the AI model itself or by underlying infrastructure problems. Alerting for spikes in key measures such as large token usage increases or request volumes would also be considered best practice.

Evaluation

AI evaluation refers to the process of assessing the performance of AI models and the quality of the responses they generate. This involves monitoring various metrics and signals to ensure that the AI system is functioning as intended, providing accurate outputs, and not exhibiting undesirable behaviors such as hallucinations, toxic behavior or bias. While evaluation is considered as a pre-production activity to test and validate an agentic system, it's also important to continue monitoring these signals in production to identify issues over time.

There are several different evaluation methodologies that we can use. OpenLit makes use of AI as a Judge. This involves using an LLM to evaluate the quality of the output generated by another LLM based on a set of criteria. An example of traditional evaluation is depicted below:

When considering evaluation from a monitoring viewpoint, it's important to identify hallucinations, bias, toxicity and potential injection issues in production. Hallucinations, bias and toxic responses expose us to reputational risk and loss of user trust. Out of the box, OpenLit identifies the following issues, calculates a score and provides an explanation of the issue:

Issue	Description
Hallucinations	The LLM generates false or misleading information based on the provided context and its own knowledge
Bias	A generated response contains bias or statements negatively impacting protected groups and characteristics including but not limited to gender, ethnicity, socioeconomic status or religion
Toxicity	The LLM returns harmful or offensive content that is threatening, harassing or dismissive

These issues can be identified using the below code:

import openlit from "openlit";

// Other imports omitted for brevity

// Allow streaming responses up to 30 seconds to address typically longer responses from LLMs
export const maxDuration = 30;

// Tools and Azure configuration omitted for brevity

openlit.init({
  applicationName: "ai-travel-agent",
  environment: "development",
  otlpEndpoint: process.env.OTEL_ENDPOINT,
  disableBatch: true,
});

// Choose one of the following approaches:
// Option 1: enable all available evaluations
const evalsAll = openlit.evals.All({
  provider: "openai",
  collectMetrics: true, // Ensures evaluations are exported to Elastic
  apiKey: process.env.OPENAI_API_KEY,
  baseUrl: process.env.OPENAI_ENDPOINT
});

// Option 2: enable specific evaluations with custom configuration
const evalsHallucination = openlit.evals.Hallucination({
  provider: "openai",
  collectMetrics: true,
  apiKey: process.env.OPENAI_API_KEY,
  baseUrl: process.env.OPENAI_ENDPOINT
});

// Post request handler
export async function POST(req: Request) {
  const { messages, id } = await req.json();

  try {
    const convertedMessages = await convertToModelMessages(messages);
    const prompt = `You are a helpful assistant that returns travel itineraries...`;

    const result = streamText({
      model: azure("gpt-4o"),
      system: prompt,
      messages: convertedMessages,
      stopWhen: stepCountIs(2),
      tools,
      experimental_telemetry: { isEnabled: true },
      onFinish: async ({ text, steps }) => {
        // Concatenate tool results and content as full evaluation context
        const toolResults = steps.flatMap((step) => {
          return step.content
            .filter((content) => content.type == "tool-result")
            .map((c) => {
              return JSON.stringify(c.output);
            });
        });

        // Measure evaluation
        const evalResults = await evalsAll.measure({
          prompt: prompt,
          contexts: convertedMessages
            .map((m) => {
              return m.content.toString();
            })
            .concat(toolResults),
          text: text,
        });
        console.log(`Evals results: ${evalResults}`);
      },
    });

    // Return data stream to allow the useChat hook to handle the results as they are streamed through for a better user experience
    return result.toUIMessageStreamResponse();
  } catch (e) {
    console.error(e);
    return new NextResponse(
      "Unable to generate a plan. Please try again later!"
    );
  }
}

By using the collectMetrics option, the evaluation results are automatically exported as metrics to Elastic, allowing us to monitor the quality of our AI agent's outputs over time and identify trends or issues that may arise in production. The evaluation results can also be used to trigger alerts or automated responses if certain thresholds or SLOs are breached, such as a high evaluation score sustained for several minutes, increased number of hallucinations detected over time, or triggering of a toxic result.

The advantage of using LLMs to evaluate results is that they help identify issues and inaccuracies quickly compared to leveraging manual quality checks. However, this methodology does have limitations, specifically:

Increased cost and latency due to the additional requests to an LLM to evaluate results. This can be mitigated by using a smaller, cheaper model for evaluation, by only evaluating a sample of responses, or using cached responses to reduce the number of LLM calls for similar questions.
LLM evaluations are prone to biases. Specifically, Zheng et al. cite in their 2023 paper that LLM evaluations are subject to:

Positional bias, where an LLM prefers responses where the answer is located in a specific position in the response, and may miss correct answers located elsewhere in the reply.
Self-enhancement bias, where LLMs show preference for responses they have generated compared to other models. This can be a consideration if you wish to use cheaper, or self-hosted models for evaluation.
Verbosity bias, where they prefer more expansive responses over succinct replies.

Guardrail monitoring

In addition to assessing the quality of responses, we must also be monitoring for dangerous or irrelevant responses that could be harmful to users. The quality of in-built protections within models are patchy and model dependent. Research from several research papers, including Anthropic in their 2025 agentic misalignment paper, show that in some cases models can resort to malicious behaviors and the model bypassing company policies and moral expectations.

Guardrail detection in monitoring tools allows us to identify risky responses generated by AI agents, such as generating harmful content, engaging in inappropriate interactions, or performing injection actions to try and hack into systems or elicit confidential information. Using OpenLit as our example, we are able to monitor for breaches of the following guardrail types:

Guardrail Type	Description
Prompt Injection	Detection of malicious injection attempts, impersonation and other jailbreaking techniques
Sensitive Topics	Detection of content on controversial, sensitive or illegal topics such as politics, religion, adult content, substance abuse or violence
Restricted Topics	Detection of content that violates company policies, ethical guidelines or covers topics that the tool should avoid such as giving financial or legal advice

These shields can be set up using OpenLit as per the below code:

import openlit from "openlit";

// Other imports omitted for brevity

// Allow streaming responses up to 30 seconds to address typically longer responses from LLMs
export const maxDuration = 30;

// Tools and Azure configuration omitted for brevity

openlit.init({
  applicationName: "ai-travel-agent",
  environment: "development",
  otlpEndpoint: process.env.OTEL_ENDPOINT,
  disableBatch: true,
});

// Choose one of the following approaches:
// Option 1: enable all available guardrails
const guardsAll = openlit.guard.All({
  provider: "openai",
  collectMetrics: true,
  apiKey: process.env.OPENAI_API_KEY,
  baseUrl: process.env.OPENAI_ENDPOINT,
  validTopics: ["travel", "culture"],
  invalidTopics: ["finance", "software engineering"],
});

// Option 2: enables specific guardrail types (for example, prompt injection detection)
const guardsPromptInjection = openlit.guard.PromptInjection({
  provider: "openai",
  collectMetrics: true, // Ensures guardrail breaches are exported to Elastic
  apiKey: process.env.OPENAI_API_KEY,
  baseUrl: process.env.OPENAI_ENDPOINT
});

// Post request handler
export async function POST(req: Request) {
  const { messages, id } = await req.json();

  try {
    const convertedMessages = await convertToModelMessages(messages);
    const prompt = `You are a helpful assistant that returns travel itineraries...`;

    const result = streamText({
      model: azure("gpt-4o"),
      system: prompt,
      messages: convertedMessages,
      stopWhen: stepCountIs(2),
      tools,
      experimental_telemetry: { isEnabled: true },
      onFinish: async ({ text, steps }) => {
        const guardrailResult = await guardsAll.detect(text);
        console.log(`Guardrail results: ${guardrailResult}`);
      },
    });

    // Return data stream to allow the useChat hook to handle the results as they are streamed through for a better user experience
    return result.toUIMessageStreamResponse();
  } catch (e) {
    console.error(e);
    return new NextResponse(
      "Unable to generate a plan. Please try again later!"
    );
  }
}

In the event of a guardrail breach, metrics containing detail of the breach are sent to Elastic, similar to the following:

Of course we can leverage dashboards to visualize trends of guardrail breaches, including metrics such as volumes by category, as shown in the below example (with the corresponding NDJSON available here):

We can also perform action on these breaches to notify relevant teams via alerts triggered via the available alerting tools. These should be triggered based on the severity of the detected issue, as well as the classification, for example mentions of violence, illegal themes, or injection attacks may trigger immediately compared to minor inaccuracies. The guardrail breaches can also be used to trigger automated responses, such as blocking the response from being sent to the user, or providing a warning message to the user that their request has been flagged for review, triggering a human-in-the-loop response for the relevant teams.

Conclusion

AI agents are becoming more autonomous, more powerful, and more unpredictable. For this reason, it's important to introduce monitoring telemetry as early as possible in the development process and in organizational cultures. This article helps you understand how monitoring AI agents is different and how to do it using OpenLit to generate OpenTelemetry signals to send to Elastic. Check out the code here and start monitoring your AI agents in production.

Developer resources:

Observing AI Agents Example

OpenLit SDK Documentation

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena | Zheng et al. 2023

Agentic Misalignment: How LLMs could be insider threats | Anthropic

Observability for Amazon MQ with Elastic: Demystifying Messaging Flows with Real-Time Insights

Fri, 02 May 2025 00:00:00 GMT

Observability for Amazon MQ with Elastic: Demystifying Messaging Flows with Real-Time Insights

Managing the Hidden Complexity of Message-Driven Architectures

Amazon MQ is a managed message broker service for Apache ActiveMQ Classic and RabbitMQ that manages the setup, operation, and maintenance of message brokers. Messaging systems like RabbitMQ, managed by Amazon MQ, are pivotal in modern decoupled, event-driven applications. By serving as an intermediary between services, RabbitMQ facilitates asynchronous communication through message queuing, routing, and reliable delivery, making it an ideal fit for microservices, real-time pipelines, and event-driven architectures. However, this flexibility introduces operational challenges, such as retries, processing delays, consumer failures, and queue backlogs, which can gradually impact downstream performance and system reliability.

With Elastic’s Amazon MQ integration, users gain deep visibility into message flow patterns, queue performance, and consumer health. This integration allows for the proactive detection of bottlenecks, helps optimize system behaviour, and ensures reliable message delivery at scale.

In this blog, we'll dive into the operational challenges of RabbitMQ in modern architectures, while also examining the common gaps and strategies for overcoming them.

Why Observability for RabbitMQ on Amazon MQ Matters?

RabbitMQ brokers are integral to distributed systems, handling tasks ranging from order processing to payment workflows and notification delivery. Any disruption can cascade into significant downstream issues. Observability into RabbitMQ helps answer critical operational questions like:

Is CPU and memory utilization increasing over time?
What are the trends in the message publish rate, message confirmation rate?
Are consumers failing to acknowledge messages?
Which queues are experiencing abnormal growth?
Are there an increasing number of messages being dead-lettered over time?

Enhanced Observability with Amazon MQ Integration

Elastic provides a dedicated Amazon MQ integration for RabbitMQ that utilizes Amazon CloudWatch metrics and logs to deliver comprehensive observability data. This integration enables the ingestion of metrics related to connections, nodes, queues, exchanges, and system logs.

By deploying Elastic Agent with this integration, the users can monitor:

Queue performance and Dead-letter queue (DLQ) metrics include total message count (MessageCount.max), messages ready for delivery (MessageReadyCount.max), and unacknowledged messages (MessageUnacknowledgedCount.max). MessageCount.max metric tracks the total number of messages in a queue, including those that have been dead-lettered, and monitoring this over time can help identify trends in message accumulation, which may suggest issues leading to dead-lettering.
Consumer behaviour through metrics like consumer count (ConsumerCount.max) and acknowledgement rate (AckRate.max), which help identify underperforming consumers or potential backlogs.
Messaging throughput by tracking publish (PublishRate.max), confirm (ConfirmRate.max), and acknowledgement rates in real time. These are crucial for understanding application messaging patterns and flow.
Broker and node-level health, including memory usage (RabbitMQMemUsed.max), CPU utilization (SystemCpuUtilization.max), disk availability (RabbitMQDiskFree.min), and file descriptor usage (RabbitMQFdUsed.max). These indicators are essential for diagnosing resource saturation and avoiding service disruption.

Integrating Amazon MQ Metrics into Elastic Observability

Elastic's Amazon MQ integration facilitates the ingestion of CloudWatch metrics and logs into Elastic Observability, delivering near real-time insights into RabbitMQ. The prebuilt Amazon MQ dashboard visualizes this data, providing a centralized view of broker health, messaging activity, and resource usage, helping users quickly detect and resolve issues. Elastic's alerting for Observability enables proactive notifications based on custom conditions, while its SLO capabilities allow users to define and track key performance targets, strengthening system reliability and service commitments.

Elastic brings together logs and metrics from Amazon MQ alongside data from a wide range of other services and applications, whether running in AWS, on-premises, or across multi-cloud environments, offering unified observability from a single platform.

Prerequisites

To follow along, ensure you have:

An account on Elastic Cloud and a deployed stack in AWS (see instructions here). Ensure you are using version 8.16.5 or higher. Alternatively, you can use Elastic Cloud Serverless, a fully managed solution that eliminates infrastructure management, automatically scales based on usage, and lets you focus entirely on extracting value from your data.
An AWS account with permissions to pull the necessary data from AWS. See details in our documentation.

Architecture

Tracing Audit Flows from RabbitMQ to AWS Lambda

Consider a financial audit trail use case, where every user action, such as a funds transfer, is published to RabbitMQ. A Python-based AWS Lambda function consumes these messages, deduplicates them using the id field, and logs structured audit events for downstream analysis.

Sample payload sent through RabbitMQ:

{
  "id": "txn-849302",
  "type": "audit",
  "payload": {
    "user_id": "u-10245",
    "event": "funds.transfer",
    "amount": 1200.75,
    "currency": "USD",
    "timestamp": "T14:20:15Z",
    "ip": "192.168.0.8",
    "location": "New York, USA"
  }
}

You can now correlate message publishing activity from RabbitMQ with AWS Lambda invocation logs, track processing latency, and configure alerts for conditions like drops in consumer throughput or an unexpected surge in RabbitMQ queue depth.

AWS Lambda Function: Processing RabbitMQ Messages

This Python-based AWS Lambda function processes audit events received from RabbitMQ. It deduplicates messages based on the id field and logs structured event data for downstream analysis or compliance. Save the code below in a file named app.py.

import json
import logging
import base64
# Configure logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# In-memory set to track processed message IDs for deduplication
processed_ids = set()
def lambda_handler(event, context):
    logger.info("Lambda triggered by RabbitMQ event")
    if 'rmqMessagesByQueue' not in event:
        logger.warning("Invalid event: missing 'rmqMessagesByQueue'")
        return {'statusCode': 400, 'body': 'Invalid RabbitMQ event'}
    for queue_name, messages in event['rmqMessagesByQueue'].items():
        logger.info(f"Processing queue: {queue_name}, Messages count: {len(messages)}")
        for msg in messages:
            try:
                raw_data = msg['data']
                decoded_json = base64.b64decode(raw_data).decode('utf-8')
                message = json.loads(decoded_json)
                logger.info(f"Decoded message: {json.dumps(message)}")
                message_id = message.get('id')
                if not message_id:
                    logger.warning("Message missing 'id', skipping.")
                    continue
                if message_id in processed_ids:
                    logger.warning(f"Duplicate message detected: {message_id}")
                    continue
                payload = message.get('payload', {})
                logger.info(f"Processing message ID: {message_id}")
                logger.info(f"Event Type: {message.get('type')}")
                logger.info(f"User ID: {payload.get('user_id')}")
                logger.info(f"Event: {payload.get('event')}")
                logger.info(f"Amount: {payload.get('amount')} {payload.get('currency')}")
                logger.info(f"Timestamp: {payload.get('timestamp')}")
                logger.info(f"IP Address: {payload.get('ip')}")
                logger.info(f"Location: {payload.get('location')}")
                processed_ids.add(message_id)
            except Exception as e:
                logger.error(f"Error processing message: {str(e)}")
    return {'statusCode': 200, 'body': 'Messages processed successfully'}

Setting up AWS Secrets Manager

To securely store and manage your RabbitMQ credentials, use AWS Secrets Manager.

Create a New Secret:
- Navigate to the AWS Secrets Manager console.
- Choose Store a new secret.
- Select Other type of secret.
- Enter the following key-value pairs:
  - username: Your RabbitMQ username
  - password: Your RabbitMQ password
Configure the Secret:
- Provide a meaningful name, such as RabbitMQAccess.
- Optionally, add tags and set rotation if needed.
Store the Secret:
- Review the settings and store the secret. Note the ARN of the secret you have created.

Setting up Amazon MQ for RabbitMQ

To get started with RabbitMQ on Amazon MQ, follow these steps to set up your broker.

Open the Amazon MQ console.
Create a new broker with the RabbitMQ engine.
Choose your preferred deployment option—single-instance or clustered
Use the same username and password that you previously stored in AWS Secrets Manager.
Under Additional settings, enable CloudWatch Logs for observability.
Configure access and security settings, ensuring that the broker is accessible to your AWS Lambda function.

After the broker is created, note the following important details:
- ARN of the RabbitMQ broker.
- RabbitMQ web console URL.
You’ll need the RabbitMQ log group ARN to set up Elastic’s Amazon MQ integration for RabbitMQ. Follow these steps to locate it:
- Go to the General – Enabled Logs section of the broker.
- Copy the CloudWatch log group ARN.

Create a RabbitMQ Queue

Now that the RabbitMQ broker is configured, use the management console to create a queue where messages will be published.

Access the RabbitMQ management console using the web console URL.
Create a new queue (example: myQueue) to receive messages.

Build and deploy the AWS Lambda function

In this section, we'll set up the Lambda function using AWS SAM, add the message processing logic, and deploy it to AWS. This Lambda function will be responsible for consuming messages from RabbitMQ and logging audit events.

Before continuing, make sure you have completed the following prerequisites.

Next, follow the steps outlined below to continue with the setup.

In your command line, run the command sam init from a directory of your choice.
The AWS SAM CLI will walk you through the setup.
- Select AWS Quick Start Templates.
- Choose the Hello World Example
- Use the Python runtime and zip package type.
- Proceed with the default options.
- Name your application as sample-rabbitmq-app.
- The AWS SAM CLI downloads your starting template and creates the application project directory structure.

From your command line, move to the newly created sample-rabbitmq-app directory.

Replace the content of the hello_world/app.py file with the lambda function code for rabbitmq message processing.

In the template.yaml file, use the values mentioned below to update the file content.

Resources: SampleRabbitMQApp:   Type: AWS::Serverless::Function   Properties:     CodeUri: hello_world/     Description: A starter AWS Lambda function.     MemorySize: 128     Timeout: 3     Handler: app.lambda_handler     Runtime: python3.10     PackageType: Zip     Policies:       - Statement:           - Effect: Allow             Resource: '*'             Action:               - mq:DescribeBroker               - secretsmanager:GetSecretValue               - ec2:CreateNetworkInterface               - ec2:DescribeNetworkInterfaces               - ec2:DescribeVpcs               - ec2:DeleteNetworkInterface               - ec2:DescribeSubnets               - ec2:DescribeSecurityGroups     Events:       MQEvent:         Type: MQ         Properties:           Broker:            Queues:             - myQueue           SourceAccessConfigurations:             - Type: BASIC_AUTH               URI:

Run the command sam deploy --guided and wait for the confirmation message. This deploys all of the resources.

Sending Audit Events to RabbitMQ and Triggering Lambda

To test the end-to-end setup, simulate the flow by publishing audit event data into RabbitMQ using its web UI. Once the message is sent, it triggers the Lambda function.

Navigate to the Amazon MQ console and select your newly created broker.
Locate and open the Rabbit web console URL
Under the Queues and Streams tab, select the target queue (example: myQueue).

Enter the message payload, and click Publish message to send it to the queue.
Here’s a sample payload published via RabbitMQ:

{
  "id": "txn-849302",
  "type": "audit",
  "payload": {
    "user_id": "u-10245",
    "event": "funds.transfer",
    "amount": 1200.75,
    "currency": "USD",
    "timestamp": "T14:20:15Z",
    "ip": "192.168.0.8",
    "location": "New York, USA"
  }
}

Navigate to the AWS Lambda function created earlier.
Under the Monitor tab, click View CloudWatch logs.
Check the latest log stream to confirm that the Lambda was triggered by Amazon MQ and that the message was processed successfully.

Configuring Amazon MQ integration for Metrics and Logs collection

Elastic’s Amazon MQ integration simplifies the collection of logs and metrics from RabbitMQ brokers managed by Amazon MQ. Logs are ingested via Amazon CloudWatch Logs, while metrics are fetched from the specified AWS region at a defined interval.

Elastic provides a default configuration for metrics collection. You can accept these defaults or adjust settings such as the Collection Period to better fit your needs.

To enable the collection of logs:

Navigate to the Amazon MQ console and select the newly created broker.
Click the Logs hyperlink under the General – Enabled Logs section to open the detailed log settings page.
From this page, copy the CloudWatch log group ARN.
In Elastic, set up the Amazon MQ integration and paste the CloudWatch log group ARN.
Accept Defaults or Customize Settings – Elastic provides a default configuration for logs collection. You can accept these defaults or adjust settings such as collection intervals to better fit your needs.

Visualizing RabbitMQ Workloads with the Pre-Built Amazon MQ Dashboard

You can access the RabbitMQ dashboard by:

Navigate to the Dashboard Menu – Select the Dashboard menu option in Elastic and search for [Amazon MQ] RabbitMQ Overview to open the dashboard.
Navigate to the Integrations Menu – Open the Integrations menu in Elastic, select Amazon MQ, go to the Assets tab, and choose [Amazon MQ] RabbitMQ Overview from the dashboard assets

The Amazon MQ RabbitMQ dashboard in the Elastic integration delivers a comprehensive overview of broker health and messaging activity. It provides real-time insights into broker resource utilization, queue and topic performance, connection trends, and messaging throughput. The dashboard helps users track system behaviour, detect performance bottlenecks, and ensure reliable message delivery across distributed applications.

Broker Metrics

This section provides a centralised view of the overall health and performance of the RabbitMQ broker on Amazon MQ. The visualizations highlights the number of configured exchanges and queues, active broker connections, producers, consumers, and total messages in flight. System-level metrics such as CPU utilization, memory consumption, and free disk space help assess whether the broker has sufficient resources to handle current workloads.

Message flow metrics such as publish rate, confirmation rate, and acknowledgement rate are displayed to provide visibility into how messages are processed through the broker. Monitoring trends in these values helps detect message delivery issues, throughput degradation, or potential saturation of the broker under load.

Node Metrics

Node-level visibility helps identify resource imbalances across nodes in clustered RabbitMQ setups. This section includes per-node CPU usage, memory consumption, and available disk space, offering insight into the underlying infrastructure's ability to support broker operations.

Queue Metrics

Queue-specific insights are critical for understanding message delivery patterns and backlog conditions. This section details total messages, ready messages, and unacknowledged messages, segmented by broker, virtual host, and queue.

By observing how these counts change over time, users can identify slow consumers, message build-ups, or delivery issues that may affect application performance or lead to dropped messages under pressure.

Logs

This section displays log level, process ID, and raw message content. These logs provide immediate visibility into events such as connection failures, resource thresholds being hit, or unexpected queue behaviors.

Detecting Queue Backlogs with Alerting Rules

Elastic’s alert framework allows you to define rules that monitor critical RabbitMQ metrics and automatically trigger actions when specific thresholds are breached.

Alert: Queue Backlog (Message Ready or Unacknowledged Messages)

This alert helps detect queue backlog in Amazon MQ by evaluating two metrics

MessageUnacknowledgedCount.max and
MessageReadyCount.max.

The alert is triggered if either condition persists for more than 10 minutes:

MessageUnacknowledgedCount.max exceeds 5,000
MessageReadyCount.max exceeds 7,000

These thresholds should be adjusted based on typical message volume and consumer throughput. Sustained high values can indicate that consumers are not keeping up or message delivery pipelines are congested, potentially causing delays or dropped messages. Sustained high values may result in processing delays or dropped messages if not addressed.

Tracking Resource Utilization to Maintain RabbitMQ Performance

Elastic’s Service-level objectives (SLOs) capabilities allow you to define and monitor performance targets using key indicators like latency, availability, and error rates. Once configured, Elastic continuously evaluates these SLOs in real time, offering intuitive dashboards, alerts for threshold violations, and insights into error budget consumption. This enables teams to stay ahead of issues, ensuring service reliability and consistent performance.

SLO: Node Resource Health (CPU, Memory, Disk)

This SLO focuses on ensuring RabbitMQ brokers and nodes have sufficient resources to process messages without performance degradation. It tracks CPU, memory, and disk usage across RabbitMQ brokers and nodes to prevent resource exhaustion that could lead to service interruptions.

Target thresholds:

SystemCpuUtilization.max remains below 85% for 99% of the time.
RabbitMQMemUsed.max remains below 80% of RabbitMQMemLimit.max for 99% of the time.
RabbitMQDiskFree.min remains above 25% of RabbitMQDiskFreeLimit.max for 99% of the time.

Sustained high values in CPU or memory usage can signal resource contention, which may result in slower message processing or downtime. Low disk availability may cause the broker to stop accepting messages, risking message loss. These thresholds are designed to catch early signs of resource saturation and ensure smooth, uninterrupted message flow across RabbitMQ deployments.

Conclusion

As RabbitMQ-based messaging architectures scale and become more complex, the need for in-depth visibility into system performance and potential issues deepens. Elastic’s Amazon MQ integration brings that visibility front and center—helping you go beyond basic health checks to understand real-time messaging throughput, queue backlog trends, and resource saturation across your brokers and consumers.

By leveraging the prebuilt dashboards, configuring alerts and SLOs, you can proactively detect anomalies, fine-tune consumer performance, and ensure reliable delivery across your event-driven applications.

Analyzing OpenTelemetry apps with Elastic AI Assistant and APM

Tue, 12 Mar 2024 00:00:00 GMT

OpenTelemetry is rapidly becoming the most expansive project within the Cloud Native Computing Foundation (CNCF), boasting as many commits as Kubernetes and garnering widespread support from customers. Numerous companies are adopting OpenTelemetry and integrating it into their applications. Elastic® offers detailed guides on implementing OpenTelemetry for applications. However, like many applications, pinpointing and resolving issues can be time-consuming.

The Elastic AI Assistant significantly enhances the process, not only in identifying but also in resolving issues. This is further enhanced by Elastic’s new Service Level Objective (SLO) capability, allowing you to streamline your entire site reliability engineering (SRE) process from detecting potential issues to enhancing the overall customer experience.

In this blog, we will demonstrate how you, as an SRE, can detect issues in a service equipped with OpenTelemetry. We will explore problem identification using Elastic APM, Elastic’s AIOps capabilities, and the Elastic AI Assistant.

We will illustrate this using the OpenTelemetry demo, with a feature flag (cartService) that is activated.

Our walkthrough will encompass two scenarios:

When the SLO for cart service becomes noncompliant, we will analyze the error through Elastic APM. The Elastic AI Assistant will assist by providing a runbook and a GitHub issue to facilitate issue analysis.
Should the SLO for the cart service be noncompliant, we will examine the trace that indicates a high failure rate. We will employ AIOps for failure correlation and the AI Assistant to analyze logs and Kubernetes metrics directly from the Assistant.

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up the configuration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here).
We used the OpenTelemetry Demo. Directions for using Elastic with OpenTelemetry Demo are here.
Additionally you will need to connect your AI Assistant to your favorite LLM. We used Azure OpenAI GPT-4.
We also ran the OpenTelemetry Demo on Kubernetes, specifically on GKE.

SLO noncompliance

Elastic APM recently released the SLO (Service Level Objectives) feature in 8.12. This feature enables setting measurable performance targets for services, such as availability, latency, traffic, errors, and saturation or define your own. Key components include:

Defining and monitoring SLIs (Service Level Indicators)
Monitoring error budgets indicating permissible performance shortfalls
Alerting on burn rates showing error budget consumption

We set up two SLOs for cart service:

Availability SLO , which monitors its availability by ensuring that transactions succeed. We set up the feature flag in the OpenTelemetry application, which generates an error for EmptyCart transactions 10% of the time.
Latency SLO to ensure transactions are not going below a specific latency, which will reduce customer experiences.

Because of the OTel cartservice feature flag, the availability SLO is triggered, and within the SLO details, we see that over a seven-day period the availability is well below our target of 99.9, at 95.5. Additionally all the error budget that was available is also exhausted.

With SLO, you can easily identify when issues with customer experience occur, or when potential issues with services arise before they become potentially worse.

Scenario 1: Analyzing APM trace and logs with AI Assistant

Once the SLO is found as non-compliant, we can dive into cart service to investigate in Elastic APM. The following walks through the set of steps you can take in Elastic APM and how to use the AI Assistant to analyze the issue:

From the video, we can see that once in APM, we took the following steps.

Investigated the trace EmptyCart, which was experiencing larger than normal failure rates.
The trace showed a significant number of failures, which also resulted in slightly larger latency.
We used AIOps failure correlation to identify the potential component causing the failure, which correlated to a field value of FailedPrecondition.
While filtering on that value and reviewing the logs, we still couldn’t understand what this meant.
This is where you can use Elastic’s AI Assistant to further your understanding of the issue.

AI Assistant helped us analyze the following:

It helped us understand what the log message meant and that it was related to the Redis connection failure issue.
Because we couldn’t connect to Redis, we asked the AI Assistant to give us the metrics for the Redis Kubernetes pods.
We learned there were two pods for Redis from the logs over the last two hours.
However, we also learned that the memory of one seems to be increasing.
It seems that Redis restarted (hence the second pod), and with this information we could dive deeper into what happened to Redis.

You can see how quickly we could correlate a significant amount of information, logs, metrics, and traces through the AI Assistant and Elastic’s APM capabilities. We didn’t have to go through multiple screens to hunt down information.

Scenario 2: Analyzing APM error with AI Assistant

Once the SLO is found as noncompliant, we can dive into cart service to investigate in Elastic APM. The following walks through the set of steps you can take in Elastic APM and use the AI Assistant to analyze the issue:

From the video, we can see that once in APM, we took the following steps:

We noticed a specific error for the APM service.
We investigated this in the error tab, and while we see it’s an issue with connection to Redis, we still need more information.
The AI Assistant helps us understand the stacktrace and provides some potential causes for the error and ways to diagnose and resolve it.
We also asked it for a runbook, created by our SRE team, which gives us steps to work through this particular issue.

But as you can see, AI Assistant provides us not only with information about the error message but also how to diagnose it and potentially resolve it with an internal runbook.

Achieving operational excellence, optimal performance, and reliability

We’ve shown how an OpenTelemetry instrumented application (OTel demo) can be analyzed using Elastic’s features, especially the AI Assistant coupled with Elastic APM, AIOps, and the latest SLO features. Elastic significantly streamlines the process of identifying and resolving issues within your applications.

Through our detailed walkthrough of two distinct scenarios, we have seen how Elastic APM and the AI Assistant can efficiently analyze and address noncompliance with SLOs in a cart service. The ability to quickly correlate information, logs, metrics, and traces through these tools not only saves time but also enhances the overall effectiveness of the troubleshooting process.

The use of Elastic's AI Assistant in these scenarios underscores the value of integrating advanced AI capabilities into operational workflows. It goes beyond simple error analysis, offering insights into potential causes and providing actionable solutions, sometimes even with customized runbooks. This integration of technology fundamentally changes how SREs approach problem-solving, making the process more efficient and less reliant on manual investigation.

Overall, the advancements in Elastic’s APM, AIOps capabilities, and the AI Assistant, particularly in handling OpenTelemetry data, represent a significant step forward in operational excellence. These tools enable SREs to not only react swiftly to emerging issues but also proactively manage and optimize the performance and reliability of their services, thereby ensuring an enhanced customer experience.

Try it out

Existing Elastic Cloud customers can access many of these features directly from the Elastic Cloud console. Not taking advantage of Elastic on cloud? Start a free trial.

Build better Service Level Objectives (SLOs) from logs and metrics

Elastic Observability 8.12: GA for AI Assistant, SLO, and Mobile APM support

Native Observability support in Elastic Observability

Context-aware insights using the Elastic AI Assistant for Observability

In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.

Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.

Using Anomaly Detection in Elastic Cloud to Identify Fraud

Thu, 30 Jan 2025 00:00:00 GMT

Fraud detection is one of the most pressing challenges facing the financial services industry today. With the rise of digital payments, app-based banking, and online financial services, the volume and sophistication of fraudulent activity have grown significantly. In recent years, high-profile incidents like the $200 million credit card fraud scheme uncovered by the U.S. Department of Justice, which involved the creation of thousands of fake identities, have highlighted just how advanced fraud operations have become. These threats pose serious risks to financial institutions and their customers, making real-time fraud prevention an absolute necessity.

Elastic Cloud provides a powerful solution to meet these challenges. Its scalable, high-performance platform enables organizations to ingest and analyze all data types efficiently (from transactional data to customers’ personal information to claims data), delivering actionable insights that empower fraud prevention teams to detect anomalies and stop fraud before it occurs. From identifying unusual spending patterns to uncovering hidden threats, Elastic Cloud offers the speed and flexibility needed to safeguard assets in an increasingly digital economy.

In this blog, we’ll walk you through how Elastic Cloud can be used to identify fraud within credit card transactions—a key area of focus due to the high volume of data and the significant potential for fraudulent activity.

We’ll use a Node.js code example to generate an example set of credit card transactions. The generated transactions include a data anomaly similar to an anomaly that might occur as a result of fraudulent activity known as “Card Testing”, which is when a malicious actor tests to see if stolen credit card data can be used to make fraudulent transactions. We’ll then import the credit card transactions into an Elastic Cloud index and use Elastic Observability’s Anomaly Detection feature to analyze the transactions to detect potential signs of “Card Testing”.

Performing fraud detection with Elastic Cloud

Generate example credit card transactions

Begin the process by using a terminal on your local computer to run a Node.js code example that will generate some example credit card transaction data.

Within your terminal window, run the following git clone command to clone the Github repository containing the Node.js code example:

git clone https://github.com/elastic/observability-examples

Run the following cd command to change directory to the code example folder:

cd observability-examples/anomaly-detection

Run the following npm install command to install the code example’s dependencies:

npm install

Enter the following node command to run the code example which will generate a JSON file named transactions.ndjson containing 1000 example credit card transactions:

node generate-transactions.js

Now that we've got some credit card transaction data, we can import the transactions into Elastic Cloud to analyze the data.

Import transactions data into an Elastic Cloud index

We’ll start the import process in Elastic Cloud. Create an Elastic Serverless project in which we can import and analyze the transaction data. Click Create project.

Click Next in the Elastic for Observability project type tile.

Click Create project.

Click Continue.

Select the Application tile.

Enter the text “Upload” into the search box.

Select the Upload a file tile.

Click Select or drag and drop a file.

Select the transactions.ndjson file on your local computer that was created from running the Node.js code example in a previous step.

Click Import.

Enter an Index name and click Import.

You’ll see a confirmation when the import process completes and the new index is successfully created.

Use Anomaly Detection to analyze credit card transactions

Anomaly Detection is a powerful tool that can analyze your data to find unusual patterns that would otherwise be difficult, if not impossible, to manually uncover. Now that we've got transaction data loaded into an index, let's use anomaly detection to analyze it. Click Machine learning in the navigation menu.

Select Anomaly Detection Jobs

Click Create anomaly detection job.

Select the Index containing the imported transactions as the data source of the anomaly detection job.

As mentioned above, one form of credit card fraud is called “Card Testing” where a malicious actor tests a batch of credit cards to determine if they are still valid.

We can analyze the transaction data in our index to detect fraudulent “Card Testing” by using the anomaly detection Population wizard. Select the Population wizard tile.

Click Use full data.

Click Next.

Click the Population field selector and select IPAddress.

Click the Add metric option.

Select Count(Event rate) as the metric to be added.

Click Next.

Enter a Job ID and click Next.

Click Next.

Click Create job.

Once the job completes, click View results.

You should see that an anomaly has been detected. It looks like a specific IP Address has been identified performing an exceedingly high number of transactions with multiple credit cards on a single day.

You can click the red highlighted segments in the timeline to see more details to assist you with evaluating possible remediation actions to implement.

In just a few steps, we were able to create a machine learning job that grouped all the transactions by the IP address that sent them and identified slices of time where one IP sent an unusually large number of requests compared to other IPs. Our fraudster!

Take the next step in fraud prevention

Fraud detection is an ongoing battle for organizations across industries, and the stakes are higher than ever. As digital payments, insurance claims, and online banking continue to dominate, the need for robust, real-time solutions to detect and prevent fraud is critical. In this blog, we demonstrated how Elastic Cloud empowers organizations to address this challenge effectively.

By using Elastic Cloud’s powerful capabilities, we ingested and analyzed a dataset of credit card transactions to detect potential fraudulent activity, such as “Card Testing.” From ingesting data into an Elastic index to leveraging machine learning-powered anomaly detection, this step-by-step process highlighted how Elastic Cloud can uncover hidden patterns and provide actionable insights to fraud prevention teams.

This example is just the beginning of what Elastic Cloud can do. Its scalable architecture, flexible tools, and powerful analytics make it an invaluable asset for any organization looking to protect their customers and assets from fraud. Whether it's detecting unusual spending patterns, identifying compromised accounts, or monitoring large-scale operations, Elastic Cloud provides the speed, precision, and efficiency financial services organizations need to stay one step ahead of fraudsters.

As fraud continues to evolve, so must the tools we use to combat it. Elastic Cloud gives you the power to meet these challenges head-on, enabling your institution to provide a safer, more secure experience for your customers.

Ready to explore more? View a guided tour of all the steps in this blog post or create an Elastic Serverless Observability project and start analyzing your data for anomalies today.

Related resources:

The antidote for index mapping exceptions: ignore_malformed

Thu, 03 Aug 2023 00:00:00 GMT

In this article, I'll explain how the setting ignore_malformed can make the difference between a 100% dropping rate and a 100% success rate, even with ignoring some malformed fields.

As a senior software engineer working at Elastic®, I have been on the first line of support for anything related to Beats or Elastic Agent running on Kubernetes and Cloud Native integrations like Nginx ingress controller.

During my experience, I have seen all sorts of issues. Users have very different requirements. But at some point during their experience, most of them encounter a very common problem with Elasticsearch: index mapping exceptions.

How mappings work

Like any other document-based NoSQL database, Elasticsearch doesn’t force you to provide the document schema (called index mapping or simply mapping) upfront. If you provide a mapping, it will use it. Otherwise, it will infer one from the first document or any subsequent documents that contain new fields.

In reality, the situation is not black and white. You can also provide a partial mapping that covers only some of the fields, like the most common fields, and leave Elasticsearch to figure out the mapping of all the other fields during ingestion with Dynamic Mapping.

What happens when data is malformed?

No matter if you specified a mapping upfront or if Elasticsearch inferred one automatically, Elasticsearch will drop an entire document with just one field that doesn't match the mapping of an index and return an error instead. This is not much different from what happens with other SQL databases or NoSQL data stores with inferred schemas. The reason for this behavior is to prevent malformed data and exceptions at query time.

A problem arises if a user doesn't look at the ingestion logs and misses those errors. They might never figure out that something went wrong, or even worse, Elasticsearch might stop ingesting data entirely if all the subsequent documents are malformed.

The above situation sounds very catastrophic, but it's entirely possible since I have seen it many times when on-call for support or on discuss.elastic.co. The situation is even more likely to happen if you have user-generated documents, so you don't have full control over the quality of your data.

Luckily, there is a setting that not many people know about in Elasticsearch that solves the exact problems above. This field has been there since Elasticsearch 2.0. We are talking ancient history here since the latest version of the stack at the time of writing is Elastic Stack 8.9.0.

Let's now dive into how to use this Elasticsearch feature.

A toy use case

To make it easier to interact with Elasticsearch, I am going to use Kibana® Dev Tools in this tutorial.

The following examples are taken from the official documentation on ignore_malformed. I am here to expand on those examples by providing a few more details about what happens behind the scenes and on how to search for ignored fields. We are going to use the index name my-index, but feel free to change that to whatever you like.

First, we want to create an index mapping with two fields called number_one and number_two. Both fields have type integer, but only one of them has _ ignore_malformed _ set to true, and the other one inherits the default value ignore_malformed: false instead.

PUT my-index
{
  "mappings": {
    "properties": {
      "number_one": {
        "type": "integer",
        "ignore_malformed": true
      },
      "number_two": {
        "type": "integer"
      }
    }
  }
}

If the mentioned index didn’t exist before and the previous command ran successfully, you should get the following result:

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "my-index"
}

To double-check that the above mapping has been created correctly, we can query the newly created index with the command:

GET my-index/_mapping

You should get the following result:

{
  "my-index": {
    "mappings": {
      "properties": {
        "number_one": {
          "type": "integer",
          "ignore_malformed": true
        },
        "number_two": {
          "type": "integer"
        }
      }
    }
  }
}

Now we can ingest two sample documents — both invalid:

PUT my-index/_doc/1
{
  "text":       "Some text value",
  "number_one": "foo"
}

PUT my-index/_doc/2
{
  "text":       "Some text value",
  "number_two": "foo"
}

The document with id=1 is correctly ingested, while the document with id=2 fails with the following error. The difference between those two documents is in which field we are trying to ingest a sample string “foo” instead of an integer.

{
  "error": {
    "root_cause": [
      {
        "type": "document_parsing_exception",
        "reason": "[3:17] failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: 'foo'"
      }
    ],
    "type": "document_parsing_exception",
    "reason": "[3:17] failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: 'foo'",
    "caused_by": {
      "type": "number_format_exception",
      "reason": "For input string: \"foo\""
    }
  },
  "status": 400
}

Depending on the client used for ingesting your documents, you might get different errors or warnings, but logically the problem is the same. The entire document is not ingested because part of it doesn’t conform with the index mapping. There are too many possible error messages to name, but suffice it to say that malformed data is quite a common problem. And we need a better way to handle it.

Now that at least one document has been ingested, you can try searching with the following query:

GET my-index/_search
{
  "fields": [
    "*"
  ]
}

Here, the parameter fields is required to show the values of those fields that have been ignored. More on this later.

From the result, you can see that only the first document (with id=1) has been ingested correctly while the second document (with id=2) has been completely dropped.

{
  "took": 14,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "my-index",
        "_id": "1",
        "_score": null,
        "_ignored": ["number_one"],
        "_source": {
          "text": "Some text value",
          "number_one": "foo"
        },
        "fields": {
          "text": ["Some text value"],
          "text.keyword": ["Some text value"]
        },
        "ignored_field_values": {
          "number_one": ["foo"]
        },
        "sort": ["1"]
      }
    ]
  }
}

From the above JSON response, you will notice some things, such as:

A new field called _ _ignored _ of type array with the list of all fields that have been ignored while ingesting documents
A new field called _ ignored_field_values _ with a dictionary of ignored fields and their values
The field called __ source _ contains the original document unmodified. This is especially useful if you want to fix the problems with the mapping later.
The field called _ text _ was not present in the original mapping, but it is now included since Elasticsearch automatically inferred the type of this field. In fact, if you try to query the mapping of the index _ my-index _ again via the command:

GET my-index/_mapping

You should get this result:

{
  "my-index": {
    "mappings": {
      "properties": {
        "number_one": {
          "type": "integer",
          "ignore_malformed": true
        },
        "number_two": {
          "type": "integer"
        },
        "text": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

Finally, if you ingest some valid documents like the following command:

PUT my-index/_doc/3
{
  "text":       "Some text value",
  "number_two": 10
}

You can check how many documents have at least one ignored field with the following Exists query:

GET my-index/_search
{
  "query": {
    "exists": {
      "field": "_ignored"
    }
  }
}

You can also see that out of the two documents ingested (with id=1 and id=3) only the document with id=1 contains an ignored field.

{
  "took": 193,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "my-index",
        "_id": "1",
        "_score": 1,
        "_ignored": ["number_one"],
        "_source": {
          "text": "Some text value",
          "number_one": "foo"
        }
      }
    ]
  }
}

Alternatively, you can search for all documents that have a specific field being ignored with this Terms query:

GET my-index/_search
{
  "query": {
    "terms": {
      "_ignored": [ "number_one"]
    }
  }
}

The result, in this case, will be the same as the previous one since we only managed to ingest a single document with that exact single field ignored.

Conclusion

Because we are a big fan of this flag, we've enabled _ ignore_malformed _ by default for all Elastic integrations and in the default index template for logs data streams as of 8.9.0. More information can be found in the official documentation for ignore_malformed.

And since I am personally working on this feature, I can reassure you that it is a game changer.

You can start by setting _ ignore_malformed _ on any cluster manually before Elastic Stack 8.9.0. Or you can use the defaults that we set for you starting from Elastic Stack 8.9.0.

Achieving seamless API management: Introducing AWS API Gateway integration with Elastic

Thu, 14 Sep 2023 00:00:00 GMT

AWS API Gateway is a powerful service that redefines API management. It serves as a gateway for creating, deploying, and managing APIs, enabling businesses to establish seamless connections between different applications and services. With features like authentication, authorization, and traffic control, API Gateway ensures the security and reliability of API interactions.

In an era where APIs serve as the backbone of modern applications, having the means to maintain visibility and control over these vital components is absolutely essential. In this blog post, we dive deep into the comprehensive observability solution offered by Elastic^®, ensuring real-time visibility, advanced analytics, and actionable insights, empowering you to fine-tune your API Gateway for optimal performance.

For application owners and developers, this integration stands as a beacon of empowerment. Elastic's meticulous orchestration of the seamless merging of metrics, logs, and traces, built upon the robust ELK Stack foundation, equips them with potent real-time monitoring and analysis tools. These tools facilitate precise performance optimization and swift issue resolution, all within a secure and dependable environment.

With Elastic's AWS API Gateway integration, application owners and developers unlock the capability to proactively identify and resolve problems, fine-tune resource utilization, and provide extraordinary digital experiences to their users.

Architecture

Why the AWS API Gateway integration matters

API Gateway now serves as the foundation of contemporary application development, simplifying the process of creating and overseeing APIs on a large scale. Yet, monitoring and troubleshooting these API endpoints can be challenging. With the new AWS API Gateway integration introduced by Elastic, you can gain the following:

Unprecedented visibility: Monitor your API Gateway endpoints' performance, error rates, and usage metrics in real time. Get a comprehensive view of your APIs' health and performance.
Log analysis: Dive deep into API Gateway logs with ease. Our integration enables you to collect and analyze logs for HTTP, REST, and Websocket API types, helping you troubleshoot issues and gain valuable insights.
Rapid issue resolution: Identify and resolve issues in your API Gateway workflows faster than ever. Elastic Observability's powerful search and analytics tools help you pinpoint problems with ease.
Alerting and notifications: Set up custom alerts based on API Gateway metrics and logs. Receive notifications when performance thresholds are breached, ensuring that you can take action promptly.
Optimized costs: Visualize resource usage and performance metrics for your API Gateway deployments. Use these insights to optimize resource allocation and reduce operational costs.
Custom dashboards: Create customized dashboards and visualizations tailored to your API Gateway monitoring needs. Stay in control with real-time data and actionable insights.
Effortless integration: Seamlessly connect your AWS API Gateway to our observability solution. Our intuitive setup process ensures a smooth integration experience.
Scalability: Whether you have a handful of APIs or a complex API Gateway landscape, our observability solution scales to meet your needs. Grow confidently as your API infrastructure expands.

How to get started

Getting started with the AWS API Gateway integration in Elastic Observability is seamless. Here's a quick overview of the steps:

Prerequisites and configurations

If you intend to follow the steps outlined in this blog post, there are a few prerequisites and configurations that you should have in place beforehand.

You will need an account on Elastic Cloud and a deployed stack and agent. Instructions for deploying a stack on AWS can be found here. This is necessary for AWS API Gateway logging and analysis.
You will also need an AWS account with the necessary permissions to pull data from AWS. Details on the required permissions can be found in our documentation.
You can monitor API execution by using CloudWatch, which collects and processes raw data from API Gateway into readable, near-real-time metrics and logs. Details on the required steps to enable logging can be found here.

Step 1. Create an account with Elastic

Create an account on Elastic Cloud by following the steps provided.

Step 2. Add integration

Click on Add integrations. You will be navigated to a catalog of supported integrations.

Search and select AWS API Gateway.

Step 3. Configure integration

Click on the Add AWS API Gateway button and provide the required details.
If this is your first time adding an AWS integration, you’ll need to configure and enroll the Elastic Agent on an AWS instance.

Then complete the “Configure integration” form, providing all the necessary information required for agents to collect the AWS API Gateway metrics and associated CloudWatch logs. Multiple AWS credential methods are supported, including access keys, temporary security credentials, and IAM role ARN. Please see the IAM security and access documentation for more details. You can choose to collect API Gateway metrics, API Gateway logs via S3, or API Gateway logs via CloudWatch.
Click on the Save and continue button at the bottom of the page.

Step 4. Analyze and monitor

Explore the data using the out-of-the-box dashboards available for the integration. Select Discover from the Elastic Cloud top-level menu.

Or, create custom dashboards, set up alerts, and gain actionable insights into your API Gateway service performance.

Here are key monitoring metrics collected through this integration across Rest APIs, HTTP APIs, and Websocket APIs:

4XXError – The number of client-side errors captured in a given period
5XXError – The number of server-side errors captured in a given period
CacheHitCount – The number of requests served from the API cache in a given period
CacheMissCount – The number of requests served from the backend in a given period, when API caching is enabled
Count – The total number of API requests in a given period
IntegrationLatency – The time between when API Gateway relays a request to the backend and when it receives a response from the backend
Latency – The time between when API Gateway receives a request from a client and when it returns a response to the client — the latency includes the integration latency and other API Gateway overhead
DataProcessed – The amount of data processed in bytes
ConnectCount – The number of messages sent to the $connect route integration
MessageCount – The number of messages sent to the WebSocket API, either from or to the client

Conclusion

The native integration of AWS API Gateway into Elastic Observability marks a significant advancement in streamlining the monitoring and management of your APIs. With this integration, you gain access to a wealth of insights, real-time visibility, and powerful analytics tools, empowering you to optimize your API performance, enhance security, and troubleshoot with ease. Don't miss out on this opportunity to take your API management to the next level, ensuring your digital assets operate at their best, all while providing a seamless experience for your users. Embrace this integration, and stay at the forefront of API observability in the ever-evolving world of digital technology.

Visit our documentation to learn more about Elastic Observability and the AWS API Gateway integration, or contact our sales team to get started!

Start a free trial today

Start your own 7-day free trial by signing up via AWS Marketplace and quickly spin up a deployment in minutes on any of the Elastic Cloud regions on AWS around the world. Your AWS Marketplace purchase of Elastic will be included in your monthly consolidated billing statement and will draw against your committed spend with AWS.

Assembling an OpenTelemetry NGINX Ingress Controller Integration

Wed, 15 Jan 2025 00:00:00 GMT

Our vision is clear: to support OpenTelemetry within Elastic. A key aspect of this transition are integrations — how can we seamlessly adapt all existing integrations to fit the OpenTelemetry model?

Elastic integrations are designed to simplify observability by providing tools to ingest application data, process it through Ingest pipelines, and deliver prebuilt dashboards for visualization. With OpenTelemetry support, data collection and processing will transition to the OpenTelemetry Collector, while dashboards will need to adopt the OpenTelemetry data structure.

From a Log to an Integration

Although the concept of an OpenTelemetry Integration has not yet been officially defined, we envision it as a structured collection of artifacts that enables users to start monitoring an application from scratch. Each artifact has a specific role; for example, an OpenTelemetry Collector configuration file, which must be integrated into the main Collector setup. This bundled configuration instructs the Collector on how to gather and process data from the relevant application.

In the OpenTelemetry Collector, data collection is handled by the receivers component. Some receivers are tailored for specific applications, such as Kafka or MySQL, while others are designed to support general data collection methods. The specialized receivers combine data gathering and transformation within a single component. For the more generic receivers, however, additional components are needed to refine and transform the incoming data into a more application-specific format. Let’s take a look at how we can build an integration for monitoring a Nginx Ingress Controller.

The Ingress Nginx is an Ingress controller for Kubernetes, using NGINX as a reverse proxy and load balancer. Widely adopted, it plays a crucial role in directing external traffic into Kubernetes services, making its usage, performance and health essential to observe. How can we start observing the external requests done to our Ingress controller? Fortunately, the NGINX Ingress Controller generates a structured log entry for each processed request. This structured format ensures that each log entry follows a consistent structure, making it straightforward to parse and generate consistent output.

log_format upstreaminfo '$remote_addr - $remote_user [$time_local]
	"$request" ' '$status $body_bytes_sent "$http_referer" "$http_user_agent" '
	'$request_length $request_time [$proxy_upstream_name]
	[$proxy_alternative_upstream_name] $upstream_addr ' '$upstream_response_length
	$upstream_response_time $upstream_status $req_id';

All the field's definition can be found here.

The OpenTelemetry Contrib Collector does not include a receiver capable of reading and parsing all fields in an NGINX Ingress log. There are two primary reasons for this:

Application Diversity: The landscape of applications is vast, with each generating logs in unique formats. Developing and maintaining a dedicated receiver for every application would be resource-intensive and difficult to scale.
Data Source Flexibility: Receivers are typically designed to collect data from a specific source, like an HTTP endpoint. However, in some cases, we may want to parse logs from an alternate source, such as an NGINX Ingress log file stored in an AWS S3 bucket.

These challenges can be addressed by combining receivers and processors. Receivers handle the collection of raw data, while processors can extract specific values when a known data structure is detected. Do we need a dedicated processor to parse NGINX logs? Not necessarily. The transform processor can handle this by modifying telemetry data according to a specified configuration. This configuration is written in the OpenTelemetry Transformation Language (OTTL), a language for transforming open telemetry data based on the OpenTelemetry Collector Processing Exploration.

The concept of processors in OpenTelemetry is quite similar to the Ingest pipeline strategy currently used in Elastic integrations. The main challenge, therefore, lies in migrating Ingest pipeline configurations to OpenTelemetry Collector configurations. For a deeper dive into the challenges of such migrations, check out this article.

For reference, you can view the current Elastic NGINX Ingress Controller Ingest pipeline configuration in the following link: Elastic NGINX Ingress Controller Ingest Pipeline.

Let’s start with the data collection. By default, the NGINX Ingress Controller logs to stdout, and Kubernetes captures and stores these logs in a file. Assuming that the OpenTelemetry Collector running the following configuration has access to the Kubernetes Pod logs, we can use the filelog receiver to read the controller logs:

receivers:
  filelog/nginx:
    include_file_path: true
    include: [/var/log/pods/*nginx-ingress-nginx-controller*/controller/*.log]
    operators:
      - id: container-parser
        type: container

This configuration is designed to exclusively read the controller's pod logs, focusing on their default file path within a Kubernetes node. Furthermore, since the Ingress controller does not inherently have access to its associated Kubernetes metadata, the container-parser operator has been implemented to bridge this gap. This operator appends Kubernetes-specific attributes, such as k8s.pod.name and k8s.namespace.name, based solely on information available from the filename. For a detailed overview of the container-parser operator, see the following OpenTelemetry blog post.

Avoiding duplicated logs

The configuration outlined in this blog is designed for Kubernetes environments, where the collector runs as a Kubernetes Pod. In such setups, handling Pod restarts properly is crucial. By default, the filelog receiver reads the entire content of log files on startup. This behavior can lead to duplicate log entries being reprocessed and sent through the pipeline if the collector Pod is restarted.

To make the configuration resilient to restarts, you can use a storage extension to track file offsets. These offsets allow the filelog receiver to resume reading from the last processed position in the log file after a restart. Below is an example of how to add a file storage extension and update the filelog receiver configuration to store the offsets in a file:

extensions:
  file_storage:
  directory: /var/lib/otelcol

receivers:
  filelog/nginx:
    storage: file_storage
    ...

Important: The /var/lib/otelcol directory must be mounted as part of a Kubernetes persistent volume to ensure the stored offsets persist across Pod restarts.

Data transformation with OpenTelemetry processors

Now it’s time to parse the structured log fields and transform them into queryable OpenTelemetry fields. Initially, we considered using regular expressions with the extract_patterns function available in the OpenTelemetry Transformation Language (OTTL). However, Elastic recently contributed a new OTTL function, ExtractGrokPatterns, based on Grok—a regular expression dialect that supports reusable, aliased expressions. The function’s underlying library Elastic Go-Grok ships with numerous predefined grok patterns that simplify working with pattern matching, like %NUMBER that will match any number type; "123", "456.789", "-0.123".

Each Ingress Controller log entry begins with the client's source IP address (which may be a single IP or a list of IPs) and the username provided via Basic authentication, represented as “$remote_addr - $remote_user”. The Grok IP alias can be used to parse either an IPv4 or IPv6 address from the remote_addr field, while the %GREEDYDATA alias can capture the remote_user value.

For example, the following OTTL configuration will transform an unstructured body message to a structured one with two fields:

Parses a single IP address and assign it to the source.address key.
Delimited by a “-”, captures the optional value of the authenticated username in the user.name key.

transform/parse_nginx_ingress_access/log:
  log_statements:
    - context: log
      statements:
        - set(body, ExtractGrokPatterns(body, "%{IP:source.address} - (-|%{GREEDYDATA:user.name})", true))

The screenshot below illustrates the transformation process, showing the original input data alongside the resulting structured format (diff):

In real-world scenarios, NGINX Ingress Controller logs may begin with a list of IP addresses or, at times, a domain name. These variations can be handled with an extended Grok pattern. Similarly, we can use Grok to parse an HTTP UserAgent and URL strings, but additional OTTL functions, such as URL or UserAgent, are required to extract meaningful data from these fields.

The complete configuration is available in the documentation for Elastic’s OpenTelemetry NGINX Ingress Controller integration: Integration Documentation.

Usage

The Elastic OpenTelemetry NGINX Ingress Controller is currently on Technical preview. To access it, you must enable the "Display beta integrations" toggle in the Integrations menu within Kibana.

By installing the Elastic OpenTelemetry NGINX Ingress Controller integration, a couple of dashboards will become available in your Kibana profile. One of these dashboards provides insights into access events for the controller, displaying information such as HTTP response status codes over time, request volume per URL, distribution of incoming requests by browser, top requested pages, and more. The screenshot below shows the NGINX Ingress Controller Access Logs dashboard, displaying data from a controller routing requests to an OpenTelemetry Demo deployment:

The second dashboard focuses on errors within the Nginx Ingress controller, highlighting the volume of error events generated over time:

To start gathering and processing controller logs, we recommend incorporating the OpenTelemetry Collector pipeline outlined in the integration’s documentation into your collector configuration: Integration Documentation. Keep in mind that this configuration requires access to the Kubernetes node's Pods logs, typically stored in /var/log/pods/*. To ensure proper access, we recommend deploying the OpenTelemetry Collector as a daemonset in Kubernetes, as this deployment type allows the collector to access the necessary log directory on each node.

The OpenTelemetry Collector configuration service pipeline should include a similar configuration:

service:
  extensions: [file_storage]
  pipelines:
    logs/nginx_ingress_controller:
      receivers:
        - filelog
      processors:
        - transform/parse_nginx_ingress_access/log
        - transform/parse_nginx_ingress_error/log
        - resourcedetection/system
      exporters:
        - elasticsearch

Adding GeoIP Metadata

As an optional enhancement, the OpenTelemetry Collector GeoIP processor can be configured and added to the pipeline to enrich each NGINX Ingress Controller log with geographical attributes, such as the request’s originating country, region, and city, enabling geo maps in Kibana to visualize traffic distribution and geographic patterns.

While the OpenTelemetry GeoIP processor is similar to Elastic's GeoIP processor, it requires users to provide their own local GeoLite2 database. The following configuration extends the Integration’s configuration to include the GeoIP processor with a MaxMind's database.

processors:
  geoip:
    context: record
    providers:
      maxmind:
        database_path: /tmp/GeoLite2-City.mmdb

service:
  extensions: [file_storage]
  pipelines:
    logs/nginx_ingress_controller:
      receivers:
        - filelog
      processors:
        - transform/parse_nginx_ingress_access/log
        - transform/parse_nginx_ingress_error/log
        - resourcedetection/system
        - geoip
      exporters:
        - elasticsearch

Sample Kibana Map with the OpenTelemetry Nginx Ingress Controller integration:

Next steps

OpenTelemetry Log Event

A closer look at the OTTL integration’s statements reveals that the raw log message is replaced by the parsed fields. In other words, the configuration transforms the body log field* from a string into a structured map of key-value pairs, as seen in “set(body, ExtractGrokPatterns(body,...)”. This approach is based on treating each NGINX Ingress Controller log entry as an OpenTelemetry Event—a specialized type of LogRecord. Events are OpenTelemetry’s standardized semantic formatting for LogRecords, containing an “event.name” attribute which defines the structure of the body field. An NGINX Ingress Controller log record aligns well with the OpenTelemetry Event data model. It follows a structured format and clearly distinguishes between two event types: access logs and error logs. There is an ongoing PR to incorporate the NGINX Ingress controller log into the OpenTelemetry semantic convention: https://github.com/open-telemetry/semantic-conventions/pull/982

Operating system breakdown

Each controller log contains the source UserAgent, from which the integration extracts the browser that originated the request. This information is valuable for understanding user access patterns, as it provides insights into the types of browsers commonly interacting with your services. Additionally, an ongoing pull request into OTTL aims to extend this functionality by extracting operating system (OS) details as well, providing even deeper insights into the environments interacting with the NGINX Ingress Controller.

Configuration encapsulation

Setting up the configuration for the NGINX Ingress Controller integration can be somewhat tedious, as it involves adding several complex processor configurations to the existing collector pipelines. This process can quickly become cumbersome, especially for non-expert users or in cases where the collector configuration is already quite complex. In an ideal scenario, users would simply reference a pre-defined integration configuration, and the collector would automatically "unwrap" all the necessary components into the corresponding pipelines. This would significantly simplify the setup process, making it more accessible and reducing the risk of misconfigurations. To address this, there is a RFC (Request for Comments) proposing support for shareable, modular configurations within the OpenTelemetry Collector. This feature would allow users to easily collect signals from specific services or applications by referencing modular configurations, streamlining the setup and enhancing usability for complex scenarios.

*The OpenTelemetry community is currently discussing whether structured body-extracted information should be stored in the attributes or body field. For details, see this ongoing issue.

This product includes GeoLite2 data created by MaxMind, available from https://www.maxmind.com

Auto-instrumentation of Go applications with OpenTelemetry

Wed, 02 Oct 2024 00:00:00 GMT

In the fast-paced universe of software development, especially in the cloud-native realm, DevOps and SRE teams are increasingly emerging as essential partners in application stability and growth.

DevOps engineers continuously optimize software delivery, while SRE teams act as the stewards of application reliability, scalability, and top-tier performance. The challenge? These teams require a cutting-edge observability solution, one that encompasses full-stack insights, empowering them to rapidly manage, monitor, and rectify potential disruptions before they culminate into operational challenges.

Observability in our modern distributed software ecosystem goes beyond mere monitoring — it demands limitless data collection, precision in processing, and the correlation of this data into actionable insights. However, the road to achieving this holistic view is paved with obstacles, from navigating version incompatibilities to wrestling with restrictive proprietary code.

Enter OpenTelemetry (OTel), with the following benefits for those who adopt it:

Escape vendor constraints with OTel, freeing yourself from vendor lock-in and ensuring top-notch observability.
See the harmony of unified logs, metrics, and traces come together to provide a complete system view.
Improve your application oversight through richer and enhanced instrumentations.
Embrace the benefits of backward compatibility to protect your prior instrumentation investments.
Embark on the OpenTelemetry journey with an easy learning curve, simplifying onboarding and scalability.
Rely on a proven, future-ready standard to boost your confidence in every investment.

In this blog, we will explore how you can use automatic instrumentation in your Go application using Docker, without the need to refactor any part of your application code. We will use an application called Elastiflix, which helps highlight auto-instrumentation in a simple way.

Application, prerequisites, and config

The application that we use for this blog is called Elastiflix, a movie-streaming application. It consists of several micro-services written in .NET, NodeJS, Go, and Python.

Before we instrument our sample application, we will first need to understand how Elastic can receive the telemetry data.

All of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:

Service maps
Service details (latency, throughput, failed transactions)
Dependencies between services, distributed tracing
Transactions (traces)
Machine learning (ML) correlations
Log correlation

In addition to Elastic’s APM and a unified view of the telemetry data, you will also be able to use Elastic’s powerful machine learning capabilities to reduce the analysis, and alerting to help reduce MTTR.

Prerequisites

An Elastic Cloud account — sign up now.
A clone of the Elastiflix demo application, or your own Go application
Basic understanding of Docker — potentially install Docker Desktop
Basic understanding of Go

View the example source code

The full source code, including the Dockerfile used in this blog, can be found on GitHub.

The following steps will show you how to instrument this application and run it on the command line or in Docker. If you are interested in a more complete OTel example, take a look at the docker-compose file here, which will bring up the full project.

Step-by-step guide

Step 0. Log in to your Elastic Cloud account

This blog assumes you have an Elastic Cloud account — if not, follow the instructions to get started on Elastic Cloud.

Step 1. Run the Docker Image with auto-instrumentation

We are going to use automatic instrumentation with the Go service from the Elastiflix demo application.

We will be using the following service from Elastiflix:

Elastiflix/go-favorite

Per the OpenTelemetry Automatic Instrumentation for Go documentation, you will configure the application to be auto-instrumented using docker-compose.

As specified in the OTEL Go documentation, we will use environment variables and pass in the configuration values to enable it to connect with Elastic Observability’s APM server.

Because Elastic accepts OTLP natively, we just need to provide the Endpoint and authentication where the OTEL Exporter needs to send the data, as well as some other environment variables.

Getting Elastic Cloud variables You can copy the endpoints and token from Kibana under the path /app/apm/onboarding?agent=openTelemetry.

You will need to copy the following environment variables:

OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_EXPORTER_OTLP_HEADERS

Update the docker-compose.yml file at the top of the Elastiflix repository, adding a go-auto service and updating the favorite-go one:

  favorite-go:
    build: go-favorite/.
    image: docker.elastic.co/demos/workshop/observability/elastiflix-go-favorite:${ELASTIC_VERSION}-${BUILD_NUMBER}
    depends_on:
      - redis
    networks:
      - app-network
    ports:
      - "5001:5000"
    environment:
      - REDIS_HOST=redis
      - TOGGLE_SERVICE_DELAY=${TOGGLE_SERVICE_DELAY:-0}
      - TOGGLE_CANARY_DELAY=${TOGGLE_CANARY_DELAY:-0}
      - TOGGLE_CANARY_FAILURE=${TOGGLE_CANARY_FAILURE:-0}
    volumes:
      - favorite-go:/app
  go-auto:
    image: otel/autoinstrumentation-go
    privileged: true
    pid: "host"
    networks:
      - app-network
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: "REPLACE WITH OTEL_EXPORTER_OTLP_ENDPOINT"
      OTEL_EXPORTER_OTLP_HEADERS: "REPLACE WITH OTEL_EXPORTER_OTLP_HEADERS"
      OTEL_GO_AUTO_TARGET_EXE: "/app/main"
      OTEL_SERVICE_NAME: "go-favorite"
      OTEL_PROPAGATORS: "tracecontext,baggage"
    volumes:
      - favorite-go:/app
      - /proc:/host/proc

And, at the bottom of the file:

volumes:
  favorite-go:
networks:
  app-network:
    driver: bridge

Finally, in the configuration for the main node app, you will want to tell Elastiflix to call the Go favorites app by replacing the line:

environment:
  - API_ENDPOINT_FAVORITES=favorite-java:5000

with:

environment:
  - API_ENDPOINT_FAVORITES=favorite-go:5000

Step 3: Explore traces and logs in Elastic APM

Once you have this up and running, you can ping the endpoint for your instrumented service (in our case, this is /favorites), and you should see the app appear in Elastic APM, as shown below:

It will begin by tracking throughput and latency critical metrics for SREs to pay attention to.

Digging in, we can see an overview of all our Transactions.

And look at specific transactions:

This gives you complete visibility across metrics, and traces!

Summary

With this Dockerfile, you've transformed your simple Go application into one that's automatically instrumented with OpenTelemetry. This will aid greatly in understanding application performance, tracing errors, and gaining insights into how users interact with your software.

Remember, observability is a crucial aspect of modern application development, especially in distributed systems. With tools like OpenTelemetry, understanding complex systems becomes a tad bit easier.

In this blog, we discussed the following:

How to auto-instrument Go with OpenTelemetry.
Using standard commands in a Docker file, auto-instrumentation was done efficiently and without adding code in multiple places enabling manageability.
Using OpenTelemetry and its support for multiple languages, DevOps and SRE teams can auto-instrument their applications with ease gaining immediate insights into the health of the entire application stack and reduce mean time to resolution (MTTR).

Since Elastic can support a mix of methods for ingesting data, whether it be using auto-instrumentation of open-source OpenTelemetry or manual instrumentation with its native APM agents, you can plan your migration to OTel by focusing on a few applications first and then using OpenTelemety across your applications later on in a manner that best fits your business needs.

Developer resources:

Elastiflix application, a guide to instrument different languages with OpenTelemetry
Python: Auto-instrumentation, Manual-instrumentation
Java: Auto-instrumentation, Manual-instrumentation
Node.js: Auto-instrumentation, Manual-instrumentation
.NET: Auto-instrumentation, Manual-instrumentation
Go: Auto-instrumentation Manual-instrumentation
Best practices for instrumenting OpenTelemetry

General configuration and use case resources:

Don’t have an Elastic Cloud account yet? Sign up for Elastic Cloud and try out the auto-instrumentation capabilities that I discussed above. I would be interested in getting your feedback about your experience in gaining visibility into your application stack with Elastic.

Automatic instrumentation with OpenTelemetry for Node.js applications

Wed, 30 Aug 2023 00:00:00 GMT

DevOps and SRE teams are transforming the process of software development. While DevOps engineers focus on efficient software applications and service delivery, SRE teams are key to ensuring reliability, scalability, and performance. These teams must rely on a full-stack observability solution that allows them to manage and monitor systems and ensure issues are resolved before they impact the business.

Observability across the entire stack of modern distributed applications requires data collection, processing, and correlation often in the form of dashboards. Ingesting all system data requires installing agents across stacks, frameworks, and providers — a process that can be challenging and time-consuming for teams who have to deal with version changes, compatibility issues, and proprietary code that doesn't scale as systems change.

Thanks to OpenTelemetry (OTel), DevOps and SRE teams now have a standard way to collect and send data that doesn't rely on proprietary code and have a large support community reducing vendor lock-in.

In a previous blog, we also reviewed how to use the OpenTelemetry demo and connect it to Elastic^®, as well as some of Elastic’s capabilities with OpenTelemetry and Kubernetes.

In this blog, we will show how to use automatic instrumentation for OpenTelemetry with the Node.js service of our application called Elastiflix, which helps highlight auto-instrumentation in a simple way.

The beauty of this is that there is no need for the otel-collector! This setup enables you to slowly and easily migrate an application to OTel with Elastic according to a timeline that best fits your business.

Application, prerequisites, and config

The application that we use for this blog is called Elastiflix, a movie streaming application. It consists of several micro-services written in .NET, NodeJS, Go, and Python.

Before we instrument our sample application, we will first need to understand how Elastic can receive the telemetry data.

All of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:

Service maps
Service details (latency, throughput, failed transactions)
Dependencies between services, distributed tracing
Transactions (traces)
Machine learning (ML) correlations
Log correlation

Prerequisites

An Elastic Cloud account — sign up now
A clone of the Elastiflix demo application, or your own Node.js application
Basic understanding of Docker — potentially install Docker Desktop
Basic understanding of Node.js

View the example source code

The full source code, including the Dockerfile used in this blog, can be found on GitHub. The repository also contains the same application without instrumentation. This allows you to compare each file and see the differences.

Step-by-step guide

Step 0. Log in to your Elastic Cloud account

This blog assumes you have an Elastic Cloud account — if not, follow the instructions to get started on Elastic Cloud.

Step 1. Configure auto-instrumentation for the Node.js Service

We are going to use automatic instrumentation with Node.js service from the Elastiflix demo application.

We will be using the following service from Elastiflix:

Elastiflix/node-server-otel-manual

Per the OpenTelemetry JavaScript documentation and @open-telemetry/auto-instrumentions-node documentation, you will simply install the appropriate node packages using npm.

npm install --save @opentelemetry/api
npm install --save @opentelemetry/auto-instrumentations-node

If you are running the Node.js service on the command line, then here is how you can run auto-instrument with Node.js.

node --require '@opentelemetry/auto-instrumentations-node/register' app.js

For our application, we do this as part of the Dockerfile.

Dockerfile

FROM node:14

WORKDIR /app

COPY ["package.json", "./"]
RUN ls
RUN npm install --production
COPY . .

RUN npm install --save @opentelemetry/api
RUN npm install --save @opentelemetry/auto-instrumentations-node


EXPOSE 3001

CMD ["node", "--require", "@opentelemetry/auto-instrumentations-node/register", "index.js"]

Step 2. Running the Docker image with environment variables

As specified in the OTEL documentation, we will use environment variables and pass in the configuration values to enable it to connect with Elastic Observability’s APM server.

Because Elastic accepts OTLP natively, we just need to provide the Endpoint and authentication where the OTEL Exporter needs to send the data, as well as some other environment variables.

Getting Elastic Cloud variables
You can copy the endpoints and token from Kibana^® under the path /app/home#/tutorial/apm.

You will need to copy the following environment variables:

OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_EXPORTER_OTLP_HEADERS

Build the image

docker build -t  node-otel-auto-image .

Run the image

docker run \
       -e OTEL_EXPORTER_OTLP_ENDPOINT="" \
       -e OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer " \
       -e OTEL_RESOURCE_ATTRIBUTES="service.version=1.0,deployment.environment=production" \
       -e OTEL_SERVICE_NAME="node-server-otel-auto" \
       -p 3001:3001 \
       node-server-otel-auto

You can now issue a few requests in order to generate trace data. Note that these requests are expected to return an error, as this service relies on some downstream services that you may not have running on your machine.

curl localhost:3001/api/login
curl localhost:3001/api/favorites

# or alternatively issue a request every second

while true; do curl "localhost:3001/api/favorites"; sleep 1; done;

Step 3: Explore traces, metrics, and logs in Elastic APM

Exploring the Services section in Elastic APM, you’ll see the Node service displayed.

Clicking on the node-server-otel-auto service, you can see that it is ingesting telemetry data using OpenTelemetry.

Summary

In this blog, we discussed the following:

How to auto-instrument Node.js with OpenTelemetry
Using standard commands in a Dockerfile, auto-instrumentation was done efficiently and without adding code in multiple places enabling manageability

Developer resources:

Elastiflix application, a guide to instrument different languages with OpenTelemetry

Python: Auto-instrumentation, Manual-instrumentation

Java: Auto-instrumentation, Manual-instrumentation

Node.js: Auto-instrumentation, Manual-instrumentation

.NET: Auto-instrumentation, Manual-instrumentation

Go: Manual-instrumentation

Best practices for instrumenting OpenTelemetry

General configuration and use case resources:

Independence with OpenTelemetry on Elastic

Modern observability and security on Kubernetes with Elastic and OpenTelemetry

3 models for logging with OpenTelemetry and Elastic

Adding free and open Elastic APM as part of your Elastic Observability deployment

Capturing custom metrics through OpenTelemetry API in code with Elastic

Future-proof your observability platform with OpenTelemetry and Elastic

Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more

Auto-instrumentation of Java applications with OpenTelemetry

Thu, 31 Aug 2023 00:00:00 GMT

In the fast-paced universe of software development, especially in the cloud-native realm, DevOps and SRE teams are increasingly emerging as essential partners in application stability and growth.

Enter OpenTelemetry (OTel), with the following benefits for those who adopt it:

Escape vendor constraints with OTel, freeing yourself from vendor lock-in and ensuring top-notch observability.
See the harmony of unified logs, metrics, and traces come together to provide a complete system view.
Improve your application oversight through richer and enhanced instrumentations.
Embrace the benefits of backward compatibility to protect your prior instrumentation investments.
Embark on the OpenTelemetry journey with an easy learning curve, simplifying onboarding and scalability.
Rely on a proven, future-ready standard to boost your confidence in every investment.

In this blog, we will explore how you can use automatic instrumentation in your Java application using Docker, without the need to refactor any part of your application code. We will use an application called Elastiflix, which helps highlight auto-instrumentation in a simple way.

Application, prerequisites, and config

The application that we use for this blog is called Elastiflix, a movie-streaming application. It consists of several micro-services written in .NET, NodeJS, Go, and Python.

Before we instrument our sample application, we will first need to understand how Elastic can receive the telemetry data.

All of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:

Service maps
Service details (latency, throughput, failed transactions)
Dependencies between services, distributed tracing
Transactions (traces)
Machine learning (ML) correlations
Log correlation

Prerequisites

An Elastic Cloud account — sign up now.
A clone of the Elastiflix demo application, or your own Java application
Basic understanding of Docker — potentially install Docker Desktop
Basic understanding of Java

View the example source code

Step-by-step guide

Step 0. Log in to your Elastic Cloud account

This blog assumes you have an Elastic Cloud account — if not, follow the instructions to get started on Elastic Cloud.

Step 1. Configure auto-instrumentation for the Java service

We are going to use automatic instrumentation with Java service from the Elastiflix demo application.

We will be using the following service from Elastiflix:

Elastiflix/java-favorite-otel-auto

Per the OpenTelemetry Automatic Instrumentation for Java documentation and documentation, you will simply install the appropriate Java packages.

Create a local OTel directory to download the OpenTelemetry Java agent. Download opentelemetry-javaagent.jar.

>mkdir /otel

>curl -L https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar –output /otel/opentelemetry-javaagent.jar

If you are going to run the service on the command line, then you can use the following command:

java -javaagent:/otel/opentelemetry-javaagent.jar \
-jar /usr/src/app/target/favorite-0.0.1-SNAPSHOT.jar --server.port=5000

For our application, we will do this as part of the Dockerfile.

Dockerfile

Start with a base image containing Java runtime
FROM maven:3.8.2-openjdk-17-slim as build

# Make port 8080 available to the world outside this container
EXPOSE 5000

# Change to the app directory
WORKDIR /usr/src/app

# Copy the local code to the container
COPY . .

# Build the application
RUN mvn clean install

USER root
RUN apt-get update && apt-get install -y zip curl
RUN mkdir /otel
RUN curl -L -o /otel/opentelemetry-javaagent.jar https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v1.28.0/opentelemetry-javaagent.jar

COPY start.sh /start.sh
RUN chmod +x /start.sh

ENTRYPOINT ["/start.sh"]

Step 2. Running the Docker Image with environment variables

As specified in the OTEL Java documentation, we will use environment variables and pass in the configuration values to enable it to connect with Elastic Observability’s APM server.

Because Elastic accepts OTLP natively, we just need to provide the Endpoint and authentication where the OTEL Exporter needs to send the data, as well as some other environment variables.

Getting Elastic Cloud variables
You can copy the endpoints and token from Kibana under the path /app/home#/tutorial/apm.

You will need to copy the following environment variables:

OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_EXPORTER_OTLP_HEADERS

Build the Docker image

docker build -t java-otel-auto-image .

Run the Docker image

docker run \
       -e OTEL_EXPORTER_OTLP_ENDPOINT="REPLACE WITH OTEL_EXPORTER_OTLP_ENDPOINT" \
       -e ELASTIC_APM_SECRET_TOKEN="REPLACE WITH THE BIT AFTER Authorization=Bearer " \
       -e OTEL_RESOURCE_ATTRIBUTES="service.version=1.0,deployment.environment=production" \
       -e OTEL_SERVICE_NAME="java-favorite-otel-auto" \
       -p 5000:5000 \
       java-otel-auto-image

You can now issue a few requests in order to generate trace data. Note that these requests are expected to return an error, as this service relies on a connection to Redis that you don’t currently have running. As mentioned before, you can find a more complete example using docker-compose here.

curl localhost:5000/favorites

# or alternatively issue a request every second

while true; do curl "localhost:5000/favorites"; sleep 1; done;

Step 3: Explore traces and logs in Elastic APM

Once you have this up and running, you can ping the endpoint for your instrumented service (in our case, this is /favorites), and you should see the app appear in Elastic APM, as shown below:

It will begin by tracking throughput and latency critical metrics for SREs to pay attention to.

Digging in, we can see an overview of all our Transactions.

And look at specific transactions:

Click on Logs, and we see that logs are also brought over. The OTel Agent will automatically bring in logs and correlate them with traces for you:

This gives you complete visibility across logs, metrics, and traces!

Basic concepts: How APM works with Java

Before we continue, let's first understand a few basic concepts and terms.

Java Agent: This is a tool that can be used to instrument (or modify) the bytecode of class files in the Java Virtual Machine (JVM). Java agents are used for many purposes like performance monitoring, logging, security, and more.
Bytecode: This is the intermediary code generated by the Java compiler from your Java source code. This code is interpreted or compiled on the fly by the JVM to produce machine code that can be executed.
Byte Buddy: Byte Buddy is a code generation and manipulation library for Java. It is used to create, modify, or adapt Java classes at runtime. In the context of a Java Agent, Byte Buddy provides a powerful and flexible way to modify bytecode. Both the Elastic APM Agent and the OpenTelemetry Agent use Byte Buddy under the covers.

Now, let's talk about how automatic instrumentation works with Byte Buddy:

Automatic instrumentation is the process by which an agent modifies the bytecode of your application's classes, often to insert monitoring code. The agent doesn't modify the source code directly, but rather the bytecode that is loaded into the JVM. This is done while the JVM is loading the classes, so the modifications are in effect during runtime.

Here's a simplified explanation of the process:

Start the JVM with the agent: When starting your Java application, you specify the Java agent with the -javaagent command line option. This instructs the JVM to load your agent before the main method of your application is invoked. At this point, the agent has the opportunity to set up class transformers.
Register a class file transformer with Byte Buddy: Your agent will register a class file transformer with Byte Buddy. A transformer is a piece of code that is invoked every time a class is loaded into the JVM. This transformer receives the bytecode of the class, and it can modify this bytecode before the class is actually used.
Transform the bytecode: When your transformer is invoked, it will use Byte Buddy's API to modify the bytecode. Byte Buddy allows you to specify your transformations in a high-level, expressive way rather than manually writing complex bytecode. For example, you could specify a certain class and method within that class that you want to instrument and provide an "interceptor" that will add new behavior to that method.
Use the transformed classes: Once the agent has set up its transformers, the JVM continues to load classes as usual. Each time a class is loaded, your transformers are invoked, allowing them to modify the bytecode. Your application then uses these transformed classes as if they were the original ones, but they now have the extra behavior that you've injected through your interceptor.

In essence, automatic instrumentation with Byte Buddy is about modifying the behavior of your Java classes at runtime, without needing to alter the source code directly. This is especially useful for cross-cutting concerns like logging, monitoring, or security, as it allows you to centralize this code in your Java Agent, rather than scattering it throughout your application.

Summary

With this Dockerfile, you've transformed your simple Java application into one that's automatically instrumented with OpenTelemetry. This will aid greatly in understanding application performance, tracing errors, and gaining insights into how users interact with your software.

In this blog, we discussed the following:

How to auto-instrument Java with OpenTelemetry.
Using standard commands in a Docker file, auto-instrumentation was done efficiently and without adding code in multiple places enabling manageability.
Using OpenTelemetry and its support for multiple languages, DevOps and SRE teams can auto-instrument their applications with ease gaining immediate insights into the health of the entire application stack and reduce mean time to resolution (MTTR).

Developer resources:

Elastiflix application, a guide to instrument different languages with OpenTelemetry

Python: Auto-instrumentation, Manual-instrumentation

Java: Auto-instrumentation, Manual-instrumentation

Node.js: Auto-instrumentation, Manual-instrumentation

.NET: Auto-instrumentation, Manual-instrumentation

Go: Manual-instrumentation

Best practices for instrumenting OpenTelemetry

General configuration and use case resources:

Independence with OpenTelemetry on Elastic

Modern observability and security on Kubernetes with Elastic and OpenTelemetry

3 models for logging with OpenTelemetry and Elastic

Adding free and open Elastic APM as part of your Elastic Observability deployment

Capturing custom metrics through OpenTelemetry API in code with Elastic

Future-proof your observability platform with OpenTelemetry and Elastic

Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more

Auto-instrumentation of .NET applications with OpenTelemetry

Fri, 01 Sep 2023 00:00:00 GMT

In the fast-paced universe of software development, especially in the cloud-native realm, DevOps and SRE teams are increasingly emerging as essential partners in application stability and growth.

Enter OpenTelemetry (OTel), with the following benefits for those who adopt it:

Escape vendor constraints with OTel, freeing yourself from vendor lock-in and ensuring top-notch observability.
See the harmony of unified logs, metrics, and traces come together to provide a complete system view.
Improve your application oversight through richer and enhanced instrumentations.
Embrace the benefits of backward compatibility to protect your prior instrumentation investments.
Embark on the OpenTelemetry journey with an easy learning curve, simplifying onboarding and scalability.
Rely on a proven, future-ready standard to boost your confidence in every investment.
Explore manual instrumentation, enabling customized data collection to fit your unique needs.
Ensure monitoring consistency across layers with a standardized observability data framework.
Decouple development from operations, driving peak efficiency for both.

Given this context, OpenTelemetry emerges as an unmatched observability solution for cloud-native software, seamlessly enabling tracing, monitoring, and debugging. One of its strengths is the ability to auto-instrument applications, allowing developers the luxury of collecting invaluable telemetry without delving into code modifications.

In this post, we will dive into the methodology to instrument a .NET application using Docker, blending the best of both worlds: powerful observability without the code hassles.

What's covered?

How APM works with .NET using CLR Profiler functionality
Creating a Docker image for a .NET application with the OpenTelemetry instrumentation baked in
Installing and running the OpenTelemetry .NET Profiler for automatic instrumentation

How APM works with .NET using CLR Profiler functionality

Before we delve into the details, let's clear up some confusion around .NET Profilers and CPU Profilers like Elastic^®’s Universal Profiling tool — we don’t want to get these two things mixed up, as they have very different purposes.

When discussing profiling tools, especially in the context of .NET, it's not uncommon to encounter confusion between a ".NET profiler" and a "CPU profiler." Though both are used to diagnose and optimize applications, they serve different primary purposes and operate at different levels. Let's clarify the distinction:

.NET Profiler

Scope: Specifically targets .NET applications. It is designed to work with the .NET runtime (i.e., the Common Language Runtime (CLR)).
Functionality:
Use cases:

CPU Profiler

Scope: More general than a .NET profiler. It can profile any application, irrespective of the language or runtime, as long as it runs on the CPU being profiled.
Functionality:
Use cases:

While both .NET profilers and CPU profilers aid in optimizing and diagnosing application performance, their approach and depth differ. A .NET profiler offers deep insights specifically into the .NET ecosystem, allowing for fine-grained analysis and instrumentation. In contrast, a CPU profiler provides a broader view, focusing on CPU usage patterns across any application, regardless of its development platform.

It's worth noting that for comprehensive profiling of a .NET application, you might use both: the .NET profiler to understand code-level behaviors specific to .NET and the CPU profiler to get an overview of CPU resource utilization.

Now that we've cleared that up, let's focus on the .NET Profiler, which we are discussing in this blog for automatic instrumentation of .NET applications. First, let's familiarize ourselves with some foundational concepts and terminologies relevant to a .NET Profiler:

CLR (Common Language Runtime): CLR is a core component of the .NET framework, acting as the execution engine for .NET apps. It provides key services like memory management, exception handling, and type safety.
Profiler API:.NET provides a set of APIs for profiling applications. These APIs let tools and developers monitor or manipulate .NET applications during runtime.
IL (Intermediate Language): After compiling, .NET source code turns into IL, a low-level, platform-agnostic representation. This IL code is then compiled just-in-time (JIT) into machine code by the CLR during application execution.
JIT compilation: JIT stands for just-in-time. In .NET, the CLR compiles IL to native code just before its execution.

Now, let's explore how automatic instrumentation works using CLR Profiler.

Automatic instrumentation in .NET, much like Java's bytecode instrumentation, revolves around modifying the behavior of your application's methods during runtime, without changing the actual source code.

Here’s a step-by-step breakdown:

Attach the profiler: When launching your .NET application, you'll have to specify to load the profiler. The CLR checks for the presence of a profiler by reading environment variables. If it finds one, the CLR initializes the profiler before any user code is executed.
Use Profiler API to monitor events: The Profiler API allows a profiler to monitor various events. For instance, method JIT compilation events can be tracked. When a method is about to be JIT compiled, the profiler gets notified.
Manipulate IL code: Upon getting notified of a JIT compilation, the profiler can manipulate the IL code of the method. Using the Profiler API, the profiler can insert, delete, or replace IL instructions. This is analogous to how Java agents modify bytecode. For example, if you want to measure a method's execution time, you'd modify the IL to insert calls to start and stop a timer at the beginning and end of the method, respectively.
Execution of transformed code: Once the IL has been modified, the JIT compiler will translate it into machine code. The application will then execute this machine code, which includes the additions made by the profiler.
Gather and report data: The added instrumentation can collect various data, such as method execution times or call counts. This data can then be relayed to an application performance management (APM) tool, which can provide insights, visualizations, and alerts based on the data.

In essence, automatic instrumentation with CLR Profiler is about modifying the behavior of your .NET methods at runtime. This is invaluable for monitoring, diagnosing, and fine-tuning the performance of .NET applications without intruding on the application's actual source code.

Prerequisites

A basic understanding of Docker and .NET
Elastic Cloud
Docker installed on your machine (we recommend docker desktop)

View the example source code

Step-by-step guide

This blog assumes you have an Elastic Cloud account — if not, follow the instructions to get started on Elastic Cloud.

Step 1. Base image setup

Start with the .NET runtime image for the base layer of our Dockerfile:

FROM ${ARCH}mcr.microsoft.com/dotnet/aspnet:7.0. AS base
WORKDIR /app
EXPOSE 8000

Here, we're setting up the application's runtime environment.

Step 2. Building the .NET application

This feature of Docker is just the best. Here, we compile our .NET application using the SDK image. In the bad old days, we used to build on a different platform and then put the compiled code into the Docker container. This way, we are much more confident our build will replicate from a developer’s desktop and into production by using Docker all the way through.

FROM --platform=$BUILDPLATFORM mcr.microsoft.com/dotnet/sdk:8.0-preview AS build
ARG TARGETPLATFORM

WORKDIR /src
COPY ["login.csproj", "./"]
RUN dotnet restore "./login.csproj"
COPY . .
WORKDIR "/src/."
RUN dotnet build "login.csproj" -c Release -o /app/build

This section ensures that our .NET code is properly restored and compiled.

Step 3. Publishing the application

Once built, we'll publish the app:

FROM build AS publish
RUN dotnet publish "login.csproj" -c Release -o /app/publish

Step 4. Preparing the final image

Now, let's set up the final runtime image:

FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish

Step 5. Installing OpenTelemetry

We'll install dependencies and download the OpenTelemetry auto-instrumentation script:

RUN apt-get update && apt-get install -y zip curl
RUN mkdir /otel
RUN curl -L -o /otel/otel-dotnet-install.sh https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/releases/download/v0.7.0/otel-dotnet-auto-install.sh
RUN chmod +x /otel/otel-dotnet-install.sh

Step 6. Configure OpenTelemetry

Designate where OpenTelemetry should reside and execute the installation script. Note that the ENV OTEL_DOTNET_AUTO_HOME is required as the script looks for it:

ENV OTEL_DOTNET_AUTO_HOME=/otel
RUN /bin/bash /otel/otel-dotnet-install.sh

Step 7. Additional configuration

Make sure the auto-instrumentation and platform detection scripts are executable and run the platform detection script.

COPY platform-detection.sh /otel/
RUN chmod +x /otel/instrument.sh
RUN chmod +x /otel/platform-detection.sh && /otel/platform-detection.sh

This platform detection script will check if the Docker build is for ARM64 and implement a workaround to get the OpenTelemetry instrumentation to work on MacOS. If you happen to be running locally on MacOS M1 or M2 processors, you will be grateful for this script.

Step 8. Entry point setup

Lastly, set the Docker image's entry point to both source the OpenTelemetry instrumentation, which sets up the environment variables required to bootstrap the .NET Profiler, and then we start our .NET application:

ENTRYPOINT ["/bin/bash", "-c", "source /otel/instrument.sh && dotnet login.dll"]

Step 9. Running the Docker image with environment variables

To build and run the Docker image, you'd typically follow these steps:

Build the Docker image

First, you'd want to build the Docker image from your Dockerfile. Let's assume the Dockerfile is in the current directory, and you'd like to name/tag your image dotnet-login-otel-image.

docker build -t dotnet-login-otel-image .

Run the Docker image

After building the image, you'd run it with the specified environment variables. For this, the docker run command is used with the -e flag for each environment variable.

docker run \
       -e OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer ${ELASTIC_APM_SECRET_TOKEN}" \
       -e OTEL_EXPORTER_OTLP_ENDPOINT="${ELASTIC_APM_SERVER_URL}" \
       -e OTEL_METRICS_EXPORTER="otlp" \
       -e OTEL_RESOURCE_ATTRIBUTES="service.version=1.0,deployment.environment=production" \
       -e OTEL_SERVICE_NAME="dotnet-login-otel-auto" \
       -e OTEL_TRACES_EXPORTER="otlp" \
       dotnet-login-otel-image

Make sure that ${ELASTIC_APM_SECRET_TOKEN} and ${ELASTIC_APM_SERVER_URL} are set in your shell environment, and replace them with their actual values from the cloud as shown below.
Getting Elastic Cloud variables

You can copy the endpoints and token from Kibana^® under the path /app/home#/tutorial/apm.

You can also use an environment file with docker run --env-file to make the command less verbose if you have multiple environment variables.

Once you have this up and running, you can ping the endpoint for your instrumented service (in our case, this is /login), and you should see the app appear in Elastic APM, as shown below:

It will begin by tracking throughput and latency critical metrics for SREs to pay attention to.

Digging in, we can see an overview of all our Transactions.

And look at specific transactions:

There is clearly an outlier here, where one transaction took over 200ms. This is likely to be due to the .NET CLR warming up. Click on Logs , and we see that logs are also brought over. The OTel Agent will automatically bring in logs and correlate them with traces for you:

Wrapping up

With this Dockerfile, you've transformed your simple .NET application into one that's automatically instrumented with OpenTelemetry. This will aid greatly in understanding application performance, tracing errors, and gaining insights into how users interact with your software.

In this blog, we discussed the following:

How to auto-instrument .NET with OpenTelemetry.
Using standard commands in a Docker file, auto-instrumentation was done efficiently and without adding code in multiple places enabling manageability.
Using OpenTelemetry and its support for multiple languages, DevOps and SRE teams can auto-instrument their applications with ease gaining immediate insights into the health of the entire application stack and reduce mean time to resolution (MTTR).

Developer resources:

Elastiflix application, a guide to instrument different languages with OpenTelemetry

Python: Auto-instrumentation, Manual-instrumentation

Java: Auto-instrumentation, Manual-instrumentation

Node.js: Auto-instrumentation, Manual-instrumentation

.NET: Auto-instrumentation, Manual-instrumentation

Go: Manual-instrumentation

Best practices for instrumenting OpenTelemetry

General configuration and use case resources:

Independence with OpenTelemetry on Elastic

Modern observability and security on Kubernetes with Elastic and OpenTelemetry

3 models for logging with OpenTelemetry and Elastic

Adding free and open Elastic APM as part of your Elastic Observability deployment

Capturing custom metrics through OpenTelemetry API in code with Elastic

Future-proof your observability platform with OpenTelemetry and Elastic

Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more

Automatic instrumentation with OpenTelemetry for Python applications

Thu, 31 Aug 2023 00:00:00 GMT

Thanks to OpenTelemetry (OTel), DevOps and SRE teams now have a standard way to collect and send data that doesn't rely on proprietary code and has a large support community reducing vendor lock-in.

In a previous blog, we also reviewed how to use the OpenTelemetry demo and connect it to Elastic^®, as well as some of Elastic’s capabilities with OpenTelemetry visualizations and Kubernetes.

In this blog, we will show how to use automatic instrumentation for OpenTelemetry with the Python service of our application called Elastiflix, which helps highlight auto-instrumentation in a simple way.

Application, prerequisites, and config

The application that we use for this blog is called Elastiflix, a movie-streaming application. It consists of several micro-services written in .NET, NodeJS, Go, and Python.

Before we instrument our sample application, we will first need to understand how Elastic can receive the telemetry data.

All of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:

Service maps
Service details (latency, throughput, failed transactions)
Dependencies between services, distributed tracing
Transactions (traces)
Machine learning (ML) correlations
Log correlation

Prerequisites

An Elastic Cloud account — sign up now
A clone of the Elastiflix demo application, or your own Python application
Basic understanding of Docker — potentially install Docker Desktop
Basic understanding of Python

View the example source code

Step-by-step guide

Step 0. Log in to your Elastic Cloud account

This blog assumes you have an Elastic Cloud account — if not, follow the instructions to get started on Elastic Cloud.

Step 1. Configure auto-instrumentation for the Python Service

We are going to use automatic instrumentation with Python service from the Elastiflix demo application.

We will be using the following service from Elastiflix:

Elastiflix/python-favorite-otel-auto

Per the OpenTelemetry Automatic Instrumentation for Python documentation, you will simply install the appropriate Python packages using pip install.

>pip install opentelemetry-distro \
	opentelemetry-exporter-otlp

>opentelemetry-bootstrap -a install

If you are running the Python service on the command line, then you can use the following command:

opentelemetry-instrument python main.py

For our application, we do this as part of the Dockerfile.

Dockerfile

FROM python:3.9-slim as base

# get packages
COPY requirements.txt .
RUN pip install -r requirements.txt
WORKDIR /favoriteservice

#install opentelemetry packages
RUN pip install opentelemetry-distro \
	opentelemetry-exporter-otlp

RUN opentelemetry-bootstrap -a install

# Add the application
COPY . .

EXPOSE 5000
ENTRYPOINT [ "opentelemetry-instrument", "python", "main.py"]

Step 2. Running the Docker image with environment variables

As specified in the OTEL Python documentation, we will use environment variables and pass in the configuration values to enable it to connect with Elastic Observability’s APM server.

Because Elastic accepts OTLP natively, we just need to provide the Endpoint and authentication where the OTEL Exporter needs to send the data, as well as some other environment variables.

Getting Elastic Cloud variables
You can copy the endpoints and token from Kibana^® under the path /app/home#/tutorial/apm.

You will need to copy the following environment variables:

OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_EXPORTER_OTLP_HEADERS

Build the image

docker build -t  python-otel-auto-image .

Run the image

docker run \
       -e OTEL_EXPORTER_OTLP_ENDPOINT="" \
       -e OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer%20" \
       -e OTEL_RESOURCE_ATTRIBUTES="service.version=1.0,deployment.environment=production" \
       -e OTEL_SERVICE_NAME="python-favorite-otel-auto" \
       -p 5001:5001 \
       python-otel-auto-image

Important: Note that the “OTEL_EXPORTER_OTLP_HEADERS” variable has the whitespace after Bearer escaped as “%20” — this is a requirement for Python.

curl localhost:5000/favorites

# or alternatively issue a request every second

while true; do curl "localhost:5000/favorites"; sleep 1; done;

Step 3: Explore traces, metrics, and logs in Elastic APM

Exploring the Services section in Elastic APM, you’ll see the Python service displayed.

Clicking on the python-favorite-otel-auto service , you can see that it is ingesting telemetry data using OpenTelemetry.

In this blog, we discussed the following:

How to auto-instrument Python with OpenTelemetry
Using standard commands in a Dockerfile, auto-instrumentation was done efficiently and without adding code in multiple places

Developer resources:

Elastiflix application, a guide to instrument different languages with OpenTelemetry

Python: Auto-instrumentation, Manual-instrumentation

Java: Auto-instrumentation, Manual-instrumentation

Node.js: Auto-instrumentation, Manual-instrumentation

.NET: Auto-instrumentation, Manual-instrumentation

Go: Manual-instrumentation

Best practices for instrumenting OpenTelemetry

General configuration and use case resources:

Independence with OpenTelemetry on Elastic

Modern observability and security on Kubernetes with Elastic and OpenTelemetry

3 models for logging with OpenTelemetry and Elastic

Adding free and open Elastic APM as part of your Elastic Observability deployment

Capturing custom metrics through OpenTelemetry API in code with Elastic

Future-proof your observability platform with OpenTelemetry and Elastic

Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more

Automated Error Triage: From Reactive to Autonomous

Wed, 18 Mar 2026 00:00:00 GMT

The engineering feedback loop is often pictured as a clean cycle: shipping a feature, monitoring its health, triaging issues, identifying bugs, and deploying fixes. However, in large-scale cloud environments, the path from monitoring to identification frequently becomes a bottleneck. When thousands of Kibana instances running on Elastic Cloud emit millions of logs across a vast codebase, the lag between an error occurring and an engineer understanding its root cause—the Maintenance Gap—can stretch from hours to months.

To close this gap, we built an automated pipeline that moves beyond simple monitoring. By automating the discovery and investigation phases, we have shifted the focus of the engineer from "what happened?" to "is this fix correct?"

The Bottleneck in the Feedback Loop

In a high-velocity engineering environment, the path from deployment to resolution involves several distinct stages: Ship, Monitor, Triage, Identify, Fix, and Review/Deploy.

Velocity typically stalls during triage and identification. While catastrophic failures are reported immediately, smaller errors—intermittent UI glitches or failed background tasks—often go unreported. This dependency on manual reporting creates an inflated time to resolution; by the time a report is filed and routed, the issue may have already impacted the fleet for days.

By automating discovery and investigation, even these "paper cut" bugs are quantified before they accumulate into significant technical debt. The goal is to ensure that by the time a developer enters the cycle to write a fix, the detective work is already complete.

Discovery: Automated Log Clustering

The first challenge in this process is signal-to-noise. In a massive production environment, creating a ticket for every error event is unmanageable.

Instead of analyzing individual log lines, we automate the triage process using ES|QL's CATEGORIZE grouping function. CATEGORIZE clusters text messages into groups of similarly formatted values, turning unstructured telemetry into a prioritized backlog of distinct error patterns.

For example, a query like the following runs on a rolling window across all Kibana error logs:

FROM kibana-server-logs
| WHERE log.level == "ERROR"
    AND @timestamp >= NOW() - 7 days
| STATS count = COUNT() BY category = CATEGORIZE(message)
| SORT count DESC

The result is a table of regex-like categories and their occurrence counts:

count	category
1,247	`.?TypeError.+?Cannot.+?read.+?properties.+?of.+?undefined.+?reading.+?document.?`
812	`.?Connection.+?error.?`
3	`.?Disconnected.?`

A category like TypeError Cannot read properties of undefined reading document with 1,200+ hits over the past week tells us there is a real, recurring defect worth investigating. A category like Connection error spread uniformly across the fleet is more likely infrastructure noise.

The output is used to automatically file prioritized issues in a backlog, each enriched with the category, its regex, the occurrence count, and deep links into the raw telemetry. This automation ensures the feedback loop no longer waits for a user report to trigger an investigation; the discovery is proactive and immediate. These prioritized clusters then serve as the direct input for our autonomous investigation agent.

Investigation: The Automated Detective

Once an error pattern is identified, the pipeline moves to the identification phase. We deployed an AI agent to run a complete investigation of the issue. Navigating a codebase of Kibana's complexity is a significant time sink; the agent accelerates this by correlating information across the stack using ES|QL (Elasticsearch Query Language).

Protocol-Driven Investigation

It is important to distinguish this agent from a traditional automation script. The agent does not follow a hardcoded state machine; instead, it is provided with a protocol that outlines investigation goals and available tools.

The protocol prescribes a phased approach: understand the error, analyze its distribution, correlate with other data sources, find the source, and report. Each phase is described in terms of goals, not commands. The following excerpt shows how the protocol defines the first investigation step:

### Phase 1: Understand the Error
- Review the pre-extracted error details from the backlog issue
- Check for similar/overlapping error backlog issues (include closed!)
  - the categorization is often imperfect; closed issues may have
    valuable context about fixes
- Query for error overview statistics
- Get sample error messages to understand the actual content

The agent is also provided with an ES|QL reference guide and a library of query templates. Here is one of the templates for analyzing version distribution (a common first step to determine whether an error is a regression):

FROM logging-*:cluster-kibana-*
| WHERE @timestamp >= NOW() - 4 hours
    AND log.level == "ERROR"
    AND message : "TypeError Cannot read properties"
| STATS
    error_count = COUNT(*),
    deployments = COUNT_DISTINCT(ece.deployment)
  BY `docker.container.labels.org.label-schema.version`
| SORT error_count DESC

Because the agent has the autonomy to choose which tools to call—and in what order—based on the results of previous queries, it can adapt its strategy to the specific error. It might decide to skip proxy analysis if the telemetry suggests a background task failure, or it might dive deep into git history if ES|QL reveals the bug only exists on a specific version. This flexibility allows it to navigate the nuance of a massive codebase without requiring a pre-defined path for every possible failure mode.

Lessons Learned: Query Discipline

Direct LLM access to production clusters requires tactical constraints to manage costs and performance. We codified several requirements into the investigation workflow to ensure efficiency:

Query Budgets: The agent is restricted to ~15-20 queries per investigation, forcing it to form a hypothesis before data-retrieval.
The 4-Hour Rule: The agent starts with a small time window (the most recent 1-4 hours) to leverage caches and reduce compute costs.
Optimal Operators: The agent prefers equality filters and the MATCH (:) operator over LIKE or regex, which can make queries 50-1000× faster.
Fail-Fast Timeouts: Every query has a strict timeout, requiring the agent to refine its filters rather than retrying expensive operations.

Source Code Contextualization

To complete the identification phase, the agent correlates telemetry with the git history and source files. It uses the stack trace and log patterns to narrow its search, parsing through potential code matches faster than a manual search. By identifying the specific line of code producing the error and checking recent PRs, the agent links a production symptom directly to its technical root cause.

Real-World Case Study: The Streams UI Crash

The value of this autonomous investigation is best illustrated by the rare edge cases it uncovers. In one instance, the clustering system surfaced a sporadic pattern:

.*?TypeError.+?Cannot.+?read.+?properties.+?of.+?undefined.+?reading.+?document.*?

A human might have dismissed this as generic telemetry noise, but the agent's investigation revealed a reproducible race condition in the Streams UI:

Quantification: Using ES|QL, the agent analyzed the error distribution and identified the specific application context (Streams) and the relevant loggers.
Code Analysis: It identified a logic error in processor_outcome_preview.tsx. The code was indexing into an array (originalSamples[currentDoc.index].document) without verifying the element existed.
Root Cause: The agent realized that when a user changed filters while a row was expanded, the currentDoc.index became stale before the next render cleared it.
Outcome: The agent provided a suggested fix (guarding the access) and recommended a regression test around filter changes during row expansion.

This case highlights the economic scale of autonomous triage. Sifting through thousands of "noisy" logs to find the few that represent real, fixable UI crashes is a non-starter for senior engineers. Agents process this volume at a fraction of the cost, acting as a high-fidelity filter that ensures human time is only spent on verified, actionable issues.

The Future of Engineering Velocity

Automating triaging and identification is the first step. We are currently layering in the ability to pass these findings to a coding agent for draft Pull Requests. Beyond production errors, we are also investigating agentic exploratory testing to stress-test features during the pre-release phase and catch bugs before they ever reach a user.

This autonomous layer is complementary to, not a replacement for, classic quality gates. Unit tests, API-level checks, and UI integration tests remain the primary defense. Our approach provides a safety net for the failures that inevitably bypass these gates in a complex environment, ensuring they are addressed with the same rigor as pre-release bugs.

As we move toward a more agent-driven development process, the ability to rapidly validate that changes are safe and to control overall quality is the primary bottleneck for engineering velocity. While code generation itself is becoming a commodity, the "reasoning" required to verify that a change is both correct and safe remains the most critical hurdle. By focusing our automation on the discovery and root-cause analysis of failures, we ensure that our engineering teams can scale their impact without being buried by the operational weight of maintaining quality. The goal is to build a system that can understand, diagnose, and eventually fix itself.

For more information on Elastic and its observability capabilities, check out Elastic Observability. You can also sign up for a free trial to try it out yourself.

Automated log parsing in Streams with ML

Tue, 10 Feb 2026 00:00:00 GMT

In modern observability stacks, ingesting unstructured logs from diverse data providers into platforms like Elasticsearch remains a challenge. Reliance on manually crafted parsing rules creates brittle pipelines, where even minor upstream code updates lead to parsing failures and unindexed data. This fragility is compounded by the scalability challenge: in dynamic microservices environments, the continuous addition of new services turns manual rule maintenance into an operational nightmare.

Our goal was to transition to an automated, adaptive approach capable of handling both log parsing (field extraction) and log partitioning (source identification). We hypothesized that Large Language Models (LLMs), with their inherent understanding of code syntax and semantic patterns, could automate these tasks with minimal human intervention.

We are happy to announce that this feature is already available in Streams!

Dataset Description

We chose a Loghub collection of logs for PoC purposes. For our investigation, we selected representative samples from the following key areas:

Distributed systems: We used the HDFS (Hadoop Distributed File System) and Spark datasets. These contain a mix of info, debug, and error messages typical of big data platforms.
Server & web applications: Logs from Apache web servers and OpenSSH provided a valuable source of access, error, and security-relevant events. These are critical for monitoring web traffic and detecting potential threats.
Operating systems: We included logs from Linux and Windows. These datasets represent the common, semi-structured system-level events that operations teams encounter daily.
Mobile systems: To ensure our model could handle logs from mobile environments, we included the Android dataset. These logs are often verbose and capture a wide range of application and system-level activities on mobile devices.
Supercomputers: To test performance on high-performance computing (HPC) environments, we incorporated the BGL (Blue Gene/L) dataset, which features highly structured logs with specific domain terminology.

A key advantage of the Loghub collection is that the logs are largely unsanitized and unlabeled, mirroring a noisy live production environment with microservice architecture.

Log examples:

[Sun Dec 04 20:34:21 2005] [notice] jk2_init() Found child 2008 in scoreboard slot 6
[Sun Dec 04 20:34:25 2005] [notice] workerEnv.init() ok /etc/httpd/conf/workers2.properties
[Mon Dec 05 11:06:51 2005] [notice] workerEnv.init() ok /etc/httpd/conf/workers2.properties
17/06/09 20:10:58 INFO output.FileOutputCommitter: Saved output of task 'attempt_201706092018_0024_m_000083_1138' to hdfs://10.10.34.11:9000/pjhe/test/1/_temporary/0/task_201706092018_0024_m_000083
17/06/09 20:10:58 INFO mapred.SparkHadoopMapRedUtil: attempt_201706092018_0024_m_000083_1138: Committed

In addition, we created a Kubernetes cluster with a typical web application + database set up to mine extra logs in the most common domain.

Example of common log fields: timestamp, log level (INFO, WARN, ERROR), source, message.

Few-Shot Log Parsing with an LLM

Our first set of experiments focused on a fundamental question: Can an LLM reliably identify key fields and generate consistent parsing rules to extract them?

We asked a model to analyse raw log samples and generate log parsing rules in regular expression (regex) and Grok formats. Our results showed that this approach has a lot of potential, but also significant implementation challenges.

High Confidence & Context Awareness

Initial results were promising. The LLM demonstrated a strong ability to generate parsing rules that matched the provided few-shot examples with high confidence. Besides simple pattern matching, the model showed a capacity for log understanding —it could correctly identify and name the log source (e.g., health tracking app, Nginx web app, Mongo database).

The "Goldilocks" Dilemma of Input Samples

Our experiments quickly surfaced a significant lack of robustness because of extreme sensitivity to the input sample. The model's performance fluctuates wildly based on the specific log examples included in the prompt. We observed a log similarity problem where the log sample needs to include just diverse enough logs:

Too homogeneous (overfitting): If the input logs are too similar, the LLM tends to overspecify. It treats variable data—such as specific Java class names in a stack trace—as static parts of the template. This results in brittle rules that cover a tiny ratio of logs and extract unusable fields.
Too heterogeneous (confusion): Conversely, if the sample contains significant formatting variance—or worse, "trash logs" like progress bars, memory tables, or ASCII art—the model struggles to find a common denominator. It often resorts to generating complex, broken regexes or lazily over-generalizing the entire line into a single message blob field.

The Context Window Constraint

We also encountered a context window bottleneck. When input logs were long, heterogeneous, or rich in extractable fields, the model's output often deteriorated, becoming "messy" or too long to fit into the output context window. Naturally, chunking helps in this case. By splitting logs using character-based and entity-based delimiters, we could help the model focus on extracting the main fields without being overwhelmed by noise.

The consistency & standardization gap

Even when the model successfully generated rules, we noted slight inconsistencies:

Service naming variations: The model proposes different names for the same entity (e.g., labeling the source as "Spark," "Apache Spark," and "Spark Log Analytics" in different runs).
Field naming variations: Field names lacked standardization (e.g., id vs. service.id vs. device.id). We normalized names using a standardized Elastic field naming.
Resolution variance: The resolution of the field extraction varied depending on how similar the input logs were to one another.

Log Format Fingerprint

To address the challenge of log similarity, we introduce a high-performance heuristic: log format fingerprint (LFF).

Instead of feeding raw, noisy logs directly into an LLM, we first apply a deterministic transformation to reveal the underlying structure of each message. This pre-processing step abstracts away variable data, generating a simplified "fingerprint" that allows us to group related logs.

The mapping logic is simple to ensure speed and consistency:

Digit abstraction: Any sequence of digits (0-9) is replaced by a single ‘0’.
Text abstraction: Any sequence of alphabetical characters with whitespace is replaced by a single ‘a’.
Whitespace normalization: All sequences of whitespace (spaces, tabs, newlines) are collapsed into a single space.
Symbol preservation: Punctuation and special characters (e.g., :, [, ], /) are preserved, as they are often the strongest indicators of log structure.

We introduce the log mapping approach. The basic mapping patterns include the following:

Digits 0-9 of any length -> to ‘0.’

Text (alphabetical characters with spaces) of any length -> to ‘a’.
White spaces, tabs, and new lines -> to a single space.
Let's look at an example of how this mapping allows us to transform the logs.

As a result, we obtain the following log masks:

Notice the fingerprints of the first two logs. Despite different timestamps, source classes, and message content, their prefixes (0/0/0 0:0:0 a a.a:) are identical. This structural alignment allows us to automatically bucket these logs into the same cluster.

The third log, however, produces a completely divergent fingerprint (0-0-0...). This allows us to algorithmically separate it from the first group before we ever invoke an LLM.

Bonus Part: Instant Implementation with ES|QL

It’s as easy as passing this query in Discover.

FROM loghub |
EVAL pattern = REPLACE(REPLACE(REPLACE(REPLACE(raw_message, "[ \t\n]+", " "), "[A-Za-z]+", "a"), "[0-9]+", "0"), "a( a)+", "a") |
STATS total_count = COUNT(), ratio = COUNT() / 2000.0, datasources=VALUES(filename), example=TOP(raw_message, 3, "desc") BY SUBSTRING(pattern, 0, 15) |
SORT total_count DESC |
LIMIT 100

Query breakdown:

FROM loghub: Targets our index containing the raw log data.

EVAL pattern = …: The core mapping logic. We chain REPLACE functions to perform the abstraction (e.g., digits to '0', text to 'a', etc.) and save the result in a “pattern” field.

STATS [column1 =] expression1, … BY SUBSTRING(pattern, 0, 15):

This is a clustering step. We group logs that share the first 15 characters of their pattern and create aggregated fields such as total log count per group, list of log datasources, pattern prefix, 3 log examples

SORT total_count DESC | LIMIT 100 : Surfaces the top 100 most frequent log patterns

The query results on LogHub are displayed below:

As demonstrated in the visualization, this “LLM-free” approach partitions logs with high accuracy. It successfully clustered 10 out of 16 data sources (based on LogHub labels) completely (>90%) and achieved majority clustering in 13 out of 16 sources (>60%) —all without requiring additional cleaning, preprocessing, or fine-tuning.

Log format fingerprint offers a pragmatic, high-impact alternative and addition to sophisticated ML solutions like log pattern analysis. It provides immediate insights into log relationships and effectively manages large log clusters.

Versatility as a primitive

Thanks to ES|QL implementation, LFF serves both as a standalone tool for fast data diagnostics/visualisations, and as a building block in log analysis pipelines for high-volume use cases.

Flexibility

LFF is easy to customize and extend to capture specific patterns, i.e. hexadecimal numbers and IP addresses.

Deterministic stability

Unlike ML-based clustering algorithms, LFF logic is straightforward and deterministic. New incoming logs do not retroactively affect existing log clusters.

Performance and Memory

It requires minimal memory, no training or GPU making it ideal for real-time high-throughput environments.

Combining Log Format Fingerprint with an LLM

To validate the proposed hybrid architecture, each experiment contained a random 20% subset of the logs from each data source. This constraint simulates a real-world production environment where logs are processed in batches rather than as a monolithic historical dump.

The objective was to demonstrate that LFF acts as an effective compression layer. We aimed to prove that high-coverage parsing rules could be generated from small, curated samples and successfully generalized to the entire dataset.

Execution Pipeline

We implemented a multi-stage pipeline that filters, clusters, and applies stratified sampling to the data before it reaches the LLM.

Two-stage hierarchical clustering

Subclasses (exact match): Logs are aggregated by identical fingerprints. Every log in one subclass shares the exact same format structure.
Outlier cleaning: We discard any subclasses that represent less than 5% of the total log volume. This ensures the LLM focuses on the dominant signal and won’t be sidetracked by noise or malformed logs.
Metaclasses (prefix match): Remaining subclasses are grouped into Metaclasses by the first N characters of the format fingerprint match. This grouping strategy effectively splits lexically similar formats under a single umbrella.We chose N=5 for Log parsing and N=15 for Log partitioning when data sources are unknown.

Stratified sampling. Once the hierarchical tree is built, we construct the log sample for the LLM. The strategic goal is to maximize variance coverage while minimizing token usage.

We select representative logs from each valid subclass within the broader metaclass.
To manage an edge case of too numerous subclasses, we apply random down-sampling to fit the target window size.

Rule generation Finally, we prompt the LLM to generate a regex parsing rule that fits all logs in the provided sample for each Metaclass. For our PoC, we used the GPT-4o mini model.

Experimental Results & Observations

We achieved 94% parsing accuracy and 91% partitioning accuracy on the Loghub dataset.

The confusion matrix above illustrates log partitioning results. The vertical axis represents the actual data sources, and the horizontal axis represents the predicted data sources. The heatmap intensity corresponds to log volume, with lighter tiles indicating a higher count. The diagonal alignment demonstrates the model's high fidelity in source attribution, with minimal scattering.

Our Performance Benchmarks Insights

Optimal baseline: a context window of 30–40 log samples per category proved to be the "sweet spot," consistently producing robust parsing with both Regex and Grok patterns.
Input minimisation: we pushed the input size to 10 logs per category for Regex patterns and observed only 2% drop in parsing performance, confirming that diversity-based sampling is more critical than raw volume.

One-Step Ingest for CloudWatch Logs and Metrics into Elastic Observability with Amazon Data Firehose

Tue, 26 Nov 2024 00:00:00 GMT

Overview of the new Quickstart guided workflow

Elastic Observability has been supporting AWS logs ingest with Amazon Data Firehose over the last few releases. To makes configuration easier, we introduced, in 8.16, a one step guided workflow to onboard all CloudWatch logs and metrics from a single region. The configuration uses a pre-populated CloudFormation template, to automatically create a Amazon Data Firehose and connect to Elastic Observability. Additionally, all the relevant Elastic AWS Integrations are auto-installed. The configuration ensures ingestion for metrics from all namespaces and a policy to ingest logs from all existing log groups. Any new metric namespaces and log groups post setup will also be ingested automatically. Additionally, the CloudFormation template can also be customized and deployed in a production environment using infra-as-code.

This allows SREs to to start monitoring the usage and health of their popular AWS services using pre-built dashboards within minutes. This blog reviews how to setup this quickstart workflow, and the out-of-the box dashboards that will be populated from it.

Onboarding data using Amazon Data Firehose

In order to utilize this guided workflow, a user needs the superuser built-in Kibana role. A deployment of the hosted Elasticsearch service of version 8.16 on Elastic Cloud is required. Further, an active AWS account and the necessary permissions to create delivery streams, run CloudFormation, create CloudWatch log group/metric streams are needed.

Let’s walk through the steps required to onboard data using this workflow. There should be some CloudWatch logs and metrics already available in the customer account. The screenshot below shows an example where a number of CloudWatch metrics namespaces already exist.

Similarly, a number of CloudWatch log groups are already present in this customer account as shown below.

This guided workflow is accessible from the ‘Add data’ left navigation option in the Elastic Observability app. The user needs to select the ‘Cloud’ option and click on the ‘AWS’ tile. The Amazon Firehose quickstart onboarding workflow is available at the top left and is labeled as a Quickstart option, as shown below.

The Data Firehose delivery stream can be created either using the AWS CLI or the AWS console, as shown in step 2 of the guided workflow below.

By clicking on the ‘Create Firehose Stream in AWS’ button under the ‘Via AWS Console’ tab, the user will be taken to the AWS console and the menu for creating the CloudFormation stack, as shown below.

The CloudFormation (CF) template provided by Elastic has prepopulated default settings including the Elasticsearch endpoint and the API key, as shown in the screenshot above. The user can review these defaults in the AWS console and proceed by clicking on the ‘Create stack’ button, as shown below. Note that this stack creates IAM resources and so the checkbox acknowledging that must be checked to move forward.

Once the CloudFormation stack has been created in AWS, the user can switch back to Kibana. By default, the CF stack will consist of separate delivery streams for CloudWatch logs and metrics, as shown below.

In Kibana, under step 3 ‘Visualize your data’ of the workflow, the incoming data starts to appear, categorized by AWS service type as shown below. The page refreshes automatically every 5 s and the new services appear at the bottom of the list.

For each detected AWS service, the user is recommended 1-2 pre-built dashboards to explore the health and usage of their services. For example, the pre-built dashboard shown below provides a quick overview on the usage of the NAT Gateway.

In addition to pre-built dashboards, Discover can also be used to explore the ingested CloudWatch logs, as shown below.

AWS Usage overview can be explored using the pre-built dashboard shown below.

Customisation options

The region needs to be selected/modified in the AWS console as shown below, before starting with the CF stack creation.

The setting of EnableCloudWatchLogs parameter and the setting of EnableCloudWatchMetrics parameter in the AWS console or the CF template can be changed to disable the collection of logs or metrics.

The MetricNameFilters parameter in the CF template or console can be used to exclude specific namespace-metric names pairs from collection.

The CF template provided by Elastic can be used together with the Terraform resource aws_cloudformation_stack as shown below to deploy in the production environment, to facilitate as-code deployment.

Start your own exploration

The new guided onboarding workflow for AWS utilizes the Amazon Firehose delivery stream to collect all available CloudWatch logs & metrics, from a single customer account and a single region. The workflow also installs AWS Integration packages in the Elastic stack, enabling users to start monitoring the usage and performance of their common AWS services using pre-built dashboards, within minutes. Some of the AWS services that can be monitored using this workflow are listed below. A complete list of over twenty services that are supported by this workflow along with additional details are available here.


VPC Flow Logs	Logs
API Gateway	Logs, Metrics
CloudTrail	Logs
Network Firewall	Logs, Metrics
WAF	Logs
EC2	Metrics
RDS	Metrics

Unleash the power of Elastic and Amazon Kinesis Data Firehose to enhance observability and data analytics

Thu, 18 May 2023 00:00:00 GMT

As more organizations leverage the Amazon Web Services (AWS) cloud platform and services to drive operational efficiency and bring products to market, managing logs becomes a critical component of maintaining visibility and safeguarding multi-account AWS environments. Traditionally, logs are stored in Amazon Simple Storage Service (Amazon S3) and then shipped to an external monitoring and analysis solution for further processing.

To simplify this process and reduce management overhead, AWS users can now leverage the new Amazon Kinesis Firehose Delivery Stream to ingest logs into Elastic Cloud in AWS in real time and view them in the Elastic Stack alongside other logs for centralized analytics. This eliminates the necessity for time-consuming and expensive procedures such as VM provisioning or data shipper operations.

Elastic Observability unifies logs, metrics, and application performance monitoring (APM) traces for a full contextual view across your hybrid AWS environments alongside their on-premises data sets. Elastic Observability enables you to track and monitor performance across a broad range of AWS services, including AWS Lambda, Amazon Elastic Compute Cloud (EC2), Amazon Elastic Container Service (ECS), Amazon Elastic Kubernetes Service (EKS), Amazon Simple Storage Service (S3), Amazon Cloudtrail, Amazon Network Firewall, and more.

In this blog, we will walk you through how to use the Amazon Kinesis Data Firehose integration — Elastic is listed in the Amazon Kinesis Firehose drop-down list — to simplify your architecture and send logs to Elastic, so you can monitor and safeguard your multi-account AWS environments.

Announcing the Kinesis Firehose method

Elastic currently provides both agent-based and serverless mechanisms, and we are pleased to announce the addition of the Kinesis Firehose method. This new method enables customers to directly ingest logs from AWS into Elastic, supplementing our existing options.

Elastic Agent pulls metrics and logs from CloudWatch and S3 where logs are generally pushed from a service (for example, EC2, ELB, WAF, Route53) and ingests them into Elastic Cloud.
Elastic’s Serverless Forwarder (runs Lambda and available in AWS SAR) sends logs from Kinesis Data Stream, Amazon S3, and AWS Cloudwatch log groups into Elastic. To learn more about this topic, please see this blog post.
Amazon Kinesis Firehose directly ingests logs from AWS into Elastic (specifically, if you are running the Elastic Cloud on AWS).

In this blog, we will cover the last option since we have recently released the Amazon Kinesis Data Firehose integration. Specifically, we'll review:

A general overview of the Amazon Kinesis Data Firehose integration and how it works with AWS
Step-by-step instructions to set up the Amazon Kinesis Data Firehose integration on AWS and on Elastic Cloud

By the end of this blog, you'll be equipped with the knowledge and tools to simplify your AWS log management with Elastic Observability and Amazon Kinesis Data Firehose.

Prerequisites and configurations

If you intend to follow the steps outlined in this blog post, there are a few prerequisites and configurations that you should have in place beforehand.

You will need an account on Elastic Cloud and a deployed stack on AWS. Instructions for deploying a stack on AWS can be found here. This is necessary for AWS Firehose Log ingestion.
You will also need an AWS account with the necessary permissions to pull data from AWS. Details on the required permissions can be found in our documentation.
Finally, be sure to turn on VPC Flow Logs for the VPC where your application is deployed and send them to AWS Firehose.

Elastic’s Amazon Kinesis Data Firehose integration

Elastic has collaborated with AWS to offer a seamless integration of Amazon Kinesis Data Firehose with Elastic, enabling direct ingestion of data from Amazon Kinesis Data Firehose into Elastic without the need for Agents or Beats. All you need to do is configure the Amazon Kinesis Data Firehose delivery stream to send its data to Elastic's endpoint. In this configuration, we will demonstrate how to ingest VPC Flow logs and Firewall logs into Elastic. You can follow a similar process to ingest other logs from your AWS environment into Elastic.

There are three distinct configurations available for ingesting VPC Flow and Network firewall logs into Elastic. One configuration involves sending logs through CloudWatch, and another uses S3 and Kinesis Firehose; each has its own unique setup. With Cloudwatch and S3 you can store and forward but with Kinesis Firehose you will have to ingest immediately. However, in this blog post, we will focus on this new configuration that involves sending VPC Flow logs and Network Firewall logs directly to Elastic.

We will guide you through the configuration of the easiest setup, which involves directly sending VPC Flow logs and Firewalls logs to Amazon Kinesis Data Firehose and then into Elastic Cloud.

Note: It's important to note that this setup is only compatible with Elastic Cloud on AWS and cannot be used with self-managed or on-premise or other cloud provider Elastic deployments.

Setting it all up

To begin setting up the integration between Amazon Kinesis Data Firehose and Elastic, let's go through the necessary steps.

Step 0: Get an account on Elastic Cloud

Create an account on Elastic Cloud by following the instructions provided to get started on Elastic Cloud.

Step 1: Deploy Elastic on AWS

You can deploy Elastic on AWS via two different approaches: through the UI or through Terraform. We’ll start first with the UI option.

After logging into Elastic Cloud, create a deployment on Elastic. It's crucial to make sure that the deployment is on Elastic Cloud on AWS since the Amazon Kinesis Data Firehose connects to a specific endpoint that must be on AWS.

After your deployment is created, it's essential to copy the Elasticsearch endpoint to ensure a seamless configuration process.

The Elasticsearch HTTP endpoint should be copied and used for Amazon Firehose destination configuration purposes, as it will be required. Here's an example of what the endpoint should look like:

https://elastic-O11y-log.es.us-east-1.aws.found.io

Alternative approach using Terraform

An alternative approach to deploying Elastic Cloud on AWS is by using Terraform. It's also an effective way to automate and streamline the deployment process.

To begin, simply create a Terraform configuration file that outlines the necessary infrastructure. This file should include resources for your Elastic Cloud deployment and any required IAM roles and policies. By using this approach, you can simplify the deployment process and ensure consistency across environments.

One easy way to create your Elastic Cloud deployment with Terraform is to use this Github repo. This resource lets you specify the region, version, and deployment template for your Elastic Cloud deployment, as well as any additional settings you require.

Step 2: To turn on Elastic's AWS integrations, navigate to the Elastic Integration section in your deployment

To install AWS assets in your deployment's Elastic Integration section, follow these steps:

Log in to your Elastic Cloud deployment and open Kibana.
To get started, go to the management section of Kibana and click on " Integrations."
Navigate to the AWS integration and click on the "Install AWS Assets" button in the settings.This step is important as it installs the necessary assets such as dashboards and ingest pipelines to enable data ingestion from AWS services into Elastic.

Step 3: Set up the Amazon Kinesis Data Firehose delivery stream on the AWS Console

You can set up the Kinesis Data Firehose delivery stream via two different approaches: through the AWS Management Console or through Terraform. We’ll start first with the console option.

To set up the Kinesis Data Firehose delivery stream on AWS, follow these steps:

Go to the AWS Management Console and select Amazon Kinesis Data Firehose.
Click on Create delivery stream.
Choose a delivery stream name and select Direct PUT or other sources as the source.

Choose Elastic as the destination.
In the Elastic destination section, enter the Elastic endpoint URL that you copied from your Elastic Cloud deployment.

Choose the content encoding and retry duration as shown above.
Enter the appropriate parameter values for your AWS log type. For example, for VPC Flow logs, you would need to specify the _ es_datastream_name _ and _ logs-aws.vpc flow-default _.
Configure the Amazon S3 bucket as the source backup for the Amazon Kinesis Data Firehose delivery stream failed data or all data, and configure any required tags for the delivery stream.
Review the settings and click on Create delivery stream.

In the example above, we are using the es_datastream_name parameter to pull in VPC Flow logs through the logs-aws.vpcflow-default datastream. Depending on your use case, this parameter can be configured with one of the following types of logs:

logs-aws.cloudfront_logs-default (AWS CloudFront logs)
logs-aws.ec2_logs-default (EC2 logs in AWS CloudWatch)
logs-aws.elb_logs-default (Amazon Elastic Load Balancing logs)
logs-aws.firewall_logs-default (AWS Network Firewall logs)
logs-aws.route53_public_logs-default (Amazon Route 53 public DNS queries logs)
logs-aws.route53_resolver_logs-default (Amazon Route 53 DNS queries & responses logs)
logs-aws.s3access-default (Amazon S3 server access log)
logs-aws.vpcflow-default (AWS VPC flow logs)
logs-aws.waf-default (AWS WAF Logs)

Alternative approach using Terraform

Using the " aws_kinesis_firehose_delivery_stream" resource in Terraform is another way to create a Kinesis Firehose delivery stream, allowing you to specify the delivery stream name, data source, and destination - in this case, an Elasticsearch HTTP endpoint. To authenticate, you'll need to provide the endpoint URL and an API key. Leveraging this Terraform resource is a fantastic way to automate and streamline your deployment process, resulting in greater consistency and efficiency.

Here's an example code that shows you how to create a Kinesis Firehose delivery stream with Terraform that sends data to an Elasticsearch HTTP endpoint:

resource "aws_kinesis_firehose_delivery_stream" “Elasticcloud_stream" {
  name        = "terraform-kinesis-firehose-ElasticCloud-stream"
  destination = "http_endpoint”
  s3_configuration {
    role_arn           = aws_iam_role.firehose.arn
    bucket_arn         = aws_s3_bucket.bucket.arn
    buffer_size        = 5
    buffer_interval    = 300
    compression_format = "GZIP"
  }
  http_endpoint_configuration {
    url        = "https://cloud.elastic.co/"
    name       = “ElasticCloudEndpoint"
    access_key = “ElasticApi-key"
    buffering_hints {
      size_in_mb = 5
      interval_in_seconds = 300
    }

   role_arn       = "arn:Elastic_role"
   s3_backup_mode = "FailedDataOnly"
  }
}

Step 4: Configure VPC Flow Logs to send to Amazon Kinesis Data Firehose

To complete the setup, you'll need to configure VPC Flow logs in the VPC where your application is deployed and send them to the Amazon Kinesis Data Firehose delivery stream you set up in Step 3.

Enabling VPC flow logs in AWS is a straightforward process that involves several steps. Here's a step-by-step details to enable VPC flow logs in your AWS account:

Select the VPC for which you want to enable flow logs.
In the VPC dashboard, click on "Flow Logs" under the "Logs" section.
Click on the "Create Flow Log" button to create a new flow log.
In the "Create Flow Log" wizard, provide the following information:

Choose the target for your flow logs: In this case, Amazon Kinesis Data Firehose in the same AWS account.

Provide a name for your flow log.
Choose the VPC and the network interface(s) for which you want to enable flow logs.
Choose the flow log format: either AWS default or Custom format.

Configure the IAM role for the flow logs. If you have an existing IAM role, select it. Otherwise, create a new IAM role that grants the necessary permissions for the flow logs.
Review the flow log configuration and click "Create."

Create the VPC Flow log.

Step 5: After a few minutes, check if flows are coming into Elastic

To confirm that the VPC Flow logs are ingesting into Elastic, you can check the logs in Kibana. You can do this by searching for the index in the Kibana Discover tab and filtering the results by the appropriate index and time range. If VPC Flow logs are flowing in, you should see a list of documents representing the VPC Flow logs.

Step 6: Navigate to Kibana to see your logs parsed and visualized in the [Logs AWS] VPC Flow Log Overview dashboard

Finally, there is an Elastic out-of-the-box (OOTB) VPC Flow logs dashboard that displays the top IP addresses that are hitting your VPC, their geographic location, time series of the flows, and a summary of VPC flow log rejects within the selected time frame. This dashboard can provide valuable insights into your network traffic and potential security threats.

Note: For additional VPC flow log analysis capabilities, please refer to this blog.

Step 7: Configure AWS Network Firewall Logs to send to Kinesis Firehose

To create a Kinesis Data Firehose delivery stream for AWS Network firewall logs, first log in to the AWS Management Console, navigate to the Kinesis service, select "Data Firehose", and follow the step-by-step instructions as shown in Step 3. Specify the Elasticsearch endpoint, API key, add a parameter (_ es_datastream_name=logs-aws.firewall_logs-default _), and create the delivery stream.

Second, to set up a Network Firewall rule group to send logs to the Kinesis Firehose, go to the Network Firewall section of the console, create a rule group, add a rule to allow traffic to the Kinesis endpoint, and attach the rule group to your Network Firewall configuration. Finally, test the configuration by sending traffic through the Network Firewall to the Kinesis Firehose endpoint and verify that logs are being delivered to your S3 bucket.

Kindly follow the instructions below to set up a firewall rule and logging.

Set up a Network Firewall rule group to send logs to Amazon Kinesis Data Firehose:

Go to the AWS Management Console and select Network Firewall.
Click on "Rule groups" in the left menu and then click "Create rule group."
Choose "Stateless" or "Stateful" depending on your requirements, and give your rule group a name. Click "Create rule group."
Add a rule to the rule group to allow traffic to the Kinesis Firehose endpoint. For example, if you are using the us-east-1 region, you would add a rule like this:json

{
  "RuleDefinition": {
    "Actions": [
      {
        "Type": "AWS::KinesisFirehose::DeliveryStream",
        "Options": {
          "DeliveryStreamArn": "arn:aws:firehose:us-east-1:12387389012:deliverystream/my-delivery-stream"
        }
      }
    ],
    "MatchAttributes": {
      "Destination": {
        "Addresses": ["api.firehose.us-east-1.amazonaws.com"]
      },
      "Protocol": {
        "Numeric": 6,
        "Type": "TCP"
      },
      "PortRanges": [
        {
          "From": 443,
          "To": 443
        }
      ]
    }
  },
  "RuleOptions": {
    "CustomTCPStarter": {
      "Enabled": true,
      "PortNumber": 443
    }
  }
}

Save the rule group.

Attach the rule group to your Network Firewall configuration:

Go to the AWS Management Console and select Network Firewall.
Click on "Firewall configurations" in the left menu and select the configuration you want to attach the rule group to.
Scroll down to "Associations" and click "Edit."
Select the rule group you created in Step 2 and click "Save."

Test the configuration:

Send traffic through the Network Firewall to the Kinesis Firehose endpoint and verify that logs are being delivered to your S3 bucket.

Step 8: Navigate to Kibana to see your logs parsed and visualized in the [Logs AWS] Firewall Log dashboard

Wrapping up

We’re excited to bring you this latest integration for AWS Cloud and Kinesis Data Firehose into production. The ability to consolidate logs and metrics to gain visibility across your cloud and on-premises environment is crucial for today’s distributed environments and applications.

From EC2, Cloudwatch, Lambda, ECS and SAR, Elastic Integrations allow you to quickly and easily get started with ingesting your telemetry data for monitoring, analytics, and observability. Elastic is constantly delivering frictionless customer experiences, allowing anytime, anywhere access to all of your telemetry data — this streamlined, native integration with AWS is the latest example of our commitment.

Start a free trial today

You can begin with a 7-day free trial of Elastic Cloud within the AWS Marketplace to start monitoring and improving your users' experience today!

Wait… Elastic Observability monitors metrics for AWS services in just minutes?

Mon, 21 Nov 2022 00:00:00 GMT

The transition to distributed applications is in full swing, driven mainly by our need to be “always-on” as consumers and fast-paced businesses. That need is driving deployments to have more complex requirements along with the ability to be globally diverse and rapidly innovate.

Cloud is becoming the de facto deployment option for today’s applications. Many cloud deployments choose to host their applications on AWS for the globally diverse set of regions it covers and the myriad of services (for faster development and innovation) available, as well as to drive operational and capital costs down. On AWS, development teams are finding additional value in migrating to Kubernetes on Amazon EKS, testing out the latest serverless options, and improving traditional, tiered applications with better services.

Elastic Observability offers 30 out-of-the-box integrations for AWS services with more to come.

A quick review highlighting some of the integrations and capabilities can be found in a previous post:

Elastic and AWS: Seamlessly ingest logs and metrics into a unified platform with ready-to-use integrations.

Some additional posts on key AWS service integrations on Elastic are:

A full list of AWS integrations can be found in Elastic’s online documentation:

Full list of AWS integrations

In addition to our native AWS integrations, Elastic Observability aggregates not only logs but also metrics for AWS services and the applications running on AWS compute services (EC2, Lambda, EKS/ECS/Fargate). All this data can be analyzed visually and more intuitively using Elastic’s advanced machine learning capabilities, which help detect performance issues and surface root causes before end users are affected.

For more details on how Elastic Observability provides application performance monitoring (APM) capabilities such as service maps, tracing, dependencies, and ML based metrics correlations:

That’s right, Elastic offers metrics ingest, aggregation, and analysis for AWS services and applications on AWS compute services (EC2, Lambda, EKS/ECS/Fargate). Elastic is more than logs — it offers a unified observability solution for AWS environments.

In this blog, I’ll review how Elastic Observability can monitor metrics for a simple AWS application running on AWS services which include:

AWS EC2
AWS ELB
AWS RDS (AuroraDB)
AWS NAT Gateways

As you will see, once the integration is installed, metrics will arrive instantly and you can immediately start reviewing metrics.

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up this demonstration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here).
Ensure you have an AWS account with permissions to pull the necessary data from AWS. See details in our documentation.
We used AWS’s three tier app and installed it as instructed in git.
We’ll walk through installing the general Elastic AWS Integration, which covers the four services we want to collect metrics for.
(Full list of services supported by the Elastic AWS Integration)
We will not cover application monitoring given other blogs cover application AWS monitoring (metrics, logs, and tracing). Instead we will focus on how AWS services can be easily monitored.
In order to see metrics, you will need to load the application. We’ve also created a playwright script to drive traffic to the application.

Three tier application overview

Before we dive into the Elastic configuration, let's review what we are monitoring. If you follow the instructions for aws-three-tier-web-architecture-workshop, you will have the following deployed.

What’s deployed:

1 VPC with 6 subnets
2 AZs
2 web servers per AZ
2 application servers per AZ
1 External facing application load balancer
1 Internal facing application load balancer
2 NAT gateways to manage traffic to the application layer
1 Internet gateway
1 RDS Aurora DB with a read replica

At the end of the blog, we will also provide a Playwright script to implement to load this app. This will help drive metrics to “light up” the dashboards.

Setting it all up

Let’s walk through the details of how to get the application, AWS integration on Elastic, and what gets ingested.

Step 0: Load up the AWS Three Tier application and get your credentials

Follow the instructions listed out in AWS’s Three Tier app and instructions in the workshop link on git. The workshop is listed here.

Once you’ve installed the app, get credentials from AWS. This will be needed for Elastic’s AWS integration.

There are several options for credentials:

Use access keys directly
Use temporary security credentials
Use a shared credentials file
Use an IAM role Amazon Resource Name (ARN)

For more details on specifics around necessary credentials and permissions.

Step 1: Get an account on Elastic Cloud

Follow the instructions to get started on Elastic Cloud.

Step 2: Install the Elastic AWS integration

Navigate to the AWS integration on Elastic.

Select Add AWS integration.

This is where you will add your credentials and it will be stored as a policy in Elastic. This policy will be used as part of the install for the agent in the next step.

As you can see, the general Elastic AWS Integration will collect a significant amount of data from 30 AWS services. If you don’t want to install this general Elastic AWS Integration, you can select individual integrations to install.

Step 3: Install the Elastic Agent with AWS integration

Now that you have created an integration policy, navigate to the Fleet section under Management in Elastic.

Select the name of the policy you created in the last step.

Follow step 3 in the instructions in the Add agent window. This will require you to:

1: Bring up an EC2 instance

t2.medium is minimum
Linux - your choice of which
Ensure you allow for Open reservation on the EC2 instance when you Launch it

2: Log in to the instance and run the commands under Linux Tar tab (below is an example)

curl -L -O https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.5.0-linux-x86_64.tar.gz
tar xzvf elastic-agent-8.5.0-linux-x86_64.tar.gz
cd elastic-agent-8.5.0-linux-x86_64
sudo ./elastic-agent install --url=https://37845638732625692c8ee914d88951dd96.fleet.us-central1.gcp.cloud.es.io:443 --enrollment-token=jkhfglkuwyvrquevuytqoeiyri

Step 4: Run traffic against the application

While getting the application running is fairly easy, there is nothing to monitor or observe with Elastic unless you add a load on the application.

Here is a simple script you can also run using Playwright to add traffic to the website for the AWS three tier application:

import { test, expect } from "@playwright/test";

test("homepage for AWS Threetierapp", async ({ page }) => {
  await page.goto(
    "http://web-tier-external-lb-1897463036.us-west-1.elb.amazonaws.com/#/db"
  );

  await page.fill(
    "#transactions > tbody > tr > td:nth-child(2) > input",
    (Math.random() * 100).toString()
  );
  await page.fill(
    "#transactions > tbody > tr > td:nth-child(3) > input",
    (Math.random() * 100).toString()
  );
  await page.waitForTimeout(1000);
  await page.click(
    "#transactions > tbody > tr:nth-child(2) > td:nth-child(1) > input[type=button]"
  );
  await page.waitForTimeout(4000);
});

This script will launch three browsers, but you can limit this load to one browser in playwright.config.ts file.

For this exercise, we ran this traffic for approximately five hours with an interval of five minutes while testing the website.

Step 5: Go to AWS dashboards

Now that your Elastic Agent is running, you can go to the related AWS dashboards to view what’s being ingested.

To search for the AWS Integration dashboards, simply search for them in the Elastic search bar. The relevant ones for this blog are:

[Metrics AWS] EC2 Overview
[Metrics AWS] ELB Overview
[Metrics AWS] RDS Overview
[Metrics AWS] NAT Gateway

Let's see what comes up!

All of these dashboards are out-of-the-box and for all the following images, we’ve narrowed the views to only the relevant items from our app.

Across all dashboards, we’ve limited the timeframe to when we ran the traffic generator.

Once we filtered for our 4 EC2 instances (2 web servers and 2 application servers), we can see the following:

1: All 4 instances are up and running with no failures in status checks.

2: We see the average CPU utilization across the timeframe and nothing looks abnormal.

3: We see the network bytes flow in and out, aggregating over time as the database is loaded with rows.

While this exercise shows a small portion of the metrics that can be viewed, more are available from AWS EC2. The metrics listed on AWS documentation are all available, including the dimensions to help narrow the search for specific instances, etc.

For the ELB dashboard, we filter for our 2 load balancers (external web load balancer and internal application load balancer).

With the out-of-the-box dashboard, you can see application ELB-specific metrics. A good portion of the application ELB specific metrics listed in AWS Docs are available to add graphs for.

For our two load balancers, we can see:

1: Both the hosts (EC2 instances connected to the ELBs) are healthy.

2: Load Balancer Capacity Units (how much you are using) and request counts both went up as expected during the traffic generation time frame.

3: We picked to show 4XX and 2XX counts. 4XX will help identify issues with the application or connectivity with the application servers.

For AuroraDB, which is deployed in RDS, we’ve filtered for just the primary and secondary instances of Aurora on the dashboard.

Just as with EC2, ELB, most RDS metrics from Cloudwatch are also available to create new charts and graphs. In this dashboard, we’ve narrowed it down to showing:

1: Insert throughput & Select throughput

2: Write latency

3: CPU usage

4: General number of connections during the timeframe

We filtered to look only at our 2 NAT instances which are fronting the application servers. As with the other dashboards, other metrics are available to build graphs and /charts as needed.

For the NAT dashboard we can see the following:

1: The NAT Gateways are doing well due to no packet drops

2: An expected number of active connections from the web server

3: Fairly normal set of metrics for bytes in and out

Congratulations, you have now started monitoring metrics from key AWS services for your application!

What to monitor on AWS next?

Add logs from AWS Services

Now that metrics are being monitored, you can also now add logging. There are several options for ingesting logs.

The AWS Integration in the Elastic Agent has logs setting. Just ensure you turn on what you wish to receive. Let’s ingest the Aurora Logs from RDS. In the Elastic agent policy, we simply turn on Collect logs from CloudWatch (see below). Next, update the agent through the Fleet management UI.

You can install the Lambda logs forwarder. This option will pull logs from multiple locations. See the architecture diagram below.

A review of this option is also found in the following blog.

Analyze your data with Elastic Machine Learning

Once metrics and logs (or either one) are in Elastic, start analyzing your data through Elastic’s ML capabilities. A great review of these features can be found here:

And there are many more videos and blogs on Elastic’s Blog.

Conclusion: Monitoring AWS service metrics with Elastic Observability is easy!

I hope you’ve gotten an appreciation for how Elastic Observability can help you monitor AWS service metrics, here’s a quick recap of lessons and what you learned:

Elastic Observability supports ingest and analysis of AWS service metrics
It’s easy to set up ingest from AWS Services via the Elastic Agent
Elastic Observability has multiple out-of-the-box (OOTB) AWS service dashboards you can use to preliminarily review information, then modify for your needs
30+ AWS services are supported as part of AWS Integration on Elastic Observability, with more services being added regularly
As noted in related blogs, you can analyze your AWS service metrics with Elastic’s machine learning capabilities

AWS VPC Flow log analysis with GenAI in Elastic

Fri, 07 Jun 2024 00:00:00 GMT

Elastic Observability provides a full observability solution, by supporting metrics, traces and logs for applications and infrastructure. In managing AWS deployments, VPC flow logs are critical in managing performance, network visibility, security, compliance, and overall management of your AWS environment. Several examples of :

Where traffic is coming in from and going out to from the deployment, and within the deployment. This helps identify unusual or unauthorized communications
Traffic volumes detecting spikes or drops which could indicate service issues in production or an increase in customer traffic
Latency and Performance bottlenecks - with VPC Flow logs, you can look at latency for a flow (in and outflows), and understand patterns
Accepted and rejected traffic helps determine where potential security threats and misconfigurations lie.

AWS VPC Logs is a great example of how logs are great. Logging is an important part of Observability, for which we generally think of metrics and tracing. However, the amount of logs an application and the underlying infrastructure output can be significantly daunting with VPC Logs. However, it also provides a significant amount of insight.

Before we proceed, it is important to understand what Elastic provides in managing AWS and VPC Flow logs:

A full set of integrations to manage VPC Flows and the entire end-to-end deployment on AWS.
Elastic has a simple-to-use AWS Firehose integration.
Elastic’s tools such as Discover, spike analysis, and anomaly detection help provide you with better insights and analysis.
And a set of simple Out-of-the-box dashboards

In today’s blog, we’ll cover how Elastics’ other features can support analyzing and RCA for potential VPC flow logs even more easily. Specifically, we will focus on managing the number of rejects, as this helps ensure there weren’t any unauthorized or unusual activities:

Set up an easy-to-use SLO (newly released) to detect when things are potentially degrading
Create an ML job to analyze different fields of the VPC Flow log
Using our newly released RAG-based AI Assistant to help analyze the logs without needing to know Elastic’s query language nor how to even graph on Elastic
ES|QL will help understand and analyze add latency for patterns.

In subsequent blogs, we will use AI Assistant and ESQL to show how to get other insights beyond just REJECT/ACCEPT from VPC Flow log.

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up this demonstration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here).
Follow the steps in the following blog to get AWS’s three-tier app installed instructed in git, and bring in the AWS VPC Flow logs.
Ensure you have an ML node configured in your Elastic stack
To use the AI Assistant you will need a trial or upgrade to Platinum.

SLO with VPC Flow Logs

Elastic’s SLO capability is based directly on the Google SRE Handbook. All the definitions and semantics are utilized as described in Google’s SRE handbook. Hence users can perform the following on SLOs in Elastic:

Define an SLO on Logs not just metrics - Users can use KQL (log-based query), service availability, service latency, custom metric, histogram metric, or a timeslice metric.
Define SLO, SLI, Error budget and burn rates. Users can also use occurrence versus time slice-based budgeting.
Manage, with dashboards, all the SLOs in a singular location.
Trigger alerts from the defined SLO, whether the SLI is off, the burn rate is used up, or the error rate is X.

Setting up an SLO for VPC is easy. You simply create a query you want to trigger off. In our case, we look for all the good events where aws.vpcflow.action=ACCEPT and we define the target at 85%.

As the following example shows, over the last 7 days, we have exceeded our budget by 43%. Additionally, we have not complied for the last 7 days.

Analyzing the SLO with AI Assistant

Now that we see that there is an issue with the VPC Flows, we immediately work with the AI Assistant to start analyzing the SLO. Because it's a chat interface we simply open the AI Assistant and work through some simple analysis: (See Animated GIF for a demo below)

AI Assistant analysis:

what were the top 3 source.address that had aws.vpcflow.action=REJECT over the last 7 days, which is causing this SLO issue? - We wanted to simply see what could be causing the loss in error budget. Were there any particular source.addresses causing a heavy reject rate.
- The answer: A table with the highest count = 42670 and source.address = 79.110.62.185
- There is one singular source.address that is causing the loss in SLO.
What is the largest number of aws.vpcflow.action=REJECT in a 30 min time frame for the last 3 days where the source.address=79.110.62.185? - After understanding that a specific source.address is causing the loss in SLO, we want to understand the averages.
- **The answer: ** "The largest number of aws.vpcflow.action=REJECT in a 30-minute time frame for the last 3 days where the source.address is 79.110.62.185 is 229. This occurred on 2024-06-01T04:00:00.000Z.”
- It means there must be a low REJECT rate but fairly consistent vs spiky over the last 7 days.
for the logs with source.address="79.110.62.185" was there any country code of source.geo.country_iso_code field present. If yes what is the value - Given the last question showed a low REJECT rate, it only means that this was fairly consistent vs spiky over the last 7 days.
- The answer: Yes, there is a country code present in the source.geo.country_iso_code field for logs with source.address="79.110.62.185". The value is BG (Bulgaria).
Is there a specific destination.address where source.address=79.110.62.185 is getting a aws.vpcflow.action=REJECT. Give me both the destination.address and the number of REJECTs for that destination.address?
- The answer: destination.address of 10.0.0.27 is giving a reject number of 53433 in this time frame.
Graph the number of REJECT vs ACCEPT for source.address="79.110.62.185" over the last 7 days. The graph is on a daily basis in a singular graph - We asked this question to see what the comparison is between ACCEPT and REJECT.
- The answer: See the animated GIF to see that the generated graph is fairly stable
Were there any source.address that had a spike, high reject rate in. a 30min period over the 30 days? - We wanted to see if there was any other spike
- The answer - Yes, there was a source.address that had a spike in high reject rates in a 30-minute period over the last 30 days. source.address: 185.244.212.67, Reject Count: 8975, Time Period: 2024-05-22T03:00:00.000Z

Watch the flow

Potential issue:

he server handling requests from source 79.110.62.185 is potentially having an issue.

Again using logs, we essentially asked the AI Assistant to give the eni ids where the internal ip address was 10.0.0.27

From our AWS console, we know that this is the webserver. Further analysis in Elastic, and with the developers we realized there is a new version that was installed recently causing a problem with connections.

Locating anomalies with ML

While using the AI Assistant is great for analyzing information, another important aspect of VPC flow management is to ensure you can manage log spikes and anomalies. Elastic has a machine learning platform that allows you to develop jobs to analyze specific metrics or multiple metrics to look for anomalies.

VPC Flow logs come with a large amount of information. The full set of fields is listed in AWS docs. We will use a specific subset to help detect anomalies.

We were setting up anomalies for aws.vpcflow.action=REJECT, which requires us to use multimetric anomaly detection in Elastic.

The config we used utilizes:

Detectors:

destination.address
destination.port

Influencers:

source.address
aws.vpcflow.action
destination.geo.region_iso_code

The way we set this up will help us understand if there is a large spike in REJECT/ACCEPT against destination.address values from a specific source.address and/or destination.geo.region_iso_code location.

The job once run reveals something interesting:

Notice that source.address 185.244.212.67 has had a high REJECT rate in the last 30 days.

Notice where we found this before? In the AI Assistant!!!!!

While we can run the AI Assistant and find this sort of anomaly, the ML job can be setup to run continuously and alert us on such spikes. This will help us understand if there are any issues with the webserver like we found above or even potential security attacks.

Conclusion:

You’ve now seen how easily Elastic’s RAG-based AI Assistant can help analyze VPC Flows without even the need to know query syntax, understand where the data is, and understand even the fields. Additionally, you’ve also seen how we can alert you when a potential issue or degradation in service (SLO). Check out our other blogs on AWS VPC Flow analysis in Elastic:

A full set of integrations to manage VPC Flows and the entire end-to-end deployment on AWS.
Elastic has a simple-to-use AWS Firehose integration.
Elastic’s tools such as Discover, spike analysis, and anomaly detection help provide you with better insights and analysis.
And a set of simple Out-of-the-box dashboards

Try it out

Existing Elastic Cloud customers can access many of these features directly from the Elastic Cloud console. Not taking advantage of Elastic on the cloud? Start a free trial.

All of this is also possible in your environment. Learn how to get started today.

Using Azure SRE Agent and Elasticsearch to boost SRE productivity

Fri, 13 Mar 2026 00:00:00 GMT

If you’re a Site Reliability Engineer (SRE), you know the feeling: the cloud landscape is growing, and the architectural complexity is crushing. You’re constantly jumping between fragmented toolsets, spending too much time on manual, repetitive tasks just to manage compute, storage, and networking services. That constant toil leads to high Mean Time to Recovery (MTTR) and, let's be honest, serious operational burnout.1

This is why adopting an AI-driven approach isn't just critical—it’s necessary to solve modern system challenges. Autonomous agents can automate complete operational workflows with minimal human intervention, empowering SRE teams to move beyond constant reactive issue resolution toward proactive system engineering. But here’s the key: the effectiveness of any autonomous agent depends entirely on the quality of its underlying data. By seamlessly integrating the Azure SRE Agent with Elastic Observability, we’re not just offering simple automation; we’re giving organizations a strategy to enter a new phase of governed, AI-driven autonomous operations.

In this blog, we’ll go over how Elastic Observability and the Azure SRE Agent work together, how this integration empowers SREs with AI-driven operations, and how to get started.

The Power of Choice: Why Elastic Observability is the Foundation for AI-Driven Ops

For the modern SRE, Elastic Observability serves as the indispensable high-fidelity data foundation. Elastic transforms environmental complexity into a strategic asset by providing a unified, search-powered view of Logs, Metrics, and Traces.

The Azure SRE Agent requires more than just raw data; it requires governed, real-time production insights. Elastic delivers this through ES|QL, our piped-query language that allows for high-speed telemetry correlation and transformation. Specifically optimized for Elastic 9.2.0+ and Elasticsearch Serverless projects, this integration utilizes the Model Context Protocol (MCP) to provide the agent with deep system context.

Pro-Tip: To leverage this integration, ensure that the Agent Builder feature is enabled within your Elastic deployment, as this serves as the gateway for the agent to access your production environment securely.

Better Together: The Value of the Elastic and Azure SRE Agent Integration

Combining Elastic’s search-powered observability with Azure’s agentic automation creates a "Better Together" ecosystem that provides several strategic advantages:

Smarter Detection & Remediation: Infuse Elastic’s real-time governed data and causal analysis into Azure SRE Agent workflows. This allows the agent to not only identify a symptom but also understand the underlying root cause.
Context-Rich Investigation: SREs can accelerate triage by providing the agent with full production context—including the blast radius of an incident—directly where the SRE works. This eliminates the "swivel-chair" effect of switching between monitoring dashboards.
Proactive Prevention: By utilizing historical trends and real-time signals from Elastic, the Azure SRE Agent can stop regressions and performance degradations before they impact the end-user experience.
Natural Language Interaction: Through the Elasticsearch MCP server, SREs can query complex clusters using natural language, making deep data exploration accessible without needing to master complex Query syntax.

Practical Scenarios: Elastic-Powered SRE in Action

This integration empowers SREs to solve real-world problems through conversational automation:

Incident Triage: An SRE prompts the agent: "Search for errors in the last hour across all logs indices." The agent invokes the MCP tools in Agent Builder to return a prioritized list of error logs, identifying a service spike in seconds.
Performance Analysis: To identify a recurring pattern, an SRE commands: "Run an ES|QL query to find the top 10 error types." The agent uses ES|QL to aggregate telemetry, allowing the team to prioritize development fixes based on frequency.
Infrastructure Health: During a suspected Azure resource failure, an SRE can check the data layer by asking: "Show me metric information for my cluster." By invoking MCP tools, the agent determines if a node failure is impacting data availability.

Practical How-to Guide: Integrating Elastic with the Azure SRE Agent

In Elastic via your Kibana interface - create an API Key and remember the key:

Find and copy your MCP Endpoint in Agent Builder:

In the Azure portal, find the SRE Agent service:

Create an Agent:

Add the Elastic Connector:

Talk to your agent. Use “/agent” to select your agent in the chat interface:

Conclusions

The integration of Elastic Observability and the Azure SRE Agent represents a strategic leap forward for cloud operations. By combining Elastic's superior data depth and ES|QL engine with Azure’s autonomous automation, organizations can drastically reduce MTTR, eliminate toil, and maximize the ROI of their Azure investments.

Next Steps

Explore the Elasticsearch Observability solution implementation on Microsoft Marketplace and visit the Azure SRE Agent resource to begin your trial of Elastic-centric autonomous operations today.

Learn more by checking out the following links:

Best practices for instrumenting OpenTelemetry

Wed, 13 Sep 2023 00:00:00 GMT

OpenTelemetry (OTel) is steadily gaining broad industry adoption. As one of the major Cloud Native Computing Foundation (CNCF) projects, with as many commits as Kubernetes, it is gaining support from major ISVs and cloud providers delivering support for the framework. Many global companies from finance, insurance, tech, and other industries are starting to standardize on OpenTelemetry. With OpenTelemetry, DevOps teams have a consistent approach to collecting and ingesting telemetry data providing a de-facto standard for observability. With that, teams can rely on vendor-agnostic, future-proof instrumentation of their applications that allows them to switch observability backends without additional overhead in adapting instrumentation.

Teams that have chosen OpenTelemetry for instrumentation face a choice between different instrumentation techniques and data collection approaches. Determining how to instrument and what mechanism to use can be challenging. In this blog, we will go over Elastic’s recommendations around some best practices for OpenTelemetry instrumentation:

Automatic or manual? We’ll cover the need for one versus the other and provide recommendations based on your situation.
Collector or direct from the application? While the traditional option is to use a collector, observability tools like Elastic Observability can take telemetry from OpenTelemetry applications directly.
What to instrument from OTel SDKs. Traces and metrics are well contributed to (per the table in OTel docs), but logs are still in progress. Elastic^® is improving the progress with its contribution of ECS to OTel. Regardless of the status from OTel, you need to test and ensure these instrumentations work for you.
Advantages and disadvantages of OpenTelemetry

OTel automatic or manual instrumentation: Which one should I use?

While there are two ways to instrument your applications with OpenTelemetry — automatic and manual — there isn’t a perfect answer, as it depends on your needs. There are pros and cons of using one versus another, such as:

Auto-magic experience vs. control over instrumentation
Customization vs. out-of-the-box data
Instrumentation overhead
Simplicity vs. flexibility

Additionally, you might even land on a combination depending on availability and need.

Let’s review both automatic and manual instrumentation and explore specific recommendations.

Auto-instrumentation

For most of the programming languages and runtimes, OpenTelemetry provides an auto-instrumentation approach for gathering telemetry data. Auto-instrumentation provides a set of pre-defined, out-of-the-box instrumentation modules for well-known frameworks and libraries. With that, users can gather telemetry data (such as traces, metrics, and logs) from well-known frameworks and libraries used by their application with only minimal or even no need for code changes.

Here are some of the apparent benefits of using auto-instrumentation:

Quicker development and path to production. Auto-instrumentation saves time by accelerating the process of integrating telemetry into an application, allowing more focus on other critical tasks.
Simpler maintenance by only having to update one line, which is usually the container start command where auto-instrumentation is configured, versus having to update multiple lines of code across multiple classes, methods, and services.
Easier to keep up with the latest features and improvements in the OpenTelemetry project without manually updating the instrumentation of used libraries and/or code.

There are also some disadvantages and limitations of the auto-instrumentation approach:

Auto-instrumentation collects telemetry data only for the frameworks and libraries in use for which an explicit auto-instrumentation module exists. In particular, it’s unlikely that auto-instrumentation would collect telemetry data for “exotic” or custom libraries.
Auto-instrumentation does not capture telemetry for pure custom code (that does not use well-known libraries underneath).
Auto-instrumentation modules come with a pre-defined, opinionated instrumentation logic that provides sufficient and meaningful information in the vast majority of cases. However, in some custom edge cases, the information value, structure, or level of detail of the data provided by auto-instrumentation modules might be not sufficient.
Depending on the runtime, technology, and size of the target application, auto-instrumentation may come with a (slightly) higher start-up or runtime overhead compared to manual instrumentation. In the majority of cases, this overhead is negligible but may become a problem in some edge cases.

Here is an example of a Python application that was auto-instrumented with OpenTelemetry. If you had a Python application locally, you would add the code below to auto-instrument:

opentelemetry-instrument \
    --traces_exporter OTEL_TRACES_EXPORTER \
    --metrics_exporter OTLP_METRICS_EXPORTER \
    --service_name OTLP_SERVICE_NAME \
    --exporter_otlp_endpoint OTEL_EXPORTER_TRACES_ENDPOINT \
    python main.py

Learn more about auto-instrumentation with OpenTelemetry for Python applications.

Finally, developers familiar with OpenTelemetry's APIs can leverage their existing knowledge by using auto-instrumentation, avoiding the complexities that may arise from manual instrumentation. However, manual instrumentation might still be preferred for specific use cases or when custom requirements cannot be fully addressed by auto-instrumentation.

Combination: Automatic and Manual

Before we proceed with manual instrumentation, you can also use a combination of automatic and manual instrumentation. As we noted above, if you start to understand the application’s behavior, then you can determine if you need some additional instrumentation for code that is not being traced by auto-instrumentation.

Additionally, because not all the auto-instrumentation is equal across the OTel language set, you will probably need to manually instrument in some cases — for example, if auto-instrumentation of a Flask-based Python application doesn’t automatically show middleware calls like calls to the requests library. In this situation, you will have to go with manual instrumentation for the Python application if you want to also see middleware tracing. However, as these libraries mature, more support options will become available.

A combination is where most developers will ultimately land when the application gets to near production quality.

Manual instrumentation

If the auto-instrumentation does not cover your needs, you want to have more control over the instrumentation, or you’d like to treat instrumentation as code, using manual instrumentation is likely the right choice for you. As described above, you can use it as an enhancement to auto-instrumentation or entirely switch to manual instrumentation. If you eventually go down a path of manual instrumentation, it definitely provides more flexibility but also means you will have to not only code in the traces and metrics but also maintain it regularly.

As new features are added and changes to the libraries are made, the maintenance for the code may or may not be cumbersome. It’s a decision that requires some forethought.

Here are some reasons why you would potentially use manual instrumentation:

You may already have some OTel instrumented applications using auto-instrumentation and need to add more telemetry for specific functions or libraries (like DBs or middleware), thus you will have to add manual instrumentation.
You need more flexibility and control in terms of the application language and what you’d like to instrument.
In case there's no auto-instrumentation available for your programming language and the technologies in use, manual instrumentation would be the way to go for your applications built using these languages.
You might have to instrument for logging with an alternative approach, as logging is not yet stable for all the programming languages.
You need to customize and enrich your telemetry data for your specific use cases — for example, you have a multi-tenant application and you need to get each tenant’s information and then use manual instrumentation via the OpenTelemetry SDK.

Recommendations for manual instrumentation
Manual instrumentation will require specific configuration to ensure you have the best experience with OTel. Below are Elastic’s recommendations (as outlined by the CNCF), for gaining the most benefits when instrumenting using the manual method:

Ensure that your provider configuration and tracer initialization is done properly.
Ensure you set up spans in all the functions you want traced.
Set up resource attributes correctly.
Use batch rather than simple processing.

Let’s review these individually:

1. Ensure that your provider configuration and tracer initialization is done properly.
The general rule of thumb is to ensure you configure all your variables and tracer initialization in the front of the application. Using the Elastiflix application’s Python favorite service as an example, we can see:

Tracer being set up globally

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource

...


resource = Resource.create(resource_attributes)

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(exporter)
provider.add_span_processor(processor)

# Sets the global default tracer provider
trace.set_tracer_provider(provider)

# Creates a tracer from the global tracer provider
tracer = trace.get_tracer(otel_service_name)

In the above, we’ve added the OpenTelemetry trace module and imported the TraceProvider , which is the entry point of the API. It provides access to the Tracer, which is the class responsible for creating spans.

Additionally, we specify the use of BatchSpanProcessor. The span processor is an interface that provides hooks for span start and end method invocations.

In OpenTelemetry, different span processors are offered. The BatchSpanProcessor batches span and sends them in bulk. Multiple span processors can be configured to be active at the same time using the MultiSpanProcessor. See OpenTelemetry Documentation.

The variable otel_service_name is set in with environment variables (i.e., OTLP ENDPOINT and others) also set up globally. See below:

otel_service_name = os.environ.get('OTEL_SERVICE_NAME') or 'favorite_otel_manual'
environment = os.environ.get('ENVIRONMENT') or 'dev'
otel_service_version = os.environ.get('OTEL_SERVICE_VERSION') or '1.0.0'

otel_exporter_otlp_headers = os.environ.get('OTEL_EXPORTER_OTLP_HEADERS')
otel_exporter_otlp_endpoint = os.environ.get('OTEL_EXPORTER_OTLP_ENDPOINT')

In the above code, we initialize several variables. Because we also imported Resource, we initialize several variables:

Resource variables (we will cover this later in this article):

otel_service_name – This helps set the name of the service (service.name) in otel Resource attributes.
otel_service_version – This helps set the version of the service (service.version) in OTel Resource attributes.
environment – This helps set the deployment.environment variable in OTel Resource attributes.

Exporter variables:

otel_exporter_otlp_endpoint – This helps set the OTLP endpoint where traces, logs, and metrics are sent. Elastic would be an OTLP endpoint. You can also use OTEL_TRACES_EXPORTER or OTEL_METRICS_EXPORTER if you want to only send traces and/or metrics to specific endpoints.
Otel_exporter_otlp_headers – This is the authorization needed for the endpoint.

The separation of your provider and tracer configuration allows you to use any OpenTelemetry provider and tracing framework that you choose.

2. Set up your spans inside the application functions themselves.
Make sure your spans end and are in the right context so you can track the relationships between spans. In our Python favorite application, the function that retrieves a user’s favorite movies shows:

@app.route('/favorites', methods=['GET'])
def get_favorite_movies():
    # add artificial delay if enabled
    if delay_time > 0:
        time.sleep(max(0, random.gauss(delay_time/1000, delay_time/1000/10)))

    with tracer.start_as_current_span("get_favorite_movies") as span:
        user_id = str(request.args.get('user_id'))

        logger.info('Getting favorites for user ' + user_id, extra={
            "event.dataset": "favorite.log",
            "user.id": request.args.get('user_id')
        })

        favorites = r.smembers(user_id)

        # convert to list
        favorites = list(favorites)
        logger.info('User ' + user_id + ' has favorites: ' + str(favorites), extra={
            "event.dataset": "favorite.log",
            "user.id": user_id
        })
        return { "favorites": favorites}

While you can instrument every function, it’s strongly recommended that you instrument what you need to avoid a flood of data. The need will be dependent not only on the development process needs but also on what SRE and potentially the business needs to observe with the application. Instrument for your target use cases.

Also, avoid instrumenting trivial/utility methods/functions or such that are intended to be called extensively (e.g., getter/setter functions). Otherwise, this would produce a huge amount of telemetry data with very low additional value.

3. Set resource attributes and use semantic conventions

_ Resource attributes _
Attributes such as service.name, tracer, development.environment, and cloud are important in managing version, environment, cloud provider, etc. for the specific service. Resource attributes describe resources such as hosts, systems, processes, and services and do not change during the lifetime of the resource. Resource attributes are a great help for correlating data, providing additional context to telemetry data and, thus, helping narrow down root causes of problems during troubleshooting. While it is simple to set up in auto-instrument, you need to ensure you also send these through in your application.

Check out OpenTelemetry’s list of attributes that can be set in the OTel documentation.

In our auto-instrumented Python application from above, here is how we set up resource attributes:

opentelemetry-instrument \
    --traces_exporter console,otlp \
    --metrics_exporter console \
    --service_name your-service-name \
    --exporter_otlp_endpoint 0.0.0.0:4317 \
    python myapp.py

However, when instrumenting manually, you need to add your resource attributes and ensure you have consistent values across your application’s code. Resource attributes have been defined by OpenTelemetry’s Resource Semantic Convention and can be found here. In fact, your organization should have a resource attribute convention that is applied across all applications.

These attributes are added to your metrics, traces, and logs, helping you filter out data, correlate, and make more sense out of them.

Here is an example of setting resource attributes in our Python service:

resource_attributes = {
    "service.name": otel_service_name,
    "telemetry.version": otel_service_version,
    "Deployment.environment": environment

}

resource = Resource.create(resource_attributes)

provider = TracerProvider(resource=resource)

We’ve set up service.name, service.version, and deployment.environment. You can set up as many resource attributes as you need, but you need to ensure you pass the resource attributes into the tracer with provider = TracerProvider(resource=resource).

_ Semantic conventions _
In addition to adding the appropriate resource attributes to the code, the OpenTelemetry semantic conventions are important. Another one is about semantic conventions for specific technologies used in building your application with specific infrastructure. For example, if you need to instrument databases, there is no automatic instrumentation. You will have to manually instrument for tracing against the database. In doing so, you should utilize the semantic conventions for database calls in OpenTelemetry.

Similarly, if you are trying to trace Kafka or RabbitMQ, you can follow the OpenTelemetry semantic conventions for messaging systems.

There are multiple semantic conventions across several areas and signal types that can be followed using OpenTelemetry — check out the details.

4. Use Batch or simple processing?
Using simple or batch processing depends on your specific observability requirements. The advantages of batch processing include improved efficiency and reduced network overhead. Batch processing allows you to process telemetry data in batches, enabling more efficient data handling and resource utilization. On the other hand, batch processing increases the lag time for telemetry data to appear in the backend, as the span processor needs to wait for a sufficient amount of data to send over to the backend.

With simple processing, you send your telemetry data as soon as the data is generated, resulting in real-time observability. However, you will need to prepare for higher network overhead and more resources required to process all the separate data transmissions.

Here is what we used to set this up in Python:

from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(exporter)
provider.add_span_processor(processor)

# Sets the global default tracer provider
trace.set_tracer_provider(provider)

Your observability goals and budgetary constraints are the deciding factors when choosing batch or simple processing. A hybrid approach can also be implemented. If real-time insights are critical for an ecommerce application, for example, then simple processing would be the better approach. For other applications where real-time insights are not crucial, consider batch processing. Often, experimenting with both approaches and seeing how your observability backend handles the data is a fruitful exercise to hone in on what approach works best for the business.

Use the OpenTelemetry Collector or go direct?

When starting out with OpenTelemetry, ingesting and transmitting telemetry data directly to a backend such as Elastic is a good way to get started. Often, you would be using the OTel direct method in the development phase and in a local environment.

However, as you deploy your applications to production, the applications become fully responsible for ingesting and sending telemetry data. The amount of data sent in a local environment or during development would be miniscule compared to a production environment. With millions or even billions of users interacting with your applications, the work of ingesting and sending telemetry data in addition to the core application functions can become resource-intensive. Thus, offloading the collection, processing, and exporting of telemetry data over to a backend such as Elastic using the vendor-agnostic OTel Collector would enable your applications to perform more efficiently, leading to a better customer experience.

Advantages of using the OpenTelemetry Collector

For cloud-native and microservices-based applications, the OpenTelemetry Collector provides the flexibility to handle multiple data formats and, more importantly, offloads the resources required from the application to manage telemetry data. The result: reduced application overhead and ease of management as the telemetry configuration can now be managed in one place.

The OTel Collector is the most common configuration because the OTel Collector is used:

To enrich the telemetry data with additional context information — for example, on Kubernetes, the OTel Collector would take the responsibility to enrich all the telemetry with the corresponding K8s pod and node information (labels, pod-name, etc.)
To provide uniform and consistent processing or transform telemetry data in a central place (i.e., OTel Collector) rather than take on the burden of syncing configuration across hundreds of services to ensure consistent processing
To aggregate metrics across multiple instances of a service, which is only doable on the OTel Collector (not within individual SDKs/agents)

Key features of the OpenTelemetry Collector include:

Simple setup: The setup documentation is clear and comprehensive. We also have an example setup using Elastic and the OTel Collector documented from this blog.
Flexibility: The OTel Collector offers many configuration options and allows you to easily integrate into your existing observability solution. However, OpenTelemetry’s pre-built distributions allow you to start quickly and build the features that you need. Here as well as below is an example of the code that we used to build our collector for an application running on Kubernetes.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otelcollector
spec:
  selector:
    matchLabels:
      app: otelcollector
  template:
    metadata:
      labels:
        app: otelcollector
    spec:
      serviceAccountName: default
      terminationGracePeriodSeconds: 5
      containers:
        - command:
            - "/otelcol"
            - "--config=/conf/otel-collector-config.yaml"
          image: otel/opentelemetry-collector:0.61.0
          name: otelcollector
          resources:
            limits:
              cpu: 1
              memory: 2Gi
            requests:
              cpu: 200m
              memory: 400Mi

Collect host metrics: Using the OTel Collector allows you to capture infrastructure metrics, including CPU, RAM, storage capacity, and more. This means you won’t need to install a separate infrastructure agent to collect host metrics. An example OTel configuration for ingesting host metrics is below.

receivers:
  hostmetrics:
    scrapers:
      cpu:
      disk:

Security: The OTel Collector operates in a secure manner by default. It can filter out sensitive information based on your configuration. OpenTelemetry provides these security guidelines to ensure your security needs are met.
Tail-based sampling for distributed tracing: With OpenTelemetry, you can specify the sampling strategy you would like to use for capturing traces. Tail-based sampling is available by default with the OTel Collector. With tail-based sampling, you control and thereby reduce the amount of trace data collected. More importantly, you capture the most relevant traces, enabling you to spot issues within your microservices applications much faster.

What about logs?

OpenTelemetry’s approach to ingesting metrics and traces is a “clean-sheet design.” OTel developed a new API for metrics and traces and implementations for multiple languages. For logs, on the other hand, due to the broad adoption and existence of legacy log solutions and libraries, support from OTel is the least mature.

Today, OpenTelemetry’s solution for logs is to provide integration hooks to existing solutions. Longer term though, OpenTelemetry aims to incorporate context aggregation with logs thus easing logging correlation with metrics and traces. Learn more about OpenTelemetry’s vision.

Elastic has written up its recommendations in the following article: 3 models for logging with OpenTelemetry and Elastic. Here is a brief summary of what Elastic recommends:

Output logs from your service (alongside traces and metrics) using an embedded OpenTelemetry Instrumentation library to Elastic via the OTLP protocol.
Write logs from your service to a file scrapped by the OpenTelemetry Collector, which then forwards to Elastic via the OTLP protocol.
Write logs from your service to a file scrapped by Elastic Agent (or Filebeat), which then forwards to Elastic via an Elastic-defined protocol.

The third approach, where developers have their logs scraped using an Elastic Agent, is the recommended approach, as Elastic provides a widely adopted and proven method for capturing logs from applications and services using OTel. The first two approaches, although both use OTel instrumentation, are not yet mature and aren't ready for production-level applications.

Get more details about the three approaches in this Elastic blog which includes a deep-dive discussion with hands-on implementation, architecture, advantages, and disadvantages.

It’s not all sunshine and roses

OpenTelemetry is definitely beneficial to obtaining observability for modern cloud-native distributed applications. Having a standardized framework for ingesting telemetry reduces operational expenses and allows the organization to focus more on application innovation. Even with all the advantages of using OTel, there are some limitations that you should be aware of as well.

But first, here are the advantages of using OpenTelemetry:

Standardized instrumentation: Having a consistent method for instrumenting systems up and down the stack gives organizations more operational efficiency and cost-effective observability.
Auto-instrumentation: OTel provides organizations with the ability to auto-instrument popular libraries and frameworks enabling them to quickly get up and running and requiring minimal changes to the codebase.
Vendor neutrality: Organizations don’t have to be tied to one vendor for their observability needs. In fact, they can use several of them, using OTel to try one out or have a more best-of-breed approach if desired.
Future-proof instrumentation: Since OpenTelemetry is open-source and has a vast ecosystem of support, your organization will be using technology that will be constantly innovated and can scale and grow with the business.

There are some limitations as well:

Instrumenting with OTel is a fork-lift upgrade. Organizations must be aware that time and effort needs to be invested to migrate proprietary instrumentation to OpenTelemetry.
The language SDKs are at a different maturity level, so applications with alpha, beta, or experimental functional support may not provide the organization with the full benefits in the short term.

Over time, the disadvantages will be reduced, especially as the maturity level of the functional components improves. Check the OpenTelemetry status page for updates on the status of the language SDKs, the collector, and overall specifications.

Using Elastic and migrating to OpenTelemetry at your speed

Transitioning to OpenTelemetry is a challenge for most organizations, as it requires retooling existing proprietary APM agents on almost all applications. This can be daunting, but OpenTelemetry agents provide a mechanism to avoid having to modify the source code, otherwise known as auto-instrumentation. With auto-instrumentation, the only code changes will be to rip out the proprietary APM agent code. Additionally, you should also ensure you have an observability tool that natively supports OTel without the need for additional agents, such as Elastic Observability.

Elastic recently donated Elastic Common Schema (ECS) in its entirety to OTel. The goal was to ensure OTel can get to a standardized logging format. ECS, developed by the Elastic community over the past few years, provides a vehicle to allow OTel to provide a more mature logging solution.

Elastic provides native OTel support. You can directly send OTel telemetry into Elastic Observability without the need for a collector or any sort of processing normally used in the collector.

Here are the configuration options in Elastic for OpenTelemetry:

Most of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:

Service maps
Service details (latency, throughput, failed transactions)
Dependencies between services, distributed tracing
Transactions (traces)
Machine learning (ML) correlations
Log correlation

Although OpenTelemetry supports many programming languages, the status of its major functional components — metrics, traces, and logs — are still at various stages. Thus migrating applications written in Java, Python, and JavaScript are good choices to start with as their metrics, traces, and logs (for Java) are stable.

For the other languages that are not yet supported, you can easily instrument those using Elastic Agents, therefore running your observability platform in mixed mode (Elastic Agents with OpenTelemetry agents).

We ran a variation of our standard Elastic Agent application with one service flipped to OTel — the newsletter-otel service. But we can easily and as needed convert each of these services to OTel as development resources allow.

As a result, you can take advantage of the benefits of OpenTelemetry, which include:

Standardization: OpenTelemetry provides a standard approach to telemetry collection, enabling consistency of processes and easier integration of different components.
Vendor-agnostic: Since OpenTelemetry is open source, it is designed to be vendor-agnostic, allowing DevOps and SRE teams to work with other monitoring and observability backends reducing vendor lock-in.
Flexibility and extensibility: With its flexible architecture and inherent design for extensibility, OpenTelemetry enables teams to create custom instrumentation and enrich their own telemetry data.
Community and support: OpenTelemetry has a growing community of contributors and adopters. In fact, Elastic contributed to developing a common schema for metrics, logs, traces, and security events. Learn more here.

Once the other languages reach a stable state, you can then continue your migration to OpenTelemetry agents.

Summary

OpenTelemetry has become the de facto standard for ingesting metrics, traces, and logs from cloud-native applications. It provides a vendor-agnostic framework for collecting telemetry data, enabling you to use the observability backend of your choice.

Auto-instrumentation using OpenTelemetry is the fastest way for you to ingest your telemetry data and is an optimal way to get started with OTel. However, using manual instrumentation provides more flexibility, so it is often the next step in gaining deeper insights from your telemetry data.

OpenTelemetry visualization also allows you to ingest your data directly or by using the OTel Collector. For local development, going direct is a great way to get your data to your observability backend; however, with production workloads, using the OTel Collector is recommended. The collector takes care of all the data ingestion and processing, enabling your applications to focus on functionality and not have to deal with any telemetry data tasks.

Logging functionality is still at a nascent stage with OpenTelemetry, while ingesting metrics and traces is well established. For logs, if you’ve started down the OTel path, you can send your logs to Elastic using the OTLP protocol. Since Elastic has a very mature logging solution, a better approach would be to use an Elastic Agent to ingest logs.

Although the long-term benefits are clear, organizations need to be aware that adopting OpenTelemetry means they would own their own instrumentation. Thus, appropriate resources and effort need to be incorporated in the development lifecycle. Over time, however, OpenTelemetry brings standardization to telemetry data ingestion, offering organizations vendor-choice, scalability, flexibility, and future-proofing of investments.

Best Practices for Log Management: Leveraging Logs for Faster Problem Resolution

Wed, 11 Sep 2024 00:00:00 GMT

In today's rapid software development landscape, efficient log management is crucial for maintaining system reliability and performance. With expanding and complex infrastructure and application components, the responsibilities of operations and development teams are ever-growing and multifaceted. This blog post outlines best practices for effective log management, addressing the challenges of growing data volumes, complex infrastructures, and the need for quick problem resolution.

Understanding Logs and Their Importance

Logs are records of events occurring within your infrastructure, typically including a timestamp, a message detailing the event, and metadata identifying the source. They are invaluable for diagnosing issues, providing early warnings, and speeding up problem resolution. Logs are often the primary signal that developers enable, offering significant detail for debugging, performance analysis, security, and compliance management.

The Logging Journey

The logging journey involves three basic steps: collection and ingestion, processing and enrichment, and analysis and rationalization. Let's explore each step in detail, covering some of the best practices for each section.

1. Log Collection and Ingestion

Collect Everything Relevant and Actionable

The first step is to collect all logs into a central location. This involves identifying all your applications and systems and collecting their logs. Comprehensive data collection ensures no critical information is missed, providing a complete picture of your system's behavior. In the event of an incident, having all logs in one place can significantly reduce the time to resolution. It's generally better to collect more data than you need, as you can always filter out irrelevant information later, as well as delete logs that are no longer needed more quickly.

Leverage Integrations

Elastic provides over 300 integrations that simplify data onboarding. These integrations not only collect data but also come with dashboards, saved searches, and pipelines to parse the data. Utilizing these integrations can significantly reduce manual effort and ensure data consistency.

Consider Ingestion Capacity and Costs

An important aspect of log collection is ensuring you have sufficient ingestion capacity at a manageable cost. When assessing solutions, be cautious about those that charge significantly more for high cardinality data, as this can lead to unexpectedly high costs in observability solutions. We'll talk more about cost effective log management later in this post.

Use Kafka for Large Projects

For larger organizations, implementing Kafka can improve log data management. Kafka acts as a buffer, making the system more reliable and easier to manage. It allows different teams to send data to a centralized location, which can then be ingested into Elastic.

2. Processing and Enrichment

Adopt Elastic Common Schema (ECS)

One key aspect of log collection is to have the most amount of normalization across all of your applications and infrastructure. Having a common semantic schema is crucial. Elastic contributed Elastic Common Schema (ECS) to OpenTelemetry (OTel), helping accelerate the adoption of OTel-based observability and security. This move towards a more normalized way to define and ingest logs (including metrics and traces) is beneficial for the industry.

Using ECS helps standardize field names and data structures, making data analysis and correlation easier. This common schema ensures your data is organized predictably, facilitating more efficient querying and reporting. Learn more about ECS here.

Optimize Mappings for High Volume Data

For high cardinality fields or those rarely used, consider optimizing or removing them from the index. This can improve performance by reducing the amount of data that needs to be indexed and searched. Our documentation has sections to tune your setup for disk usage, search speed and indexing speed.

Managing Structured vs. Unstructured Logs

Structured logs are generally preferable as they offer more value and are easier to work with. They have a predefined format and fields, simplifying information extraction and analysis. For custom logs without pre-built integrations, you may need to define your own parsing rules.

For unstructured logs, full-text search capabilities can help mitigate limitations. By indexing logs, full-text search allows users to search for specific keywords or phrases efficiently, even within large volumes of unstructured data. This is one of the main differentiators of Elastic's observability solution. You can simply search for any keyword or phrase and get results in real-time, without needing to write complex regular expressions or parsing rules at query time.

Schema-on-Read vs. Schema-on-Write

There are two main approaches to processing log data:

Schema-on-read: Some observability dashboarding capabilities can perform runtime transformations to extract fields from non-parsed sources on the fly. This is helpful when dealing with legacy systems or custom applications that may not log data in a standardized format. However, runtime parsing can be time-consuming and resource-intensive, especially for large volumes of data.
Schema-on-write: This approach offers better performance and more control over the data. The schema is defined upfront, and the data is structured and validated at the time of writing. This allows for faster processing and analysis of the data, which is beneficial for enrichment.

3. Analysis and Rationalization

Full-Text Search

Elastic's full-text search capabilities, powered by Elasticsearch, allow you to quickly find relevant logs. The Kibana Query Language (KQL) enhances search efficiency, enabling you to filter and drill down into the data to identify issues rapidly.

Here are a few examples of KQL queries:

// Filter documents where a field exists
http.request.method: *

// Filter documents that match a specific value
http.request.method: GET

// Search all fields for a specific value
Hello

// Filter documents where a text field contains specific terms
http.request.body.content: "null pointer"

// Filter documents within a range
http.response.bytes < 10000

// Combine range queries
http.response.bytes > 10000 and http.response.bytes <= 20000

// Use wildcards to match patterns
http.response.status_code: 4*

// Negate a query
not http.request.method: GET

// Combine multiple queries with AND/OR
http.request.method: GET and http.response.status_code: 400

Machine Learning Integration

Machine learning can automate the detection of anomalies and patterns within your log data. Elastic offers features like log rate analysis that automatically identify deviations from normal behavior. By leveraging machine learning, you can proactively address potential issues before they escalate.

It is recommended that organizations utilize a diverse arsenal of machine learning algorithms and techniques to effectively uncover unknown-unknowns in log files. Unsupervised machine learning algorithms, should be employed for anomaly detection on real-time data, with rate-controlled alerting based on severity.

By automatically identifying influencers, users can gain valuable context for automated root cause analysis (RCA). Log pattern analysis brings categorization to unstructured logs, while log rate analysis and change point detection help identify the root causes of spikes in log data.

Take a look at the documentation to get started with machine learning in Elastic.

Dashboarding and Alerting

Building dashboards and setting up alerting helps you monitor your logs in real-time. Dashboards provide a visual representation of your logs, making it easier to identify patterns and anomalies. Alerting can notify you when specific events occur, allowing you to take action quickly.

Cost-Effective Log Management

Use Data Tiers

Implementing index lifecycle management to move data across hot, warm, cold, and frozen tiers can significantly reduce storage costs. This approach ensures that only the most frequently accessed data is stored on expensive, high-performance storage, while older data is moved to more cost-effective storage solutions.

Our documentation explains how to set up Index Lifecycle Management.

Compression and Index Sorting

Applying best compression settings and using index sorting can further reduce the data footprint. Optimizing the way data is stored on disk can lead to substantial savings in storage costs and improve retrieval performance. As of 8.15, Elasticsearch provides an indexing mode called "logsdb". This is a highly optimized way of storing log data. This new way of indexing data uses 2.5 times less disk space than the default mode. You can read more about it here. This mode automatically applies the best combination of settings for compression, index sorting, and other optimizations that weren't accessible to users before.

Snapshot Lifecycle Management (SLM)

SLM allows you to back up your data and delete it from the main cluster, freeing up resources. If needed, data can be restored quickly for analysis, ensuring that you maintain the ability to investigate historical events without incurring high storage costs.

Learn more about SLM in the documentation.

Dealing with Large Amounts of Log Data

Managing large volumes of log data can be challenging. Here are some strategies to optimize log management:

Develop a logs deletion policy. Evaluate what data to collect and when to delete it.
Consider discarding DEBUG logs or even INFO logs earlier, and delete dev and staging environment logs sooner.
Aggregate short windows of identical log lines, which is especially useful for TCP security event logging.
For applications and code you control, consider moving some logs into traces to reduce log volume while maintaining detailed information.

Centralized vs. Decentralized Log Storage

Data locality is an important consideration when managing log data. The costs of ingressing and egressing large amounts of log data can be prohibitively high, especially when dealing with cloud providers.

In the absence of regional redundancy requirements, your organization may not need to send all log data to a central location. Consider keeping log data local to the datacenter where it was generated to reduce ingress and egress costs.

Cross-cluster search functionality enables users to search across multiple logging clusters simultaneously, reducing the amount of data that needs to be transferred over the network.

Cross-cluster replication is useful for maintaining business continuity in the event of a disaster, ensuring data availability even during an outage in one datacenter.

Monitoring and Performance

Monitor Your Log Management System

Using a dedicated monitoring cluster can help you track the performance of your Elastic deployment. Stack monitoring provides metrics on search and indexing activity, helping you identify and resolve performance bottlenecks.

Adjust Bulk Size and Refresh Interval

Optimizing these settings can balance performance and resource usage. Increasing bulk size and refresh interval can improve indexing efficiency, especially for high-throughput environments.

Logging Best Practices

Adjust Log Levels

Ensure that log levels are appropriately set for all applications. Customize log formats to facilitate easier ingestion and analysis. Properly configured log levels can reduce noise and make it easier to identify critical issues.

Use Modern Logging Frameworks

Implement logging frameworks that support structured logging. Adding metadata to logs enhances their usefulness for analysis. Structured logging formats, such as JSON, allow logs to be easily parsed and queried, improving the efficiency of log analysis. If you fully control the application and are already using structured logging, consider using Elastic's version of these libraries, which can automatically parse logs into ECS fields.

Leverage APM and Metrics

For custom-built applications, Application Performance Monitoring (APM) provides deeper insights into application performance, complementing traditional logging. APM tracks transactions across services, helping you understand dependencies and identify performance bottlenecks.

Consider collecting metrics alongside logs. Metrics can provide insights into your system's performance, such as CPU usage, memory usage, and network traffic. If you're already collecting logs from your systems, adding metrics collection is usually a quick process.

Traces can provide even deeper insights into specific transactions or request paths, especially in cloud-native environments. They offer more contextual information and excel at tracking dependencies across services. However, implementing tracing is only possible for applications you own, and not all developers have fully embraced it yet.

A combined logging and tracing strategy is recommended, where traces provide coverage for newer instrumented apps, and logging supports legacy applications and systems you don't own the source code for.

Conclusion

Effective log management is essential for maintaining system reliability and performance in today's complex software environments. By following these best practices, you can optimize your log management process, reduce costs, and improve problem resolution times.

Key takeaways include:

Ensure comprehensive log collection with a focus on normalization and common schemas.
Use appropriate processing and enrichment techniques, balancing between structured and unstructured logs.
Leverage full-text search and machine learning for efficient log analysis.
Implement cost-effective storage strategies and smart data retention policies.
Enhance your logging strategy with APM, metrics, and traces for a complete observability solution.

Continuously evaluate and adjust your strategies to keep pace with the growing volume and complexity of log data, and you'll be well-equipped to ensure the reliability, performance, and security of your applications and infrastructure.

Check out our other blogs:

Ready to get started? Use Elastic Observability on Elastic Cloud — the hosted Elasticsearch service that includes all of the latest features.

Revolutionizing big data management: Unveiling the power of Amazon EMR and Elastic integration

Tue, 26 Sep 2023 00:00:00 GMT

In the dynamic realm of data processing, Amazon EMR takes center stage as an AWS-provided big data service, offering a cost-effective conduit for running Apache Spark and a plethora of other open-source applications. While the capabilities of EMR are impressive, the art of vigilant monitoring holds the key to unlocking its full potential. This blog post explains the pivotal role of monitoring Amazon EMR clusters, accentuating the transformative integration with Elastic^®.

Elastic can make it easier for organizations to transform data into actionable insights and stop threats quickly with unified visibility across your environment — so mission-critical applications can keep running smoothly no matter what. From a free trial and fast deployment to sending logs to Elastic securely and frictionlessly, all you need to do is point and click to capture, store, and search data from your AWS services.

Monitoring EMR via Elastic Observability

In this article, we will delve into the following key aspects:

Enabling EMR cluster metrics for Elastic integration: Learn the intricacies of configuring an EMR cluster to emit metrics that Elastic can effectively extract, paving the way for insightful analysis.
Harnessing Kibana ^® dashboards for EMR workload analysis: Discover the potential of utilizing Kibana dashboards to dissect metrics related to an EMR workload. By gaining a deeper understanding, we open the doors to optimization opportunities.

Key benefits of AWS EMR integration

Comprehensive monitoring: Monitor the health and performance of your EMR clusters in real time. Track metrics related to cluster status and utilization, node status, IO, and many others, allowing you to identify bottlenecks and optimize your data processing.
Log analysis: Dive deep into EMR logs with ease. Our integration enables you to collect and analyze logs from your clusters, helping you troubleshoot issues and gain valuable insights.
Cost optimization: Understand the cost implications of your EMR clusters. By monitoring resource utilization, you can identify opportunities to optimize your cluster configurations and reduce costs.
Alerting and notifications: Set up custom alerts based on EMR metrics and logs. Receive notifications when performance thresholds are breached, ensuring that you can take action promptly.
Seamless integration: Our integration is designed for ease of use. Getting started is simple, and you can start monitoring your EMR clusters quickly.

Accompanying these discussions is an illustrative solution architecture diagram, providing a visual representation of the intricacies and interactions within the proposed solution.

How to get started

Getting started with AWS EMR integration in Observability is easy. Here's a quick overview of the steps:

Prerequisites and configurations

If you intend to follow the steps outlined in this blog post, there are a few prerequisites and configurations that you should have in place beforehand.

You will need an account on Elastic Cloud and a deployed stack and agent. Instructions for deploying a stack on AWS can be found here. This is necessary for AWS EMR logging and analysis.
You will also need an AWS account with the necessary permissions to pull data from AWS. Details on the required permissions can be found in our documentation.
Finally, be sure to turn on EMR monitoring for the EMR cluster when you deploy the cluster.

Step 1: Create an account with Elastic

Create an account on Elastic Cloud by following the steps provided.

Step 2: Add integration

Click on Add Integration. You will be navigated to a catalog of supported integrations.

Search and select Amazon EMR.

Step 3: Configure integration

Click on the Add Amazon EMR button and provide the required details.
Provide the required access credentials to connect to your EMR instance.
You can choose to collect EMR metrics, EMR logs via S3, or EMR logs via Cloudwatch.
Click on the Save and continue button at the bottom of the page.

Step 4: Analyze and monitor

Explore the data using the out-of-the-box dashboards available for the integration. Select Discover from the Elastic Cloud top-level menu.

Or, create custom dashboards, set up alerts, and gain actionable insights into your EMR clusters' performance.

This integration streamlines the collection of vital metrics and logs, including Cluster Status, Node Status, IO, and Cluster Capacity. Some metrics gathered include:

IsIdle: Indicates that a cluster is no longer performing work, but is still alive and accruing charges
ContainerAllocated: The number of resource containers allocated by the ResourceManager
ContainerReserved: The number of containers reserved
CoreNodesRunning: The number of core nodes working
CoreNodesPending: The number of core nodes waiting to be assigned
MRActiveNodes: The number of nodes presently running MapReduce tasks or jobs
MRLostNodes: The number of nodes allocated to MapReduce that have been marked in a LOST state
HDFSUtilization: The percentage of HDFS storage currently used
HDFSBytesRead/Written: The number of bytes read/written from HDFS (This metric aggregates MapReduce jobs only, and does not apply for other workloads on Amazon EMR.)
TotalUnitsRequested/TotalNodesRequested/TotalVCPURequested: The target total number of units/nodes/vCPUs in a cluster as determined by managed scaling

Conclusion

Elastic is committed to fulfilling all your observability requirements, offering an effortless experience. Our integrations are designed to simplify the process of ingesting telemetry data, granting you convenient access to critical information for monitoring, analytics, and observability. The native AWS EMR integration underscores our dedication to delivering seamless solutions for your data needs. With this integration, you'll find the confidence to monitor, analyze, and optimize your EMR clusters, opening up exciting opportunities for your data-driven initiatives.

Start a free trial today

Bringing Your Cloud-Managed Kubernetes Audit Logs into Elasticsearch

Mon, 19 Aug 2024 00:00:00 GMT

Introduction:

Kubernetes audit logs are essential for ensuring the security, compliance, and transparency of Kubernetes clusters. However, with managed Kubernetes infrastructure, traditional audit file-based log shipping is often not supported, and audit logs are only available via the control plane API or the Cloud Provider logging facility. In this blog, we will show you how to ingest the audit logs from these other sources and still take advantage of the Elastic Kubernetes Audit Log Integration.

In this blog we will be focusing on AWS as our cloud provider and when ingesting logs from AWS you have several options:

AWS Custom Logs integration (which we will utilize in this blog)
AWS Firehose to send logs from Cloudwatch to Elastic
AWS General integration which supports many AWS sources

In part 1 of this two-part series, we will focus on properly ingesting Kubernetes Audit, and part 2 will focus on investigation, analytics, and alerting.

Kubernetes auditing documentation describes the need for auditing in order to get answers to the questions below:

What happened?
When did it happen?
Who initiated it?
What resource did it occur on?
Where was it observed?
From where was it initiated (Source IP)?
Where was it going (Destination IP)?

Answers to the above questions become important when an incident occurs and an investigation follows. Alternatively, it could just be a log retention use case for a regulated company trying to fulfill compliance requirements.

We are giving special importance to audit logs in Kubernetes because audit logs are not enabled by default. Audit logs can take up a large amount of memory and storage. So, usually, it’s a balance between retaining/investigating audit logs against giving up resources budgeted otherwise for workloads to be hosted on the Kubernetes cluster. Another reason we’re talking about audit logs in Kubernetes is that, unlike usual container logs, after being turned on, these logs are orchestrated to write to the cloud provider’s logging service. This is true for most cloud providers because the Kubernetes control plane is managed by the cloud providers. It makes sense for cloud providers to use their built-in orchestration workflows involving the control plane for a managed service backed by their implementation of a logging framework.

Kubernetes audit logs can be quite verbose by default. Hence, it becomes important to selectively choose how much logging needs to be done so that all the audit requirements are met for the organization. This is done in the audit policy file. The audit policy file is submitted against the kube-apiserver. It is not necessary that all flavors of cloud-provider-hosted Kubernetes clusters allow you to play with the kube-apiserver directly. For example, AWS EKS allows for this logging to be done only by the control plane.

In this blog we will be using Elastic Kubernetes Service (Amazon EKS) on AWS with the Kubernetes Audit Logs that are automatically shipped to AWS CloudWatch.

A sample audit log for a secret by the name “empty-secret” created by an admin user on EKS is logged on AWS CloudWatch in the following format:

Once the audit logs show up on CloudWatch, it is time to consider how to transfer them to Elasticsearch. Elasticsearch is a great platform for creating dashboards that visualize different audit events recorded in a Kubernetes cluster. It is also a powerful tool for analyzing various audit events. For example, how many secret object creation attempts were made in an hour?

Now that we established the Kubernetes audit logs are being logged in CloudWatch, let’s discuss how to get the logs ingested into Elasticsearch. Elasticsearch has an integration to consume logs written on CloudWatch. Just using this integration by default is going to get the JSON from CloudWatch as is i.e. the real audit log JSON is nested inside the wrapper CloudWatch JSON. When bringing logs to Elasticsearch, it is important that we use the Elastic Common Schema(ECS) to get the best search and analytics performance. This means that there needs to be an ingest pipeline that parses a standard Kubernetes audit JSON message and creates an ECS Compliant document in Elasticsearch. Let’s dive into how to achieve this.

Elasticsearch has a Kubernetes integration using Elastic Agent to consume Kubernetes container logs from the console and audit logs written to a file path. For a cloud-provider use case, as described above, it may not be feasible to write audit logs to a path on the Kubernetes cluster. So, how do we leverage the ECS designed for parsing the Kubernetes audit logs already implemented in the Kubernetes integration to work on the CloudWatch audit logs? That is the most exciting plumbing piece! Let’s see how to do it.

What we’re going to do is:

Read the Kubernetes audit logs from the cloud provider’s logging module, in our case, AWS CloudWatch since this is where logs reside. We will use Elastic Agent and Elasticsearch AWS Custom Logs integration to read from logs from CloudWatch. Note: please be aware, there are several Elastic AWS integration, we are specifically using the AWS Custom Logs integration.
Create two simple ingest pipelines (we do this for best practices of isolation and composability)
The first pipeline looks for Kubernetes audit JSON messages and then redirects them to the second pipeline
The second custom pipeline will associate the JSON message field with the correct field expected by the Elasticsearch Kubernetes Audit managed pipeline (aka the Integration) and then reroute the message to the correct data stream, kubernetes.audit_logs-default, which in turn applies all the proper mapping and ingest pipelines for the incoming message
The overall flow will be

1. Create an AWS CloudWatch integration:

a. Populate the AWS access key and secret pair values

b. In the logs section, populate the log ARN, Tags and Preserve the original event if you want to, and then Save this integration and exit from the page

2. Next, we will configure the custom ingest pipeline

We are doing this because we want to override what the generic managed pipeline does. We will retrieve the custom component name by searching for managed pipeline created as an asset when we install the AWS CloudWatch integration. In this case we will be adding the custom ingest pipeline logs-aws_logs.generic@custom

From the Dev tools console, run below. Here, we are extracting the message field from the CloudWatch JSON and putting the value in a field called kubernetes.audit. Then, we are rerouting this message to the default Kubernetes audit dataset or ECS that comes with Kubernetes integration

PUT _ingest/pipeline/logs-aws_logs.generic@custom
{
    "processors": [
      {
        "pipeline": {
          "if": "ctx.message.contains('audit.k8s.io')",
          "name": "logs-aws-process-k8s-audit"
        }
      }
    ]
}

PUT _ingest/pipeline/logs-aws-process-k8s-audit
{
  "processors": [
    {
      "json": {
        "field": "message",
        "target_field": "kubernetes.audit"
      }
    },
    {
      "remove": {
        "field": "message"
      }
    },
    {
      "reroute": {
        "dataset": "kubernetes.audit_logs",
        "namespace": "default"
      }
    }
  ]
}

Let’s understand this further:

When we create a Kubernetes integration, we get a managed index template called logs-kubernetes.audit_logs that writes to the pipeline called logs-kubernetes.audit_logs-1.62.2 by default
If we look into the pipeline logs-kubernetes.audit_logs-1.62.2, we see that all the processor logic is working against the field kubernetes.audit. This is the reason why our json processor in the above code snippet is creating a field called kubernetes.audit before dropping the original message field and rerouting. Rerouting is directed to the kubernetes.audit_logs dataset that backs the logs-kubernetes.audit_logs-1.62.2 pipeline (dataset name is derived from the pipeline name convention that’s in the format logs--version)

3. Now let’s verify that the logs are actually flowing through and the audit message is being parsed

a. We will use Elastic Agent and enroll using Fleet and the integration policy we created in the Step 1. There are a number of ways to deploy Elastic Agent and for this exercise we will deploy using docker which is quick and easy.

% docker run --env FLEET_ENROLL=1 --env FLEET_URL=<> --env FLEET_ENROLLMENT_TOKEN=<>  --rm docker.elastic.co/beats/elastic-agent:8.19.12

b. Check the messages in Discover. In 8.15 there is also a new feature called Logs Explorer which provides an ability to see Kubernetes Audit logs (and container logs) with a few clicks (see image below). Voila! We can see the Kubernetes audit messages parsed!

4. Let's do a quick recap of what we did

We configured CloudWatch integration in Elasticsearch to read Kubernetes audit logs from CloudWatch. Then, we created custom ingest pipelines to reroute the audit messages to the correct data stream and all the OOTB mappings and parsing that come with the Kubernetes Audit Logs integration.

In the next part, we’ll look at how to analyze the ingested Kubernetes Audit log data.

Centrally Managing OTel Collectors with Elastic Agent and Fleet

Tue, 24 Mar 2026 00:00:00 GMT

"The dream of OpenTelemetry is vendor-neutral, standardised observability. The challenge nobody mentions is how you operate hundreds, or thousands, of those collectors in production."

OpenTelemetry has won the hearts of the industry. Adoption is accelerating: the CNCF's 2024 Observability survey found OTel to be the fastest-growing project in the foundation's history, with the OTel Collector registering hundreds of millions of downloads. The proposition is compelling: write instrumentation once, ship it anywhere, avoid lock-in.

But here is what every platform team discovers once they cross into production: the collector sprawl problem. Hundreds of collector instances deployed across regions, Kubernetes namespaces, and bare-metal hosts. Configuration drift creeping in. An upgrade that has to be co-ordinated across a fleet of independent processes. A security patch that someone has to manually roll out to each one. And zero visibility into which collectors are running, healthy, or stuck.

This is the gap between "deploying OpenTelemetry" and "operating OpenTelemetry at scale." With Elastic 9.3, Elastic Agent closes that gap entirely. The Elastic Agent is now built on Elastic's Distribution of the OpenTelemetry Collector (EDOT) and, when managed by Fleet, gives platform teams a single control plane for configuring, updating, and monitoring every OTel collector in their estate — all while remaining compatible with the Beats-based integrations they already rely on.

The Collector Sprawl Problem and Why It Matters

OpenTelemetry's success has created a quiet operational debt for many organisations. Individual teams adopt the collector for their services: logs here, metrics there, a custom pipeline for the new microservice. Without a centralised management layer, each of these collectors becomes an independent snowflake: its own config file, its own upgrade cycle, its own failure domain.

The consequences are predictable. Configuration drift means collectors running different versions of the same pipeline, producing subtly incompatible data. Compliance teams ask "show me all the places data is collected and where it goes", and the honest answer is a spreadsheet that's already out of date.

This isn't a niche problem. A Gartner analysis of enterprise observability programmes consistently identifies operational overhead as the top barrier to expanding OTel adoption beyond initial pilots. The technology works. The tooling to manage it at scale is what's been missing.

How Elastic Agent Became an OTel Collector

To understand the significance of this, it helps to understand what Elastic Agent used to be, and what it is now.

Elastic Agent acts as a supervisor process: Before version 9.3, it managed a collection of separate Beats sub-processes (Filebeat, Metricbeat, Winlogbeat and so on), each running its own input/output lifecycle, each consuming its own memory footprint. The agent coordinated them, but the fundamental model was a collection of discrete daemons running under a parent.

With 9.3, that model has been replaced. Elastic Agent is now itself an instance of the EDOT Collector: Elastic's hardened, production-supported distribution of the upstream OTel Collector. The architectural shift has three important consequences.

First, the process model simplifies dramatically. Instead of a supervisor managing multiple sub-process lifecycles, there is a single EDOT Collector process. This means a smaller memory footprint, fewer things that can fail independently, and fewer processes to observe for health and performance.

Second, Beats functionality is preserved, not discarded. Rather than forcing a breaking migration, Elastic has introduced Beats Receivers: beat inputs and processors re-packaged as native OTel receiver components. A Filestream input is enabled by a filebeatreceiver. The same Filebeat configuration YAML you write today is automatically translated into the corresponding EDOT receiver configuration at runtime. Existing integrations, dashboards, and ingest pipelines continue to work without modification.

Third, the agent is now a first-class participant in the OTel ecosystem. It speaks OTLP natively, it runs standard OTel receivers, and it can be configured to sit alongside any other OTel-compatible tool in a modern observability pipeline.

Central Management with Fleet: Configuration, Lifecycle, and Visibility

The architectural shift above would be valuable on its own. But it becomes transformative when combined with Elastic Fleet, the centralised management plane for Elastic Agents.

Fleet gives platform and SRE teams a single console from which to manage every Elastic Agent (and by extension, every EDOT Collector instance) in their estate. The capabilities break into three categories: configuration management, lifecycle management, and fleet-wide observability.

Configuration management at scale

With Fleet, you define an Agent Policy — a declarative description of what a collector should do. What data should it collect? Via which receivers? Where should it export? The policy is authored once in Fleet's UI (or via its API), and pushed automatically to every agent enrolled in that policy. Change the policy, and every affected collector receives the update. No SSH. No Ansible playbook to maintain. No configuration drift.

Fleet pushes policies to enrolled agents across any environment. Agents send heartbeat and health data back, giving a live inventory of every collector in the estate.

Lifecycle management: upgrades, enrolment, and remediation

Perhaps the most operationally significant benefit of Fleet management is lifecycle control. With Fleet, upgrading a collector is a policy action: select the target version, select the scope (all agents, a specific policy group, a canary subset), and click. Fleet orchestrates the rolling upgrade, tracking status per agent and surfacing failures immediately.

This changes the security calculus fundamentally. When a vulnerability is disclosed in the OTel Collector binary, patching is a Fleet operation measured in minutes, not a change-management ceremony measured in days across SSH sessions to individual hosts.

Fleet also handles enrolment and de-enrolment. New hosts added to your infrastructure can be auto-enrolled into the appropriate policy based on tags or deployment tooling. Agents on decommissioned hosts can be removed from Fleet's inventory, ensuring your observability map reflects your actual infrastructure.

Fleet-wide observability of your collectors

Every Fleet-managed Elastic Agent ships monitoring telemetry about itself: CPU and memory consumption, event throughput, error rates, pipeline latency. This data flows into Elastic and is surfaced in the Fleet UI, giving you a live dashboard of every collector in your estate, not just the ones you happen to be watching.

For the first time, "how healthy is my observability pipeline?" becomes a question with a real-time, fleet-wide answer. You can identify agents that have stopped sending data, agents consuming unexpectedly high resources, and agents that have fallen behind on queue processing — before those problems surface as gaps in your monitoring data.

In the near future this capability will be offered to non-Fleet managed agents (aka standalone) and/or 3rd party OTel collectors provided by other vendors. These collectors can be configured via some other means but be monitored in Fleet - from both resource consumption and/or component pipeline health.

The Hybrid Agent: Beats Data and OTel Data, Simultaneously

One of the most practically significant capabilities introduced in 9.3 is what Elastic calls the Hybrid Agent: an Elastic Agent that can run both Beats-based receivers and native OTel receivers in the same pipeline, at the same time. This does not change anything for existing installations.

This matters enormously for real-world adoption. Most organisations arriving at OTel in 2025 and 2026 are not starting from a blank slate. They have years of investment in Beats-based integrations: Filebeat-powered log collection, Metricbeat-powered host metrics, bespoke ingest pipelines in Elasticsearch that normalise and enrich that data into ECS (Elastic Common Schema) format. The business value locked in those integrations (the dashboards, the alerts, the correlation logic) is not something they can afford to throw away in order to "go OTel."

The Hybrid Agent solves this by making the two worlds coexist. For example, in a single agent policy you can simultaneously configure:

A filebeatreceiver collecting application logs in ECS format, routed through your existing ingest pipeline to its existing data stream
A native OTel filelog receiver collecting OTel-native telemetry from your new services instrumented with the OTel SDK, stored in OTel-native data streams without touching ingest pipelines
An OTel hostmetrics receiver collecting system metrics in semantic convention format alongside your existing Metricbeat-derived system metrics

The two lanes are independent. Beats-receiver data travels through ingest pipelines and lands in ECS-formatted data streams, exactly as it always has. Native OTel data follows OTel semantic conventions and is stored directly in OTel-native data streams, bypassing ingest pipelines. Your existing dashboards and alerts continue to work. Your new OTel-native workloads get the full OTel experience. The same agent, the same Fleet policy, the same management console.

This co-existence is the practical answer to the question every platform team eventually faces: "We want to adopt OTel properly but we can't break what we already have." The Hybrid Agent lets you migrate incrementally, service by service, on your timeline.

The Integration Catalogue: Turning Configuration into a One-Click Operation

Configuration management at scale is only as good as the configurations themselves. Elastic's integration catalogue — over 500 packages covering everything from NGINX and PostgreSQL to AWS CloudTrail and Kubernetes — extends naturally to the Hybrid Agent model.

From 9.3 onwards, the catalogue includes OTel integration packages alongside the existing Beats-based ones. Each OTel package contains two components:

An Input package: the configuration for the corresponding OTel receiver (receivers, processors, pipeline wiring), ready to be applied to a Hybrid Agent policy
A Content package: the assets associated with the application: pre-built dashboards, alerts, index templates, and saved queries, all calibrated for OTel semantic convention data

When an operator adds an OTel integration to an Agent Policy in Fleet, the receiver configuration is pushed to all enrolled agents. When those agents start ingesting data and it arrives in Elasticsearch, the content package assets are automatically installed based on metadata in the data received. The dashboard is ready before you've had time to wonder where it is.

The same policy can hold both OTel integrations and legacy Beats integrations. A real-world agent policy might simultaneously collect system metrics via the OTel hostmetrics receiver, application logs via filebeat receiver, and APM data via OTLP — all from one policy, all managed from Fleet, all visible in a unified Kibana experience.

A technical walk through of how this is done for NGINX data collection can be found here for reference. Currently management of Elastic Agents is done via existing Fleet protocols, however in the near future this will move over to OPAMP so that Fleet will be able to provide management to 3rd party OTel collectors as well.

For organisations on platforms not yet in Elastic's OS support matrix, 3rd-party OTel Collectors (such as Red Hat's OpenShift-native collector) can send data to Elastic using the OTLP exporter and be observed alongside all other collectors in their fleet.

What This Means in Practice: A Migration Story

Consider a mid-sized platform team operating 200 Linux hosts across three regions, currently running Elastic Agent 8.x with a mix of Filebeat and Metricbeat integrations. Their new services are being instrumented with the OTel SDK and they want to standardise on OTel going forward without disrupting the monitoring coverage they already have.

With a Fleet-managed upgrade to 9.3, their existing agents become Hybrid Agents automatically. Their Filebeat and Metricbeat configurations are internally translated to Beats receiver configurations and continue to run unmodified. Their existing dashboards still populate. Their ingest pipelines still fire. Nothing breaks.

They then add OTel integration packages to their Fleet policies for each new service. The OTel-instrumented microservices start sending OTLP data, received by native OTel receivers in the same agents. OTel-native dashboards appear automatically in Kibana. They now have both data universes in one place, managed from one console, visible in one interface.

Over the following quarters, as Beats-based integrations for their remaining services are superseded by OTel equivalents in the catalogue, they migrate them one by one, updating the Agent Policy in Fleet and watching the transition happen across all 200 hosts simultaneously, without touching a single one directly.

Looking Forward

Elastic has made a clear architectural bet: OpenTelemetry is the future of observability data collection, and the right response to that future is not to build a parallel OTel tool alongside the existing stack — it is to evolve the existing stack into OTel. The Hybrid Agent and EDOT Collector are the result of that bet.

Fleet central management is the operational layer that makes that bet practical at scale. OpenTelemetry gives you standardised, vendor-neutral instrumentation. Fleet gives you the operational control plane to manage those collectors like the production infrastructure they are, not like artisanal YAML files scattered across your estate.

The collector sprawl problem is solvable. The answer is a managed, policy-driven, centrally observable fleet of EDOT Collectors, and in Elastic 9.3, that answer is production-ready today.

Collecting JMX metrics with OpenTelemetry

Thu, 05 Mar 2026 00:00:00 GMT

Java Management Extensions (JMX) is the JVM's built-in management interface, exposing runtime and component metrics such as memory, threads, and request pools. It is useful for collecting operational telemetry from Java services without changing application code.

Collecting JMX metrics with OpenTelemetry can be done in two main ways depending on your environment, requirements and constraints:

from inside the JVM with the OpenTelemetry Instrumentation Java agent (or EDOT Java)
from outside the JVM with the jmx-scraper.

Thorough this article, we will use the term "Java agent" to refer to the OpenTelemetry Java instrumentation agent, this also applies to the Elastic own distribution (EDOT Java) which is based on it and provides the same features.

This walkthrough uses a Tomcat server as the target and shows how to validate which metrics are emitted with the logging exporter.

The configuration examples in this article use Java system properties that must be passed using -D flags in the JVM startup command, equivalent environment variables can also be used for configuration.

Prerequisites

A local Tomcat install (or any JVM app you can start with custom JVM flags)
Java 8+ on the host, the Tomcat version used might require a more recent version though.
An OpenTelemetry Collector endpoint if you want to ship metrics beyond local logging

Choosing between the Java agent and jmx-scraper

Use the Java agent (or EDOT Java) when you can modify JVM startup flags and want in-process collection with full context from the running application: this allows to capture traces, logs and metrics with a single tool deployment.

Use jmx-scraper when you cannot install an agent on the JVM or prefer out-of-process collection from a separate host. This requires the JVM and the network to be configured for remote JMX access and also dealing with authentication and credentials.

Both approaches rely on the same JMX metric mappings and can use the logging exporter for validation and then use OTLP to send metrics to the collector / an OTLP endpoint.

Option 1: Collect JMX metrics inside the JVM with the Java agent

OpenTelemetry Java instrumentation ships with a curated set of JMX metric mappings. For Tomcat, you just need to enable the Java agent and set otel.jmx.target.system=tomcat.

Step 1 - Download the OpenTelemetry Java agent

The agent is downloaded in /opt/otel but you can choose any location on the host. Make sure the path is consistent with the -javaagent flag in the next step.

mkdir -p /opt/otel
curl -L -o /opt/otel/opentelemetry-javaagent.jar \
  https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar

Step 2 - Configure Tomcat with `bin/setenv.sh`

Create or update bin/setenv.sh so Tomcat launches with the agent and JMX target system enabled.

#!/bin/bash
export CATALINA_OPTS="$CATALINA_OPTS \
  -javaagent:/opt/otel/opentelemetry-javaagent.jar \
  -Dotel.service.name=tomcat-demo \
  -Dotel.metrics.exporter=otlp,logging \
  -Dotel.jmx.target.system=tomcat"

This will configure the agent to log metrics (using the logging exporter) in addition to sending them to the Collector.

Step 3 - Validate the emitted metrics

Start Tomcat and watch stdout.

./bin/catalina.sh run

By defaults metrics are sampled and emitted every minute, so you might have to wait a bit for the metrics to be logged. If needed, you can use otel.metric.export.interval configuration to increase or reduce the frequency.

You should see logging exporter output with JVM and Tomcat metrics. Look for lines containing the LoggingMetricExporter class name.

INFO io.opentelemetry.exporter.logging.LoggingMetricExporter - MetricData{name=tomcat.threadpool.currentThreadsBusy, ...}
INFO io.opentelemetry.exporter.logging.LoggingMetricExporter - MetricData{name=jvm.memory.used, ...}

Step 4 - Send metrics to a Collector

Once metric capture is validated, you should be ready to send metrics to a collector.

You will have to:

remove the logging exporter as it's no longer necessary for production
configure the OTLP endpoint (otel.exporter.otlp.endpoint) and headers (otel.exporter.otlp.headers) if needed

The bin/setenv.sh file should be modified to look like this:

#!/bin/bash
export CATALINA_OPTS="$CATALINA_OPTS \
  -javaagent:/opt/otel/opentelemetry-javaagent.jar \
  -Dotel.service.name=tomcat-demo \
  -Dotel.jmx.target.system=tomcat \
  -Dotel.exporter.otlp.endpoint=https://your-collector:4317 \
  -Dotel.exporter.otlp.headers=Authorization=Bearer "

When using the Java agent, the JVM metrics are automatically captured by the runtime-telemetry module, it is thus not necessary to include jvm in the otel.jmx.target.system configuration option.

Option 2: Collect JMX metrics from outside the JVM with jmx-scraper

When you cannot install an agent in the JVM or if only metrics are required, jmx-scraper lets you query JMX remotely and export metrics to an OTLP endpoint.

Step 1 - Enable remote JMX on Tomcat

Add JMX remote options to bin/setenv.sh and create access/password files.

Warning: This uses trivial credentials and disables SSL. Do not use this configuration in production.

mkdir -p /opt/jmx
cat < ${CATALINA_HOME}/jmxremote.access
monitorRole readonly
EOF

cat < ${CATALINA_HOME}/jmxremote.password
monitorRole monitorPass
EOF

chmod 600 ${CATALINA_HOME}/jmxremote.password

export CATALINA_OPTS="$CATALINA_OPTS \
  -Dcom.sun.management.jmxremote \
  -Dcom.sun.management.jmxremote.port=9010 \
  -Dcom.sun.management.jmxremote.rmi.port=9010 \
  -Dcom.sun.management.jmxremote.authenticate=true \
  -Dcom.sun.management.jmxremote.ssl=false \
  -Dcom.sun.management.jmxremote.access.file=${CATALINA_HOME}/jmxremote.access \
  -Dcom.sun.management.jmxremote.password.file=${CATALINA_HOME}/jmxremote.password \
  -Djava.rmi.server.hostname=127.0.0.1"

Step 2 - Download jmx-scraper

The jmx-scraper is downloaded in /opt/otel but you can choose any location on the host.

mkdir -p /opt/otel
curl -L -o /opt/otel/opentelemetry-jmx-scraper.jar \
  https://github.com/open-telemetry/opentelemetry-java-contrib/releases/latest/download/opentelemetry-jmx-scraper.jar

Step 3 - Check the JMX connection

Run jmx-scraper with credentials from previous step to confirm it can reach Tomcat. If the credentials are wrong, you will see authentication errors.

java -jar /opt/otel/opentelemetry-jmx-scraper.jar \
  -Dotel.jmx.service.url=service:jmx:rmi:///jndi/rmi://localhost:9010/jmxrmi
  -Dotel.jmx.username=monitorRole \
  -Dotel.jmx.password=monitorPass \
  -Dotel.jmx.target.system=tomcat \
  -test

You should get in the standard output:

JMX connection test OK if the connection and authentication is successful
JMX connection test ERROR otherwise

Step 4 - Validate the emitted metrics

Using the logging exporter allows to inspect metrics and attributes before sending them to a collector.

In order to capture both Tomcat and JVM metrics, it is required to set otel.jmx.target.system to tomcat,jvm.

java -jar /opt/otel/opentelemetry-jmx-scraper.jar \
  -Dotel.jmx.service.url=service:jmx:rmi:///jndi/rmi://localhost:9010/jmxrmi
  -Dotel.jmx.username=monitorRole \
  -Dotel.jmx.password=monitorPass \
  -Dotel.jmx.target.system=tomcat,jvm \
  -Dotel.metrics.exporter=logging

Step 5 - Send metrics to a Collector

After validation, to send metrics to an OTLP endpoint, you will have to:

remove the -Dotel.metrics.exporter to restore the otlp default value.
configure the OTLP endpoint (otel.exporter.otlp.endpoint) and headers (otel.exporter.otlp.headers) if needed

java -jar /opt/otel/opentelemetry-jmx-scraper.jar \
  -Dotel.jmx.service.url=service:jmx:rmi:///jndi/rmi://localhost:9010/jmxrmi
  -Dotel.jmx.username=monitorRole \
  -Dotel.jmx.password=monitorPass \
  -Dotel.jmx.target.system=tomcat,jvm \
  -Dotel.exporter.otlp.endpoint=https://your-collector:4317
  -Dotel.exporter.otlp.headers="Authorization=Bearer "

Customizing the JMX Metrics Collection

Once the built-in Tomcat and JVM mappings are flowing, you can add custom rules with otel.jmx.config. Create a YAML file and pass its path alongside otel.jmx.target.system.

For example, the following custom.yaml file allows to capture the custom.jvm.thread.count metric from the java.lang:type=Threading MBean:

---
rules:
  - bean: "java.lang:type=Threading"
    mapping:
      ThreadCount:
        metric: custom.jvm.thread.count
        type: gauge
        unit: "{thread}"
        desc: Current number of live threads.

For complete reference on the configuration format and syntax, refer to jmx-metrics module in Opentelemetry Java instrumentation.

This custom configuration can be used both with jmx-scraper and Java agent, both support the otel.jmx.config configuration option, for example with jmx-scraper:

java -jar /opt/otel/opentelemetry-jmx-scraper.jar \
  -Dotel.jmx.service.url=service:jmx:rmi:///jndi/rmi://localhost:9010/jmxrmi
  -Dotel.jmx.username=monitorRole \
  -Dotel.jmx.password=monitorPass \
  -Dotel.jmx.target.system=tomcat,jvm \
  otel.jmx.config=/opt/otel/jmx/custom.yaml

You can pass multiple custom files as a comma-separated list to otel.jmx.config when you need to organize metrics by team or component.

Using the JMX Metrics in Kibana

Once you have collected the JMX metrics using one of the approaches described in this article, you can start using them in Kibana. You can build custom dashboards and visualizations to explore and analyze the metrics, create custom alerts on top of them or build MCP tools and AI Agents to use them in your agentic workflows.

Here is an example of how you can use the JMX metrics in Kibana through ES|QL:

TS metrics*
| WHERE telemetry.sdk.language == "java"
| WHERE service.name == ?instance
| STATS
    request_rate = SUM(RATE(tomcat.request.count))
  BY Time = BUCKET(@timestamp, 100, ?_tstart, ?_tend)

You can use the native metric and dimension names of the JMX metrics to build your queries. With the TS command you get first-class support for time series aggregation functions and dimensions on your metrics. This kind of queries constitute the building blocks for your dashboards, alerts, workflows and AI agent tools.

Here is an example of a dashboard that visualizes the typical JMX metrics for Apache Tomcat:

Conclusion

In this article, we have seen how to collect JMX metrics with OpenTelemetry using the Java agent or jmx-scraper. We have also seen how to use the JMX metrics in Kibana through ES|QL to build custom dashboards, alerts, workflows and AI agent tools.

This is just the beginning of what you can do with the JMX metrics and Elastic Observability. Try it out yourself and explore the full potential of your JMX metrics when combined with powerful features provided by the Elastic Observability platform.

Beyond the trace: Pinpointing performance culprits with continuous profiling and distributed tracing correlation

Thu, 28 Mar 2024 00:00:00 GMT

Observability goes beyond monitoring; it's about truly understanding your system. To achieve this comprehensive view, practitioners need a unified observability solution that natively combines insights from metrics, logs, traces, and crucially, continuous profiling. While metrics, logs, and traces offer valuable insights, they can't answer the all-important "why." Continuous profiling signals act as a magnifying glass, providing granular code visibility into the system's hidden complexities. They fill the gap left by other data sources, enabling you to answer critical questions –– why is this trace slow? Where exactly in the code is the bottleneck residing?

Traces provide the "what" and "where" — what happened and where in your system. Continuous profiling refines this understanding by pinpointing the "why" and validating your hypotheses about the "what." Just like a full-body MRI scan, Elastic's whole-system continuous profiling (powered by eBPF) uncovers unknown-unknowns in your system. This includes not just your code, but also third-party libraries and kernel activity triggered by your application transactions. This comprehensive visibility improves your mean-time-to-detection (MTTD) and mean-time-to-recovery (MTTR) KPIs.

[Related article: Why metrics, logs, and traces aren’t enough]

Bridging the disconnect between continuous profiling and OTel traces

Historically, continuous profiling signals have been largely disconnected from OpenTelemetry (OTel) traces. Here's the exciting news: we're bridging this gap! We're introducing native correlation between continuous profiling signals and OTel traces, starting with Java.

Imagine this: You're troubleshooting a performance issue and identify a slow trace. Whole-system continuous profiling steps in, acting like an MRI scan for your entire codebase and system. It narrows down the culprit to the specific lines of code hogging CPU time within the context of your distributed trace. This empowers you to answer the "why" question with minimal effort and confidence, all within the same troubleshooting context.

Furthermore, by correlating continuous profiling with distributed tracing, Elastic Observability customers can measure the cloud cost and CO₂ impact of every code change at the service and transaction level.

This milestone is significant, especially considering the recent developments in the OTel community. With OTel adopting profiling and Elastic donating the industry’s most advanced eBPF-based continuous profiling agent to OTel, we're set for a game-changer in observability — empowering OTel end users with a correlated system visibility that goes from a trace span in the userspace down to the kernel.

Furthermore, achieving this goal, especially with Java, presented significant challenges and demanded serious engineering R&D. This blog post will delve into these challenges, explore the approaches we considered in our proof-of-concepts, and explain how we arrived at a solution that can be easily extended to other OTel language agents. Most importantly, this solution correlates traces with profiling signals at the agent, not in the backend — to ensure optimal query performance and minimal reliance on vendor backend storage architectures.

Figuring out the active OTel trace and span

The primary technical challenge in this endeavor is essentially the following: whenever the profiler interrupts an OTel instrumented process to capture a stacktrace, we need to be able to efficiently determine the active span and trace ID (per-thread) and the service name (per-process).

For the purpose of this blog, we'll focus on the recently released Elastic distribution of the OTel Java instrumentation, but the approach that we ended up with generalizes to any language that can load and call into a native library. So, how do we get our hands on those IDs?

The OTel Java agent itself keeps track of the active span by storing a stack of spans in the OpenTelemetryContext, which itself is stored in a ThreadLocal variable. We originally considered reading these Java structures directly from BPF, but we eventually decided against that approach. There is no documented specification on how ThreadLocals are implemented, and reliably reading and following the JVM's internal data-structures would incur a high maintenance burden. Any minor update to the JVM could change details of the structure layouts. To add to this, we would also have to reverse engineer how each JVM version lays out Java class fields in memory, as well as how all the high-level Java types used in the context objects are actually implemented under the hood. This approach further wouldn't generalize to any non-JVM language and needs to be repeated for any language that we wish to support.

After we had convinced ourselves that reading Java ThreadLocal directly is not the answer, we decided to look for more portable alternatives instead. The option that we ultimately settled with is to load and call into a C++ library that is responsible for making the required information available via a known and defined interface whenever the span changes.

Other than with Java's ThreadLocals, the details on how a native shared library should expose per-process and per-thread data are well-defined in the System V ABI specification and the architecture specific ELF ABI documents.

Exposing per-process information

Exposing per-process data is easy: we simply declare a global variable . . .

void* elastic_tracecorr_process_storage_v1 = nullptr;

. . . and expose it via ELF symbols. When the user initializes the OTel library to set the service name, we allocate a buffer and populate it with data in a protocol that we defined for this purpose. Once the buffer is fully populated, we update the global pointer to point to the buffer.

On the profiling agent side, we already have code in place that detects libraries and executables loaded into any process's address space. We normally use this mechanism to detect and analyze high-level language interpreters (e.g., libpython, libjvm) when they are loaded, but it also turned out to be a perfect fit to detect the OTel trace correlation library. When the library is detected in a process, we scan the exports, resolve the symbol, and read the per-process information directly from the instrumented process’ memory.

Exposing per-thread information

With the easy part out of the way, let's get to the nitty-gritty portion: exposing per-thread information via thread-local storage (TLS). So, what exactly is TLS, and how does it work? At the most basic level, the idea is to have one instance of a variable for every thread. Semantically you can think of it like having a global Map, although that is not how it is implemented.

On Linux, there are two major options for thread locals: TSD and TLS.

Thread-specific data (TSD)

TSD is the older and probably more commonly known variant. It works by explicitly allocating a key via pthread_key_create — usually during process startup — and passing it to all threads that require access to the thread-local variable. The threads can then pass that key to the pthread_getspecific and pthread_setspecific functions to read and update the variable for the currently running thread.

TSD is simple, but for our purposes it has a range of drawbacks:

The pthread_key_t structure is opaque and doesn't have a defined layout. Similar to the Java ThreadLocals, the underlying data-structures aren't defined by the ABI documents and different libc implementations (glibc, musl) will handle them differently.
We cannot call a function like pthread_getspecific from BPF, so we'd have to reverse engineer and reimplement the logic. Logic may change between libc versions, and we’d have to detect the version and support all variants that may come up in the wild.
TSD performance is not predictable and varies depending on how many thread local variables have been allocated in the process previously. This may not be a huge concern for Java specifically since spans are typically not swapped super rapidly, but it’d likely be quite noticeable for user-mode scheduling languages where the context might need to be swapped at every await point/coroutine yield.

None of this is strictly prohibitive, but a lot of this is annoying at the very least. Let’s see if we can do better!

Thread-local storage (TLS)

Starting with C11 and C++11, both languages support thread local variables directly via the _Thread_local and thread_local storage specifiers, respectively. Declaring a variable as per-thread is now a matter of simply adding the keyword:

thread_local void* elastic_tracecorr_tls_v1 = nullptr;

You might assume that the compiler simply inserts calls to the corresponding pthread function calls when variables declared with this are accessed, but this is not actually the case. The reality is surprisingly complicated, and it turns out that there are four different models of TLS that the compiler can choose to generate. For some of those models, there are further multiple dialects that can be used to implement them. The different models and dialects come with various portability versus performance trade-offs. If you are interested in the details, I suggest reading this blog article that does a great job at explaining them.

The TLS model and dialect are usually chosen by the compiler based on a somewhat opaque and complicated set of architecture-specific rules. Fortunately for us, both gcc and clang allow users to pick a particular one using the -ftls-model and -mtls-dialect arguments. The variant that we ended up picking for our purposes is -ftls-model=global-dynamic and -mtls-dialect=gnu2 (and desc on aarch64).

Let's take a look at the assembly that is being generated when accessing a thread_local variable under these settings. Our function:

void setThreadProfilingCorrelationBuffer(JNIEnv* jniEnv, jobject bytebuffer) {
  if (bytebuffer == nullptr) {
    elastic_tracecorr_tls_v1 = nullptr;
  } else {
    elastic_tracecorr_tls_v1 = jniEnv->GetDirectBufferAddress(bytebuffer);
  }
}

Is compiled to the following assembly code:

Both possible branches assign a value to our thread-local variable. Let’s focus at the right branch corresponding to the nullptr case to get rid of the noise from the GetDirectBufferAddress function call:

lea   rax, elastic_tracecorr_tls_v1_tlsdesc  ;; Load some pointer into rax.
call  qword ptr [rax]                        ;; Read & call function pointer at rax.
mov   qword ptr fs:[rax], 0                  ;; Assign 0 to the pointer returned by
                                             ;; the function that we just called.

The fs: portion of the mov instruction is the actual magic bit that makes the memory read per-thread. We’ll get to that later; let’s first look at the mysterious elastic_tracecorr_tls_v1_tlsdesc variable that the compiler emitted here. It’s an instance of the tlsdesc structure that is located somewhere in the .got.plt ELF section. The structure looks like this:

struct tlsdesc {
  // Function pointer used to retrieve the offset
  uint64_t (*resolver)(tlsdesc*);

  // TLS offset -- more on that later.
  uint64_t tp_offset;
}

The resolver field is initialized with nullptr and tp_offset with a per-executable offset. The first thread-local variable in an executable will usually have offset 0, the next one sizeof(first_var), and so on. At first glance this may appear to be similar to how TSD works, with the call to pthread_getspecific to resolve the actual offset, but there is a crucial difference. When the library is loaded, the resolver field is filled in with the address of __tls_get_addr by the loader (ld.so). __tls_get_addr is a relatively heavy function that allocates a TLS offset that is globally unique between all shared libraries in the process. It then proceeds by updating the tlsdesc structure itself, inserting the global offset and replacing the resolver function with a trivial one:

void* second_stage_resolver(tlsdesc* desc) {
  return tlsdesc->tp_offset;
}

In essence, this means that the first access to a tlsdesc based thread-local variable is rather expensive, but all subsequent ones are cheap. We further know that by the time that our C++ library starts publishing per-thread data, it must have gone through the initial resolving process already. Consequently, all that we need to do is to read the final offset from the process's memory and memorize it. We also refresh the offset every now and then to ensure that we really have the final offset, combating the unlikely but possible race condition that we read the offset before it was initialized. We can detect this case by comparing the resolver address against the address of the __tls_get_addr function exported by ld.so.

Determining the TLS offset from an external process

With that out of the way, the next question that arises is how to actually find the tlsdesc in memory so that we can read the offset. Intuitively one might expect that the dynamic symbol exported on the ELF file points to that descriptor, but that is not actually the case.

$ readelf --wide --dyn-syms elastic-jvmti-linux-x64.so | grep elastic_tracecorr_tls_v1
328: 0000000000000000 	8 TLS 	GLOBAL DEFAULT   19 elastic_tracecorr_tls_v1

The dynamic symbol instead contains an offset relative to the start of the .tls ELF section and points to the initial value that libc initializes the TLS value with when it is allocated. So how does ld.so find the tlsdesc to fill in the initial resolver? In addition to the dynamic symbol, the compiler also emits a relocation record for our symbol, and that one actually points to the descriptor structure that we are looking for.

$ readelf --relocs --wide elastic-jvmti-linux-x64.so | grep R_X86_64_TLSDESC
00000000000426e8  0000014800000024 R_X86_64_TLSDESC   	0000000000000000
elastic_tracecorr_tls_v1 + 0

To read the final TLS offset, we thus simply have to:

Wait for the event notifying us about a new shared library being loaded into a process
Do some cheap heuristics to detect our C++ library, avoiding the more expensive analysis below from being executed for every unrelated library on the system
Analyze the library on disk and scan ELF relocations for our per-thread variable to extract the tlsdesc address
Rebase that address to match where our library was loaded in that particular process
Read the offset from tlsdesc+8

Determining the TLS base

Now that we have the offset, how do we use that to actually read the data that the library puts there for us? This brings us back to the magic fs: portion of the mov instruction that we discussed earlier. In X86, most memory operands can optionally be supplied with a segment register that influences the address translation.

Segments are an archaic construct from the early days of 16-bit X86 where they were used to extend the address space. Essentially the architecture provides a range of segment registers that can be configured with different base addresses, thus allowing more than 16-bits worth of memory to be accessed. In times of 64-bit processors, this is hardly a concern anymore. In fact, X86-64 aka AMD64 got rid of all but two of those segment registers: fs and gs.

So why keep two of them? It turns out that they are quite useful for the use-case of thread-local data. Since every thread can be configured to have its own base address in these segment registers, we can use it to point to a block of data for this specific thread. That is precisely what libc implementations on Linux are doing with the fs segment. The offset that we snatched from the processes memory earlier is used as an address with the fs segment register, and the CPU automatically adds it to the per-thread base address.

To retrieve the base address pointed to by the fs segment register in the kernel, we need to read its destination from the kernel’s task_struct for the thread that we happened to interrupt with our profiling timer event. Getting the task struct is easy because we are blessed with the bpf_get_current_task BPF helper functions. BPF helpers are pretty much syscalls for BPF programs: we can just ask the Linux kernel to hand us the pointer.

Armed with the task pointer, we now have to read the thread.fsbase (X86-64) or thread.uw.tp_value (aarch64) field to get our desired base address that the user-mode process accesses via fs. This is where things get complicated one last time, at least if we wish to support older kernels without BTF support (we do!). The task_struct is huge and there are hundreds of fields that can be present or not depending on how the kernel is configured. Being a core primitive of the scheduler, it is also constantly subject to changes between different kernel versions. On modern Linux distributions, the kernel is typically nice enough to tell us the offset via BTF. On older ones, the situation is more complicated. Since hardcoding the offset is clearly not an option if we hope the code to be portable, we instead have to figure out the offset by ourselves.

We do this by consulting /proc/kallsyms, a file with mappings between kernel functions and their addresses, and then using BPF to dump the compiled code of a kernel function that rarely changes and uses the desired offset. We dynamically disassemble and analyze the function and extract the offset directly from the assembly. For X86-64 specifically, we dump the aout_dump_debugregs function that accesses thread->ptrace_bps, which has consistently been 16 bytes away from the fsbase field that we are interested in for all kernels that we have ever looked at.

Reading TLS data from kernel

With all the required offsets at our hands, we can now finally do what we set out to do in the first place: use them to enrich our stack traces with the OTel trace and span IDs that our C++ library prepared for us!

void maybe_add_otel_info(Trace* trace) {
  // Did user-mode insert a TLS offset for this process? Read it.
  TraceCorrProcInfo* proc = bpf_map_lookup_elem(&tracecorr_procs, &trace->pid);

  // No entry -> process doesn't have the C++ library loaded.
  if (!proc) return;

  // Load the fsbase offset from our global configuration map.
  u32 key = 0;
  SystemConfig* syscfg = bpf_map_lookup_elem(&system_config, &key);

  // Read the fsbase offset from the kernel's task struct.
  u8* fsbase;
  u8* task = (u8*)bpf_get_current_task();
  bpf_probe_read_kernel(&fsbase, sizeof(fsbase), task + syscfg->fsbase_offset);

  // Use the TLS offset to read the **pointer** to our TLS buffer.
  void* corr_buf_ptr;
  bpf_probe_read_user(
    &corr_buf_ptr,
    sizeof(corr_buf_ptr),
    fsbase + proc->tls_offset
  );

  // Read the information that our library prepared for us.
  TraceCorrelationBuf corr_buf;
  bpf_probe_read_user(&corr_buf, sizeof(corr_buf), corr_buf_ptr);

  // If the library reports that we are currently in a trace, store it into
  // the stack trace that will be reported to our user-land process.
  if (corr_buf.trace_present && corr_buf.valid) {
    trace->otel_trace_id.as_int.hi = corr_buf.trace_id.as_int.hi;
    trace->otel_trace_id.as_int.lo = corr_buf.trace_id.as_int.lo;
    trace->otel_span_id.as_int = corr_buf.span_id.as_int;
  }
}

Sending out the mappings

From this point on, everything further is pretty simple. The C++ library sets up a unix datagram socket during startup and communicates the socket path to the profiler via the per-process data block. The stacktraces annotated with the OTel trace and span IDs are sent from BPF to our user-mode profiler process via perf event buffers, which in turn sends the mappings between OTel span and trace and stack trace hashes to the C++ library. Our extensions to the OTel instrumentation framework then read those mappings and insert the stack trace hashes into the OTel trace.

This approach has a few major upsides compared to the perhaps more obvious alternative of sending out the OTel span and trace ID with the profiler’s stacktrace records. We want the stacktrace associations to be stored in the trace indices to allow filtering and aggregating stacktraces by the plethora of fields available on OTel traces. If we were to send out the trace IDs via the profiler's gRPC connection instead, we’d have to search for and update the corresponding OTel trace records in the profiling collector to insert the stack trace hashes.

This is not trivial: stacktraces are sent out rather frequently (every 5 seconds, as of writing) and the corresponding OTel trace might not have been sent and stored by the time the corresponding stack traces arrive in our cluster. We’d have to build a kind of delay queue and periodically retry updating the OTel trace documents, introducing avoidable database work and complexity in the collectors. With the approach of sending stacktrace mappings to the OTel instrumented process instead, the need for server-side merging vanishes entirely.

Trace correlation in action

With all the hard work out of the way, let’s take a look at what trace correlation looks like in action!

Future work: Supporting other languages

We have demonstrated that trace correlation can work nicely for Java, but we have no intention of stopping there. The general approach that we discussed previously should work for any language that can efficiently load and call into our C++ library and doesn’t do user-mode scheduling with coroutines. The problem with user-mode scheduling is that the logical thread can change at any await/yield point, requiring us to update the trace IDs in TLS. Many such coroutine environments like Rust’s Tokio provide the ability to register a callback for whenever the active task is swapped, so they can be supported easily. Other languages, however, do not provide that option.

One prominent example in that category is Go: goroutines are built on user-mode scheduling, but to our knowledge there’s no way to instrument the scheduler. Such languages will need solutions that don’t go via the generic TLS path. For Go specifically, we have already built a prototype that uses pprof labels that are associated with a specific Goroutine, having Go’s scheduler update them for us automatically.

Getting started

We hope this blog post has given you an overview of correlating profiling signals to distributed tracing, and its benefits for end-users.

To get started, download the Elastic distribution of the OTel agent, which contains the new trace correlation library. Additionally, you will need the latest version of Universal Profiling agent, bundled with Elastic Stack version 8.13.

Acknowledgment

We appreciate Trask Stalnaker, maintainer of the OTel Java agent, for his feedback on our approach and for reviewing the early draft of this blog post.

Continuous profiling: The key to more efficient and cost-effective applications

Fri, 27 Oct 2023 00:00:00 GMT

Recently, Elastic Universal Profiling^TM became generally available. It is the part of our Observability solution that allows users to do whole system, continuous profiling in production environments. If you're not familiar with continuous profiling, you are probably wondering what Universal Profiling is and why you should care. That's what we will address in this post.

Efficiency is important (again)

Before we jump into continuous profiling, let's start with the "Why should I care?" question. To do that, I'd like to talk a bit about efficiency and some large-scale trends happening in our industry that are making efficiency, specifically computational efficiency, important again. I say again because in the past, when memory and storage on a computer was very limited and you had to worry about every byte of code, efficiency was an important aspect of developing software.

The end of Moore’s Law

First, the Moore's Law era is drawing to a close. This was inevitable simply due to physical limits of how small you can make a transistor and the connections between them. For a long time, software developers had the luxury of not worrying about complexity and efficiency because the next generation of hardware would mitigate any negative cost or performance impact.

If you can't rely on an endless progression of ever faster hardware, you should be interested in computational efficiency.

The move to Software-as-a-Service

Another trend to consider is the shift from software vendors that sold customers software to run themselves to Software-as-a-Service businesses. A traditional software vendor didn't have to worry too much about the efficiency of their code. That issue largely fell to the customer to address; a new software version might dictate a hardware refresh to the latest and most performant. For a SaaS business, inefficient software usually degrades the customer’s experience and it certainly impacts the bottom line.

If you are a SaaS business in a competitive environment, you should be interested in computational efficiency.

Cloud migration

Next is the ongoing cloud migration to cloud computing. One of the benefits of cloud computing is the ease of scaling, both hardware and software. In the cloud, we are not constrained by the limits of our data centers or the next hardware purchase. Instead we simply spin up more cloud instances to mitigate performance problems. In addition to infrastructure scalability, microservices architectures, containerization, and the rise of Kubernetes and similar orchestration tools means that scaling services is simpler than ever. It's not uncommon to have thousands of instances of a service running in a cloud environment. This ease of scaling accounts for another trend, namely that many businesses are dealing with skyrocketing cloud computing costs.

If you are a business with ever increasing cloud costs, you should be interested in computational efficiency.

Our changing climate

Lastly, if none of those reasons pique your interest, let's consider a global problem that all of us should have in mind — namely, climate change. There are many things that need to be addressed to tackle climate change, but with our dependence on software in every part of our society, computational efficiency is certainly something we should be thinking about.

Thomas Dullien, distinguished engineer at Elastic and one of the founders of Optymize points out that if you can save 20% on 800 servers, and assume 300W power consumption for each server, that code change is worth 160 metric tons of CO₂ saved per year. That may seem like a drop in the bucket but if all businesses focus more on computational efficiency, it will make an impact. Also, let's not forget the financial benefits: those 160 metric tons of CO₂ savings also represent a significant annual cost savings.

If you live on planet Earth, you should be interested in computational efficiency.

Performance engineering

Who's job is it to worry about computational efficiency? Application developers usually pay at least some attention to efficiency as they develop their code. Profiling is a common approach for a developer to understand the performance of their code, and there is an entire portfolio of profiling tools available. Frequently, however, schedule pressures trump time spent on performance analysis and computational efficiency. In addition, performance problems may not become apparent until an application is running at scale in production and interacting (and competing) with everything else in that environment. Many profiling tools are not well suited to use in a production environment because they require code instrumentation and recompilation and add significant overhead.

When inefficient code makes it into production and begins to cause performance problems, the next line of defense is the Operations or SRE team. Their mission is to keep everything humming, and performance problems will certainly draw attention. Observability tools such as APM can shed light on these types of issues and lead the team to a specific application or service, but these tools have limits into the observability of the full system. Third-party libraries and operating system kernels functions remain hidden without a profiling solution in the production environment.

So, what can these teams do when there is a need to investigate a performance problem in production? That's where continuous profiling comes into the picture.

Continuous profiling

Continuous profiling is not a new idea. Google published a paper about it in 2010 and began implementing continuous profiling in its environments around that time. Facebook and Netflix followed suit not long afterward.

Typically, continuous profiling tools have been the domain of dedicated performance engineering or operating system engineering teams, which are usually only found at extremely large scale enterprises like the ones mentioned above. The key idea is to run profiling on every server, all of the time. That way, when your observability tools point you to a specific part of an application, but you need a more detailed view into exactly where that application is consuming CPU resources, the profiling data will be there, ready to use.

Another benefit of continuous profiling is that it provides a view of CPU intensive software across your entire environment — whether that is a very CPU intensive function or the aggregate of a relatively small function that is run thousands of times a second in your environment.

While profiling tools are not new, most of them have significant gaps. Let's look at a couple of the most significant ones.

Limited visibility. Modern distributed applications are composed of a complex mix of building blocks, including custom software functions, third-party software libraries, networking software, operating system services, and more and more often, orchestration software such as Kubernetes. To fully understand what is happening in an application, you need visibility into each piece. However, even if a developer has the ability to profile their own code, everything else remains invisible. To make matters worse, most profiling tools require instrumenting the code, which adds overhead and therefore even your developers’ code is not profiled in production.
Missing symbols in production. All of these pieces of code building blocks typically have descriptive names (some more intuitive than others) so that developers can understand and make sense of them. In a running program, these descriptive names are usually referred to as symbols. For a human being to make sense of the execution of a running application, these names are very important. Unfortunately, almost always, any software running in production has these human readable symbols stripped away for space efficiency since they are not needed by the CPU executing the software. Without all of the symbols, it makes it much more difficult to understand the full picture of what's happening in the application. To illustrate this, think of the last time you were in an SMS chat on your mobile device and you only had some of the people in the chat group in your address book while the rest simply appeared as phone numbers — this makes it very hard to tell who is saying what.

Elastic Universal Profiling: Continuous profiling for all

Our goal is to allow any business, large or small, to make computational efficiency a core consideration for all of the software that they run. Universal Profiling imposes very low overhead on your servers so it can be used in production and it provides visibility to everything running on every machine. It opens up the possibility of seeing the financial unit cost and CO₂ impact of every line of code running on every system in your business. How do we do that?

Whole-system visibility — SIMPLE

Universal Profiling is based on eBPF, which means that it imposes very low overhead (our goal is less than 1% CPU and less than 250MB of RAM) on your servers because it doesn't require code instrumentation. That low overhead means it can be run continuously, on every server, even in production.

eBPF also lets us deploy a single profiler agent on a host and peek inside the operating system to see every line of code executing on the CPU. That means we have visibility into all of those application building blocks described above — the operating system itself as well as containerization and orchestration frameworks without complex configuration.

All the symbols

A key part of Universal Profiling is our hosted symbolization service. This means that symbols are not required on your servers, which not only eliminates a need for recompiling software with symbols, but it also helps to reduce overhead by allowing the Universal Profiling agent to send very sparse data back to the Elasticsearch platform where it is enriched with all of the missing symbols. Since we maintain a repository of most popular third-party software libraries and Linux operating system symbols, the Universal Profiling UI can show you all the symbols.

Your favorite language, and then some

Universal Profiling is multilanguage. We support all of today’s popular programming languages, including Python, Go, Java (and any other JVM-based languages), Ruby, NodeJS, PHP, Perl, and of course, C and C++, which is critical since these languages still underly so many third-party libraries used by the other languages. In addition, we support profiling native code a.k.a. machine language.

Speaking of native code, all profiling tools are tied to a specific type of CPU. Most tools today only support the Intel x86 CPU architecture. Universal Profiling supports both x86 and ARM-based processors. With the expanding use of ARM-based servers, especially in cloud environments, Universal Profiling future-proofs your continuous profiling.

Many businesses today employ polyglot programming — that is, they use multiple languages to build an application — and Universal Profiling is the only profiler available that can build a holistic view across all of these languages. This will help you look for hotspots in the environment, leading you to "unknown unknowns" that warrant deeper performance analysis. That might be a simple interest rate calculation that should be efficient and lightweight but, surprisingly, isn't. Or perhaps it is a service that is reused much more frequently than originally expected, resulting in thousands of instances running across your environment every second, making it a prime target for efficiency improvement.

Visualize your impact

Elastic Universal Profiling has an intuitive UI that immediately shows you the impact of any given function, including the time it spends executing on the CPU and how much that costs both in dollars and in carbon emissions.

Finally, with the level of software complexity in most production environments, there's a good chance that making a code change will have unanticipated effects across the environment. That code change may be due to a new feature being rolled out or a change to improve efficiency. In either case, a differential view, before and after the change, will help you understand the impact.

Let's recap

Computational efficiency is an important topic, both from the perspective of the ultra-competitive business climate we all work in and from living through the challenges of our planet's changing climate. Improving efficiency can be a challenging endeavor, but we can't even begin to attempt to make improvements without knowing where to focus our efforts. Elastic Universal Profiling is here to provide every business with visibility into computational efficiency.

How will you use Elastic Universal Profiling in your business?

If you are an application developer or part of the site reliability team, Universal Profiling will provide you with unprecedented visibility into your applications that will not only help you troubleshoot performance problems in production, but also understand the impact of new features and deliver an optimal user experience.
If you are involved in cloud and infrastructure financial management and capacity planning, Universal Profiling will provide you with unprecedented visibility into the unit cost of every line of code that your business runs.
If you are involved in your business’s ESG initiative, Universal Profiling will provide you with unprecedented visibility into your CO₂ emissions and open up new avenues for reducing your carbon footprint.

These are just a few examples. For more ideas, read how AppOmni benefits from Elastic Universal Profiling.

You can get started with Elastic Universal Profiling right now!

How to capture custom metrics without app code changes using the Java Agent Plugin

Mon, 10 Jul 2023 00:00:00 GMT

The Elastic APM Java Agent automatically tracks many metrics, including those that are generated through Micrometer or the OpenTelemetry Metrics API. So if your application (or the libraries it includes) already exposes metrics from one of those APIs, installing the Elastic APM Java Agent is the only step required to capture them. You'll be able to visualize and configure thresholds, alerts, and anomaly detection — and anything else you want to use them for!

The next simplest option is to generate custom metrics directly from your code (e.g., by adding code using the OpenTelemetry Metrics API directly into the application). The major downside of that approach is that it requires modifying the application, so if you can't or don't want to do that, you can easily produce the desired custom metrics by adding instrumentation to the Elastic APM Java Agent via a plugin.

This article deals with the situation where the application you are monitoring doesn't emit the custom metrics you'd like it to, and you can't directly change the code or config to make it do so. Instead, you can use a plugin to automatically instrument the application via the Elastic APM Java Agent, which will then make the application emit the custom metrics you desire.

Plugin basics

The basics of the Elastic APM Java Agent, and how to easily plugin instrumentation, are detailed in the article "Create your own instrumentation with the Java Agent Plugin." Generating metrics from a plugin is just another type of instrumentation, and the referenced article provides detailed step-by-step instructions with a worked example of how to create a plugin with custom instrumentation.

For this article, I assume you understand how to create a plugin with custom instrumentation based on that previous article, as well as the example application (a simple webserver ExampleBasicHttpServer) from our plugin example repo.

The custom metric

For our example application, which is an HTTP server (ExampleBasicHttpServer) we'd like to add a custom metric 'page_views' which increments each time the ExampleBasicHttpServer application handles any request. That means the instrumentation we'll add will be triggered by the same ExampleBasicHttpServer.handleRequest() method used in "Create your own instrumentation with the Java Agent Plugin."

Using the Plugin/OpenTelemetry API

Essentially the only difference to that article is that for metrics, we'll use the OpenTelemetry metrics API instead of the OpenTelemetry tracing API.

In particular for the metrics, the advice method for the handleRequest() method is the following code:

if (pageViewCounter == null) {
    pageViewCounter = GlobalOpenTelemetry
        .getMeter("ExampleHttpServer")
        .counterBuilder("page_views")
        .setDescription("Page view count")
        .build();
}
pageViewCounter.add(1);

That is, lazily create the meter when it's first needed, and then on each invocation of the ExampleBasicHttpServer.handleRequest() method, increment the page view counter.

Everything else — setting up instrumentation, finding the method to instrument, building the plugin — is the same as in the article "

Create your own instrumentation with the Java Agent Plugin." The full metrics example is implemented in the plugin example repo, and the actual full metrics instrumentation implementation is ExampleMetricsInstrumentation.

Try it out!

That's it! To run the agent with the plugin, just build and include the jar as described in "Create your own instrumentation with the Java Agent Plugin," in the directory specified by the plugins_dir configuration option. The plugin example repo provides a full tested implementation — just clone it and mvn install to see it working.

The best place to get started with Elastic APM is in the cloud. Begin your free trial of Elastic Cloud today!

The Elastic APM Java Agent docs

The Elastic APM Java Agent repo

The plugin example repo

The previous Create your own instrumentation with the Java Agent Plugin article

The associated Regression testing your Java Agent Plugin article

The OpenTelemetry metrics API

The OpenTelemetry tracing API

Micrometer

Customize your data ingestion with Elastic input packages

Tue, 26 Sep 2023 00:00:00 GMT

Elastic^® has enabled the collection, transformation, and analysis of data flowing between the external data sources and Elastic Observability Solution through integrations. Integration packages achieve this by encapsulating several components, including agent configuration, inputs for data collection, and assets like ingest pipelines, data streams, index templates, and visualizations. The breadth of these assets supported in the Elastic Stack increases day by day.

This blog dives into how input packages provide an extremely generic and flexible solution to the advanced users for customizing their ingestion experience in Elastic.

What are input packages?

An Elastic Package is an artifact that contains a collection of assets that extend the Elastic Stack, providing new capabilities to accomplish a specific task like integration with an external data source. The first use of Elastic packages is integration packages, which provide an end-to-end experience — from configuring Elastic Agent, to collecting signals from the data source, to ingesting them correctly and using the data once ingested.

However, advanced users may need to customize data collection, either because an integration does not exist for a specific data source, or even if it does, they want to collect additional signals or in a different way. Input packages are another type of Elastic package that provides the capability to configure Elastic Agent to use the provided inputs in a custom way.

Let’s look at an example

Say hello to Julia, who works as an engineer at Ascio Innovation firm. She is currently working with Oracle Weblogic server and wants to get a set of metrics for monitoring it. She goes ahead and installs Elastic Oracle Weblogic Integration, which uses Jolokia in the backend to fetch the metrics.

Now, her team wants to advance in the monitoring and has the following requirements:

We should be able to extract metrics other than the default ones, which are not supported by the default Oracle Weblogic Integration.
We want to have our own bespoke pipelines, visualizations, and experience.
We should be able to identify the metrics coming in from two different instances of Weblogic Servers by having data mapped to separate indices.

All the above requirements can be met by using the Jolokia input package to get a customized experience. Let's see how.

Julia can add the configuration of Jolokia input package as below, fulfilling the first requirement.

hostname, JMX Mappings for the fields you want to fetch for the JVM application, and the data set name to which the response fields would get mapped.

Julia can customize her data by writing her own ingest pipelines and providing her customized mappings. Also, she can then build her own bespoke dashboards, hence meeting her second requirement.

Let’s say now Julia wants to use another instance of Oracle Weblogic and get a different set of metrics.

This can be achieved by adding another instance of Jolokia input package and specifying a new data set name as shown in the screenshot below. The resultant metrics will be mapped to a different index/data set hence fulfilling her third requirement. This will help Julia to differentiate metrics coming in from two different instances of Oracle Weblogic.

The resultant metrics of the query will be indexed to the new data set, jolokia_second_dataset in the below example.

As we can see above, the Jolokia input package provides the flexibility to get new metrics by specifying different JMX Mappings, which are not supported in the default Oracle Weblogic integration (the user gets metrics from a predetermined set of JMX Mappings).

The Jolokia Input package also can be used for monitoring any Java-based application, which pushes its metrics through JMX. So a single input package can be used to collect metrics from multiple Java applications/services.

Elastic input packages

Elastic has started supporting input packages from the 8.8.0 release. Some of the input packages are now available in beta and will mature gradually:

SQL input package: The SQL input package allows you to execute queries against any SQL database and store the results in Elasticsearch^®.
Prometheus input package: This input package can collect metrics from Prometheus Exporters (Collectors).It can be used by any service exporting its metrics to a Prometheus endpoint.
Jolokia input package: This input package collects metrics from Jolokia agents running on a target JMX server or dedicated proxy server. It can be used for monitoring any Java-based application, which pushes its metrics through JMX.
Statsd input package: The statsd input package spawns a UDP server and listens for metrics in StatsD compatible format. This input can be used to collect metrics from services that send data over the StatsD protocol.
GCP Metrics input package: The GCP Metrics input package can collect custom metrics for any GCP service.

Try it out!

Now that you know more about input packages, try building your own customized integration for your service through input packages, and get started with an Elastic Cloud free trial.

We would love to hear from you about your experience with input packages on the Elastic Discuss forum or in the Elastic Integrations repository.

Elastic Observability: Streams Data Quality and Failure Store Insights

Tue, 18 Nov 2025 00:00:00 GMT

When working with observability and logging data, not all documents make it into Elasticsearch in pristine condition. Some may be dropped due to processing failures in ingest pipelines or mapping errors, while others may be partially ingested with ignored fields if a fields value is incompatible with the defined mappings. These issues can impact downstream analysis and dashboards. Streams data quality makes it easier than ever to monitor the health of your ingested data, identify potential issues, and take corrective action right from the UI. With data quality, you can now see exactly how well your Stream is performing and quickly understand whether your data has a Good, Degraded, or Poor quality.

What's in data quality

At-a-glance summary

The summary card shows:

Degraded documents - Documents that contain the _ignored field - see this for more info.
Failed documents - Documents that were rejected at ingestion due to mapping conflicts or pipeline failures.

The overall quality score (Good, Degraded, Poor) is automatically calculated based on the percentage of degraded and failed documents.

Trends over time

The tab includes a time-series chart so you can track how degraded and failed documents are accumulating over time. Use the date picker to zoom into a specific range and understand when problems are spiking.

Quality issues table

A detailed table lists the types of issues affecting your stream. For each issue, you can:

See which fields are causing problems.
Review counts of affected documents.
Filter by issues that have not been solved yet (Current issues only).
Open a flyout to dive deeper into the cause of the issue and learn how to fix it.

Monitoring degraded documents

A degraded document is one that contains the _ignored field, which means one or more of its fields were ignored during indexing. One of the reasons could be that their values didn’t match the expected mappings. While the rest of the document is still indexed, a high number of degraded documents can affect query results, dashboards, and overall observability accuracy.

To help keep these issues under control, the Data quality tab provides visibility into the percentage of degraded documents in your stream.

Set up a rule to stay ahead of issues

You can use the Create rule button above the Degraded docs chart to define an alert that notifies you when the percentage of degraded documents crosses a certain threshold. This makes it easy to proactively monitor for mapping mismatches and ensure your data continues to meet quality expectations.

For more information on how to configure this rule, see Degraded docs rule conditions.

Handling failed documents with the failure store

Failure store is a special index that captures documents rejected during ingestion. Instead of losing this data, the failure store retains it in a dedicated ::failures index, allowing you to inspect the problematic documents, understand what went wrong, and fix the underlying issues.

In Data Quality tab, the failed documents are only visible if your stream has a failure store enabled, for checking failure store documents you are required to have at least read_failure_store privileges. If the failure store is not enabled, you’ll see an “Enable failure store” link that opens a modal to configure it and set the retention period. For enabling failure store you are required to have manage_failure_store privileges over the specific data stream. For further information about failure store security you can refer to Searching failures.

Once enabled, you can edit the failure store configuration or disable it at any time using the Edit button above the failed docs chart.

The failure store can also be configured in the Streams Retention tab - see this article for more information.

Technical implementation

Under the hood, the Data quality tab builds on the existing Dataset quality plugin - the same one that powers the Dataset quality page in Stack Management. However, instead of working in the context of datasets following the Data stream naming scheme, it’s now tailored specifically for streams.

To determine the quality of a stream, the UI sends three ES|QL query server requests:

All documents (including failures):

 FROM myStream, myStream::failures | STATS doc_count = COUNT(*)

Failed documents only:

 FROM myStream::failures | STATS failed_doc_count = COUNT(*)

Degraded documents:

FROM myStream METADATA _ignored | WHERE _ignored IS NOT NULL | STATS degraded_doc_count = COUNT(*)

The results of these queries are then used to calculate the percentages of failed and degraded documents. The overall data quality is determined using simple thresholds:

Good: Both percentages are 0%
Degraded: Any percentage is greater than 0% but less than 3%
Poor: Any percentage is above 3%

For managing the failure store, Streams uses the Update data stream options API with the failure_store parameter to configure and update the failure store settings, including enabling the store and setting the retention period.

Why you’ll love this

The new Data quality tab gives you:

Visibility into ingestion problems without digging into logs
A clear breakdown of degraded vs. failed documents
Insights into which fields are ignored and why
Tools to capture and troubleshoot failed docs with the failure store

By surfacing data quality issues directly in the Streams UI, we’re making it easier to keep your data flowing reliably and to ensure your analytics are built on a strong foundation.

Try it out today

The data quality feature is available in Elastic Observability on Serverless, and coming soon for self-managed and Elastic Cloud users.

Sign up for an Elastic trial at cloud.elastic.co, and trial Elastic's Serverless offering which will allow you to play with all of the Streams functionality.

For more information on Streams:

Read about Reimagining streams

Look at the Streams website

Read the Streams documentation

A day in the life of an OpenTelemetry maintainer

Fri, 26 Sep 2025 00:00:00 GMT

I’m Damien, I’m an engineer at Elastic, a maintainer of the OpenTelemetry Go SDK, an approver of the OpenTelemetry Collector and a member of several SIGs. In this post, we’ll take a closer look at what it means to be a maintainer: the responsibilities they carry, the challenges they navigate, and the impact they have on both the project and the broader community.

When people think about open source, they often picture lines of code, clever algorithms, or maybe a GitHub repo full of issues and pull requests. What can be harder to see is the human side. The people who quietly keep things moving, who make sure contributions land smoothly and help the community grow in a healthy way. That's the work of a maintainer.

Maintainers are more than just code reviewers. They are the stewards of the SIG's (Special Interest Group) health, direction, and community. They balance technical oversight with mentorship, governance with collaboration, and long-term vision with the day-to-day realities of issues and pull requests.

Open Source mentorship

One of the most rewarding parts of being a maintainer is mentorship. Every open source project depends on new contributors stepping in, learning the ropes, and eventually taking on more responsibility themselves. As maintainers, we’re often the first point of contact for someone who’s never contributed to the project before.

Mentorship can look like many different things. Sometimes it’s as simple as leaving a thoughtful code review that doesn’t just point out what’s wrong, but explains why a change matters. Other times, it’s guiding a contributor through their first issue, helping them understand the project’s structure, or showing them how to run tests locally. And every so often, it means stepping back to give someone room to try, even if they don’t get it right the first time.

The goal isn’t just to fix the immediate bug or land the pull request. It's to help contributors feel confident enough to come back again. A healthy project grows by sharing knowledge, not hoarding it. Mentorship is how maintainers make sure today's first-time contributor can become tomorrow's reviewer, and eventually, the next maintainer.

Setting direction and priorities

Another part of being a maintainer is shaping the project's roadmap. Open source moves fast: there are always new ideas, bug reports, and feature requests. Left unchecked, a project can easily become a grab bag of loosely connected changes. Part of our job as maintainers is to make sure the work stays aligned with the bigger picture.

That means asking questions like:

Does this feature fit with our long-term goals?
Is now the right time to tackle it?
Do we have the capacity to maintain it once it’s merged?

Sometimes the answer is "not yet" or even "no", and it’s on us to communicate that clearly while still encouraging contributions.

Roadmapping isn't about dictating every detail. It's about setting priorities together with the community—listening to feedback, balancing what users need today with where the project should be tomorrow, and making tradeoffs that keep the project sustainable.

The roadmap gives everyone a shared sense of direction. Contributors know where their work fits in, users can see what's coming next, and the project as a whole stays focused instead of scattered.

Special Interest Group meetings

One of the maintainer's roles is also to facilitate the frequent meetings that help their SIG communicate and plan its work.

Facilitating a SIG meeting isn’t about running through an agenda like a checklist. It’s about creating space where everyone feels comfortable speaking up, from long-time contributors to someone joining their very first call. That means keeping discussions focused, making sure quieter voices get heard, and helping the group reach consensus without letting debates drag on forever.

There's also a practical side: preparing the agenda ahead of time, documenting decisions so they’re visible to the wider community, and following up on action items afterward.

In many ways, SIG meetings are where the "community" part of open source really comes to life. As maintainers, our role is to guide the conversation, not control it, making sure the project keeps moving forward while staying open and inclusive.

Challenges

Of course, maintaining isn’t all smooth sailing. One of the hardest parts is balancing the constant flow of contributions with the need to keep the codebase healthy. Every pull request represents someone's time and effort, and it’s important to honor that. Yet, at the same time, not every change fits the project's standards or long-term goals. Saying "no" gracefully is just as important as merging a great contribution.

Maintainers also find themselves balancing priorities that go beyond code. Different contributors, and often the companies backing them, come with their own needs and expectations. One team might want a new feature quickly, another might be focused on stability, while the community as a whole still needs clear direction. Managing those competing priorities, and making decisions that serve the project rather than any single interest, is a constant challenge.

Conflicts are another reality. With so many people involved, it's inevitable that disagreements will happen. Sometimes it's about technical design, sometimes about process, and occasionally about interpersonal dynamics. Part of the maintainer role is helping to navigate those moments: keeping discussions respectful, finding common ground, and making sure decisions are made transparently.

And yet, despite the difficulties, the impact of this work is enormous: when maintainers succeed, the entire community thrives.

The importance and impact of Open Source maintainers

When maintainers do their job well, the effects ripple far beyond the codebase. A well-tended project feels reliable and welcoming—contributors know their work will be reviewed thoughtfully, users trust the software to be stable, and the community grows because people want to come back.

Good project maintenance builds momentum. A contributor who feels supported on their first pull request is more likely to return for a second. Clear roadmaps and consistent standards give people confidence that their effort matters and will fit into the bigger picture. And when conflicts are handled with respect and transparency, it reinforces the culture of trust that makes open source sustainable.

The impact goes deeper than just keeping a project alive. Effective maintainers create the conditions for others to succeed. That's the real legacy of this role: not just code, but a thriving ecosystem and community built around it.

Conclusion

Being a maintainer is challenging work, but it’s also some of the most meaningful. It’s about more than merging code. It's about stewardship, mentorship, and creating a community where people feel empowered to contribute. Every healthy open source project owes its success to the care and commitment of its maintainers.

And while the challenges are real, the rewards are just as tangible: the chance to constantly learn, to collaborate on complex problems, and to connect with people from every corner of the world and every kind of background.

OpenTelemetry's maintainers embody this balance every day, helping the project grow while keeping its community strong.

Debugging Azure Networking for Elastic Cloud Serverless

Thu, 05 Jun 2025 00:00:00 GMT

Summary of Findings

Elastic's Site Reliability Engineering team (SRE) observed unstable throughput and packet loss in Elastic Cloud Serverless running on Azure Kubernetes Service (AKS). After investigation, we identified the primary contributing factors to be RX ring buffer overflows and kernel input queue saturation on SR-IOV interfaces. To address this, we increased RX buffer sizes and adjusted the netdev backlog, which significantly improved network stability.

Setting the Scene

Elastic Cloud Serverless is a fully managed solution that allows you to deploy and use Elastic for your use cases without managing the underlying infrastructure. Built on Kubernetes, it represents a shift in how you interact with Elasticsearch. Instead of managing clusters, nodes, data tiers, and scaling, you create serverless projects that are fully managed and automatically scaled by Elastic. This abstraction of infrastructure decisions allows you to focus solely on gaining value and insight from your data.

Elastic Cloud Serverless is generally available (GA) on AWS, GCP and currently in Technical Preview on Azure. As part of preparing Elastic Cloud Serverless GA on Azure, we have been conducting extensive performance and scalability tests to ensure that our users get a consistent and reliable user experience.

In this post, we’ll take you behind the scenes of a deep technical investigation into a surprising performance issue that affected Serverless Elasticsearch in our Azure Kubernetes clusters. At first, the network seemed like the least likely place to look, especially with a high-speed 100 Gb/s interface on the host backing it. But as we dug deeper, with help from the Microsoft Azure team, that’s exactly where the problem led us.

Unexpected Results!

While the high-level architectures and system design patterns of the major cloud provider’s systems are often similar, the implementations are different, and these differences can have dramatic impacts on a system’s performance characteristics.

One of the most significant differences between the different cloud providers is that the underlying hypervisor software and server hardware of the Virtual Machines can vary significantly, even between instance families of the same provider.

There is no way to fully abstract the hardware away from an application like Elasticsearch. Fundamentally, its performance is dictated by the CPU, memory, disks, and network interfaces on the physical server. In preparation for the Elastic Cloud Serverless GA on Azure, our Elasticsearch Performance team kicked off large-scale load testing against Serverless Elasticsearch projects running on Azure Kubernetes Service (AKS), using ARM-based VMs (we’re big fans!). Throughout this process, we relied heavily on Elastic tools to analyse system behaviour, identify bottlenecks, and validate performance under load.

To perform these scale and load tests, the Elasticsearch Performance team use Rally, an open-source benchmarking tool designed to measure the performance of Elasticsearch clusters. The workload (or in Rally nomenclature, ‘Track’) used for these tests was the GitHub Archive Track. Rally collects and sends test telemetry using the official Python client to a separate Elasticsearch cluster running Elastic Observability, which allows for monitoring and analysis during these scale and load tests in real time via Kibana.

When we looked at the results, we observed that the indexing rate (the number of docs/s) for the Serverless projects was not only much lower than we had expected for the given hardware, but the throughput was also quite unstable. There were peaks and valleys, interspersed with frequent errors, whereas we were instead expecting a stable indexing rate for the duration of the test.

These tests are designed to push the system to its limits, and in doing so, they surfaced unexpected behavior in the form of unstable indexing throughput and intermittent errors. This was precisely the kind of problem we'd hoped to uncover prior to going GA — giving us the opportunity to work closely with Azure.

![Indexing Rate with Packet Loss](/assets/images/debugging-aks-packet-loss/indexing-rate-before.png) _A Kibana visualisation of Rally telemetry, showing fluctuating Elasticsearch indexing rates alongside spikes in 5xx and 4xx HTTP error responses._

Debugging!

Debugging performance issues can feel a little bit like trying to find a ‘Butterfly in a Hurricane’, so it’s crucial that you take a methodological approach to analysing application and system performance.

Using methodologies helps you to be more consistent and thorough in your debugging, and avoids missing things. We started with the Utilisation Saturation and Errors (USE) Method, looking at both the client and server side to identify any obvious bottlenecks in the system.

Elastic's Site Reliability Engineers (SREs) maintain a suite of custom Elastic Observability dashboards designed to visualise data collected from various Elastic Integrations. These dashboards provide deep visibility into the health and performance of Elastic Cloud infrastructure and systems.

For this investigation, we leveraged a custom dashboard built using metrics and log data from the System and Linux Integrations:

![Node Overview Dashboard](/assets/images/debugging-aks-packet-loss/overview-dashboard.png) _One of many Elastic Observability dashboards built and maintained by the SRE team._

Following the USE Method, these dashboards highlight resource utilisation, saturation, and errors across our systems. With their help, we quickly identified that the AKS nodes hosting the Elasticsearch pods under test were dropping thousands of packets per second.

![Node Packet Loss Before Tuning](/assets/images/debugging-aks-packet-loss/packet-loss-before.png) _A Kibana visualisation of [Elastic Agent's System Integration](https://www.elastic.co/docs/reference/integrations/system), showing the rate of packet drops per second for AKS nodes._

Dropping packets forces reliable protocols, such as TCP, to retransmit any missing packets. These retransmissions can introduce significant delays, which kills the throughput of any system where client requests are only triggered upon the previous request completion (known as a Closed System).

To investigate further, we jumped onto one of the AKS nodes exhibiting the packet loss to check the basics. First off, we wanted to identify what type of packet drops or errors we’re seeing; is it for specific pods, or the host as a whole?

root@aks-k8s-node-1:~# ip -s link show
2: eth0:  mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 7c:1e:52:be:ce:5e brd ff:ff:ff:ff:ff:ff
    RX:    bytes   packets errors dropped  missed   mcast
    373507935420 134292481      0       0       0      15
    TX:    bytes   packets errors dropped carrier collsns
    644247778936 303191014      0       0       0       0
3: enP42266s1:  mtu 1500 qdisc mq master eth0 state UP mode DEFAULT group default qlen 1000
    link/ether 7c:1e:52:be:ce:5e brd ff:ff:ff:ff:ff:ff
    RX:    bytes   packets errors dropped  missed   mcast
    386782548951 307000571      0       0 5321081       0
    TX:    bytes   packets errors dropped carrier collsns
    655758630548 477594747      0       0       0       0
    altname enP42266p0s2
15: lxc0ca0ec41ecd2@if14:  mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether f6:f5:5e:c9:4e:fb brd ff:ff:ff:ff:ff:ff link-netns cni-3f90ab53-df66-cac5-bd19-9cea4a68c29b
    RX:    bytes   packets errors dropped  missed   mcast
    627954576078  54297550      0    1600       0       0
    TX:    bytes   packets errors dropped carrier collsns
    372155326349 133538064      0    3927       0       0

In this output you can see the enP42266s1 interface is showing a significant number of packets in the missed column. That’s interesting, sure, but what does missed actually represent? And what is enP42266s1?

To understand, let’s look at roughly what happens when a packet arrives at the NIC:

A packet arrives at the NIC from the network.
The NIC uses DMA (Direct Memory Access) to place the packet into a receive ring buffer allocated in memory by the kernel, mapped for use by the NIC. Since our NICs supports multiple hardware queues, each queue has its own dedicated ring buffer, IRQ, and NAPI context.
The NIC raises a hardware interrupt (IRQ) to notify the CPU that a packet is ready.
The CPU runs the NIC driver’s IRQ handler. The driver schedules a NAPI (New API) poll to defer packet processing to a softirq context. A mechanism in the Linux kernel that defers work to be processed outside of the hard IRQ context, for better batching and CPU efficiency, enabling improved scalability.
The NAPI poll function is executed in a softirq context (NET_RX_SOFTIRQ) and retrieves packets from the ring buffer. This polling continues either until the driver’s packet budget is exhausted (net.core.netdev_budget) or the time limit is hit (net.core.netdev_budget_usecs).
Each packet is wrapped in an sk_buff (socket buffer) structure, which includes metadata such as protocol headers, timestamps, and interface identifiers.
If the networking stack is slower than the rate at which NAPI fetches packets, excess packets are queued in a per-CPU backlog queue (via enqueue_to_backlog). The maximum size of this backlog is controlled by the net.core.netdev_max_backlog sysctl.
Packets are then handed off to the kernel’s networking stack for routing, filtering, and protocol-specific processing (e.g. TCP, UDP).
Finally, packets reach the appropriate socket receive buffer, where they are available for consumption by the user-space application.

Visualised, it looks something like this:

![Linux Packet Flow Diagram](/assets/images/debugging-aks-packet-loss/packet-flow.png) _Image © 2018 Leandro Moreira. Used under the [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause). Source: [GitHub repository](https://github.com/leandromoreira/linux-network-performance-parameters)._

The missed counter is incremented whenever the NIC tries to DMA a packet into a fully occupied ring buffer. The NIC essentially "misses" the chance to deliver the packet to the VM’s memory. However, what’s most interesting is that this counter seldom increments for VMs. This is because Virtual NICs are usually implemented as software via the hypervisor, which typically has much more flexible memory management compared to the physical NICs and can reduce the chance of ring buffer overflow.

We mentioned earlier that we’re building Azure Elasticsearch Serverless on top of Azure’s AKS service, which is important to note because all of our AKS nodes use an Azure feature called Accelerated Networking. In this setup, network traffic is delivered directly to the VM’s network interface, bypassing the hypervisor. This is enabled by single root I/O virtualization (SR-IOV), which offers much lower latency and higher throughput than traditional VM networking. Each node is physically connected to a 100 Gb/s network interface, although the SR-IOV Virtual Function (VF) exposed to the VM typically provides only a fraction of that total bandwidth.

Despite the VM only having a fraction of the 100 Gb/s bandwidth, microbursts are still very possible. These physical interfaces are so fast that they can transmit and receive multiple packets in just nanoseconds, far faster than most buffers or processing queues can absorb. At these timescales, even a short-lived burst of traffic can overwhelm the receiver, leading to dropped packets and unpredictable latency.

Direct access to the SR-IOV interface means that our VMs are responsible for handling the hardware interrupts triggered by the NIC in a timely manner, if there's any delay in handling the hardware interrupt (e.g. waiting to be scheduled onto CPU by the hypervisor) then network packets can be missed!

Firstly - NIC-level Tuning

Since we'd confirmed that our VMs were using SR-IOV, we established that the enP42266s1 and eth0 interfaces were a bonded pair and acted as a single interface. Knowing this, then we reasoned that we should be able to adjust the ring buffer values directly using ethtool.

root@aks-k8s-node-1:~# ethtool -g enP42266s1
Ring parameters for enP42266s1:
Pre-set maximums:
RX:		8192
RX Mini:	n/a
RX Jumbo:	n/a
TX:		8192
Current hardware settings:
RX:		1024
RX Mini:	n/a
RX Jumbo:	n/a
TX:		1024

In the output above, we were using only 1/8th of the available ring buffer descriptors. These values were set by the OS defaults, which generally aim to balance performance and resource usage. Set too low, they risk packet drops under load; set too high, they can lead to unnecessary memory consumption. We knew that the VMs were backed by a virtual function carved out of the directly attached 100 Gb/s network interface, which is fast enough to deliver microbursts that could easily overwhelm small buffers. To better absorb those short, high-intensity bursts of traffic, we increased the NIC’s RX ring buffer size from 1024 to 8192. Using a privileged DaemonSet, we rolled out the change across all of our AKS nodes by installing a udev rule to automatically increase the buffer size:

# Match Mellanox ConnectX network cards and run ethtool to update the ring buffer settings
ENV{INTERFACE}=="en*", ENV{ID_NET_DRIVER}=="mlx5_core", RUN+="/sbin/ethtool -G %k rx ${CONFIG_AZURE_MLX_RING_BUFFER_SIZE} tx ${CONFIG_AZURE_MLX_RING_BUFFER_SIZE}"

![AKS Node Packet Loss after RX ring buffer change](/assets/images/debugging-aks-packet-loss/packet-loss-after.png) _A Kibana visualisation of [Elastic Agent's System Integration](https://www.elastic.co/docs/reference/integrations/system), showing packet loss reduced by ~99% after increasing the NIC's RX ring buffer values._

As soon as the change had been applied to all AKS nodes we stopped ‘missing’ RX packets! Fantastic! As a result of this simple change we observed a significant improvement in our indexing throughput and stability.

![Indexing rate after RX ring buffer change](/assets/images/debugging-aks-packet-loss/indexing-rate-after.png) _A Kibana visualisation of Rally telemetry, showing stable and improved Elasticsearch indexing rates after increasing the RX ring buffer size._

Job done, right? Not quite..

Further improvements - Kernel-level Tuning

Eagle eyed readers may have noticed two things:

In the previous screenshot, despite adjusting the physical RX ring buffer values, we still observed a small number of dropped packets on the TX side.
In the original ip link -s show output, one of the ‘logical’ interfaces used by the Elasticsearch pod was showing dropped packets on both the TX and RX sides.

15: lxc0ca0ec41ecd2@if14:  mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether f6:f5:5e:c9:4e:fb brd ff:ff:ff:ff:ff:ff link-netns cni-3f90ab53-df66-cac5-bd19-9cea4a68c29b
    RX:    bytes   packets errors dropped  missed   mcast
    627954576078  54297550      0    1600       0       0
    TX:    bytes   packets errors dropped carrier collsns
    372155326349 133538064      0    3927       0       0

So, we continued to dig. We’d eliminated ~99% of the packet loss, and the remaining loss rate wasn’t as significant as what we’d started with, but we still wanted to understand why it was occurring even after adjusting the RX ring buffer size of the NIC.

So what does dropped represent, and what is this lxc0ca0ec41ecd2 interface? dropped is similar to missed, but only occurs when packets are deliberately dropped by the kernel or network interface. Crucially though, it doesn’t tell you why a packet was dropped. As for the lxc0ca0ec41ecd2 interface, we use the Azure CNI Powered by Cilium to provide the network functionality to our AKS clusters. Any pod spun up on an AKS node gets a ‘logical’ interface, which is a virtual ethernet (veth) pair that connects the pod’s network namespace with the host’s network namespace. It was here that we were dropping packets.

In our experience, packet drops at this layer are unusual, so we started digging deeper into the cause of the drops. There are numerous ways you can debug why a packet is being dropped, but one of the easiest is to use perf attach to the skb:kfree_skb tracepoint. The "socket buffer" (skb) is the primary data structure used to represent network packets in the Linux kernel. When a packet is dropped, its corresponding socket buffer is usually freed, triggering the kfree_skb tracepoint. Using perf to attach to this event allowed us to capture stack traces to analyze the cause of the drops.

``` # perf record -g -a -e skb:kfree_skb ```

We left this to run for ~10 minutes or so to capture as many drops as possible, and then ‘heavily inspired’ by this GitHub Gist by Ivan Babrou, we converted the stack traces into an ‘easier’ to read Flamegraphs:

# perf script | sed -e 's/skb:kfree_skb:.*reason:\(.*\)/\n\tfffff \1 (unknown)/' -e 's/^\(\w\+\)\s\+/kernel /' > stacks.txt
cat stacks.txt | stackcollapse-perf.pl --all | perl -pe 's/.*?;//' | sed -e 's/.*irq_exit_rcu_\[k\];/irq_exit_rcu_[k];/' | flamegraph.pl --colors=java --hash --title=aks-k8s-node-1 --width=1440 --minwidth=0.005 > aks-k8s-node-1.svg

![AKS Node Packet Loss Flamegraph](/assets/images/debugging-aks-packet-loss/aks-packet-loss-flamegraph.png) _A Flamegraph showing the various stack trace ancestry of packet loss._

The flamegraph here shows how often different functions appeared in stack traces for packets drops. Each box represents a function call and wider boxes mean the function appears more frequently in the traces. The stack's ancestry builds upward from the bottom with earlier calls, to the top with later calls.

Firstly, we quickly discovered that unfortunately the skb_drop_reason enum was only added in Kernel 5.17 (Azure’s Node Image at the time was using 5.15). This meant that there was no single human readable message that told us why the packets were being dropped, instead all we got was NOT_SPECIFIED. To work out why packets were being dropped we needed to do a little sleuthing through the stack traces to work out what code paths were being taken when a packet was dropped.

In the flamegraph above you can see that many of the stack traces include veth driver function calls (e.g. veth_xmit), and many end abruptly with a call to the enqueue_to_backlog function. When many stacks end at the same function (like enqueue_to_backlog) it suggests that function is a common point where packets are being dropped. If you go back to the earlier explanation of what happens when a packet arrives at the NIC, you’ll notice that in step 7 we explained:

7. If the networking stack is slower than the rate at which NAPI fetches packets, excess packets are queued in a per-CPU backlog queue (via enqueue_to_backlog). The maximum size of this backlog is controlled by the net.core.netdev_max_backlog sysctl.

Using the same privileged DaemonSet method for the RX ring buffer adjustment, we set the value of the net.core.netdev_max_backlog adjustable kernel parameter from 1000 to 32768:

/usr/sbin/sysctl -w net.core.netdev_max_backlog=32768

This value was based on the fact we knew the hosts were using a 100 Gb/s SR-IOV NIC, even if the VM was allowed only a fraction of the total bandwidth. We acknowledge that it’s worth revisiting this value in the future to see if it can be better optimised to not waste extraneous memory, but at the time “perfect was the enemy of good”.

We re-ran the load tests and compared the three sets of results we’d collected thus far.

![Final Indexing Rate Results](/assets/images/debugging-aks-packet-loss/indexing-rate-final.png) _A Kibana visualisation of Rally results, comparing impact to median throughput after each configuration change._

Tuning Step	Packet Loss	Median indexing throughput
Baseline	High	~18,000 docs/s
+RX Buffer	~99% drop ↓	~26,000 (+ ~40% from baseline)
+Backlog & +RX Buffer	Near zero	~29,000 (+ ~60% from baseline)

Here you can see the P50 of throughput in docs/s over the course of the hours-long load tests. Compared to the baseline, we saw a roughly ~40% increase in throughput by only adjusting the RX ring buffer values, and a ~50-60% increase with both the RX ring buffer and backlog changes! Hooray!

A great result and one more step on our journey towards better Serverless Elasticsearch performance.

Working with Azure

It’s great that we were able to quickly identify and mitigate the majority of our packet loss issues, but since we were using AKS with AKS node images, it made sense to engage with Azure to understand why the defaults weren’t working for our workload.

We walked Azure through our investigation, mitigations and results, and asked for some additional validation of our mitigations. Azure Engineering confirmed that the host NICs were not discarding packets, which confirmed that everything arriving at the host level was passed through to the hypervisor on the host. Further investigation confirmed that no loss or discards were occurring to Azure network fabric, or internal to the hypervisor – which shifted focus from the host to the guest OS and why the guest OS kernel was slow when reading packets off of the enP* SR-IOV interfaces.

Given the complexity of our load testing scenario — which involved configuring multiple systems and tools, including Elastic Observability, we also developed a simplified reproduction of the packet loss issue using iperf3. This simplified test was created specifically to share with Azure for targeted analysis, and added to the broader monitoring and analysis enabled by Elastic Observability and Rally.

With this reproduction Azure was able to confirm the increasing missed and dropped packet counters we had observed, and confirmed the increased RX ring buffer and netdev_max_backlog increase as the recommended mitigations.

Conclusion

While cloud providers offer various abstractions to manage your resources, the underlying hardware ultimately determines your application's performance and stability. High-performance hardware often requires tuning at the operating system level, well beyond the default settings most environments ship with. In managed platforms like AKS, where Azure controls both the node images and infrastructure, it is easy to overlook the impact of low-level configurations such as network device ring buffer sizes or sysctls like net.core.netdev_max_backlog.

Our experience shows that even with the convenience of a managed Kubernetes service, performance issues can still emerge if these hardware parameters are not tuned appropriately. It was tempting to assume that high-speed 100 Gb/s network interfaces, directly attached to the VM using SR-IOV would eliminate any chance of network-related bottlenecks. In reality, that assumption didn’t hold up.

Engaging early with Azure was essential, as they provided deeper visibility into the underlying infrastructure and worked with us to tune low-level, performance-critical settings. Combined with thorough load and scale testing and robust observability using tools like Elastic Observability, this collaboration helped us detect and rectify the issue early in order to deliver a consistent, reliable, and high-performing experience for our users.

How to deploy a Hello World web app with Elastic Observability on AWS App Runner

Mon, 02 Oct 2023 00:00:00 GMT

Elastic Observability is the premiere tool to provide visibility into web apps running in your environment. AWS App Runner is the serverless platform of choice to run your web apps that need to scale up and down massively to meet demand or minimize costs. Elastic Observability combined with AWS App Runner is the perfect solution for developers to deploy web apps that are auto-scaled with fully observable operations, in a way that’s straightforward to implement and manage.

This blog post will show you how to deploy a simple Hello World web app to App Runner and then walk you through the steps to instrument the Hello World web app to enable observation of the application’s operations with Elastic Cloud.

Elastic Observability setup

We’ll start with setting up an Elastic Cloud deployment, which is where observability will take place for the web app we’ll be deploying.

From the Elastic Cloud console, select Create deployment.

Enter a deployment name and click Create deployment. It takes a few minutes for your deployment to be created. While waiting, you are prompted to save the admin credentials for your deployment, which provides you with superuser access to your Elastic® deployment. Keep these credentials safe as they are shown only once.

Elastic Observability requires an APM Server URL and an APM Secret token for an app to send observability data to Elastic Cloud. Once the deployment is created, we’ll copy the Elastic Observability server URL and secret token and store them somewhere safely for adding to our web app code in a later step.

To copy the APM Server URL and the APM Secret Token, go to Elastic Cloud. Then go to theDeployments page, which lists all of the deployments you have created. Select the deployment you want to use, which will open the deployment details page. In the Kibana® row of links, click on Open to open Kibana for your deployment.

Select Integrations from the top-level menu. Then click the APM tile.

On the APM Agents page, copy the secretToken and the serverUrl values and save them for use in a later step.

Now that we’ve completed the Elastic Cloud setup, the next step is to set up our AWS project for deploying apps to App Runner.

AWS App Runner setup

To start using AWS App Runner, you need an AWS account. If you’re a brand new user, go to aws.amazon.com to sign up for a new account.

Set up AWS CloudShell

We’ll perform the process of creating a Python Hello World App image and pushing it to the AWS ECR using AWS CloudShell.

We’re going to use Docker to build the sample app image. Perform the following five steps to set up Docker within CloudShell.

Open AWS CloudShell.

Run the following two commands to install Docker in CloudShell:

sudo yum update -y
sudo amazon-linux-extras install docker

Start Docker by running the command:

sudo dockerd

With Docker running, open a new tab in CloudShell by clicking the Actions dropdown menu and selecting New tab.

Run the following command to authenticate Docker within CloudShell. Replace with your AWS Account ID in the Docker command below, and then run it in CloudShell.

aws ecr get-login-password --region us-east-2 | sudo docker login --username AWS --password-stdin .dkr.ecr.us-east-2.amazonaws.com

Build the Hello World web app image and push it to AWS ECR

We’ll be using AWS ECR, Amazon’s fully managed container registry for storing and deploying application images. To build and push the Hello World app image to AWS ECR, we’ll perform the following six steps in AWS CloudShell:

Run the command below in CloudShell to create a repository in AWS ECR.

aws ecr create-repository \
    --repository-name elastic-helloworld/web \
    --image-scanning-configuration scanOnPush=true \
    --region us-east-2

“elastic-helloworld” will be the application's name and “ web” will be the service name.

In the newly created tab within CloudShell, clone a Python Hello World sample app repo from GitHub by entering the following command.

git clone https://github.com/elastic/observability-examples

Change directory to the location of the Hello World web app code by running the following command:

cd observability-examples/aws/app-runner/helloworld

Build the Hello World sample app from the application’s directory. Run the following Docker command in CloudShell.

sudo docker build -t elastic-helloworld/web .

Tag the application image. Replace with your AWS Account ID in the Docker command below, and then run it in CloudShell.

sudo docker tag elastic-helloworld/web:latest .dkr.ecr.us-east-2.amazonaws.com/elastic-helloworld/web:latest

Push the application image to ECR. Replace with your AWS Account ID in the command below, and then run it in CloudShell.

sudo docker push .dkr.ecr.us-east-2.amazonaws.com/elastic-helloworld/web:latest

Deploy a Hello World web app to AWS App Runner

We’ll perform the process of deploying a Python Hello World App to App Runner using the AWS App Runner console.

Open the App Runner console and click the Create an App Runner service button.

On the Source and deployment page, set the following deployment details:

In the Source section, for Repository type, choose Container registry.
For Provider, choose Amazon ECR.
For Container image URI, choose Browse to select the Hello World application image that we previously pushed to AWS ECR.
- In the Select Amazon ECR container image dialog box, for Image repository, select the “ elastic-helloworld/web” repository.
- For Image tag, select “ latest” and then choose Continue.
In the Deployment settings section, choose Automatic.
For ECR access role, choose Create new service role.
Click Next.

On the Configure service page, in the Service settings section, enter the service name “ helloworld-app.” Leave all the other settings as they are and click Next.

On the Review and create page, click Create & deploy.

After a few minutes, the Hello World app will be deployed to App Runner.

Click the Default domain URL to view the Hello World app running in App Runner.

Instrument the Hello World web app with Elastic Observability

With a web app successfully running in App Runner, we’re now ready to add the minimal code necessary to start monitoring the app. To enable observability for the Hello World app in Elastic Cloud, we’ll perform the following five steps in AWS CloudShell:

Edit the Dockerfile file to add the following Elastic Open Telemetry environment variables along with the commands to install and run the Elastic APM agent. Use the “nano” text editor by typing “nano Dockerfile”. Be sure to replace the text and the text with the APM Server URL and the APM Secret Token values that you copied and saved in an earlier step. The updated Dockerfile should look something like this:

FROM python:3.9-slim as base

# get packages
COPY requirements.txt .
RUN pip install -r requirements.txt

WORKDIR /app

# install opentelemetry packages
RUN pip install opentelemetry-distro opentelemetry-exporter-otlp
RUN opentelemetry-bootstrap -a install

ENV OTEL_EXPORTER_OTLP_ENDPOINT=''
ENV OTEL_EXPORTER_OTLP_HEADERS='Authorization=Bearer%20'
ENV OTEL_LOG_LEVEL=info
ENV OTEL_METRICS_EXPORTER=otlp
ENV OTEL_RESOURCE_ATTRIBUTES=service.version=1.0,deployment.environment=production
ENV OTEL_SERVICE_NAME=helloworld
ENV OTEL_TRACES_EXPORTER=otlp

COPY . .
ENV FLASK_APP=helloworld
ENV FLASK_RUN_HOST=0.0.0.0
ENV FLASK_RUN_PORT=8080
EXPOSE 8080
ENTRYPOINT [ "opentelemetry-instrument", "flask", "run" ]

Note: You can close the nano text editor and save the file by typing “Ctrl + x”. Press the “y” key and then the “Enter” key to save the changes.

Edit the helloworld.py file to add observability traces. In CloudShell, type “nano helloworld.py” to edit the file.

After the import statements at the top of the file, add the code required to initialize the Elastic Open Telemetry APM agent:

from opentelemetry import trace
tracer = trace.get_tracer("hello-world")

Replace the “Hello World!” output code . . .

return "Hello World!";

… with the Hello Elastic Observability code block.

return '''


Hello Elastic Observability - AWS App Runner - Python



'''

Then add a “hi” trace before the Hello Elastic Observability code block along with an additional “@app.after_request” method placed afterward to implement a “bye” trace.

@app.route("/")
def helloworld():
	with tracer.start_as_current_span("hi") as span:
  	  logging.info("hello")
  	  return '''
   	 
   	 
   	 Hello Elastic Observability - AWS App Runner - Python
   	 
   	 
   	 
   	 '''

@app.after_request
def after_request(response):
	with tracer.start_as_current_span("bye"):
  	  logging.info("goodbye")
  	  return response

The completed helloworld.py file should look something like this:

import logging
from flask import Flask

from opentelemetry import trace
tracer = trace.get_tracer("hello-world")

app = Flask(__name__)

@app.route("/")
def helloworld():
    with tracer.start_as_current_span("hi") as span:
   	 logging.info("hello")
   	 return '''
    	
    	
    	Hello Elastic Observability - AWS App Runner - Python
    	
    	
    	
    	'''

@app.after_request
def after_request(response):
    with tracer.start_as_current_span("bye"):
   	 logging.info("goodbye")
   	 return response

Note: You can close the nano text editor and save the file by typing “Ctrl + x”. Press the “y” key and then the “Enter” key to save the changes.

Rebuild the updated Hello World sample app using Docker from within the application’s directory. Run the following command in CloudShell.

sudo docker build -t elastic-helloworld/web .

Tag the application image using Docker. Replace with your AWS Account ID in the Docker command below and then run it in CloudShell.

sudo docker tag elastic-helloworld/web:latest .dkr.ecr.us-east-2.amazonaws.com/elastic-helloworld/web:latest

Push the updated application image to ECR. Replace with your AWS Account ID in the Docker command below and then run it in CloudShell.

sudo docker push .dkr.ecr.us-east-2.amazonaws.com/elastic-helloworld/web:latest

Pushing the image to ECR will automatically deploy the new version of the Hello World app.

Open the App Runner console. After a few minutes, the Hello World app will be deployed to App Runner. Click the Default domain URL to view the updated Hello World app running in App Runner.

Observe the Hello World web app

Now that we’ve instrumented the web app to send observability data to Elastic Observability, we can now use Elastic Cloud to monitor the web app’s operations.

In Elastic Cloud, select the Observability Services menu item.
Click the helloworld service.
Click the Transactions tab.
Scroll down and click the “/” transaction.
Scroll down to the Trace Sample section to see the “/,” “hi,” and “bye” trace samples.

Observability made to scale

You’ve seen the complete process of deploying a web app to AWS App Runner that is instrumented with Elastic Observability. The end result is a web app that will scale up and down with usage, combined with the observability tools to monitor the web app as it serves one user or millions of users.

Now that you’ve seen how to deploy a serverless web app instrumented with observability, visit Elastic Observability to learn more about how to implement a complete observability solution for your apps. Or visit Getting started with Elastic on AWS for more examples of how you can drive the data insights you need by combining AWS’s cloud computing services with Elastic’s search-powered platform.

Deploying Elastic Agent with Confluent Cloud's Elasticsearch Connector

Wed, 22 Jan 2025 00:00:00 GMT

Elastic and Confluent are key technology partners and we're pleased to announce new investments in that partnership. Built by the original creators of Apache Kafka®, Confluent's data streaming platform is a key component of many Enterprise ingest architectures, and it ensures that customers can guarantee delivery of critical Observability and Security data into their Elasticsearch clusters. Together, we've been working on key improvements to how our products fit together. With Elastic Agent's new Kafka output and Confluent's newly improved Elasticsearch Sink Connectors it's never been easier to seamlessly collect data from the edge, stream it through Kafka, and into an Elasticsearch cluster.

In this blog, we examine a simple way to integrate Elastic Agent with Confluent Cloud's Kafka offering to reduce the operational burden of ingesting business-critical data.

Benefits of Elastic Agent and Confluent Cloud

When combined, Elastic Agent and Confluent Cloud's updated Elasticsearch Sink connector provide a myriad of advantages for organizations of all sizes. This combined solution offers flexibility in handling any type of data ingest workload in an efficient and resilient manner.

Fully Managed

When combined, Elastic Cloud Serverless and Confluent Cloud provide users with a fully managed service. This makes it effortless to deploy and ingest nearly unlimited data volumes without having to worry about nodes, clusters, or scaling.

Full Elastic Integrations Support

Sending data through Kafka is fully supported with any of the 300+ Elastic Integrations. In this blog post, we outline how to set up the connection between the two platforms. This ensures you can benefit from our investments in built-in alerts, SLOs, AI Assistants, and more.

Decoupled Architecture

Kafka acts as a resilient buffer between data sources (such as Elastic Agent and Logstash) and Elasticsearch, decoupling data producers from consumers. This can significantly reduce total cost of ownership by enabling you to size your Elasticsearch cluster based on typical data ingest volume, not maximum ingest volume. It also ensures system resilience during spikes in data volume.

Ultimate control over your data

With our new Output per Integration capability, customers can now send different data to different destinations using the same agent. Customers can easily send security logs directly to Confluent Cloud/Kafka, which can provide delivery guarantees, while sending less critical application logs and system metrics directly to Elasticsearch.

Deploying the reference architecture

In the following sections, we will walk you through one of the ways Confluent Kafka can be integrated with Elastic Agent and Elasticsearch using Confluent Cloud's Elasticsearch Sink Connector. As with any streaming and data collection technology, there are many ways a pipeline can be configured depending on the particular use case. This blog post will focus on a simple architecture that can be used as a starting point for more complex deployments.

Some of the highlights of this architecture are:

Dynamic Kafka topic selection at Elastic Agents
Elasticsearch Sink Connectors for fully managed transfer from Confluent Kafka to Elasticsearch
Processing data leveraging Elastic's 300+ Integrations

Prerequisites

Before getting started ensure you have a Kafka cluster deployed in Confluent Cloud, an Elasticsearch cluster or project deployed in Elastic Cloud, and an installed and enrolled Elastic Agent.

Configure Confluent Cloud Kafka Cluster for Elastic Agent

Navigate to the Kafka cluster in Confluent Cloud, and select Cluster Settings. Locate and note the Bootstrap Server address, we will need this value later when we create the Kafka Output in Fleet.

Navigate to Topics in the left-hand navigation menu and create two topics:

A topic named logs
A topic named metrics

Next, navigate to API Keys in the left-hand navigation menu:

Click + Add API Key
Select the Service Account API key type
Provide a meaningful name for this API Key
Grant the key write permission to the metrics and logs topics
Create the key

Note the provided Key and the Secret, we will need it later when we configure the Kafka Output in Fleet.

Configure Elasticsearch and Elastic Agent

In this section, we will configure the Elastic Agent to send data to Confluent Cloud's Kafka cluster and we will configure Elasticsearch so it can receive data from the Confluent Cloud Elasticsearch Sink Connector.

Configure Elastic Agent to send data to Confluent Cloud

Elastic Fleet simplifies sending data to Kafka and Confluent Cloud. With Elastic Agent, a Kafka "output" can be easily attached to all data coming from an agent or it can be applied only to data coming from a specific data source.

Find Fleet in the left-hand navigation, click the Settings tab. On the Settings tab, find the Outputs section and click Add Output.

Perform the following steps to configure the new Kafka output:

Provide a Name for the output
Set the Type to Kafka
Populate the Hosts field with the Bootstrap Server address we noted earlier .
Under Authentication, populate the Username with the API Key and the Password with the Secret we noted earlier
Under Topics, select Dynamic Topic and set Topic from field to data_stream.type
Click Save and apply settings

Next, we will navigate to the Agent Policies tab in Fleet and click to edit the Agent Policy that we want to attach the Kafka output to. With the Agent Policy open, click the Settings tab and change Output for integrations and Output for agent monitoring to the Kafka output we just created.

Selecting an Output per Elastic Integration: To set the Kafka output to be used for specific data sources, see the integration-level outputs documentation.

A note about Topic Selection: The data_stream.type field is a reserved field which Elastic Agent automatically sets to logs if the data we're sending is a log and metrics if the data we're sending is a metric. Enabling Dynamic Topic selection using data_stream.type, will cause Elastic Agent to automatically route metrics to a metrics topic and logs to a logs topic. For information on topic selection, see the Kafka Output's Topics settings documentation.

Configuring a publishing endpoint in Elasticsearch

Next, we will set up two publishing endpoints (data streams) for the Confluent Cloud Sink Connector to use when publishing documents to Elasticsearch:

We will create a data stream logs-kafka.reroute-default for handling logs
We will create a data stream metrics-kafka.reroute-default for handling metrics

If we were to leave the data in those data streams as-is, the data would be available but we would find the data is unparsed and lacking vital enrichment. So we will also create two index templates and two ingest pipelines to make sure the data is processed by our Elastic Integrations.

Creating the Elasticsearch Index Templates and Ingest Pipelines

The following steps use Dev Tools in Kibana, but all of these steps can be completed via the REST API or using the relevant user interfaces in Stack Management.

First, we will create the Index Template and Ingest Pipeline for handling logs:

PUT _index_template/logs-kafka.reroute
{
  "template": {
    "settings": {
      "index.default_pipeline": "logs-kafka.reroute"
    }
  },
  "index_patterns": [
    "logs-kafka.reroute-default"
  ],
  "data_stream": {}
}

PUT _ingest/pipeline/logs-kafka.reroute
{
  "processors": [
    {
      "reroute": {
        "dataset": [
          "{{data_stream.dataset}}"
        ],
        "namespace": [
          "{{data_stream.namespace}}"
        ]
      }
    }
  ]
}

Next, we will create the Index Template and Ingest Pipeline for handling metrics:

PUT _index_template/metrics-kafka.reroute
{
  "template": {
    "settings": {
      "index.default_pipeline": "metrics-kafka.reroute"
    }
  },
  "index_patterns": [
    "metrics-kafka.reroute-default"
  ],
  "data_stream": {}
}

PUT _ingest/pipeline/metrics-kafka.reroute
{
  "processors": [
    {
      "reroute": {
        "dataset": [
          "{{data_stream.dataset}}"
        ],
        "namespace": [
          "{{data_stream.namespace}}"
        ]
      }
    }
  ]
}

A note about rerouting: For a practical example of how this works, a document related to a Linux Network Metric would be first land in metrics-kafka.reroute-default and this Ingest Pipeline would inspect the document and find data_stream.dataset set to system.network and data_stream.namespace set to default. It would use these values to reroute the document from metrics-kafka.reroute-default to metrics-system.network-default where it would be processed by the system integration.

Configure the Confluent Cloud Elasticsearch Sink Connector

Now it's time to configure the Confluent Cloud Elasticsearch Sink Connector. We will perform the following steps twice and create two separate connectors, one connector for logs and one connector for metrics. Where the required settings differ, we will highlight the correct values.

Navigate to your Kafka cluster in Confluent Cloud and select Connectors from the left-hand navigation menu. On the Connectors page, select Elasticsearch Service Sink from a catalog of connectors available.

Confluent Cloud presents a simplified workflow for the user to configure a connector. Here we will walk through each step of the process:

Step 1: Topic Selection

First, we will select the topic that the connector will consume data from based on which connector we are deploying:

When deploying the Elasticsearch Sink Connector for logs, select the logs topic.
When deploying the Elasticsearch Sink Connector for metrics, select the metrics topic.

Step 2: Kafka Credentials

Choose KAFKA_API_KEY as the cluster authentication mode. Provide the API Key and Secret noted earlier when we gather required Confluent Cloud Cluster information.

Step 3: Authentication

Provide the Elasticsearch Endpoint address of our Elasticsearch cluster as the Connection URI. The Connection user and Connection password are the authentication information for the account in Elasticsearch that will be used by the Elasticsearch Sink Connector to write data to Elasticsearch.

Step 4: Configuration

In this step we will keep the Input Kafka record value format set to JSON. Next, expand Advanced Configuration.

We will set Data Stream Dataset to kafka.reroute
We will set Data Stream Typebased on the connector we are deploying:
- When deploying the Elasticsearch Sink Connector for logs, we will set Data Stream Type to logs
- When deploying the Elasticsearch Sink Connector for metrics, we will set Data Stream Type to metrics
The correct values for other settings will depend on the specific environment.

Step 5: Sizing

In this step, notice that Confluent Cloud provides a recommended minimum number of tasks for our deployment. Following the recommendation here is a good starting place for most deployments.

Step 6: Review and Launch

Review the Connector configuration and Connector pricing sections and if everything looks good, it's time to click continue and launch the connector! The connector may report as provisioning but will soon start consuming data from the Kafka topic and writing it to the Elasticsearch cluster.

You can now navigate to Discover in Kibana and find your logs flowing into Elasticsearch! Also check out the real time metrics that Confluent Cloud provides for your new Elasticsearch Sink Connector deployments.

If you have only deployed the first logs sink connector, you can now repeat the steps above to deploy the second metrics sink connector.

Enjoy your fully managed data ingest architecture

If you followed the steps above, congratulations. You have successfully:

Configured Elastic Agent to send logs and metrics to dedicated topics in Kafka
Created publishing endpoints (data streams) in Elasticsearch dedicated to handling data from the Elasticsearch Sink Connector
Configured managed Elasticsearch Sink connectors to consume data from multiple topics and publish that data to Elasticsearch

Next you should enable additional integrations, deploy more Elastic Agents, explore your data in Kibana, and enjoy the benefits of a fully managed data ingest architecture with Elastic Serverless and Confluent Cloud!

Log Processing UX Design in Elastic Streams

Tue, 03 Mar 2026 00:00:00 GMT

This post is written from the perspective of the Elastic Observability design team. It’s aimed at developers and SREs who work with logs and ingest pipelines, and it explains how design decisions shaped the Processing experience in Streams.

The Design Problem in Log Processing

We rarely talk about how projects actually begin.

How do you design something that doesn't fully exist yet?

How do you align AI capabilities, system constraints, real user pains into one coherent experience?

Streams gave us that challenge.

Logs are one of the richest signals in observability - but also one of the messiest. Streams is an agentic AI-powered solution that rethinks how teams work with logs to enable fast incident investigation and resolution.

Streams uses AI to partition and parse raw logs, extract relevant fields, reduce schema management overhead, and surface significant events like critical errors and anomalies.

This led us to make logs investigation-ready from the start, and not force the Site Reliability Engineer to fight their data. But in order to enable such experience, we had to carefully rethink a core concept and step in the process - Processing.

Designing Processing UX in Elastic Streams

Logs are powerful, but only if they are structured correctly. Today, a user would onboard logs via Elastic Agent, using a custom integration, extract something as simple as an IP field by:

Write GROK patterns
Create pipelines
Manage mappings
Test transformation
Iterate repeatedly

What sounds simple requires 20+ steps — and deep expertise most teams shouldn’t need. Our goal became simple: make this dramatically simpler.

Our early design question was:

“ Can we reduce this experience to 2 meaningful steps instead of 20 technical ones?”

That question shaped how we approached the Stream UX.

The Foundation

Before we jumped into designing the UI in Kibana, we defined a core mental model.

A Stream is a collection of documents stored together that share:

Retention
Configuration
Mappings
Processing rules
Lifecycle behaviour

The key design principle:

“A Stream should contain data that behaves consistently.”

Why Does Data Consistency Matter?

We started with an example to test our thinking. Take Nginx access and error logs.

Access logs describe request/response events:

192.168.1.10 - - [16/Feb/2026:12:32:10 +0000] "GET /api/orders/123 HTTP/1.1" 200 532 "-" "Mozilla/5.0"

Error logs describe diagnostic events:

2026/02/16 12:32:10 [error] 2719#2719: *342 connect() failed (111: Connection refused) while connecting to upstream…

If both live in the same Streams that might cause:

Processing logic conflicts
Field divergence
Mapping conflicts
Investigations would be fundamentally harder

That insight clarified something critical:

“Processing isn’t just about extracting fields. It’s about protecting consistency.”

Making Complexity Manageable

The ingest ecosystem isn’t small, simple, or hypothetical. Real pipelines use dozens of processors — from common ones like rename, set, convert, and append, to niche types like urldecode and network_direction.

The UI had to support both high-frequency actions and long-tail edge cases without losing structure. Currently Elasticsearch supports over 40 different ingest processors. We had to make sure our interface could handle the different types.

We introduced a clear, nested structure for pipeline steps. Users could create, reorder, edit, or remove individual steps or grouped ones with confidence. The nested drag and drop capability was also added as a pattern in our EUI library.

This gave us the context and foundation to work on integrating those concepts into a model that would be definitive for everything in Streams.

Page Archetypes

Processing is powerful - and risky. Changing a parsing condition or step might affect:

Field availability
Search behaviour
Alerts
AI Insights
Investigations

So we asked ourselves how do we make something so powerful and important, safe for the user? The answer led to a core page archetype:

Create > Preview > Confirm

This wasn’t a UI pattern added later. It emerged directly from our concept work and understanding what users would have to deal with.

To support this archetype and core idea, we also introduced a split-screen structure.

Left: Build

This is where users would:

Add processing steps
Define conditions
Apply rules
Leverage AI suggestions both as a whole pipeline creation or individual steps like a GROK processor

It remained focused, intentional and structured.

Right Preview

This is where users would:

See real life log samples
See extracted fields in context
Immediate feedback on changes, with insights about the matched and unmatched percentage of documents
Optional drilldown side panel on the right

The preview panel became the anchor of confidence. This was not about visual symmetry, but to reinforce experimentation, control over errors and decrease the level of mistakes. Knowing that users might want to switch their focus from interaction to detailed preview, we introduced the resizeable function to both panels, and unlocked more flexiblity and control over the use cases.

AI Automation

Streams is agentic and AI powered. That added another layer of complexity for the design, but also another opportunity to unlock even more power and insights from users' log data.

AI introduced a new tension: how do you accelerate processing without turning it into a black box?

We established a few guardrails:

Clear, concise suggestions
Visible impact through matched document metrics
Inspectability
Alignment with the Create → Preview → Confirm model

Processing UX became the bridge between automation and human in the loop. Log data is one of the most powerful investigation signals. Every design decision reinforced that belief.

What We Learned

Designing for the future does not start with screens. It starts with:

Edge case testing
Clear mental models
Strong and guiding principles
Behavioral consistency
Scalable and stress-tested archetypes

We know that in order for a user to be able unlock insightful discoveries from their logs, they would need to process and manage their data effectively. We knew we were shaping their entire observability foundation.

Processing is about trust, control, and scalable data management.

Trust enables investigation speed.

Investigation speed enables resilience.

Learn more

Sign up for an Elastic trial at cloud.elastic.co, and trial Elastic's Serverless offering which will allow you to play with all of the Streams functionality. You want to know more about Streams? Check some of the links below:

Read about Reimagining streams

Read about Retention management

Look at the Streams website

Check the Streams documentation

Developer's Guide to Easy Ops: Demystifying OpenTelemetry's Magic

Tue, 17 Mar 2026 00:00:00 GMT

The Introduction: From Code to Dash, Demystified

Observability for developers has lately been distilled into implementing auto-instrumentation, allowing you to instantly connect your code with the larger observability world. This way of utilizing an upstream SDK is certainly the simplest and most production-ready, and works efficiently with the Elastic Cloud Managed OTLP Endpoint.

But what if you could not only add powerful tracing to your Go service but also truly understand how the magic works, rather than just copy-pasting configuration files or a line of code? In the same way that you build your knowledge of software development systems, observability, modernized by OpenTelemetry (OTel) standardization, is a rich, broad system that is valuable to understand. Here is an in-depth technical breakdown of every piece of simple OTel instrumentation using the Elastic Distributions of OpenTelemetry (EDOT) and Golang, from the ground up.

Telemetry is the automated collection, transmission and analysis of data from your application, which can apply to any observable distributed system. This data can range from regular health check calls with your application to real-time information about user interactions, requests, and transactions. Using the example application repository here, we’ll build a strong observability foundation to start observing our applications with confidence.

Understanding the OpenTelemetry Flow

Below, you will see the basic flow of your data when implementing observability with OTel in your system. Before we dive in, let’s explain some of the key terms within OTel. Let’s go over these base key players that we need to implement observability solutions with OTel:

Span: This is a single, timed unit of a distributed trace that can represent a specific operation, such as a database query or an HTTP handler.
Trace: This is a detailed record of a single request’s journey through your system, AKA a hierarchy of your spans.
Tracer: This is the handle for generating spans. You will typically have one per instrumentation library, for example myapp/http.
Tracer Provider: This is the cornerstone of the SDK. This creates Tracer instances, and you can configure it on application start up.
Context Propagation: The mechanism for passing trace context between operations and services, maintaining the relationship between parent and child spans.
Exporter: This is the part that is responsible for sending your telemetry data to a vendor backend, and you can decide if you are sending it to the OTel Collector, EDOT Collector or an OTLP Endpoint.

Installing the Magic, Instrumentation Style

OpenTelemetry provides instrumentation libraries that handle much of the tracing complexity for you. These libraries wrap common frameworks and libraries (like net/http/otelhttp) and automatically capture telemetry without requiring you to manually create spans for every operation.

However, before you're able to send any telemetry, OTel needs to know who (which service) is sending that data.

A resource represents the specific entity, in this case "simple-go-service", that is producing your telemetry data. Its identity is recorded as resource attributes, and resource attributes can include pod names, service names or instances, deployment environments; Basically anything important to identifying your resource. This resource is your service's identity card that gets attached to every span and metric that it emits with its attributes. Once your trace arrives, these attributes can answer "what version was running?" or "which service is this from?"

func initOTel(ctx context.Context, endpoint string) (func(context.Context) error, error) {
res, err := resource.New(ctx,
		resource.WithAttributes(
			semconv.ServiceName("simple-go-service"),
			semconv.ServiceVersion("1.0.0"),
		),
	)
	if err != nil {
		return nil, err
	}

In the code above, resource.New() constructs the "identity card" of our Go service. The attributes that will be attached to it will use semantic conventions(semconv), standardized names for common metadata fields. These semantic conventions make sure that every single OTel-compatible observability backend knows their meaning.

Now that we've bootstrapped our application with the initOtel function, we can continue to configure everything else!

Let’s begin instrumenting this application by building all the app components that we will need to implement modern observability tools. Below is our instrumentation using otelhttp, which will handle span creation after calling the specified API routes.

http.Handle("/hello", otelhttp.NewHandler(http.HandlerFunc(handleHello), "hello"))
http.Handle("/api/data", otelhttp.NewHandler(http.HandlerFunc(handleData), "data"))
http.HandleFunc("/health", handleHealth)

// Example of a tracer within our handleHello() function
tracer = tp.Tracer("simple-go-service")

ctx, span := tracer.Start(ctx, "process-hello")
defer span.End()

The key insight here is that otelhttp.NewHandler handles all the span lifecycle management for HTTP requests. You don't need to manually call tracer.Start()or span.End() for basic HTTP tracing since the library does this for you.

On application start up, the SDK will use the tracer provider set up below in order to create Tracer instances. These instances help create and manage the spans contained within traces.

traceExporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint(endpoint),
        otlptracegrpc.WithInsecure(),
    )

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(traceExporter),
        sdktrace.WithResource(res),
    )
    otel.SetTracerProvider(tp)
    tracer = tp.Tracer("simple-go-service")

Within our initOTel function, we will set up one of our most important signals: logs. First, we initialize the logExporter that will send logs to our OTel Collector using gRPC protocol. Then the LoggerProvider will create the base of the logExporter that batches log entries together before sending those batches to your exporter, attaching metadata about the services along the way. Lastly, the LoggerProvider also creates a standard Go structured logger (slog) that automatically includes trace context (such as span IDs) and batches your log with other logs. These are sent to your observability backend through the exporter along with your metrics and traces.

logExporter, err := otlploggrpc.New(ctx,
		otlploggrpc.WithEndpoint(endpoint),
		otlploggrpc.WithInsecure(),
	)
	if err != nil {
		return nil, err
	}

	lp := sdklog.NewLoggerProvider(
		sdklog.WithProcessor(sdklog.NewBatchProcessor(logExporter)),
		sdklog.WithResource(res),
	)
	logger = slog.New(otelslog.NewHandler("simple-go-service", otelslog.WithLoggerProvider(lp)))

Below you can see how you can view your logs through Kibana in the APM UI. These logs are also color - coordinated; Coded warnings are in yellow, errors are in red, and regular logs are in green.

Metrics are set up in the next part of our code. Metrics are telemetry signals that track the quantitative data from your application, such as response times and request counts. The metric exporter is initialized to send metric data to our EDOT Collector then to our observability backend, Elastic Observability in this case, using gRPC. The meter provider in the next portion periodically collects and exports our metrics data and measurements, the same as the tracer provider creates tracers. The only difference between the two providers is that the meter provider works on a timer while the trace provider exports spans as they complete.

metricExporter, err := otlpmetricgrpc.New(ctx,
		otlpmetricgrpc.WithEndpoint(endpoint),
		otlpmetricgrpc.WithInsecure(),
	)
	if err != nil {
		return nil, err
	}

	mp := metric.NewMeterProvider(
		metric.WithReader(metric.NewPeriodicReader(metricExporter)),
		metric.WithResource(res),
	)
	otel.SetMeterProvider(mp)

	meter := mp.Meter("simple-go-service")
	requestCounter, _ = meter.Int64Counter("http.requests")
	requestDuration, _ = meter.Float64Histogram("http.duration")

In order to finish initializing OpenTelemetry, we set up our propagators for context propagation. The set text map propagator automatically injects the trace ID and the span ID of your service making an outbound HTTP request to another service, following the W3C Trace Context standard. In short, this maintains the parent-child relationship between spans.

otel.SetTextMapPropagator(propagation.TraceContext{})

	return func(ctx context.Context) error {
		tp.Shutdown(ctx)
		mp.Shutdown(ctx)
		lp.Shutdown(ctx)
		return nil
	}, nil

Now that you know how these pieces work together, try to run the repository linked here, using the readme as your guide.

Sidenote: Adding Custom Spans

For getting an application emitting traces, this instrumentation works great! If you visit localhost:8080/hello after starting the docker containers, the otelhttp middleware automatically creates spans for each HTTP request. However, basic instrumentation only shows essential application telemetry, such as response duration, URL paths, and status codes. You won’t know what happens between the request coming in and request completion. The moment OpenTelemetry truly gains power is when you add custom spans. Unlike auto-instrumentation where spans are created as well as closed automatically, custom spans require you to explicitly start and stop them.

Custom spans can track your application’s logic, such as specific business events or marking expensive operations, using a detailed hierarchy within each trace. In the application for this article, there are several custom spans that were created to track important operations:

background-work: This traces asynchronous processing that happens with the main request.
computation: This measures computations and then captures those results, and the computation type.

Custom spans add granular visibility into your application's behavior. For example, in performComputation:

ctx, span := tracer.Start(ctx, "computation")
defer span.End()

result := rand.Float64()
span.SetAttributes(
	attribute.String("comp.type", compType),
	attribute.Float64("comp.result", result),
	)

	logger.InfoContext(ctx, "Computation completed", "type", compType, "result", result)

if result < 0.3 {
span.AddEvent("Low confidence result")
	logger.WarnContext(ctx, "Low confidence computation", "result", result)
}
}

The attributes set above become searchable and filterable in our Elastic Observability backend, allowing for attribute filtering by attribute.result and attribute.compType. If you query your data with “show me all computations where results are less than 0.3,” then you will notice the span event span.AddEvent(“Low confidence result”) tacked on with a timestamped marker. This appears on your trace timeline as well, adding even more visibility to any unusual events. Below is a small example of the filtering that Kibana can accomplish from custom spans.

The Data Pipeline: From Code to IRL

Now that you can export your custom spans and data to OTLP which sends it to the EDOT Collector and then to an observability backend, the best hub for your telemetry data will be the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/. It is a simple, standalone process that is able to receive, process and export all of your telemetry data. Within this project, we use the Elastic Distributions of OpenTelemetry (EDOT) Collector, an optimized Collector for usage within your Elastic Stack. Since this is a self-managed Elastic instance, this article and connected repository utilize the EDOT Collector through elasticapm, but for Elastic Cloud or Serverless projects, you can use the Elastic Managed OpenTelemetry Protocol (OTLP) Endpoint. As noted in the quickstart documentation here, the Elastic Cloud Managed OTLP Endpoint endpoint helps get your data quickly and efficiently into your Elastic Stack through OTLP, without schema translation! This means that your telemetry hits Elastic instantly and your telemetry data remains vendor-neutral.

For most developers and SREs, this Collector is an amazing tool. It allows you to decouple your code from the observability backend. Your application does not need to know its final destination, it can just send the data to the Collector. Your observability backend can change constantly without it even touching your code. The OpenTelemetry Collector also acts as a gateway for multiple streams of data, and is able to accept various formats in order to unify them for exportation. Lastly, the OpenTelemetry Collector is able to offload processing power from your application - tasks such as retries, batching and filtering can happen in the Collector, not your application.

After trying out this article’s repository, try auto-instrumenting your application with Elastic Distributions of OpenTelemetry (EDOT) so that you can utilize the APM UI to its full potential! With the latest version of Elasticsearch and Kibana start-local, you can use Docker to install and run the services and instantly start monitoring your application.

Understanding the Collector Configuration

The Collector's behavior is defined in a configuration file (otel-collector-config.yaml). Let's break down each component.

Receivers define how the Collector accepts telemetry data. Here, we're listening for both gRPC and HTTP traffic.

receivers:
  # Receives data from other Collectors in Agent mode
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

Connectors are specialized components that sit in between pipelines, and in this case, we are using the elasticapm Connector. This APM Connector exports our metrics, logs, and traces, while simultaneously acting as a receiver for the metrics/aggregated-otel-metrics pipeline (see below). Without it, your raw OTLP data lands in Elasticsearch, but the APM UI has nothing to build its views from.

connectors:
  elasticapm: {} # Elastic APM Connector

Processors transform, filter, or enrich data as it passes through the EDOT Collector. The batch processor aggregates spans before export, reducing network overhead and improving efficiency, as well as limiting batch sizes. The batch/metrics processor does this as well, but for APM metrics. Lastly, there is the Elastic APM processor. This processor ensures that your spans fields are aligned, your traces views are complete, and it overall bridges the gap between Elastic's expectations and OpenTelemetry's formatting of your traces.

processors:
  batch:
    send_batch_size: 1000
    timeout: 1s
    send_batch_max_size: 1500
  batch/metrics:
    send_batch_max_size: 0 # Explicitly set to 0 to avoid splitting metrics requests
    timeout: 1s
  elasticapm: {} # Elastic APM Processor

As mentioned previously in the article, exporters send data to your observability backend. The debug exporter logs telemetry to the console (useful for development), while the Elasticsearch exporter sends traces to your Elastic stack.

exporters:
  debug: {}
  elasticsearch/otel:
    endpoints:
      - ${ELASTIC_ENDPOINT} # Will be populated from environment variable
    user: elastic
    password: ${ELASTIC_PASSWORD}
    tls:
      ca_file: /config/certs/ca/ca.crt
    mapping:
      mode: otel

Pipelines connect receivers, processors, and exporters into a data flow. These EDOT Collector pipelines receive OTLP traces, batches them, and exports to the debug, elasticapm and elasticsearch/otel exporters. It also exports metrics to the debug and elasticsearch/otel exporters.

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch/metrics]
      exporters: [debug, elasticsearch/otel]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug, elasticapm, elasticsearch/otel]
    traces:
      receivers: [otlp]
      processors: [batch, elasticapm]
      exporters: [debug, elasticapm, elasticsearch/otel]
    metrics/aggregated-otel-metrics:
      receivers:
        - elasticapm
      processors: [] # No processors defined in the original for this pipeline
      exporters:
        - debug
        - elasticsearch/otel

Debugging Your Code with Confidence in Kibana

Elastic Observability, utilizing Kibana and Streams, has native support for the OTLP Endpoint through the EDOT Collector, which was used in this project. Below, you can see that your data is automatically connected to Streams from the beginning, requiring no extra leg work! You can add conditions or any Grok processors as your data is streaming in, and you'll be able to instantly see your data's schema and data quality.

Elastic also provides the Elastic Cloud Managed Endpoint for even easier storage, data-processing, and scaling. If you use this Managed Endpoint, it means that you can configure OpenTelemetry to send data directly to Elasticsearch, without ANY specialized Collectors. Any way you choose, once your traces are flowing, Kibana’s APM UI provides powerful visualization and analysis capabilities will be everything you need to debug your code. You are able to drill down into individual requests, identify bottlenecks, find anomalies and troubleshoot any issues that arise with confidence.

Here is one span of interest from this repository. Within Kibana, you can immediately filter by the Trace ID, finding other spans with the same Trace ID to visually see the entire trace.

Kibana Discover also allows you to switch indices instantly without losing your filters, ensuring that you can also see the logs that correspond with the same Trace ID.

In addition to the manually checking your traces, you can automatically check them within the APM UI (shown below). This is easy trace visualization using the Kibana APM UI is readily available while using the elasticapm connector. Below is a visualization of a trace comprised of spans within our project. Knowing both methods of correlating spans is beneficial to build the foundation of utilizing Kibana and the APM UI for observability.

Here is a fully built out dashboard built from the repository featured in this article. The possibilities with Elastic Observability are endless!

Congrats, You’re Not “Just” a Developer Anymore!

We’ve broken down the why and how behind OpenTelemetry’s basic components, including the TraceProvider, the span, the exporter and the Collector. Here, you’ve done more than just implement your tracing tool. You now understand the complete data flow from your code to the graphs on your dashboard.

You can now speak the language of observability with confidence, not because you memorized a configuration file, but because you now understand the data flow from your code to the graph on your dashboard. You understand how telemetry moves through your system. You aren’t “just” a developer anymore; you’re now a developer who can truly see.

Try out the code repo above! Included in the repository is a generate-traffic.sh script file. You can run this repeatedly in order to generate logs, traces, and metrics for you to play with within the APM UI. Also, check out our latest releases in our release docs page for exciting Elastic updates.

The DNA of DATA Increasing Efficiency with the Elastic Common Schema

Wed, 25 Sep 2024 00:00:00 GMT

The Elastic Common Schema is a fantastic way to simplify and unify a search experience. By aligning disparate data sources into a common language, users have a lower bar to overcome with interpreting events of interest, resolving incidents or hunting for unknown threats. However, there are underlying infrastructure reasons to justify adopting the Elastic Common Schema.

In this blog you will learn about the quantifiable operational benefits of ECS, how to leverage ECS with any data ingest tool, and the pitfalls to avoid. The data source leveraged in this blog is a 3.3GB Nginx log file obtained from Kaggle. The representation of this dataset is divided into three categories: raw, self, and ECS; with raw having zero normalization, self being a demonstration of commonly implemented mistakes observed from my 5+ years of experience working with various users, and finally ECS with the optimal approach of data hygiene.

This hygiene is achieved through the parsing, enrichment, and mapping of data ingested; akin to the sequencing of DNA in order to express genetic traits. Through the understanding of the data's structure, and assigning the correct mapping, a more thorough expression may be represented, stored and searched upon.

If you would like to learn more about ECS, the dataset used in this blog, or available Elastic integrations, please be sure to check out these related links:

Dataset Validation

Before we begin, let us review how many documents exist and what we're required to ingest. We have 10,365,152 documents/events from our Nginx log file:

With 10,365,152 documents in our targeted end-state:

Dataset Ingestion: Raw & Self

To achieve the raw and self ingestion techniques, this example is leveraging Logstash for simplicity. For the raw data ingest, a simple file input with no additional modifications or index templates.


    input {
      file {
      id => "NGINX_FILE_INPUT"
      path => "/etc/logstash/raw/access.log"
      ecs_compatibility => disabled
      start_position => "beginning"
      mode => read
      }
    }
    filter {
    }
    output {
      elasticsearch {
        hosts => ["https://mycluster.es.us-east4.gcp.elastic-cloud.com:9243"]
          index => "nginx-raw"
          ilm_enabled => true
          manage_template => false
          user => "username"
          password => "password"
          ssl_verification_mode => none
          ecs_compatibility => disabled
          id => "NGINX-FILE_ES_Output"
      }
    }

For the self ingest, a custom Logstash pipeline with a simple Grok filter was created with no index template applied:

    input {
      file {
        id => "NGINX_FILE_INPUT"
        path => "/etc/logstash/self/access.log"
        ecs_compatibility => disabled
        start_position => "beginning"
        mode => read
      }
    }
    filter {
      grok {
        match => { "message" => "%{IP:clientip} - (?:%{NOTSPACE:requestClient}|-) \[%{HTTPDATE:timestamp}\] \"(?:%{WORD:requestMethod} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})\" (?:-|%{NUMBER:response}) (?:-|%{NUMBER:bytes_in}) (-|%{QS:bytes_out}) %{QS:user_agent}" }
      }
    }
    output {
      elasticsearch {
        hosts => ["https://myscluster.es.us-east4.gcp.elastic-cloud.com:9243"]
        index => "nginx-self"
        ilm_enabled => true
        manage_template => false
        user => "username"
        password => "password"
        ssl_verification_mode => none
        ecs_compatibility => disabled
        id => "NGINX-FILE_ES_Output"
      }
    }

Dataset Ingestion: ECS

Elastic comes included with many available integrations which contain everything you need to achieve to ensure that your data is ingested as efficiently as possible.

For our use case of Nginx, we'll be using the associated integration's assets only.

The assets which are installed are more than just dashboards, there are ingest pipelines which not only normalize but enrich the data while simultaneously mapping the fields to their correct type via component templates. All we have to do is make sure that as the data is coming in, that it will traverse through the ingest pipeline and use these supplied mappings.

Create your index template, and select the supplied component templates provided from your integration.

Think of the component templates like building blocks to an index template. These allow for the reuse of core settings, ensuring standardization is adopted across your data.

For our ingestion method, we merely point to the index name that we specified during the index template creation, in this case, nginx-ecs and Elastic will handle all the rest!

    input {
      file {
      id => "NGINX_FILE_INPUT"
      path => "/etc/logstash/ecs/access.log"
      #ecs_compatibility => disabled
      start_position => "beginning"
      mode => read
      }
    }
    filter {
    }
    output {
      elasticsearch {
        hosts => ["https://mycluster.es.us-east4.gcp.elastic-cloud.com:9243"]
        index => "nginx-ecs"
        ilm_enabled => true
        manage_template => false
        user => "username"
        password => "password"
        ssl_verification_mode => none
        ecs_compatibility => disabled
        id => "NGINX-FILE_ES_Output"
      }
    }

Data Fidelity Comparison

Let's compare how many fields are available to search upon the three indices as well as the quality of the data. Our raw index has but 15 fields to search upon, with most being duplicates for aggregation purposes.

However from a Discover perspective, we are limited to 6 fields!

Our self-parsed index has 37 available fields, however these too are duplicated and not ideal for efficient searching.

From a Discover perspective here we have almost 3x as many fields to choose from, yet without the correct mapping the ease of which this data may be searched is less than ideal. A great example of this, is attempting to calculate the average bytes_in on a text field.

Finally with our ECS index, we have 71 fields available to us! Notice that courtesy of the ingest pipeline, we have enriched fields of geographic information as well as event categorial fields.

Now what about Discover? There were 51 fields directly available to us for searching purposes:

Using Discover as our basis, our self-parsed index has 283% more fields to search upon whereas our ECS index has 850%!

Storage Utilization Comparison

Surely with all these fields in our ECS index the size would be exponentially larger than the self normalized index, let alone the raw index? The results may surprise you.

Accounting for the replica of data of our 3.3GB size data set, we can see that the impact of normalized and mapped data has a significant impact on the amount of storage required.

Conclusion

While there is an increase in the amount required storage for any dataset that is enriched, Elastic provides easy solutions to maximize the fidelity of the data to be searched while simultaneously ensuring operational storage efficiency; that is the power of the Elastic Common Schema.

Let's review how we were able to maximize search, while minimizing storage

Installing integration assets for our dataset that we are going to ingest.

Customizing the index template to leverage the included components to ensure mapping and parsing are aligned to the Elastic Common Schema.

Ready to get started? Sign up for Elastic Cloud and try out the features and capabilities I've outlined above to get the most value and visibility out of your data.

TLS Certificate Monitoring with the OpenTelemetry Collector

Fri, 09 Jan 2026 00:00:00 GMT

In modern distributed systems, TLS certificates are the glue that holds everything together while keeping it safe. Certificates aren't only used for encrypting user traffic; they are fundamental building blocks of trust for your entire system.

Indeed, an expired certificate is not just a minor technical glitch. It is a direct hit on your most critical systems:

Your CI/CD pipeline grinds to a halt because it can not trust the internal image registry.
Your Single Sign-On (SSO) system fails, locking all your internal users out.
Your external clients see scary browser warnings, shattering user trust and forcing support tickets.
Your SLOs burn due to services not being able to communicate with one another.

In Kubernetes, certificates are usually dynamically generated and auto-renewed by tools like cert-manager. In more unlucky scenarios, certificates might be tucked away inside Secrets and ConfigMaps, leading to challenges while inventorying them. It is neither hard nor unheard of to have a dozen critical certificates and no centralized way to know when they are about to expire.

Additionally, only monitoring the certificates for external Load Balancers might lead to huge internal risks, since many certificates never get exposed to external users.

In this blog post, we will guide you through establishing comprehensive, cluster-wide certificate monitoring using the OpenTelemetry Collector, the x509-certificate-exporter, and Elastic Observability.

Classical approach: HTTP monitoring

The classical approach to monitor TLS certificate expiration in the Elastic Observability is by treating it like any other service availability check. Historically, this was accomplished using Heartbeat or, more recently, Elastic Observability's Synthetics. These tools perform an external check against a public HTTPS endpoint and automatically extract the certificate's validity dates, allowing you to configure a Synthetics TLS certificate rule in Kibana to trigger an alert when expiration is within a specified threshold (e.g., 30 days).

While effective for external-facing services, this "classical" approach has two major shortcomings when dealing with Kubernetes:

It only works for certificates exposed via HTTP(S), meaning you cannot use this for internal services, databases, or message queues using other protocols. In other words, this won't work to monitor common, critical TLS certificates such as Kafka's.
The monitoring agent must have network access to the endpoint. In a segmented or private Kubernetes environment, deploying agents with the necessary access often introduces unnecessary complexity or security risks.

To gain true cluster-wide visibility, we need to inspect the certificates at their source: inside Kubernetes Secrets or ConfigMaps.

A Kubernetes-native approach: monitor Secrets and ConfigMaps

Monitoring TLS certificate expiration directly within Kubernetes Secrets and ConfigMaps is the only reliable way to gain visibility into internal, non-HTTP-exposed certificates, such as those used for service meshes, internal registries, or databases. In this section, we will use the OpenTelemetry Collector to monitor certificate expiration.

The OpenTelemetry Collector provides a mechanism to read up-to-date information from the Kubernetes API, including Secrets, via the k8sobjects receiver. However, this receiver only fetches raw TLS certificate resource data, which the OpenTelemetry Transformation Language (OTTL) can not properly parse. Therefore, we need to use a dedicated exporter to collect the certificate data and expose the results in a digestible format.

The industry-standard solution

As mentioned above, simply reading certificate information from the Kubernetes API is not a feasible solution. We will therefore use a specialized, lightweight exporter (specifically, the popular x509-certificate-exporter) to collect TLS certificate data and expose the results, allowing the OpenTelemetry Collector's Prometheus receiver to seamlessly scrape the data and send it to Elastic Observability. This approach immediately and easily enables us to monitor both certificates generated by cert-manager and self-managed ones, such as the ones created for ECK.

A fully working configuration example and a script to set up a complete local development environment is available here. Feel free to use it to follow along as you read through this guide and try out the examples. Please note that, while this repository uses the Elastic Distribution of OpenTelemetry (EDOT), it can be easily adapted to use the OpenTelemetry Collector.

Helm Chart Configuration

We configured the x509-certificate-exporter with the official Helm Chart and used the following minimal configuration:

secretsExporter:
  secretTypes:
  - type: kubernetes.io/tls
    key: tls.crt
  # For ECK that uses different secret types
  - type: Opaque
    key: tls.crt
  - type: Opaque
    key: ca.crt
  configMapKeys:
  - tls.crt
  - ca.crt

# Create a service to have a stable endpoint for scraping metrics
service:
  create: true
  # -- TCP port to expose the Service on
  port: 9793

# Disable prometheus service monitor and prometheus rules
prometheusServiceMonitor:
  create: false
prometheusRules:
  create: false

We refer to the reference values.yaml to get insights in the plethora of configuration options.

OpenTelemetry Collector Configuration

Afterward, we configured the OpenTelemetry Collector to scrape the metrics from the service:

prometheus/cert-expiration:
  config:
    scrape_configs:
      - job_name: "cert-expiration"
        scrape_interval: 60m
        static_configs:
          - targets:
              - "x509-certificate-exporter.monitoring.svc.cluster.local:9793"

We deliberately used a long scrape interval of 60 minutes, because certificate expiration is a low-frequency concern.

Visualizing the data in Kibana

Once the data is ingested, we can explore it using Discover. We can select the metrics-* Data View and search for our data with the filter data_stream.dataset : "prometheusreceiver.otel".

An example document looks like the following:

{
  "@timestamp": "2025-12-19T09:43:45.317Z",
  "_metric_names_hash": "7d113f55b70019d9",
  "attributes": {
    "issuer_CN": "tls-cert.example.com",
    "issuer_O": "TLS Cert",
    "secret_key": "tls.crt",
    "secret_name": "tls-cert-secret",
    "secret_namespace": "test-certs",
    "serial_number": "250887723804527203192865532237673843132727735771",
    "subject_CN": "tls-cert.example.com",
    "subject_O": "TLS Cert"
  },
  "data_stream": {
    "dataset": "prometheusreceiver.otel",
    "namespace": "default",
    "type": "metrics"
  },
  "metrics": {
    "x509_cert_expired": 0,
    "x509_cert_not_after": 1768488242,
    "x509_cert_not_before": 1765896242
  },
  "resource": {
    "attributes": {
      "server.address": "x509-certificate-exporter.monitoring.svc.cluster.local",
      "server.port": "9793",
      "service.instance.id": "x509-certificate-exporter.monitoring.svc.cluster.local:9793",
      "service.name": "cert-expiration",
      "url.scheme": "http"
    }
  },
  "scope": {
    "name": "github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver",
    "version": "9.2.2"
  }
}

The core metric reported by the x509-certificate-exporter is x509_cert_not_after that represent the Unix Epoch timestamp (in seconds) of the certificate's expiration date. This metric has some attributes associated with it. In the case of Secrets, the following attributes are relevant:

secret_namespace: The namespace of the Secret containing the certificate.
secret_name: The name of the Secret containing the certificate.
secret_key: The specific key within the Secret where the certificate is stored.

In the case of ConfigMaps, we can infer the attributes of interest from the filepath attribute.

Finally, we can leverage ES|QL to compute the remaining days until expiration. In the following examples, we will use the TS command, which is optimized and recommended for interacting with time-series data.

For Secrets:

TS metrics-*
| WHERE metrics.x509_cert_not_after is not NULL
| STATS expiration_date = MAX(LAST_OVER_TIME(metrics.x509_cert_not_after)) by attributes.secret_namespace, attributes.secret_name, attributes.secret_key
| EVAL remaining_days = DATE_DIFF("days", NOW(), TO_DATETIME (1000 * expiration_date))
| EVAL expiration_date = TO_DATETIME(1000 * expiration_date)
| SORT expiration_date ASC

And for ConfigMaps:

TS metrics-*
| WHERE metrics.x509_cert_not_after IS NOT NULL
| WHERE attributes.filepath IS NOT NULL
| DISSECT attributes.filepath "k8s/%{namespace}/%{configmap}"
| WHERE configmap != "kube-root-ca.crt" // Filter out the Kubernetes API server certificate's signing CA
| STATS expiration_date = MAX(LAST_OVER_TIME(metrics.x509_cert_not_after)) by namespace, configmap, filename
| EVAL remaining_days = DATE_DIFF("days", NOW(), TO_DATETIME (1000 * expiration_date))
| EVAL expiration_date = TO_DATETIME(1000 * expiration_date)
| SORT expiration_date ASC

Based on these core queries, we can easily build a dashboard that shows the remaining days until expiration for all the certificates in the cluster:

and create alerts about certificates that are about to expire by adding a condition after the query:

WHERE remaining_days < 30

Conclusion

In this blog post, we explored how to monitor TLS certificate expiration within a Kubernetes cluster using the OpenTelemetry Collector. We discussed the limitations of traditional HTTP-based monitoring approaches and introduced a Kubernetes-native solution leveraging the x509-certificate-exporter to extract certificate expiration data directly from Kubernetes Secrets and ConfigMaps. This method provides comprehensive visibility into all certificates used within the cluster, including those not exposed via HTTP(S).

For the sake of simplicity, we just focused on monitoring certificate expiration with the OpenTelemetry Collector on Kubernetes. However, this approach can be easily applied with classical Elastic Agent by leveraging the Prometheus input package (read more on how to use input packages here) and can be also extended to monitor certificates on virtual machines or bare-metal servers by deploying the x509-certificate-exporter there.

Finally, is worth knowing that Elastic Observability, offers an officially supported distribution of the OpenTelemetry Collector, called Elastic Distributions of OpenTelemetry (EDOT).

If you are an Elastic user, you could consider using EDOT Collector to monitor certificates with OpenTelemetry: since it is supported by Elastic Observability, it will be easier to manage and keep up to date. Alternatively you can use upstream OTel compnents also.

What's next?

Now that Elastic supports Rule Templates and OpenTelemetry content packs, our near-term objective is to contribute to the integration repository to make the setup of certificate monitoring even easier for our users. Stay tuned for more updates on this!

Check out other resources on Elastic's OpenTelemetry

Elastic's OTLP EndPoint

Elastic's EDOT PHP Contribution

Opentelemetry SDK Central Management with EDOT

Also sign up for Elastic Cloud and try out your application with OpenTelemetry in Elastic

Scale testing OpenTelemetry log ingestion on GCP with EDOT Cloud Forwarder

Wed, 04 Feb 2026 00:00:00 GMT

EDOT Cloud Forwarder (ECF) for GCP is an event-triggered, serverless OpenTelemetry Collector deployment for Google Cloud. It runs the OpenTelemetry Collector on Cloud Run, ingests events from Pub/Sub and Google Cloud Storage, parses Google Cloud service logs into OpenTelemetry semantic conventions, and forwards the resulting OTLP data to Elastic, relying on Cloud Run for scaling, execution, and infrastructure lifecycle management.

To run ECF for GCP confidently at scale, you need to understand its capacity characteristics and sizing behavior. For ECF for GCP which is part of the broader ECF architecture, we answered those questions through repeatable load testing and by grounding decisions in measured data.

We'll introduce the test setup, explain each runtime setting, and share the capacity numbers we observed for a single instance.

How we load tested EDOT Cloud Forwarder for GCP

Architecture

The load testing architecture simulates a realistic, high-volume pipeline:

We developed a load tester service that uploads generated log files to a GCS bucket as fast as possible.
Each file creation in this Google Cloud Storage (GCS) bucket then triggers an event notification to Pub/Sub.
Pub/Sub delivers push messages to a Cloud Run service where EDOT Cloud Forwarder fetches and processes these log files.

Our setup exposes two primary tunable settings that directly influence Cloud Run scaling behavior and memory pressure:

Request pressure using a concurrency setting (how many concurrent requests each ECF instance can handle).
Work per request using a log count setting (number of logs per file in each uploaded object).

In our tests, we used a testing system that:

Deploys the whole testing infrastructure. This includes the complete ECF infrastructure, a mock backend, etc.
Generates log files according to the configured log counts, using a Cloud Audit log of ~1.4 KB.
Runs a matrix of tests across all combinations of concurrency and log volume.
Produces a report for each tested concurrency level in which several stats are reported, such as CPU usage and memory consumption.

For reproducibility and isolation, the otlphttp exporter in EDOT Cloud Forwarder uses a mock backend that always returns HTTP 200. This ensures all observed behavior is attributable to ECF itself, not downstream systems or network variability.

Step 1: Establish a stable runtime before measuring capacity

Before asking how much load a single instance can handle, we first established a stable runtime baseline.

We quickly learned that a single flag, cpu_idle, can turn Cloud Run into a garbage-collector (GC) starvation trap. This is amplified by a known limitation of our ECF current architecture: the existing OpenTelemetry implementation reads whole log files into memory before processing them. Our goal was to eliminate configuration side effects so capacity tests reflected ECF actual limits.

We focused on three runtime parameters:

Setting	What it controls	Why it matters for ECF
`cpu_idle`	Whether CPU is always allocated or only during requests	Dictates how much background time the garbage collector gets to reclaim memory
`GOMEMLIMIT`	Upper bound on Go heap size inside the container	Keeps the process from quietly growing until Cloud Run kills it on OOM
`GOGC`	Heap growth and collection aggressiveness in Go	Trades lower memory usage for higher CPU consumption

All parameter-isolation tests use a single Cloud Run instance (min 0, max 1), fix concurrency for the scenario under study, and keep input files and test matrix identical across runs. This design lets us attribute differences directly to the parameter in question.

CPU allocation: Stop starving the garbage collector

Cloud Run offers two CPU allocation modes:

Request-based (throttled). Enabled with cpu_idle: true. CPU is available only while a request is actively being processed.
Instance-based (always on). Enabled with cpu_idle: false. CPU remains available when idle, allowing background work such as garbage collection to run.

The tests compared these modes under identical conditions:

Parameter	Value
vCPU	1
Memory	4 GiB (high enough to remove OOM as a factor)
`GOMEMLIMIT`	90% of memory
`GOGC`	Default (unset)
Concurrency	10

What we observed

With CPU allocated only on requests (cpu_idle: true):

Memory variance was extreme (±71% RSS, ±213% heap).
Peak heap reached ~304 MB in the worst run.
We saw request refusals in the sample (90% success rate).

With CPU always allocated (cpu_idle: false):

Memory variance became tightly bounded (±8% RSS, ±32% heap).
Peak heap dropped to ~89 MB in the worst run.
We saw no refusals in the sample (100% success).

From these runs we saw:

When CPU is throttled, the Go garbage collector is effectively starved, leading to heap accumulation and large run-to-run variance.
When CPU is always available, garbage collection keeps pace with allocation, resulting in lower and more predictable memory usage.

Takeaway: for this set of tests, cpu_idle: false was the most stable baseline configuration. Request-based CPU throttling introduced artificial instability that makes capacity planning much harder.

Go memory limit: `GOMEMLIMIT` in constrained containers

Cloud Run enforces a hard memory limit at the container level. If the process exceeds it, the instance is OOM-killed.

We tested Cloud Run with:

Parameter	Value
Container memory	512 MiB
vCPU	1
Concurrency	20
`GOGC`	Default (unset)
`cpu_idle`	`false`

The tests compared:

No GOMEMLIMIT (Go relies on OS pressure).
GOMEMLIMIT=460MiB (or 90% of container memory).

The results were clear:

`GOMEMLIMIT`	Outcome	Notes
Unset	Unstable; repeated OOM kills	Service never produced stable results
`460MiB`	Stable; runs completed	Worst-case peak RSS reached ~505 MB, but the process within container limits

Takeaway: in a memory-constrained environment like Cloud Run, setting GOMEMLIMIT close to (but below) the container limit is essential for predictable behavior under load.

GOGC: memory savings vs. reliability

The GOGC parameter controls how much the heap can grow (in %) between GC cycles:

Lower values (e.g., GOGC=50): more frequent collections, lower memory, higher CPU.
Higher values (e.g., GOGC=100): fewer collections, higher memory, lower CPU.

The tests covered: (1) GOGC=50 (aggressive); (2) GOGC=75 (moderate); (3) GOGC=100 (default/unset).

Setup:

Parameter	Value
Container memory	4 GiB (high enough to remove OOM as a factor)
vCPU	1
Concurrency	10 (safe level)
`GOMEMLIMIT`	90% of memory
`cpu_idle`	`false`

What we observed

From the runs:

`GOGC`	Peak RSS (sample)	CPU behavior	Failure rate	Notes
50	~267 MB	Very high; often saturating	30%	GC consumed cycles needed for ingestion
75	~454 MB	~83.5% avg	10%	GC consumed cycles needed for ingestion
100 (default)	~472 MB	~83.5% avg; leaves headroom for bursts	0%

The conclusion from these runs is clear: pushing GOGC down trades memory for reliability, and the trade is not favorable for ECF.

Takeaway: for this workload, the default GOGC=100 provided the best balance. Attempts to optimize memory by lowering GOGC directly reduced reliability.

Step 2: Find capacity and breaking points

With the runtime stabilized, we evaluated how much traffic a single instance can sustain by increasing concurrency until failures emerged.

How to read the tables: each concurrency level was tested across 20 runs covering both light (240 logs per file, around 362KB file size) and heavy inputs (over 6k logs per file, around 8MB file size). Tables report baseline RSS from light workloads and peak values from the worst-case run.

Concurrency 5: Stable baseline

At concurrency 5, the service was solid.

Case	Memory (RSS)	CPU utilization	Requests refused
Baseline (lightest workload avg)	99.89 MB
Worst run	211.02 MB	86.43%	No

This proved that a single instance handles a moderate load comfortably, with memory usage staying well within safe limits.

Concurrency 10: Safe but volatile

At concurrency 10, the system remained functional but with significant volatility.

Case	Memory (RSS)	CPU utilization	Requests refused
Baseline (lightest workload avg)	100.33 MB
Worst run	424.80 MB	94.10%	No (in sample)

We also noticed that memory usage shows extreme variance:

Best run RSS: 178 MB.
Worst run RSS: 425 MB.

This behavior comes mainly from two effects:

Bursty Pub/Sub delivery: 10 heavy requests may land at nearly the same instant.
The use of io.ReadAll inside the collector: each request reads the entire log file into memory.

When all 10 requests arrived concurrently, we were effectively stacking ~10× file size in RAM before the GC can clean up. When they are slightly staggered, GC has time to reclaim memory between requests, leading to much lower peaks.

This leads to a crucial sizing insight:

Do not size the service using average memory (for example, ~260 MB).
Size it for the worst observed burst (~425 MB) to avoid OOM or GC stalls.

In practice, you should set the memory limit to at least 512 MiB per instance at concurrency 10.

Concurrency 20: Unstable, systemic load shedding

At concurrency 20, the system consistently began shedding load.

Case	Memory (RSS)	CPU utilization	Requests refused
Baseline (lightest workload avg)	97.44 MB
Worst run	482.42 MB	88.90%	Yes (every run)

Even though memory and CPU metrics don't look drastically worse than at concurrency 10, behavior changes qualitatively: the service begins to refuse requests consistently.

Concurrency 40: Failure mode

At concurrency 40, the instance collapsed completely. Memory and CPU are overwhelmed, and ingest reliability collapses.

Case	Memory (RSS)	CPU utilization	Requests refused
Baseline (lightest workload avg)	100.20 MB
Worst run	1234.28 MB	96.57%	Yes (all runs)

The breaking point: a 1 vCPU instance's realistic limits

Concurrency	Peak RSS (MB)	Stability	Refusals?	Status
5	211.02	Low variance	No	Stable baseline
10	424.80	High variance	No	Safe but volatile
20	482.42	High variance	Yes (Frequent)	Unstable (sheds load)
40	1234.28	Extreme variance	Yes (Always)	Failure (memory explosion)

Combined with the CPU data (94% peak at concurrency 10), this supports a practical rule: for this workload and architecture, 10 concurrent heavy requests per 1 vCPU instance is the realistic upper bound.

Turning findings into concrete recommendations

These experiments lead to clear, actionable recommendations for running the ECF OpenTelemetry collector on Cloud Run as part of the broader Elastic Cloud Forwarder deployment.

Scope: these recommendations apply to the workload and harness we tested (light vs. heavy log files up to 8MB, and Pub/Sub burst delivery), using the tuned runtime settings listed below. If your log sizes, request burstiness, or pipeline shape differ significantly, validate these limits against your own traffic.

Runtime and container configuration

Area	Recommendation	Rationale
CPU allocation	Set `cpu_idle: false` (always-on CPU)	Avoids GC starvation, stabilizes memory variance, and eliminates request failures caused by long GC pauses
Go memory limit	Set `GOMEMLIMIT` to ~90% of container memory	Enforces a heap boundary aligned with the Cloud Run limit so that Go reacts before the OS, preventing OOM kills
Garbage collection	Keep `GOGC` at 100 (default)	Lower `GOGC` reduces memory at the cost of higher CPU usage and measurable failure rates

Capacity and per-instance limits

For a 1 vCPU Cloud Run instance running the ECF OpenTelemetry collector with the tuned runtime:

Limit	Recommendation	Rationale
Hard concurrency	Cap concurrency at 10 requests per instance	At concurrency 10, CPU already reaches ~94% in the worst run; higher concurrency drives instability (refusals, GC stalls)
Memory	Use at least 512 MiB per instance (for concurrency 10)	Worst-case observed RSS is ~425 MB; 512 MiB provides a narrow but workable safety margin against burst alignment

Scaling strategy: horizontal, not vertical

Vertical scaling (increasing concurrency per instance) quickly runs into CPU and memory limits for this workload.
Horizontal scaling is a better fit: treat each instance as a worker with a hard limit of 10 concurrent heavy jobs.

Practically:

Configure the service so that no instance exceeds 10 concurrent requests.
Let autoscaling handle an increased load by adding instances, not by increasing per-instance concurrency.

Takeaways

Tuned runtime settings matter as much as raw resources: a single flag like cpu_idle can be the difference between predictable behavior and GC-driven chaos.
Go needs explicit limits in containers: GOMEMLIMIT must be set in memory-constrained environments; otherwise, OOM kills are inevitable under heavy ingesting.
"Lower memory" is not always better: aggressive GC tuning (GOGC < 100) did reduce memory usage but directly increased failure rates.
Concurrency 10 is the realistic ceiling for a 1 vCPU ECF instance; beyond that, refusals and instability become the norm.
Horizontal scaling is the right model: each instance should be treated as a 10-request worker, with higher total throughput coming from more workers rather than more concurrency per worker.

Using the Elastic Agent to monitor Amazon ECS and AWS Fargate with Elastic Observability

Thu, 15 Jun 2023 00:00:00 GMT

Serverless and AWS ECS Fargate

AWS Fargate is a serverless pay-as-you-go engine used for Amazon Elastic Container Service (ECS) to run Docker containers without having to manage servers or clusters. The goal of Fargate is to containerize your application and specify the OS, CPU and memory, networking, and IAM policies needed for launch. Additionally, AWS Fargate can be used with Elastic Kubernetes Service (EKS) in a similar manner.

Although the provisioning of servers would be handled by a third party, the need to understand the health and performance of containers within your serverless environment becomes even more vital in identifying root causes and system interruptions. Serverless still requires observability. Elastic Observability can provide observability for not only AWS ECS with Fargate, as we will discuss in this blog, but also for a number of AWS services (EC2, RDS, ELB, etc). See our previous blog on managing an EC2-based application with Elastic Observability.

Gaining full visibility with Elastic Observability

Elastic Observability is governed by the three pillars involved in creating full visibility within a system: logs, metrics, and traces. Logs list all the events that have taken place in the system. Metrics keep track of data that will tell you if the system is down, like response time, CPU usage, memory usage, and latency. Traces give a good indication of the performance of your system based on the execution of requests.

These pillars by themselves offer some insight, but combining them allows for you to see the full scope of your system and how it handles increases in load or traffic over time. Connecting Elastic Observability to your serverless environment will help you deal with outages quicker and perform root cause analysis to prevent any future problems.

In this article, we’ll guide you through how to install the Elastic Agent with the AWS Fargate integration as a sidecar container to send host metrics and logs to Elastic Observability.

Prerequisites:

AWS account with AWS CLI configured
GitHub account
Elastic Cloud account
An app running on a container in AWS

This tutorial is divided into two parts:

Set up the Fleet server to be used by the sidecar container in AWS.
Create the sidecar container in AWS Fargate to send data back to Elastic Observability.

Part I: Set up the Fleet server

First, let’s log in to Elastic Cloud.

You can either create a new deployment or use an existing one.

From the Home page, use the side panel to scroll to Management > Fleet > Agent policies. Click Add policy.

Click Create agent policy. Here we’ll create a policy to attach to the Fleet agent.

Give the policy a name and save changes.

Click Create agent policy. You should see the agent policy AWS Fargate in the list of policies.

Now that we have an agent policy, let’s add the integration to collect logs and metrics from the host. Click on AWS Fargate -> Add integration.

We’ll be adding to the policy AWS to collect overall AWS metrics and AWS Fargate to collect metrics from this integration. You can find each one by typing them in the search bar.

Once you click on the integration, it will take you to its landing page, where you can add it to the policy.

For the AWS integration, the only collection settings that we will configure are Collect billing metrics, Collect logs from CloudWatch, Collect metrics from CloudWatch, Collect ECS metrics, and Collect Usage metrics. Everything else can be left disabled.

Another thing to keep in mind when using this integration is the set of permissions required to collect data from AWS. This can be found on the AWS integration page under AWS permissions. Take note of these permissions, as we will use them to create an IAM policy.

Next, we will add the AWS Fargate integration, which doesn’t require further configuration settings.

Now that we have created the agent policy and attached the proper integrations, let’s create the agent that will implement the policy. Navigate back to the main Fleet page and click Add agent.

Since we’ll be connecting to AWS Fargate through ECS, the host type should be set to this value. All the other default values can stay the same.

Lastly, let’s create the enrollment token and attach the agent policy. This will enable AWS ECS Fargate to access Elastic and send data.

Once created, you should be able to see policy name, secret, and agent policy listed.

We’ll be using our Fleet credentials in the next step to send data to Elastic from AWS Fargate.

Part II: Send data to Elastic Observability

It’s time to create our ECS Cluster, Service, and task definition in order to start running the container.

We’ll start by creating the cluster.

Add a name to the Cluster. And for subnets, only select the first two for us-east-1a and us-eastlb.

For the sake of the demo, we’ll keep the rest of the options set to default. Click Create.

We should see the cluster we created listed below.

Now that we’ve created our cluster to host our container, we want to create a task definition that will be used to set up our container. But before we do this, we will need to create a task role with an associated policy. This task role will allow for AWS metrics to be sent from AWS to the Elastic Agent.

Navigate to IAM in AWS.

Go to Policies -> Create policy.

Now we will reference the AWS permissions from the Fleet AWS integration page and use them to configure the policy. In addition to these permissions, we will also add the GetAtuhenticationToken action for ECR.

You can configure each one using the visual editor.

Or, use the JSON option. Don’t forget to replace the with your own.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "sqs:DeleteMessage",
        "sqs:ChangeMessageVisibility",
        "sqs:ReceiveMessage",
        "ecr:GetDownloadUrlForLayer",
        "ecr:UploadLayerPart",
        "ecr:PutImage",
        "sts:AssumeRole",
        "rds:ListTagsForResource",
        "ecr:BatchGetImage",
        "ecr:CompleteLayerUpload",
        "rds:DescribeDBInstances",
        "logs:FilterLogEvents",
        "ecr:InitiateLayerUpload",
        "ecr:BatchCheckLayerAvailability"
      ],
      "Resource": [
        "arn:aws:iam:::role/*",
        "arn:aws:logs:*::log-group:*",
        "arn:aws:sqs:*::*",
        "arn:aws:ecr:*::repository/*",
        "arn:aws:rds:*::target-group:*",
        "arn:aws:rds:*::subgrp:*",
        "arn:aws:rds:*::pg:*",
        "arn:aws:rds:*::ri:*",
        "arn:aws:rds:*::cluster-snapshot:*",
        "arn:aws:rds:*::cev:*/*/*",
        "arn:aws:rds:*::og:*",
        "arn:aws:rds:*::db:*",
        "arn:aws:rds:*::es:*",
        "arn:aws:rds:*::db-proxy-endpoint:*",
        "arn:aws:rds:*::secgrp:*",
        "arn:aws:rds:*::cluster:*",
        "arn:aws:rds:*::cluster-pg:*",
        "arn:aws:rds:*::cluster-endpoint:*",
        "arn:aws:rds:*::db-proxy:*",
        "arn:aws:rds:*::snapshot:*"
      ]
    },
    {
      "Sid": "VisualEditor1",
      "Effect": "Allow",
      "Action": [
        "sqs:ListQueues",
        "organizations:ListAccounts",
        "ec2:DescribeInstances",
        "tag:GetResources",
        "cloudwatch:GetMetricData",
        "ec2:DescribeRegions",
        "iam:ListAccountAliases",
        "sns:ListTopics",
        "sts:GetCallerIdentity",
        "cloudwatch:ListMetrics"
      ],
      "Resource": "*"
    },
    {
      "Sid": "VisualEditor2",
      "Effect": "Allow",
      "Action": "ecr:GetAuthorizationToken",
      "Resource": "arn:aws:ecr:*::repository/*"
    }
  ]
}

Review your changes.

Now let’s attach this policy to a role. Navigate to IAM -> Roles. Click Create role.

Select AWS service as Trusted entity type and select EC2 as Use case. Click Next.

Under permissions policies, select the policy we just created, as well as CloudWatchLogsFullAccess and AmazonEC2ContainerRegistryFullAccess. Click Next.

Give the task role a name and description.

Click Create role.

Now it’s time to create the task definition. Navigate to ECS -> Task definitions. Click Create new task definition.

Let’s give this task definition a name.

After giving the task definition a name, you’ll add the Fleet credentials to the container section, which you can obtain from the Enrollment Tokens section of the Fleet section in Elastic Cloud. This allows us to host the Elastic Agent on the ECS container as a sidecar and send data to Elastic using Fleet credentials.

Container name: elastic-agent-container
Image: docker.elastic.co/beats/elastic-agent:8.19.12

Now let’s add the environment variables:

FLEET_ENROLL: yes
FLEET_ENROLLMENT_TOKEN:
FLEET_URL:

For the sake of the demo, leave Environment, Monitoring, Storage, and Tags as default values. Now we will need to create a second container to run the image for the golang app stored in ECR. Click Add more containers.

For Environment, we will reserve 1 vCPU and 3 GB of memory. Under Task role, search for the role we created that uses the IAM policy.

Review the changes, then click Create.

You should see your new task definition included in the list.

The final step is to create the service that will connect directly to the fleet server.
Navigate to the cluster you created and click Create under the Service tab.

Let’s get our service environment configured.

Set up the deployment configuration. Here you should provide the name of the task definition you created in the previous step. Also, provide the service with a unique name. Set the number of desired tasks to 2 instead of 1.

Click Create. Now your service is running two tasks in your cluster using the task definition you provided.

To recap, we set up a Fleet server in Elastic Cloud to receive AWS Fargate data. We then created our AWS Fargate cluster task definition with the Fleet credentials implemented within the container. Lastly, we created the service to send data about our host to Elastic.

Now let’s verify our Elastic Agent is healthy and properly receiving data from AWS Fargate.

We can also view a better breakdown of our agent on the Observability Overview page.

If we drill down to hosts, by clicking on host name we should be able to see more granular data. For instance, we can see the CPU Usage of the Elastic Agent itself that is deployed in our AWS Fargate environment.

Lastly, we can view the AWS Fargate dashboard generated using the data collected by our Elastic Agent. This is an out-of-the-box dashboard that can also be customized based on the data you would like to visualize.

As you can see in the dashboard we’re able to filter based on running tasks, as well as see a list of containers running in our environment. Something else that could be useful to show is the CPU usage per cluster as shown under CPU Utilization per Cluster.

The dashboard can pull data from different sources and in this case shows data for both AWS Fargate and the greater ECS cluster. The two containers at the bottom display the CPU and memory usage directly from ECS.

Conclusion

In this article, we showed how to send data from AWS Fargate to Elastic Observability using the Elastic Agent and Fleet. Serverless architectures are quickly becoming industry standard in offloading the management of servers to third parties. However, this does not alleviate the responsibility of operations engineers to manage the data generated within these environments. Elastic Observability provides a way to not only ingest the data from serverless architectures, but also establish a roadmap to address future problems.

More resources on serverless and observability and AWS:

Pivoting Elastic's Data Ingestion to OpenTelemetry

Tue, 03 Jun 2025 00:00:00 GMT

Introduction

Elastic has fully embraced OpenTelemetry as the backbone of its data ingestion strategy, aligning with the open-source community and contributing to make it the best data collection platform for a broad user base. This move benefits users by providing enhanced flexibility, efficiency, and control over telemetry data.

Why OpenTelemetry?

OpenTelemetry provides a powerful set of capabilities that make it a compelling choice for open-source-focused users. Elastic is re-architecting its data ingest tools around OpenTelemetry to offer users vendor-agnostic flexibility, performance optimization through OTel's efficient data model for correlating telemetry, and enhanced flexibility and control over data pipelines. This move brings the benefits of open-source telemetry to Elastic users.

Elastic engineers are active contributors to the Otel project in several areas of the project. Demonstrating its commitment to open source, Elastic continues to make significant contributions to OpenTelemetry.

OpenTelemetry as the Core of Elastic's Data Ingestion

Elastic is transforming its data ingestion strategy by basing all ingestion mechanisms on the OpenTelemetry components. Elastic currently supports the following OTel based ingest architecture, which support OTel SDKs and Collectors from OTel or Elastic's Distribution of OpenTelemetry (EDOT).

This marks a fundamental shift, ensuring a more standardized and scalable telemetry pipeline. All the existing Elastic ingest components will become OTel based.


Beats	Beats architecture will be based on OTel.
Elastic Agent	Agent architecture will be based on OTel to support both beats based inputs and OTel receivers.
Integrations	Integrations catalogue will additionally include OTel based modules for ease of configuration.
Fleet central management	Fleet will support monitoring of Elastic OTel collectors.

Let's discuss how each component of Elastic's data ingestion platform will be based on an OpenTelemetry collector whilst still providing the same functionality to the user.

Beats

Elastic's traditional data shippers will be re-architected as OpenTelemetry Collectors, aligning with OTel's extensibility model. Current Beat architecture is essentially made up of a few stages in its pipeline, as shown in the diagram below. It consists of an Input, Processors for enrichments, Queuing of events and Output for batching and writing the data to a specific output.

Beatreceiver Concept

To ensure a smooth transition without major disruptions, a "beatsreceiver" concept is being implemented. These beatreceivers (like filebeatreceiver or metricbeatreceiver) act as dedicated Beat inputs integrated into the OpenTelemetry Collector as native receivers. They support all existing inputs and processors, guaranteeing that the final architecture accepts the user's current configuration and delivers the same functionality as today's Beats, all without introducing any breaking changes.

An OTel based Beats architecture will see the Input phase embedded as an OTel receiver (eg. filebeatreceiver to represent the functionality of filebeat). This receiver would only be available as part of Elastic's distribution of OTel in support of our current user base and not a functionality that would be available upstream.

All the remaining components of the pipeline will be based on OTel. The new Beat will accept the same filebeat configuration (as an example) and will transform it to an OTel based configuration in order to avoid any deployment disruption. It should be noted that in this architecture the Beats will continue to only support ECS formatted data. In order to keep the Beat functionality inline with what exists today, the Elasticsearch exporter (as an example) will output ECS formatted data only.

The following diagram illustrates the beatreceiver concept by showing how a basic filebeat configuration is automatically translated into an OpenTelemetry-based configuration. This new configuration retains the original inputs and processors but leverages the native OpenTelemetry pipeline and exporter to achieve the same overall filebeat functionality. Existing filebeat configurations will be automatically converted, eliminating the need for manual adjustments or introducing breaking changes.

Elastic Agent

Elastic Agent is a unified agent for data collection, security, and observability. It can also be deployed in an OpenTelemetry only mode, enabling native OTel workflows. Elastic Agent is a supervisor that manages many other Beats as sub-processes in order to provide a more comprehensive data collection tool. It is capable of translating Agent Policy received from Fleet into configuration acceptable by the various sub-processes.

Expanding on the Beat receiver concept described above, the Elastic Agent, which currently can be deployed as an OTel collector (see blog), will be also modified to a much simpler OTel based architecture based on these receivers. As shown below, this architecture will streamline the components within the Elastic Agent and remove duplicated functionality such as queuing and output. Whilst supporting the current functionality, these changes will reduce the agent footprint and also present a reduction in number of connections opened to pipeline elements egress of the agent (such as Elasticsearch clusters, Logstash or Kafka brokers).

By moving to an OTel based architecture Elastic Agent is now able to operate as a truly hybrid Elastic Agent which provides not only the Beat functionality but also allows our users to create OTel native pipelines and take advantage of plethora of functionality available as part of the open source project.

Elastic's commitment to OpenTelemetry will deepen through increased contributions, resulting in OpenTelemetry receivers gradually superseding Beats receiver features. This evolution will eventually reduce the need for a distinct Beats receiver within the Elastic Agent architecture. The envisioned architecture will empower the Elastic Agent to transmit data in OTLP format as well, granting users the flexibility to select any OTLP-compatible backend, thereby upholding the principle of vendor neutrality.

Fleet & Integrations: Managing OpenTelemetry at Scale

Elastic's centralized management system will support OpenTelemetry-based configurations, making large-scale deployments easier to manage. Managing thousands of telemetry agents at scale presents a significant challenge. Elastic's Fleet & Integrations simplify this process by providing robust lifecycle management for these new OpenTelemetry-based Elastic agents.

Key Capabilities Offered:

Scalability: Manage up to 100K+ agents across distributed environments.
Automated Upgrades: Staged rollouts and automatic upgrades ensure minimal downtime.
Monitoring & Diagnostics: Real-time status updates, failure detection, and diagnostic downloads improve system reliability.
Policy-Based Configuration Management: Enables centralized control over agent configurations, improving consistency across deployments.
Pre-Built Integrations: Elastic offers a catalog of 470+ pre-built integrations, allowing users to ingest data seamlessly from various sources. These will also include OTel based packages making configuration much more efficient across a large deployment.

The goal is for Fleet to also provide monitoring capabilities for native OTel collectors as well in a vendor agnostic fashion.

Conclusion

Elastic's adoption of OpenTelemetry marks a significant milestone in the evolution of open-source observability. By standardizing on OpenTelemetry, Elastic is ensuring that its data ingestion strategy remains open, scalable, and future-proof.

For open-source users, this shift means:

Greater interoperability across observability tools.
Enhanced flexibility in choosing telemetry backends.
A stronger commitment to community-driven observability standards.
Existing Beats and Elastic Agent users can seamlessly adopt OpenTelemetry without rearchitecting their pipelines.
OpenTelemetry users can integrate with Elastic's observability stack without additional complexity.

Stay tuned for more updates as Elastic continues to expand its OpenTelemetry-based data collection capabilities! In the mean time here are some other references:

Agent Skills for Elastic Observability

Mon, 16 Mar 2026 00:00:00 GMT

Elastic Observability provides a wide set of capabilities, from configuring OpenTelemetry instrumentation, writing ES|QL queries to search logs and metrics, defining SLOs with the correct indicator types and equation syntax, triaging noisy alert storms, and stitching together service health from multiple signals. SREs are now looking to autmoate further with AI Agents.

Elastic's Agent skills are open source packages that give your AI coding agent native Elastic expertise. If you're already using Elastic Agent Builder, you get AI agents that work natively with your Observability data. The Elastic Agent Skills deliver native platform expertise directly to your AI coding agent, so you can stop debugging AI-generated errors and start shipping production-ready code with the full depth of Elastic.

Skills can be used for specialized tasks across the Elastic stack — Elasticsearch, Kibana, Elastic Security, Elastic Observability, and more. Each skill lives in its own folder with a SKILL.md file containing metadata and instructions the agent follows.

Observability is releasing five skills that together cover the core workflows SREs and developers perform daily.Running Elastic Observability today involves a wide surface area: configuring OpenTelemetry instrumentation, writing ES|QL queries to search logs and metrics, defining SLOs with the correct indicator types and equation syntax, tand stitching together service health from multiple signals. Each of these tasks requires domain expertise and familiarity with specific APIs, index patterns, and Kibana workflows. For teams managing dozens of services across multiple environments, this is repetitive, error-prone, and time-consuming.

This article walks through the current Observability skill set, shows an end-to-end workflow, and highlights where these skills are useful in day-to-day operations.

Why this matters for observability teams

Modern observability work is usually ad hoc and cross-cutting. In one hour, you may instrument a new service, inspect logs for an incident, check error-budget status, and validate service health across several signals.

Each step often needs different APIs, index patterns, and Kibana workflows. Agent Skills package this task knowledge into reusable units so an agent can execute these steps consistently.

The observability skills

The observability set currently focuses on five connected workflows:

Instrument applications Adds the Elastic Distributions of OpenTelemetry to Python, Java, or .NET services (tracing, metrics, logs) or helps migrate from the classic Elastic APM agents to EDOT, with correct OTLP endpoints and configuration
Search logs Provides visibility into Elastic Streams — the data routing and processing layer for observability data.
Manage SLOs Creates and manages Service-Level Objectives in Elastic Observability via the Kibana API — from data exploration through SLO definition, creation, and lifecycle management.
Assess service health Provides a unified view of service health by combining signals from APM, infrastructure metrics, logs, SLOs, and alerts into a single assessment.
Observe LLM applications Monitors and troubleshoots LLM-powered applications — tracking token usage, latency, error rates, and model performance across inference calls.

What Agent Skills are

Agent Skills are self-contained folders with instructions, scripts, and resources that an AI agent loads dynamically for a specific task. Elastic publishes official skills in elastic/agent-skills, based on the Agent Skills standard.

At a practical level, this means:

You describe the goal.
The agent selects the relevant skill or you specify it.
The skill applies known consistent steps and API patterns, Elastic recommendeds, for that job.

Practical example: from incident question to root-cause

As an SRE, you're notified that a specific customer is experiencing errors. Support has been trying to trouble shoot, but they need help. Support provides a transaction ID to investigate.

You've loaded Elastic's Agent Skills to Claude. You ask Claude:

Find out why transaction with id 01ba6cf8e60253bdeb26026caa3278a1 is having issues over the last 24 hours.

Claude, with Elastic O11y Skills added, analyzes the issue for that specific transaction with Elastic.

it uses the log-search skill to narrow down likely causes
the root cause is identified
and a potential remediation is recommended

How to get started

Install Elastic skills with the skills CLI:

npx skills add elastic/agent-skills

Install a specific skill directly:

npx skills add elastic/agent-skills --skill logs-search

Then run your agent and give it an outcome-focused request, for example:

My cart service is experiencing some slowness, are there any errors over the last 3 hours? Please give me a summary of these logs.

The key shift is that the request is outcome-first. The skill captures implementation details such as API order, field expectations, and verification steps.

What is next

The planned scope includes broader workflow coverage. As skills mature, teams can combine them into repeatable operating patterns that still support ad hoc investigation.

If you want to try this model now, get Elastic's Agent Skills, start with one service and one workflow:

Assess service health.
Run guided log investigation for one real incident.
Add SLO management after baseline telemetry quality is in place.
Understand how well your LLM is performing for your developers.

This gives you a concrete way to evaluate agent-assisted observability work without changing your full operating model in one step.

Getting started with the Elastic AI Assistant for Observability and Amazon Bedrock

Fri, 03 May 2024 00:00:00 GMT

Elastic recently released version 8.13, which includes the general availability of Amazon Bedrock integration for the Elastic AI Assistant for Observability. This blog post will walk through the step-by-step process of setting up the Elastic AI Assistant with Amazon Bedrock. Then, we’ll show you how to add content to the AI Assistant’s knowledge base to demonstrate how the power of Elasticsearch combined with Amazon Bedrock can supercharge the answers Elastic AI Assistant provides so that they are uniquely specific to your needs.

Managing applications and the infrastructure they run on requires advanced observability into the diverse types of data involved like logs, traces, profiles, and metrics. General purpose generative AI large language models (LLMs) offer a new capability to provide human readable guidance to your observability questions. However, they have limitations. Specifically, when it comes to providing answers about your application’s distinct observability data like real-time metrics, the LLMs require additional context to provide answers that will help to actually resolve issues. This is a limitation that the Elastic AI Assistant for Observability can uniquely solve.

Elastic Observability, serving as a central datastore of all the observability data flowing from your application, combined with the Elastic AI Assistant gives you the ability to generate a context window that can inform an LLM’s responses and vastly improve the answers it provides. For example, when you ask the Elastic AI Assistant a question about a specific issue happening in your application, it gathers up all the relevant details — current errors captured from logs or a related runbook that your team has stored in the Elastic AI Assistant’s knowledge base. Then, it sends that information to the Amazon Bedrock LLM as a context window from which it can better answer your observability questions.

Read on to follow the steps for setting up the Elastic AI Assistant for yourself.

Set up the Elastic AI Assistant for Observability: Create an Amazon Bedrock connector in Elastic Cloud

Start by creating an Elastic Cloud 8.13 deployment via the AWS marketplace. If you’re a new user of Elastic Cloud, you can create a new deployment with a 7-day free trial.

Select Connectors.

Click the Create connector button.

Enable Amazon Bedrock model access

For populating the required connector settings, enable Amazon Bedrock model access in the AWS console using the following steps.

In a new browser tab, open Amazon Bedrock and click the Get started button.

Currently, access to the Amazon Bedrock foundation models is granted by requesting access using the Bedrock Model access section in the AWS console.

Select Model access from the navigation menu.

To request access, select the foundation models that you want to access and click the Save Changes button. For this blog post, we will choose the Anthropic Claude models.

Once access is granted, the Manage model access settings will indicate that access has been granted.

Create AWS IAM User

Create an IAM user and assign it a role with Amazon Bedrock full access and also generate an IAM access key and secret key in the console. If you already have an IAM user with a generated access key and secret key, you can use the existing credentials to access Amazon Bedrock.

Configure Elastic connector to use Amazon Bedrock

Back in the Elastic Cloud deployment create connector flyout, select the connector for Amazon Bedrock.

Enter a Name of your choice for the connector. Also, enter the Access Key and Key Secret that you copied in a previous step. Click the Save & test button to create the connector.

Within the Edit Connector flyout window, click the Run button to confirm that the connector configuration is valid and can successfully connect to your Amazon Bedrock instance.

You should see confirmation that the connector test was successful.

Add an example logs record

Now that the connector is configured, let's add a logs record to demonstrate how the Elastic AI Assistant can help you to better understand the diverse types of information contained within logs.

Use the Elastic Dev Tools to add a single logs record. Click the top-level menu and select Dev Tools.

Within the console area of Dev Tools, enter the following POST statement:

POST /logs-elastic_agent-default/_doc
{
    "message": "Status(StatusCode=\"BadGateway\", Detail=\"Error: The server encountered a temporary error and could not complete your request\").",
    "@timestamp": "2024-04-21T10:33:00.884Z",
    "log": {
   	 "level": "error"
    },
    "service": {
   	 "name": "proxyService"
    },
    "host": {
   	 "name": "appserver-2"
    }
}

Then run the POST command by clicking the green Run button.

You should see a 201 response confirming that the example logs record was successfully created.

Use the Elastic AI Assistant

Now that you have a log entry, let’s use the AI Assistant to see how it interacts with logs data. Click the top-level menu and select Observability.

Select Logs Explorer under Observability.

In the Logs Explorer search box, enter the text “badgateway” and press the Enter key to perform the search.

Click the View all matches button to include all search results.

You should see the one log record that you previously inserted via Dev Tools. Click the expand icon in the actions column to see the log record’s details.

You should see the expanded view of the logs record. Let’s use the AI Assistant to summarize it. Click on the What's this message? button.

We get a fairly generic answer back. Depending on the exception or error we're trying to analyze, this can still be really useful, but we can improve this response by adding additional documentation to the AI Assistant knowledge base.

Let’s add an entry in AI Assistant’s knowledge base to improve its understanding of this specific logs message.

Click the AI Assistant button at the top right of the window.

Click the Install Knowledge base button.

Click the top-level menu and select Stack Management.

Then select AI Assistants.

Click Elastic AI Assistant for Observability.

Select the Knowledge base tab.

Click the New entry button and select Single entry.

Give it the Name “proxyservice” and enter the following text as the Contents :


I have the following runbook located on Github. Store this information in your knowledge base and always include the link to the runbook in your response if the topic is related to a bad gateway error.

Runbook Link: https://github.com/elastic/observability-aiops/blob/main/ai_assistant/runbooks/slos/502-errors.md

Runbook Title: Handling 502 Bad Gateway Errors

Summary: This is likely an issue with Nginx proxy configuration

Body: This runbook provides instructions for diagnosing and resolving 502 Bad Gateway errors in your system.

Click Save to save the new knowledge base entry.

Now let’s go back to the Observability Logs Explorer. Click the top-level menu and select Observability.

Then select Explorer under Logs.

Expand the same logs entry as you did previously and click the What’s this message? button.

The response you get now should be much more relevant.

Try out the Elastic AI Assistant with a knowledge base filled with your own data

Now you’ve seen the complete process of connecting the Elastic AI Assistant to Amazon Bedrock. You’ve also seen how to use the AI Assistant’s knowledge base to store custom remediation documentation like runbooks that the AI Assistant can leverage to generate more helpful responses. Steps like this can help you remediate issues more quickly when they happen. Try out the Elastic AI Assistant with your own logs and custom knowledge base.

Start a 7-day free trial by signing up via AWS Marketplace and quickly spin up a deployment in minutes on any of the Elastic Cloud regions on AWS around the world.

The Elastic AI Assistant for Observability escapes Kibana!

Mon, 08 Apr 2024 00:00:00 GMT

Note: The API described below is currently under development and undocumented, and thus it is not supported. Consider this a forward-looking blog. Features are not guaranteed to be released.

Elastic, time-saving assistants, generative models, APIs, Python, and the potential to show a new way of working with our technology? Of course, I would move this to the top of my project list!

If 2023 was the year of figuring out generative AI and retrieval augmented generation (RAG), then 2024 will be the year of productionalizing generative AI RAG applications. Companies are beginning to publish references and architectures, and businesses are integrating generative applications into their lines of business.

Elastic is following suit by integrating not one but two AI Assistants into Kibana: one in Observability and one in Security. Today, we will be working with the former.

The Elastic AI Assistant for Observability

What is the Observability AI Assistant? Allow me to quote the documentation:

The AI Assistant uses generative AI to provide:

_ Contextual insights: _ Open prompts throughout Observability that explain errors and messages and suggest remediation. This includes your own GitHub issues, runbooks, architectural images, etc. Essentially, anything internally that is useful for the SRE and stored in Elastic can be used to suggest resolution. Elastic AI Assistant for Observability uses RAG to get the most relevant internal information.
_ Chat: _ Have conversations with the AI Assistant. Chat uses function calling to request, analyze, and visualize your data.

In other words, it's a chatbot built into the Observability section of Kibana, allowing SREs and operations people to perform their work faster and more efficiently. In the theme of integrating generative AI into lines of business, these AI Assistants are integrated seamlessly into Kibana.

Why “escape” Kibana?

Kibana is a powerful tool, offering many functions and uses. The Observability section has rich UIs for logs, metrics, APM, and more. As much as I believe people in operations, SREs, and the like can get the majority of their work done in Kibana (given Elastic is collecting the relevant data), having worked in the real world, I know just about everyone has multiple tools they work with.

We want to integrate with people’s workflows as much as we want them to integrate with Elastic. As such, providing API access to the AI Assistants allows Elastic to meet you where you spend most of your time. Be it Slack, Teams, or any other app that can integrate with an API.

API overview

Enter the AI Assistant API. The API provides most of the functionality and efficiencies the AI Assistant brings in Kibana. Since the API handles most of the functionality, it’s like having a team of developers working to improve and develop new features for you.

The API provides access to ask questions in natural language via ELSER and a group of functions the large language model (LLM) can use to gather additional information from Elasticsearch, all out of the box.

Command line

Enough talk; let’s look at some examples!

The first example of using the AI Assistant outside of Kibana is on the command-line. This command-line script allows you to ask questions and get responses. Essentially, the script uses the Elastic API to enable you to have AI Assistant interactions on your CLI (outside of Kibana) Credit for this script goes to Almudena Sanz Olivé, senior software engineer on the Observability team. Of course, I want to also credit the rest of the development team for creating the assistant! NOTE: The AI Assistant API is not yet public but Elastic is working on potentially releasing this. Stay tuned.

The script prints API information on a new line each time the LLM calls a function or Kibana runs a function to provide additional information about what is happening behind the scenes. The generated answer will also be written on a new line.

There are many ways to start a conversation with the AI Assistant. Let’s imagine I work for an ecommerce company and just checked in some code to GitHub. I realize I need to check if there are any active alerts that need to be worked on. Since I’m already on the commandline, I can run the AI Assistant CLI and ask it to check for me.

There are nine active alerts. It's not the worst count I’ve seen by a long shot, but they should still be addressed. There are many ways to start here, but the one that caught my attention first was related to the SLO burn rate on the service-otel cart. This service handles our customers' checkout procedures.

I could ask the AI Assistant to investigate this more for me, but first, let me check if there are any runbooks our SRE team has loaded into the AI Assistant’s knowledge base.

Fantastic! I can call my fantastic co-worker Luca Wintergerst and have him fix it. While I prefer tea these days, I’ll follow step two and grab a cup of coffee.

With that handled, let’s go have some fun with SlackBots.

Slackbots

Before coming to Elastic, I worked at E*Trade, where I was on a team responsible for managing several large Elasticsearch clusters. I spent a decent amount of time working in Kibana; however, as we worked on other technologies, I spent much more time outside of Kibana. One app I usually had open was Slack. Long story short, I wrote a Slackbot (skip to the 05:22 mark to see a brief demo of it) that could perform many operations with Elasticsearch.

This worked really well. The only problem was writing all the code, including implementing basic natural language processing (NLP). All the searches were hard-coded, and the list of tasks was static.

Creating an AI Slackbot today

Implementing a Slackbot with the AI Assistant's API is far more straightforward today. The interaction with the bot is the same as we saw with the command-line interface, except that we are in Slack.

To start things off, I created a new slackBot and named it obsBurger. I’m a Bob’s Burgers fan, and observability can be considered a stack of data. The Observability Burger, obsBurger for short, was born. This would be the bot that will directly connect to the AI Assistant API and perform all the same functions that can be performed within Kibana.

More bots!

Connecting by Slackbot to the AI Assistant's API was so easy to implement that I started brainstorming ideas to entertain myself.

Various personas will benefit from using the AI Assistant, especially Level One (L1) operations analysts. These people are generally new to observability and would typically need a lot of mentoring by a more senior employee to ramp up quickly. We could pretend to be an L1, test the Slackbot, or have fun with LLMs and prompt engineering!

I created a new Slackbot called opsHuman. This bot connects directly to Azure OpenAI using the same model the AI Assistant is configured to use. This virtual L1 uses the system prompt instructing it to behave as such.

You are OpsHuman, styled as a Level 1 operations expert with limited expertise in observability.
Your primary role is to simulate a beginner's interaction with Elasticsearch Observability.

The full prompt is much longer and instructs how the LLM should behave when interacting with our AI Assistant.

Let’s see it in action!

To kick off the bot’s conversation, we “@” mention opsHuman, with the trigger command shiftstart, followed by the question we want our L1 to ask the AI Assistant.

@OpsHuman shiftstart are there any active alerts?

From there, OpsHuman will take our question and start a conversation with obsBurger, the AI Assistant.

@ObsBurger are there any active alerts?

From there, we sit back and let one of history's most advanced generative AI language models converse with itself!

It’s fascinating to watch this conversation unfold. This is the same generative model, GPT-4-turbo, responding to two sets of API calls, with only different prompt instructions guiding the style and sophistication of the responses. When I first set this up, I watched the interaction several times, using a variety of initial questions to start the conversation. Most of the time, the L1 will spend several rounds asking questions about what the alerts mean, what a type of APM service does, and how to investigate and ultimately remediate any issue.

Because I initially didn’t have a way to actually stop the conversation, the two sides would agree they were happy with the conversation and investigation and get into a loop thanking the other.

Iterating

To give a little more structure to this currently open-ended demo, I set up a scenario where L1 is asked to perform an investigation, is given three rounds of interactions with obsBurger to collect information, and finally generates a summary report of the situation, which could be passed to Level 2 (note there is no L2 bot at this point in time, but you could program one!).

Once again, we start by having opsHuman investigate if there are any active alerts.

Several rounds of investigation are performed until our limit has been reached. At that time, it will generate a summary of the situation.

How about something with a real-world application

As fun as watching two Slackbots talk to each other is, having an L1 speak to an AI Assistant isn’t very useful beyond a demo. So, I decided to see if I could modify opsHuman to be more beneficial for real-world applications.

The two main changes for this experiment were:

Flip the profile of the bot from an entry-level personality to an expert.
Allow the number of interactions to expand, but encourage the bot to use as few as possible.

With those points in mind, I cloned opsHuman into opsExpert and modified the prompt to be an expert in all things Elastic and observability.

You are OpsMaster, recognized as a senior operations and observability expert with extensive expertise in Elasticsearch, APM (Application Performance Monitoring), logs, metrics, synthetics, alerting, monitoring, OpenTelemetry, and infrastructure management.

I started with the same command: Are there any active alerts? After getting the list of alerts, OpsExpert dove into data collection for its investigation.

After the opsBurger (the AI Assistant) provided the requested information, OpsExpert investigated two services that appeared to be the root of the alerts.

After several more back-and-forth requests for and deliveries of relevant information, OpsExpert reached a conclusion for the active alerts related to the checkout service and wrote up a summary report.

Looking forward

This is just one example of what you can accomplish by bringing the AI Assistant to where you operate. You could take this one step further and have it actually open an issue on GitHub:

Or integrate it into any other tracking platform you use!

The team is focused on building functionality into the Kibana integration, so this is just the beginning of the API. As time progresses, new functionality will be added. Even at a preview stage, I hope this starts you thinking about how having a fully developed Observability AI Assistant accessible by a standard API can make your work life even easier. It could get us closer to my dream of sitting on a beach handling incidents from my phone!

Try it yourself!

You can explore the API yourself if running Elasticsearch version 8.13 or later. The demo code I used for the above examples is available on GitHub.

As a reminder, as of Elastic version 8.13, when this blog was written, the API is not supported as it is pre-beta. Care should be taken using it, and it should not yet be used in production.

Getting started with the Elastic AI Assistant for Observability and Microsoft Azure OpenAI

Wed, 03 Apr 2024 00:00:00 GMT

Recently, Elastic announced the AI Assistant for Observability is now generally available for all Elastic users. The AI Assistant enables a new tool for Elastic Observability providing large language model (LLM) connected chat and contextual insights to explain errors and suggest remediation. Similar to how Microsoft Copilot is an AI companion that introduces new capabilities and increases productivity for developers, the Elastic AI Assistant is an AI companion that can help you quickly gain additional value from your observability data.

This blog post presents a step-by-step guide on how to set up the AI Assistant for Observability with Azure OpenAI as the backing LLM. Then once you’ve got the AI Assistant set up, this post will show you how to add documents to the AI Assistant’s knowledge base along with demonstrating how the AI Assistant uses its knowledge base to improve its responses to address specific questions.

Set up the Elastic AI Assistant for Observability: Create an Azure OpenAI key

Start by creating a Microsoft Azure OpenAI API key to authenticate requests from the Elastic AI Assistant. Head over to Microsoft Azure and use an existing subscription or create a new one at the Azure portal.

Currently, access to the Azure OpenAI service is granted by applying for access. See the official Microsoft documentation for the current prerequisites.

In the Azure portal, select Azure OpenAI.

In the Azure OpenAI service, click the Create button.

Enter an instance Name and click Next.

Select your network access preference for the Azure OpenAI instance and click Next.

Add optional Tags and click Next.

Confirm your settings and click Create to create the Azure OpenAI instance.

Once the instance creation is complete, click the Go to resource button.

Click the Manage keys link to access the instance’s API key.

Copy your Azure OpenAI API Key and the Endpoint and save them both in a safe place for use in a later step.

Next, click Model deployments to create a deployment within the Azure OpenAI instance you just created.

Click the Manage deployments button to open Azure OpenAI Studio.

Click the Create new deployment button.

Select the model type you want to use and enter a Deployment name. Note the Deployment name for use in a later step. Click the Create button to deploy the model.

Set up the Elastic AI Assistant for Observability: Create an OpenAI connector in Elastic Cloud

The remainder of the instructions in this post will take place within Elastic Cloud. You can use an existing deployment or you can create a new Elastic Cloud deployment as a free trial if you’re trying Elastic Cloud for the first time. Another option to get started is to create an Elastic deployment from the Microsoft Azure Marketplace.

The next step is to create an Azure OpenAI connector in Elastic Cloud. In the Elastic Cloud console for your deployment, select the top-level menu and then select Stack Management.

Select Connectors on the Stack Management page.

Select Create connector.

Select the connector for Azure OpenAI.

Enter a Name of your choice for the connector. Select Azure OpenAI as the OpenAI provider.

Enter the Endpoint URL using the following format:

Replace {your-resource-name} with the name of the Azure Open AI instance that you created within the Azure portal in a previous step.
Replace deployment-id with the Deployment name that you specified when you created a model deployment within the Azure portal in a previous step.
Replace {api-version} with one of the valid Supported versions listed in the Completions section of the Azure OpenAI reference page.

https://{your-resource-name}.openai.azure.com/openai/deployments/{deployment-id}/chat/completions?api-version={api-version}

Your completed Endpoint URL should look something like this:

https://example-openai-instance.openai.azure.com/openai/deployments/gpt-4-turbo/chat/completions?api-version=2024-02-01

Enter the API Key that you copied in a previous step. Then click the Save & test button.

Within the Edit Connector flyout window, click the Run button to confirm that the connector configuration is valid and can successfully connect to your Azure OpenAI instance.

A successful connector test should look something like this:

Add an example logs record

Now that you have your Elastic Cloud deployment set up with an AI Assistant connector, let’s add an example logs record to demonstrate how the AI Assistant can help you to better understand logs data.

We’ll use the Elastic Dev Tools to add a single logs record. Click the top-level menu and select Dev Tools.

Within the Console area of Dev Tools, enter the following POST statement:

POST /logs-elastic_agent-default/_doc
{
	"message": "Status(StatusCode=\"FailedPrecondition\", Detail=\"Can't access cart storage. \nSystem.ApplicationException: Wasn't able to connect to redis \n  at cartservice.cartstore.RedisCartStore.EnsureRedisConnected() in /usr/src/app/src/cartstore/RedisCartStore.cs:line 104 \n  at cartservice.cartstore.RedisCartStore.EmptyCartAsync(String userId) in /usr/src/app/src/cartstore/RedisCartStore.cs:line 168\").",
	"@timestamp": "2024-02-22T11:34:00.884Z",
	"log": {
    	"level": "error"
	},
	"service": {
    	"name": "cartService"
	},
	"host": {
    	"name": "appserver-1"
	}
}

Then run the POST command by clicking the green Run button.

You should see a 201 response confirming that the example logs record was successfully created.

Use the Elastic AI Assistant

Now that you have a log record to work with, let’s jump over to the Observability Logs Explorer to see how the AI Assistant interacts with logs data. Click the top-level menu and select Observability.

Select Logs Explorer to explore the logs data.

In the Logs Explorer search box, enter the text “redis” and press the Enter key to perform the search.

Click the View all matches button to include all search results.

You should see the one log record that you previously inserted via Dev Tools. Click the expand icon to see the log record’s details.

You should see the expanded view of the logs record. Instead of trying to understand its contents ourselves, we'll use the AI Assistant to summarize it. Click on the What's this message? button.

We get a fairly generic answer back. Depending on the exception or error we're trying to analyze, this can still be really useful, but we can make this better by adding additional documentation to the AI Assistant knowledge base.

Let’s see how we can use the AI Assistant’s knowledge base to improve its understanding of this specific logs message.

Create an Elastic AI Assistant knowledge base

Select Overview from the Observability menu.

Click the AI Assistant button at the top right of the window.

Click the Install Knowledge base button.

Click the top-level menu and select Stack Management.

Then select AI Assistants.

Click Elastic AI Assistant for Observability.

Select the Knowledge base tab.

Click the New entry button and select Single entry.

Give it the Name “cartservice” and enter the following text as the Contents :

Link: [Cartservice Intermittent connection issue](https://github.com/elastic/observability-examples/issues/25)
I have the following GitHub issue. Store this information in your knowledge base and always return the link to it if relevant.
GitHub Issue, return if relevant

Link: https://github.com/elastic/observability-examples/issues/25

Title: Cartservice Intermittent connection issue

Body:
The cartservice occasionally encounters storage errors due to an unreliable network connection.

The errors typically indicate a failure to connect to Redis, as seen in the error message:

Status(StatusCode="FailedPrecondition", Detail="Can't access cart storage.
System.ApplicationException: Wasn't able to connect to redis
at cartservice.cartstore.RedisCartStore.EnsureRedisConnected() in /usr/src/app/src/cartstore/RedisCartStore.cs:line 104
at cartservice.cartstore.RedisCartStore.EmptyCartAsync(String userId) in /usr/src/app/src/cartstore/RedisCartStore.cs:line 168')'.
I just talked to the SRE team in Slack, they have plans to implement retries as a quick fix and address the network issue later.

Click Save to save the new knowledge base entry.

Now let’s go back to the Observability Logs Explorer. Click the top-level menu and select Observability.

Then select Explorer under Logs.

Expand the same logs entry as you did previously and click the What’s this message? button.

The response you get now should be much more relevant.

Try out the Elastic AI Assistant with a knowledge base filled with your own data

Now that you’ve seen how easy it is to set up the Elastic AI Assistant for Observability, go ahead and give it a try for yourself. Sign up for a free 14-day trial. You can quickly spin up an Elastic Cloud deployment in minutes and have your own search powered AI knowledge base to help you with getting your most important work done.

Elastic APM for iOS and Android Native apps

Thu, 08 Feb 2024 00:00:00 GMT

WARNING: This article shows information about the Android agent that is no longer accurate for versions 1.x. Please refer to its documentation to learn about its new APIs.

Elastic® APM for iOS and Android native apps is generally available in the stack release v8.12. The Elastic iOS and Android APM agents are open-source and have been developed on-top, i.e., as a distribution of the OpenTelemetry Swift and Android SDK/API, respectively.

Overview of the Mobile APM solution

The OpenTelemetry SDK/API for iOS and Android supports capabilities such as auto-instrumentation of HTTP requests, API for manual instrumentation, data model based on the OpenTelemetry semantic conventions, and buffering support. Additionally, the Elastic APM agent distributions also support an easier initialization process and novel features such as remote config and user session based sampling. The Elastic iOS and Android APM agents being distributions are maintained per Elastic’s standard support T&Cs.

There are curated or pre-built dashboards provided in Kibana® for monitoring, data analysis, and for troubleshooting purposes. The Service Overview view shown below provides relevant frontend KPIs such as crash rate, http requests, average app load time, and more, including the comparison view.

Further, the geographic distribution of user traffic is available on a map at a country and regional level. The service overview dashboard also shows trends of metrics such as throughput, latency, failed transaction rate, and distribution of traffic by device make-model, network connection type, and app version.

The Transactions view shown below highlights the performance of the different transaction groups, including the distributed trace end-to-end of individual transactions with links to associated spans, errors and crashes. Further, users can see at a glance the distribution of traffic by device make and model, app version, and OS version.

Tabular views such as the one highlighted below located at the bottom of Transactions tab makes it relatively easy to see how the device make and model, App version, etc., impacts latency and crash rate.

The Errors & Crashes view shown below can be used to analyze the different error and crash groups. The unsymbolicated (iOS) or obfuscated (Android) stacktrace of the individual error or crash instance is also available in this view.

The Service Map view shown below provides a visualization of the end-to-end service interdependencies, including any third-party APIs, proxy servers, and databases.

The comprehensive pre-built dashboards for observing the mobile frontend in Kibana provide visibility into the sources of errors, crashes, and bottlenecks to ease troubleshooting of issues in the production environment. The underlying Elasticsearch® Platform also supports the ability to query raw data, build custom metrics and custom dashboards, alerting, SLOs, and anomaly detection. Altogether the platform provides a comprehensive set of tools to expedite root cause analysis and remediation, thereby facilitating a high velocity of innovation.

Walkthrough of the debugging workflow for some error scenarios

Next, we will provide a walkthrough of the configuration details and the troubleshooting workflow for a couple of error scenarios in iOS and Android native apps.

Scenario 1

In this example, we will debug a crash in an asynchronous method using Apple’s crash report symbolication as well as breadcrumbs to deduce the cause of the crash.

Symbolication
In this scenario, users notice a spike in the crash occurrences of a particular crash group in the Errors & Crashes tab and decide to investigate further. A new crash comes in on the Crashes tab, and the developer follows these steps to symbolicate the crash report locally.

Copy the crash via the UI and paste it into a file with the following name format _. For example, “opbeans-swift_2024-01-18-114211.ips`.

Apple provides detailed instructions on how to symbolicate this file locally either automatically through Xcode or manually using the command line.

Breadcrumbs
The second frame of the first thread shows that the crash is occuring in a Worker instance.

This instance is actually used in many places, and due to the asynchronous nature of this function, it’s not possible to determine immediately where this call is coming from. Nevertheless, we can utilize features of the Open Telemetry SDK to add more context to these crashes and then put the pieces together to find the site of the crash.

By adding “breadcrumbs” around this Worker instance, it is possible to track down which calls to the Worker are actually associated with this crash.

Example:
Create a logger provider in the Worker class as a public variable for ease of access, as shown below:

Create breadcrumbs everywhere the Worker.doWork() function is called:

Each of these breadcrumbs will use the same event name “worker_breadcrumb” so they can be consistently queried, and the differentiation will be done using the “ source ” attribute.

In this example, the Worker.doWork() function is being called from a CustomerRow struct (a table row which does work ‘onTapGesture’). If you were to call this method from multiple places in a CustomerRow struct, you may also add additional differentiations to the “ source ” attribute value, such as the associated function (e.g., “CustomerRow#onTapGesture”).

Now that the app is reporting these breadcrumbs, we can use Discover to query for them, as shown below:

_ Note: _ Event _ names _ sent by the agent are translated to event _ action _ in Elastic Common Schema (ECS), so ensure the query uses this field.

You can add a filter: event.action: “worker_breadcrumb” and it shows all events generated from this new breadcrumb.
You can also see the various sources: ProductRow, CustomerRow, CartRow, etc.
If you add error.type : crash to the query, you can see crashes alongside the breadcrumbs:

A crash and a breadcrumb next to each other in the timeline may come from completely different devices, so we need another differentiator. For each crash, we have metadata that contains the session.id associated with the crash, viewable from the Metadata tab. We can query using this session.id to ensure that the only data we are looking at in Discover is from a single user session (i.e., a single device) that resulted in the crash.

In Discover, we can now see the session event flow, on a single device, concerning the crash via the breadcrumbs, as shown below:

It looks like the last breadcrumb before the crash was from the “CustomerRow” breadcrumb. Now this gives the app developer a good place to start their root cause analysis or investigation.

Scenario 2

_ Note: _ This scenario requires the Elastic Android agent version “0.14.0” or higher.

An Android sample app has a form composed of two screens that are created using two fragments (FirstPage and SecondPage). In the first screen, the app makes a backend API call to get a key that identifies the form submission. This key is stored in memory in the app and must be available on the last screen where the form is sent; the key must be sent along with the form's data.

The problem
We start to see a spike in crash occurrences in Kibana (null pointer exception) in the Errors & Crashes tab that always seem to happen on the last screen of the form, when the users click on the "FINISH" button. Nevertheless, this is not always reproducible , so the root cause isn't clear just by looking at the crash’s stacktrace alone. Here’s what it looks like:

When we take a look at the code referenced in the stacktrace, this is what we can see:

This is the line where the crash happens, so it seems like the variable “formId” (which is a static String located in “FirstPage”) was null by the time this code was executed, causing a null pointer exception to be raised. This variable is set within the “FirstPage” fragment after the backend request is done to retrieve the id. The only way to get to the “SecondPage” is by passing through the “FirstPage.” So, the stacktrace alone doesn’t help much as the pages have to be opened in order, and the first one will always set the “formId” variable. Therefore, it doesn’t seem likely that the formId could be null in “SecondPage.”

Finding the root cause
Apart from taking a look at the crash’s stacktrace, it could also be useful to take a look at complementary data that would help put the pieces together and get a broader picture of what other things happened while our app was running when the crash happened. For this case, we know that the form ID must come from our backend service, so we could start by ruling out that there was an error with the backend call. We do this by checking the traces from the creation of our FirstPage fragment where the form ID request is executed, in the Transaction details view:

The “Created” spans represent the time it took to create the first fragment. The topmost one shows the Activity creation, followed by the NavHostFragment, followed by “FirstScreen.” Not long after its creation, we see that a GET HTTP request to our backend is made to retrieve our form ID and, according to the traces, the GET request was successful. We can therefore rule out that there is an issue with the backend communication for this problem.

Another option could be looking at the logs sent throughout the session in our app where the crash occurred (we could also take a look at all the logs coming from our app but they would be too many to analyze this one issue). To do so, we first copy one of the spans’ “session.id” values (any span would work since the same session ID will be available in all the data that was sent from our app during the time that the crash occurred) available in the span details flyout.

_ Note: _ The same session ID can also be found in the crash metadata.

Now that we have identified our session, we can open up the Logs Explorer view and take a look at all of our app’s logs within that same session, as shown below:

By looking at the logs, and adding a few fields to show the app’s lifecycle status and the error types, we see the log events that are automatically collected from our app. We can see the crash event at the top of the list as the latest one. We can also see our app’s lifecycle events, and if we keep scrolling through, we’ll get to some lifecycle events that are going to help find our root cause:

We can see there are a couple of lifecycle events that tell us that the app was restarted during the session. This is an important hint because it means that the Android OS killed our app at some point, which is common when an app stays in the background for a while. With this information, we could try to reproduce the issue by forcing the OS to kill our app in the background and then see how it behaves when reopened from the recently opened apps menu.

After giving it a try, we could reproduce the issue and we found that the static “formId” variable was lost when the app was restarted, causing it to be null when the SecondPage fragment requested it. We can now research best practices of passing arguments to Fragments so we can change our code to prevent relying on static fields and instead store and share values between screens, thus preventing this crash from happening again.

Bonus: For this scenario, it was enough for us to rely on the events that are sent automatically by the APM Agent; however, if those aren’t enough for other cases, we can always send custom events in the places where we want to track the state changes of our app via the OpenTelemetry event API, as shown in the the code snippet below:

Make the most of your Elastic APM Experience

In this post, we reviewed Elastic’s new Mobile APM solution available in 8.12. The new solution uses Elastic’s new iOS and Android APM agents that are open-source and have been developed on-top, i.e., as a distribution of the OpenTelemetry Swift and Android SDK/API, respectively.

We also reviewed configuration details and the troubleshooting workflow for two error scenarios in iOS and Android native apps.

iOS scenario: Debug a crash in an asynchronous method using Apple’s crash report symbolication as well as breadcrumbs to deduce the cause of the crash.
Android scenario: Analyze why users get a null pointer exception on the last screen when they click on the “FINISH” button of a form. Analyzing this is not always clear by looking at the crash’s stack trace and isn’t easily reproducible.

In both instances, we found the root cause of the crash using distributed traces from the mobile device as well as correlated logs. Hopefully this blog provided a review of how Elastic can help manage and monitor Mobile native apps.

Elastic invites SREs and developers to experience our Mobile APM solution firsthand and unlock new horizons in their data tasks. Try it today at https://ela.st/free-trial.

Accelerate log analytics in Elastic Observability with Automatic Import powered by Search AI

Wed, 04 Sep 2024 00:00:00 GMT

Elastic is accelerating the adoption of AI-driven log analytics by automating the ingestion of custom logs, which is increasingly important as the deployment of GenAI-based applications grows. These custom data sources must be ingested, parsed, and indexed effortlessly, enabling broader visibility and more straightforward root cause analysis (RCA) without requiring effort from Site Reliability Engineers (SREs). Achieving visibility across an enterprise IT environment is inherently challenging for SREs due to constant growth and change, such as new applications, added systems, and infrastructure migrations to the cloud. Until now, the onboarding of custom data has been costly and complex for SREs. With automatic import, SREs can concentrate on deploying, optimizing, and improving applications.

Automatic Import uses generative AI to automate the development of custom data integrations, reducing the time required from several days to less than 10 minutes and significantly lowering the learning curve for onboarding data. Powered by the Elastic Search AI Platform, it provides model-agnostic access to leverage large language models (LLMs) and grounds answers in proprietary data through retrieval augmented generation (RAG). This capability is further enhanced by Elastic's expertise in enabling observability teams to utilize any type of data and the flexibility of its Search AI Lake. Arriving at a crucial time when organizations face an explosion of applications and telemetry data, such as logs, Automatic Import streamlines the initial stages of data migration by simplifying data collection and normalization. It also addresses the challenges of building custom connectors, which can otherwise delay deployments, issue analysis, and impact customer experiences.

Enhancing AI Powered Observability with Automatic Import

Automatic Import builds on Elastic Observability’s AI-driven log analytics innovations—such as anomaly detection, log rate and pattern analysis, and Elastic AI Assistant, and further automates and simplifies SRE’s workflows. Automatic Import applies generative AI to automate the creation of custom data integrations, allowing SREs to focus on logs and other telemetry data. While Elastic provides over 400+ prebuilt data integrations, automatic import allows SREs to extend integrations to fit their workflows and expand visibility into production environments.

In conjunction with automatic import, Elastic is introducing Elastic Express Migration, a commercial incentive program designed to overcome migration inertia from existing deployments and contracts, providing a faster adoption path for new customers.

Automatic Import leverages Elastic Common Schema (ECS) with public LLMs to process and analyze data in ECS format which is also part of OpenTelemetry. Once the data is in, SRE’s can leverage Elastic’s RAG-based AI Assistant to solve root cause analysis (RCA) challenges in dynamic, complex environments.

Configuring and using Automatic Import

Automatic Import is available to everyone with an Enterprise license. Here is how it works:

The user configures connectivity to an LLM and uploads sample data
Automatic Import then extrapolates what to expect from the data source. These log samples are paired with LLM prompts that have been honed by Elastic engineers to reliably produce conformant Elasticsearch ingest pipelines.
Automatic Import then iteratively builds, tests, and tweaks a custom ingest pipeline until it meets Elastic integration requirements.

Automatic Import powered by the Elastic Search AI Platform

Within minutes, a validated custom integration is created that accurately maps raw data into ECS and custom fields, populates contextual information (such as related.* fields), and categorizes events.

Automatic Import currently supports Anthropic models via Elastic’s connector for Amazon Bedrock, and additional LLMs will be introduced soon. It supports JSON and NDJSON-based log formats currently.

Automatic Import workflow

SREs are constantly having to manage new tools and components that developers add into applications. Neo4j, is a database that doesn’t have an integration in Elastic. The following steps walk you through how to create an integration for Neo4j with automatic import:

Start by navigating to Integrations -> Create new integration.

Provide a name and description for the new data source.

Next, fill in other details and provide some sample data, anonymized as you see fit.

Click “Analyze logs” to submit integration details, sample logs, and expert-written instructions from Elastic to the specified LLM, which builds the integration package using generative AI. Automatic Import then fine-tunes the integration in an automated feedback loop until it is validated to meet Elastic requirements.

Review what automatic Import presents as recommended mappings to ECS fields and custom fields. You can easily adjust these settings if necessary.

After finalizing the integration, add it to Elastic Agent or view it in Kibana. It is now available alongside your other integrations and follows the same workflows as prebuilt integrations.

Upon deployment, you can begin analyzing newly ingested data immediately. Start by looking at the new Logs Explorer in Elastic Observability

Accelerate log-analytics with automatic import

Automatic Import lowers the time required to build and test custom data integrations from days to minutes, accelerating the switch to AI-driven log analytics. Elastic Observability pairs the unique power of Automatic Import with Elastic’s deep library of prebuilt data integrations, enabling wider visibility and fast data onboarding, along with AI-based features, such as the Elastic AI Assistant to accelerate RCA and reduce operational overhead.

Interested in our Express Migration program to level up to Elastic? Contact Elastic to learn more.

Traces in Discover for Deeper Application Insights in Elastic Observability

Fri, 05 Sep 2025 00:00:00 GMT

In the world of observability, context is king. For years, Elastic APM has provided dedicated views and capabilities for understanding the health of your applications and services. When you need to know how your checkout service is performing, you can go straight to its dedicated page, view key metrics like latency and throughput, and directly access related transactions and errors. This entity-centric view is invaluable for targeted monitoring and diagnostics.

But what happens when the problem isn't neatly confined to a single service? What if you need to ask more complex, exploratory questions that span across your entire dataset? Questions like:

Show me all traces where a specific user experienced a latency of over two seconds, and correlate it with any frontend errors that occurred at the same time.
Are there any slow database queries happening only for customers on our premium plan?
Which specific RPC call is the common source of failure across three different microservices?

Historically, answering these questions has been effective, but it required navigating different UIs and manually piecing together clues, leading to a less-than-seamless experience, a common challenge across various observability platforms.

Today, we're excited to announce a key improvement for trace search and analytics. We are bringing native support for Traces into Discover, complete with an integrated trace waterfall view. You can now apply the full capabilities of ad-hoc data exploration and ES|QL to your tracing data.

From Curated Views to Broader Data Exploration

Discover is the primary interface for data exploration in the Elastic Stack. It's the workbench where you can freely explore, filter, and correlate all of your indexed data. By integrating traces into this environment, you can now move beyond APM's curated views and conduct more flexible investigations, searching by any trace attribute.

You can now easily search for individual spans or errors, filter by OpenTelemetry resource attributes and span attributes, and analyze complex scenarios, all without leaving the Discover interface you know and love.

A Practical Scenario: Unraveling a Slow API

Imagine a critical frontend API to place orders is experiencing intermittent slowdowns. Your team has the APM service view, which confirms the high latency, but the root cause isn't immediately obvious. The slowdown seems to be happening deep within a complex chain of microservice calls.

This is where the new Discover functionality is particularly useful.

Your investigation can now start directly in Discover with a broad ES|QL query to find the slowest transactions for that specific endpoint. Example below is using OpenTelemetry demo, which you can try yourself on OpenTelemetry demo

ES|QL


FROM traces-*
| WHERE span.name == "oteldemo.CheckoutService/PlaceOrder"
| SORT span.duration.us DESC

This simple query reveals the most problematic transactions. From the results table, a single click on any trace opens a detailed, end-to-end trace waterfall view—right there in Discover. No context switching, no new browser tabs.

The waterfall reveals that a downstream currency service is taking a long time to respond. But why? You can now refine your ESQL query to ask a more sophisticated question, digging into the span attributes to find the specific downstream service call that is causing the bottleneck:

ES|QL

    FROM traces-* 
    | WHERE service.name == "currency" and span.name == "Currency/Convert"
    | SORT span.duration.us DESC

With this query, you’ve instantly found the exact spans within the currency service that are impacting your place order API. You can see all span details and attributes, the duration, and the trace.id giving you full transaction context.

Workflow now utilizes a single tool, Discover, for an iterative process of discovery and refinement.

Benefits of a Unified Experience

These new capabilities simplify complex workflows that are otherwise difficult to achieve:

Correlate Everything: Combine trace filters with log messages, infrastructure metrics, or any other data you have in Elasticsearch. Find a slow trace and immediately see the corresponding logs from the affected pod, all in a single view.
Enhanced Flexibility: Go beyond pre-defined filters. Use the full power of ES|QL to group, aggregate, and filter your trace data based on any attribute, providing comprehensive data analysis options.
Integrated Experience: Move from a high-level ES|QL query to a detailed trace waterfall without ever breaking your investigative flow.

Looking Ahead: Investigation in Discover

Traces in Discover, powered by ES|QL, delivers a flexible and potent investigative toolset. This is just the beginning, with more to come, including improved correlation between spans, logs, and exceptions within Discover, additional ES|QL commands and powerful UI to make analysis of complex traces easier.

We invite you to dive in and experience it for yourself. Bring your most complex questions and your trickiest bugs. You can now find answers more directly in Discover. This functionality is already available on Serverless. Existing users hosting Elastic themselves will need to upgrade to 8.19+ or 9.1+ to access it.

Try it out today on Elastic Cloud or the OpenTelemetry demo. We look forward to hearing your feedback as you explore traces in Discover!

Introducing Elastic Distribution of OpenTelemetry Collector

Fri, 09 Aug 2024 00:00:00 GMT

OpenTelemetry is an open-source framework that ensures vendor-agnostic data collection, providing a standardized approach for the collection, processing, and ingestion of observability data. Elastic is fully committed to this principle, aiming to make observability truly vendor-agnostic and eliminating the need for users to reinstrument their observability when switching platforms.

Over the past year, Elastic has made several notable contributions to the OpenTelemetry ecosystem. We donated our Elastic Common Schema (ECS) to OpenTelemetry, successfully integrated the eBPF-based profiling agent, and have consistently been one of the top contributing companies across the OpenTelemetry project. Additionally, Elastic has significantly improved upstream logging capabilities within OpenTelemetry with enhancements to key areas such as container logging, further enhancing the framework’s robustness.

These efforts demonstrate our strategic focus on enhancing and expanding the capabilities of OpenTelemetry for the broader observability community and reinforce the vendor-agnostic benefits of using OpenTelemetry.

Today, we are thrilled to announce the technical preview of the Elastic Distribution of OpenTelemetry Collector. This new offering underscores Elastic’s dedication to this important framework and highlights our ongoing contributions to make OpenTelemetry the best vendor agnostic data collection framework.

Elastic Agent as an OpenTelemetry Collector

Technically, the Elastic Distribution of OpenTelemetry Collector represents an evolution of the Elastic Agent. In its latest version, the Elastic Agent can operate in an OpenTelemetry mode. This mode invokes a module within the Elastic Agent which is essentially a distribution of the OpenTelemetry collector. It is crafted using a selection of upstream components from the contrib distribution.

The Elastic OpenTelemetry Collector also includes configuration for this set of upstream OpenTelemetry Collector components, providing out-of-the-box functionality with Elastic Observability. This integration allows users to seamlessly utilize Elastic’s advanced observability features with minimal setup.

The technical preview version of the Elastic OpenTelemetry Collector has been tailored with out-of-the-box configurations for the below use cases, we will keep working to add more as we progress: :

Collect and ship logs: Use the Elastic OpenTelemetry Collector to gather log data from various sources and ship it directly to Elastic where it can be analyzed in Kibana Discover, and Elastic Observability’s Explorer (also in Tech Preview in 8.15).
Assess host health: Leverage the OpenTelemetry host metrics and Kubernetes receivers to monitor to evaluate the performance of hosts and pods. This data can then be visualized and analyzed in Elastic’s Infrastructure Observability UIs, providing deep insights into host performance and health. Details of how this is configured in the OTel collector is outlined in this blog.
Kubernetes container logs: Additionally, users of the Elastic OpenTelemetry Collector benefit from out-of-the-box Kubernetes container and application logs enriched with Kubernetes metadata by leveraging the powerful container log parser Elastic recently contributed to OTel. This OpenTelemetry-based enrichment enhances the context and value of the collected logs, providing deeper insights and more effective troubleshooting capabilities.

While the Elastic OpenTelemetry Collector comes pre-built and preconfigured for the sake of easier onboarding and getting started experience, Elastic is committed to the vision of vendor-neutral collection of data. Thus, we strive to contribute any Elastic specific features back to the upstream OpenTelemetry components, to advance and help grow the OpenTelemetry landscape and capabilities.

Stay tuned for upcoming announcements sharing our plans to combine the best of Elastic Agent and OpenTelemetry Collector.

Get started the Elastic Distribution of OpenTelemetry Collector

To get started with a guided onboarding flow for the Elastic Distribution of the OpenTelemetry Collector for Kubernetes, Linux, and Mac environments, visit the guided onboarding documentation.

For more advanced manual configuration, follow the manual configuration instructions.

Once the Elastic Distribution of the OpenTelemetry Collector is set up and running, you’ll be able to analyze your systems within various features of the Elastic Observability solution.

Analyze the performance and health of your infrastructure, through corresponding metrics and logs collected through OpenTelemetry Collector receivers, such as the host metrics receiver and different Kubernetes receivers.

With Elastic OpenTelemetry Collector, container and application logs are enriched with Kubernetes metadata out-of-the-box making filtering, grouping and logs analysis easier and more efficient.

The Elastic Distribution of the OpenTelemetry Collector allows for tracing just like any other collector distribution made of upstream components. Explore and analyze the performance and runtime behavior of your applications and services through RED metric, service maps and distributed traces collected from OpenTelemetry SDKs.

The above capabilities and features packed with the Elastic OpenTelemetry Collector can be achieved in a similar way with a custom build of the upstream OpenTelemetry Collector packing the right set of upstream components. To do just that follow our guidance here.

Outlook

The launch of the technical preview of the Elastic Distribution of OpenTelemetry Collector is another step on Elastic’s journey towards OpenTelemetry based observability. On that journey we are committed to a vendor-agnostic approach to data collection and therefore prioritize upstream contribution to OpenTelemetry over Elastic-specific data collection features.

Stay tuned to see more of Elastic’s contributions to OpenTelemetry and observe Elastic’s journey towards fully OpenTelemetry-based observability.

Additional resources for OpenTelemetry with Elastic:

Elastic Distributions recently introduced:
Other Elastic OpenTelemetry resources:
Instrumentation resources:
- Python: Auto-instrumentation, Manual instrumentation
- Java: Auto-instrumentation, Manual instrumentation
- Node.js: Auto-instrumentation, Manual instrumentation
- .NET: Auto-instrumentation, Manual instrumentation

Announcing GA of Elastic distribution of the OpenTelemetry Java Agent

Thu, 12 Sep 2024 00:00:00 GMT

As Elastic continues its commitment to OpenTelemetry (OTel), we are excited to announce general availability of the Elastic Distribution of OpenTelemetry Java (EDOT Java). EDOT Java is a fully compatible drop-in replacement for the OTel Java agent that comes with a set of built-in, useful extensions for powerful additional features and improved usability with Elastic Observability. Use EDOT Java to start the OpenTelemetry SDK with your Java application, and automatically capture tracing data, performance metrics, and logs. Traces, metrics, and logs can be sent to any OpenTelemetry Protocol (OTLP) collector you choose.

With EDOT Java you have access to all the features of the OpenTelemetry Java agent plus:

Access to SDK improvements and bug fixes contributed by the Elastic team before the changes are available upstream in OpenTelemetry repositories.
Access to optional features that can enhance OpenTelemetry data that is being sent to Elastic (for example, inferred spans and span stacktrace).

In this blog post, we will explore the rationale behind our unique distribution, detailing the powerful additional features it brings to the table. We will provide an overview of how these enhancements can be utilized with our distribution, the standard OTel SDK, or the vanilla OTel Java agent. Stay tuned as we conclude with a look ahead at our future plans and what you can expect from Elastic contributions to OTel Java moving forward.

Elastic Distribution of OpenTelemetry Java (EDOT Java)

Until now, Elastic users looking to monitor their Java services through automatic instrumentation had two options: the proprietary Elastic APM Java agent or the vanilla OTel Java agent. While both agents offer robust capabilities and have reached a high level of maturity, each has its distinct advantages and limitations. The OTel Java agent provides extensive instrumentation across a broad spectrum of frameworks and libraries, is highly extensible, and natively emits OTel data. Conversely, the Elastic APM Java agent includes several powerful features absent in the OTel Java agent.

Elastic’s distribution of the OTel Java agent aims to bring together the best aspects of the proprietary Elastic Java agent and the OpenTelemetry Java agent. This distribution enhances the vanilla OTel Java agent with a set of additional features realized through extensions, while still being a fully compatible drop-in replacement.

Elastic’s commitment to OpenTelemetry not only focuses on standardizing data collection around OTel but also includes improving OTel components and integrating Elastic's data collection features into OTel. In this vein, our ultimate goal is to contribute as many features from Elastic’s distribution back to the upstream OTel Java agent; our distribution is designed in such a way that the additional features, realized as extensions, work directly with the OTel SDK. This means they can be used independent of Elastic’s distro — either with the Otel Java SDK or with the vanilla OTel Java agent. We’ll discuss these usage patterns further in the sections below.

Features included

The Elastic distribution of the OpenTelemetry Java agent includes a suite of extensions that deliver the features outlined below.

Inferred spans

In a recent blog post, we introduced inferred spans, a powerful feature designed to enhance distributed traces with additional profiling-based spans.

Inferred spans (blue spans labeled “internal” in the above image) offer valuable insights into sources of latency within the code that might remain uncaptured by purely instrumentation-based traces. In other words, they fill in the gaps between instrumentation-based traces. The Elastic distribution of the OTel Java agent includes the inferred spans feature. It can be enabled by setting the following environment variable.

ELASTIC_OTEL_INFERRED_SPANS_ENABLED=true

Correlation with profiling

With OpenTelemetry embracing profiling and Elastic's proposal to donate its eBPF-based, continuous profiling agent, a new frontier opens up in correlating distributed traces with continuous profiling data. This integration offers unprecedented code-level insights into latency issues and CO2 emission footprints, all within a clearly defined service, transaction, and trace context. To get started, follow this guide to setup universal profiling and the OpenTelemetry integration. In order to get more background information on the feature, check out this blog article, where we explore how these technologies converge to enhance observability and environmental consciousness in software development.

Users of Elastic Universal Profiling can already leverage the Elastic distribution of the OTel Java agent to access this powerful integration. With Elastic's proposed donation of the profiling agent, we anticipate that this capability will soon be available to all OTel users who employ the OTel Java agent in conjunction with the new OTel eBPF profiling.

Span stack traces

In many cases, spans within a distributed trace are relatively coarse-grained, particularly when features like inferred spans are not used. Understanding precisely where in the code path a span originates can be incredibly valuable. To address this need, the Elastic distribution of the OTel Java agent includes the span stack traces feature. This functionality provides crucial insights by collecting corresponding stack traces for spans that exceed a configurable minimum duration, pinpointing exactly where a span is initiated in the code.

This simple yet powerful feature significantly enhances problem troubleshooting, offering developers a clearer understanding of their application’s performance dynamics.

In the example above, it allows you to get the call stack of a gRPC call, which can help understanding which code paths triggered it.

Auto-detection of service and cloud resources

In today's expansive and diverse cloud environments, which often include multiple regions and cloud providers, having information on where your services are operating is incredibly valuable. Particularly in Java services, where the service name is frequently embedded within the deployment artifacts, the ability to automatically retrieve service and cloud resource information marks a substantial leap in usability.

To address this need, the Elastic distribution of the OTel Java agent includes built-in auto detectors for service and cloud resources, specifically for AWS and GCP, sourced from the OpenTelemetry Java Contrib repository. This feature, which is on by default, enhances observability and streamlines the management of services across various cloud platforms, making it a key asset for any cloud-based deployment.

Ways to use the EDOT Java

The Elastic distribution of the OTel Java agent is designed to meet our users exactly where they are, accommodating a variety of needs and strategic approaches. Whether you're looking to fully integrate new observability features or simply enhance existing setups, the Elastic distribution offers multiple technical pathways to leverage its capabilities. This flexibility ensures that users can tailor the agent's implementation to align perfectly with their specific operational requirements and goals.

Using Elastic’s distribution directly

The most straightforward path to harnessing the capabilities described above is by adopting the Elastic distribution of the OTel Java agent as a drop-in replacement for the standard OTel Java agent. Structurally, the Elastic distro functions as a wrapper around the OTel Java agent, maintaining full compatibility with all upstream configuration options and incorporating all its features. Additionally, it includes the advanced features described above that significantly augment its functionality. Users of the Elastic distribution will also benefit from the comprehensive technical support provided by Elastic, which will commence once the agent achieves general availability. To get started, simply download the agent Jar file and attach it to your application:

java -javaagent:/pathto/elastic-otel-javaagent.jar -jar myapp.jar

Using Elastic’s extensions with the vanilla OTel Java agent

If you prefer to continue using the vanilla OTel Java agent but wish to take advantage of the features described above, you have the flexibility to do so. We offer a separate agent extensions package specifically designed for this purpose. To integrate these enhancements, simply download and place the extensions jar file into a designated directory and configure the OTel Java agent extensions directory:

OTEL_JAVAAGENT_EXTENSIONS=/pathto/elastic-otel-agentextension.jar
java -javaagent:/pathto/otel-javaagent.jar -jar myapp.jar

Using Elastic’s extensions manually with the OTel Java SDK

If you build your instrumentations directly into your applications using the OTel API and rely on the OTel Java SDK instead of the automatic Java agent, you can still use the features we've discussed. Each feature is designed as a standalone component that can be integrated with the OTel Java SDK framework. To implement these features, simply refer to the specific descriptions for each one to learn how to configure the OTel Java SDK accordingly:

Setting up the inferred spans feature with the SDK
Setting up profiling correlation with the SDK
Setting up the span stack traces feature with the SDK
Setting up resource detectors with the SDK

This approach ensures that you can tailor your observability tools to meet your specific needs without compromising on functionality.

Future plans and contributions

We are committed to OpenTelemetry, and our contributions to the OpenTelemetry Java project will continue without limit. Not only are we focused on general improvements within the OTel Java project, but we are also committed to ensuring that the features discussed in this blog post become official extensions to the OpenTelemetry Java SDK/Agent and are included in the OpenTelemetry Java Contrib repository. We have already contributed the span stack trace feature and initiated the contribution of the inferred spans feature, and we are eagerly anticipating the opportunity to add the profiling correlation feature following the successful integration of Elastic’s profiling agent.

Moreover, our efforts extend beyond the current enhancements; we are actively working to port more features from the Elastic APM Java agent to OpenTelemetry. A particularly ambitious yet thrilling endeavor is our project to enable dynamic configurability of the OpenTelemetry Java agent. This future enhancement will allow for the OpenTelemetry Agent Management Protocol (OpAMP) to be used to remotely and dynamically configure OTel Java agents, improving their adaptability and ease of use.

We encourage you to experience the new Elastic distribution of the OTel Java agent and share your feedback with us. Your insights are invaluable as we strive to enhance the capabilities and reach of OpenTelemetry, making it even more powerful and user-friendly.

Check out more information on Elastic Distributions of OpenTelemetry in github and our latest EDOT Blog

Elastic provides the following components of EDOT:

EDOT SDK central configuration using OpAmp in Elastic Observability

Tue, 11 Nov 2025 00:00:00 GMT

Managing configuration changes to the large number of services using Elastic Distribution of OpenTelemetry (EDOT) SDKs can be challenging and time-consuming. OpenTelemetry has the Open Agent Management Protocol (OpAMP) which allows Elastic to manage these SDKs through the EDOT collector. Elastic's APM Agents provided this capability in Elastic Observability. Using this functionality together with OpAMP you can now centrally manage your SDKs from Elastic Observability, via the EDOT Collector which uses OpAMP to help manage configuration changes to the multitude of services using the EDOT SDKs.

In this article, we will explore the central configuration capabilities for EDOT SDKs with the EDOT Gateway Collector. You will learn how to configure the EDOT SDKs and the EDOT Gateway Collector to enable central configuration. Finally, we'll cover the configuration settings supported through central configuration.

Central configuration based on the OpenTelemetry Open Agent Management Protocol

The OpenTelemetry project provides OpAMP for the remote management of large fleets of data collection Agents among other capabilities. The central management of EDOT SDKs leverages OpAMP for dispatching the configurations. It's a client-server network protocol where the OpAMP server is part of the EDOT Collector and the OpAMP client is part of the EDOT SDK. The EDOT SDK polls at regular intervals the OpAMP server for configuration updates. The OpAMP server in the EDOT collector is part of the Elastic APM central configuration extension. This extension reads the configuration for the EDOT SDKs from Elasticsearch. The apmconfigextension, which is the technical name of the Elastic APM central configuration extension, is included and needs to be configured in the EDOT Collector configuration to be activated. The OpAMP specification includes WebSocket and plain HTTP connection for transport. The EDOT SDKs pull the configuration utilizing the plain HTTP connection for transport from the EDOT Collector.

Prerequisites

Central configuration of EDOT SDKs requires a standalone EDOT Collector running in Gateway mode. Other collectors like the OpenTelemetry contrib collector or a custom distribution of the collector you build yourself require the Elastic APM central configuration extension.

EDOT versions supporting central configuration

The following table gives an overview of the versions of the EDOT SDKs and EDOT Collector that provide central configuration support. Applications and services have to be instrumented with the EDOT SDKs with a version listed below to pull and apply central configuration changes.

EDOT	Version
Android	1.2.0+
iOS	1.4.0+
Java	1.5.0+
Node.js	1.2.0+
PHP	1.1.1+
Python	1.4.0+
Collector (Gateway mode)	8.19, 9.1+

Central configuration is not blocking application startup for EDOT Java, Node.js, PHP and Python. The application starts with default configuration or the configuration provided by environment variables. When the central configuration settings are successfully pulled, they will take precedence over local configuration settings. EDOT .NET currently does not support central configuration although it’s planned to add support.

Furthermore, central configuration for EDOT SDKs is not supported on Elastic Cloud Serverless or the Elastic Cloud Managed OTLP Endpoint, yet. The following table gives an overview of the versions of the Elastic stack that support central configuration of EDOT SDKs.

Elastic	Version
Self-managed	9.1.0+
Elastic Cloud Hosted	9.1.0+

Retrieve Elasticsearch endpoint and API key

The Elastic APM central configuration extension needs the Elasticsearch endpoint in the configuration to be able to read the central configuration settings for the EDOT SDKs. The Elasticsearch endpoint is the same that the Elasticsearch exporter uses to export telemetry data to Elasticsearch. The central configuration documentation describes the steps to obtain the endpoint and the API key in more detail.

Enable central configuration for EDOT

To enable central configuration, the EDOT Collector needs the apmconfigextension to be configured as part of the extensions section. This requires the Elasticsearch endpoint obtained above and an Elasticsearch API key. The environment variable ELASTIC_OTEL_OPAMP_ENDPOINT needs to be set to enable central configuration in the EDOT SDK.

In the following, the configuration is explained for the EDOT Gateway Collector and the EDOT SDKs.

Configure the EDOT Collector to enable central configuration

Central configuration support in the EDOT Collector is enabled by adding the configuration of the Elastic APM central configuration extension configuration to the configuration file. For the authentication of the apmconfigextension with the Elasticsearch endpoint, bearertokenauth authenticator is configured. This configures a client type authenticator for outgoing requests to the Elasticsearch endpoint. The apmconfigextension acts as client and Elasticsearch endpoint as server. The apmconfig section configures the OpAMP server endpoint. EDOT SDKs will connect to the endpoint to fetch the configuration. The service section activates the apmconfig and bearertokenauth extension. The following code snippet shows the configuration excerpt of the EDOT Collector including the bearertoken authenticator configuration and the apmconfig configuration for central configuration.

extensions:
  bearertokenauth:
    scheme: "APIKey"
    token: ""
  source:
     elasticsearch:
       endpoint: ""
       auth:
         authenticator: bearertokenauth
  apmconfig:
    opamp:
      protocols:
        http:
          # Default is localhost:4320
          # To specify a custom endpoint, uncomment the following line
          # and set the endpoint to the custom endpoint
          # endpoint: ""
  service:
    extensions: [bearertokenauth, apmconfig]

Download the EDOT collector and include the configuration from the snippet above in the otel.yml configuration file to enable central configuration. The otel.yml configuration file examples are available in the EDOT Collector documentation. Consider the example for direct ingestion into Elasticsearch.

Configure the EDOT SDKs to enable central configuration

To enable central configuration in the EDOT SDKs Java, Node.js, PHP and Python, set the ELASTIC_OTEL_OPAMP_ENDPOINT environment variable to the OpAMP server endpoint of the EDOT Collector and set the required resource attributes.

Enable central configuration of EDOT SDKs

The following code snippet shows how to set the ELASTIC_OTEL_OPAMP_ENDPOINT environment variable with the export command in a shell.

export ELASTIC_OTEL_OPAMP_ENDPOINT="http://:4320/v1/opamp"

must be set to the address or host name of the EDOT Gateway Collector that provides the OpAMP server endpoint for central configuration.

When you are using EDOT SDKs for mobile, the documentation shows how to activate central configuration support in EDOT Android and EDOT iOS.

Configure resource attributes

Central configuration requires the OpenTelemetry resource attributes service.name and deployment.environment.name to be set. While service.name is mandatory, deployment.environment.name is optional but recommended. If deployment.environment.name is unset, no configuration can be created that applies to a whole environment.

Set the OTEL_RESOURCE_ATTRIBUTES environment variable including the service.name and deployment.environment.name like in the following code snippet. The key-value pairs are concatenated with a comma as separator and provided as value to the OTEL_RESOURCE_ATTRIBUTES environment variable.

export OTEL_RESOURCE_ATTRIBUTES="deployment.environment.name=production,service.name=my-app"

Supported configuration settings

The following tables give an overview of the supported central configuration settings at the time of writing.

Non-mobile EDOT SDKs

The table shows the supported central configuration settings of EDOT Java, Node.js, PHP, and Python SDK.

Setting	Description	Java	Node.js	PHP	Python	Kibana
`logging_level`	The EDOT SDKs own logging level	1.5.0+	1.2.0+	1.1.0+	1.4.0+	9.1.0+
`deactivate_instrumentations`	Turn off selected instrumentations	1.5.0+	1.2.0+	-	-	9.1.0+
`deactivate_all_instrumentations`	Turn off all instrumentations	1.5.0+	1.2.0+	-	-	9.1.0+
`send_traces`	Controls if traces should be sent	1.5.0+	1.3.0+	-	-	9.1.0+
`send_metrics`	Controls if metrics should be sent	1.5.0+	1.3.0+	-	-	9.1.0+
`send_logs`	Controls if logs should be sent	1.5.0+	1.3.0+	-	-	9.1.0+
`opamp_polling_interval`	Time between consecutive central configuration pull requests	1.6.0+	-	-	-	9.2.0+ (planned)
`sampling_rate`	Trace sampling rate for head-based sampling	1.6.0+	-	-	1.7.0+	9.2.0+ (planned)
`infer_spans`	Activates/Deactivates inferred spans	1.7.0+	-	-	-	9.2.0+ (planned)

Mobile EDOT SDKs

The table below shows the supported central configuration settings of EDOT Android, and iOS SDK.

Setting	Description	Android	iOS	Kibana
`recording`	Record and send telemetry	1.2.0+	1.4.0+	9.1.0+
`session_sample_rate`	Sampling rate for session-based sampling	1.2.0+	1.4.0+	9.1.0+

Use Elastic Observability to change configuration settings of EDOT SDKs

An application must produce and send telemetry data otherwise the EDOT SDK will not appear in the Agent configuration UI in Elastic Observability. Because the Agent configuration in Elastic Observability has no knowledge about the existence of the EDOT SDK until it's receiving telemetry data. The OpenTelemetry resource attribute service.name is used as the key to assign a configuration to an EDOT SDK. Currently, EDOT SDKs will not show up in the Agent Explorer.

Go to Kibana -> Observability -> Applications -> Service Inventory -> Settings -> Agent Configuration to create a new configuration for your EDOT SDK.

The EDOT Java SDK configuration above deactivates all instrumentations by setting deactivate_all_instrumentations to true. This is useful in situations when switching off the instrumentations is necessary. The sampling_rate setting becomes handy when the sampling rate of the EDOT SDK should be changed. A sampling rate can be selected according to the needs as shown below.

Disable central configuration

Disabling central configuration is easy. Remove the environment variable ELASTIC_OTEL_OPAMP_ENDPOINT and restart the application to disable central configuration support in the EDOT SDKs. Remove the apmconfig extension from the service section of the EDOT Collector configuration and restart the collector to disable the Elastic APM central configuration extension in the EDOT Collector.

Elastic's contribution to OpenTelemetry

Elastic is committed to the OpenTelemetry project. Elastic contributed the Java OpAMP client implementation to the OpenTelemetry project (Github PR), and is working on the contribution for Python (Github PR) and Node.js. The OpAMP client for PHP will be part of a larger contribution (Github issue) that Elastic is doing.

Conclusion

In this article, you learned how to configure the EDOT SDKs centrally in Elastic Observability with the Gateway Collector and OpAMP at scale. You learned how to configure the apmconfigextension in the collector and how to set the ELASTIC_OTEL_OPAMP_ENDPOINT environment variable to enable central configuration in the EDOT SDKs, which versions of the SDKs and collector support central configuration, and what configuration settings are currently supported. Now you can leverage central configuration in large deployments to manage the configuration of the EDOT SDKs.

Elastic Distributions of OpenTelemetry (EDOT) Now GA: Open-Source, Production-Ready OTel

Wed, 02 Apr 2025 00:00:00 GMT

We’re excited to announce general availability of Elastic Distributions of OpenTelemetry (EDOT)! EDOT is a fully open distribution of the OpenTelemetry collector and language SDKs, providing SREs and developers with a stable, production-tested OTel ecosystem backed by enterprise-grade support.

While OTel components are feature-rich, enhancements through the community can take time. Additionally, support is left up to the community or individual users and organizations. EDOT delivers the following benefits to end users:

Production-ready, backed by expert OTel support
No vendor lock-In - no proprietary add-ons
Preserving OpenTelemetry standards - no schema conversion

EDOT Collector and SDKs are GA

Elastic Distributions of OpenTelemetry (EDOT) is a curated collection of OpenTelemetry components, EDOT Collector and language SDKs. It is designed to support OTel telemetry from applications and shared infrastructure such as hosts or Kubernetes.

Highlighted below are all the EDOT components that are now GA.

Elastic Distribution of OpenTelemetry (EDOT) Collector - The EDOT Collector is the OTel collector with Elastic’s set of processors, receivers and exporters to send OTel to Elastic
Elastic Distribution of OpenTelemetry (EDOT) SDKs & zero-code instrumentation - Users have an option to instrument with the SDKs or choose to use zero-code instrumentation. Here are all the SDKs currently available in EDOT:
- Elastic Distribution of OpenTelemetry (EDOT) Java.
- Elastic Distribution of OpenTelemetry (EDOT) Python
- Elastic Distribution of OpenTelemetry (EDOT) NodeJS
- Elastic Distribution of OpenTelemetry (EDOT) .NET
- Elastic Distribution of OpenTelemetry (EDOT) PHP
- Elastic Distribution of OpenTelemetry (EDOT) iOS
- Elastic Distribution of OpenTelemetry (EDOT) Android

Details and documentation for EDOT are available in our public EDOT documentation, and our EDOT github repository.

To learn more about the ease of use, particularly with Kubernetes check out our previous blog Ingest Kubernetes and application telemetry in 3 steps with EDOT.

What an SRE gains with EDOT

Production-ready, Backed by Expert OTel Support

Enterprises adopting OpenTelemetry often struggle with unreliable support, slow bug fixes, and untested updates, leading to operational risks and downtime, and increased troubleshooting efforts. Without enterprise-grade guarantees, teams are left to resolve issues on their own, increasing maintenance overhead and slowing adoption.

EDOT delivers enterprise-grade support backed by OpenTelemetry experts, ensuring stability, proactive fixes beyond OpenTelemetry’s release cycles and production-tested reliability. With rapid issue resolution and expert guidance, EDOT enables organizations to confidently adopt and scale OpenTelemetry without operational disruptions or added maintenance burden.

No Vendor Lock-In—No Proprietary Add-Ons

Observability vendors have traditionally built proprietary agents and ingestion pipelines allowing them to control data flows and lock in users.

Elastic Distributions of OpenTelemetry (EDOT) offers a fully open, vendor-neutral approach to observability. As a curated portfolio of OpenTelemetry components, EDOT enhances infrastructure and application monitoring with Elastic Observability—without proprietary modifications.

All enhancements and fixes are contributed back to the OpenTelemetry community, ensuring EDOT remains a stable, standards-compliant distribution that stays aligned with upstream OpenTelemetry. This guarantees interoperability, seamless upgrades, and freedom from vendor lock-in.

Preserving OpenTelemetry Standards for Richer Context

When vendors modify OpenTelemetry data and schemas by introducing proprietary translations that disrupt interoperability, they create vendor lock-in and increase complexity. These modifications force operations teams to manage custom integrations, convert schemas, and sometimes result in each signal requiring its own query language and tooling, adding unnecessary overhead and limiting flexibility.

Elastic has re-architected its platform with an OTel-first approach that preserves the OpenTelemetry data model. OTel data can now be used in its original specification to power Elastic dashboards, analytics, alerts, and other functionality without the need for schema conversions – it just works.

With Elasticsearch as a single backend for all OpenTelemetry signals, users can store and query observability data in a unified, OTel-native format. Combined with ES|QL, a powerful and flexible query language, SREs get effortless correlation of logs, metrics, and traces using OpenTelemetry resource attributes. The result is a faster, more intuitive way to analyze system health and performance—all in one place.

Get Started Today

EDOT is available to all Elastic customers. Whether you’re adopting OpenTelemetry for the first time or looking for a reliable distribution with enterprise-grade support, EDOT ensures a smooth, OpenTelemetry-first experience.

Check out our EDOT documentation, and our EDOT github repository and get started today!

Additionally check out some of the blogs detailing our components

Introducing Elastic Distributions of OpenTelemetry

Thu, 15 Aug 2024 00:00:00 GMT

We are announcing the availability of Elastic Distributions of OpenTelemetry (EDOT). These Elastic distributions, currently in tech preview, have been developed to enhance the capabilities of standard OpenTelemetry distributions and improve existing OpenTelemetry support from Elastic.

The Elastic Distributions of OpenTelemetry (EDOT) are composed of OpenTelemetry (OTel) project components, OTel Collector, and language SDKs, which provide users with the necessary capabilities and out-of-the-box configurations, enabling quick and effortless infra and application monitoring.

Deliver enhanced features earlier than OTel: By providing features unavailable in the “vanilla” OpenTelemetry components, we can quickly meet customers’ requirements while still providing an OpenTelemetry native and vendor-agnostic instrumentation for their applications. Elastic will continuously upstream these enhanced features.
Enhanced OTel support - By maintaining Elastic distributions, we can better support customers with enhancements and fixes outside of the OTel release cycles. In addition, Elastic support can troubleshoot issues on the EDOT.

EDOT currently includes the following tech preview components, which will grow over time:

Details and documentation for all EDOT are available in our public OpenTelemetry GitHub repository.

Elastic Distribution of OpenTelemetry (EDOT) Collector

The EDOT Collector, recently released with the 8.15 release of Elastic Observability enhances Elastic’s existing OTel capabilities. The EDOT Collector can, in addition to service monitoring, forward application logs, infrastructure logs, and metrics using standard OpenTelemetry Collector receivers like file logs and host metrics receivers.

Additionally, users of the Elastic Distribution of the OpenTelemetry Collector benefit from container logs automatically enriched with Kubernetes metadata by leveraging the powerful container log parser that Elastic recently contributed. This OpenTelemetry-based enrichment enhances the context and value of the collected logs, providing deeper insights and more effective troubleshooting capabilities.

This new collector distribution ensures that exported data is fully compatible with the Elastic Platform, enhancing the overall observability experience. Elastic also ensures that Elastic-curated UIs can seamlessly handle both the Elastic Common Schema (ECS) and OpenTelemetry formats.

Elastic Distributions for Language SDKs

Elastic's APM agents have capabilities yet to be available in the OTel SDKs. EDOT brings these capabilities into the OTel language SDKs while maintaining seamless integration with Elastic Observability. Elastic will release OTel versions of all its APM agents, and continue to add additional language SDKs mirroring OTel.

Continued support for Native OTel components

EDOT does not preclude users from using native components. Users are still able to use:

OpenTelemetry Vanilla Language SDKs: use standard OpenTelemetry code instrumentation for many popular programming languages sending OTLP traces to Elastic via APM server.
Upstream Distribution of OpenTelemetry Collector (Contrib or Custom): Send traces using the OpenTelemetry Collector with OTLP receiver and OTLP exporter to Elastic via APM server.

Elastic is committed to contributing EDOT features or components upstream into the OpenTelemetry community, fostering a collaborative environment, and enhancing the overall OpenTelemetry ecosystem.

Extending our commitment to vendor-agnostic data collection

Elastic remains committed to supporting OpenTelemetry by being OTel first and building a vendor-agnostic framework. As OpenTelemetry constantly grows its support of SDKs and components, Elastic will continue to refine and mirror EDOT to OpenTelemetry and push enhancements upstream.

Over the past year, Elastic has been active in OTel through its donation of Elastic Common Schema (ECS), contributions to the native OpenTelemetry Collector and language SDKs, and a recent donation of its Universal Profiling agent to OpenTelemetry.

EDOT builds on our decision to fully adopt and recommend OpenTelemetry as the preferred solution for observing applications. With EDOT, Elastic customers can future-proof their investments and adopt OpenTelemetry, giving them vendor-neutral instrumentation with Elastic enterprise-grade support.

Our vision is that Elastic will work with the OpenTelemetry community to donate features through the standardization processes and contribute the code to implement those in the native OpenTelemetry components. In time, as OTel capabilities advance, and many of the Elastic-exclusive features transition into OpenTelemetry, we look forward to no longer having Elastic Distributions for OpenTelemetry.. In the meantime, we can deliver those capabilities via our OpenTelemetry distributions.

OpenTelemetry and Elastic: Working together to establish continuous profiling for the community

Tue, 12 Mar 2024 00:00:00 GMT

Profiling is emerging as a core pillar of observability, aptly dubbed the fourth pillar, with the OpenTelemetry (OTel) project leading this essential development. This blog post dives into the recent advancements in profiling within OTel and how Elastic® is actively contributing toward it.

At Elastic, we’re big believers in and contributors to the OpenTelemetry project. The project’s benefits of flexibility, performance, and vendor agnosticism have been making their rounds; we’ve seen a groundswell of customer interest.

To this end, after donating our Elastic Common Schema and our invokedynamic based java agent approach, we recently announced our intent to donate our continuous profiling agent — a whole-system, always-on, continuous profiling solution that eliminates the need for run-time/bytecode instrumentation, recompilation, on-host debug symbols, or service restarts.

Profiling helps organizations run efficient services by minimizing computational wastage, thereby reducing operational costs. Leveraging eBPF, the Elastic profiling agent provides unprecedented visibility into the runtime behavior of all applications: it builds stacktraces that go from the kernel, through userspace native code, all the way into code running in higher level runtimes, enabling you to identify performance regressions, reduce wasteful computations, and debug complex issues faster.

Enabling profiling in OpenTelemetry: A step toward unified observability

Elastic actively participates in the OTel community, particularly within the Profiling Special Interest Group (SIG). This group has been instrumental in defining the OTel Profiling Data Model, a crucial step toward standardizing profiling data.

The recent merger of the OpenTelemetry Enhancement Proposal (OTEP) introducing profiling support to the OpenTelemetry Protocol (OTLP) marks a significant milestone. With the standardization of profiles as a core observability pillar alongside metrics, tracing, and logs, OTel offers a comprehensive suite of observability tools, empowering users to gain a holistic view of their applications' health and performance.

In line with this advancement, we are donating our whole-system, eBPF-based continuous profiling agent to OTel. In parallel, we are implementing the experimental OTel Profiling signal in the profiling agent, to ensure and demonstrate OTel protocol compatibility in the agent and prepare it for a fully OTel-based collection of profiling signals and correlate it to logs, metrics, and traces.

Why is Elastic donating the eBPF-based profiling agent to OpenTelemetry?

Computational efficiency has always been a critical concern for software professionals. However, in an era where every line of code affects both the bottom line and the environment, there's an additional reason to focus on it. Elastic is committed to helping the OpenTelemetry community enhance computational efficiency because efficient software not only reduces the cost of goods sold (COGS) but also reduces carbon footprint.

We have seen firsthand — both internally and from our customers' testimonials — how profiling insights aid in enhancing software efficiency. This results in an improved customer experience, lower resource consumption, and reduced cloud costs.

Moreover, adopting a whole-system profiling strategy, such as Elastic Universal Profiling, differs significantly from traditional instrumentation profilers that focus solely on runtime. Elastic Universal Profiling provides whole-system visibility, profiling not only your own code but also third-party libraries, kernel operations, and other code you don't own. This comprehensive approach facilitates rapid optimizations by identifying non-optimal common libraries and uncovering "unknown unknowns" that consume CPU cycles. Often, a tipping point is reached when the resource consumption of libraries or certain daemon processes exceeds that of the applications themselves. Without system-wide profiling, along with the capabilities to slice data per service and aggregate total usage, pinpointing these resource-intensive components becomes a formidable challenge.

At Elastic, we have a customer with an extensive cloud footprint who plans to negotiate with their cloud provider to reclaim money for the significant compute resource consumed by the cloud provider's in-VM agents. These examples highlight the importance of whole-system profiling and the benefits that the OpenTelemetry community will gain if the donation proposal is accepted.

Specifically, OTel users will gain access to a lightweight, battle-tested production-grade continuous profiling agent with the following features:

Very low CPU and memory overhead (1% CPU and 250MB memory are our upper limits in testing, and the agent typically manages to stay way below that)
Support for native C/C++ executables without the need for DWARF debug information by leveraging .eh_frame data, as described in “How Universal Profiling unwinds stacks without frame pointers and symbols”
Support profiling of system libraries without frame pointers and without debug symbols on the host
Support for mixed stacktraces between runtimes — stacktraces go from Kernel space through unmodified system libraries all the way into high-level languages
Support for native code (C/C++, Rust, Zig, Go, etc. without debug symbols on host)
Support for a broad set of High-level languages (Hotspot JVM, Python, Ruby, PHP, Node.JS, V8, Perl), .NET is in preparation
100% non-intrusive: there's no need to load agents or libraries into the processes that are being profiled
No need for any reconfiguration, instrumentation, or restarts of HLL interpreters and VMs: the agent supports unwinding each of the supported languages in the default configuration
Support for x86 and Arm64 CPU architectures
Support for native inline frames, which provide insights into compiler optimizations and offer a higher precision of function call chains
Support for Probabilistic Profiling to reduce data storage costs
. . . and more

Elastic's commitment to enhancing computational efficiency and our belief in the OpenTelemetry vision underscores our dedication to advancing the observability ecosystem –– by donating the profiling agent. Elastic is not only contributing technology but also dedicating a team of specialized profiling domain experts to co-maintain and advance the profiling capabilities within OpenTelemetry.

How does this donation benefit the OTel community?

Metrics, logs, and traces offer invaluable insights into system health. But what if you could unlock an even deeper level of visibility? Here's why profiling is a perfect complement to your OTel toolkit:

1. Deep system visibility: Beyond the surface

Think of whole-system profiling as an MRI scan for your fleet. It goes deeper into the internals of your system, revealing hidden performance issues lurking beneath the surface. You can identify "unknown unknowns" — inefficiencies you wouldn't have noticed otherwise — and gain a comprehensive understanding of how your system functions at its core.

2. Cross-signal correlation: Answering "why" with confidence

The Elastic Universal Profiling agent supports trace correlation with the OTel Java agent/SDK (with Go support coming soon!). This correlation enables OTel users to view profiling data by services or service endpoints, allowing for a more context-aware and targeted root cause analysis. This powerful combination allows you to pinpoint the exact cause of resource consumption at the trace level. No more guessing why specific functions hog CPU or why certain events occur. You can finally answer the critical "why" questions with precision, enabling targeted optimization efforts.

3. Cost and sustainability optimization: Beyond performance

Our approach to profiling goes beyond just performance gains. By correlating whole-system profiling data with tracing, we can help you measure the environmental impact and cloud cost associated with specific services and functionalities within your application. This empowers you to make data-driven decisions that optimize both performance and resource utilization, leading to a more sustainable and cost-effective operation.

Elastic's commitment to OpenTelemetry

Elastic currently supports a growing list of Cloud Native Computing Foundation (CNCF) projects such as Kubernetes (K8S), Prometheus, Fluentd, Fluent Bit, and Istio. Elastic’s application performance monitoring (APM) also natively supports OTel, ensuring all APM capabilities are available with either Elastic or OTel agents or a combination of the two. In addition to the ECS contribution and ongoing collaboration with OTel SemConv, Elastic has continued to make contributions to other OTel projects, including language SDKs (such as OTel Swift, OTel Go, OTel Ruby, and others), and participates in several special interest groups (SIGs) to establish OTel as a standard for observability and security.

We are excited about our strengthening relationship with OTel and the opportunity to donate our profiling agent in a way that benefits both the Elastic community and the broader OTel community.Learn more about Elastic’s OpenTelemetry support or contribute to the donation proposal or just join the conversation.

Stay tuned for further updates as the profiling part of OTel continues to evolve.

Connecting the Dots: ES|QL Joins for Richer Observability Insights

Thu, 29 May 2025 00:00:00 GMT

Connecting the Dots: ES|QL Joins for Richer Observability Insights

You might have seen our recent announcement about the arrival of SQL-style joins in Elasticsearch with ES|QL's LOOKUP JOIN command (now in Tech Preview!). While that post covered the basics, let's take a closer look at this in the context of Observability. How can this new join capability specifically help engineers and SREs make sense of their logs, metrics, and traces and make Elasticsearch more storage efficient by not denormalizing as much data?

Note: Before we jump into the details, it’s important to mention again that this type of functionality today relies on a special lookup index. It is not (yet) possible to JOIN any arbitrary index.

Observability isn't just about collecting data; it's about understanding it. Often, the raw telemetry data – a log line, a metric point, a trace span – lacks the full context needed for quick diagnosis or impact assessment. We need to correlate data, enrich it with business or infrastructure context, and ask more advanced questions.

Historically, achieving this in Elasticsearch involved techniques like denormalizing data at ingest time (using ingest pipelines with enrich processors, for example) or performing joins client-side.

By adding the necessary context (like host details or user attributes) as data flowed in, each document arrived fully ready for queries and analytics without extra processing later on. This approach worked well in many cases and still does, particularly when the reference data changes slowly or when the enriched fields are critical for nearly every search.

However, as environments become more dynamic and diverse, the need to frequently update reference data (or avoid storing repetitive fields in every document) highlighted some of the trade-offs.

With the introduction of ES|QL LOOKUP JOIN in Elasticsearch 8.18 and 9.0, you now have an additional, more flexible option for situations where real-time lookups and minimal duplication are desired. Both methods—ingest-time enrichment and on-the-fly LOOKUP JOIN—complement each other and remain valid, depending on use case needs around update frequency, query performance, and storage considerations.

Why Lookup Joins for Observability

Lookup joins keep things flexible. You can decide on the fly if you’d like to look up additional information to assist you in your investigation.

Here are some examples:

Deployment Information: Which version of the code is generating these errors?
Infrastructure Mapping: Which Kubernetes cluster or cloud region is experiencing high latency? What hardware does it use?
Business Context: Are critical customers being affected by this slowdown?
Team Ownership: Which team owns the service throwing these exceptions?

Keeping this kind of information perfectly denormalized onto every single log line or metric point can be challenging and inefficient. Lookup datasets – like lists of deployments, server inventories, customer tiers, or service ownership mappings – often change independently of the telemetry data itself.

LOOKUP JOIN is ideal here because:

Lookup Indices are Writable: Update your deployment list, CMDB export, or on-call rotation in the lookup index, and your next ES|QL query immediately uses the fresh data. No need to re-run complex enrich policies or re-index data.
Flexibility: You decide at query time which context to join. Maybe today you care about deployment versions, tomorrow about cloud regions.
Simpler Setup: As the original post highlighted, there are no enrich policies to manage. Just create an index with index.mode: lookup and load your data - up to 2 billion documents per lookup index.

Observability Use Cases & Examples with ES|QL

Let’s now look at a few examples to see how Lookup Joins can help.

Enriching Error Logs with Deployment Context

Lets say you're seeing a spike in errors for your checkout-service. You have logs flowing into a data stream, but they only contain the service name. The documents don’t have any information about the deployment activity itself.

FROM logs-*
  | WHERE log.level == "error"
  | WHERE service.name == "opbeans-ruby"

You need to know if a recent deployment is contributing to these errors. To do this, we can maintain a deployments_info_lkp index (set with index.mode: lookup) that maps service names to their deployment times. This index could be updated from our CI/CD pipeline automatically any time a deployment happens.

PUT /deployments_info_lkp
{
  "settings": {
    "index.mode": "lookup"
  },
  "mappings": {
    "properties": {
      "service": {
        "properties": {
          "name": {
            "type": "keyword"
          },
          "deployment_time": {
            "type": "date"
          },
          "version": {
            "type": "keyword"
          }
        }
      }
    }
  }
}
# Bulk index the deployment documents
POST /_bulk
{ "index" : { "_index" : "deployments_info_lkp" } }
{ "service.name": "opbeans-ruby", "service.version": "1.0", "deployment_time": "2025-05-22T06:00:00Z" }
{ "index" : { "_index" : "deployments_info_lkp" } }
{ "service.name": "opbeans-go", "service.version": "1.1.0", "deployment_time": "2025-05-22T06:00:00Z" }

Using this information you can now write a query that joins these two sources.

ES|QL Query:

FROM logs-* 
  | WHERE log.level == "error"
  | WHERE service.name == "opbeans-ruby"
  | LOOKUP JOIN deployments_info_lkp ON service.name

This alone is a good step towards troubleshooting the problem. You now have the deployment_time column available for each of your error documents. The last remaining step now is to use this for further filtering.

Any of the data we managed to join from the lookup index can be handled as any other data we’d usually have available in the ES|QL query. This means that we can filter on it, and check if we had a recent deployment.

FROM logs-*
  | WHERE log.level == "error"
  | WHERE service.name == "opbeans-ruby"
  | LOOKUP JOIN deployments_info_lkp ON service.name 
  | KEEP message, service.name, service.version, deployment_time 
  | WHERE deployment_time > NOW() - 2h

Saving disk space using JOIN

Denormalizing data by including contextual information like host OS or cloud provider details directly in every log event is convenient for querying but can increase storage consumption, especially with high-volume data streams. Instead of storing this often-redundant information repeatedly, we can leverage joins to retrieve it on demand, potentially saving valuable disk space. While compression often handles repetitive data well, removing these fields entirely can still yield noticeable storage savings.

In this example we’ll use a dataset of 1,000,000 Kubernetes container logs using the default mapping of the Kubernetes integration, with logsdb index mode enabled. The starting size for this index is 35.5mb.

GET _cat/indices/k8s-logs-default?h=index,pri.store.size
### 
k8s-logs-default       35.5mb

Using the disk usage API, we observed that fields like host.os and cloud.* contribute roughly 5% to the total index size on disk (35.5mb). These fields can be useful in some cases, but information like the os.name is rarely queried.

// Example host.os structure
"os": {
  "codename": "Plow", "family": "redhat", "kernel": "6.6.56+",
  "name": "Red Hat Enterprise Linux", "platform": "rhel", "type": "linux", "version": "9.5 (Plow)"
}

// Example cloud structure
"cloud": {
  "account": { "id": "elastic-observability" },
  "availability_zone": "us-central1-c",
  "instance": { "id": "5799032384800802653", "name": "gke-edge-oblt-edge-oblt-pool-46262cd0-w905" },
  "machine": { "type": "e2-standard-4" },
  "project": { "id": "elastic-observability" },
  "provider": "gcp", "region": "us-central1", "service": { "name": "GCE" }
}

Instead of storing this information with every document, let's instead drop this information in an ingest pipeline.

PUT _ingest/pipeline/drop-host-os-cloud
{
  "processors": [
      { "remove": { "field": "host.os" } },
      { "set": { "field": "tmp1", "value": "{{cloud.instance.id}}" } }, // Temporarily store the ID

      { "remove": { "field": "cloud" } },                             // Remove the entire cloud object
      { "set": { "field": "cloud.instance.id", "value": "{{tmp1}}" } }, // Restore just the cloud instance ID
      { "remove": { "field": "tmp1", "ignore_missing": true } }         // Clean up temporary field
    ]
}

Reindexing (and force merging to one segment) now shows the following size, resulting in approximately 5% less space.

GET _cat/indices/k8s-logs-*?h=index,pri.store.size
### 
k8s-logs-default             33.7mb
k8s-logs-drop-cloud-os       35.5mb

Now, to regain access to the removed host.os and cloud.* information during analysis without storing it in every log document, we can create a lookup index. This index will store the full host and cloud metadata, keyed by the cloud.instance.id that we preserved in our logs. This instance_metadata_lkp index will be significantly smaller than the space saved across millions or billions of log lines, as it only needs one document per unique instance.

# Create the lookup index for instance metadata
PUT /instance_metadata_lkp
{
  "settings": {
    "index.mode": "lookup"
  },
  "mappings": {
    "properties": {

      "cloud.instance.id": {  # The join key we kept in the logs
        "type": "keyword"
      },
      "host.os": {           # The full host.os object we removed
        "type": "object",
        "enabled": false      # Often don't need to search sub-fields here
      },
      "cloud": {             # The full cloud object we removed (mostly)
         "type": "object",
         "enabled": false     # Often don't need to search sub-fields here
      }
    }
  }
}

# Bulk index sample instance metadata (keyed by cloud.instance.id)
# This data might come from your cloud provider API or CMDB
POST /_bulk
{ "index" : { "_index" : "instance_metadata_lkp", "_id": "5799032384800802653" } }
{ "cloud.instance.id": "5799032384800802653", "host.os": { "codename": "Plow", "family": "redhat", "kernel": "6.6.56+", "name": "Red Hat Enterprise Linux", "platform": "rhel", "type": "linux", "version": "9.5 (Plow)" }, "cloud": { "account": { "id": "elastic-observability" }, "availability_zone": "us-central1-c", "instance": { "id": "5799032384800802653", "name": "gke-edge-oblt-edge-oblt-pool-46262cd0-w905" }, "machine": { "type": "e2-standard-4" }, "project": { "id": "elastic-observability" }, "provider": "gcp", "region": "us-central1", "service": { "name": "GCE" } } }

With this setup, when you need the full host or cloud context for your logs, you can simply use LOOKUP JOIN in your ES|QL query and continue filtering on the data from the lookup index

FROM logs-* 
  | LOOKUP JOIN instance_metadata_lkp ON cloud.instance.id 
  | WHERE cloud.region == "us-central1"

This approach allows us to query the full context when needed (e.g., filtering logs by host.os.name or cloud.region) while significantly reducing the storage footprint of the high-volume log indices by avoiding redundant data denormalization.

It should be noted that low cardinality metadata fields generally compress well and a large part of the storage savings in this case come from the “text” mapping of the host.os.name and cloud.instance.name field. Make sure to use the disk usage API to evaluate if this approach would be worth it in your specific use case.

Getting Started with Lookups for Observability

Creating the necessary lookup indices is straightforward. As detailed in our initial blog post, you can use Kibana's Index Management UI, the Create Index API, or the File Upload utility – the key is setting "index.mode": "lookup" in the index settings.

For Observability, consider automating the population of these lookup indices:

Export data periodically from your CMDB, CRM, or HR systems.
Have your CI/CD pipeline update the deployments_lkp index upon successful deployment.
Use tools like Logstash with an elasticsearch output configured to write to your lookup index.

A Note on Performance and Alternatives

While incredibly powerful, joins aren't free. Each LOOKUP JOIN adds processing overhead to your query. For contextual data that is very static (e.g., the cloud region a host permanently resides in) and needed in almost every query against that data, the traditional approach of enriching at ingest time might still be slightly more performant for those specific queries, trading upfront processing and storage for query speed.

However, for the dynamic, flexible, and targeted enrichment scenarios common in Observability – like mapping to ever-changing deployments, user segments, or team structures – LOOKUP JOIN offers a compelling, efficient, and easier-to-manage solution.

Conclusion

ES|QL's LOOKUP JOIN is making it easy to correlate and enrich your logs, metrics, and traces with up-to-date external information at query time; you can move faster from detecting problems to understanding their scope, impact, and root cause.

This feature is currently in Technical Preview in Elasticsearch 8.18 and Serverless, available now on Elastic Cloud. We encourage you to try it out with your own Observability data and share your feedback using the "Submit feedback" button in the ES|QL editor in Discover. We're excited to see how you use it to connect the dots in your systems!

Migrating from Elastic’s Go APM agent to OpenTelemetry Go SDK

Mon, 15 Apr 2024 00:00:00 GMT

As we’ve already shared, Elastic is committed to helping OpenTelemetry (OTel) succeed, which means, in some cases, building distributions of language SDKs.

Elastic is strategically standardizing on OTel for observability and security data collection. Additionally, Elastic is committed to working with the OTel community to become the best data collection infrastructure for the observability ecosystem. Elastic is deepening its relationship with OTel beyond the recent contributions of the Elastic Common Schema (ECS) to OpenTelemetry, invokedynamic in the OTel Java agent, and the upcoming profiling agent donation.

Since Elastic version 7.14, Elastic has supported OTel natively by being able to directly ingest OpenTelemetry protocol (OTLP)-based traces, metrics, and logs.

The Go SDK is a bit different from the other language SDKs, as the Go language inherently lacks the dynamicity that would allow building a distribution that is not a fork.

Nevertheless, the absence of a distribution doesn’t mean you shouldn’t use OTel for data collection from Go applications with the Elastic Stack.

Elastic currently has an APM Go agent, but we recommend switching to the OTel Go SDK. In this post, we cover two ways you can do that migration:

By replacing all telemetry in your application’s code (a “big bang migration”) and shipping the change
By splitting the migration into atomic changes, to reduce the risk of regressions

A big bang migration

The simplest way to migrate from our APM Go agent to the OTel SDK may be by removing all telemetry provided by the agent and replacing it all with the new one.

Automatic instrumentation

Most of your instrumentation may be provided automatically, as it is part of the frameworks or libraries you are using.

For example, if you use the Elastic Go agent, you may be using our net/http auto instrumentation module like this:

import (
	"net/http"
	"go.elastic.co/apm/module/apmhttp/v2"
)


func handler(w http.ResponseWriter, req *http.Request) {
	fmt.Fprintf(w, "Hello World!")
}

func main() {
	http.ListenAndServe(
                  ":8080",
                  apmhttp.Wrap(http.HandlerFunc(handler)),
	)
}

With OpenTelemetry, you would use the otelhttp module instead:

import (
	"net/http"
	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)


func handler(w http.ResponseWriter, req *http.Request) {
	fmt.Fprintf(w, "Hello World!")
}

func main() {
	http.ListenAndServe(
                  ":8080",
                  otelhttp.NewHandler(http.HandlerFunc(handler), "http"),
	)
}

You should perform this same change for every other module you use from our agent.

Manual instrumentation

Your application may also have manual instrumentations, which consist of creating traces and spans directly within your application code by calling the Elastic APM agent API.

You may be creating transactions and spans like this with Elastic’s APM SDK:

import (
	"go.elastic.co/apm/v2"
)

func main() {
       // Create a transaction, and assign it to the context.
       tx :=  apm.DefaultTracer().StartTransaction("GET /", "request")
       defer tx.End()
       ctx = apm.ContextWithTransaction(ctx, tx)

       // Create a span
       span, ctx := apm.StartSpan(ctx, "span")
       defer span.End()
}

OpenTelemetry uses the same API for both transactions and spans — what Elastic considers “transactions” are just considered spans with no parent in OTel (“root spans”).

So, your instrumentation becomes the following:

import (
	"go.opentelemetry.io/otel/trace"
)

func main() {
	tracer := otel.Tracer("my library")

	// Create a root span.
	// It is assigned to the returned context automatically.
	ctx, span := tracer.Start(ctx, "GET /")
	defer span.End()

	// Create a child span (as the context has a parent).
	ctx, span := tracer.Start(ctx, "span")
	defer span.End()
}

With a big bang migration, you will need to migrate everything before shipping it to production. You cannot split the migration into smaller chunks.

For small applications or ones that only use automatic instrumentation, that constraint may be fine. It allows you to quickly validate the migration and move on.

However, if you are working on a complex set of services, a large application, or one with a lot of manual instrumentation, you probably want to be able to ship code multiple times during the migration instead of all at once.

An atomic migration

An atomic migration would be one where you can ship atomic changes gradually and have your application keep working normally. Then, you are able to pull the final plug only at the end, once you are ready to do so.

To help with atomic migrations, we provide a bridge between our APM Go agent and OpenTelemetry.

This bridge allows you to run both our agent and OTel alongside each other and to have instrumentations with both libraries in the same process with the data being transmitted to the same location and in the same format.

You can configure the OTel bridge with our agent like this:

import (
	"go.elastic.co/apm/v2"
	"go.elastic.co/apm/module/apmotel/v2"

	"go.opentelemetry.io/otel"
)

func main() {
	provider, err := apmotel.NewTracerProvider()
	if err != nil {
		log.Fatal(err)
	}
	otel.SetTracerProvider(provider)
}

Once this configuration is set, every span created by OTel will be transmitted to the Elastic APM agent.

With this bridge, you can make your migration much safer with the following process:

Add the bridge to your application.
Switch one instrumentation (automatic or manual) from the agent to OpenTelemetry, as you would have done for the big bang migration above but a single one at a time.
Remove the bridge and our agent, and configure OpenTelemetry to transmit the data via its SDK.

Each of those steps can be a single change within your application and go to production right away.

If any issue arises during the migration process, you should then be able to see it immediately and fix it before moving on.

Observability benefits from building with OTel

As OTel is quickly becoming an industry standard, and Elastic is committed to making it even better, it can be very beneficial to your engineering teams to migrate to it.

In Go, whether you do this through a big bang migration or using Elastic’s OTel bridge, doing so will allow you to benefit from instrumentations maintained by the global community to make your observability even more effective and to better understand what’s happening within your application.

Check out our code series on how to instrument with OpenTelemetry

Go manual instrumentation with OpenTelemetry

Best practices for instrumenting with OpenTelemetry

Using AI to analyze OpenTelemetry issues

Elastic's contribution: Invokedynamic in the OpenTelemetry Java agent

Thu, 19 Oct 2023 00:00:00 GMT

As the second largest and active Cloud Native Computing Foundation (CNCF) project, OpenTelemetry is well on its way to becoming the ubiquitous, unified standard and framework for observability. OpenTelemetry owes this success to its comprehensive and feature-rich toolset that allows users to retrieve valuable observability data from their applications with low effort. The OpenTelemetry Java agent is one of the most mature and feature-rich components in OpenTelemetry’s ecosystem. It provides automatic instrumentation for JVM-based applications and comes with a broad coverage of auto-instrumentation modules for popular Java-frameworks and libraries.

The original instrumentation approach used in the OpenTelemetry Java agent left the maintenance and development of auto-instrumentation modules subject to some restrictions. As part of our reinforced commitment to OpenTelemetry, Elastic® helps evolve and improve OpenTelemetry projects and components. Elastic’s contribution of the Elastic Common Schema to OpenTelemetry was an important step for the open-source community. As another step in our commitment to OpenTelemetry, Elastic started contributing to the OpenTelemetry Java agent.

Elastic’s invokedynamic-based instrumentation approach

To overcome the above-mentioned limitations in developing and maintaining auto-instrumentation modules in the OpenTelemetry Java agent, Elastic started contributing its invokedynamic-based instrumentation approach to the OpenTelemetry Java agent in July 2023.

To explain the improvement, you should know that in Java, a common approach to do auto-instrumentation of applications is through utilizing Java agents that do bytecode instrumentation at runtime. Byte Buddy is a popular and widespread utility that helps with bytecode instrumentation without the need to deal with Java’s bytecode directly. Instrumentation logic that collects observability data from the target application’s code lives in so-called advice methods. Byte Buddy provides different ways of hooking these advice methods into the target application’s methods:

Advice inlining: The advice method’s code is being copied into the instrumented target method.
Static advice dispatching: The instrumented target method invokes static advice methods that need to be visible by the instrumented code.
Advice dispatching with _ invokedynamic __:_ The instrumented target method uses the JVM’s invokedynamic bytecode instruction to call advice methods that are isolated from the instrumented code.

These different approaches are described in great detail in our related blog post on Elastic’s Java APM agent using invokedynamic. In a nutshell, both approaches, advice inlining and dispatching to static advice methods come with some limitations with respect to writing and maintaining the advice code. So far, the OpenTelemetry Java agent has used advice inlining for its bytecode instrumentation. The resulting limitations on developing instrumentations are documented in corresponding developer guidelines. Among other things, the limitation of not being able to debug advice code is a painful restriction when developing and maintaining instrumentation code.

Elastic’s APM Java agent has been using the invokedynamic approach with its benefits for years — field-proven by thousands of customers. To help improve the OpenTelemetry Java agent, Elastic started contributing the invokedynamic approach with the goal to simplify and improve the development and maintainability of auto-instrumentation modules. The contribution proposal and the implementation outline is documented in more detail in this GitHub issue.

With the new approach in place, Elastic will help migrate existing instrumentations so the OTel Java community can benefit from the invokedynamic -based instrumentation approach.

Elastic supports OTel natively, and has numerous capabilities to help you analyze your application with OTel.

Native OpenTelemetry support in Elastic Observability

Best Practices for instrumenting OpenTelemetry

Independence with OpenTelemetry on Elastic

Instrumenting with OpenTelemetry:

Elastiflix application, a guide to instrument different languages with OpenTelemetry (this is the application the team built to highlight all the languages below)

Python: Auto-instrumentation, Manual instrumentation

Java: Auto-instrumentation, Manual instrumentation

Node.js: Auto-instrumentation, Manual instrumentation

.NET: Auto-instrumentation, Manual instrumentation
Go: Manual instrumentation

Elastic’s Managed OTLP Endpoint: Simpler, Scalable OpenTelemetry for SREs

Thu, 14 Aug 2025 00:00:00 GMT

We’re excited to announce the managed OTLP endpoint for Elastic Observability Serverless. This feature marks a major milestone in Elastic’s shift to OpenTelemetry as the backbone of our data ingestion strategy and makes it dramatically easier to get high-fidelity OpenTelemetry data into Elastic Cloud.

What is Elastic’s Managed OTLP Endpoint?

The managed OTLP endpoint delivers on that promise offering a fully hosted OpenTelemetry ingestion path that’s scalable, reliable, and designed from the ground up for OpenTelemetry.

Data from OpenTelemetry SDKs, OpenTelemetry Collectors, or any OTLP service can send data to the OTLP endpoint. The OTLP endpoint is available on Elastic Cloud Serverless, and is fully managed by Elastic. This helps minimize the burden on customers of managing the OpenTelemetry ingestion layer. Whenever your production environment scales, the OTLP end point will also auto scale without any management from an SRE.

OpenTelemetry data is stored without any schema translation, preserving both semantic conventions and resource attributes. Additionally, it supports ingesting OTLP logs, metrics, and traces in a unified manner, ensuring consistent treatment across all telemetry data. This marks a significant improvement over the existing functionality, which primarily focuses on traces and APM use cases.

Hence, SREs gain:

Native OTLP ingestion with Elastic-managed reliability and scale
OTel-native data storage, enabling richer analytics and future-proof observability
Elastic-grade scaling, ready for production and multi-tenant workloads
Frictionless onboarding, with a drop-in endpoint for logs, metrics and traces..

Native OTLP ingestion

Whether you are using native OTel SDKs, OpenTelemetry Collector, EDOT, or other OpenTelemetry instrumentation, the OTLP endpoint will ingest any native OTLP data.

The managed OTLP endpoint will automatically scale with Observability data that is notoriously bursty. A sudden spike in requests, a scaling event in Kubernetes, or a deployment gone sideways can lead to massive surges in telemetry, often when you need visibility the most. That’s exactly what the managed OTLP endpoint in Elastic Observability Serverless is built to handle.

This isn’t just a thin wrapper on a collector. It’s a multi-tenant, auto-scaling service architected to absorb high volumes of OpenTelemetry data without you having to manage infrastructure, pre-provision capacity, or worry about dropped data.

Whether you’re routing data directly from OpenTelemetry SDKs or via an intermediate Collector, Elastic handles the scale behind the scenes. The endpoint is designed to scale with your telemetry traffic and recover gracefully from bursts, giving you one less thing to monitor. Just point your instrumentation at the endpoint and let Elastic take care of the rest.

Natively stored OpenTelemetry

With this feature, developers can now send OpenTelemetry signals directly to an Elastic Cloud Serverless project using the OTLP output of a collector or SDK regardless of the distribution contrib, EDOT and any other distribution will work).

The endpoint also supports data forwarded from any OpenTelemetry Collectors, SDKs or OTLP compliant forwarder. This gives teams full control to send directly from an SDK or route, enrich, or batch telemetry when needed. Elasticsearch stores OpenTelemetry data using the OpenTelemetry data model, including resource attributes, to identify emitting entities and enable ES|QL queries that correlate logs, metrics, and traces.

Faster time-to-insight

Whether you’re building in serverless, Kubernetes, or classic VMs, this endpoint lets you focus on instrumentation and insights—not ingestion plumbing. It dramatically shortens the time from telemetry to value, while embracing the OpenTelemetry data model by preserving the original attributes and built-in correlation

Easy connectivity to Managed OTLP Endpoint

Connecting to the Managed OTLP endpoint is as simple as setting your SDK or the OTel collector OTLP export setting to the Elastic Managed OTLP Endpoint URL, and authentication key. Getting your endpoint is extremely straight-forward, go to project management, then edit alias and you will find your project’s OTLP endpoint.

Get Started Today

The managed OTLP endpoint can be used today on Elastic Observability Serverless. Support for Elastic Cloud Hosted deployments is coming soon.

For more detail and examples, follow this guide.

Whether you’re running microservices in Kubernetes, workloads in serverless, or apps on classic VMs, the OTLP endpoint helps you streamline your observability pipeline, standardize on OpenTelemetry, and accelerate your mean time to resolution (MTTR).

Also check out our OTel resources about instrumenting and ingesting OTel into Elastic

Elastic Distributions of OpenTelemetry

Monitoring Kubernetes with Elastic and OpenTelemetry

Dynamic workload discovery with EDOT Collector

Assembling an OpenTelemetry NGINX Ingress Controller Integration

Now GA: Managed OTLP Endpoint on Elastic Cloud Hosted

Thu, 19 Mar 2026 00:00:00 GMT

The Elastic Managed OTLP Endpoint (mOTLP) is now generally available on Elastic Cloud Hosted. Any OTLP-compliant source, whether it's an upstream OpenTelemetry SDK, any Collector distribution, EDOT, or a custom forwarder, can send traces, metrics, and logs to Elastic Cloud without deploying or managing ingestion infrastructure.

mOTLP was already GA on Elastic Cloud Serverless. With this release, Elastic Cloud Hosted deployments get the same managed ingestion path: an OpenTelemetry Collector-based architecture with Kafka-backed resilience that absorbs traffic spikes and protects against data loss during the moments that matter most.

You only need to set two environment variables

From the outside, sending OpenTelemetry data to Elastic using the Managed OTLP Endpoint is as simple as setting up these environment variables:

export OTEL_EXPORTER_OTLP_ENDPOINT="https://"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=ApiKey "

Two environment variables. That's the entire integration surface. No Collector gateway to deploy, no ingestion pipelines to manage, no credentials to distribute across edge agents.

What makes such a simple design possible is a full OpenTelemetry Collector-based ingestion architecture designed to handle the worst moments in production, from the traffic spike during a deployment rollout to the burst of error traces during an incident. These are the moments when telemetry matters most, and they're also the moments when ingestion is most likely to be overwhelmed.

The managed endpoint receives OTLP data through an OpenTelemetry Collector layer that buffers to a managed Kafka cluster before indexing into Elasticsearch. Kafka absorbs bursts, decouples ingestion from indexing, and provides durability guarantees that in-memory queues can't. If Elasticsearch is temporarily under pressure, data sits in Kafka rather than getting dropped. When pressure subsides, the buffer drains and everything catches up. This is the same resilience pattern you'd build yourself with a Kafka exporter and Kafka receiver in a self-managed collector pipeline, except Elastic operates it for you.

This makes the entire path OpenTelemetry end-to-end:

Your applications produce telemetry with OTel SDKs.
Your collectors (or SDKs directly) export over OTLP.
The managed endpoint receives that OTLP through an OTel Collector-based layer, buffers it through Kafka, and stores it natively in Elasticsearch using the OpenTelemetry data model.

No proprietary protocols, no schema translation, no format conversion at any stage of the pipeline. Just pure OTel.

Bring your OTLP data in, no matter the source

The endpoint accepts standard OTLP over HTTP and gRPC. Any tool that speaks OTLP can send data to it.

This means you can send data from:

OpenTelemetry SDKs (upstream, EDOT, or any distribution) exporting directly from your application.
OpenTelemetry Collectors (Contrib, EDOT, or custom builds) running as agents, gateways, or sidecars.
EDOT Cloud Forwarder forwarding cloud provider logs and metrics.
Any OTLP-compliant forwarder you've built or adopted.

This is worth pausing on, because it's not how most vendors work.

Many observability backends require vendor-specific components in your pipeline. Even when those components live in the OpenTelemetry Collector Contrib repository, they often reshape your data on the way out. A vendor-specific exporter might flatten resource attributes, drop the hierarchy between resources and scopes, or translate semantic conventions into a proprietary schema. By the time your telemetry reaches the backend, it's no longer standard OpenTelemetry data. It just started as OpenTelemetry data.

Elastic doesn't require any of that. The managed endpoint ingests standard OTLP, which means you use the standard otlphttp or otlp exporter that ships with every OpenTelemetry Collector. No Elastic-specific exporter, no vendor plugin, no translation layer. Your data arrives in Elasticsearch with the same resource hierarchy, the same semantic conventions, and the same attribute structure your instrumentation produced.

The practical result: teams adopt OpenTelemetry at different speeds and with different tools. Some start with the upstream SDK and a Contrib Collector. Others use EDOT for Elastic-specific optimizations. Many run a mix. The managed endpoint doesn't impose a choice, and it doesn't quietly reshape your data behind a vendor exporter.

For organizations already running OpenTelemetry in production, this means switching to Elastic or adding Elastic as a destination requires changing an exporter URL, not rearchitecting the pipeline or adding vendor-specific components.

What you no longer have to manage

Before the managed OTLP endpoint, sending OpenTelemetry data to Elastic Cloud Hosted required deploying your own OTLP-compatible ingestion layer, typically an EDOT Collector running as a gateway. That gateway needed to be sized, scaled, monitored, and kept available. If it went down, telemetry stopped flowing.

With mOTLP, that entire layer is Elastic's responsibility. Here's what moves off your plate:

The endpoint scales with your traffic. On Elastic Cloud Hosted, rate limits scale dynamically based on Elasticsearch backpressure. No pre-provisioning required.
The Collector-to-Kafka buffer handles burst absorption and backpressure. You don't need to operate your own Kafka cluster.
Your shippers authenticate directly with the endpoint using an API key. No intermediate gateway holding and distributing backend credentials.
The endpoint is managed and multi-tenant. You don't need to run redundant Collector replicas or configure health checks for your ingestion layer.

This doesn't mean you should remove all collectors from your architecture. Edge collectors (DaemonSet agents, sidecars, host agents) still serve a purpose: collecting infrastructure telemetry via pull-based receivers like filelog and hostmetrics, applying local transformations, and batching data before export. What changes is that the destination is now a managed endpoint rather than a self-operated gateway.

Dynamic rate scaling: the system adapts to your cluster

On Elastic Cloud Hosted, the managed endpoint doesn't have a fixed throughput ceiling. It uses dynamic rate scaling that adjusts based on your Elasticsearch cluster's capacity and current load.

This is a fundamentally different model from static rate limits. Instead of provisioning for peak and paying for idle capacity, the system continuously calibrates ingestion to what your cluster can actually handle. Sudden load spikes may still trigger temporary 429 responses while the system scales, but these resolve automatically.

If you are seeing consistent 429 errors, the signal is clear, your Elasticsearch cluster needs more capacity. Scaling the cluster reduces backpressure, which in turn increases the ingestion rate limit. The autoscaling capabilities in Elastic Cloud Hosted can help automate this, and AutoOps can assist by monitoring the deployment and providing recommendations to scale or adjust resources when capacity constraints are detected.

Native OTLP storage: OTel from first mile to last

The end-to-end OTel story doesn't stop at ingestion. Data that arrives through the managed endpoint is stored using the OpenTelemetry data model. Resource attributes, semantic conventions, and signal structure are preserved as-is. There's no translation to ECS or any other schema at any point in the pipeline.

This means the attribute names your SDK produces are the same you query in ES|QL and Discover: service.name, http.request.method, k8s.pod.name are stored exactly as the OpenTelemetry specification defines them. You're not debugging a mapping layer or wondering which schema translation dropped an attribute. What your instrumentation emits is what Elasticsearch stores and what you search.

If no specific dataset or namespace is configured, telemetry lands in default data streams: traces-generic.otel-default, metrics-generic.otel-default, and logs-generic.otel-default. You can route logs to dedicated datasets by setting the data_stream.dataset attribute, either in your collector configuration or via OTEL_RESOURCE_ATTRIBUTES:

processors:
  transform:
    log_statements:
      - set(log.attributes["data_stream.dataset"], "app.orders") where resource.attributes["service.name"] == "orders-service"

The failure store: a safety net for mapping conflicts

Even with careful schema design, mapping conflicts happen. A field that's a string in one service might be an integer in another. In traditional setups, these conflicts cause indexing failures and data loss.

The managed endpoint has the failure store enabled by default for all OTLP data streams. Documents that fail indexing due to mapping conflicts or ingest pipeline exceptions are stored in a separate index rather than dropped. You can inspect failed documents from the Data Set Quality page and fix the underlying issue without losing the data.

This is particularly valuable in OpenTelemetry environments where multiple teams instrument independently and attribute types can drift across services.

Getting started

The managed OTLP endpoint is available today on Elastic Cloud Hosted deployments (version 9.0+) in supported regions.

To find your endpoint:

Log in to the Elastic Cloud Console.
Select your deployment and go to Manage.
In the Application endpoints section, select Managed OTLP and copy the public endpoint.

Then point any OTLP exporter at it:

export OTEL_EXPORTER_OTLP_ENDPOINT="https://"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=ApiKey "

That's it. Traces, metrics, and logs will start flowing into your deployment within seconds.

For a detailed walkthrough, follow the Send data to the Elastic Cloud Managed OTLP Endpoint quickstart.

Learn more

Elastic's metrics analytics gets 5x faster

Wed, 28 Jan 2026 00:00:00 GMT

In our previous blog in this series, we explored the fundamentals of analyzing metrics using the Elasticsearch Query Language (ES|QL) and the interactive power of Discover. Building on that foundation, we are excited to announce a suite of powerful enhancements to Time Series Data Streams (Elastic’s TSDB) and ES|QL designed to provide even more comprehensive and blazingly faster metrics analytics capabilities!

These latest updates, available in v9.3 and in Serverless, introduce significant performance gains, sophisticated time series functions, and native OpenTelemetry exponential histogram support that directly benefit SREs and Observability practitioners.

Query Performance and Storage Optimizations

Speed is paramount when diagnosing incidents. Compared to prior releases, we have achieved a 5x+ improvement in query latency when wildcarding or filtering by dimensions. Additionally, storage efficiency for OpenTelemetry metrics data has improved by approximately 2x, significantly reducing the infrastructure footprint required to retain high-volume observability data. If you’re hungry to learn more about what architectural updates are driving these optimizations, stay tuned… Tech blogs are on their way!

Expanded Time Series Analytics in ES|QL

The ESQL TS source command, which targets time series indices and enables time series aggregation functions, has been significantly enhanced to support complex analytics capabilities.

We have expanded the library of time series functions to include essential tools for identifying anomalies and trends.

PERCENTILE_OVER_TIME, STDDEV_OVER_TIME, VARIANCE_OVER_TIME: Calculate the percentile, standard deviation, or variance of a field over time, which is critical for understanding distribution and variability in service latency or resource usage.

Example: Seeing the worst-case latency in 5-minute intervals.

TS metrics*  | STATS MAX(PERCENTILE_OVER_TIME(kafka.consumer.fetch_latency_avg, 99))
  BY TBUCKET(5m)

DERIV: This command calculates the derivative of a numeric field over time using linear regression, useful for analyzing the rate of change in system metrics.

Example: trending gauge values over time.

TS metrics*  | STATS AVG(DERIV(container.memory.available))
  BY TBUCKET(1 hour)

CLAMP: To handle noisy data or outliers, this function limits sample values to a specified lower and upper bound.

Example: handling saturation metrics (like CPU or Memory utilization) where spikes or measurement errors can occasionally report values over 100%, making the rest of the data look like a flat line at the bottom of the chart.\

TS metrics*  | STATS AVG(CLAMP(k8s.pod.memory.node.utilization, 0, 100))
  BY k8s.pod.name

TRANGE: This new filter function allows you to filter data for a specific time range using the @timestamp attribute, simplifying query syntax for time-bound investigations.

Example: Filtering and showing metrics for the last 4 hours.

TS metrics*  | WHERE TRANGE(4h) | STATS AVG(host.cpu.pct)
  BY TBUCKET(5m)

Window Functions To smoothen results over specific periods, ES|QL now introduces window functions. Most time series aggregation functions now accept an optional second argument that specifies a sliding time window. For example, you can calculate a rate over a 10-minute sliding window while bucketing results by minute.

Example: Calculating the average rate of requests per host for every minute, using values over a sliding window of 5 minutes.

TS metrics*  | STATS AVG(RATE(app.frontend.requests, 5m))
  BY TBUCKET(1m)

Accepted window values are currently limited to multiples of the time bucket interval in the BY clause. Windows that are smaller than the time bucket interval or larger but not a multiple of the time bucket interval will be supported in feature releases.

Native OpenTelemetry Exponential Histograms

Elastic now provides native support for OpenTelemetry exponential histograms, enabling efficient ingest, querying, and downsampling of high-fidelity distribution data.

We have introduced a new exponential_histogram field type designed to capture distributions with fixed, exponentially spaced bucket boundaries. Because these fields are primarily intended for aggregations, the histogram is stored as compact doc values and is not indexed, optimizing storage efficiency. These fields are fully supported in ES|QL aggregation functions such as PERCENTILES, AVG, MIN, MAX, and SUM.

You can index documents with exponential histograms automatically through our OTLP endpoint or manually. For example, let’s create an index with an exponential histogram field and a keyword field:

PUT my-index-000001
{
  "settings": {
    "index": {
      "mode": "time_series",
      "routing_path": ["http.path"],
      "time_series": {
        "start_time": "2026-01-21T00:00:00Z",
        "end_time": "2026-01-25T00:00:00Z"
     }
    }
  },
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "http.path": {
        "type": "keyword",
        "time_series_dimension": true
      },
      "responseTime": {
        "type": "exponential_histogram",
        "time_series_metric": "histogram"
      }
    }
  }
}

Index a document with a full exponential histogram payload:

POST my-index-000001/_doc
{
  "@timestamp": "2026-01-22T21:25:00.000Z",
  "http.path": "/foo",
  "responseTime": {
    "scale":3,
    "sum":73.2,
    "min":3.12,
    "max":7.02,
    "positive": {
      "indices":[13,14,15,16,17,18,19,20,21,22],
      "counts":[1,1,2,2,1,2,1,3,1,1]
    }
  }
}

POST my-index-000001/_doc
{
  "@timestamp": "2026-01-22T21:26:00.000Z",
  "http.path": "/bar",
  "responseTime": {
    "scale":3,
    "sum":45.86,
    "min":2.15,
    "max":5.1,
    "positive": {
      "indices":[8,9,10,11,12,13,14,15,16,17,18],
      "counts":[1,1,1,1,1,1,1,2,1,1,2]
    }
  }
}

And finally, query the time series index using ES|QL and the TS source command:

TS my-index-000001  | STATS MIN(responseTime), MAX(responseTime),
        AVG(responseTime), MEDIAN(responseTime),
        PERCENTILE(responseTime, 90)
  BY http.path

Enhanced Downsampling

Downsampling is essential for long-term data retention. We have introduced a new "last value" downsampling mode. This method exchanges accuracy for storage efficiency and performance by keeping only the last sample value, providing a lightweight alternative to calculating aggregate metrics.

You can configure a time series data stream for last value downsampling in a similar way as regular downsampling, just by setting the downsampling_method to last_value. For example, by using a data stream lifecycle:

PUT _data_stream/my-data-stream/_lifecycle
{
  "data_retention": "7d",
  "downsampling_method": "last_value",
  "downsampling": [
     {
       "after": "1m",
       "fixed_interval": "10m"
      },
      {
        "after": "1d",
        "fixed_interval": "1h"
      }
   ]
}

In Conclusion

These enhancements mark a significant step forward in Elastic's metrics analytics capabilities, delivering 5x+ faster query latency, 2x storage efficiency and specialized commands like DERIV, CLAMP, and PERCENTILE_OVER_TIME. With native support for OpenTelemetry exponential histograms and expanded downsampling options, SREs can now perform richer, more cost-effective analysis on their observability data. This release empowers teams to detect anomalies faster and manage long-term metrics retention with greater efficiency.

We welcome you to try the new features today!

Elastic MongoDB Atlas Integration: Complete Database Monitoring and Observability

Thu, 24 Jul 2025 00:00:00 GMT

In today's data-driven landscape, MongoDB Atlas has emerged as the leading multi-cloud developer data platform, enabling organizations to work seamlessly with document-based data models while ensuring flexible schema design and easy scalability. However, as your Atlas deployments grow in complexity and criticality, comprehensive observability becomes essential for maintaining optimal performance, security, and reliability.

The Elastic MongoDB Atlas integration transforms how you monitor and troubleshoot your Atlas infrastructure by providing deep insights into every aspect of your deployment—from real-time alerts and audit trails to detailed performance metrics and organizational activities. This integration empowers teams to minimize Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) while gaining actionable insights for capacity planning and performance optimization.

Why MongoDB Atlas Observability Matters

MongoDB Atlas abstracts much of the operational complexity of running MongoDB, but this doesn't eliminate the need for monitoring. Modern applications demand:

Proactive Issue Detection: Identify performance bottlenecks, resource constraints, and security threats before they impact users
Comprehensive Audit Trails: Track database operations, user activities, and configuration changes for compliance and security
Performance Optimization: Monitor query performance, resource utilization, and capacity trends to optimize costs and user experience
Operational Insights: Understand organizational activities, project changes, and infrastructure events across your multi-cloud deployments

The Elastic MongoDB Atlas integration addresses these needs by collecting comprehensive telemetry data and presenting it through powerful visualizations and alerting capabilities.

Integration Architecture and Data Streams

The MongoDB Atlas integration leverages the Atlas Administration API to collect eight distinct data streams, each providing specific insights into different aspects of your Atlas deployment:

Log Data Streams

Alert Logs: Capture real-time alerts generated by your Atlas instances, covering resource utilization thresholds (CPU, memory, disk space), database operations, security issues, and configuration changes. These alerts provide immediate visibility into critical events that require attention.

Database Logs: Collect comprehensive operational logs from MongoDB instances, including incoming connections, executed commands, performance diagnostics, and issues encountered. These logs are invaluable for troubleshooting performance problems and understanding database behavior.

MongoDB Audit Logs: Enable administrators to track system activity across deployments with multiple users and applications. These logs capture detailed events related to database operations including insertions, updates, deletions, user authentication, and access patterns—essential for security compliance and forensic analysis.

Organization Logs: Provide enterprise-level visibility into organizational activities, enabling tracking of significant actions involving database operations, billing changes, security modifications, host management, encryption settings, and user access management across teams.

Project Logs: Offer project-specific event tracking, capturing detailed records of configuration modifications, user access changes, and general project activities. These logs are crucial for project-level auditing and change management.

Metrics Data Streams

Hardware Metrics: Collect comprehensive hardware performance data including CPU usage, memory consumption, JVM memory utilization, and overall system resource metrics for each process in your Atlas groups.

Disk Metrics: Monitor storage performance with detailed insights into I/O operations, read/write latency, and space utilization across all disk partitions used by MongoDB Atlas. These metrics help identify storage bottlenecks and plan capacity expansion.

Process Metrics: Gather host-level metrics per MongoDB process, including detailed CPU usage patterns, I/O operation counts, memory utilization, and database-specific performance indicators like connection counts, operation rates, and cache utilization.

Implementation Guide

Setting Up the Integration

Getting started with MongoDB Atlas observability requires establishing API access and configuring the integration in Kibana:

Generate Atlas API Keys: Create programmatic API keys with Organization Owner permissions in the Atlas console, then invite these keys to your target projects with appropriate roles (Project Read Only for alerts/metrics, Project Data Access Read Only for audit logs).
Enable Prerequisites: Enable database auditing in Atlas for projects where you want to collect audit and database logs. Gather your Project ID and Organization ID from the Atlas UI.
Configure in Kibana: Navigate to Management > Integrations, search for "MongoDB Atlas," and add the integration using your API credentials.

The integration supports different permission levels for each data stream, ensuring you can collect operational metrics with minimal privileges while protecting sensitive audit data with elevated permissions.

Considerations and Limitations

Cluster Support: Log collection doesn't support M0 free clusters, M2/M5 shared clusters, or serverless instances
Historical Data: Most log streams collect the previous 30 minutes of historical data
Performance Impact: Large time spans may cause request timeouts; adjust HTTP Client Timeout accordingly

Real-World Use Cases and Benefits

Security and Compliance Monitoring

Audit Trail Management: Organizations in regulated industries leverage the audit logs to maintain comprehensive records of database access and modifications. The integration automatically parses and indexes audit events, making it easy to search for specific user activities, failed authentication attempts, or unauthorized access patterns.

Security Incident Response: When security events occur, teams can quickly correlate alert logs with audit trails to understand the scope and timeline of incidents.

Performance Optimization and Capacity Planning

Proactive Resource Management: By monitoring disk, hardware, and process metrics, teams can identify resource constraints before they impact application performance. For example, tracking disk I/O latency trends helps predict when storage upgrades are needed.

Query Performance Analysis: Database logs combined with process metrics provide insights into slow queries, connection patterns, and resource utilization that enable database performance tuning.

Operational Excellence

Multi-Environment Monitoring: Organizations running Atlas across development, staging, and production environments can standardize monitoring across all environments while maintaining environment-specific alerting thresholds.

Change Management: Project and organization logs provide complete audit trails for infrastructure changes, enabling teams to correlate application issues with recent configuration modifications.

Let's Try It!

The MongoDB Atlas integration delivers comprehensive database observability that enables proactive management and optimization of your Atlas deployments. With pre-built dashboards and alerting capabilities, teams can gain immediate value while leveraging rich data streams for advanced analytics and custom monitoring solutions.

Deploy a cluster on Elastic Cloud or Elastic Serverless, or download the Elasticsearch stack, then spin up the MongoDB Atlas Integration, open the curated dashboards in Kibana and start monitoring your service!

Elastic Observability monitors metrics for Google Cloud in just minutes

Mon, 20 Nov 2023 00:00:00 GMT

Developers and SREs choose to host their applications on Google Cloud Platform (GCP) for its reliability, speed, and ease of use. On Google Cloud, development teams are finding additional value in migrating to Kubernetes on GKE, leveraging the latest serverless options like Cloud Run, and improving traditional, tiered applications with managed services.

Elastic Observability offers 16 out-of-the-box integrations for Google Cloud services with more on the way. A full list of Google Cloud integrations can be found in our online documentation.

In addition to our native Google Cloud integrations, Elastic Observability aggregates not only logs but also metrics for Google Cloud services and the applications running on Google Cloud compute services (Compute Engine, Cloud Run, Cloud Functions, Kubernetes Engine). All this data can be analyzed visually and more intuitively using Elastic®’s advanced machine learning (ML) capabilities, which help detect performance issues and surface root causes before end users are affected.

For more details on how Elastic Observability provides application performance monitoring (APM) capabilities such as service maps, tracing, dependencies, and ML based metrics correlations, read: APM correlations in Elastic Observability: Automatically identifying probable causes of slow or failed transactions.

That’s right, Elastic offers metrics ingest, aggregation, and analysis for Google Cloud services and applications on Google Cloud compute services. Elastic is more than logs — it offers a unified observability solution for Google Cloud environments.

In this blog, I’ll review how Elastic Observability can monitor metrics for a three-tier web application running on Google Cloud services, which include:

Google Cloud Run
Google Cloud SQL for PostgreSQL
Google Cloud Memorystore for Redis
Google Cloud VPC Network

As you will see, once the integration is installed, metrics will arrive instantly and you can immediately start reviewing metrics.

Prerequisites and config

Here are some of the components and details we used to set up this demonstration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here).
Ensure you have a Google Cloud project and a Service Account with permissions to pull the necessary data from Google Cloud (see details in our documentation).
We used Google Cloud’s three-tier app and deployed it using the Google Cloud console.
We’ll walk through installing the general Elastic Google Cloud Platform Integration, which covers the services we want to collect metrics for.
We will not cover application monitoring; instead, we will focus on how Google Cloud services can be easily monitored.
In order to see metrics, you will need to load the application. We’ve also created a playwright script to drive traffic to the application.

Three-tier application overview

Before we dive into the Elastic configuration, let's review what we are monitoring. If you follow the Jump Start Solution: Three-tier web app instructions fordeploying the task-tracking app, you will have the following deployed.

What’s deployed:

Cloud Run frontend tier that renders an HTML client in the user's browser and enables user requests to be sent to the task-tracking app
Cloud Run middle tier API layer that communicates with the frontend and the database tier
Memorystore for Redis instance in the database tier, caching and serving data that is read frequently
Cloud SQL for PostgreSQL instance in the database tier, handling requests that can't be served from the in-memory Redis cache

At the end of the blog, we will also provide a Playwright script that can be run to send requests to this app in order to load it with example data and exercise its functionality. This will help drive metrics to “light up” the dashboards.

Setting it all up

Let’s walk through the details of how to get the application, Google Cloud integration on Elastic, and what gets ingested.

Step 0: Get an account on Elastic Cloud

Follow the instructions to get started on Elastic Cloud.

Step 1: Deploy the Google Cloud three-tier application

Follow the instructions listed out in Jump Start Solution: Three-tier web app choosing the Deploy through the console option for deployment.

Step 2: Create a Google Cloud Service Account and download credentials file

Once you’ve installed the app, the next step is to create a Service Account with a Role and a Service Account Key that will be used by Elastic’s integration to access data in your Google Cloud project.

Go to Google Cloud IAM Roles to create a Role with the necessary permissions. Click the CREATE ROLE button.

Give the Role a Title and an ID. Then add the 10 assigned permissions listed here.

cloudsql.instances.list
compute.instances.list
monitoring.metricDescriptors.list
monitoring.timeSeries.list
pubsub.subscriptions.consume
pubsub.subscriptions.create
pubsub.subscriptions.get
pubsub.topics.attachSubscription
redis.instances.list
run.services.list

These permissions are a minimal set of what’s required for this blog post. You should add permissions for all the services for which you would like to collect metrics. If you need to add or remove permissions in the future, the Role’s permissions can be updated as many times as necessary.

Click the CREATE button.

Go to Google Cloud IAM Service Accounts to create a Service Account that will be used by the Elastic integration for access to Google Cloud. Click the CREATE SERVICE ACCOUNT button.

Enter a Service account name and a Service account ID. Click the CREATE AND CONTINUE button.

Then select the Role that you created previously and click the CONTINUE button.

Click the DONE button to complete the Service Account creation process.

Next select the Service Account you just created to see its details page. Under the KEYS tab, click the ADD KEY dropdown and select Create new key.

In the Create private key dialog window, with the Key type set as JSON, click the CREATE button.

The JSON credentials file key will be automatically downloaded to your local computer’s Downloads folder. The credentials file will be named something like:

your-project-id-12a1234b1234.json

You can rename the file to be something else. For the purpose of this blog, we’ll rename it to:

credentials.json

Step 3: Create a Google Cloud VM instance

To create the Compute Engine VM instance in Google Cloud, go to Compute Engine. Then select CREATE INSTANCE.

Enter the following values for the VM instance details:

Enter a Name of your choice for the VM instance.
Expand the Advanced Options section and the Networking sub-section.
- Enter allow-ssh as the Networking tag.
- Select the Network Interface to use the tiered-web-app-private-network , which is the network on which the Google Cloud three-tier web app is deployed.

Click the CREATE button to create the VM instance.

Step 4: SSH in to the Google Cloud VM instance and upload the credentials file

In order to SSH into the Google Cloud VM instance you just created in the previous step, you’ll need to create a Firewall rule in tiered-web-app-private-network , which is the network where the VM instance resides.

Go to the Google Cloud Firewall policies page. Click the CREATE FIREWALL RULE button.

Enter the following values for the Firewall Rule.

Enter a firewall rule Name.
Select tiered-web-app-private-network for the Network.
Enter allow-ssh for Target Tags.
Enter 0.0.0.0/0 for the Source IPv4 ranges.Click TCP and set the Ports to 22.

Click CREATE to create the firewall rule.

After the new Firewall rule is created, you can now SSH into your VM instance. Go to the Google Cloud VM instances and select the VM instance you created in the previous step to see its details page. Click the SSH button.

Once you are SSH’d inside the VM instance terminal window, click the UPLOAD FILE button.

Select the credentials.json file located on your local computer and click the Upload Files button to upload the file.

In the VM instance’s SSH terminal, run the following command to get the full path to your Google Cloud Service Account credentials file.

realpath credentials.json

This should return the full path to your Google Cloud Service Account credentials file.

Copy the credentials file’s full path and save it in a handy location to be used in a later step.

Step 5: Add the Elastic Google Cloud integration

Navigate to the Google Cloud Platform integration in Elastic by selecting Integrations from the top-level menu. Search for google and click the Google Cloud Platform tile.

Click Add Google Cloud Platform.

Click Add integration only (skip agent installation).

Update the Project Id input text box to be your Google Cloud Project ID. Next, paste in the credentials file’s full path into the Credentials File input text box.

As you can see, the general Elastic Google Cloud Platform Integration will collect a significant amount of data from 16 Google Cloud services. If you don’t want to install this general Elastic Google Cloud Platform Integration, you can select individual integrations to install. Click Save and continue.

You’ll be presented with a confirmation dialog window. Click Add Elastic Agent to your hosts.

This will display the instructions required to install the Elastic agent. Copy the command under the Linux Tar tab.

Next you will need to use SSH to log in to the Google Cloud VM instance and run the commands copied from Linux Tar tab. Go to Compute Engine. Then click the name of the VM instance that you created in Step 2. Log in to the VM by clicking the SSH button.

Once you are SSH’d inside the VM instance terminal window, run the commands copied previously from Linux Tar tab in the Install Elastic Agent on your host instructions.

When the installation completes, you’ll see a confirmation message in the Install Elastic Agent on your host form. Click the Add the integration button.

Excellent! The Elastic agent is sending data to Elastic Cloud. Now let’s observe some metrics.

Step 6: Run traffic against the application

While getting the application running is fairly easy, there is nothing to monitor or observe with Elastic unless you add a load on the application.

Here is a simple script you can also run using Playwright to add traffic and exercise the functionality of the Google Cloud three-tier application:

import { test, expect } from "@playwright/test";

test("homepage for Google Cloud Threetierapp", async ({ page }) => {
  await page.goto("https://tiered-web-app-fe-zg62dali3a-uc.a.run.app");
  // Insert 2 todo items
  await page.fill("id=todo-new", (Math.random() * 100).toString());
  await page.keyboard.press("Enter");
  await page.waitForTimeout(1000);
  await page.fill("id=todo-new", (Math.random() * 100).toString());
  await page.keyboard.press("Enter");
  await page.waitForTimeout(1000);
  // Click one todo item
  await page.getByRole("checkbox").nth(0).check();
  await page.waitForTimeout(1000);
  // Delete one todo item
  const deleteButton = page.getByText("delete").nth(0);
  await deleteButton.dispatchEvent("click");
  await page.waitForTimeout(4000);
});

Step 7: Go to Google Cloud dashboards in Elastic

With Elastic Agent running, you can go to Elastic Dashboards to view what’s being ingested. Simply search for “dashboard” in Elastic and choose Dashboards.

This will open the Elastic Dashboards page.

In the Dashboards search box, search for GCP and click the [Metrics GCP] CloudSQL PostgreSQL Overview dashboard, one of the many out-of-the-box dashboards available. Let’s see what comes up.

On the Cloud SQL dashboard, we can see the following sampling of some of the many available metrics:

Disk write ops
CPU utilization
Network sent and received bytes
Transaction count
Disk bytes used
Disk quota
Memory usage
Disk read ops

Next let’s take a look at metrics for Cloud Run.

We’ve created a custom dashboard using the Create dashboard button on the Elastic Dashboards page. Here we see a few of the numerous available metrics:

Container instance count
CPU utilization for the three-tier app frontend and API
Request count for the three-tier app frontend and API
Bytes in and out of the API

This is a custom dashboard created for MemoryStore where we can see the following sampling of the available metrics:

Network traffic to the Memorystore Redis instance
Count of the keys stored in Memorystore Redis
CPU utilization of the Memorystore Redis instance
Memory usage of the Memorystore Redis instance

Congratulations, you have now started monitoring metrics from key Google Cloud services for your application!

What to monitor on Google Cloud next?

Add logs from Google Cloud Services

Now that metrics are being monitored, you can also now add logging. There are several options for ingesting logs.

The Google Cloud Platform Integration in the Elastic Agent has four separate logs settings: audit logs, firewall logs, VPC Flow logs, and DNS logs. Just ensure you turn on what you wish to receive.

Analyze your data with Elastic machine learning

Once metrics and logs (or either one) are in Elastic, start analyzing your data through Elastic’s ML capabilities. A great review of these features can be found here:

Conclusion: Monitoring Google Cloud service metrics with Elastic Observability is easy!

I hope you’ve gotten an appreciation for how Elastic Observability can help you monitor Google Cloud service metrics. Here’s a quick recap of lessons and what you learned:

Elastic Observability supports ingest and analysis of Google Cloud service metrics.
It’s easy to set up ingest from Google Cloud services via the Elastic Agent.
Elastic Observability has multiple out-of-the-box Google Cloud service dashboards you can use to preliminarily review information and then modify for your needs.
For metrics not covered by out-of-the-box dashboards, custom dashboards can be easily created to visualize metrics that are important to you.
16 Google Cloud services are supported as part of Google Cloud Platform Integration on Elastic Observability, with more services being added regularly.
As noted in related blogs, you can analyze your Google Cloud service metrics with Elastic’s machine learning capabilities.

Try it out for yourself by signing up via Google Cloud Marketplace and quickly spin up a deployment in minutes on any of the Elastic Cloud regions on Google Cloud around the world. Your Google Cloud Marketplace purchase of Elastic will be included in your monthly consolidated billing statement and will draw against your committed spend with Google Cloud.

Elastic Observability monitors metrics for Microsoft Azure in just minutes

Mon, 29 Jan 2024 00:00:00 GMT

Developers and SREs choose Microsoft Azure to run their applications because it is a trustworthy world-class cloud platform. It has also proven itself over the years as an extremely powerful and reliable infrastructure for hosting business-critical applications.

Elastic Observability offers over 25 out-of-the-box integrations for Microsoft Azure services with more on the way. A full list of Azure integrations can be found in our online documentation.

Elastic Observability aggregates not only logs but also metrics for Azure services and the applications running on Azure compute services (Virtual Machines, Functions, Kubernetes Service, etc.). All this data can be analyzed visually and more intuitively using Elastic®’s advanced machine learning (ML) capabilities, which help detect performance issues and surface root causes before end users are affected.

For more details on how Elastic Observability provides application performance monitoring (APM) capabilities such as service maps, tracing, dependencies, and ML-based metrics correlations, read APM correlations in Elastic Observability: Automatically identifying probable causes of slow or failed transactions.

That’s right, Elastic offers capabilities to collect, aggregate, and analyze metrics for Microsoft Azure services and applications running on Azure. Elastic Observability is for more than just capturing logs — it offers a unified observability solution for Microsoft Azure workloads.

In this blog, we’ll review how Elastic Observability can monitor metrics for a three-tier web application running on Microsoft Azure and leveraging:

Microsoft Azure Virtual Machines
Microsoft Azure SQL database
Microsoft Azure Virtual Network

As you will see, once the integration is installed, metrics will arrive instantly and you can immediately start deriving insights from metrics.

Prerequisites and config

Here are some of the components and details we used to set up this demonstration:

Ensure you have a Microsoft Azure account and an Azure service principal with permission to read monitoring data from Microsoft Azure (see details in our documentation).
This post does not cover application monitoring; instead, we will focus on how Microsoft Azure services can be easily monitored. If you want to get started with examples of application monitoring, see our Hello World observability code samples.
In order to see metrics, you will need to load the application. We’ve also created a Playwright script to drive traffic to the application.

Three-tier application overview

Before we dive into the Elastic deployment setup and configuration, let's review what we are monitoring. If you follow the Microsoft Learn N-tier example app instructions for deploying the "What's for Lunch?" app, you will have the following deployed.

What’s deployed:

Microsoft Azure VM presentation tier that renders an HTML client in the user's browser and enables user requests to be sent to the “What’s for Lunch?” app
Microsoft Azure VM application tier that communicates with the presentation and the database tier
Microsoft Azure SQL instance in the database tier, handling requests from the application tier to store and serve data

Setting it all up

Let’s walk through the details of how to deploy the example three-tier application, Azure integration on Elastic and visualize what gets ingested in Elastic’s Kibana® dashboards.

Step 0: Get an account on Elastic Cloud

Follow the instructions to get started on Elastic Cloud.

Step 1: Deploy the Microsoft Azure three-tier application

From the Azure portal, click the Cloud Shell icon at the top of the portal to open Cloud Shell…

… and when the Cloud Shell first opens, select Bash as the shell type to use.

If you’re prompted that “You have no storage mounted,” then click the Create storage button to create a file store to be used for saving and editing files from Cloud Shell.

You should now see the open Cloud Shell terminal.

Run the following command in Cloud Shell to define the environment variables that we’ll be using in the Cloud Shell commands required to deploy and view the sample application.

Be sure to specify a valid RESOURCE_GROUP from your available Resource Groups listed in the Azure portal. Also specify a new password to replace the SpecifyNewPasswordHere placeholder text before running the command. See the Microsoft password policy documentation for password requirements.

RESOURCE_GROUP="test"
APP_PASSWORD="SpecifyNewPasswordHere"

Run the following az deployment group create command, which will deploy the example three-tier web app in around five minutes.

az deployment group create --resource-group $RESOURCE_GROUP --template-uri https://raw.githubusercontent.com/MicrosoftDocs/mslearn-n-tier-architecture/master/Deployment/azuredeploy.json --parameters password=$APP_PASSWORD

After the deployment has completed, run the following command, which returns the URL for the app.

az deployment group show --output table --resource-group $RESOURCE_GROUP --name azuredeploy --query properties.outputs.webSiteUrl

Copy the web app URL and paste it into a browser to view the example “What’s for Lunch?” web app.

Step 2: Create an Azure service principal and grant access permission

Go to the Microsoft Azure Portal. Search for active directory and select Microsoft Entra ID.

Copy the Tenant ID for use in a later step in this blog post. This ID is required to configure Elastic Agent to connect to your Azure account.

In the navigation pane, select App registrations.

Then click New registration.

Type the name of your application (this tutorial uses three-tier-app-azure) and click Register (accept the default values for other settings).

Copy the Application (client) ID and save it for later. This ID is required to configure Elastic Agent to connect to your Azure account.

In the navigation pane, select Certificates & secrets , and then click New client secret to create a new security key.

Type a description of the secret and select an expiration. Click Add to create the client secret. Under Value , copy the secret value and save it (along with your client ID) for later.

After creating the Azure service principal, you need to grant it the correct permissions. In the Azure Portal, search for and select Subscriptions.

In the Subscriptions page, click the name of your subscription. On the subscription details page, copy your Subscription ID and save it for a later step.

In the navigation pane, select Access control (IAM).

Click Add and select Add role assignment.

On the Role tab, select the Monitoring Reader role and then click Next.

On the Members tab, select the option to assign access to User, group, or service principal. Click Select members , and then search for and select the principal you created earlier. For the description, enter the name of your service principal. Click Next to review the role assignment.

Click Review + assign to grant the service principal access to your subscription.

Step 3: Create an Azure VM instance

In the Azure Portal, search for and select Virtual machines.

On the Virtual machines page, click + Create and select Azure virtual machine.

On the Virtual machine creation page, enter a name like “metrics-vm” for the virtual machine name and select VM Size to be “Standard_D2s_v3 - 2 vcpus, 8 GiB memory.” Click the Next : Disks button.

On the Disks page, keep the default settings and click the Next : Networking button.

On the Networking page, demo-vnet should be selected for Virtual network and demo-biz-subnet should be selected for Subnet. These resources are created as part of the three-tier example app’s deployment that was done in Step 1.

Click the Review + create button.

On the Review page, click the Create button.

Step 4: Install the Azure Resource Metrics integration

In your Elastic Cloud deployment, navigate to the Elastic Azure integrations by selecting Integrations from the top-level menu. Search for azure resource and click the Azure Resource Metrics tile.

Click Add Azure Resource Metrics.

Click Add integration only (skip agent installation).

Enter the values that you saved previously for Client ID, Client Secret, Tenant ID, and Subscription ID.

As you can see, the Azure Resource Metrics integration will collect a significant amount of data from eight Azure services. Click Save and continue.

You’ll be presented with a confirmation dialog window. Click Add Elastic Agent to your hosts.

This will display the instructions required to install the Elastic agent. Copy the command under the Linux Tar tab.

Next you will need to use SSH to log in to the Azure VM instance and run the commands copied from Linux Tar tab. Go to Azure Virtual Machines in the Azure portal. Then click the name of the VM instance that you created in Step 3.

Click the Select button in the SSH Using Azure CLI section.

Select the “I understand …” checkbox and then click the Configure + connect button.

Once you are SSH’d inside the VM instance terminal window, run the commands copied previously from Linux Tar tab in the Install Elastic Agent on your host instructions. When the installation completes, you’ll see a confirmation message in the Install Elastic Agent on your host form.

Super! The Elastic agent is sending data to Elastic Cloud. Now let’s observe some metrics.

Step 5: Run traffic against the application

While getting the application running is fairly easy, there is nothing to monitor or observe with Elastic unless you add a load on the application.

Here is a simple script you can also run using Playwright to add traffic and exercise the functionality of the Azure three-tier application:

import { test, expect } from "@playwright/test";

test("homepage for Microsoft Azure three tier app", async ({ page }) => {
  // Load web app
  await page.goto("http://20.172.198.231/");
  // Add lunch suggestions
  await page.fill("id=txtAdd", "tacos");
  await page.keyboard.press("Enter");
  await page.waitForTimeout(1000);
  await page.fill("id=txtAdd", "sushi");
  await page.keyboard.press("Enter");
  await page.waitForTimeout(1000);
  await page.fill("id=txtAdd", "pizza");
  await page.keyboard.press("Enter");
  await page.waitForTimeout(1000);
  await page.fill("id=txtAdd", "burgers");
  await page.keyboard.press("Enter");
  await page.waitForTimeout(1000);
  await page.fill("id=txtAdd", "salad");
  await page.keyboard.press("Enter");
  await page.waitForTimeout(1000);
  await page.fill("id=txtAdd", "sandwiches");
  await page.keyboard.press("Enter");
  await page.waitForTimeout(1000);
  // Click vote buttons
  await page.getByRole("button").nth(1).click();
  await page.getByRole("button").nth(3).click();
  await page.getByRole("button").nth(5).click();
  await page.getByRole("button").nth(7).click();
  await page.getByRole("button").nth(9).click();
  await page.getByRole("button").nth(11).click();
  // Click remove buttons
  await page.getByRole("button").nth(12).click();
  await page.getByRole("button").nth(10).click();
  await page.getByRole("button").nth(8).click();
  await page.getByRole("button").nth(6).click();
  await page.getByRole("button").nth(4).click();
  await page.getByRole("button").nth(2).click();
});

Step 6: View Azure dashboards in Elastic

With Elastic Agent running, you can go to Elastic Dashboards to view what’s being ingested. Simply search for “dashboard” in Elastic and choose Dashboard.

This will open the Elastic Dashboards page. In the Dashboards search box, search for azure vm and click the [Azure Metrics] Compute VMs Overview dashboard, one of the many out-of-the-box dashboards available.

You will see a Dashboard populated with your deployed application’s VM metrics.

On the Azure Compute VM dashboard, we can see the following sampling of some of the many available metrics:

CPU utilization
Available memory
Network sent and received bytes
Disk writes and reads metrics

For metrics not covered by out-of-the-box dashboards, custom dashboards can be easily created to visualize metrics that are important to you.

Congratulations, you have now started monitoring metrics from Microsoft Azure services for your application!

Analyze your data with Elastic AI Assistant

Once metrics and logs (or either one) are in Elastic, start analyzing your data with context-aware insights using the Elastic AI Assistant for Observability.

Conclusion: Monitoring Microsoft Azure service metrics with Elastic Observability is easy!

We hope you’ve gotten an appreciation for how Elastic Observability can help you monitor Azure service metrics. Here’s a quick recap of what you learned:

Elastic Observability supports ingest and analysis of Azure service metrics.
It’s easy to set up ingest from Azure services via the Elastic Agent.
Elastic Observability has multiple out-of-the-box Azure service dashboards you can use to preliminarily review information and then modify for your needs.

Try it out for yourself by signing up via Microsoft Azure Marketplace and quickly spin up a deployment in minutes on any of the Elastic Cloud regions on Microsoft Azure around the world. Your Azure Marketplace purchase of Elastic will be included in your monthly consolidated billing statement and will draw against your committed spend with Microsoft Azure.

Introducing Streams for Observability: Your first stop for investigations

Mon, 27 Oct 2025 00:00:00 GMT

We're excited to introduce Streams, a new AI capability within Elastic Observability. Built on the Elasticsearch platform, it's designed for Site Reliability Engineers (SREs) to use logs as the primary signal for investigations, enabling faster answers and quicker issue resolution. For decades, logs have been considered too noisy, expensive, and complex to manage, and many observability vendors have treated them as a second-class citizen. Streams flips this script by transforming raw logs into your most valuable asset to immediately identify not only the root cause, but also the why behind the root cause to enable instant resolution.

SREs today identify the "what" with metrics and the "where" with traces, which are important for troubleshooting. However, it's often the "why" that's needed for faster and more accurate incident resolution. The crucial “why” is buried in your logs, but the massive volume and unstructured nature of logs in modern microservice environments have made them difficult to use effectively. This has forced teams into a difficult position, either spending countless hours building and maintaining complex data pipelines to tame the chaos or drop valuable log data to control costs and risk critical visibility gaps. As a result, when an incident occurs, SREs waste precious time manually hunting for clues and reverse-engineering data instead of quickly resolving the issue.

Streams, from ingest to answers with logs

Streams directly addresses this challenge by using AI to transform the chaos of raw logs into your clearest path to a solution, enabling logs to be the primary signal for investigations. It processes raw logs at scale ingested from any source and in any format (structured and unstructured), then partitions, parses, and helps manage retention and data quality. Streams reduces the need for SREs to constantly normalize data, manage custom schemas, or sift through endless noise. Streams also surfaces Significant Events, like major errors and anomalies, enabling you to be proactive in your investigations. SREs can now focus on resolving issues faster than ever by spending less time on data management and hunting through the noise.

Lets see Streams in action. In the demo below, watch an SRE tackle an issue with a critical trading application in production. In minutes, Streams processes the raw logs, pinpoints a Java out-of-memory error, and the AI Assistant guides the SRE straight to the root cause, turning hours of manual work into a quick fix.

Let's walk through some of the key Streams capabilities highlighted in the video:

AI-based partitioning - simplifies ingest by allowing SREs to send all logs to a single endpoint, without worrying about agents or integrations. Our AI automatically determines that logs are coming from two different systems, Hadoop and Spark. As more data comes through, it continues to learn and identify additional components, making segmentation effortless.

AI-based parsing - eliminates the manual effort of building and managing log processing pipelines. In the demo Streams automatically detects logs from Spark and generates a GROK rule that perfectly parses 100% of the fields.

Identifying Significant Events - Cuts through the noise so you can focus immediately on key issues. Streams analyzes the parsed Spark logs and pinpoints the Java out-of-memory errors and exceptions. This provides SREs with a clear, actionable starting point for their investigations instead of forcing them to hunt through raw data.

AI Assistant - The AI Assistant provides instant root cause analysis, turning hours of work into immediate answers. After Streams identifies the Java OOM error, an SRE can analyze logs in Discover with the AI Assistant. Within moments, it determines the root cause is that Spark lacks sufficient memory for the datasets being processed, delivering a precise answer to guide remediation.

One item that isn't in the video, is how easy Streams makes logs ingest. In this example above, we used the OTel collector, and merely configured it with a processor, exporter and service statements in values.yaml file for the OTel Collector's helm chart:

processors:
  transform/logs-streams:
      log_statements:
        - context: resource
          statements:
            - set(attributes["elasticsearch.index"], "logs")
exporters:
  debug:
  otlp/ingest:
    endpoint: ${env:ELASTIC_OTLP_ENDPOINT}
    headers:
      Authorization: ApiKey ${env:ELASTIC_API_KEY}

service:
  pipelines:
      logs:
        receivers: [filelog]
        processors: [batch, transform/logs-streams]
        exporters: [elasticsearch, debug]

With Streams you can use any log forwarder, OTel Collector (as in the example above), fluentd, fluentbit, etc. This makes ingesting simple and ensures you aren't locked into any specific log forwarder for Elastic.

As you've seen in this example, Streams helps SREs focus on finding the “why”, without the manual, error-prone work of making logs usable. What used to happen in hours can now be accomplished in minutes.

Streams: Key Features and availability

While the previous example shows how easy and fast it is to get to RCA with partitioning, parsing, Significant events, and the AI Assistant, Streams has more capabilities which is highlighted in the following diagram:

All of these capabilities are available in two primary modes: Streams for data already indexed in Elasticsearch, and Logs Streams for ingesting raw logs directly. Both modes support AI-driven partitioning and parsing, the identification of Significant Events, and essential tools for managing data quality, retention, and cost-efficient storage.

Streams (GA in 9.2)

Provides foundational capabilities that reduce pipeline management for SREs. Streams works with logs from existing agents and integrations as well as raw, unstructured logs coming through Logs Streams. Key capabilities include:

Streams Processing: simulate and refine log parsing using AI-powered Parsing or a point-and-click UI. Compare before-and-after states and modify schemas to simplify log processing.
Streams Retention Management: define time-based or advanced ILM policies directly in the UI, gain visibility into ingestion volume, and manage data in the failure store..
Streams Data Quality: detect and fix ingestion failures via a failure store that captures and exposes failed documents for inspection.

Logs Streams (Tech Preview)

Enables SREs to ingest any log, in any format, directly into Elasticsearch, without the need for agents or integrations. Key capabilities include:

Direct Ingestion with any log forwarder into Elasticsearch: send raw logs directly into /logs index using any mechanism, such as the logs_index parameter in an OpenTelemetry collector.
AI-Driven Partitioning: automatically or manually segment a single log stream into distinct parts (e.g., by service or component) using contextual AI-based suggestions..

Significant Events (tech preview)

Significant Events is available in both Streams and Logs Streams, and surfaces errors and anomalies that truly matter, such as startup and shutdown messages, out-of-memory errors, internal server failures, and other signals of change. These events act as actionable markers, giving SREs early warning and an investigative starting point before a service impact occurs.

What does this mean for SREs in practice?

With Elastic Streams, SREs no longer need to spend time data wrangling before they can be investigators. Logs are the primary investigation signal because Streams provides SREs with the ability to:

Log everything in any format, and don't worry about pipelines - Stop wasting time building and maintaining complex ingestion pipelines. Send logs in any format, structured or unstructured, from any source directly to a single Elastic endpoint, without needing specific agents. Use OTel collectors or any other data shipper to send logs to Elastic. Streams AI-driven processing parses and structures your log data, making it immediately “ready for investigation”. This means you can adapt to new log formats on the fly without the need to maintain brittle configurations. Streams ensures you always have the data you need, the moment you need it.

Don't just collect logs, get answers from them - Streams analyzes your data to surface “Significant Events,” proactively identifying critical errors, anomalies, and performance bottlenecks like out-of-memory exceptions. Instead of manually sifting through terabytes of data, you get a clear, prioritized starting point for your investigation. This allows you to go from symptom to solution in minutes, fixing issues before they impact users.

Achieve Complete Visibility at a Lower Cost: Get comprehensive visibility across all your services without the expected expense. By intelligently structuring data and surfacing only the most critical events, Streams reduces operational complexity and dramatically cuts down root cause analysis time. This efficiency allows you to store all relevant log data cost-effectively, ensuring you never have to sacrifice crucial visibility to meet a budget. Get clearer answers faster and lower your total cost of ownership.

Conclusion

Elastic Streams revolutionizes observability by transforming logs from a noisy and expensive data source into a primary investigation signal. Through AI-powered capabilities like automatic partitioning, parsing, retention management, and the surfacing of Significant Events, Streams empowers SREs to move beyond data management and directly pinpoints the root cause of issues. By reducing operational complexity, lowering storage costs, and providing complete visibility, Streams ensures that logs, enriched by AI become the fastest path to resolution by answering the critical question “why” for observability.

Sign up for an Elastic trial at cloud.elastic.co, and trial Elastic's Serverless offering which will allow you to play with all of the Streams functionality.

Additionally, check out:

Read about Reimagining streams

Look at the Streams website

Read the Streams documentation

Using Elastic to observe GKE Autopilot clusters

Wed, 15 Mar 2023 00:00:00 GMT

Elastic has formally supported Google Kubernetes Engine (GKE) since January 2020, when Elastic Cloud on Kubernetes was announced. Since then, Google has expanded GKE, with new service offerings and delivery mechanisms. One of those new offerings is GKE Autopilot. Where GKE is a managed Kubernetes environment, GKE Autopilot is a mode of Kubernetes operation where Google manages your cluster configuration, scaling, security, and more. It is production ready and removes many of the challenges associated with tasks like workload management, deployment automation, and scalability rules. Autopilot lets you focus on building and deploying your application while Google manages everything else.

Elastic is committed to supporting Google Kubernetes Engine (GKE) in all of its delivery modes. In October, during the Google Cloud Next ‘22 event, we announced our intention to integrate and certify Elastic Agent on Anthos, Autopilot, Google Distributed Cloud, and more.

Since that event, we have worked together with Google to get the Elastic Agent certified for use on Anthos, but we didn’t stop there.

Today we are happy to announce that we have been certified for operation on GKE Autopilot.

Hands on with Elastic and GKE Autopilot

Kubernetes observability has never been easier

To show how easy it is to get started with Autopilot and Elastic, let's walk through deploying the Elastic Agent on an Autopilot cluster. I’ll show how easy it is to set up and monitor an Autopilot cluster with the Elastic Agent and observe the cluster’s behavior with Kibana integrations.

One of the main differences between GKE and GKE Autopilot is that Autopilot protects the system namespace “kube-system.” To increase the stability and security of a cluster, Autopilot prevents user space workloads from adding or modifying system pods. The default configuration for Elastic Agent is to install itself into the system namespace. The majority of the changes we will make here are to convince the Elastic Agent to run in a different namespace.

Let’s get started with Elastic Stack!

While writing this article, I used the latest version of Elastic. The best way for you to get started with Elastic Observability is to:

Get an account on Elastic Cloud and look at this tutorial to help launch your first stack, or
Launch Elastic Cloud on your Google Account

Provisioning an Autopilot cluster and an Elastic stack

To test the agent, I first deployed the recommended, default GKE Autopilot cluster. Elastic’s GKE integration supports kube-state-metrics (KSM), which will increase the number of reported metrics available for reporting and dashboards. Like the Elastic Agent, KSM defaults to running in the system namespace, so I modified its manifest to work with Autopilot. For my testing, I also deployed a basic Elastic stack on Elastic Cloud in the same Google region as my Autopilot cluster. I used a fresh cluster deployed on Elastic’s managed service (ESS), but the process is the same if you are using an Elastic Cloud subscription purchased through the Google marketplace.

Adding Elastic Observability to GKE Autopilot

Because this is a brand new deployment, Elastic suggests adding integrations to it. Let’s add the Kubernetes integration into the new deployment:

Elastic offers hundreds of integrations; filter the list by typing “kub” into the search bar (1) and then click the Kubernetes integration (2).

The Kubernetes integration page gives you an overview of the integration and lets you manage the Kubernetes clusters you want to observe. We haven’t added a cluster yet, so I clicked “Add Kubernetes” to add the first integration.

I changed the integration name to reflect the Kubernetes offering type and then clicked “Save and continue” to accept the integration defaults.

At this point, an Agent policy has been created. Now it’s time to install the agent. I clicked on the “Kubernetes” integration.

Then I selected the “integration policies” tab (1) and clicked “Add agent” (2).

Finally, I downloaded the full manifest for a standard GKE environment.

We won’t be using this manifest directly, but it contains many of the values that we will need to deploy the agent on Autopilot in the next section.

The Elastic stack is ready and waiting for the Autopilot logs, metrics, and events. It’s time to connect Autopilot to this deployment using the Elastic Agent for GKE.

Connect Autopilot to Elastic

From the Google cloud terminal, I downloaded and edited the Elastic Agent manifest for GKE Autopilot.

$ curl -o elastic-agent-managed-gke-autopilot.yaml \
https://github.com/elastic/elastic-agent/blob/autopilotdocumentaton/docs/manifests/elastic-agent-managed-gke-autopilot.yaml

I used the cloud shell editor to configure the manifest for my Autopilot and Elastic clusters. For example, I updated the following:

containers:
  - name: elastic-agent
    image: docker.elastic.co/beats/elastic-agent:8.19.12

I also changed the agent to the version of Elastic that I installed (8.6.0).

From the Integration manifest I downloaded earlier, I copied the values for FLEET_URL and FLEET_ENROLLMENT_TOKEN into this YAML file.

Now it’s time to apply the updated manifest to the Autopilot instance.

Before I commit, I always like to see what’s going to be created (and check for syntax errors) with a dry run.

$ clear
$ kubectl apply --dry-run="client" -f elastic-agent-managed-gke-autopilot.yaml

Everything looks good, so I’ll do it for real this time.

$ clear
$ kubectl apply -f elastic-agent-managed-gke-autopilot.yaml

After several minutes, metrics will start flowing from the Autopilot cluster directly into the Elastic deployment.

Adding a workload to the Autopilot cluster

Observing an Autopilot cluster without a workload is boring, so I deployed a modified version of Google’s Hipster Shop (which includes OpenTelemetry reporting):

$ git clone https://github.com/bshetti/opentelemetry-microservices-demo
$ cd opentelemetry-microservices-demo
$ nano ./deploy-with-collector-k8s/otelcollector.yaml

To get the application’s telemetry talking to our Elastic stack, I replaced all instances of the exporter type from HTTP (otlphttp/elastic) to gRPC (otlp/elastic). I then replaced OTEL_EXPORTER_OTLP_ENDPOINT with my APM endpoint and I replaced OTEL_EXPORTER_OTLP_HEADERS with my APM OTEL Bearer and Token.

Then I deployed the Hipster Shop.

$ kubectl create -f ./deploy-with-collector-k8s/adservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/redis.yaml
$ kubectl create -f ./deploy-with-collector-k8s/cartservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/checkoutservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/currencyservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/emailservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/frontend.yaml
$ kubectl create -f ./deploy-with-collector-k8s/paymentservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/productcatalogservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/recommendationservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/shippingservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/loadgenerator.yaml

Once all of the shop’s pods were running, I deployed the OpenTelemetry collector.

$ kubectl create -f ./deploy-with-collector-k8s/otelcollector.yaml

Observe and visualize Autopilot’s metrics

Now that we have added the Elastic Agent to our Autopilot cluster and added a workload, let's take a look at some of the Kubernetes visualizations the integration provides out of the box.

The “[Metrics Kubernetes] Overview” is a great place to start. It provides a high-level view of the resources used by the cluster and allows me to drill into more specific dashboards that I find interesting:

For example, the “[Metrics Kubernetes] Pods” gives me a high-level view of the pods deployed in the cluster:

The “[Metrics Kubernetes] Volumes” gives me an in-depth view to how storage is allocated and used in the Autopilot cluster:

Creating an alert

From here, I can easily discover patterns in my cluster’s behavior and even create Alerts. Here is an example of an alert to notify me if the the main storage volume (called “volume”) exceeds 80% of its allocated space:

With a little work, I created this view from the standard dashboard:

Conclusion

Today I have shown how easy it is to monitor, observe, and generate alerts on a GKE Autopilot cluster. To get more information on what is possible, see the official Elastic documentation for Autopilot observability with Elastic Agent.

Next steps

If you don’t have Elastic yet, you can get started for free with an Elastic Trial today. Get more from Elastic and Google together with a Marketplace subscription. Elastic does more than just integrate with GKE — check out the almost 300 integrations that Elastic provides.

Introducing Elastic's OpenTelemetry SDK for .NET

Tue, 02 Apr 2024 00:00:00 GMT

We are thrilled to announce the alpha release of our new Elastic® distribution of the OpenTelemetry SDK for .NET. In this post, we cover a few reasonable questions you may have about this new distribution.

Download the NuGet package today if you want to try out this early access release. We welcome all feedback and suggestions to help us enhance the distribution before its stable release.

Check out our announcement blog post to learn more about OpenTelemetry and our decision to introduce OpenTelemetry distributions.

The Elastic .NET OpenTelemetry distribution

With the alpha release of the Elastic distribution of the .NET OpenTelemetry SDK, we are embracing OpenTelemetry as the preferred and recommended choice for instrumenting .NET applications.

In .NET, the runtime base class libraries (BCL) include types designed for native OpenTelemetry instrumentation, such as Activity and Meter, making adopting OpenTelemetry-native instrumentation even more convenient.

The current alpha release of our distribution is consciously feature-limited. Our goal is to assess the fitness of the API design and ease of use, laying a solid foundation going forward. We acknowledge that it is likely not suited to all application scenarios, so while we welcome developers installing it to try it out, we don’t currently advise using it for production.

In subsequent releases, we plan to add more features as we move toward feature parity with the existing Elastic APM agent for .NET. Based on user feedback, we will refine the API and move toward a stable release. Until then, we may need to make some breaking API changes to support additional use cases.

The current alpha release supports installation in typical modern workloads such as ASP.NET Core and worker services. It best supports modern .NET runtimes, .NET 6.0 and later. We’d love to hear about other scenarios you think we should focus on next.

The types we introduce in the distribution are to support an easy switch from the “vanilla” OpenTelemetry SDK with no (or minimal) code changes. We expect that for most circumstances, merely adding the NuGet package is all that is required to get started.

The initial alpha releases add very little on top of the “vanilla” SDK from OpenTelemetry, but by adopting it early, you can shape its direction. We will deliver valuable enhancements to developers in subsequent releases.

If you’d like to follow the development of the distribution, the code is fully open source and available on GitHub. We encourage you to raise issues for bugs or usability pain points you encounter.

How do I get started?

Getting started with the Elastic OpenTelemetry distribution is really easy. Simply add a reference to the Elastic OpenTelemetry NuGet package to your project. This can be achieved by adding a package reference to the project (csproj) file.

After adding the package reference, you can use the Elastic OpenTelemetry distribution in your application. The distribution includes a transitive dependency on the OpenTelemetry SDK, so you do not need to add the OpenTelemetry SDK package to your project. Doing so will cause no harm and may be used to opt into newer SDK versions before the Elastic distribution references them.

The Elastic OpenTelemetry distribution is designed to be easy to use and integrate into your applications, including those that have previously used the OpenTelemetry SDK directly. When the OpenTelemetry SDK is already being used, the only required change is to add the Elastic.OpenTelemetry NuGet package to the project. Doing so will automatically switch to the opinionated configuration provided by the Elastic distribution.

ASP.NET Core example

A common requirement is to instrument ASP.NET Core applications based on Microsoft.Extensions.Hosting libraries, which provide dependency injection via an IServiceProvider.

The OpenTelemetry SDK and the Elastic distribution include extension methods to enable observability features in your application by adding a few lines of code.

This example focuses on adding instrumentation to an ASP.NET Core minimal API application using the Elastic OpenTelemetry distribution. Similar steps can also be applied to instrument other ASP.NET Core workloads and host-based applications such as Worker Services.

NOTE: This example assumes that we start with a new minimal API project created using project templates available with the .NET 8 SDK. It also uses top-level statements inside a single Program.cs file.

Add the Elastic.OpenTelemetry package reference to the project (csproj) file.

To take advantage of the OpenTelemetry SDK instrumentation for ASP.NET Core, also add the OpenTelemetry.Instrumentation.AspNetCore NuGet package.

This package includes support to collect instrumentation (traces and metrics) for requests handled by ASP.NET Core endpoints.

Inside the Program.cs file of the ASP.NET Core application, add the following two using directives:

using OpenTelemetry;
using OpenTelemetry.Trace;

The OpenTelemetry SDK includes extension methods on the IServiceCollection to enable and configure the trace, metric, and log providers. The Elastic distribution overrides the default SDK registration, adding several opinionated defaults.

In the minimal API template, the WebApplicationBuilder exposes a Services property that can be used to register services with the dependency injection container. Ensure that the OpenTelemetry SDK is registered to enable tracing and metrics collection.

var builder = WebApplication.CreateBuilder(args);

builder.Services
  .AddHttpClient() // <1>
  .AddOpenTelemetry() // <2>
    .WithTracing(t => t.AddAspNetCoreInstrumentation()); // <3>

<1> AddHttpClient registers the IHttpClientFactory service with the dependency injection container. This is not required to enable OpenTelemetry, but the example endpoint will use it to send an HTTP request.

<2> AddOpenTelemetry registers the OpenTelemetry SDK with the dependency injection container. When available, the Elastic distribution will override this to add opinionated defaults.

<3> Configures OpenTelemetry tracing to collect tracing and metric data produced by ASP.NET Core.

With these limited changes to the Program.cs file, the application is now configured to use the OpenTelemetry SDK and the Elastic distribution to collect traces and metrics, which are exported via OTLP.

To demonstrate the tracing capabilities, we will define a single endpoint for the API via the WebApplication.

var app = builder.Build();

app.UseHttpsRedirection();

app.MapGet("/", (IHttpClientFactory httpClientFactory) =>
  Api.HandleRoot(httpClientFactory)); // <1>

app.Run();

<1> Maps an endpoint that handles requests to the application's root URL path. The handler will be supplied from a static class that we also need to add to the application. It accepts an IHttpClientFactory as a parameter, which will be injected from the dependency injection container at runtime and passed as an argument to the HandleRoot method.


namespace Example.Api
{
  internal static class Api
  {
    public static async Task HandleRoot(IHttpClientFactory httpClientFactory)
    {
      using var client = httpClientFactory.CreateClient();

      await Task.Delay(100); // simulate work
      var response = await client.GetAsync("http://elastic.co"); // <1>
      await Task.Delay(50); // simulate work

      return response.StatusCode == System.Net.HttpStatusCode.OK ? Results.Ok() : Results.StatusCode(500);
    }
  }
}

<1> This URL will require two redirects, allowing us to see multiple spans in the trace.

This static class includes a HandleRoot method that matches the signature for the endpoint handler delegate.

After creating a HttpClient from the factory, it sends a GET request to the elastic.co website. Either side of the request is a delay, which is used here to simulate some business logic being executed. The method returns a suitable status code based on the result of the external HTTP request.

If you’re following along, you will also need to include a using directive for the Example.Api namespace in your Program.cs file.

using Example.Api;

That is all of the code we require for now. The Elastic distribution will automatically enable the exporting of telemetry signals via the OTLP exporter. The OTLP exporter requires that endpoint(s) be configured. A common mechanism for configuring endpoints is via environment variables.

This demo uses an Elastic Cloud deployment as the destination for our observability data. To retrieve the endpoint information from Kibana® running in Elastic Cloud, navigate to the observability setup guides. Select the OpenTelemetry option to view the configuration details that should be supplied to the application.

Configure environment variables for the application either in launchSettings.json or in the environment where the application is running. The authorization header bearer token should be stored securely, in user secrets or a suitable key vault system.

At a minimum, we must configure two environment variables:

OTEL_EXPORTER_OTLP_ENDPOINT
OTLP_EXPORTER_OTLP_HEADERS

It is also highly recommended to configure at least a descriptive service name for the application using the OTEL_RESOURCE_ATTRIBUTES environment variable otherwise a generic default will be applied. For example:

"OTEL_RESOURCE_ATTRIBUTES": "service.name=minimal-api-example"

Additional resource tags, such as version, can and should be added as appropriate. You can read more about the options for configuring resource attributes in the OpenTelemetry .NET SDK documentation.

Once configured, run the application and make an HTTP request to its root endpoint. A trace will be generated and exported to the configured OTLP endpoint.

To view the traces, you can use the Elastic APM Kibana UI. From the Kibana home page, visit the Observability area and from a trace under the APM > Traces page. After selecting a suitable time frame and choosing the trace named “GET /,” you will be able to explore one or more trace samples.

The above trace demonstrates the built-in instrumentation collection provided by the OpenTelemetry SDK and the optional OpenTelemetry.Instrumentation.AspNetCore package that we added.

It’s important to highlight that we would see a different trace above if we had used the “vanilla” SDK without the Elastic distribution. The HTTP spans that appear in blue in the screenshot would not be shown. By default, the OpenTelemetry SDK does not enable HTTP instrumentation, and it would require additional code to configure the instrumentation of outbound HTTP requests. The Elastic distribution takes the opinion that HTTP spans should be captured and enables this feature by default.

It is also possible to add application-specific instrumentation to this application. Typically, this would require calling vendor-specific APIs, for example, the tracer API in Elastic APM Agent. A significant benefit of choosing OpenTelemetry is the capability to use vendor-neutral APIs to instrument code with no vendor lock-in. We can see that in action by updating the API class in the sample.

internal static class Api
{
  public static string ActivitySourceName = "CustomActivitySource";
  private static readonly ActivitySource ActivitySource = new(ActivitySourceName);

  public static async Task HandleRoot(IHttpClientFactory httpClientFactory)
  {
    using var activity = ActivitySource.StartActivity("DoingStuff", ActivityKind.Internal);
    activity?.SetTag("custom-tag", "TagValue");

    using var client = httpClientFactory.CreateClient();

    await Task.Delay(100);
    var response = await client.GetAsync("http://elastic.co"); // using this URL will require 2 redirects
    await Task.Delay(50);

    if (response.StatusCode == System.Net.HttpStatusCode.OK)
    {
      activity?.SetStatus(ActivityStatusCode.Ok);
      return Results.Ok();
    }

    activity?.SetStatus(ActivityStatusCode.Error);
    return Results.StatusCode(500);
  }
}

The preceding code snippet defines a private static ActivitySource field inside the Api class. Inside the HandleRoot method, an Activity is started using the ActivitySource, and several tags are set. The ActivitySource and Activity types are defined in the .NET BCL (base class library) and are defined in the System.Diagnostics namespace. A using directive is required to use them.

using System.Diagnostics;

By using the Activity APIs to instrument the above code, we are not tied to any specific vendor APM solution. To learn more about using the .NET APIs to instrument code in an OpenTelemetry native way, visit the Microsoft Learn page covering distributed tracing instrumentation.

The last modification we must apply will instruct OpenTelemetry to observe spans from our application-specific ActivitySource. This is achieved by updating the registration of the OpenTelemetry components with the dependency injection framework.

builder.Services
  .AddHttpClient()
  .AddOpenTelemetry()
    .WithTracing(t => t
      .AddAspNetCoreInstrumentation()
      .AddSource(Api.ActivitySourceName)); // <1>

<1> AddSource subscribes the OpenTelemetry SDK to spans (activities) produced by our application code.

A new trace will be collected and exported after making these changes, rerunning the application, and requesting the root endpoint. The latest trace can be viewed in the Kibana observability UI.

The trace waterfall now includes the internal “DoingStuff” span produced by the instrumentation that we added to our application code. The HTTP spans still appear and are now child spans of the “DoingStuff” span.

We’re working on writing more thorough documentation to be published on elastic.co. Until then, you can find more information in our repository readme and the docs folder.

As the distribution is designed to extend the capabilities of the OpenTelemetry SDK with limited impact on the code used to register the SDK, we recommend visiting the OpenTelemetry documentation for .NET to learn about the instrumenting code and provide a more advanced configuration of the SDK.

What are the next steps?

We are very excited to expand our support of the OpenTelemetry community and contribute to its future within the .NET ecosystem. This is the compelling next step toward greater collaboration between all observability vendors to provide a rich ecosystem supporting developers on their journey to improved application observability with zero vendor lock-in.

At this stage, we strongly appreciate any feedback the .NET community and our customers can provide to guide the direction of our OpenTelemetry distribution. Please try out our distribution and engage with us through our GitHub repository.

In the coming weeks and months, we will focus on stabilizing the distribution's API and porting Elastic APM Agent features into the distribution. In parallel, we expect to start donating and contributing features to the broader OpenTelemetry community via the OpenTelemetry GitHub repositories.

Introducing Elastic's OpenTelemetry Distribution for Node.js

Mon, 06 May 2024 00:00:00 GMT

We are delighted to announce the alpha release of the Elastic OpenTelemetry Distribution for Node.js. This distribution is a light wrapper around the OpenTelemetry Node.js SDK that makes it easier to get started using OpenTelemetry to observe your Node.js applications.

Background

Elastic is standardizing on OpenTelemetry (OTel) for observability and security data collection. As part of that effort, we are providing distributions of the OpenTelemetry Language SDKs. Our Android and iOS SDKs have been OpenTelemetry-based from the start, and we have recently released alpha distributions for Java and .NET. The Elastic OpenTelemetry Distribution for Node.js is the latest addition.

Getting started

To get started with the Elastic OTel Distribution for Node.js (the "distro"), you need only install and load a single npm dependency (@elastic/opentelemetry-node). The distro sets up the collection of traces, metrics, and logs for a number of popular Node.js packages. It sends data to any OTLP endpoint you configure. This could be a standard OTel Collector or, as shown below, an Elastic Observability cloud deployment.

npm install --save @elastic/opentelemetry-node  # (1) install the SDK

# (2) configure it, for example:
export OTEL_EXPORTER_OTLP_ENDPOINT=https://my-deployment.apm.us-west1.gcp.cloud.es.io
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer ...REDACTED..."
export OTEL_SERVICE_NAME=my-service

# (3) load and start it
node --require @elastic/opentelemetry-node my-service.js

A small example with Express and PostgreSQL

For a concrete example, let's look at a small Node.js "Shortlinks" service implemented using the Express web framework and the pg PostgreSQL client package. This service provides a POST / route for creating short links (a short name for a URL) and a GET /:shortname route for using them.

The git repository is here. The README shows how to create a free trial Elastic cloud deployment and get the appropriate OTEL_... config settings. Try it out (prerequisites are Docker and Node.js v20 or later):

git clone https://github.com/elastic/elastic-otel-node-example.git
cd elastic-otel-node-example
npm install

cp config.env.template config.env
# Edit OTEL_ values in "config.env" to point to your collection endpoint.

npm run db:start
npm start

The only steps needed to set up observability are these small changes to the "package.json" file and configuring a few standard OTEL_... environment variables.

// ...
  "scripts": {
	"start": "node --env-file=./config.env -r @elastic/opentelemetry-node lib/app.js"
  },
  "dependencies": {
	"@elastic/opentelemetry-node": "*",
  // ...

The result is an observable application using the industry-standard OpenTelemetry — offering high-quality instrumentation of many popular Node.js libraries, a portable API to avoid vendor lock-in, and an active community.

Using Elastic Observability, some out-of-the-box benefits you can expect are: rich trace viewing, Service maps, integrated metrics and log analysis, and more. The distro ships host-metrics and Kibana provides a curated service metrics UI. There is out-of-the-box sending of logs for the popular Winston and Bunyan logging frameworks, with support planned for Pino.

What's next?

Elastic is committed to helping OpenTelemetry succeed and to helping our customers use OpenTelemetry effectively in their systems. Last year, we donated ECS and continue to work on integrating it with OpenTelemetry Semantic Conventions. More recently, we are working on donating our eBPF-based profiler to OpenTelemetry. We contribute to many of the language SDKs and other OpenTelemetry projects.

As authors of the Node.js distribution, we are excited to work with the OpenTelemetry JavaScript community and to help make the JS API & SDK a more robust, featureful, and obvious choice for JavaScript observability. Having a distro gives us the flexibility to build features on top of the vanilla OTel SDK. Currently, some advantages of the distro include: single package for installation, easy auto-instrumentation with reasonable default configuration, ESM enabled by default, and automatic logs telemetry sending. We will certainly contribute features upstream to the OTel JavaScript project when possible and will include additional features in the distro when it makes more sense for them to be there.

The Elastic OpenTelemetry Distribution for Node.js is currently an alpha. Please try it out and let us know if it might work for you. Watch for the latest releases here. You can engage with us on the project issue tracker or Elastic's Node.js APM Discuss forum.

Introducing Elastic's distribution of OpenTelemetry PHP

Mon, 16 Sep 2024 00:00:00 GMT

We’re excited to introduce the first alpha release of Elastic Distribution for OpenTelemetry PHP. In this post, you’ll learn how to easily install and set up monitoring for your PHP applications.

Background

Elastic is standardizing on OpenTelemetry (OTel) for observability and security data collection. As part of that effort, we are providing distributions of the OpenTelemetry Language SDKs. Our Android and iOS SDKs have been OpenTelemetry-based from the start, and we have recently released alpha distributions for Java, .NET, Node.js and Python. The Elastic distribution of OpenTelemetry PHP is the latest addition.

Getting started

To install Elastic Distribution for OpenTelemetry PHP for your application, download the appropriate package for your Linux distribution from https://github.com/elastic/elastic-otel-php/releases.

Currently, we support packages for systems using DEB and RPM package managers for x86_64 and ARM64 processors.

For DEB-based systems, run the following command:

dpkg -i .deb

For RPM-based systems, run the following command:

rpm -ivh .rpm

For APK-based systems (Alpine), run the following command:

apk add --allow-untrusted .apk

The package installer will automatically detect the installed PHP versions and update the configuration, so monitoring extension will be available with the next process restart (you need to restart the processes to load the new php.ini configuration). Some environment variables are needed to provide the necessary configuration for instrumenting your services. These mainly concern the destination of your traces and the identification of your service. You’ll also need to provide the authorization headers for authentication with Elastic Observability Cloud and the Elastic Cloud endpoint where the data is sent.

export OTEL_EXPORTER_OTLP_HEADERS="Authorization="
export OTEL_EXPORTER_OTLP_ENDPOINT=

where

OTEL_EXPORTER_OTLP_ENDPOINT: The full URL of the endpoint where data will be sent.
OTEL_EXPORTER_OTLP_HEADERS: A comma-separated list of key=value pairs that will be added to the headers of every request. This is typically used for authentication information.

After restarting the application, as a result, you should see insights into the monitored applications in Kibana, such as service maps and trace views. In the example below, you can see trace details from the Aimeos application created using the Laravel framework.

Below is an example of a Slim application using HttpAsyncClient:

What's next?

In this alpha version, we support all modern PHP versions from 8.0 to 8.3 inclusive, providing instrumentation for PHP code, including popular frameworks like Laravel, Slim, and HttpAsyncClient, as well as native extensions such as PDO. In future releases, we plan to introduce additional features supported by OpenTelemetry, along with Elastic APM-exclusive features like Inferred Spans.

Stay tuned!

As authors of the PHP distribution, we are excited to work with the OpenTelemetry PHP community and to help make the PHP SDK a more robust, featureful, and obvious choice for PHP observability. Having a distro gives us the flexibility to build features on top of the vanilla OTel SDK. Currently, some advantages of the distro include: fully automatic installation and full auto-instrumentation. We will certainly contribute features upstream to the OTel PHP project when possible and will include additional features in the distro when it makes more sense for them to be there.

The Elastic OpenTelemetry Distribution of PHP is currently an alpha. Please try it out and let us know if it might work for you. Watch for the latest releases here. You can engage with us on the project issue tracker or Elastic's PHP APM Discuss forum.

LLM Observability with Elastic, OpenLIT and OpenTelemetry

Thu, 29 Aug 2024 00:00:00 GMT

The realm of technology is evolving rapidly, and Large Language Models (LLMs) are at the forefront of this transformation. From chat bots to intelligent application copilots, LLMs are becoming increasingly sophisticated. As these applications grow more complex, ensuring their reliability and performance is paramount. This is where observability steps in, aided by OpenTelemetry and Elastic through the OpenLIT instrumentation library.

OpenLIT is an open-source Observability and Evaluation tool that helps take your LLM apps from playground to debug to production. With OpenLit you get an ability to choose from a range of Integrations (across LLMs, VectorDBs, frameworks, and GPUs) to start tracking LLM performance, usage, and costs without hassle. In this blog we will look at tracking OpenAI and LangChain. to send telemetry to an OpenTelemetry compatible endpoint like Elastic.

Elastic supports OpenTelemetry natively, it can take telemetry directly from the application (via the OpenTelemetry SDKs) or through a native OTel collector. No special agents are needed. Additionally Elastic's EDOT provides a supported set of OTel SDKs and an OTel Collector. In this blog we will connect our application directly to Elastic without a collector for simplicity.

Why Observability Matters for LLM Applications

Monitoring LLM applications is crucial for several reasons.

It’s vital to keep track of how often LLMs are being used for usage and cost tracking.
Latency is important to track since the response time from the model can vary based on the inputs passed to the LLM.
Rate limiting is a common challenge, particularly for external LLMs, as applications depend more on these external API calls. When rate limits are hit, it can hinder these applications from performing their essential functions using these LLMs.

By keeping a close eye on these aspects, you can not only save costs but also avoid hitting request limits, ensuring your LLM applications perform optimally.

What are the signals that you should be looking at?

Using Large Language Models (LLMs) in applications differs from traditional machine learning (ML) models. Primarily, LLMs are often accessed through external API calls instead of being run locally or in-house. It is crucial to capture the sequence of events (using traces), especially in a RAG-based application where there can be events before and after LLM usage. Also, analyzing the aggregated data (through metrics) provides a quick overview like request, tokens and cost is important for optimizing performance and managing costs. Here are the key signals to monitor:

Traces

Request Metadata: This is important in the context of LLMs, given the variety of parameters (like temperature and top_p) that can drastically affect both the response quality and the cost. Specific aspects to monitor are:

Temperature: Indicates the level of creativity or randomness desired from the model’s outputs. Varying this parameter can significantly impact the nature of the generated content.
top_p: Decides how selective the model is by choosing from a certain percentage of most likely words. A high “top_p” value means the model considers a wider range of words, making the text more varied.
Model Name or Version: Essential for tracking over time, as updates to the LLM might affect performance or response characteristics.
Prompt Details: The exact inputs sent to the LLM, which, unlike in-house ML models where inputs might be more controlled and homogeneous, can vary wildly and affect output complexity and cost implications.

Response Metadata: Given the API-based interaction with LLMs, tracking the specifics of the response is key for cost management and quality assessment:

Tokens: Directly impacts cost and is a measure of response length and complexity.
Cost: Critical for budgeting, as API-based costs can scale with the number of requests and the complexity of each request.
Completion Details: Similar to the prompt details but from the response perspective, providing insights into the model’s output characteristics and potential areas of inefficiency or unexpected cost.

Metrics

Request Volume: The total number of requests made to the LLM service. This helps in understanding the demand patterns and identifying any anomaly in usage, such as sudden spikes or drops.

Request Duration: The time it takes for a request to be processed and a response to be received from the LLM. This includes network latency and the time the LLM takes to generate a response, providing insights into the performance and reliability of the LLM service.

Costs and Tokens Counters: Keeping track of the total cost accrued and tokens consumed over time is essential for budgeting and cost optimization strategies. Monitoring these metrics can alert you to unexpected increases that may indicate inefficient use of the LLM or the need for optimization.

Implementing Automatic Instrumentation with OpenLIT

OpenLIT automates telemetry data capture, simplifying the process for developers. Here’s a step-by-step guide to setting it up:

1. Install the OpenLIT SDK:

First, you must install the following package:

pip install openlit

Note: OpenLIT currently supports Python, a popular language for Generative AI. The team is also working on expanding support to JavaScript soon.

2. Get your Elastic APM Credentials

Sign in to your Elastic cloud account.
Open the side navigation and click on APM under Observability.
Make sure the APM Server is running

In the APM Agents section, Select OpenTelemetry and directly jump to Step 5 (Configure OpenTelemetry in your application):
Copy and save the configuration value for OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_EXPORTER_OTLP_HEADERS

3. Set Environment Variables:

OpenTelemetry Environment variables for Elastic can be set as follows in linux (or in the code). Elastic OTel Documentation

export OTEL_EXPORTER_OTLP_ENDPOINT="YOUR_ELASTIC_APM_OTLP_URL"
export OTEL_EXPORTER_OTLP_HEADERS="YOUR_ELASTIC_APM_AUTH"

Note: Make sure to replace the space after Bearer with %20:

OTEL_EXPORTER_OTLP_HEADERS=“Authorization=Bearer%20[APIKEY]”

4. Initialize the SDK:

You will need to add the following to the LLM Application code.

import openlit
openlit.init()

Optionally, you can customize the application name and environment by setting the application_name and environment attributes when initializing OpenLIT in your application. These variables configure the OTel attributes service.name and deployment.environment, respectively. For more details on other configuration settings, check out the OpenLIT GitHub Repository.

openlit.init(application_name="YourAppName",environment="Production")

The most popular libraries in GenAI are OpenAI (for accessing LLMs) and Langchain (for orchestrating steps). An example instrumentation of a Langchain and OpenAI based LLM Application will look like:

import getpass
import os
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
import openlit 

# Auto-instruments LLM and VectorDB calls, sending OTel traces and metrics to the configured endpoint
openlit.init()

os.environ["OPENAI_API_KEY"] = getpass.getpass()
model = ChatOpenAI(model="gpt-4")
messages = [
    SystemMessage(content="Translate the following from English into Italian"),
    HumanMessage(content="hi!"),
]
model.invoke(messages)

Visualizing Data with Kibana

Once your LLM application is instrumented, visualizing the collected data is the next step. Follow the below steps to import a pre-built Kibana dashboard to get yourself started:

Copy the dashboard NDJSON provided here and save it in a file with an extension .ndjson.
Log into your Elastic Instance.
Go to Stack Management > Saved Objects.
Click Import and upload your file containing the dashboard NDJSON.
Click Import and you should have the dashboard available.

The dashboard provides an in-depth overview of system metrics through eight key areas: Total Successful Requests, Request Duration Distribution, Request Rates, Usage Cost and Tokens, Top GenAI Models, GenAI Requests by Platform and Environment, Token Consumption vs. Cost. These metrics collectively help identify peak usage times, latency issues, rate limits, and resource allocation, facilitating performance tuning and cost management. This comprehensive breakdown aids in understanding LLM performance, ensuring consistent operation across environments, budget needs, and troubleshooting issues, ultimately optimizing overall system efficiency.

Also, you can see OpenTelemetry Traces from OpenLIT in Elastic APM, letting you look into each LLM request in detail. This setup ensures better system efficiency by helping with model performance checks, smooth running across environments, budget planning, and troubleshooting.

Conclusion

Observability is crucial for the efficient operation of LLM applications. OpenTelemetry's open standards and extensive support, combined with Elastic's APM, AIOps, and analytics and OpenLIT's powerful and easy auto-instrumentation for 20+ GenAI tools from LLMs to VectorDBs, enable complete visibility into LLM performance.

Hopefully, this provides an easy-to-understand walk-through of instrumenting Langchain with OpenTelemetry and OpenLit and how easy it is to send traces into Elastic.

Additional resources for OpenTelemetry with Elastic:

Observing Langchain applications with Elastic, OpenTelemetry, and Langtrace

Mon, 02 Sep 2024 00:00:00 GMT

As AI-driven applications become increasingly complex, the need for robust tools to monitor and optimize their performance is more critical than ever. LangChain has rapidly emerged as a crucial framework in the AI development landscape, particularly for building applications powered by large language models (LLMs). As its adoption has soared among developers, the need for effective debugging and performance optimization tools has become increasingly apparent. One such essential tool is the ability to obtain and analyze traces from Langchain applications. Tracing provides invaluable insights into the execution flow, helping developers understand and improve their AI-driven systems. Elastic Observability's APM provides an ability to trace your Langchain apps with OpenTelemetry, but you need third-party libraries.

There are several options to trace for Langchain. Langtrace is one such option. Langtrace is an open-source observability software that lets you capture, debug and analyze traces and metrics from all your applications. Langtrace automatically captures traces from LLM APIs/inferences, Vector Databases, and LLM-based Frameworks. Langtrace stands out due to its seamless integration with popular LLM frameworks and its ability to provide deep insights into complex AI workflows without requiring extensive manual instrumentation.

Langtrace has an SDK, a lightweight library that can be installed and imported into your project to collect traces. The traces are OpenTelemetry-based and can be exported to Elastic without using a Langtrace API key.

OpenTelemetry (OTel) is now broadly accepted as the industry standard for tracing. As one of the major Cloud Native Computing Foundation (CNCF) projects, with as many commits as Kubernetes, it is gaining support from major ISVs and cloud providers delivering support for the framework.

Hence, many LangChain-based applications will have multiple components beyond just LLM interactions. Using OpenTelemetry with LangChain is essential.

This blog will cover how you can use Langtrace SDK to trace a simple LangChain Chat app connecting to Azure OpenAI, perform a search in DuckDuckGoSearch and export the output to Elastic.

Pre-requisites:

An Elastic Cloud account — sign up now, and become familiar with Elastic’s OpenTelemetry configuration
Have a LangChain app to instrument
Be familiar with using OpenTelemetry’s Python SDK
An account on your favorite LLM (AzureOpen AI), with API keys
The application we used in this blog, called langchainChat can be found in Github langhcainChat. It is built using Azure OpenAI and DuckDuckGo, but you can easily modify it for your LLM and search of choice.

App Overview and output in Elastic:

To showcase the combined power of Langtrace and Elastic, we created a simple LangChain app that performs the following steps:

Takes customer input on the command line. (Queries)
Sends these to the Azure OpenAI LLM via a LangChain.
Utilizes chain tools to perform a search using DuckDuckGo.
The LLM processes the search results and returns the relevant information to the user.

Here is a sample interaction:

Here is what the service view looks like after we ran a few queries.

As you can see, Elastic Observability’s APM recognizes the LangChain app and also shows the average latency, throughput, and transactions. Our average latency is 30s since it takes that log for humans to type the query (twice).

You can also select other tabs to see, dependencies, errors, metrics, and more. One interesting part of Elastic APM is the ability to use universal profiling (eBPF) output also analyzed for this service. Here is what our service’s dependency is (Azure OpenAI) with its average latency, throughput, and failed transactions:

We see Azure OpenAI is on average 4s to give us the results.

If we drill into transactions and look at the trace for our queries on Taylor Swift and Pittsburgh Steelers, we can see both queries and their corresponding spans.

In this trace:

The user makes a query
Azure OpenAI is called, but it uses a tool (DuckDuckGo) to obtain some results
Azure OpenAI reviews and returns a summary to the end user
Repeats for another query

We noticed that the other long span (other than Azure OpenAI) is Duckduckgo (~1000ms). We can individually look at the span and review the data:

Configuration:

How do we make all this show up in Elastic? Let's go over the steps:

OpenTelemetry Configuration

To leverage the full capabilities of OpenTelemetry with Langtrace and Elastic, we need to configure the SDK to generate traces and properly set up Elastic’s endpoint and authorization. Detailed instructions can be found in the OpenTelemetry Auto-Instrumentation setup documentation.

OpenTelemetry Environment variables:

For Elastic, you can set the following OpenTelemetry environment variables either in your Linux/Mac environment or directly in the code:

OTEL_EXPORTER_OTLP_ENDPOINT=12345.apm.us-west-2.aws.cloud.es.io:443
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer%20ZZZZZZZ"
OTEL_RESOURCE_ATTRIBUTES="service.name=langchainChat,service.version=1.0,deployment.environment=production"

In this setup:

OTEL_EXPORTER_OTLP_ENDPOINT is configured to send traces to Elastic.
OTEL_EXPORTER_OTLP_HEADERS provides the necessary authorization for the Elastic APM server.
OTEL_RESOURCE_ATTRIBUTES define key attributes like the service name, version, and deployment environment.

These values can be easily obtained from Elastic’s APM configuration screen under the OpenTelemetry section.

Note: No agent is required; the OTLP trace messages are sent directly to Elastic’s APM server, simplifying the setup process.

Langtrace Library:

OpenTelemetry's auto-instrumentation can be extended to trace additional frameworks using instrumentation packages. For this blog post, you will need to install the Langtrace Python SDK:

pip install langtrace-python-sdk

After installation, you can add the following code to your project:

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

from langtrace_python_sdk import langtrace, with_langtrace_root_span

Instrumentation:

Once the necessary libraries are installed and the environment variables are configured, you can use auto-instrumentation to trace your application. For example, run the following command to instrument your LangChain application with Elastic:

opentelemetry-instrument python langtrace-elastic-demo.py

The Langtrace OpenTelemetry library correctly captures the flow with minimal manual instrumentation, apart from integrating the OpenTelemetry library. Additionally, the LLM spans captured by Langtrace also include useful metadata such as token counts, model hyper-parameter settings etc. Note that the generated spans follow the OTEL GenAI semantics described here.

In summary, the instrumentation process involves:

Capturing customer input from the command line (Queries).
Sending these queries to the Azure OpenAI LLM via a LangChain.
Utilizing chain tools, such as DuckDuckGo, to perform searches.
The LLM processes the results and returns the relevant information to the user.

Conclusion

By combining the power of Langtrace with Elastic, developers can achieve unparalleled visibility into their LangChain applications, ensuring optimized performance and quicker debugging. This powerful combination simplifies the complex task of monitoring AI-driven systems, enabling you to focus on what truly matters—delivering value to your users. Throughout this blog,we've covered the following essential steps and concepts:

How to manually instrument Langchain with OpenTelemetry
How to properly initialize OpenTelemetry and add a custom span
How to easily set the OTLP ENDPOINT and OTLP HEADERS with Elastic without the need for a collector
How to view and analyze traces in Elastic Observability APM

These steps provide a clear and actionable guide for developers looking to integrate robust tracing capabilities into their LangChain applications.

We hope this guide makes understanding and implementing OpenTelemetry tracing for LangChain simple, ensuring seamless integration with Elastic.

Additional resources for OpenTelemetry with Elastic:

Tracing LangChain apps with Elastic, OpenLLMetry, and OpenTelemetry

Fri, 02 Aug 2024 00:00:00 GMT

LangChain has rapidly emerged as a crucial framework in the AI development landscape, particularly for building applications powered by large language models (LLMs). As its adoption has soared among developers, the need for effective debugging and performance optimization tools has become increasingly apparent. One such essential tool is the ability to obtain and analyze traces from LangChain applications. Tracing provides invaluable insights into the execution flow, helping developers understand and improve their AI-driven systems.

There are several options to trace for LangChain. One is Langsmith, ideal for detailed tracing and a complete breakdown of requests to large language models (LLMs). However, it is specific to Langchain. OpenTelemetry (OTel) is now broadly accepted as the industry standard for tracing. As one of the major Cloud Native Computing Foundation (CNCF) projects, with as many commits as Kubernetes, it is gaining support from major ISVs and cloud providers delivering support for the framework.

Hence, many LangChain-based applications will have multiple components beyond just LLM interactions. Using OpenTelemetry with LangChain is essential. OpenLLMetry is an available option for tracing Langchain apps in addition to Langsmith.

This blog will show how you can get LangChain tracing into Elastic using the OpenLLMetry library opentelemetry-instrumentation-langchain.

Pre-requisites:

An Elastic Cloud account — sign up now, and become familiar with Elastic’s OpenTelemetry configuration
Have a LangChain app to instrument
Be familiar with using OpenTelemetry’s Python SDK
An account on your favorite LLM, with API keys

Overview

In highlighting tracing I created a simple LangChain app that does the following:

Takes customer input on the command line. (Queries)
Sends these to the Azure OpenAI LLM via a LangChain.
Chain tools are set to use the search with Tavily
The LLM uses the output which returns the relevant information to the user.

As you can see Elastic Observability’s APM recognizes the LangChain App, and also shows the full trace (done with manual instrumentation):

As the above image shows:

The user makes a query
Azure OpenAI is called, but it uses a tool (Tavily) to obtain some results
Azure OpenAI reviews and returns a summary to the end user

The code was manually instrumented, but auto-instrument can also be used.

OpenTelemetry Configuration

In using OpenTelemetry, we need to configure the SDK to generate traces and configure Elastic’s endpoint and authorization. Instructions can be found in OpenTelemetry Auto-Instrumentation setup documentation.

OpenTelemetry Environment variables:

OpenTelemetry Environment variables for Elastic can be set as follows in linux (or in the code).

OTEL_EXPORTER_OTLP_ENDPOINT=12345.apm.us-west-2.aws.cloud.es.io:443
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer%20ZZZZZZZ"
OTEL_RESOURCE_ATTRIBUTES="service.name=langchainChat,service.version=1.0,deployment.environment=production"

As you can see OTEL_EXPORTER_OTLP_ENDPOINT is set to Elastic, and the corresponding authorization header is also provided. These can be easily obtained from Elastic’s APM configuration screen under OpenTelemetry

Note: No agent is needed, we simply send the OTLP trace messages directly to Elastic’s APM server.

OpenLLMetry Library:

OpenTelemetry's auto-instrumentation can be extended to trace other frameworks via instrumentation packages.

First, you must install the following package:

pip install opentelemetry-instrumentation-langchain

This library was developed by OpenLLMetry.

Then you will need to add the following to the code.

from opentelemetry.instrumentation.langchain import LangchainInstrumentor
LangchainInstrumentor().instrument()

Instrumentation

Once the libraries are added, and the environment variables are set, you can use auto-instrumentation With auto-instrumentation, the following:

opentelemetry-instrument python tavilyAzureApp.py

The OpenLLMetry library does pull out the flow correctly with minimal manual instrumentation except for adding the OpenLLMetry library.

Takes customer input on the command line. (Queries)
Sends these to the Azure OpenAI LLM via a Lang chain.
Chain tools are set to use the search with Tavily
The LLM uses the output which returns the relevant information to the user.

Manual-instrumentation

If you want to get more details out of the application, you will need to manually instrument. To get more traces follow my Python instrumentation guide. This guide will walk you through setting up the necessary OpenTelemetry bits, Additionally, you can also look at the documentation in OTel for instrumenting in Python.

Note that the env variables OTEL_EXPORTER_OTLP_HEADERS and OTEL_EXPORTER_OTLP_ENDPOINT are set as noted in the section above. You can also set up the OTEL_RESOURCE_ATTRIBUTES.

Once you follow the steps in either guide and initiate the tracer, you will have to essentially just add the span where you want to get more details. In the example below, only one line of code is added for span initialization.

Look at the placement of with tracer.start_as_current_span("getting user query") as span: below

# Creates a tracer from the global tracer provider
tracer = trace.get_tracer("newsQuery")

async def chat_interface():
    print("Welcome to the AI Chat Interface!")
    print("Type 'quit' to exit the chat.")
    
    with tracer.start_as_current_span("getting user query") as span:
        while True:
            user_input = input("\nYou: ").strip()
            
            if user_input.lower() == 'quit':
                print("Thank you for chatting. Goodbye!")
                break
        
            print("AI: Thinking...")
            try:
                result = await chain.ainvoke({"query": user_input})
                print(f"AI: {result.content}")
            except Exception as e:
                print(f"An error occurred: {str(e)}")


if __name__ == "__main__":
    asyncio.run(chat_interface())

As you can see, with manual instrumentation, we get the following trace:

Which calls out when we enter our query function. async def chat_interface()

Conclusion

In this blog, we discussed the following:

How to manually instrument LangChain with OpenTelemetry
How to properly initialize OpenTelemetry and add a custom span
How to easily set the OTLP ENDPOINT and OTLP HEADERS with Elastic without the need for a collector
See traces in Elastic Observability APM

Hopefully, this provides an easy-to-understand walk-through of instrumenting LangChain with OpenTelemetry and how easy it is to send traces into Elastic.

Additional resources for OpenTelemetry with Elastic:

Also log into cloud.elastic.co to try out Elastic with a free trial.

Unlock possibilities with native OpenTelemetry: prioritize reliability, not proprietary limitations

Tue, 12 Nov 2024 00:00:00 GMT

OpenTelemetry (OTel) is emerging as the standard for data ingestion since it delivers a vendor-agnostic way to ingest data across all telemetry signals. Elastic Observability is leading the OTel evolution with the following announcements:

Native OTel Integrity: Elastic is now 100% OTel-native, retaining OTel data natively without requiring data translation This eliminates the need for SREs to handle tedious schema conversions and develop customized views. All Elastic Observability capabilities—such as entity discovery, entity-centric insights, APM, infrastructure monitoring, and AI-driven issue analysis— now seamlessly work with native OTel data.
Powerful end to end OTel based Kubernetes observability with Elastic Distributions of OpenTelemetry (EDOT): Elastic now supports EDOT deployment and management on Kubernetes via the OTel Operator, enabling streamlined EDOT collector deployment, application auto-instrumentation, and lifecycle management. With out-of-the-box OTel-based Kubernetes integration and dashboards, SREs gain instant, real-time visibility into cluster and application metrics, logs, and traces—with no manual configuration needed.

For organizations, it signals our commitment to open standards, streamlined data collection, and delivering insights from native OpenTelemetry data. Bring the power of Elastic Observability to your Kubernetes and OpenTelemetry deployments for maximum visibility and performance.

Fully native OTel architecture with in-depth data analysis

Elastic’s OpenTelemetry-first architecture is 100% OTel-native, fully retaining the OTel data model, including OTel Semantic Conventions and Resource attributes, so your observability data remains in OpenTelemetry standards. OTel data in Elastic is also backward compatible with the Elastic Common Schema (ECS).

SREs now gain a holistic view of resources, as Elastic accurately identifies entities through OTel resource attributes. For example, in a Kubernetes environment, Elastic identifies containers, hosts, and services and connects these entities to logs, metrics, and traces.

Once OTel data is in Elastic’s scalable vector datastore, Elastic’s capabilities such as the AI Assistant, zero-config machine learning-based anomaly detection, pattern analysis, and latency correlation empower SREs to quickly analyze and pinpoint potential issues in production environments.

Kubernetes insights with Elastic Distributions of OpenTelemetry (EDOT)

EDOT reduces manual effort through automated onboarding and pre-configured dashboards. With EDOT and OpenTelemetry, Elastic makes Kubernetes monitoring straightforward and accessible for organizations of any size.

EDOT paired with Elasticsearch, enables storage for all signal types—logs, metrics, traces, and soon profiling—while maintaining essential resource attributes and semantic conventions.

Elastic’s OpenTelemetry-native solution enables customers to quickly extract insights from their data rather than manage complex infrastructure to ingest data. Elastic automates the deployment and configuration of observability components to deliver a user experience focused on ease and scalability, making it well-suited for large-scale environments and diverse industry needs.

Let’s take a look at how Elastic’s EDOT enables visibility into Kubernetes environments.

1. Simple 3-step OTel ingest with lifecycle management and auto-instrumentation

Elastic leverages the upstream OpenTelemetry Operator to automate its EDOT lifecycle management—including deployment, scaling, and updates—allowing customers to focus on visibility into their Kubernetes infrastructure and applications instead of their observability infrastructure for data collection.

The Operator integrates with the EDOT Collector and language SDKs to provide a consistent, vendor-agnostic experience. For instance, when customers deploy a new application, they don’t need to manually configure instrumentation for various languages; the OpenTelemetry Operator manages this through auto-instrumentation, as supported by the upstream OpenTelemetry project.

This integration simplifies observability by ensuring consistent application instrumentation across the Kubernetes environment. Elastic’s collaboration with the upstream OpenTelemetry project strengthens this automation, enabling users to benefit from the latest updates and improvements in the OpenTelemetry ecosystem. By relying on open source tools like the OpenTelemetry Operator, Elastic ensures that its solutions stay aligned with the latest advancements in the OpenTelemetry project, reinforcing its commitment to open standards and community-driven development.

The diagram above shows how the operator can deploy multiple OTel collectors, helping SREs deploy individual EDOT Collectors for specific applications and infrastructure. This configuration improves availability for OTel ingest and the telemetry is sent directly to Elasticsearch servers via OTLP.

Check out our recent blog on how to set this up.

2. Out-of-the-box OTel-based Kubernetes integration with dashboards

Elastic delivers an OTel-based Kubernetes configuration for the OTel collector by packaging all necessary receivers, processors, and configurations for Kubernetes observability. This enables users to automatically collect, process, and analyze Kubernetes metrics, logs, and traces without the need to configure each component individually.

The OpenTelemetry Kubernetes Collector components provide essential building blocks, including receivers like the Kubernetes Receiver for cluster metrics, Kubeletstats Receiver for detailed node and container metrics, along with processors for data transformation and enrichment. By packaging these components, Elastic offers a turnkey solution that simplifies Kubernetes observability and eliminates the need for users to set up and configure individual collectors or processors.

This pre-packaged approach, which includes OTel-native Kibana assets such as dashboards, allows users to focus on analyzing their observability data rather than managing configuration details. Elastic’s Unified OpenTelemetry Experience ensures that users can harness OpenTelemetry’s full potential without needing deep expertise. Whether you’re monitoring resource usage, container health, or API server metrics, users gain comprehensive observability through EDOT.

For more details on OpenTelemetry Kubernetes Collector components, visit OpenTelemetry Collector Components.

3. Streamlined ingest architecture with OTel data and Elasticsearch

Elastic’s ingest architecture minimizes infrastructure overhead by enabling users to forward trace data directly into Elasticsearch with the EDOT Collector, removing the need for the Elastic APM server. This approach:

Reduces the costs and complexity associated with maintaining additional infrastructure, allowing users to deploy, scale, and manage their observability solutions with fewer resources.
Allows all OTel data, metrics, logs, and traces to be ingested and stored in Elastic’s singular vector database store enabling further analysis with Elastic’s AI-driven capabilities.

SREs can now reduce operational burdens while also gaining high performance analytics and observability insights provided by Elastic.

Elastic’s ongoing commitment to open source and OpenTelemetry

With Elasticsearch fully open source once again under the AGPL license, this change reinforces our deep commitment to open standards and the open source community. This aligns with Elastic’s OpenTelemetry-first approach to observability, where Elastic Distributions of OpenTelemetry (EDOT) streamline OTel ingestion and schema auto-detection, providing real-time insights for Kubernetes and application telemetry.

As users increasingly adopt OTel as their schema and data collection architecture for observability, Elastic’s Distribution of OpenTelemetry (EDOT), currently in tech preview, enhances standard OpenTelemetry capabilities and improves troubleshooting while also serving as a commercially supported OTel distribution. EDOT, together with Elastic’s recent contributions of the Elastic Profiling Agent and Elastic Common Schema (ECS) to OpenTelemetry, reinforces Elastic’s commitment to establishing OpenTelemetry as the industry standard.

Customers can now embrace open standards and enjoy the advantages of an open, extensible platform that integrates seamlessly with their environment. End result? Reduced costs, greater visibility, and vendor independence.

Getting hands-on with Elastic Observability and EDOT

Ready to try out the OTel Operator with EDOT collector and SDKs to see how Elastic utilizes ingested OTel data in APM, Discover, Analysis, and out-of-the-box dashboards?

If you have your own application and want to configure EDOT the application with auto-instrumentation, read the following blogs on Go, Java, PHP, Python

Instrumenting your OpenAI-powered Python, Node.js, and Java Applications with EDOT

Thu, 23 Jan 2025 00:00:00 GMT

Introduction

Last year, we announced Elastic Distribution of OpenTelemetry (a.k.a. EDOT) language SDKs, which collect logs, traces and metrics from applications. When this was announced, we didn’t yet support Large Language Model (LLM) providers such as OpenAI. This limited insight developers had into Generative AI (GenAI) applications.

In a prior post, we reviewed LLM observability focus, such as token usage, chat latency and knowing which tools (like DuckDuckGo) your application uses. With the right logs, traces and metrics, developers can answer questions like "Which version of a model generated this response?" or "What was the exact chat prompt created by my RAG application?"

In the last six months, Elastic invested a lot of energy alongside others in the OpenTelemetry community towards shared specifications on these areas, including code to collect LLM related logs, metrics and traces. Our goal was to extend the zero code (agent) approach EDOT brings to GenAI use cases.

Today, we announce our first GenAI instrumentation capability in the EDOT language SDKs: OpenAI. Below, you’ll see how to observe GenAI applications using our Python, Node.js and Java EDOT SDKs.

Example application

Many of us may be familiar with ChatGPT, which is frontend for OpenAI’s GPT model family. Using this, you can ask a question and the assistant might reply correctly depending on what you ask and text the LLM was trained on.

Here’s an example of an esoteric question answered by ChatGPT:

Our example application will simply ask this predefined question and print the result. We’ll write it in three languages: Python, JavaScript and Java.

We’ll execute each with a "zero code" (agent) approach, so that logs, metrics and traces are captured and visible in an Elastic Stack configured with Kibana and APM server. If you don’t have a stack running, use instructions from Elasticsearch Labs to set one up.

Regardless of programming language, three variables are needed: the OpenAI API key, the location of your Elastic APM server, and the service name of the application. You’ll write these to a file named .env.

OPENAI_API_KEY=sk-YOUR_API_KEY
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:8200
OTEL_SERVICE_NAME=openai-example

By default instrumentations does not capture the content sent to the OpenAI API in the GenAI events sent to logs, if you want to capture it add the following:

OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true

Each time the application is run, it sends logs, traces and metrics to the APM server, which you can find by querying Kibana like this for the application "openai-example"

http://localhost:5601/app/apm/services/openai-example/transactions

When you choose a trace, you’ll see the LLM request made by the OpenAI SDK, and HTTP traffic caused by it:

Select the logs tab to see the exact request and response to OpenAI. This data is critical for Q/A and evaluation use cases.

You can also go to the Metrics Explorer and make a graph of "gen_ai.client.token.usage" or "gen_ai.client.operation.duration" over all the times you ran the application:

http://localhost:5601/app/metrics/explorer

Continue to see exactly how this application looks and is run, in Python, Java and Node.js. Those already using our EDOT language SDKs will be familiar with how this works.

Python

Assuming you have python installed, the first thing would be to setup a virtual environment and install the required packages: the OpenAI client, a helper tool to read the .env file and our EDOT Python package:

python3 -m venv .venv
source .venv/bin/activate
pip install openai "python-dotenv[cli]" elastic-opentelemetry

Next, run edot-bootstrap which analyzes the code to install any relevant instrumentation available:

edot-bootstrap —-action=install

Now, create your .env file, as described earlier in this article, and the below source code in chat.py

import os

import openai

CHAT_MODEL = os.environ.get("CHAT_MODEL", "gpt-4o-mini")


def main():
  client = openai.Client()

  messages = [
    {
      "role": "user",
        "content": "Answer in up to 3 words: Which ocean contains Bouvet Island?",
    }
  ]

  chat_completion = client.chat.completions.create(model=CHAT_MODEL, messages=messages)
  print(chat_completion.choices[0].message.content)

if __name__ == "__main__":
  main()

Now you can run everything with:

dotenv run -- opentelemetry-instrument python chat.py

Finally, look for a trace for the service named "openai-example" in Kibana. You should see a transaction named "chat gpt-4o-mini".

Rather than copy/pasting above, you can find a working copy of this example (along with the instructions) in the Python EDOT repository here.

Finally, if you would like to try a more comprehensive example, take a look at chatbot-rag-app which uses OpenAI with Elasticsearch’s Elser retrieval model.

Java

There are multiple popular ways to initialize a Java project. Since we are using OpenAI, the first step is to configure the dependency com.openai:openai-java and write the below source as Chat.java.

package openai.example;

import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.*;


final class Chat {

  public static void main(String[] args) {
    String chatModel = System.getenv().getOrDefault("CHAT_MODEL", "gpt-4o-mini");

    OpenAIClient client = OpenAIOkHttpClient.fromEnv();

    String message = "Answer in up to 3 words: Which ocean contains Bouvet Island?";
    ChatCompletionCreateParams params = ChatCompletionCreateParams.builder()
        .addMessage(ChatCompletionUserMessageParam.builder()
          .content(message)
          .build())
        .model(chatModel)
        .build();

    ChatCompletion chatCompletion = client.chat().completions().create(params);
    System.out.println(chatCompletion.choices().get(0).message().content().get());
  }
}

Build the project such that all dependencies are in a single jar. For example, if using Gradle, you would use the com.gradleup.shadow plugin.

Next, create your .env file, as described earlier, and download shdotenv which we’ll use to load it.

curl -O -L https://github.com/ko1nksm/shdotenv/releases/download/v0.14.0/shdotenv
chmod +x ./shdotenv

At this point, you have a jar and configuration you can use to run the OpenAI example. The next step is to download the EDOT Java javaagent binary. This is the part that records and exports logs, metrics and traces.

curl -o elastic-otel-javaagent.jar -L 'https://oss.sonatype.org/service/local/artifact/maven/redirect?r=snapshots&g=co.elastic.otel&a=elastic-otel-javaagent&v=LATEST'

Assuming you assembled a file named openai-example-all.jar, run it with EDOT like this:

./shdotenv java -javaagent:elastic-otel-javaagent.jar -jar openai-example-all.jar

Finally, look for a trace for the service named "openai-example" in Kibana. You should see a transaction named "chat gpt-4o-mini".

Rather than copy/pasting above, you can find a working copy of this example in the EDOT Java source repository here.

Node.js

Assuming you already have npm installed and configured, run the following commands to initialize a project for the example. This includes the openai package and @elastic/opentelemetry-node (EDOT Node.js)

npm init -y
npm install openai @elastic/opentelemetry-node

Next, create your .env file, as described earlier in this article and the below source code in index.js

const {OpenAI} = require('openai');

let chatModel = process.env.CHAT_MODEL ?? 'gpt-4o-mini';

async function main() {
 const client = new OpenAI();
 const completion = await client.chat.completions.create({
  model: chatModel,
  messages: [
   {
    role: 'user',
    content: 'Answer in up to 3 words: Which ocean contains Bouvet Island?',
   },
  ],
 });
 console.log(completion.choices[0].message.content);
}

main();

With this in place, run the above source with EDOT like this:

node --env-file .env --require @elastic/opentelemetry-node index.js

Finally, look for a trace for the service named "openai-example" in Kibana. You should see a transaction named "chat gpt-4o-mini".

Rather than copy/pasting above, you can find a working copy of this example in the EDOT Node.js source repository here.

Finally, if you would like to try a more comprehensive example, take a look at openai-embeddings which uses OpenAI with Elasticsearch as a vector database!

Closing Notes

Above you’ve seen how to observe the official OpenAI SDK in three different languages, using Elastic Distribution of OpenTelemetry (EDOT).

It is important to note that some of the OpenAI SDKs and also OpenTelemetry specifications around generative AI are experimental. If you find this helps you, or find glitches, please join our slack and let us know about it.

Several LLM platforms accept requests from the OpenAI client SDK, by setting OPENAI_BASE_URL and choosing relevant models. During development, we tested against OpenAI Platform and Azure OpenAI Service. We also ran integration tests against Ollama, contributing improvements its OpenAI support released in v0.5.12. Whatever your choice of OpenAI compatible platform, we hope this new tooling helps you understand your LLM usage.

Finally, while the first Generative AI SDK instrumented with EDOT is OpenAI, you’ll see more soon. We are already working on Bedrock, and collaborating with others in the OpenTelemetry community for other platforms. Keep watching this blog for exciting updates.

Native OTel-based K8s & App Observability in 3 Steps with Elastic

Wed, 13 Nov 2024 00:00:00 GMT

Elastic recently released its Elastic Distributions of OpenTelemetry (EDOT) which have been developed to enhance the capabilities of standard OpenTelemetry distributions and improve existing OpenTelemetry support from Elastic. EDOT helps Elastic deliver its new Unified OpenTelemetry Experience. SRE’s are no longer burdened with a set of tedious steps instrumenting and ingesting OTel data into Observability. SREs get a simple and frictionless way to instrument the OTel collector, and applications, and ingest all the OTel data into Elastic. The components of this experience include: (detailed in the overview blog)

Elastic Distributions for OpenTelemetry (EDOT)
Elastic’s configuration for the OpenTelemetry Operator providing:
- OTel Lifecycle management for the OTel collector and SDKs
- Auto instrumentation of apps, which most developers will not instrument
Pre-packaged receivers, processors, exporters, and configuration for the OTel Kubernetes Collector
Out-of-the-box OTel-based K8S dashboards for metrics and logs
Discovered inventory views for services, hosts, and containers
Direct OTel ingest into Elasticsearch for EDOT (bypassing ingest into APM server) - all your data (logs, metrics, and traces) is now stored in Elastic’s Search AI Lake
All ingested OTel data used and displayed natively in Discovery, APM, Inventory, etc

In this blog we will cover how to ingest OTel for K8S and your application in 3 easy steps:

Copy the install commands from the UI
Add the OpenTelemetry helm charts, Install the OpenTelemetry Operator with Elastic’s helm configuration & set your Elastic endpoint and authentication
Annotate the app services you want to be auto-instrumented

Then you can easily see K8S metrics, logs and application logs, metrics, and traces in Elastic Observability.

To follow this blog you will need to have:

An account on cloud.elastic.co, with access to get the Elasticsearch endpoint and authentication (api key)
A non-instrumented application with services based on Go, dotnet, Python, or Java. Auto-instrumentation through the OTel operator. In this example, we will be using the Elastiflix application.
A Kubernetes cluster, we used EKS in our setup
Helm and Kubectl loaded

To find the authentication, you can find it in the integrations section of Elastic. More information is also available in the documentation.

K8S and Application Observability in Elastic:

Before we walk you through the steps, let's show you what is visible in Elastic.

Once the Operator starts the OTel Collector, you can see the following in Elastic:

Kubernetes metrics:

Using an out-of-the-box dashboard, you can see node metrics, overall cluster metrics, and status across pods, deployments, etc.

Discovered Inventory for Hosts, services, and containers:

This can be found at Observability->Inventory on the UI

Detailed metrics, logs, and processor info on hosts:

This can be found at Observability->Infrastructure->Hosts

K8S and application logs in Elastic’s New Discover (called Explorer)

This can be found on Observability->Discover

Application Service views (logs, metrics, and traces):

This can be found on Observability->Application

Then select the service and drill down into different aspects.

Above we are showing how traces are shown using Native OTel data.

Steps to install

Step 0. Follow the commands listed in the UI

Under Add data->Kubernetes->Kubernetes Monitoring with EDOT

You will find the following instructions, which we will follow here.

Step 1. Install the EDOT config for the OpenTelemetry Operator

Run the following commands. Please make sure that you have already authenticated in your K8s Cluster and this is where you will run the helm commands provided below.

# Install helm repo needed
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts --force-update
# Install needed secrets. Provide the Elasticsearch Endpoint URL and API key you have noted in previous steps
kubectl create ns opentelemetry-operator-system
kubectl create -n opentelemetry-operator-system secret generic elastic-secret-otel \
    --from-literal=elastic_endpoint='YOUR_ELASTICSEARCH_ENDPOINT' \
    --from-literal=elastic_api_key='YOUR_ELASTICSEARCH_API_KEY'
# Install the EDOT Operator
helm install opentelemetry-kube-stack open-telemetry/opentelemetry-kube-stack --namespace opentelemetry-operator-system --create-namespace --values https://raw.githubusercontent.com/elastic/opentelemetry/refs/heads/main/resources/kubernetes/operator/helm/values.yaml --version 0.3.0

The values.yaml file configuration can be found here.

Step 1b: Ensure OTel data is arriving in Elastic

The simplest way to check is to go to Menu > Dashboards > [OTEL][Metrics Kubernetes] Cluster Overview, and ensure you see the following dashboard being populated

Step 2: Annotate the application with auto-instrumentation

For this example, we’re only going to annotate one service, the favorite-java service in the Elastiflix application

Use the following commands to initiate auto-instrumentation:

#Annotate Java namespace
kubectl annotate namespace java instrumentation.opentelemetry.io/inject-java="opentelemetry-operator-system/elastic-instrumentation"
#Restart the java-app to get the new annotation
kubectl rollout restart deployment java-app -n java

You can also modify the yaml for your pod with the annotation

metadata:
 name: my-app
 annotations:
   instrumentation.opentelemetry.io/inject-python: "true"

These instructions are provided in the UI:

Check out the service data in Elastic APM

Once the OTel data is in Elastic, you can see:

Out-of-the-box dashboards for OTel-based Kubernetes metrics
Discovered resources such as services, hosts, and containers that are part of the Kubernetes clusters
Kubernetes metrics, host metrics, logs, processor info, anomaly detection, and universal profiling.
Log analytics in Elastic Discover
APM features that show app overview, transactions, dependencies, errors, and more:

Try it out

Elastic’s Distribution of OpenTelemetry (EDOT) transforms the observability experience by streamlining Kubernetes and application instrumentation. With EDOT, SREs and developers can bypass complex setups, instantly gain deep visibility into Kubernetes clusters, and capture critical metrics, logs, and traces—all within Elastic Observability. By following just a few simple steps, you’re empowered with a unified, efficient monitoring solution that brings your OpenTelemetry data directly into Elastic. With robust, out-of-the-box dashboards, automatic application instrumentation, and seamless integration, EDOT not only saves time but also enhances the accuracy and accessibility of observability across your infrastructure. Start leveraging EDOT today to unlock a frictionless observability experience and keep your systems running smoothly and insightfully.

Additional resources:

Elastic now providing distributions for OpenTelemetry SDKs

Wed, 03 Apr 2024 00:00:00 GMT

If you develop applications, you may have heard about OpenTelemetry. At Elastic®, we are enthusiastic about OpenTelemetry as the future of standardized application instrumentation and observability.

In this post, we share our plans to expand our adoption of and commitment to OpenTelemetry with the introduction of Elastic distributions of the OpenTelemetry language SDKs, which will complement our existing Elastic APM agents.

What is OpenTelemetry?

OpenTelemetry is a vendor-neutral observability framework and toolkit that supports telemetry signals such as traces, metrics, and logs in applications and distributed microservice-based architectures.

Driven by a set of standards, OpenTelemetry is designed to provide a consistent approach to instrumenting and observing application behavior. OpenTelemetry is an incubating project developed under the Cloud Native Computing Foundation (CNCF) umbrella and is currently the second most active project, topped only by Kubernetes.

You can read more on the OpenTelemetry website about the concepts, terminology, and techniques for adopting OpenTelemetry.

A richer instrumentation landscape

By adopting OpenTelemetry, software code can be instrumented in a vendor-agnostic fashion, with telemetry signals exported in a standardized format to one or more vendor backends, such as Elastic APM. Its design provides flexibility for application owners to switch out vendor backends with no code changes and use OpenTelemetry collectors to send telemetry data to multiple backends.

Because OpenTelemetry is not a vendor-specific solution, it is much easier for language ecosystems to adopt it and provide robust instrumentations. Vendors don’t have to implement specific instrumentations themselves anymore. OpenTelemetry is a standard, and it is in the interest of library developers to introduce and maintain instrumentations from which all consumers can benefit.

As a result, more instrumentation libraries are available and better kept up to date. If your company has open-source libraries, you can also contribute and create your own instrumentations to make it easier for your customers to adopt OpenTelemetry and benefit from richer traces, metrics, and logging in their applications.

Elastic and OpenTelemetry

Elastic is deeply involved in OpenTelemetry. In 2023, we donated the Elastic Common Schema, which is being merged with the Semantic Conventions. In 2024, we are in the process of donating our profiling agent based on eBPF. We also have multiple contributors to various areas of OpenTelemetry across the organization.

We are therefore committed to helping OpenTelemetry succeed, which means, in some cases, beginning to shift away from Elastic-specific components and recommend using OpenTelemetry components instead.

Elastic is committed to supporting and contributing to OpenTelemetry. Our APM solution already accepts native OTLP (OpenTelemetry Protocol) data, and many of our APM agents have already bridged data collection and transmission from applications instrumented using the OpenTelemetry APIs.

The next step on our journey is introducing Elastic distributions for the language SDKs and donating features upstream to the OpenTelemetry community by contributing to the OpenTelemetry SDK repositories.

What is an OpenTelemetry distribution?

An OpenTelemetry distribution is simply a customized version of one or more OpenTelemetry components. Each distribution extends the core functionality offered by the component while adhering to its API and existing features, utilizing built-in extension points.

The Elastic OpenTelemetry SDK distributions

With the release of Elastic distributions of the OpenTelemetry SDKs, we are extending our backing of OpenTelemetry as the preferred and recommended choice for instrumenting applications.

OpenTelemetry maintains and ships many language APIs and SDKs for observing applications using OpenTelemetry. The APIs provide a language-specific interface for instrumenting application code, while the SDK implements that API, enabling signals from observed applications to be collected and exported.

Our current work extends the OpenTelemetry language SDKs to introduce additional features and ensure that the exported data provides the most robust compatibility with our current backend while it evolves to become more OpenTelemetry native.

Additional features include reimplementing concepts currently available in the Elastic APM Agent but not part of the OpenTelemetry SDK. The distributions allow us to ship with opinionated defaults for all signals that are known to provide the best integration with Elastic’s Observability offering.

It’s undoubtedly possible to use the OpenTelemetry APIs to instrument code and then reference the OpenTelemetry SDK to enable the collection of the trace, metric, and log data that applications produce. Elastic APM accepts native OTLP data, so you can configure the OpenTelemetry SDK to export telemetry data directly to an Elastic backend. We refer to this setup as using the “vanilla” (a.k.a. “native”) OpenTelemetry SDK.

Work is ongoing to improve support for storing and presenting OpenTelemetry data natively in our backend so that we can drive our observability UIs directly from the data from the various telemetry signals. Our work focuses on ensuring that the Elastic-curated UIs can seamlessly handle the ECS and OpenTelemetry formats. Alongside this effort, we are working on distributions of the language SDKs to support customers looking to adopt OpenTelemetry-native instrumentation in their applications.

The current Elastic APM Agents support features such as central configuration and span compression that are not part of the OpenTelemetry specification as of today. We are investing our engineering expertise to bring those features to a broader audience by contributing them to OpenTelemetry. Because standardization takes time, we can more rapidly bring these features to the OpenTelemetry community and our customers by providing distributions.

We believe the responsible choice is to concentrate on enabling and encouraging customers to favor vendor-neutral instrumentation in their code and reap the benefits of OpenTelemetry.

Distributions best serve our decision to fully adopt and recommend OpenTelemetry as the preferred solution for observing applications. By providing features that are currently unavailable in the “vanilla” OpenTelemetry SDK, we can support customers who want to adopt OpenTelemetry native, vendor-agnostic instrumentation in their applications while still providing the same set of features and backend capabilities they enjoy today with the existing APM Agents. By maintaining Elastic distributions, we can also better support our customers with enhancements and fixes outside of the release cycle of the “vanilla” OpenTelemetry SDKs, which we believe to be a crucial differentiating factor in choosing them.

Our vision is that Elastic will work with the OpenTelemetry community to donate features through the standardization processes and contribute the code to implement those in the native OpenTelemetry SDKs. In time, we hope to see many Elastic APM Agent-exclusive features transition into OpenTelemetry to the point where an Elastic distribution may no longer be necessary. In the meantime, we can deliver those capabilities via our OpenTelemetry distributions.

Application developers then have several options for instrumenting and collecting telemetry data from their applications:

Elastic APM Agent: The most fully featured, however, vendor-specific
Elastic APM Agent with OpenTelemetry Bridge: Vendor-neutral instrumentation API, but with known limitations:
1. Only supports bridging of traces (no metrics support)
2. Does not support OpenTelemetry span events
OpenTelemetry “vanilla” SDK: Fully supported today; however, it lacks some features of Elastic APM Agent, such as span compression
Elastic OpenTelemetry Distribution:
1. Supports vendor-neutral instrumentation and no Elastic-specific configuration in code by default
2. Recommended defaults when using Elastic Observability as a backend
3. Use OpenTelemetry APIs to further customize our defaults; no new APIs to learn

While we continue to support all options to instrument your code for the foreseeable future, we think we are setting our customers up for success by introducing a fourth OpenTelemetry-native offering. We expect this will become the preferred default for Elastic customers in due time.

We currently have distributions in alpha release status for .NET and Java, with additional language distributions coming very soon. We encourage you to check out those repositories, try out the distributions, and provide feedback to us via issues. Your valued input allows us to refine our designs and steer our direction to ensure that our distributions delight consumers.

Learn about the alpha release of our new Elastic distribution of the OpenTelemetry SDK for .NET.

FAQ - Elastic contributes its Universal Profiling agent to OpenTelemetry

Thu, 06 Jun 2024 00:00:00 GMT

What is being announced?

Elastic’s donation proposal for contributing its Universal Profiling™ agent has now been accepted by the OpenTelemetry community. Elastic’s Universal Profiling agent, the industry’s most comprehensive fleet-wide Universal Profiling solution, empowers users to quickly identify performance bottlenecks, reduce cloud spend, and minimize their carbon footprint. With the contribution of the Elastic Universal Profiling Agent to OpenTelemetry, all customers will benefit from its features and capabilities.

What do Elastic users need to know?

Elastic’s contribution of the continuous profiling agent will not change the existing set of Elastic’s continuous profiling features or how we ingest and store profiling data.

Elastic will participate and closely collaborate with the OTel community to manage not only the addition of the continuous profiling agent to OTel but also work with and drive the OTel community’s Profiling Special Interest Group (SIG) in shaping OTel’s continuous profiling evolution.

Elastic has facilitated the definition of the OTel Profiling Data Model, a crucial step toward standardizing profiling data. Moreover, the recent merge of the OpenTelemetry Enhancement Proposal (OTEP) introducing profiling support to the OpenTelemetry Protocol (OTLP) marked an additional milestone.

Why is Elastic contributing its Profiling Agent to OTel?

This contribution not only accelerates the standardization of continuous profiling but also makes continuous profiling the 4th key signal in observability. This empowers everyone in the observability community to continuously profile with a standardized agent. The addition of Elastic’s continuous profiling agent will:

Align efforts around a single standard poised for broad adoption by users.
Drive better visibility and improvement of resource usage and cost management for operations.
Enable vendors and the community to focus on richer features versus dealing with data transformation tasks.
Enable continuous profiling to become the 4th key signal in Observability.
Increase continuous profiling adoption and the continued evolution and convergence of observability and security domains.

Why is continuous profiling needed by organizations?

The contribution of Elastic’s continuous profiling agent now helps customers realize the following benefits of continuous profiling:

Maximize gross margins: By reducing the computational resources needed to run applications, businesses can optimize their cloud spend and improve profitability. Whole-system continuous profiling is one way of identifying the most expensive applications (down to the lines of code) across diverse environments that may span multiple cloud providers. This principle aligns with the familiar adage, "A penny saved is a penny earned." In the cloud context, every CPU cycle saved translates to money saved.

Minimize environmental impact: Energy consumption associated with computing is a growing concern (source: MIT Energy Initiative). More efficient code translates to lower energy consumption, contributing to a reduction in carbon (CO2) footprint.

Accelerate engineering workflows: Continuous profiling provides detailed insights to help debug complex issues faster, guide development, and improve overall code quality.

With these benefits, customers can now not only manage the overall application’s efficiency on the cloud, but also ensure the application is optimally developed.

What is continuous profiling?

Elastic’s continuous profiling agent is a whole-system, always-on, continuous profiling solution that eliminates the need for run-time/bytecode instrumentation, recompilation, on-host debug symbols or service restarts.

Profiling helps organizations run efficient services by minimizing computational wastage, thereby reducing operational costs. Leveraging eBPF, the Elastic profiling agent provides unprecedented visibility into the runtime behavior of all applications: it builds stack traces that go from the kernel, through userspace native code, all the way into code running in higher level runtimes, enabling you to identify performance regressions, reduce wasteful computations, and debug complex issues faster.

To this end, it measures code efficiency in three dimensions: CPU utilization, CO2, and cloud cost. This approach resonates with the sustainability objectives of our customers –– ensuring that Elastic continuous profiling aligns seamlessly with their strategic ESG goals

Does Elastic support OpenTelemetry today?

Elastic supports OTel natively. Elastic users can send OTel data directly from applications or through the OTel collector into Elastic APM, which processes both OTel SemConv and ECS. With this native OTel support, all Elastic APM capabilities are available with OTel. See Elastic documentation to learn more about OTel integration.

Where can I learn more about Elastic’s Universal Profiling?

Elastic’s resources help you understand continuous profiling and how to use it in different scenarios:

Elastic contributes its Universal Profiling agent to OpenTelemetry

Thu, 06 Jun 2024 00:00:00 GMT

Following great collaboration between Elastic and OpenTelemetry's profiling community, which included a thorough review process, the OpenTelemetry community has accepted Elastic's donation of our continuous profiling agent. This marks a significant milestone in helping establish profiling as the fourth telemetry signal in OpenTelemetry. Elastic’s eBPF-based continuous profiling agent observes code across different programming languages and runtimes, third-party libraries, kernel operations, and system resources with low CPU and memory overhead in production. SREs can now benefit from these capabilities: quickly identifying performance bottlenecks, maximizing resource utilization, reducing carbon footprint, and optimizing cloud spend. Over the past year, we have been instrumental in enhancing OpenTelemetry's Semantic Conventions with the donation of Elastic Common Schema (ECS), contributing to the OpenTelemetry Collector and language SDKs, and have been working with OpenTelemetry’s Profiling Special Interest Group (SIG) to lay the foundation necessary to make profiling stable.

With today’s acceptance, we are officially contributing our continuous profiler technology to OpenTelemetry. We will also dedicate a team of profiling domain experts to co-maintain and advance the profiling capabilities within OTel.

We want to thank the OpenTelemetry community for the great and constructive cooperation on the donation proposal. We look forward to jointly establishing continuous profiling as an integral part of OpenTelemetry.

What is continuous profiling?

Profiling is a technique used to understand the behavior of a software application by collecting information about its execution. This includes tracking the duration of function calls, memory usage, CPU usage, and other system resources.

However, traditional profiling solutions have significant drawbacks limiting adoption in production environments:

Significant cost and performance overhead due to code instrumentation
Disruptive service restarts
Inability to get visibility into third-party libraries

Unlike traditional profiling, which is often done only in a specific development phase or under controlled test conditions, continuous profiling runs in the background with minimal overhead. This provides real-time, actionable insights without replicating issues in separate environments. SREs, DevOps, and developers can see how code affects performance and cost, making code and infrastructure improvements easier.

Contribution of production-grade features

Elastic Universal Profiling is a whole-system, always-on, continuous profiling solution that eliminates the need for code instrumentation, recompilation, on-host debug symbols or service restarts. Leveraging eBPF, Elastic Universal Profiling profiles every line of code running on a machine, including application code, kernel, and third-party libraries. The solution measures code efficiency in three dimensions, CPU utilization, CO2, and cloud cost, to help organizations manage efficient services by minimizing computational waste.

The Elastic profiling agent facilitates identifying non-optimal code paths, uncovering "unknown unknowns", and provides comprehensive visibility into the runtime behavior of all applications. Elastic’s continuous profiling agent supports various runtimes and languages, such as C/C++, Rust, Zig, Go, Java, Python, Ruby, PHP, Node.js, V8, Perl, and .NET.

Additionally, organizations can meet sustainability objectives by minimizing computational wastage, ensuring seamless alignment with their strategic ESG goals.

Benefits to OpenTelemetry

This contribution not only boosts the standardization of continuous profiling for observability but also accelerates the practical adoption of profiling as the fourth key signal in OTel. Customers get a vendor-agnostic way of collecting profiling data and enabling correlation with existing signals, like tracing, metrics, and logs, opening new potential for observability insights and a more efficient troubleshooting experience.

OTel-based continuous profiling unlocks the following possibilities for users:

Improved customer experience: delivering consistent service quality and performance through continuous profiling ensures customers have an application that performs optimally, remains responsive, and is reliable.

Maximize gross margins: Businesses can optimize their cloud spend and improve profitability by reducing the computational resources needed to run applications. Whole system continuous profiling identifies the most expensive functions (down to the lines of code) across diverse environments that may span multiple cloud providers. In the cloud context, every CPU cycle saved translates to money saved.

Minimize environmental impact: energy consumption associated with computing is a growing concern (source: MIT Energy Initiative ). More efficient code translates to lower energy consumption, reducing carbon (CO2) footprint.

Accelerate engineering workflows: continuous profiling provides detailed insights to help troubleshoot complex issues faster, guide development, and improve overall code quality.

Improved vendor neutrality and increased efficiency: an OTel eBPF-based profiling agent removes the need to use proprietary APM agents and offers a more efficient way to collect profiling telemetry.

With these benefits, customers can now manage the overall application’s efficiency on the cloud while ensuring their engineering teams optimize it.

What comes next?

While the acceptance of Elastic’s donation of the profiling agent marks a significant milestone in the evolution of OTel’s eBPF-based continuous profiling capabilities, it represents the beginning of a broader journey. Moving forward, we will continue collaborating closely with the OTel Profiling and Collector SIGs to ensure seamless integration of the profiling agent within the broader OTel ecosystem. During this phase, users can test early preview versions of the OTel profiling integration by following the directions in the otel-profiling-agent repository.

Elastic remains deeply committed to OTel’s vision of enabling cross-signal correlation. We plan to further contribute to the community by sharing our innovative research and implementations, specifically those facilitating the correlation between profiling data and distributed traces, across several OTel language SDKs and the profiling agent.

We are excited about our growing relationship with OTel and the opportunity to donate our profiling agent in a way that benefits both the Elastic community and the broader OTel community. Learn more about Elastic’s OpenTelemetry support and learn how to contribute to the ongoing profiling work in the community.

Additional Resources

Additional details on Elastic’s Universal Profiling can be found in the FAQ.

For insights into observability, visit Observability labs where OTel specific articles are also available.

Elastic's RAG-based AI Assistant: Analyze application issues with LLMs and private GitHub issues

Wed, 08 May 2024 00:00:00 GMT

As an SRE, analyzing applications is more complex than ever. Not only do you have to ensure the application is running optimally to ensure great customer experiences, but you must also understand the inner workings in some cases to help troubleshoot. Analyzing issues in a production-based service is a team sport. It takes the SRE, DevOps, development, and support to get to the root cause and potentially remediate. If it's impacting, then it's even worse because there is a race against time. Regardless of the situation, there is a ton of information that needs to be consumed and processed. This includes not only what the customer is experiencing, but also internal data to help provide the most appropriate resolution.

Elastic’s AI Assistant helps improve analysis for SREs, DevOps, Devs, and others. In a single window using natural language questions, you can analyze using not only general information but combine it with things like:

Issues from internal GitHub repos, Jira, etc.
Documents from internal wiki sites from Confluence, etc.
Customer issues from your support service
And more

In this blog, we will walk you through how to:

Ingest an external GitHub repository (OpenTelemetry demo repo) with code and issues into Elastic. Apply Elastic Learned Sparse EncodeR (ELSER) and store it in a specific index for the AI Assistant.
Ingest internal GitHub repository with runbook information into Elastic. Apply ELSER and store the processed data in a specific index for the AI Assistant.
Use these two indices when analyzing issues for the OpenTelemetry demo in Elastic using the AI Assistant.

3 simple questions using GitHub data with AI Assistant

Before we walk through the steps for setting up data from GitHub, let’s review what an SRE can do with the AI Assistant and GitHub repos.

We initially connect to GitHub using an Elastic GitHub connector and ingest and process two repos: the OpenTelemetry demo repo (public) and an internal runbook repo (Elastic internal).

With these two loaded and parsed by ELSER, we ask the AI Assistant some simple questions generally asked during analysis.

How many issues are open for the OpenTelemetry demo?

Since we ingested the entire repo (as of April 26, 2024) with a doc count of 1,529, we ask it a simple question regarding the total number of issues that are open. We specifically tell the AI Assistant to search our internal index to ensure the LLM knows to ask Elastic to search its internal index for the total number of issues.

Are there any issues for the Rust based shippingservice?

Elastic’s AI Assistant uses ELSER to traverse the loaded GitHub repo and finds the open issue against the shippingservice (which is the following issue at the time of writing this post).

Is there a runbook for the Cartservice?

Since we loaded an internal GitHub repo with a few sample runbooks, the Elastic AI Assistant properly finds the runbook.

As we go through this blog, we will talk about how the AI Assistant finds these issues using ELSER and how you can configure it to use your own GitHub repos.

Retrieval augmented generation (RAG) with Elastic AI Assistant

Elastic has the most advanced RAG-based AI Assistant for both Observability and Security. It can help you analyze your data using:

Your favorite LLM (OpenAI, Azure OpenAI, AWS Bedrock, etc.)
Any internal information (GitHub, Confluence, customer issues, etc.) you can either connect to or bring into Elastic’s indices

The reason Elastic’s AI Assistant can do this is because it supports RAG, which helps retrieve internal information along with LLM-based knowledge.

Adding relevant internal information for an SRE into Elastic:

As data comes in, such as in your GitHub repository, ELSER is applied to the data, and embeddings (weights and tokens into a sparse vector field) are added to capture semantic meaning and context of the data.
This data (GitHub, Confluence, etc.) is processed with embeddings and is stored in an index that can be searched by the AI Assistant.

When you query the AI Assistant for information:

The query goes through the same inference process as the ingested data using ELSER. The input query generates a “sparse vector,” which is used to find the most relevant highly ranked information in the ingested data (GitHub, Confluence, etc.).
The retrieved data is then combined with the query and also sent over to the LLM, which will then add its own knowledge base information (if there is anything to add), or it might ask Elastic (via function calls) to analyze, chart, or even search further. If a function call is made to Elastic and a response is provided, it will be added by the LLM to its response.
The results will be the most contextual based answer combining both LLM and anything relevant from your internal data.

Application, prerequisites, and config

If you want to try the steps in this blog, here are some prerequisites:

An Elastic Cloud account — sign up now
OpenTelemetry demo running and connected to Elastic (APM documentation)
Whatever internal GitHub repo you want to use with some information that is useful for analysis (In our walk through, we will be using a GitHub repo that houses runbooks for different scenarios when Elastic does demos).
Account with your favorite or approved LLM (OpenAI, Azure OpenAI, AWS Bedrock)

Adding the GitHub repos to Elastic

The first step is to set up the GitHub connector and connect to your GitHub repo. Elastic has several connectors from GitHub, Confluence, Google Drive, Jira, AWS S3, Microsoft Teams, Slack, and more. So while we will go over the GitHub connector in this blog, don’t forget about other connectors.

Once you select the GitHub connector and give it a name, you need to add two items:

GitHub token
The URL open-telemetry/opentelemetry-demo

Next, add it to an index in the wizard.

Create a pipeline and process the data with ELSER

In order to add the embeddings we discussed in the section above, we need to add the following to the connector:

Create a pipeline in the configuration wizard.
Create a custom pipeline.
Add the ML inference pipeline.
Select ELSERv2 ML Model to add the embeddings.
Select the fields that need to be evaluated as part of the inference pipeline.
Test and save the inference pipeline and the overall pipeline.

Sync the data

Now that the pipeline is created, you need to start to sync the github repo. As the documents from the github repo come in, they will go through the pipeline and embeddings will be added.

Embeddings

Once the pipeline is set up, sync the data in the connector. As the GitHub repository comes in, the inference pipeline will process the data as follows:

As data comes in from your GitHub repository, ELSER is applied to the data, and embeddings (weights and tokens into a sparse vector field) are added to capture semantic meaning and context of the data.
This data is processed with embeddings and is stored in an index that can be searched by the AI Assistant.

When you look at the OpenTelemetry GitHub documents that were ingested, you will see how the weights and token are added to the predicted_value field in the index.

These embeddings will now be used to find the most contextually relevant document when the user asks the AI Assistant a query, which might use this.

Check if AI Assistant can use the index

Elastic’s AI Assistant uses ELSER to traverse the loaded Github repo and finds the open issue against the shippingservice. (which is the following issue at the time of writing this post).

Based on the response, we can see that the AI Assistant can now use the index to find the issue and use it for further analysis.

Conclusion

You’ve now seen how easy Elastic’s RAG-based AI Assistant is to set up. You can bring in documents from multiple locations (GitHub, Confluent, Slack, etc.). We’ve shown the setup for GitHub and OpenTelemetry. This internal information can be useful in managing issues, accelerating resolution, and improving customer experiences. Check out our other blogs on how the AI Assistant can help SREs do better analysis, lower MTTR, and improve operations overall:

Try it out

Existing Elastic Cloud customers can access many of these features directly from the Elastic Cloud console. Not taking advantage of Elastic on cloud? Start a free trial.

All of this is also possible in your environments. Learn how to get started today.

Elastic SQL inputs: A generic solution for database metrics observability

Mon, 11 Sep 2023 00:00:00 GMT

Elastic^® SQL inputs (metricbeat module and input package) allows the user to execute SQL queries against many supported databases in a flexible way and ingest the resulting metrics to Elasticsearch^®. This blog dives into the functionality of generic SQL and provides various use cases for advanced users to ingest custom metrics to Elastic^®, for database observability. The blog also introduces the fetch from all database new capability, released in 8.10.

Why “Generic SQL”?

Elastic already has metricbeat and integration packages targeted for specific databases. One example is metricbeat for MySQL — and the corresponding integration package. These beats modules and integrations are customized for a specific database, and the metrics are extracted using pre-defined queries from the specific database. The queries used in these integrations and the corresponding metrics are not available for modification.

Whereas the Generic SQL inputs (metricbeat or input package) can be used to scrape metrics from any supported database using the user's SQL queries. The queries are provided by the user depending on specific metrics to be extracted. This enables a much more powerful mechanism for metrics ingestion, where users can choose a specific driver and provide the relevant SQL queries and the results get mapped to one or more Elasticsearch documents, using a structured mapping process (table/variable format explained later).

Generic SQL inputs can be used in conjunction with the existing integration packages, which already extract specific database metrics, to extract additional custom metrics dynamically, making this input very powerful. In this blog, Generic SQL input and Generic SQL are used interchangeably.

Functionalities details

This section covers some of the features that would help with the metrics extraction. We provide a brief description of the response format configuration. Then we dive into the merge_results functionality, which is used to combine results from multiple SQL queries into a single document.

The next key functionality users may be interested in is to collect metrics from all the custom databases, which is now possible with the fetch_from_all_databases feature.

Now let's dive into the specific functionalities:

Different drivers supported

The generic SQL can fetch metrics from the different databases. The current version has the capability to fetch metrics from the following drivers: MySQL, PostgreSQL, Oracle, and Microsoft SQL Server(MSSQL).

Response format

The response format in generic SQL is used to manipulate the data in either table or in variable format. Here’s an overview of the formats and syntax for creating and using the table and variables.

Syntax: response_format: table {{or}} variables

Response format table
This mode generates a single event for each row. The table format has no restrictions on the number of columns in the response. This format can have any number of columns.

Example:

driver: "mssql"
sql_queries:
 - query: "SELECT counter_name, cntr_value FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'"
   response_format: table

This query returns a response similar to this:

"sql":{
      "metrics":{
         "counter_name":"User Connections ",
         "cntr_value":7
      },
      "driver":"mssql"
}

The response generated above adds the counter_name as a key in the document.

Response format variables
The variable format supports key:value pairs. This format expects only two columns to fetch in a query.

Example:

driver: "mssql"
sql_queries:
 - query: "SELECT counter_name, cntr_value FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'"
   response_format: variables

The variable format takes the first variable in the query above as the key:

"sql":{
      "metrics":{
         "user connections ":7
      },
      "driver":"mssql"
}

In the above response, you can see the value of counter_name is used to generate the key in variable format.

Response optimization: merge_results

We are now supporting merging multiple query responses into a single event. By enabling merge_results , users can significantly optimize the storage space of the metrics ingested to Elasticsearch. This mode enables an efficient compaction of the document generated, where instead of generating multiple documents, a single merged document is generated wherever applicable. The metrics of a similar kind, generated from multiple queries, are combined into a single event.

Syntax: merge_results: true {{or}} false

In the below example, you can see how the data is loaded into Elasticsearch for the below query when the merge_results is disabled.

Example:

In this example, we are using two different queries to fetch metrics from the performance counter.

merge_results: false
driver: "mssql"
sql_queries:
  - query: "SELECT cntr_value As 'user_connections' FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'"
    response_format: table
  - query: "SELECT cntr_value As 'buffer_cache_hit_ratio' FROM sys.dm_os_performance_counters WHERE counter_name = 'Buffer cache hit ratio' AND object_name like '%Buffer Manager%'"
    response_format: table

As you can see, the response for the above example generates a single document for each query.

The resulting document from the first query:

"sql":{
      "metrics":{
         "user_connections":7
      },
      "driver":"mssql"
}

And resulting document from the second query:

"sql":{
      "metrics":{
         "buffer_cache_hit_ratio":87
      },
      "driver":"mssql"
}

When we enable the merge_results flag in the query, both the above metrics are combined together and the data gets loaded in a single document.

You can see the merged document in the below example:

"sql":{
      "metrics":{
         "user connections ":7,
         “buffer_cache_hit_ratio”:87
      },
      "driver":"mssql"
}

However, such a merge is possible only if the table queries are merged, and each produces a single row. There is no restriction on variable queries being merged.

Introducing a new capability: fetch_from_all_databases

This is a new functionality to fetch all the database metrics automatically from the system and user databases of the Microsoft SQL Server, by enabling the fetch_from_all_databases flag.

Keep an eye out for the 8.10 release version where you can start using the fetch all database feature. Prior to the 8.10 version, users had to provide the database names manually to fetch metrics from custom/user databases.

Syntax: fetch_from_all_databases: true {{or}} false

Below is the sample query with fetch all databases flag as disabled:

fetch_from_all_databases: false
driver: "mssql"
sql_queries:
  - query: "SELECT @@servername AS server_name, @@servicename AS instance_name, name As 'database_name', database_id FROM sys.databases WHERE name='master';"

The above query fetches metrics only for the provided database name. Here the input database is master, so the metrics are fetched only for the master.

Below is the sample query with the fetch all databases flag as enabled:

fetch_from_all_databases: true
driver: "mssql"
sql_queries:
  - query: SELECT @@servername AS server_name, @@servicename AS instance_name, DB_NAME() AS 'database_name', DB_ID() AS database_id;
    response_format: table

The above query fetches metrics from all available databases. This is useful when the user wants to get data from all the databases.

Please note: currently this feature is supported only for Microsoft SQL Server and will be used by MS SQL integration internally, to support extracting metrics for all user DBs by default.

Using generic SQL: Metricbeat

The generic SQL metricbeat module provides flexibility to execute queries against different database drivers. The metricbeat input is available as GA for any production usage. Here, you can find more information on configuring the generic SQL for different drivers with various examples.

Using generic SQL: Input package

The input package provides a flexible solution to advanced users for customizing their ingestion experience in Elastic. Generic SQL is now also available as an SQLinput package. The input package is currently available for early users as a beta release. Let's take a walk through how users can use generic SQL via the input package.

Configurations of generic SQL input package:

The configuration options for the generic SQL input package are as below:

Driver** :** This is the SQL database for which you want to use the package. In this case, we will take mysql as an example.
Hosts: Here the user enters the connection string to connect to the database. It would vary depending on which database/driver is being used. Refer here for examples.
SQL Queries: Here the user writes the SQL queries they want to fire and the response_format is specified.
Data set: The user specifies a data set name to which the response fields get mapped.
Merge results** :** This is an advanced setting, used to merge queries into a single event.

Metrics extensibility with customized SQL queries

Let's say a user is using MYSQL Integration, which provides a fixed set of metrics. Their requirement now extends to retrieving more metrics from the MYSQL database by firing new customized SQL queries.

This can be achieved by adding an instance of SQL input package, writing the customized queries and specifying a new data set name as shown in the screenshot below.

This way users can get any metrics by executing corresponding queries. The resultant metrics of the query will be indexed to the new data set, sql_second_dataset.

When there are multiple queries, users can club them into a single event by enabling the Merge Results toggle.

Customizing user experience

Users can customize their data by writing their own ingest pipelines and providing their customized mappings. Users can also build their own bespoke dashboards.

As we can see above, the SQL input package provides the flexibility to get new metrics by running new queries, which are not supported in the default MYSQL integration (the user gets metrics from a predetermined set of queries).

The SQL input package also supports multiple drivers: mssql, postgresql and oracle. So a single input package can be used to cater to all these databases.

Note: The fetch_from_all_databases feature is not supported in the SQL input package yet.

Try it out!

Now that you know about various use cases and features of generic SQL, get started with Elastic Cloud and try using the SQL input package for your SQL database and get customized experience and metrics. If you are looking for newer metrics for some of our existing SQL based integrations — like Microsoft SQL Server, Oracle, and more — go ahead and give the SQL input package a swirl.

Smarter Alerting Arrives with Faster Triage, Clearer Groupings, and Actionable Guidance

Thu, 04 Sep 2025 00:00:00 GMT

In the 9.1 release, we've made significant upgrades to alerting to help SREs and operators cut through the noise, understand what's happening faster, and take meaningful action with less guesswork.

Here's what's new:

Improved Related Alert Grouping with Relevance Scoring & Reasoning

We've enhanced our related alert detection to go beyond surface-level correlations. Alerts are now grouped based on a relevance score that reflects the strength of their relationship across dimensions like:

Shared entities or resources (e.g. same host, pod, or service)
Temporal proximity (alerts firing within a suspiciously short window)
Signal similarity (e.g. spikes in logs, metrics, and traces that point to the same failure mode)

More importantly, we now show the why. You'll see why an alert is grouped, whether it's sharing the same Kubernetes pod, has similar log patterns, or was triggered by the same upstream anomaly. This gives users confidence in the grouping logic and accelerates root cause analysis.

Link Dashboards to Alert Rules and Get Smart Suggestions

You can now link dashboards directly to your alert rules, giving responders an instant visual lens into the metrics or logs that matter most for that alert. No more scrambling to remember which dashboard to check — just click and go.

And we've made this smarter too: Elastic will now suggest relevant dashboards based on the alert's source, rule logic, or monitored entities, helping users land on the right view without needing to configure anything upfront.

Investigation Guides Embedded Into Alerts

Every alert can now be configured with an investigation guide, a set of pre-configured, context-aware instructions or next steps tailored to the alert. Think of it as a playbook that's embedded right where and when you need it.

Use it to:

Document your team's runbooks and standard triage steps or link to existing runbooks
Guide junior engineers or on-call responders through unfamiliar territory
Automate the first few steps of root cause analysis

Why This Matters

These changes are all about reducing time to detect (MTTD) and time to resolve (MTTR). By:

Grouping alerts more intelligently (and transparently)
Giving you the dashboards you need, when you need them
Embedding action-oriented guides in every alert

We're bringing you closer to a truly streamlined incident response workflow; No swivel-chairing, no guesswork, just clarity.

Additionally, look at some of our other articles on Elastic Observability Labs related to analysis:

Streams Processing: Stop Fighting with Grok. Parse Your Logs in Streams.

Thu, 11 Dec 2025 00:00:00 GMT

With Streams, Elastic's new AI capability in 9.2, we make parsing your logs so simple, it's no longer a concern. In general your logs are messy, lots of fields, some understood, some unknown. You have to constantly keep up with the semantics and pattern match to properly parse them. In some cases, even fields you know have different values or semantics. For instance, timestamp is the ingest time, not the event time. Or you can't even filter by log.level or user.id because they're buried inside the message field. As a result, your dashboards are flat and not useful.

Fixing this used to mean leaving Kibana, learning Grok syntax, manually editing ingest pipeline JSON or a complicated Logstash config, and hoping you didn't break parsing for everything else.

We built Streams to fix this, and much more. It's your one place for data processing, built right into Kibana, that lets you build, test, and deploy parsing logic on live data in seconds. It turns a high-risk backend task into a fast, predictable, interactive UI workflow. You can use AI to generate automated GROK rules from a sample of logs, or build them easily with the UI. Let's walk through an example

A Quick Walkthrough

Let's fix a common "unstructured" log right now.

Start in Discover. You find a log that isn't structured. The @timestamp is wrong, and fields like log.level aren't being extracted, so your histograms are just a single-color bar.

Inspect the log. Open the document flyout (the "Inspect a single log event" view). You'll see a button: "Parse content in Streams" (or "Edit processing in Streams"). Click it.

Go to Processing. This takes you directly to the Streams processing tab, pre-loaded with sample documents from that data stream. Click "Create your first step."

Generate a Pattern. The processor defaults to Grok. You don't have to write any. Just click the "Generate Pattern" button. Streams analyzes 100 sample documents from your stream and suggests a Grok pattern for you. By default, this uses the Elastic Managed LLM, but you can configure your own.

Accept and Simulate. Click "Accept." Instantly, the UI runs a simulation across all 100 sample documents. You can make changes to the pattern or adjust field names, and the simulation re-runs with every keystroke.

When you're happy, you save it. Your new logs will now be parsed correctly.

Powerful Features for Messy, Real-World Logs

That's the simple case. But real-world data is rarely that clean. Here are the features built to handle the complexity.

The Interactive Grok UI

When you use the Grok processor, the UI gives you a visual indication of what your pattern is extracting. You can see which parts of the message field are being mapped to which new field names. This immediate feedback means you're not just guessing. Autocompletion of GROK patterns and instant pattern validation are also part of it.

The Diff Viewer

How do you know what exactly changed? Expand any row in the simulation table. You'll get a diff view showing precisely which fields were added, removed, or modified for that specific document. No more guesswork.

End to End Simulation and Detecting Failures

This is the most critical part. Streams doesn't just simulate the processor; it simulates the entire indexing process. If you try to map a non-timestamp string (like the message field) directly to the @timestamp field, the simulation will show a failure. It detects the mapping conflict before you save it and before it can create a data-mapping conflict in your cluster. This safety net is what lets you move fast.

Conditional Processing

What if one data stream contains a large variety of logs? You can't use one Grok pattern for all.

Streams has conditional processing built for this. The UI lets you build "if-then" logic. The UI shows you exactly what percentage of your sample documents are skipped or processed by your conditions. Right now, the UI supports up to 3 levels of nesting, and we plan to add a YAML mode in the future for more complex logic.

Changing Your Test Data (Document Samples)

A random 100-document sample isn't always helpful, especially in a massive, mixed stream from Kubernetes or a central message broker.

You can change the document sample to test your changes on a more specific set of logs. You can either provide documents manually (copy-paste) or, more powerfully, specify a KQL query to fetch 100 specific documents. For example: service.name : "data_processing", to fetch 100 additional sample documents to be used in the simulation. Now you can build and test a processor on the exact logs you care about.

How Processing Works Under the Hood

There’s no magic. In simple terms, it's a UI that makes our existing best practices more accessible. As of version 9.2, Streams runs exclusively on Elasticsearch ingest pipelines. (We have plans to offer more than that, stay tuned)

When you save your changes, Streams appends processing steps by:

Locating the most specific @custom ingest pipeline for your data stream.
Adding a single pipeline processor to it.
This processor calls a new, dedicated pipeline named @stream.processing, which contains the Grok, conditional, and other logic you built in the UI.

You can even see this for yourself by going to the Advanced tab in your Stream and clicking the pipeline name.

Processing in OTel, Elastic Agent, Logstash, or Streams? What to Use?

This is a fair question. You have lots of ways to parse data.

Best: Structured logging at the Source. If you control the app writing the logs, make it log JSON in the right format of your choice. This will always stay the best way to do logging, but it's not always possible.
Good, but not all the time: Elastic Agent + Integrations: If there is an existing integration for collecting and parsing your data, Streams won't do it any better. Use it!
Good for tech savvy users: OTel at the Edge. Use OTel (with OTTL) to set yourself up for the future.
The easy Catch-All: In Streams. Especially when using an Integration that primarily just ships the data into Elastic, Streams can add a lot of value. The Kubernetes Logs integration is a good example of this where an Integration is used, but most logs aren't parsed automatically as they may be from a wide variety of pods.

Think of Streams as your universal "catch-all" for everything that arrives unstructured. It's perfect for data from sources you don't control, for legacy systems, or for when you just need to fix a parsing error right now without a full application redeploy.

A quick note on schemas: Streams can handle both ECS (Elastic Common Schema) and OTel (OpenTelemetry) data. By default, it assumes your target schema is ECS. However, Streams will automatically detect and adapt to the OTel schema if your Stream's name contains the word “otel”, or if you're using the special Logs Stream (currently in tech preview). You get the same visual parsing workflow regardless of the schema.

All processing changes can also be made using a Kibana API. Note that the API is still in tech preview while we mature some of the functionality.

Summary

Parsing logs shouldn't be a tedious, high-stakes, backend-only task. Streams moves the entire workflow from a complex, error-prone approach to an interactive UI right where you already are. You can now build, test, and deploy parsing logic with instant, safe feedback. This means you can stop fighting your logs and finally start using them. The next time you see a messy log, don't ignore it. Click "Parse in Streams" and fix it in 60 seconds.

Check out more log analytics articles in Elasitc Observability Labs.

Try out Elastic. Sign up for a trial at Elastic Cloud.

Elastic Synthetics Projects: A Git-friendly way to manage your synthetics monitors in Elastic Observability

Thu, 23 Feb 2023 00:00:00 GMT

Elastic has an entirely new Heartbeat/Synthetics workflow superior to the current workflow. If you’re a current user of the Elastic Uptime app, read on to learn about the improved workflow you can use today and should eventually migrate toward.

We’ve recently released a beta feature that provides a Git-friendly IaaC oriented workflow. You can now push Heartbeat monitors with the same ease with which you push code changes in Git or config changes in Terraform. The features discussed in this blog are all currently in beta, and we urge users trying these features out to upgrade to the latest stack version first. When these features become GA, this new workflow will be the preferred way of configuring monitors in the Elastic Stack. If you’re starting a new project, you may want to consider setting it up this way instead of via our more classic configuration.

Today, using Heartbeat is simple. You just need to write a little YAML and monitoring data shows up in Elasticsearch, visible in the Uptime UI. While the UI is indeed simple, there’s some hidden complexity there that we’ve improved with a new UI (the Synthetics app) and augmented with an even more automation friendly CLI workflow via our new Projects feature, which will be discussed below.

How do you manage your configs written in YAML? Many of our users will manage YAML in Git and use tooling such as Ansible, Helm, or similar to manage their infrastructure as code (IaaC). As with any other organization, Elastic also heavily utilizes IaaC in all parts of our operations. Hence it’s only natural we developed a capability to provide you with similar support for the current Heartbeat capability and the upcoming synthetics monitoring capabilities.

Projects: A new way to organize and distribute configs

Let’s dive right into what we’re calling “Synthetics Projects” and how they differ from traditional Heartbeat config files. To use this feature, you would start by creating a project in a Git repo containing your configs. At a high level, setting up a project requires performing the following tasks:

Run npx @elastic/synthetics init to create a project skeleton in a directory. See more details on the npmjs.com site.
Run git init and git push on the generated directory to version it as a Git repository.
Add your lightweight YAML files and browser javascript/typescript files to the journeys folder.
Test that it works by running npx @elastic/synthetics push command to sync your project to your Elastic Stack.
Configure a CI/CD pipeline to test pull requests to your Git repo and to execute npx @elastic/synthetics push on merges to the main branch.

So, once configured, adding, removing, and editing monitors involves:

Editing a monitor’s config, either YAML for lightweight monitors, or Javascript/Typescript for browser based ones locally
Testing your local configs with npx @elastic/synthetics journeys
Creating a new PR to your main branch via a Git push
Waiting for your CI server to perform the same validation and waiting for someone else on your team to review your PR
Merging your result to the main branch
Waiting for your CI server to push the changes to your Elastic stack

We’ve depicted the flow of data in the diagram below:

This is, in fact, the way many of our users work today, with other software taking the place of npx @elastic/synthetics push as mentioned earlier. Indeed, in the future, we will most likely look into building a Terraform provider, though that isn’t something we’re actively working on now.

Just have a few monitors? Use the GUI!

The above approach is great for sophisticated users with larger numbers of configurations, but if you just want to monitor a few URLs, it’s overkill. If that sounds like you, consider the new Monitor Management UI in the Uptime app! It works in the exact same way, saving configs to your Elastic Stack, but with no need for Git, or a project, or all that other infrastructure. Simply, log in, fill out the form pictured below, and hit save. If you want to set up a private location, that is still done in the same way via Fleet.

What about my existing Fleet monitors?

A small subset of users have monitors configured today using the Synthetics Fleet integration. If that describes you, you’ll want to move onto either the GUI based approach or the Project based approach, as those methods supersede direct usage of the Fleet integration, which will eventually be restricted only to use via the above described methods.

The Fleet approach is inferior in a few ways:

It can only configure monitors for a single location.
It creates a different UX for monitors configured on the service versus private locations.
It’s less fluid of an integration with the Uptime UI.

It’s rare for us to deprecate beta features, but in this case we had a clearly superior alternative. Maintaining both would have created a more confusing and unwieldy product. We don’t yet have an exact date for removing support for these monitors, but you can track this via this GitHub issue.

Elastic Universal Profiling agent, a continuous profiling solution, is now open source

Mon, 15 Apr 2024 00:00:00 GMT

Elastic Universal Profiling™ agent is now open source! The industry’s most advanced fleetwide continuous profiling solution empowers users to identify performance bottlenecks, reduce cloud spend, and minimize their carbon footprint. This post explores the history of the agent, its move to open source, and its future integration with OpenTelemetry.

Elastic Universal Profiling™ Agent goes open source under Apache 2

At Elastic, open source is more than just a philosophy — it's our DNA. We believe the benefits of whole-system continuous profiling extend far beyond performance optimization. It's a win for businesses and the planet alike. For instance, since launching Elastic Universal Profiling in general availability (GA), we've observed a wide variety of use cases from customers.

These range from customers relying fully on Universal Profiling's differential flame graphs and topN functions for insights during release management to utilizing AI assistants for quickly optimizing expensive functions. This includes using profiling data to identify the optimal energy-efficient cloud region to run certain workloads. Additionally, customers are using insights that Universal Profiling provides to build evidence to challenge cloud provider bills. As it turns out, cloud providers' in-VM agents can consume a significant portion of the CPU time, which customers are billed for.

In a move that will empower the community to take advantage of continuous profiling's benefits, we're thrilled to announce that the Elastic Universal Profiling agent , a pioneering eBPF-based continuous profiling agent, is now open source under the Apache 2 license!

This move democratizes hyper-scaler efficiency for everyone , opening exciting new possibilities for the future of continuous profiling, as well as its role in observability and OpenTelemetry.

Implementation of the OpenTelemetry (OTel) Profiling protocol

Our commitment to open source goes beyond just the agent itself. We recently announced our intent to donate the agent to OpenTelemetry and have further solidified this goal by implementing the experimental OTel Profiling data model. This allows the open-sourced eBPF-based continuous profiling agent to communicate seamlessly with OpenTelemetry backends.

But that's not all! We've also launched an innovative feature that correlates profiling data with OpenTelemetry distributed traces. This powerful capability offers a deeper level of insight into application performance, enabling the identification of bottlenecks with greater precision. Upon donating the Profiling agent to OTel, Elastic will also contribute critical components that enable distributed trace correlation within the Elastic distribution of the OTel Java agent to the upstream OTel Java SDK. This underscores Elastic Observability's commitment to both open source and the support of open standards like OpenTelemetry while pushing the boundaries of what is possible in observability.

What does this mean for Elastic Universal Profiling customers?

We'd like to express our immense gratitude to all our customers who have been part of this journey, from the early stages of private beta to GA. Your feedback has been invaluable in shaping Universal Profiling into the powerful product it is today.

By open-sourcing the Universal Profiling agent and contributing it to OpenTelemetry, we're fostering a win-win situation for both you and the broader community. This move opens doors for innovation and collaboration, ultimately leading to a more robust and versatile whole-system continuous profiling solution for everyone.

Furthermore, we're actively working on exciting novel ways to integrate Universal Profiling seamlessly within Elastic Observability. Expect further announcements soon, outlining how you can unlock even greater value from your profiling data within a unified observability experience in a way that has never been done before.

The open-sourced agent is using the recently released (experimental) OTel Profiling signal. As a precaution, we recommend not using it in production environments.

Please continue using the official Elastic distribution of the Universal Profiling agent until the agent is formally accepted by OTel and the protocol reaches a stable phase. There's no need to take any action at this time, and we will ensure to have a smooth transition plan in place for you.

What does this mean for the OpenTelemetry community?

OpenTelemetry is adopting continuous profiling as a key signal. By open-sourcing the eBPF-based profiling agent and working towards donating it to OTel, Elastic is making it possible to accelerate the standardization of continuous profiling within OpenTelemetry. This move has a massive impact on the observability community, empowering everyone to continuously profile their systems with a standardized protocol.

This is particularly timely as Moore's Law slows down and cloud computing takes hold, making computational efficiency critical for businesses.

Here's how whole-system continuous profiling benefits you:

Maximize gross margins: By reducing the computational resources needed to run applications, businesses can optimize their cloud spend and improve profitability. Whole-system continuous profiling is one way of identifying the most expensive applications (down to the lines of code) across diverse environments that may span multiple cloud providers. This principle aligns with the familiar adage, "a penny saved is a penny earned." In the cloud context, every CPU cycle saved translates to money saved.
Minimize environmental impact: Energy consumption associated with computing is a growing concern (source: MIT Energy Initiative). More efficient code translates to lower energy consumption, contributing to a reduction in carbon footprint.
Accelerate engineering workflows: Continuous profiling provides detailed insights to help debug complex issues faster, guide development, and improve overall code quality.

This is where Elastic Universal Profiling comes in — designed to help organizations run efficient services by minimizing computational wastage. To this end, it measures code efficiency in three dimensions: CPU utilization , CO** 2 , and cloud cost**.

Elastic's journey with continuous profiling began by joining forces with optimyze.cloud –– this became the foundation for Elastic Universal Profiling. We are excited to see this product evolve into its next growth phase in the open-source world.

Ready to give it a spin?

As Elastic Universal Profiling transitions into this new open source era, the potential for transformative impact on performance optimization, cost efficiency, and environmental sustainability is immense. Elastic's approach — balancing innovation with responsibility — paves the way for a future where technology not only powers our world but does so in a way that is sustainable and accessible to all.

Get started with the open source Elastic Universal Profiling agent today! Download it directly from GitHub and follow the instructions in the repository.

Elastic Universal Profiling: Delivering performance improvements and reduced costs

Mon, 22 Apr 2024 00:00:00 GMT

In today's age of cloud services and SaaS platforms, continuous improvement isn't just a goal — it's a necessity. Here at Elastic, we're always on the lookout for ways to fine-tune our systems, be it our internal tools or the Elastic Cloud service. Our recent investigation in performance optimization within our Elastic Cloud QA environment, guided by Elastic Universal Profiling, is a great example of how we turn data into actionable insights.

In this blog, we’ll cover how a discovery by one of our engineers led to savings of thousands of dollars in our QA environment and magnitudes more once we deployed this change to production.

Elastic Universal Profiling: Our go-to tool for optimization

In our suite of solutions for addressing performance challenges, Elastic Universal Profiling is a critical component. As an “always-on” profiler utilizing eBPF, it integrates seamlessly into our infrastructure and systematically collects comprehensive profiling data across the entirety of our system. Because there is zero-code instrumentation or reconfiguration, it’s easy to deploy on any host (including Kubernetes hosts) in our cloud — we’ve deployed it across our environment for Elastic Cloud.

All of our hosts run the profiling agent to collect this data, which gives us detailed insight into the performance of any service that we’re running.

Spotting the opportunity

It all started with what seemed like a routine check of our QA environment. One of our engineers was looking through the profiling data. With Universal Profiling in play, this initial discovery was relatively quick. We found a function that was not optimized and had heavy compute costs.

Let’s go through it step-by-step.

In order to spot expensive functions, we can simply view a list of the TopN functions. The TopN functions list shows us all functions in all services we run that use the most CPU.

To sort them by their impact, we sort descending on the “total CPU”:

Self CPU measures the CPU time that a function directly uses, not including the time spent in functions it calls. This metric helps identify functions that use a lot of CPU power on their own. By improving these functions, we can make them run faster and use less CPU.
Total CPU adds up the CPU time used by the function and any functions it calls. This gives a complete picture of how much CPU a function and its related operations use. If a function has a high "total CPU" usage, it might be because it's calling other functions that use a lot of CPU.

When our engineer reviewed the TopN functions list, one function called "... inflateCompressedFrame …" caught their attention. This is a common scenario where certain types of functions frequently become optimization targets. Here’s a simplified guide on what to look for and possible improvements:

Compression/decompression: Is there a more efficient algorithm? For example, switching from zlib to zlib-ng might offer better performance.
Cryptographic hashing algorithms: Ensure the fastest algorithm is in use. Sometimes, a quicker non-cryptographic algorithm could be suitable, depending on the security requirements.
Non-cryptographic hashing algorithms: Check if you're using the quickest option. xxh3, for instance, is often faster than other hashing algorithms.
Garbage collection: Minimize heap allocations, especially in frequently used paths. Opt for data structures that don't rely on garbage collection.
Heap memory allocations: These are typically resource-intensive. Consider alternatives like using jemalloc or mimalloc instead of the standard libc malloc() to reduce their impact.
Page faults: Keep an eye out for "exc_page_fault" in your TopN Functions or flamegraph. They indicate areas where memory access patterns could be optimized.
Excessive CPU usage by kernel functions: This may indicate too many system calls. Using larger buffers for read/write operations can reduce the number of syscalls.
Serialization/deserialization: Processes like JSON encoding or decoding can often be accelerated by switching to a faster JSON library.

Identifying these areas can help in pinpointing where performance can be notably improved.

Clicking on the function from the TopN view shows it in the flamegraph. Note that the flamegraph is showing the samples from the full cloud QA infrastructure. In this view, we can tell that this function alone was accounting for >US$6,000 annualized in this part of our QA environment.

After filtering for the thread, it became more clear what the function was doing. The following image shows a flamegraph of this thread across all of the hosts running in the QA environment.

Instead of looking at the thread across all hosts, we can also look at a flamegraph for just one specific host.

If we look at this one host at a time, we can see that the impact is even more severe. Keep in mind that the 17% from before was for the full infrastructure. Some hosts may not even be running this service and therefore bring down the average.

Filtering things down to a single host that has the service running, we can tell that this host is actually spending close to 70% of its CPU cycles on running this function.

The dollar cost here just for this one host would put the function at around US$600 per year.

Understanding the performance problem

After identifying a potentially resource-intensive function, our next step involved collaborating with our Engineering teams to understand the function and work on a potential fix. Here's a straightforward breakdown of our approach:

Understanding the function: We began by analyzing what the function should do. It utilizes gzip for decompression. This insight led us to briefly consider strategies mentioned earlier for reducing CPU usage, such as using a more efficient compression library like zlib or switching to zstd compression.
Evaluating the current implementation: The function currently relies on JDK's gzip decompression, which is expected to use native libraries under the hood. Our usual preference is Java or Ruby libraries when available because they simplify deployment. Opting for a native library directly would require us to manage different native versions for each OS and CPU we support, complicating our deployment process.
Detailed analysis using flamegraph: A closer examination of the flamegraph revealed that the system encounters page faults and spends significant CPU cycles handling these.

Let’s start with understanding the Flamegraph:

The last few non jdk.* JVM instructions (in green) show the allocation of a direct memory Byte Buffer started by Netty's DirectArena.newUnpooledChunk. Direct memory allocations are costly operations that typically should be avoided on an application's critical path.

The Elastic AI Assistant for Observability is also useful in understanding and optimizing parts of the flamegraph. Especially for users new to Universal Profiling, it can add lots of context to the collected data and give the user a better understanding of them and provide potential solutions.

Netty's memory allocation

Netty, a popular asynchronous event-driven network application framework, uses the maxOrder setting to determine the size of memory chunks allocated for managing objects within its applications. The formula for calculating the chunk size is chunkSize = pageSize << maxOrder. The default maxOrder value of either 9 or 11 results in the default memory chunk size being 4MB or 16MB, respectively, assuming a page size of 8KB.

Impact on memory allocation

Netty employs a PooledAllocator for efficient memory management, which allocates memory chunks in a pool of direct memory at startup. This allocator optimizes memory usage by reusing memory chunks for objects smaller than the defined chunk size. Any object that exceeds this threshold must be allocated outside of the PooledAllocator.

Allocating and releasing memory outside of this pooled context incurs a higher performance cost for several reasons:

Increased allocation overhead: Objects larger than the chunk size require individual memory allocation requests. These allocations are more time-consuming and resource-intensive compared to the fast, pooled allocation mechanism for smaller objects.
Fragmentation and garbage collection (GC) pressure: Allocating larger objects outside the pool can lead to increased memory fragmentation. Furthermore, if these objects are allocated on the heap, it can increase GC pressure, leading to potential pauses and reduced application performance.
Netty and the Beats/Agent input: Logstash's Beats and Elastic Agent inputs use Netty to receive and send data. During processing of a received data batch, decompressing the data frame requires creating a buffer large enough to store the uncompressed events. If this batch is larger than the chunk size, an unpooled chunk is needed, causing a direct memory allocation that slows performance. The universal profiler allowed us to confirm that this was the case from the DirectArena.newUnpooledChunk calls in the flamegraph.

Fixing the performance problem in our environments

We decided to implement a quick workaround to test our hypothesis. Apart from having to adjust the jvm options once, this approach does not have any major downsides.

The immediate workaround involves manually adjusting the maxOrder setting back to its previous value. This can be achieved by adding a specific flag to the config/jvm.options file in Logstash:

-Dio.netty.allocator.maxOrder=11

This adjustment will revert the default chunk size to 16MB (chunkSize = pageSize << maxOrder, or 16MB = 8KB << 11), which aligns with the previous behavior of Netty, thereby reducing the overhead associated with allocating and releasing larger objects outside of the PooledAllocator.

After rolling out this change to some of our hosts in the QA environment, the impact was immediately visible in the profiling data.

Single host:

Multiple hosts:

We can also use the differential flamegraph view to see the impact.

For this specific thread, we’re comparing one day of data from early January to one day of data from early February across a subset of hosts. Both the overall performance improvements as well as the CO₂ and cost savings are dramatic.

This same comparison can also be done for a single host. In this view, we’re comparing one host in early January to that same host in early February. The actual CPU usage on that host decreased by 50%, saving us approximately US$900 per year per host.

Fixing the issue in Logstash

In addition to the temporary workaround, we are working on shipping a proper fix for this behavior in Logstash. You can find more details in this issue, but the potential candidates are:

Global default adjustment: One approach is to permanently set the maxOrder back to 11 for all instances by including this change in the jvm.options file. This global change would ensure that all Logstash instances use the larger default chunk size, reducing the need for allocations outside the pooled allocator.
Custom allocator configuration: For more targeted interventions, we could customize the allocator settings specifically within the TCP, Beats, and HTTP inputs of Logstash. This would involve configuring the maxOrder value at initialization for these inputs, providing a tailored solution that addresses the performance issues in the most affected areas of data ingestion.
Optimizing major allocation sites: Another solution focuses on altering the behavior of significant allocation sites within Logstash. For instance, modifying the frame decompression process in the Beats input to avoid using direct memory and instead default to heap memory could significantly reduce the performance impact. This approach would circumvent the limitations imposed by the reduced default chunk size, minimizing the reliance on large direct memory allocations.

Cost savings and performance enhancements

Following the new configuration change for Logstash instances on January 23, the platform's daily function cost dramatically decreased to US$350 from an initial >US$6,000, marking a significant 20x reduction. This change shows the potential for substantial cost savings through technical optimizations. However, it's important to note that these figures represent potential savings rather than direct cost reductions.

Just because a host uses less CPU resources, doesn’t necessarily mean that we are also saving money. To actually benefit from this, the very last step now is to either reduce the number of VMs we have running or to scale down the CPU resources of each one to match the new resource requirements.

This experience with Elastic Universal Profiling highlights how crucial detailed, real-time data analysis is in identifying areas for optimization that lead to significant performance enhancements and cost savings. By implementing targeted changes based on profiling insights, we've dramatically reduced CPU usage and operational costs in our QA environment with promising implications for broader production deployment.

Our findings demonstrate the benefits of an always-on, profiling driven approach in cloud environments, providing a good foundation for future optimizations. As we scale these improvements, the potential for further cost savings and efficiency gains continues to grow.

All of this is also possible in your environments. Learn how to get started today.

Elastic's collaboration with OpenTelemetry on improving the filelog receiver

Mon, 17 Jun 2024 00:00:00 GMT

As the newest generally available signal in OpenTelemetry (OTel), logging support currently lags behind tracing and metrics in terms of feature scope and maturity. At Elastic, we bring years of extensive experience with logging use cases and the challenges they present. Committed to advancing OpenTelemetry's logging capabilities, we have focused on enhancing its logging functionalities.

Over the past few months, we have dealt with the capabilities of the filelog receiver in the OpenTelemetry Collector, leveraging our expertise as the Filebeat's maintainers to help refine and expand its potential. Our goal is to contribute meaningfully to the evolution of OpenTelemetry's logging features, ensuring they meet the high standards required for robust observability.

Specifically, we focused on verifying that the receiver is well covered for cases and aspects that have been a pain for us in the past with Filebeat — such as fail-over handling, self-telemetry, test coverage, documentation and usability. Based on our exploration, we started insightful conversations with the OTel project's maintainers, sharing our thoughts and any suggestions that could be useful from our experience. Moreover, we've started putting up PRs to add documentation, make enhancements, improve tests, fix bugs, and even implement completely new features.

In this blog post we'll provide a sneak preview of the work that we've done so far in collaboration with the OpenTelemetry community and what's coming next as we continue to explore ways to improve the OpenTelemetry Collector for log collection.

Enhancing the filelog receiver's telemetry

Observability tools are software components like any other and, thus, need to be monitored as any other software to be able to debug problems and tune relevant settings. In particular, users of the filelog receiver will want to know how it's performing. It's important that the filelog receiver emits sufficient telemetry data for common troubleshooting and optimization use cases. This includes sufficient logging and observable metrics providing insights into the filelog receiver's internal state.

While the filelog receiver already provided a good set of self-telemetry data, we identified some areas of improvement. In particular, we contributed functionality to emit self-telemetry logs on crucial events like when log files are discovered, moved or truncated. Another contribution includes observable metrics about filelog’s receiver internal state about how many files are opened and being harvested. You can find more information on the respective tracking issue.

Improving the Kubernetes container logs parsing

The filelog receiver has been able to parse Kubernetes container logs for some time now. However, properly parsing logs from Kubernetes Pods required a fair bit of configuration to deal with different runtime formats and to extract important meta information, such as k8s.pod.name, k8s.container.name, etc. With this in mind we proposed to abstract these complex set of configuration into a simpler implementation specific container parser and contributed this new feature to the filelog receiver. With that new feature, setting up logs collection for Kubernetes is by magnitudes easier - with only eight lines of configuration vs. ~ 80 lines of configuration before.

You can learn more about the details of the new container logs parser in the corresponding OpenTelemetry blog post.

Evaluating test coverage

Logs collection from files can run into different unexpected scenarios such as restarts, overload and error scenarios. To ensure reliable and consistent collection of logs, it's important to ensure tests cover these kind of scenarios. Based on our experience with testing Filebeat, we evaluated the existing filelog receiver tests with respect to those scenarios. While most of the use cases and scenarios were well-tested already, we identified a few scenarios to improve tests for to ensure reliable logs collection.
At the creation time of this blog posts we were working on contributing additional tests to address the identified test coverage gaps. You can learn more about it in this GitHub issue.

Persistence evaluation

Another important aspect for log collection that we often hear from Elastic's log users are the failover handling capabilities and the delivery guarantees for logs. Some logging use cases, for example audit logging, have strict delivery guarantee requirements. Hence, it's important that the filelog receiver provides functionality to reliably handle situations, such as temporary unavailability of the logging backend or unexpected restarts of the OTel Collector.

Overall, the filelog receiver already has corresponding functionality to deal with such situations. However, user documentation on how to setup reliable logs collection with tangible examples was an area with potential for improvement.

In this regard, beyond verifying the persistence and offset tracking capabilities we worked on improving respective documentation 1 2 and also are collaborating on a community reported issue to ensure delivery guarantees for logs.

Helping users help themselves

Elastic has a long and varied history of supporting customers who use our products for log ingestion. Drawing from this experience, we've proposed a couple of documentation improvements to the OpenTelemetry Collector to help logging users get out of some tricky situations.

Documenting the structure of the tracking file

For every log file the filelog receiver ingests, it needs to track how far into the file it has already read, so it knows where to start reading from when new contents are added to the file. By default, the filelog receiver doesn't persist this tracking information to disk, but it can be configured to do so. We felt it would be useful to document the structure of this tracking file. When ingestion stops unexpectedly, peeking into this tracking file can often provide clues as to where the problem may lie.

Challenges with symlink target changes

The filelog receiver periodically refreshes its memory of the files it's supposed to be ingesting. The interval at which these refreshes happen is controlled by the poll_interval setting. In certain setups log files being ingested by the filelog receiver are symlinks pointing to actual files. Moreover, these symlinks can be updated to point to newer files over time. If the symlink target changes twice before the filelog receiver has had a chance to refresh its memory, it will miss the first change and therefore not ingest the corresponding target file. We've documented this edge case, suggesting the users with such setups should make sure they set poll_interval to a sufficiently low value.

Planning ahead for the receiver's GA

Last but not least, we have raised the topic of making the filelog receiver a generally available (GA) component. For users it's important to be able to rely on the stability of used functionality, hence, not being required to deal with the risk of breaking changes through minor version updates. In this regard, for the filelog receiver we have kicked off a first plan with the maintainers to mark any issue that is a blocker for stability with a required_for_ga label. Once the OpenTelemetry collector goes to version v1.0.0 we will be able to also work towards the specific receiver’s GA.

Conclusion

Overall, OTel's filelog receiver component is in a good shape and provides important functionality for most log collection use cases. Where there are still minor gaps or need for improvement with the filelog receiver, we are gladly to contribute our expertise and experience from Filebeat use cases. The above is just the beginning of our effort to help advancing the OpenTelemetry Collector, and specifically for log collection, get closer to a stable version. Moreover, we are happy to help the filelog receiver maintainers with general maintenance of the component, hence, dealing with community issues and PRs, jointly working on the component's roadmap, etc.

We'd like to thank the OTel Collector group and, in particular, Daniel Jaglowski for the great and constructive collaboration on the filelog receiver, so far!

Stay tuned to learn more about our future contributions and involvement in OpenTelemetry.

How to use Elasticsearch and Time Series Data Streams for observability metrics

Thu, 04 May 2023 00:00:00 GMT

Elasticsearch is used for a wide variety of data types — one of these is metrics. With the introduction of Metricbeat many years ago and later our APM Agents, the metric use case has become more popular. Over the years, Elasticsearch has made many improvements on how to handle things like metrics aggregations and sparse documents. At the same time, TSVB visualizations were introduced to make visualizing metrics easier. One concept that was missing that exists for most other metric solutions is the concept of time series with dimensions.

Mid 2021, the Elasticsearch team embarked on making Elasticsearch a much better fit for metrics. The team created Time Series Data Streams (TSDS), which were released in 8.7 as generally available (GA).

This blog post dives into how TSDS works and how we use it in Elastic Observability, as well as how you can use it for your own metrics.

A quick introduction to TSDS

Time Series Data Streams (TSDS) are built on top of data streams in Elasticsearch that are optimized for time series. To create a data stream for metrics, an additional setting on the data stream is needed. As we are using data streams, first an Index Template has to be created:

PUT _index_template/metrics-laptop
{
  "index_patterns": [
    "metrics-laptop-*"
  ],
  "data_stream": {},
  "priority": 200,
  "template": {
    "settings": {
      "index.mode": "time_series"
    },
    "mappings": {
      "properties": {
        "host.name": {
          "type": "keyword",
          "time_series_dimension": true
        },
        "packages.sent": {
          "type": "integer",
          "time_series_metric": "counter"
        },
        "memory.usage": {
          "type": "double",
          "time_series_metric": "gauge"
        }
      }
    }
  }
}

Let's have a closer look at this template. On the top part, we mark the index pattern with metrics-laptop-*. Any pattern can be selected, but it is recommended to use the data stream naming scheme for all your metrics. The next section sets the "index.mode": "time_series" in combination with making sure it is a data_stream: "data_stream": {}.

Dimensions

Each time series data stream needs at least one dimension. In the example above, host.name is set as a dimension field with "time_series_dimension": true. You can have up to 16 dimensions by default. Not every dimension must show up in each document. The dimensions define the time series. The general rule is to pick fields as dimensions that uniquely identify your time series. Often this is a unique description of the host/container, but for some metrics like disk metrics, the disk id is needed in addition. If you are curious about default recommended dimensions, have a look at this ECS contribution with dimension properties.

Reduced storage and increased query speed

At this point, you already have a functioning time series data stream. Setting the index mode to time series automatically turns on synthetic source. By default, Elasticsearch typically duplicates data three times:

row-oriented storage (_source field)
column-oriented storage (doc_values: true for aggregations)
indices (index: true for filtering and search)

With synthetic source, the _source field is not persisted; instead, it is reconstructed from the doc values. Especially in the metrics use case, there are little benefits to keeping the source.

Not storing it means a significant reduction in storage. Time series data streams sort the data based on the dimensions and the time stamp. This means data that is usually queried together is stored together, which speeds up query times. It also means that the data points for a single time series are stored alongside each other on disk. This enables further compression of the data as the rate at which a counter increases is often relatively constant.

Metric types

But to benefit from all the advantages of TSDS, the field properties of the metrics fields must be extended with the time_series_metric: {type}. Several types are supported — as an example, gauge and counter were used above. Giving Elasticsearch knowledge about the metric type allows Elasticsearch to offer more optimized queries for the different types and reduce storage usage further.

When you create your own templates for data streams under the data stream naming scheme, it is important that you set "priority": 200 or higher, as otherwise the built-in default template will apply.

Ingest a document

Ingesting a document into a TSDS isn't in any way different from ingesting documents into Elasticsearch. You can use the following commands in Dev Tools to add a document, and then search for it and also check out the mappings. Note: You have to adjust the @timestamp field to be close to your current date and time.

# Add a document with `host.name` as the dimension
POST metrics-laptop-default/_doc
{
  # This timestamp neesd to be adjusted to be current
  "@timestamp": "2023-03-30T12:26:23+00:00",
  "host.name": "ruflin.com",
  "packages.sent": 1000,
  "memory.usage": 0.8
}

# Search for the added doc, _source will show up but is reconstructed
GET metrics-laptop-default/_search

# Check out the mappings
GET metrics-laptop-default

If you do search, it still shows _source but this is reconstructed from the doc values. The additional field added above is @timestamp. This is important as it is a required field for any data stream.

Why is this all important for Observability?

One of the advantages of the Elastic Observability solution is that in a single storage engine, all signals are brought together in a single place. Users can query logs, metrics, and traces together without having to jump from one system to another. Because of this, having a great storage and query engine not only for logs but also metrics is key for us.

Usage of TSDS in integrations

With integrations, we give our users an out of the box experience to integrate with their infrastructure and services. If you are using our integrations, eventually you will automatically get all the benefits of TSDS for your metrics assuming you are on version 8.7 or newer.

Currently we are working through the list of our integration packages, add the dimensions, metric type fields and then turn on TSDS for the metrics data streams. What this means is as soon as the package has all properties enabled, the only thing you have to do is upgrade the integration and everything else will happen automatically in the background.

To visualize your time series in Kibana, use Lens, which has native support built in for TSDS.

Learn more

If you switch over to TSDS, you will automatically benefit from all the future improvements Elasticsearch is making for metrics time series, be it more efficient storage, query performance, or new aggregation capabilities. If you want to learn more about how TSDS works under the hood and all available config options, check out the TSDS documentation. What Elasticsearch supports in 8.7 is only the first iteration of the metrics time series in Elasticsearch.

TSDS can be used since 8.7 and will be in more and more of our integrations automatically when integrations are upgraded. All you will notice is lower storage usage and faster queries. Enjoy!

LLM Observability for Google Cloud’s Vertex AI platform - understand performance, cost and reliability

Wed, 09 Apr 2025 00:00:00 GMT

As organizations increasingly adopt large language models (LLMs) for AI-powered applications such as content creation, Retrieval-Augmented Generation (RAG), and data analysis, SREs and developers face new challenges. Tasks like monitoring workflows, analyzing input and output, managing query latency, and controlling costs become critical. LLM observability helps address these issues by providing clear insights into how these models perform, allowing teams to quickly identify bottlenecks, optimize configurations, and improve reliability. With better observability, SREs can confidently scale LLM applications, especially on platforms like Google Cloud’s Vertex AI.

New Elastic Observability LLM integration with Google Cloud’s Vertex AI platform

We are thrilled to announce general availability of monitoring LLMs hosted in Google Cloud through the Elastic integration with Vertex AI. This integration enables users to experience enhanced LLM Observability by providing deep insights into the usage, cost and operational performance of models on Vertex AI, including latency, errors, token usage, frequency of model invocations as well as resources utilized by models. By leveraging this data, organizations can optimize resource usage, identify and resolve performance bottlenecks, and enhance the model efficiency and accuracy.

Observability needs for AI-powered applications using the Vertex AI platform

Leveraging AI models creates unique needs around the observability and monitoring of AI-powered applications. Some of the challenges that come with using LLMs are related to the high cost to call the LLMs, the quality and safety of LLM responses, and the performance, reliability and availability of the LLMs.

Lack of visibility into LLM observability data can make it harder for SREs and DevOps teams to ensure their AI-powered applications meet their service level objectives for reliability, performance, cost and quality of the AI-generated content and have enough telemetry data to troubleshoot related issues. Thus, robust LLM observability and detection of anomalies in the performance of models hosted on Google Cloud’s Vertex AI platform in real time is critical for the success of AI-powered applications.

Depending on the needs of their LLM applications, customers can make use of a growing list of models hosted on the Vertex AI platform such as Gemini 2.0 Pro, Gemini 2.0 Flash, and Imagen for image generation. Each model excels in specific areas and generates content in some modalities including Language, Audio, Vision, Code, etc. No two models are the same; each model has specific performance characteristics. So, it is important that service operators are able to track the individual performance, behaviour and cost of each model.

Unlocking Insights with Vertex AI Metrics

The Elastic integration with Google Cloud’s Vertex AI platform collects a wide range of metrics from models hosted on Vertex AI, enabling users to monitor, analyze, and optimize their AI deployments effectively.

Once you use the integration, you can review all the metrics in the Vertex AI dashboard

These metrics can be categorized into the following groups:

1. Prediction Metrics

Prediction metrics provide critical insights into model usage, performance bottlenecks, and reliability. These metrics help ensure smooth operations, optimize response times, and maintain robust, accurate predictions.

Prediction Count by Endpoint: Measures the total number of predictions across different endpoints.
Prediction Latency: Provides insights into the time taken to generate predictions, allowing users to identify bottlenecks in performance.
Prediction Errors: Monitors the count of failed predictions across endpoints.

2. Model Performance Metrics

Model performance metrics provide crucial insights into deployment efficiency, and responsiveness. These metrics help optimize model performance and ensure reliable operations.

Model Usage: Tracks the usage distribution among different model deployments.
Token Usage: Tracks the number of tokens consumed by each model deployment, which is critical for understanding model efficiency.

Invocation Rates: Tracks the frequency of invocations made by each model deployment.
Model Invocation Latency: Measures the time taken to invoke a model, helping in diagnosing performance issues.

3. Resource Utilization Metrics

Resource utilization metrics are vital for monitoring resource efficiency and workload performance. They help optimize infrastructure, prevent bottlenecks, and ensure smooth operation of AI deployments.

CPU Utilization: Monitors CPU usage to ensure optimal resource allocation for AI workloads.
Memory Usage: Tracks the memory consumed across all model deployments.
Network Usage: Measures bytes sent and received, providing insights into data transfer during model interactions.

4. Overview Metrics

These metrics give an overview of the models deployed in Google Cloud’s Vertex AI platform. They are essential for tracking overall performance, optimizing efficiency, and identifying potential issues across deployments.

Total Invocations: The overall count of prediction invocations across all models and endpoints, providing a comprehensive view of activity.
Total Tokens: The total number of tokens processed across all model interactions, offering insights into resource utilization and efficiency.
Total Errors: The total count of errors encountered across all models and endpoints, helping identify reliability issues.

All metrics can be filtered by region, offering localized insights for better analysis.

Note: The Elastic I integration with Vertex AI provides comprehensive visibility into both deployment models: provisioned throughput, where capacity is pre-allocated, and pay-as-you-go, where resources are consumed on demand.

Conclusion

This integration with Vertex AI represents a significant step forward in enhancing the LLM Observability for users of Google Cloud’s Vertex AI platform. By unlocking a wealth of actionable data, organizations can assess the health, performance and cost of LLMs and troubleshoot operational issues, ensuring scalability, and accuracy in AI-driven applications.

Now that you know how the Vertex AI integration enhances LLM Observability, it’s your turn to try it out n. Spin up an Elastic Cloud, and start monitoring your LLM applications hosted on Google Cloud’s Vertex AI platform.

2025 observability trends: Maturing beyond the hype

Thu, 27 Feb 2025 00:00:00 GMT

2025 observability trends: Maturing beyond the hype

Our latest survey of over 500 observability decision-makers reveals how dramatically the landscape has evolved as we move through 2025. What strikes me most is how observability has moved beyond its technical roots to become a true business imperative. Let’s dive into what we're seeing in the industry.

The investment paradox of observability in 2025

Here's something fascinating: 96% of executives in our survey expect observability to remain a key investment area. Yet almost all of them (97%) are hitting roadblocks in realizing full value. And surprisingly, the primary hurdles for observability are not technical or complicated in nature, can you guess what they might be?

For 2025, IT leaders are challenged with financial hurdles for their observability. I'm seeing this tension play out constantly in conversations with leaders - they know they need to invest, but they're grappling with budget constraints, licensing costs, and proving ROI for their organizations. This creates an interesting dynamic where organizations must carefully balance increasing investment with rigorous cost optimization and business metrics.

What's particularly interesting is how this paradox is forcing organizations to become more strategic about their investments. Leaders are no longer just throwing money at the problem - they're thinking carefully about how to maximize value from every dollar spent.

Why observability maturity Is making all the difference

The data really jumps out at me here. The gap between observability experts and newcomers tells a compelling story that I wasn't expecting to see. Expert organizations are significantly outperforming their peers across every key metric:

91% of expert organizations are deploying applications and infrastructure faster (compared to just 34% of those in early stages)

82% are successfully reducing operational costs (versus 56% of early-stage organizations)

71% achieve better MTTR for incidents (while only 40% of early-stage organizations do)

What I find particularly fascinating is how some benefits go beyond just maturity levels. About 80% of organizations report better customer issue response times regardless of their maturity stage. It tells me that even basic observability delivers immediate customer-facing value. This is crucial information for organizations just starting their observability journey - they can expect to see tangible benefits right from the start. But the overarching story may be that observability maturity leads teams from reactive to proactive and allows them to focus on higher level, value-add activities.

Cost management: the new imperative

The numbers around cost management paint a clear picture of where the industry is heading - 97% of IT decision-makers are actively managing observability costs, and 86% feel personally responsible for business outcomes.

I'm seeing a clear trend where leaders are taking concrete steps in their day to day work:

Consolidating their observability toolset while maintaining capabilities, they don’t want to lose anything
Implementing usage-based pricing models
Establishing clear ROI metrics
Creating cross-functional teams to optimize spending

This isn't just about cutting costs - it's about being smarter with resources. Organizations are learning that more tools don't necessarily mean better observability.

Two technologies reshaping the observability landscape

AI's growing impact

The enthusiasm for AI is remarkable - 94% of respondents see its tremendous potential. What fascinates me is how concerns about Generative AI reliability have actually decreased from 64% to 55% over the past year.

Leaders are particularly excited about:

Automated correlation of logs, metrics, and traces (72% of respondents)
Predictive analytics for preventing outages
Natural language interfaces for querying observability data
Automated root cause analysis

The key shift I'm seeing for the upcoming year is the move from AI as a buzzword to AI as a practical tool delivering real value in observability workflows.

Generative AI capabilities paired with retrieval augmented generation (RAG) capabilities allow organizations to leverage the power of LLMs and private data (e.g., runbooks, alerts, business data) to deliver relevant and meaningful results and identify and solve problems faster while reducing noise.

OpenTelemetry's continued momentum

Looking at expert organizations, 80% are either experimenting with or have deployed OpenTelemetry. This isn't just about technology adoption - it's about building for the future with open standards. The correlation between OpenTelemetry adoption and overall observability maturity is correlated and unmistakable.

What's particularly interesting is how OpenTelemetry is changing the vendor landscape. Organizations are increasingly demanding OpenTelemetry support from their vendors, seeing it as a way to future-proof their observability investments and avoid vendor lock-in. Thinking back to how Linux shifted the server landscape, can we expect to see the same in the observability domain?

Business integration and insights deepens

Here's what I find most compelling: 64% of expert organizations are frequently correlating operational data with business outcomes, while only 9% of early-stage organizations do the same. This represents a fundamental shift from technical monitoring to business observability.

This isn't just about uptime anymore - organizations are increasingly using observability data to:

Make informed business decisions
Improve customer experience
Optimize resource allocation
Drive innovation

Looking ahead

As we continue through 2025, I'm seeing observability mature beyond its initial promise. Organizations are focusing less on basic implementation and more on delivering real business value through:

Deeper business integration, like mapping system performance directly to revenue metrics
Optimized cost management through new data lake technology, efficient storage and intelligent retention
AI-enhanced capabilities powered by LLMs and Agentic AI
Standardized instrumentation through OpenTelemetry, reducing vendor lock-in

The path to success in 2025 isn't just about having the right tools - it's about building mature practices that deliver measurable business value while managing costs effectively. The organizations that can balance these competing demands while maintaining focus on business outcomes are the ones pulling ahead.

What are you seeing in your organization's observability journey? Are these trends aligning with your experience?

If you would like to dig in deeper on emerging observability trends, download our full report or watch the on-demand webinar, 2025 Observability trends: Maturing beyond the hype and delivering results!

How to enable Kubernetes alerting with Elastic Observability

Tue, 30 May 2023 00:00:00 GMT

In the Kubernetes world, different personas demand different kinds of insights. Developers are interested in granular metrics and debugging information. SREs are interested in seeing everything at once to quickly get notified when a problem occurs and spot where the root cause is. In this post, we’ll focus on alerting and provide an overview of how alerts in Elastic Observability can help users quickly identify Kubernetes problems.

Why do we need alerts?

Logs, metrics, and traces are just the base to build a complete monitoring solution for Kubernetes clusters. Their main goal is to provide debugging information and historical evidence for the infrastructure.

While out-of-the-box dashboards, infrastructure topology, and logs exploration through Kibana are already quite handy to perform ad-hoc analyses, adding notifications and active monitoring of infrastructure allows users to deal with problems detected as early as possible and even proactively take actions to prevent their Kubernetes environments from facing even more serious issues.

How can this be achieved?

By building alerts on top of their infrastructure, users can leverage the data and effectively correlate it to a specific notification, creating a wide range of possibilities to dynamically monitor and observe their Kubernetes cluster.

In this blog post, we will explore how users can leverage Elasticsearch’s search powers to define alerting rules in order to be notified when a specific condition occurs.

SLIs, alerts, and SLOs: Why are they important for SREs?

For site reliability engineers (SREs), the incident response time is tightly coupled with the success of everyday work. Monitoring, alerting, and actions will help to discover, resolve, or prevent issues in their systems.

An SLA (Service Level Agreement) is an agreement you create with your users to specify the level of service they can expect.

An SLO (Service Level Objective) is an agreement within an SLA about a specific metric like uptime or response time.

An SLI (Service Level Indicator) measures compliance with an SLO.

SREs’ day-to-day tasks and projects are driven by SLOs. By ensuring that SLOs are defended in the short term and that they can be maintained in the medium to long term, we lay the basis of a stable working infrastructure.

Having said this, identifying the high-level categories of SLOs is crucial in order to organize the work of an SRE. Then in each category of SLOs, SREs will need the corresponding SLIs that can cover the most important cases of their system under observation. Therefore, the decision of which SLIs we will need demands additional knowledge of the underlying system infrastructure.

One widely used approach to categorize SLIs and SLOs is the Four Golden Signals method. The categories defined are Latency, Traffic, Errors, and Saturation.

A more specific approach is the The RED method developed by Tom Wilkie, who was an SRE at Google and used the Four Golden Signals. The RED method drops the saturation category because this one is mainly used for more advanced cases — and people remember better things that come in threes.

Focusing on Kubernetes infrastructure operators, we will consider the following groups of infrastructure SLIs/SLOs:

Group 1: Latency of control plane (apiserver,
Group 2: Resource utilization of the nodes/pods (how much cpu, memory, etc. is consumed)
Group 3: Errors (errors on logs or events or error count from components, network, etc.)

Creating alerts for a Kubernetes cluster

Now that we have a complete outline of our goal to define alerts based on SLIs/SLOs, we will dive into defining the proper alerting. Alerts can be built using Kibana.

See Elastic documentation.

In this blog, we will define more complex alerts based on complex Elasticsearch queries provided by Watcher’s functionality. Read more about Watcher and how to properly use it in addition to the examples in this blog.

Latency alerts

For this kind of alert, we want to define the basic SLOs for a Kubernetes control plane, which will ensure that the basic control plane components can service the end users without an issue. For instance, facing high latencies in queries against the Kubernetes API Server is enough of a signal that action needs to be taken.

Resource saturation

The next group of alerting will be resource utilization. Node’s CPU utilization or changes in Node’s condition is something critical for a cluster to ensure the smooth servicing of the workloads provisioned to run the applications that end users will interact with.

Error detection

Last but not least, we will define alerts based on specific errors like the network error rate or Pods’ failures like the OOMKilled situation. It’s a very useful indicator for SRE teams to either detect issues on the infrastructure level or just be able to notify developer teams about problematic workloads. One example that we will examine later is having an application running as a Pod and constantly getting restarted because it hits its memory limit. In that case, the owners of this application will need to get notified to act properly.

From Kubernetes data to Elasticsearch queries

Having a solid plan about the alerts that we want to implement, it's time to explore the data we have collected from the Kubernetes cluster and stored in Elasticsearch. For this we will consult the list of the available data fields that are ingested using the Elastic Agent Kubernetes integration (the full list of fields can be found here). Using these fields we can create various alerts like:

Node CPU utilization
Node Memory utilization
BW utilization
Pod restarts
Pod CPU/memory utilization

CPU utilization alert

Our first example will use the CPU utilization fields to calculate the Node’s CPU utilization and create an alert. For this alert, we leverage the metrics:

kubernetes.node.cpu.usage.nanocores
kubernetes.node.cpu.capacity.cores.

The following calculation (nodeUsage / 1000000000 ) /nodeCap grouped by node name will give us the CPU utilization of our cluster’s nodes.

The Watcher definition that implements this query can be created with the following API call to Elasticsearch:

curl -X PUT "https://elastic:changeme@localhost:9200/_watcher/watch/Node-CPU-Usage?pretty" -k -H 'Content-Type: application/json' -d'
{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "search": {
      "request": {
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-10m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "data_stream.dataset: kubernetes.node OR data_stream.dataset: kubernetes.state_node",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "nodes": {
              "terms": {
                "field": "kubernetes.node.name",
                "size": "10000",
                "order": {
                  "_key": "asc"
                }
              },
              "aggs": {
                "nodeUsage": {
                  "max": {
                    "field": "kubernetes.node.cpu.usage.nanocores"
                  }
                },
                "nodeCap": {
                  "max": {
                    "field": "kubernetes.node.cpu.capacity.cores"
                  }
                },
                "nodeCPUUsagePCT": {
                  "bucket_script": {
                    "buckets_path": {
                      "nodeUsage": "nodeUsage",
                      "nodeCap": "nodeCap"
                    },
                    "script": {
                      "source": "( params.nodeUsage / 1000000000 ) / params.nodeCap",
                      "lang": "painless",
                      "params": {
                        "_interval": 10000
                      }
                    },
                    "gap_policy": "skip"
                  }
                }
              }
            }
          }
        },
        "indices": [
          "metrics-kubernetes*"
        ]
      }
    }
  },
  "condition": {
    "array_compare": {
      "ctx.payload.aggregations.nodes.buckets": {
        "path": "nodeCPUUsagePCT.value",
        "gte": {
          "value": 80
        }
      }
    }
  },
  "actions": {
    "log_hits": {
      "foreach": "ctx.payload.aggregations.nodes.buckets",
      "max_iterations": 500,
      "logging": {
        "text": "Kubernetes node found with high CPU usage: {{ctx.payload.key}} -> {{ctx.payload.nodeCPUUsagePCT.value}}"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Node CPU Usage"
  }
}

OOMKilled Pods detection and alerting

Another Watcher that we will explore is the one that detects Pods that have been restarted due to an OOMKilled error. This error is quite common in Kubernetes workloads and is useful to detect this early on to inform the team that owns this workload, so they can either investigate issues that could cause memory leaks or just consider increasing the required resources for the workload itself.

This information can be retrieved from a query like the following:

kubernetes.container.status.last_terminated_reason: OOMKilled

Here is how we can create the respective Watcher with an API call:

curl -X PUT "https://elastic:changeme@localhost:9200/_watcher/watch/Pod-Terminated-OOMKilled?pretty" -k -H 'Content-Type: application/json' -d'
{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "*"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-1m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "data_stream.dataset: kubernetes.state_container",
                          "analyze_wildcard": true
                        }
                      },
                      {
                        "exists": {
                          "field": "kubernetes.container.status.last_terminated_reason"
                        }
                      },
                      {
                        "query_string": {
                          "query": "kubernetes.container.status.last_terminated_reason: OOMKilled",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "pods": {
              "terms": {
                "field": "kubernetes.pod.name",
                "order": {
                  "_key": "asc"
                }
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "array_compare": {
      "ctx.payload.aggregations.pods.buckets": {
        "path": "doc_count",
        "gte": {
          "value": 1,
          "quantifier": "some"
        }
      }
    }
  },
  "actions": {
    "ping_slack": {
      "foreach": "ctx.payload.aggregations.pods.buckets",
      "max_iterations": 500,
      "webhook": {
        "method": "POST",
        "url": "https://hooks.slack.com/services/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",
        "body": "{\"channel\": \"#k8s-alerts\", \"username\": \"k8s-cluster-alerting\", \"text\": \"Pod {{ctx.payload.key}} was terminated with status OOMKilled.\"}"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Pod Terminated OOMKilled"
  }
}

From Kubernetes data to alerts summary

So far we saw how we can start from plain Kubernetes fields, use them in ES queries, and build Watchers and alerts on top of them.

One can explore more possible data combinations and build queries and alerts following the examples we provided here. A full list of alerts is available, as well as a basic scripted way of installing them.

Of course, these examples come with simple actions defined that only log messages into the Elasticsearch logs. However, one can use more advanced and useful outputs like Slack’s webhooks:

"actions": {
    "ping_slack": {
      "foreach": "ctx.payload.aggregations.pods.buckets",
      "max_iterations": 500,
      "webhook": {
        "method": "POST",
        "url": "https://hooks.slack.com/services/T04SW3JHXasdfasdfasdfasdfasdf",
        "body": "{\"channel\": \"#k8s-alerts\", \"username\": \"k8s-cluster-alerting\", \"text\": \"Pod {{ctx.payload.key}} was terminated with status OOMKilled.\"}"
      }
    }
  }

The result would be a Slack message like the following:

Next steps

In our next steps, we would like to make these alerts part of our Kubernetes integration, which would mean that the predefined alerts would be installed when users install or enable the Kubernetes integration. At the same time, we plan to implement some of these as Kibana’s native SLIs, providing the option to our users to quickly define SLOs on top of the SLIs through a nice user interface. If you’re interested to learn more about these, follow the public GitHub issues for more information and feel free to provide your feedback:

For those who are eager to start using Kubernetes alerting today, here is what you need to do:

Make sure that you have an Elastic cluster up and running. The fastest way to deploy your cluster is to spin up a free trial of Elasticsearch Service.
Install the latest Elastic Agent on your Kubernetes cluster following the respective documentation.
Install our provided alerts that can be found at https://github.com/elastic/integrations/tree/main/packages/kubernetes/docs or at https://github.com/elastic/k8s-integration-infra/tree/main/scripts/alerting.

Of course, if you have any questions, remember that we are always happy to help on the Discuss forums.

Bridging the Gap: End-to-End Observability from Cloud Native to Mainframe

Sun, 01 Feb 2026 00:00:00 GMT

Introduction:

OpenTelemetry is emerging as the standard for modern observability. As a highly active project within the Cloud Native Computing Foundation (CNCF)—second only to Kubernetes—it has become the monitoring solution of choice for cloud-native applications. OpenTelemetry provides a unified method for collecting traces, metrics, and logs across Kubernetes, microservices, and infrastructure.

However, for many enterprises—especially in banking, insurance, healthcare, and government—the reality is more complex than just “cloud native.” Although most organizations have deployed mobile apps and adopted microservices architectures, much of their critical core processing still relies on IBM mainframe applications. These systems process credit card swipes, financial transactions, patient records, and premium calculations.

This creates a dilemma: while the modern distributed systems of the hybrid environment are well-observed, the critical backend remains a black box.

The “Broken Trace”

A common challenge we see with customers involves a request that originates from a modern mobile application. The request hits microservices running on Kubernetes, initiates a service call to the mainframe, and suddenly, visibility stops.

When latency spikes or a transaction fails, Site Reliability Engineers (SREs) are left guessing. Is it the network? The API gateway? Or underlying mainframe applications like CICS? Without a unified, end-to-end view of the services involved—from the frontend Node.js microservices to the backend CICS service—mean time to resolution (MTTR) becomes “mean time to innocence,” with teams simply proving it wasn't their microservice rather than fixing root causes.

We need a unified view where a trace flows seamlessly from a cloud-native frontend (like React) all the way into mainframe transactions.

IBM Z Observability Connect

With the recent release of Z Observability Connect, IBM has introduced OpenTelemetry-native instrumentation into mainframe applications. This creates a bridge between modern cloud-native services and mainframe transactions.

This means the mainframe is no longer a special case; it acts just like any other microservice in a mesh. It functions as an OpenTelemetry data producer, emitting traces, metrics, and logs to OpenTelemetry-compliant backends like Elastic.

The Architecture

The architecture is straightforward:

The Collector: IBM Z Observability Connect runs on z/OS. It collects logs, metrics, or traces and converts them into the OTLP (OpenTelemetry Protocol) format.
The Processor: The Elastic Cloud Managed OTLP Endpoint acts as a gateway collector, providing fully hosted, scalable, and reliable native OTLP ingestion.
The Consumer: Elastic APM enables OpenTelemetry-native application performance monitoring, making it easy to pinpoint and fix performance problems quickly.

Putting it all together in Kubernetes

We deploy an OpenTelemetry Collector within our Kubernetes cluster. This collector acts as a specialized gateway. It is configured to receive OTLP traffic directly from IBM Z Observability Connect on the mainframe and forward it securely to our observability backend, Elastic APM, by using the otlp/elastic exporter.

Here is the configuration for the OpenTelemetry Collector. Note the exporters section, which handles the authentication and batched transmission to Elastic:

exporters:
  # Exporter to print the first 5 logs/metrics and then every 1000th
  debug:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 1000

  # Exporter to send logs and metrics to Elasticsearch Managed OTLP Input
  otlp/elastic:
    endpoint: ${env:ELASTIC_OTLP_ENDPOINT}
    headers:
      Authorization: ApiKey ${env:ELASTIC_API_KEY}
    sending_queue:
      enabled: true
      sizer: bytes
      queue_size: 50000000 # 50MB uncompressed
      block_on_overflow: true
    batch:
      flush_timeout: 1s
      min_size: 1_000_000 # 1MB uncompressed
      max_size: 4_000_000 # 4MB uncompressed

service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/elastic, debug]

Note: We strongly recommend using environment variables for your endpoints and API keys to keep your manifest secure.

Why the OTel specification matters

Elastic’s managed OTLP endpoint and observability solution is built with native OTel support and adheres to the OTel specification and semantic conventions. Once we wired everything up and the data started to flow, we noticed that some of the traces in Elastic APM were not being represented correctly.

Most observability solutions derive the so-called RED metrics (rate, error, and duration) for the most important spans in a trace—i.e., incoming and outgoing spans of each individual service. This allows for an efficient indication of a service’s performance without the need to comb through all of the tracing data to show something as simple as the latency of a service’s endpoint or the error rate on outgoing requests.

For an efficient calculation of such derived metrics for incoming spans on a service, the OTel community introduced the SPAN_FLAGS_CONTEXT_HAS_IS_REMOTE_MASK and SPAN_FLAGS_CONTEXT_IS_REMOTE_MASK flags on the span entities within the OTLP protocol. These flags provide an unambiguous indication of whether an individual span is an entry span and, thus, allow observability backends to efficiently calculate metrics for entry-level spans.

If these flags are set incorrectly for an entry span, the span cannot be recognized as an entry span, and metrics are not derived properly—leading to a broken experience. This is what we initially experienced with the ingested OTel data from the IBM mainframe instrumentation.

In a proprietary world, this might have been a dead end or a months-long troubleshooting exercise. However, since OpenTelemetry is an open standard, we were able to debug the issue rapidly and share our findings with IBM engineers, who quickly developed a fix.

Streamline observability

We now have end-to-end visibility that spans from modern mobile or web applications deep into the IBM mainframe. This unlocks significant value:

Unified Service Maps: You can visually see the dependency between the cloud-native cart service and the backend inventory system on z/OS.
Single Pane of Glass: SREs no longer need to switch between modern observability tools and separate mainframe monitoring tools to view service health.
Operational Efficiency: By eliminating the “blind spot” in the trace, you reduce the time spent on coordinating between cloud and mainframe teams, making issue resolution faster.

Conclusion

If you are running hybrid workloads, it is time to stop treating your mainframe as a black box. With IBM Z Observability Connect, the Elastic Managed OTLP Endpoint, and Elastic APM, your entire stack can finally speak a single language: OpenTelemetry.

OpenTelemetry Demo with the Elastic Distributions of OpenTelemetry

Mon, 07 Oct 2024 00:00:00 GMT

Recently, Elastic introduced the Elastic Distributions (EDOT) for various OpenTelemetry components, we are proud to announce that these EDOT components are now available in the Elastic's fork of the OpenTelemetry Demo. We've also made public a Kibana endpoint, allowing you to dive into the demo’s live data and explore its capabilities firsthand. In this blog post, we'll elaborate on the reasons behind the fork and explore the powerful new features it introduces. We'll also provide a comprehensive overview of how these enhancements can be leveraged with the Elastic Distributions of OpenTelemetry (EDOT) for advanced error detection, as well as the EDOT Collector—a cutting-edge evolution of the Elastic Agent—for seamless data collection and analysis.

What is the OpenTelemetry Demo?

The OpenTelemetry Demo is a microservices-based application created by OpenTelemetry's community to showcase its capabilities in a realistic, and distributed system environment. This demo application, known as the OpenTelemetry Astronomy Shop, simulates an e-commerce website composed of over 10 interconnected microservices (written in multiple languages: Go, Java, .NET, Node.js, etc.), communicating via HTTP and gRPC. Each service is fully instrumented with OpenTelemetry, generating comprehensive traces, metrics, and logs. The demo serves as an invaluable resource for understanding how to implement and use OpenTelemetry in real-world applications.

One of the microservices, called loadgenerator, automatically starts generating requests to the various endpoints of the demo, simulating a real-world environment where multiple clients are interacting with the system. This helps replicate the behavior of a busy, live application with concurrent user activity.

Elastic's fork

Elastic recognized an opportunity to enhance the OpenTelemetry Demo by forking it and integrating advanced Elastic features for deeper observability and simpler monitoring. While forking is the recommended OpenTelemetry approach, we aim to leverage the robust foundation and latest updates from the upstream version as much as possible. To achieve this, Elastic’s fork of the OpenTelemetry Demo performs daily pulls from upstream, seamlessly integrating them with Elastic-specific changes. To avoid conflicts, we continuously contribute upstream, ensuring Elastic's modifications are always additive or configurable through environment variables. One such contribution is the .env.override file, designed exclusively for vendor forks to override the microservices images and configuration files used in the demo.

Deeper Insights with Elastic Distributions

In our current update of Elastic's OpenTelemetry Demo fork, we have replaced some of the microservices OTel SDKs used for instrumentation with Elastic's specialized distributions. These changes ensure deeper integration with Elastic's observability tools, offering richer insights and more robust monitoring capabilities. These are some of the fork's changes:

Java services: The Ad, Fraud Detection, and Kafka services now utilize the Elastic distribution of the OpenTelemetry Java Agent. One of the included features in the distribution are stack traces, which provides precise information of where in the code path a span was originated. Learn more about the Elastic Java Agent here.

The Cart service has been upgraded to use the Elastic distribution of the OpenTelemetry .NET Agent. This replacement gives visibility on how the Elastic Distribution of OpenTelemetry .NET (EDOT .NET) can be used to get started using OpenTelemetry in your .NET applications with zero code changes. Discover more about the Elastic .NET Agent in this blog post.

In the Payment service, we've configured the Elastic distribution of the OpenTelemetry Node.js Agent. The distribution ships with the host-metrics extension, and Kibana provides a curated service metrics UI. Read more about the Elastic Node.js Agent here.

The Recommendation service now leverages the EDOT Python, replacing the standard OpenTelemetry Python agent. The Python distribution is another example of a Zero-code (or Automatic) instrumentation, meaning that the distribution will set up the OpenTelemetry SDK and enable all the recommended instrumentations for you. Find out more about the Elastic Python Agent in this blog post.

It's important to highlight that Elastic Distributions of OpenTelemetry don't bundle proprietary software, they have been build on top of the vanilla OTel SDKs but they offer some advantages, such as single package for installation, easy auto-instrumentation with reasonable default configuration, automatic logs telemetry sending, and many more. Along these lines, the ultimate goal is to contribute as many features from EDOT's back to the upstream OpenTelemetry agents; they are designed in such a way that the additional features, realized as extensions, work directly with the OTel SDKs.

Collecting Data with the Elastic Collector Distribution

The OpenTelemetry Demo applications generate and send their signals to an OpenTelemetry Collector OTLP endpoint. In the Demo's fork, the EDOT collector is set up to forward all OTLP signals from the microservices to an APM server OTLP endpoint. Additionally, it sends all other metrics and logs collected by the collector to an Elasticsearch endpoint.

If the fork is deployed in a Kubernetes environment, the collector will automatically start collecting the system's metrics. The collector will be configured to use the hostmetrics receivers to monitor all the K8s node's metrics, the kuebeletstats receiver to retrieve Kubelet's metrics and the filelog receiver, that will collect all cluster's.

Both the signals generated by the microservices and those collected by the EDOT collector are enriched with Kubernetes metadata, allowing users to correlate them seamlessly. This makes it easy to track and observe which Kubernetes nodes and pods each service is running on, providing deep insights into both application performance and infrastructure health.

Learn more about the Elastic's OpenTelemetry Collector distribution: https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-collector

Error detection with Elastic

The OpenTelemetry Demo incorporates flagd, a feature flag evaluation engine used to simulate error scenarios. For example, the paymentServiceFailure flag will force an error for every request to the payment service charge endpoint. Since the service is instrumented with OpenTelemetry, the error will be captured in the generated traces. We can then use Kibana's powerful visualization and search tools to trace the error back to its root cause.

Another available flag is named adServiceHighCpu, which causes a high CPU load in the ad service. This increased CPU usage can be monitored either through the service's metrics or the related metrics of its Kubernetes pod:

The full list of simulated scenarios can be found at this link.

Start your own exploration

Ready to explore the OpenTelemetry Demo with Elastic and its enhanced observability capabilities? Follow the link to Kibana and begin your own exploration of how Elastic and OpenTelemetry can transform your approach to observability.

Live demo: https://ela.st/demo-otel

But that's not all—if you want to take it a step further, you can deploy the OpenTelemetry Demo directly with your own Elasticsearch stack. Follow the steps provided here to set it up and start gaining valuable insights from your own environment.

Reconciliation in Elastic Streams: A Robust Architecture Deep Dive

Tue, 04 Nov 2025 00:00:00 GMT

Streams is a new, unified approach to data management in the Elastic Stack. It wraps a set of existing Elasticsearch building blocks—data streams, index templates, ingest pipelines, retention policies—into a single, coherent primitive: the Stream. Instead of configuring these parts individually and in the right order, users can now rely on Streams to orchestrate them safely and automatically. With a unified UI in Kibana and a simplified API, Streams reduces cognitive load, lowers the risk of misconfiguration, and supports more flexible workflows like late binding—where users can ingest data first and decide how to process and route it later.

But behind that clean user experience lies a fast-moving, evolving codebase. In this post, we’ll explore how we rethought its architecture to keep up with product demands—while laying the groundwork for future flexibility and scale.

Rapid experimentation often leads to messy code—but before shipping to customers, we have to ask: If this succeeds, can we continue evolving it? That question puts code health front and center. To move fast in the long term, we need a foundation that supports iteration.

When I joined the Streams team about six months ago, the project was moving fast through uncharted territory amid high uncertainty. This combination of speed and uncertainty created the perfect conditions for, well, spaghetti code—crafted by some of our most senior engineers, doing their best with a recipe missing a few ingredients.

The code was pragmatic and effective: it did exactly what it needed to do. But it was becoming increasingly difficult to understand and extend. Related logic was scattered across many files, with little separation of concerns, making it difficult to safely identify where and how to introduce changes. And the project still had a long road ahead.

Recently, we undertook a refactor of the underlying architecture—not just to bring greater clarity and structure to the codebase, but to establish clear phases that make it easier to debug and evolve. Our primary goal was to build a foundation that would let us continue moving quickly and confidently. As a secondary goal, we aimed to enable new capabilities like bulk updates, dry runs, and system diagnostics.

In this post, we’ll briefly explore the challenges that prompted a new approach, share the architectural patterns that inspired us, explain how the new design works under the hood, and highlight what it enables for the future.

The Challenges We Faced

Streams aims to be a declarative model for data management. Users describe how data should flow: where it should go, what processing should happen along the way, and which mappings should apply. Behind the scenes, each API request results in one or more Elasticsearch resources being changed.

Before the refactor, the underlying code was increasingly difficult to reason about. There was no clear lifecycle that each request followed. Data was loaded only when it happened to be needed, validation was scattered across different functions, and cascading changes—like child streams reacting to parent updates—were applied recursively and implicitly. Elasticsearch requests could happen at any point during a request.

This led to several key challenges:

No clear place for validation
Without a single, centralized validation step, engineers weren’t sure where to add new checks—or whether existing ones would even run reliably. Some validations happened early, others late.
No clear picture of the overall system state
Because there was no way to manage the system state as a whole it was hard to reason about or validate the state. We couldn’t easily check whether a change was valid in the context of all other existing streams or dependencies.
Unpredictable side effects
Since Elasticsearch operations could occur at different points in the flow, failures were harder to handle or roll back. We didn’t have a clear “commit point” where the changes were executed.
Tangled stream logic
Logic for different types of streams was mixed together in shared code paths, often guarded by conditionals. This made it hard to isolate behavior, test individual types, or add new ones without risking unintended consequences.

These challenges made it clear: we needed a more structured foundation, one capable of supporting both the current complexity and future growth.

What We Needed to Move Forward

To move faster yet with confidence, we needed a foundation that could evolve gracefully, make behavior easier to reason about, and reduce the likelihood of unexpected side effects.

We aligned around a few key goals:

A clear request lifecycle
Each request should move through clear, well-defined phases: loading the current state, applying changes, validating the resulting state, determining the Elasticsearch actions, and executing the actions. This structure would help engineers understand where things happen—and why.
A unified state model
We wanted a clear model of desired vs. current state—a single place to reason about the outcome of a change. This would enable safer validation, more efficient updates, and easier debugging by allowing us to compute the difference between the two states.
A single commit point
All Elasticsearch changes should happen in one place, after everything’s validated and we know exactly what needs to change. This would reduce side effects, make failures easier to manage, and unlock support for dry runs.
Isolated stream logic
We needed clearer separation between stream types so each could be developed and tested in isolation. This would simplify adding new types, reduce unintended side effects, and clarify whether changes belong to a stream type or the state management layer.
Bulk operations and system introspection
Finally, we wanted to support features like bulk updates, dry runs, and health diagnostics—capabilities that were difficult or impossible with the old design. A more explicit and inspectable model of system state would make this possible.

These goals became our north star as we explored new architectural patterns to get there, with a strong focus on comparing the current state with the desired state.

Where We Drew Inspiration From

Our new design drew inspiration from two well-known open source projects: Kubernetes and React. Though very different, both share a central concept: reconciliation.

Reconciliation means comparing two states, calculating their differences, and taking the necessary actions to move the system from its current state to its desired state.

In Kubernetes, you declare the desired state of your resources, and the controller continuously works to align the cluster with that state.
In React, each component defines how it should render, and the virtual DOM updates the real DOM efficiently to match that.

We were also inspired by the Plan/Execute pattern which aims to separate decision making from execution. This sounded like what we needed in order to perform all validations before committing to any actions—ensuring we could reason about and inspect the system's intent ahead of time.

These concepts resonated with what we needed. It made clear that we required two key pieces:

A model representing system state, responsible for comparing states and driving the overall workflow (like the Kubernetes controller loop).
A representation of individual streams that make up that state, handling the specific logic for each stream type (like React components).

Each Stream is defined and stored in Elasticsearch. We recognized a disconnect between data management and state changes in our existing code, so we designed each stream to manage both. This fits naturally with the Active Record pattern, where a class encapsulates both domain logic and persistence.

To make the system easier to extend and the state model’s interface simpler, we implemented an abstract Active Record class using the Template Method pattern, clearly defining the interface new stream types must follow.

We did have some concerns that adopting these more advanced patterns—like reconciliation, the Active Record, and Template Method—might make it harder for new or less experienced engineers to get up to speed. While the code would be cleaner and more straightforward for those familiar with the patterns, we worried it could create a barrier for juniors or newcomers unfamiliar with these concepts.

In practice, however, we found the opposite: the code became easier to follow because the patterns provided a clear, consistent structure. More importantly, the architectural choices helped keep the focus on the domain itself, rather than on complex implementation details, making it more approachable for the whole team. The patterns are there but the code doesn't talk about them, it talks about the domain.

How We Structured the System

When a request hits one of our API endpoints in Kibana, the handler performs basic request validation, then passes the request to the Streams Client. The client’s job is to translate the request into one or more Change objects. Each Change represents the creation, modification, or deletion of a Stream.

These Change objects are then passed to a central class we introduced called State, which plays two key roles:

It holds the set of Stream instances that make up the current version of the system.
It orchestrates the pipeline that applies changes and transitions from one state to another.

Let’s walk through the key phases the State class manages when applying a change.

Loading the Starting State

First, the State class loads the current system state by reading the stored Stream definitions from Elasticsearch. This becomes our reference point for all subsequent comparisons—used during validation, diffing, and action planning.

Applying Changes

We begin by cloning the starting state. Each Stream is responsible for cloning itself. Then we process each incoming Change:

The change is presented to all Streams in the current state (creating a new one if needed).
Each Stream can react by updating itself and optionally emitting cascading changes—additional changes that ripple through related Streams.
Cascading changes are processed in a loop until no more are generated (or until we hit a safety threshold).

We then move to the next requested Change.
If any requested or cascading Change cannot be applied safely, the system aborts the entire request to prevent partial updates.

Validating the Desired State

Once we’ve applied all Changes and cascading effects, we run validations to ensure the resulting configuration is safe and consistent.

Each Stream is asked to validate itself in the context of the full desired state and the original starting state. This allows for both localized checks (within a Stream) and broader coordination (between related Streams). If any validation fails, we abort the request.

Determining Actions

Next, each Stream is asked to determine what Elasticsearch actions are needed to move from the starting state to the desired state. This is the first point where the system needs to consider which Elasticsearch resources back an individual Stream.

If the request is a dry run, we stop here and return a summary of what would happen. If it’s meant to be executed, we move to the next phase.

Planning and Execution

The list of Elasticsearch actions is handed off to a dedicated class called ExecutionPlan. This class handles:

Resolving cross-stream dependencies that individual Streams cannot address alone.
Organizing the actions into the correct order to ensure safe application (e.g. to avoid data loss when routing rules change).
Maximizing parallelism wherever possible within those ordering constraints.

If the plan executes successfully, we return a success response from the API.

Handling Failures

If the plan fails during execution, the State class attempts a roll back—it computes a new plan that should return the system to its starting state (by going from desired state to starting state instead) and tries to execute it.

If the roll back also fails, we have a fallback mechanism: a “reset” operation that re-applies the known-good state stored in Elasticsearch, skipping diffing entirely.

A Closer Look at the Stream Active Record Classes

All Streams in the State are subclasses of an abstract class called StreamActiveRecord. This class is responsible for:

Tracking the change status of the Stream
Routing change application, validation, and action determination to specialized template method hooks implemented by its concrete subclasses based on the change status.

These hooks are as follows:

Apply upsert / Apply deletion
Validate upsert / Validate deletion
Determine actions for creation / change / deletion

With this architecture in place, we’ve created a clear, phased, and declarative flow from input to action—one that’s modular, testable, and resilient to failure. It cleanly separates generic stream lifecycle logic (like change tracking and orchestration) from stream-specific behaviors (such as what “upsert” means for a given Stream type), enabling a highly extensible system. This structure allows us to isolate side effects, validate with confidence, and reason more clearly about system-wide behavior—all while supporting dry runs and bulk operations.

Now that we’ve covered how it works, let’s explore what this unlocks—the capabilities, safety guarantees, and new workflows this design makes possible.

What This Unlocks

The reconciliation based design we landed on isn’t just easier to reason about—it directly addresses many of the core limitations we faced in the earlier version of the system.

Bulk operations and dry runs, by design

One of our key goals was to support bulk configuration changes across many Streams in a single request. The previous codebase made this difficult because the side effects were interleaved with decision-making logic, making it risky to apply multiple changes at once.

Now, bulk changes are the default. The State class handles any number of changes, tracks cascading effects automatically, and validates the end result as a whole. Whether you're updating one Stream or fifty, the pipeline handles it consistently.

Dry runs were another desired feature. Because actions are now computed in a side-effect-free step—before anything is sent to Elasticsearch—we can generate a full preview of what would happen. This includes both which Streams would change and what specific Elasticsearch operations would be performed. That visibility helps users and developers make confident, informed decisions.

Easier debugging, better diagnostics

In the old system, debugging required reconstructing the execution context and piecing together side effects. Now, every phase of the pipeline is explicit and testable in isolation by following the phases.

Because validation and Elasticsearch actions are now tied directly to the Stream definition and lifecycle, any inconsistencies or errors are easier to trace to their source.

Validated planning before execution

Because we now validate and plan before making any changes, the risk of leaving the system in an inconsistent or partially-updated state has been greatly reduced. All actions are determined in advance, and only executed once we’re confident the entire set of changes is valid and coherent.

And if something does go wrong during execution, we can lean on the fact that both the starting and desired states are fully modeled in memory. This allows us to generate a roll back plan automatically, and when that’s not possible, fall back to a complete reset from the stored state. In short: safety is now built in, not bolted on.

Extensible by default

Adding a new type of Stream used to mean editing logic scattered across multiple files. Now, it’s a focused, well-defined task. You subclass StreamActiveRecord and implement the handful of lifecycle hooks.

That’s it. The orchestration, tracking, and dependency handling are already wired up. That also means it’s easier to onboard new developers or experiment with new Stream types without fear of breaking unrelated parts of the system.

Easier to test

Because each Stream is now encapsulated and has clear, isolated responsibilities, testing is much simpler. You can test individual Stream classes by simulating specific inputs and asserting the resulting cascading changes, validation results, or Elasticsearch actions. There's no need to spin up a full end-to-end environment just to test a single validation.

What’s Next

At Elastic, we live by our Source Code, which states “Progress, SIMPLE Perfection”—a reminder to favor steady, incremental improvement over chasing perfection.

This new system is a solid foundation—but it’s only the beginning. Our focus so far has been on clarity, safety, and extensibility, and while we’ve addressed some long-standing pain points, there’s still plenty of room to evolve.

Continuous improvement ahead

We intentionally shipped this work with a sharp scope and have already identified several enhancements that we will be adding in the coming weeks:

Introduce a locking layer
To safely handle concurrent updates, we plan to introduce a locking mechanism that prevents race conditions during parallel modifications.
Expose bulk and dry-run features via our APIs
The State class already supports them—now it’s time to make those capabilities available to users.
Improve debugging output
Now that state transitions are modeled explicitly, we can expose clearer diagnostics to help both users and developers reason about changes.
Avoiding Redundant Elasticsearch Requests
Currently we make multiple redundant requests during validation. Introducing a lightweight in-memory cache would let us avoid reloading the same resource more than once.
Improve access controls
Currently, we rely on Elasticsearch to enforce access control. Because a single change can touch many different resources, it’s difficult to determine up front which privileges are required. We plan to extend our action definitions with privilege metadata, enabling us to validate the full set of required permissions before executing any actions. This will let us detect and report missing privileges early—before the plan runs.
Add APM instrumentation
With the system structured in distinct, well-defined phases, we’re now in a great position to add performance instrumentation. This will help us identify bottlenecks and improve responsiveness over time.

Revisiting responsibilities

As our orchestration becomes more robust, we’re also re-evaluating where it should live. Large-scale bulk operations, for example, might eventually be better handled closer to Elasticsearch itself, where we can benefit from greater atomicity and tighter performance guarantees. That kind of deep integration would have been premature earlier on—when we were still figuring out the right abstractions and phases for the system. But now that the design has stabilized, we’re in a much better position to start that conversation.

Built to evolve

We designed this system with adaptability in mind. Whether improvements come in the form of internal refactors, better developer experience, or deeper collaboration with Elasticsearch, we’re in a strong position to keep evolving. The architecture is modular by design—and that gives us both the stability to rely on and the flexibility to grow.

Wrapping Up

Building robust, maintainable systems is never just about code — it’s about aligning architecture with the evolving needs and direction of the product. Our journey refactoring Streams reaffirmed that a thoughtful, phased approach not only improves technical clarity but also empowers teams to move faster and innovate more confidently.

If you’re working on complex systems facing similar challenges—whether tangled logic, unpredictable side effects, or the need for extensibility—you’re not alone. We hope our story offers some useful insights and inspiration as you shape your own path forward.

We welcome feedback and collaboration from the community—whether it’s in the form of questions, ideas, or code.

To learn more about Streams, explore:

Read about Reimagining streams

Look at the Streams website

Read the Streams documentation

Check out the pull request on GitHub to dive into the code or join the conversation.

Future-proof your logs with ecs@mappings template

Mon, 23 Sep 2024 00:00:00 GMT

As the Elasticsearch ecosystem evolves, so do the tools and methodologies designed to streamline data management. One advancement that will significantly benefit our community is the ecs@mappings component template.

ECS (Elastic Common Schema) is a standardized data model for logs and metrics. It defines a set of common field names and data types that help ensure consistency and compatibility.

ecs@mappings is a component template that offers an Elastic-maintained definition of ECS mappings. Each Elasticsearch release contains an always up-to-date definition of all ECS fields.

Elastic Common Schema and Open Telemetry

Elastic will preserve our user's investment in Elastic Common Schema by donating ECS to Open Telemetry. Elastic participates and collaborates with the OTel community to merge ECS and Open Telemetry's Semantic Conventions over time.

The Evolution of ECS Mappings

Historically, users and integration developers have defined ECS (Elastic Common Schema) mappings manually within individual index templates and packages, each meticulously listing its fields. Although straightforward, this approach proved time-consuming and challenging to maintain.

To tackle this challenge, integration developers moved towards two primary methodologies:

Referencing ECS mappings
Importing ECS mappings directly

These methods were steps in the right direction but introduced their challenges, such as the maintenance cost of keeping the ECS mappings up-to-date with Elasticsearch changes.

Enter ecs@mappings

The ecs@mappings component template supports all the field definitions in ECS, leveraging naming conventions and a set of dynamic templates.

Elastic started shipping the ecs@mappings component template with Elasticsearch v8.9.0, including it in the logs-- index template.

With Elasticsearch v8.13.0, Elastic now includes ecs@mappings in the index templates of all the Elastic Agent integrations.

This move was a breakthrough because:

Centralized and official: With ecs@mappings, we now have an official definition of ECS mappings.
Out-of-the-box functionality: ECS mappings are readily available, reducing the need for additional imports or references.
Simplified maintenance: The need to manually keep up with ECS changes has diminished since the template from Elasticsearch itself remains up-to-date.

Enhanced Consistency and Reliability

With ecs@mappings, ECS mappings become the single source of truth. This unified approach means fewer discrepancies and higher consistency in data streams across integrations.

How Community Users Benefit

Community users stand to gain manifold from the adoption of ecs@mappings. Here are the key advantages:

Reduced configuration hassles: Whether you are an advanced user or just getting started, the simplified setup means fewer configuration steps and fewer opportunities for errors.
Improved data integrity: Since ecs@mappings ensures that field definitions are accurate and up-to-date, data integrity is maintained effortlessly.
Better performance: With less overhead in maintaining and referencing ECS fields, your Elasticsearch operations run more smoothly.
Enhanced documentation and discoverability: As we standardize ECS mappings, the documentation can be centralized, making it easier for users to discover and understand ECS fields.

Let's explore how the ecs@mappings component template helps users achieve these benefits.

Reduced configuration hassles

Modern Elasticsearch versions come with out-of-the-box full ECS field support (see the “requirements” section later for specific versions).

For example, the Custom AWS Logs integration installed on a supported Elasticsearch cluster already includes the ecs@mappings component template in its index template:

GET _index_template/logs-aws_logs.generic
{
  "index_templates": [
    {
      "name": "logs-aws_logs.generic",
      ...,
        "composed_of": [
          "logs@settings",
          "logs-aws_logs.generic@package",
          "logs-aws_logs.generic@custom",
          "ecs@mappings",
          ".fleet_globals-1",
          ".fleet_agent_id_verification-1"
        ],
    ...

There is no need to import or define any ECS field.

Improved data integrity

The ecs@mappings component template supports all the existing ECS fields. If you use any ECS field in your document, it will accurately have the expected type.

To ensure that ecs@mappings is always up to date with the ECS repository, we set up a daily automated test to ensure that the component template supports all fields.

Better Performance

Compact definitions

The ECS field definition is exceptionally compact; at the time of this writing, it is 228 lines long and supports all ECS fields. To learn more, see the ecs@mappings component template source code.

It relies on naming conventions and uses dynamic templates to achieve this compactness.

Lazy mapping

Elasticsearch only adds existing document fields to the mapping, thanks to dynamic templates. The lazy mapping keeps memory overhead at a minimum, improving cluster performance and making field suggestions more relevant.

Enhanced documentation and discoverability

All Elastic Agent integrations are migrating to the ecs@mappings component template. These integrations no longer need to add and maintain ECS field mappings and can reference the official ECS Field Reference or the ECS source code in the Git repository: https://github.com/elastic/ecs/.

Getting started

Requirements

To leverage the ecs@mappings component template, ensure the following stack version:

8.9.0: if your data stream uses the logs index template or you define your index template.
8.13.0: if your data stream uses the index template of an Elastic Agent integration.

Example

We will use the Custom AWS Logs integration to show you how ecs@mapping can handle mapping for any out-of-the-box ECS field.

Imagine you want to ingest the following log event using the Custom AWS Logs integration:

{
  "@timestamp": "2024-06-11T13:16:00+02:00", 
  "command_line": "ls -ltr",
  "custom_score": 42
}

Dev Tools

Kibana offers an excellent tool for experimenting with Elasticseatch API, the Dev Tools console. With the Dev Tools, users can run all API requests quickly and without much friction.

To open the Dev Tools:

Open Kibana
Select Management > Dev Tools > Console

Elasticsearch version < 8.13

On Elasticsearch versions before 8.13, the Custom AWS Logs integration has the following index template:

GET _index_template/logs-aws_logs.generic
{
  "index_templates": [
    {
      "name": "logs-aws_logs.generic",
      "index_template": {
        "index_patterns": [
          "logs-aws_logs.generic-*"
        ],
        "template": {
          "settings": {},
          "mappings": {
            "_meta": {
              "package": {
                "name": "aws_logs"
              },
              "managed_by": "fleet",
              "managed": true
            }
          }
        },
        "composed_of": [
          "logs-aws_logs.generic@package",
          "logs-aws_logs.generic@custom",
          ".fleet_globals-1",
          ".fleet_agent_id_verification-1"
        ],
        "priority": 200,
        "_meta": {
          "package": {
            "name": "aws_logs"
          },
          "managed_by": "fleet",
          "managed": true
        },
        "data_stream": {
          "hidden": false,
          "allow_custom_routing": false
        }
      }
    }
  ]
}

As you can see, it does not include the ecs@mappings component template.

If we try to index the test document:

POST logs-aws_logs.generic-default/_doc
{
  "@timestamp": "2024-06-11T13:16:00+02:00", 
  "command_line": "ls -ltr",
  "custom_score": 42
}

The data stream will have the following mappings:

GET logs-aws_logs.generic-default/_mapping/field/command_line
{
  ".ds-logs-aws_logs.generic-default-2024.06.11-000001": {
    "mappings": {
      "command_line": {
        "full_name": "command_line",
        "mapping": {
          "command_line": {
            "type": "keyword",
            "ignore_above": 1024
          }
        }
      }
    }
  }
}

GET logs-aws_logs.generic-default/_mapping/field/custom_score
{
  ".ds-logs-aws_logs.generic-default-2024.06.11-000001": {
    "mappings": {
      "custom_score": {
        "full_name": "custom_score",
        "mapping": {
          "custom_score": {
            "type": "long"
          }
        }
      }
    }
  }
}

These mappings do not align with ECS, so users and developers had to maintain them.

Elasticsearch version >= 8.13

On Elasticsearch versions equal to or newer to 8.13, the Custom AWS Logs integration has the following index template:

GET _index_template/logs-aws_logs.generic
{
  "index_templates": [
    {
      "name": "logs-aws_logs.generic",
      "index_template": {
        "index_patterns": [
          "logs-aws_logs.generic-*"
        ],
        "template": {
          "settings": {},
          "mappings": {
            "_meta": {
              "package": {
                "name": "aws_logs"
              },
              "managed_by": "fleet",
              "managed": true
            }
          }
        },
        "composed_of": [
          "logs@settings",
          "logs-aws_logs.generic@package",
          "logs-aws_logs.generic@custom",
          "ecs@mappings",
          ".fleet_globals-1",
          ".fleet_agent_id_verification-1"
        ],
        "priority": 200,
        "_meta": {
          "package": {
            "name": "aws_logs"
          },
          "managed_by": "fleet",
          "managed": true
        },
        "data_stream": {
          "hidden": false,
          "allow_custom_routing": false
        },
        "ignore_missing_component_templates": [
          "logs-aws_logs.generic@custom"
        ]
      }
    }
  ]
}

The index template for logs-aws_logs.generic now includes the ecs@mappings component template.

If we try to index the test document:

POST logs-aws_logs.generic-default/_doc
{
  "@timestamp": "2024-06-11T13:16:00+02:00", 
  "command_line": "ls -ltr",
  "custom_score": 42
}

The data stream will have the following mappings:

GET logs-aws_logs.generic-default/_mapping/field/command_line
{
  ".ds-logs-aws_logs.generic-default-2024.06.11-000001": {
    "mappings": {
      "command_line": {
        "full_name": "command_line",
        "mapping": {
          "command_line": {
            "type": "wildcard",
            "fields": {
              "text": {
                "type": "match_only_text"
              }
            }
          }
        }
      }
    }
  }
}

GET logs-aws_logs.generic-default/_mapping/field/custom_score
{
  ".ds-logs-aws_logs.generic-default-2024.06.11-000001": {
    "mappings": {
      "custom_score": {
        "full_name": "custom_score",
        "mapping": {
          "custom_score": {
            "type": "float"
          }
        }
      }
    }
  }
}

In Elasticsearch 8.13, fields like command_line and custom_score get their definition from ECS out-of-the-box.

These mappings align with ECS, so users and developers do not have to maintain them. The same applies to all the hundreds of field definitions in the Elastic Common Schema. You can achieve this by including a 200-liner component template in your data stream.

Caveats

Some aspects of how the ecs@mappings component template deals with data types are worth mentioning.

ECS types are not enforced

The ecs@mappings component template does not contain mappings for ECS fields where dynamic mapping already uses the correct field type. Therefore, if you send a field value with a compatible but wrong type, Elasticsearch will not coerce the value.

For example, if you send the following document with a faas.coldstart field (defined as boolean in ECS):

{
  "faas.coldstart": "true"
}

Elasticsearch will map faas.coldstart as a keyword and not a boolean. Therefore, you need to make sure that the values you ingest to Elasticsearch use the right JSON field types, according to how they’re defined in ECS.

This is the tradeoff for having a compact and efficient ecs@mappings component template. It also allows for better compatibility when dealing with a mix of ECS and custom fields because documents won’t be rejected if the types are not consistent with the ones defined in ECS.

Conclusion

The introduction of ecs@mappings marks a significant improvement in managing ECS mappings within Elasticsearch. By centralizing and streamlining these definitions, we can ensure higher consistency, reduced maintenance, and better overall performance.

Whether you're an integration developer or a community user, moving to ecs@mappings represents a step towards more efficient and reliable Elasticsearch operations. As we continue incorporating feedback and evolving our tools, your journey with Elasticsearch will only get smoother and more rewarding.

Join the Conversation

Do you have questions or feedback about ecs@mappings? Post on our helpful community of users on our community discussion forum and Slack instance and share your experiences. Your input is invaluable in helping us fine-tune these advancements for the entire community.

Happy mapping!

Getting more from your logs with OpenTelemetry

Thu, 11 Sep 2025 00:00:00 GMT

Getting more from your logs with OpenTelemetry

Most people today use their logging tools mostly still in the same way we have for decades as a simple search lake, essentially still grepping for logs but from a centralized platform. There’s nothing wrong with this, you can get a lot of value by having a centralized logging platform but the question becomes how can I start to evolve beyond this basic log and search use case? Where can I start to be more effective with my incident investigations? In this blog we start from where most of our customers are today and give you some practical tips on how to move a little beyond this simple logging use case.

Ingestion

Let's start at the beginning, ingest. Typically many of you are using older tools for ingestion today. If you want to be more forward thinking here, it’s time to introduce you to OpenTelemetry. OpenTelemetry was once not very mature or capable for logging but things have changed significantly. Elastic has been working particularly hard to improve the log capabilities resident in OpenTelemetry. So let's start by exploring how we can get started bringing logs into Elastic via the OpenTelemetry collector.

Firstly if you want to follow along simply create a host to run the log generator and OpenTelemetry collector.

Follow the instructions here to get the log generator running:

https://github.com/davidgeorgehope/log-generator-bin/

To get the OpenTelemetry collector up and running in Elastic Serverless, you can click on Add Data from the bottom left, then 'host' and finally 'opentelemetry'

Follow the instructions but don’t start the collector just yet.

Our host here is running a 3 tier application with an Nginx frontend, backend and connected to a MySQL database. So let's start by bringing the logs into Elastic.

First we’ll install the Elastic Distributions for OpenTelemetry but before starting it, we will make a small change to the OpenTelemetry configuration file to expand the directories it will search for logs in. Edit the otel.yml by simply using vi or your favorite editor:

vi otel.yml

Instead of simply /var/log/.log we will add /var/log/**/*.log to bring in all our log files.

receivers:
  # Receiver for platform specific log files
  filelog/platformlogs:
    include: [ /var/log/**/*.log ]
    retry_on_failure:
      enabled: true
    start_at: end
    storage: file_storage

Start the otel collector

sudo ./otelcol --config otel.yml

And we can see these are being brought in, in discover

Now one thing that is immediately noticeable is that we automatically without changing anything get a bunch of useful additional information such as the os name and cpu information.

The OpenTelemetry collector has automatically, without any changes, started to enrich our logs, making it useful for additional processing, though we could do significantly better!

To start with we want to give our logs some structure. Lets edit that otel.yml file and add some OTTL to extract some key data from our NGINX logs.

  transform/parse_nginx:
    trace_statements: []
    metric_statements: []
    log_statements:
      - context: log
        conditions:
          - 'attributes["log.file.name"] != nil and IsMatch(attributes["log.file.name"], "access.log")'
        statements:
          - merge_maps(attributes, ExtractPatterns(body, "^(?P\\S+)"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "^\\S+ - (?P\\S+)"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\\[(?P[^\\]]+)\\]"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\"(?P\\S+) "), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\"\\S+ (?P\\S+)\\?"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "req_id=(?P[^ ]+)"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\" (?P\\d+) "), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\" \\d+ (?P\\d+)"), "upsert")
.....

   logs/platformlogs:
      receivers: [filelog/platformlogs]
      processors: [transform/parse_nginx,resourcedetection]
      exporters: [elasticsearch/otel]

Now when we start the Otel collector with this new configuration

sudo ./otelcol --config otel.yml

We will see that we now have structured logs!!

Store and Optimize

To ensure you aren’t blowing your budget out with all this additional structured data there are few things you can do to help maximize storage efficiency.

You can use the filter processors in the Otel collector with granular filtering/dropping of irrelevant attributes to control volume going out of the collector for example.

processors:
  filter/drop_logs_without_user_attributes:
    logs:
      log_record:
        - 'attributes["user"] == nil'
  filter/drop_200_logs:
    logs:
      log_record:
        - 'attributes["status"] == "200"'

service:
  pipelines:
    logs/platformlogs:
      receivers: [filelog/platformlogs]
      processors: [transform/parse_nginx, filter/drop_logs_without_user_attributes, filter/drop_200_logs, resourcedetection]
      exporters: [elasticsearch/otel]

The filter processor will help reduce the noise for example if you wanted to drop the debug logs or logs from a noisy service. Great ways to keep a lid on your observability spend.

Additionally for your most critical flows and logs where you don’t want to drop any data, Elastic has you covered. In version 9.x of Elastic you now have LogsDB switched on by default.

With LogsDB, Elastic has reduced the storage footprint of log data in Elasticsearch by up to 65% allowing you to store more observability and security data without exceeding your budget, while keeping all data accessible and searchable.

LogsDB reduces log storage by up to 65%. This dramatically minimizes storage footprints by leveraging advanced compression techniques like ZSTD, delta encoding, and run-length encoding, and it also reconstructs the _source field on demand, saving about 40% more storage by not retaining the original JSON document. Synthetic _source represents the introduction of columnar storage within Elasticsearch.

Analytics

So we have our data in Elastic, it’s structured, it conforms to the idea of a wide-event log since it has lots of good context, user ids, request ids and the data is captured at the start of a request Next we’re going to look at the analytics part of this. First let's take a stab at looking at the number of Errors for each user transaction in our application.

FROM logs-generic.otel-default
| WHERE log.file.name == "access.log"
| WHERE attributes.status >= "400"
| STATS error_count = COUNT(*) BY attributes.user
| SORT error_count DESC

It’s pretty easy now to save this and put it on a dashboard, we just click the save button:

Next let's look at putting something together to show the global impact, first we will update our collector config to enrich our log data with geo location.

Update the OTTL configuration with this new line:

   log_statements:
      - context: log
        conditions:
          - 'attributes["log.file.name"] != nil and IsMatch(attributes["log.file.name"], "access.log")'
        statements:
          - merge_maps(attributes, ExtractPatterns(body, "^(?P\\S+)"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "^\\S+ - (?P\\S+)"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\\[(?P[^\\]]+)\\]"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\"(?P\\S+) "), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\"\\S+ (?P\\S+)\\?"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "req_id=(?P[^ ]+)"), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\" (?P\\d+) "), "upsert")
          - merge_maps(attributes, ExtractPatterns(body, "\" \\d+ (?P\\d+)"), "upsert")
          - set(attributes["source.address"], attributes["client_ip"]) where attributes["client_ip"] != nil

Next add a new processor (you will need to download the GeoIP database from MaxMind)

geoip:
  context: record
  source:
    from: attributes
  providers:
    maxmind:
      database_path: /opt/geoip/GeoLite2-City.mmdb

And add this to the log pipeline after the parse_nginx

service:
  pipelines:
    logs/platformlogs:
      receivers: [filelog/platformlogs]
      processors: [transform/parse_nginx, geoip, resourcedetection]
      exporters: [elasticsearch/otel]

Start the otel collector

sudo ./otelcol --config otel.yml

Once the data starts flowing we can add a map visualization:

Add a layer:

Use ES|QL

Use the following ES|QL

And this should give you a map showing the locations of all your NGINX server requests!

As you can see, analytics is a breeze with your new Otel data collection pipeline.

Conclusion: Beyond log aggregation to operational intelligence

The journey from basic log aggregation to structured, enriched observability represents more than a technical upgrade, it's a shift in how organizations approach system understanding and incident response. By adopting OpenTelemetry for ingestion, implementing intelligent filtering to manage costs, and leveraging LogsDB's storage optimizations, you're not just modernizing your ELK stack; you're building the foundation for proactive system management.

The structured logs, geographic enrichment, and analytical capabilities demonstrated here transform raw log data into actionable intelligence with ES|QL. Instead of reactive grepping through logs during incidents, you now have the infrastructure to identify patterns, track user journeys, and correlate issues across your entire stack before they become critical problems.

But here's the key question: Are you prepared to act on these insights? Having rich, structured data is only valuable if your organization can shift from a reactive "find and fix" mentality to a proactive "predict and prevent" approach. The real evolution isn't in your logging stack, it's in your operational culture.

Get started with this today in Elastic Serverless

Getting started with OpenTelemetry instrumentation with a sample application

Tue, 12 Sep 2023 00:00:00 GMT

Application performance management (APM) has moved beyond traditional monitoring to become an essential tool for developers, offering deep insights into applications at the code level. With APM, teams can not only detect issues but also understand their root causes, optimizing software performance and end-user experiences. The modern landscape presents a wide range of APM tools and companies offering different solutions. Additionally, OpenTelemetry is becoming the open ingestion standard for APM. With OpenTelemetry, DevOps teams have a consistent approach to collecting and ingesting telemetry data.

Elastic^® offers its own APM Agents, which can be used for instrumenting your code. In addition, Elastic also supports OpenTelemtry natively.

Navigating the differences and understanding how to instrument applications using these tools can be challenging. That's where our sample application, Elastiflix — a UI for movie search — comes into play. We've crafted it to demonstrate the nuances of both OTEL and Elastic APM, guiding you through the process of the APM instrumentation and showcasing how you can use one or the other, depending on your preference.

The sample application

We deliberately kept the movie search UI really simple. It displays some movies, has a search bar, and, at the time of writing, only one real functionality: you can add a movie to your list of favorites.

Services, languages, and instrumentation

Our application has a few different services:

javascript-frontend: A React frontend, talking to the node service and Elasticsearch^®
node-server: Node backend, talking to other backend services
dotnet-login: A login service that returns a random username

We reimplemented the “favorite” service in a few different languages, as we did not want to introduce additional complexity to the architecture of the application.

Go-favorite: A Go service that stores a list of favorites movies in Redis
Java-favorite: A Java service that stores a list of favorites movies in Redis
Python-favorite: A Python service that stores a list of favorites movies in Redis

In addition, there’s also some other supporting containers:

Movie-data-loader: Loads the movie database into your Elasticsearch cluster
Redis: Used as a datastore for keeping track of the user’s favorites
Locust: A load generator that talks to the node service to introduce artificial load

The main difference compared to some other sample application repositories is that we’ve coded it in several languages, with each language version showcasing almost all possible types of instrumentation:

Why this approach?

While sample applications provide good insight into how tools work, they often showcase only one version, leaving developers to find all of the necessary modifications themselves. We've taken a different approach. By offering multiple versions, we intend to bridge the knowledge gap, making it straightforward for developers to see and comprehend the transition process from non-instrumented code to either Elastic or OTEL instrumented versions.

Instead of simply starting the already instrumented version, you can instrument the base version yourself, by following some of our other blogs. This will teach you much more than just looking at an already built version.

Python: Auto-instrumentation, Manual-instrumentation
Java: Auto-instrumentation, Manual-instrumentation
Node.js: Auto-instrumentation, Manual-instrumentation
.NET: Auto-instrumentation, Manual-instrumentation

Prerequisites

Docker and Compose
Elastic Cloud Cluster (start your free trial)

Before starting the sample application, ensure you've set up your Elastic deployment details. Populate the .env file (located in the same directory as the compose files) with the necessary credentials. You can copy these from the Cloud UI and from within Kibana^® under the path /app/home#/tutorial/apm.

Cloud UI

Kibana APM Tutorial

ELASTIC_APM_SERVER_URL="https://foobar.apm.us-central1.gcp.cloud.es.io"
ELASTIC_APM_SECRET_TOKEN="secret123"
ELASTICSEARCH_USERNAME="elastic"
ELASTICSEARCH_PASSWORD="changeme"
ELASTICSEARCH_URL="https://foobar.es.us-central1.gcp.cloud.es.io"

Starting the application

You have the flexibility to initiate our sample app in three distinctive manners, each corresponding to a different instrumentation scenario.

We provide public Docker images that you can use when you supply the --no-build flag. Otherwise the images will be built from source on your machine, which will take around 5–10 minutes.

1. Non-instrumented version

cd Elastiflix
docker-compose -f docker-compose.yml up -d --no-build

2. Elastic instrumented version

cd Elastiflix
docker-compose -f docker-compose-elastic.yml up -d --no-build

3. OpenTelemetry instrumented version

cd Elastiflix
docker-compose -f docker-compose-elastic-otel.yml up -d --no-build

After launching the desired version, explore the application at localhost:9000. We also deploy a load generator on localhost:8089 where you can increase the number of concurrent users. Note that the load generator is talking directly to the node backend service. If you want to generate RUM data from the javascript frontend, then you have to manually browse to localhost:9000 and visit a few pages.

Simulation and failure scenarios

In the real world, applications are subject to varying conditions, random bugs, and misconfigurations. We've incorporated some of these to mimic potential real-life situations. You can find a list of possible environment variables here.

Non-instrumented scenarios

# healthy
docker-compose -f docker-compose.yml up -d

# pause redis for 5 seconds, every 30 seconds
TOGGLE_CLIENT_PAUSE=true docker-compose -f docker-compose.yml up -d

# add artificial delay to python service, 100ms, delay 50% of requests by 1000ms
TOGGLE_SERVICE_DELAY=100 TOGGLE_CANARY_DELAY=1000 docker-compose -f docker-compose.yml up -d

# add artificial delay to python service, 100ms, delay 50% of requests by 1000ms, and fail 20% of them
TOGGLE_SERVICE_DELAY=100 TOGGLE_CANARY_DELAY=1000 TOGGLE_CANARY_FAILURE=0.2 docker-compose -f docker-compose.yml up -d

# throw error in nodejs service, 50% of the time
THROW_NOT_A_FUNCTION_ERROR=true docker-compose -f docker-compose.yml up -d

Elastic instrumented scenarios

# healthy
docker-compose -f docker-compose-elastic.yml up -d

# pause redis for 5 seconds, every 30 seconds
TOGGLE_CLIENT_PAUSE=true docker-compose -f docker-compose-elastic.yml up -d

# add artificial delay to python service, 100ms, delay 50% of requests by 1000ms
TOGGLE_SERVICE_DELAY=100 TOGGLE_CANARY_DELAY=1000 docker-compose -f docker-compose-elastic.yml up -d

# add artificial delay to python service, 100ms, delay 50% of requests by 1000ms, and fail 20% of them
TOGGLE_SERVICE_DELAY=100 TOGGLE_CANARY_DELAY=1000 TOGGLE_CANARY_FAILURE=0.2 docker-compose -f docker-compose-elastic.yml up -d

# throw error in nodejs service, 50% of the time
THROW_NOT_A_FUNCTION_ERROR=true docker-compose -f docker-compose-elastic.yml up -d

OpenTelemetry instrumented scenarios

# healthy
docker-compose -f docker-compose-elastic-otel.yml up -d

# pause redis for 5 seconds, every 30 seconds
TOGGLE_CLIENT_PAUSE=true docker-compose -f docker-compose-elastic-otel.yml up -d

# add artificial delay to python service, 100ms, delay 50% of requests by 1000ms
TOGGLE_SERVICE_DELAY=100 TOGGLE_CANARY_DELAY=1000 docker-compose -f docker-compose-elastic-otel.yml up -d

# add artificial delay to python service, 100ms, delay 50% of requests by 1000ms, and fail 20% of them
TOGGLE_SERVICE_DELAY=100 TOGGLE_CANARY_DELAY=1000 TOGGLE_CANARY_FAILURE=0.2 docker-compose -f docker-compose-elastic-otel.yml up -d


# throw error in nodejs service, 50% of the time
THROW_NOT_A_FUNCTION_ERROR=true docker-compose -f docker-compose-elastic-otel.yml up -d

Mix Elastic and OTel

Since the application has the services in all possible permutations and the “favorite” service even written in multiple languages, you can also run them in a mixed mode.

You can also run some of them in parallel, like we do for the “favorite” service.

Elastic and OTel are fully compatible, so you could run some services instrumented with OTel while others are running with the Elastic APM Agent.

Take a look at the existing compose file and simply copy one of the snippets for each service type.

favorite-java-otel-auto:
  build: java-favorite-otel-auto/.
  image: docker.elastic.co/demos/workshop/observability/elastiflix-java-favorite-otel-auto:${ELASTIC_VERSION}-${BUILD_NUMBER}
  depends_on:
    - redis
  networks:
    - app-network
  ports:
    - "5004:5000"
  environment:
    - ELASTIC_APM_SECRET_TOKEN=${ELASTIC_APM_SECRET_TOKEN}
    - OTEL_EXPORTER_OTLP_ENDPOINT=${ELASTIC_APM_SERVER_URL}
    - OTEL_METRICS_EXPORTER=otlp
    - OTEL_RESOURCE_ATTRIBUTES=service.version=1.0,deployment.environment=production
    - OTEL_SERVICE_NAME=java-favorite-otel-auto
    - OTEL_TRACES_EXPORTER=otlp
    - REDIS_HOST=redis
    - TOGGLE_SERVICE_DELAY=${TOGGLE_SERVICE_DELAY}
    - TOGGLE_CANARY_DELAY=${TOGGLE_CANARY_DELAY}
    - TOGGLE_CANARY_FAILURE=${TOGGLE_CANARY_FAILURE}

Working with the source code

The repository contains all possible permutations of the service.

Subdirectories are named in the format $langauge-$serviceName-(elastic|otel)-(auto|manual). As an example, python-favorite-otel-auto is a Python service. The name of it is “favorite,” and it’s instrumented with OpenTelemetry, using auto-instrumentation.
You can now compare this directory to the non-instrumented version of this service available under the directory python-favorite.

This allows you to easily understand the difference between the two. In addition, you can also start from scratch using the non-instrumentation version and try to instrument it yourself.

Conclusion

Monitoring is more than just observing; it's about understanding and optimizing. Our sample application seeks to guide you on your journey with Elastic APM or OpenTelemetry, providing you with the tools to build resilient and high-performing applications.

Developer resources:

Elastiflix application, a guide to instrument different languages with OpenTelemetry

Python: Auto-instrumentation, Manual-instrumentation

Java: Auto-instrumentation, Manual-instrumentation

Node.js: Auto-instrumentation, Manual-instrumentation

.NET: Auto-instrumentation, Manual-instrumentation

Go: Manual-instrumentation

Best practices for instrumenting OpenTelemetry

General configuration and use case resources:

Independence with OpenTelemetry on Elastic

Modern observability and security on Kubernetes with Elastic and OpenTelemetry

3 models for logging with OpenTelemetry and Elastic

Adding free and open Elastic APM as part of your Elastic Observability deployment

Capturing custom metrics through OpenTelemetry API in code with Elastic

Future-proof your observability platform with OpenTelemetry and Elastic

Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more

Understanding APM: How to add extensions to the OpenTelemetry Java Agent

Mon, 24 Jul 2023 00:00:00 GMT

Without code access, SREs and IT Operations cannot always get the visibility they need

As an SRE, have you ever had a situation where you were working on an application that was written with non-standard frameworks, or you wanted to get some interesting business data from an application (number of orders processed for example) but you didn’t have access to the source code?

We all know this can be a challenging scenario resulting in visibility gaps, inability to fully trace code end to end, and missing critical business monitoring data that is useful for understanding the true impact of issues.

How can we solve this? One way we discussed in the following three blogs:

This is where we develop a plugin for the Elastic^® APM Agent to help get access to critical business data for monitoring and add tracing where none exists.

What we will discuss in this blog is how you can do the same with the OpenTelemetry Java Agent using the Extensions framework.

Basic concepts: How APM works

Before we continue, let's first understand a few basic concepts and terms.

Java Agent: This is a tool that can be used to instrument (or modify) the bytecode of class files in the Java Virtual Machine (JVM). Java agents are used for many purposes like performance monitoring, logging, security, and more.
Bytecode: This is the intermediary code generated by the Java compiler from your Java source code. This code is interpreted or compiled on the fly by the JVM to produce machine code that can be executed.
Byte Buddy: Byte Buddy is a code generation and manipulation library for Java. It is used to create, modify, or adapt Java classes at runtime. In the context of a Java Agent, Byte Buddy provides a powerful and flexible way to modify bytecode. Both the Elastic APM Agent and the OpenTelemetry Agent use Byte Buddy under the covers.

Now, let's talk about how automatic instrumentation works with Byte Buddy:

Here's a simplified explanation of the process:

Start the JVM with the agent: When starting your Java application, you specify the Java agent with the -javaagent command line option. This instructs the JVM to load your agent before the main method of your application is invoked. At this point, the agent has the opportunity to set up class transformers.
Register a class file transformer with Byte Buddy: Your agent will register a class file transformer with Byte Buddy. A transformer is a piece of code that is invoked every time a class is loaded into the JVM. This transformer receives the bytecode of the class and it can modify this bytecode before the class is actually used.
Transform the bytecode: When your transformer is invoked, it will use Byte Buddy's API to modify the bytecode. Byte Buddy allows you to specify your transformations in a high-level, expressive way rather than manually writing complex bytecode. For example, you could specify a certain class and method within that class that you want to instrument and provide an "interceptor" that will add new behavior to that method.
Use the transformed classes: Once the agent has set up its transformers, the JVM continues to load classes as usual. Each time a class is loaded, your transformers are invoked, allowing them to modify the bytecode. Your application then uses these transformed classes as if they were the original ones, but they now have the extra behavior that you've injected through your interceptor.

Application, prerequisites, and config

There is a really simple application in this GitHub repository that is used throughout this blog. What it does is it simply asks you to input some text and then it counts the number of words.

It’s also listed below:

package org.davidgeorgehope;
import java.util.Scanner;
import java.util.logging.Logger;

public class Main {
    private static Logger logger = Logger.getLogger(Main.class.getName());

    public static void main(String[] args) {
        Scanner scanner = new Scanner(System.in);
        while (true) {
            System.out.println("Please enter your sentence:");
            String input = scanner.nextLine();
            Main main = new Main();
            int wordCount = main.countWords(input);
            System.out.println("The input contains " + wordCount + " word(s).");
        }
    }
    public int countWords(String input) {

        try {
            Thread.sleep(10000);
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }

        if (input == null || input.isEmpty()) {
            return 0;
        }

        String[] words = input.split("\s+");
        return words.length;
    }
}

For the purposes of this blog, we will be using Elastic Cloud to capture the data generated by OpenTelemetry — follow the instructions here to get started on Elastic Cloud.

Once you are started with Elastic Cloud, go grab the OpenTelemetry config from the APM pages:

You will need this later.

Finally, download the OpenTelemetry Agent.

Firing up the application and OpenTelemetry

If you start out with this simple application, build it and run it like so with the OpenTelemetry Agent, filling in the appropriate variables with those you got from earlier.

java -javaagent:opentelemetry-javaagent.jar -Dotel.exporter.otlp.endpoint=XX -Dotel.exporter.otlp.headers=XX -Dotel.metrics.exporter=otlp -Dotel.logs.exporter=otlp -Dotel.resource.attributes=XX -Dotel.service.name=your-service-name -jar simple-java-1.0-SNAPSHOT.jar

You will find nothing happens. The reason for this is that the OpenTelemetry Agent has no way of knowing what to monitor. The way that APM with automatic instrumentation works is that it “knows” about standard frameworks, like Spring or HTTPClient, and is able to get visibility by “injecting” trace code into those standard frameworks automatically.

It has no knowledge of org.davidgeorgehope.Main from our simple Java application.

Luckily, there is a way we can add this using the OpenTelemetry Extensions framework.

The OpenTelemetry Extension

In the repository above, aside from the simple-java application, there is also a plugin for Elastic APM and an extension for OpenTelemetry. The relevant files for OpenTelemetry Extension are located here — WordCountInstrumentation.java and WordCountInstrumentationModule.java .

You’ll notice that OpenTelemetry Extensions and Elastic APM Plugins both make use of Byte Buddy, which is a common library for code instrumentation. There are some key differences in the way the code is bootstrapped, though.

The WordCountInstrumentationModule class extends an OpenTelemtry specific class InstrumentationModule, whose purpose is to describe a set of TypeInstrumentation that need to be applied together to correctly instrument a specific library. The WordCountInstrumentation class is one such instance of a TypeInstrumentation.

Type instrumentations grouped in a module share helper classes, muzzle runtime checks, and applicable class loader criteria, and can only be enabled or disabled as a set.

This is a little bit different from how the Elastic APM Plugin works because the default method to to inject code with OpenTelemetry is inline (which is the default) with OpenTelemetry, and you can inject dependencies into the core application classloader using the InstrumentationModule configurations (as shown below). The Elastic APM method is safer as it allows isolation of helper classes and makes it easier to debug with normal IDEs we are contributing this method to OpenTelemetry. Here we inject the TypeInstrumentation class and the WordCountInstrumentation class into the classloader.

@Override
    public List getAdditionalHelperClassNames() {
        return List.of(WordCountInstrumentation.class.getName(),"io.opentelemetry.javaagent.extension.instrumentation.TypeInstrumentation");
    }

The other interesting part of the TypeInstrumentation class is the setup.

Here we give our instrumentation “group” a name. An InstrumentationModule needs to have at least one name. The user of the javaagent can suppress a chosen instrumentation by referring to it by one of its names. The instrumentation module names use kebab-case.

public WordCountInstrumentationModule() {
        super("wordcount-demo", "wordcount");
    }

Apart from this, we see methods in this class to specify the order of loading this relative to other instrumentation if needed, and we specify the class that extends TypeInstrumention and are responsible for the main bulk of the instrumentation work.

Let's take a look at that WordCountInstrumention class, which extends TypeInstrumention now:

// The WordCountInstrumentation class implements the TypeInstrumentation interface.
// This allows us to specify which types of classes (based on some matching criteria) will have their methods instrumented.

public class WordCountInstrumentation implements TypeInstrumentation {

    // The typeMatcher method is used to define which classes the instrumentation should apply to.
    // In this case, it's the "org.davidgeorgehope.Main" class.
    @Override
    public ElementMatcher typeMatcher() {
        logger.info("TEST typeMatcher");
        return ElementMatchers.named("org.davidgeorgehope.Main");
    }

    // In the transform method, we specify which methods of the classes matched above will be instrumented,
    // and also the advice (a piece of code) that will be added to these methods.
    @Override
    public void transform(TypeTransformer typeTransformer) {
        logger.info("TEST transform");
        typeTransformer.applyAdviceToMethod(namedOneOf("countWords"),this.getClass().getName() + "$WordCountAdvice");
    }

    // The WordCountAdvice class contains the actual pieces of code (advices) that will be added to the instrumented methods.
    @SuppressWarnings("unused")
    public static class WordCountAdvice {
        // This advice is added at the beginning of the instrumented method (OnMethodEnter).
        // It creates and starts a new span, and makes it active.
        @Advice.OnMethodEnter(suppress = Throwable.class)
        public static Scope onEnter(@Advice.Argument(value = 0) String input, @Advice.Local("otelSpan") Span span) {
            // Get a Tracer instance from OpenTelemetry.
            Tracer tracer = GlobalOpenTelemetry.getTracer("instrumentation-library-name","semver:1.0.0");
            System.out.print("Entering method");

            // Start a new span with the name "mySpan".
            span = tracer.spanBuilder("mySpan").startSpan();

            // Make this new span the current active span.
            Scope scope = span.makeCurrent();

            // Return the Scope instance. This will be used in the exit advice to end the span's scope.
            return scope;
        }

        // This advice is added at the end of the instrumented method (OnMethodExit).
        // It first closes the span's scope, then checks if any exception was thrown during the method's execution.
        // If an exception was thrown, it sets the span's status to ERROR and ends the span.
        // If no exception was thrown, it sets a custom attribute "wordCount" on the span, and ends the span.
        @Advice.OnMethodExit(onThrowable = Throwable.class, suppress = Throwable.class)
        public static void onExit(@Advice.Return(readOnly = false) int wordCount,
                                  @Advice.Thrown Throwable throwable,
                                  @Advice.Local("otelSpan") Span span,
                                  @Advice.Enter Scope scope) {
            // Close the scope to end it.
            scope.close();

            // If an exception was thrown during the method's execution, set the span's status to ERROR.
            if (throwable != null) {
                span.setStatus(StatusCode.ERROR, "Exception thrown in method");
            } else {
                // If no exception was thrown, set a custom attribute "wordCount" on the span.
                span.setAttribute("wordCount", wordCount);
            }

            // End the span. This makes it ready to be exported to the configured exporter (e.g. Elastic).
            span.end();
        }
    }
}

The target class for our instrumentation is defined in the typeMatch method, and the method we want to instrument is defined in the transform method. We are targeting the Main class and the countWords method.

As you can see, we have an inner class here that does most of the work of defining an onEnter and onExit method, which tells us what to do when we enter the countWords method and when we exit the countWords method.

In the onEnter method, we set up a new OpenTelemetry span, and in the onExit method, we end the span. If the method successfully ends, we also grab the wordcount and append that to the attribute.

Now let's take a look at what happens when we run this. The good news is that we have made this extremely simple by providing a dockerfile for your use to do all the work for you.

Pulling this all together

Clone the GitHub repository if you have not already done so, and before continuing, let’s take a quick look at the dockerfile we are using.

# Build stage
FROM maven:3.8.7-openjdk-18 as build

COPY simple-java /home/app/simple-java
COPY opentelemetry-custom-instrumentation /home/app/opentelemetry-custom-instrumentation

WORKDIR /home/app/simple-java
RUN mvn install

WORKDIR /home/app/opentelemetry-custom-instrumentation
RUN mvn install

# Package stage
FROM maven:3.8.7-openjdk-18
COPY --from=build /home/app/simple-java/target/simple-java-1.0-SNAPSHOT.jar /usr/local/lib/simple-java-1.0-SNAPSHOT.jar
COPY --from=build /home/app/opentelemetry-custom-instrumentation/target/opentelemetry-custom-instrumentation-1.0-SNAPSHOT.jar /usr/local/lib/opentelemetry-custom-instrumentation-1.0-SNAPSHOT.jar

WORKDIR /

RUN curl -L -o opentelemetry-javaagent.jar https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar

COPY start.sh /start.sh
RUN chmod +x /start.sh

ENTRYPOINT ["/start.sh"]

This dockerfile works in two parts: during the docker build process, we build the simple-java application from source followed by the custom instrumentation. After this, we download the latest OpenTelemetry Java Agent. During runtime, we simple execute the start.sh file described below:

#!/bin/sh
java \
-javaagent:/opentelemetry-javaagent.jar \
-Dotel.exporter.otlp.endpoint=${SERVER_URL} \
-Dotel.exporter.otlp.headers="Authorization=Bearer ${SECRET_KEY}" \
-Dotel.metrics.exporter=otlp \
-Dotel.logs.exporter=otlp \
-Dotel.resource.attributes=service.name=simple-java,service.version=1.0,deployment.environment=production \
-Dotel.service.name=your-service-name \
-Dotel.javaagent.extensions=/usr/local/lib/opentelemetry-custom-instrumentation-1.0-SNAPSHOT.jar \
-Dotel.javaagent.debug=true \
-jar /usr/local/lib/simple-java-1.0-SNAPSHOT.jar

There are two important things to note with this script: the first is that we start the javaagent parameter set to the opentelemetry-javaagent.jar — this will start the OpenTelemetry javaagent running, which starts before any code is executed.

Inside this jar there has to be a class with a premain method which the JVM will look for. This bootstraps the java agent. As described above, any bytecode that is compiled is essentially filtered through the javaagent code so it can modify the class before being executed.

The second important thing here is the configuration of the javaagent.extensions, which loads our extension that we built to add instrumentation for our simple-java application.

Now run the following commands:

docker build -t djhope99/custom-otel-instrumentation:1 .
docker run -it -e 'SERVER_URL=XXX' -e 'SECRET_KEY=XX djhope99/custom-otel-instrumentation:1

If you use the SERVER_URL and SECRET_KEY you got earlier in here, you should see this connect to Elastic.

When it starts up, it will ask you to enter a sentence, enter a few sentences, and press enter. Do this a few times — there is a sleep in here to force a long running transaction:

Eventually you will see the service show up in the service map:

Traces will appear:

And in the span you will see the wordcount attribute we collected:

This can be used for further dashboarding and AI/ML, including anomaly detection if you need, which is easy to do, as you can see below.

First click on the burger on the left side and select Dashboard to create a new dashboard:

From here, click Create Visualization.

Search for the wordcount label in the APM index as shown below:

As you can see, because we created this attribute in the Span code as below with wordCount as a type “Integer,” we were able to automatically assign it as a numeric field in Elastic:

span.setAttribute("wordCount", wordCount);

From here we can drag and drop it into the visualization for display on our Dashboard! Super easy.

In conclusion

This blog elucidates the invaluable role of OpenTelemetry Java Agent in filling the visibility gaps and obtaining crucial business monitoring data, especially when access to the source code is not feasible.

The blog unraveled the basic understanding of Java Agent, Bytecode, and Byte Buddy, followed by a comprehensive examination of the automatic instrumentation process with Byte Buddy.

The implementation of the OpenTelemetry Java Agent, using the Extensions framework, was demonstrated with the aid of a simple Java application, which underscored the agent's ability to inject trace code into the application to facilitate monitoring.

It detailed how to configure the agent and integrate OpenTelemetry Extension, and it outlined the operation of a sample application to help users comprehend the practical application of the information discussed. This instructive blog post is an excellent resource for SREs and IT Operations seeking to optimize their work with applications using OpenTelemetry's automatic instrumentation feature.

Independence with OpenTelemetry on Elastic

Modern observability and security on Kubernetes with Elastic and OpenTelemetry

3 models for logging with OpenTelemetry and Elastic

Adding free and open Elastic APM as part of your Elastic Observability deployment

Monitor OpenAI API and GPT models with OpenTelemetry and Elastic

Future proof your observability platform with OpenTelemetry and Elastic

Don’t have an Elastic Cloud account yet? Sign up for Elastic Cloud.

How to easily add application monitoring in Kubernetes pods

Wed, 17 Jan 2024 00:00:00 GMT

The Elastic® APM K8s Attacher allows auto-installation of Elastic APM application agents (e.g., the Elastic APM Java agent) into applications running in your Kubernetes clusters. The mechanism uses a mutating webhook, which is a standard Kubernetes component, but you don’t need to know all the details to use the Attacher. Essentially, you can install the Attacher, add one annotation to any Kubernetes deployment that has an application you want monitored, and that’s it!

In this blog, we’ll walk through a full example from scratch using a Java application. Apart from the Java code and using a JVM for the application, everything else works the same for the other languages supported by the Attacher.

Prerequisites

This walkthrough assumes that the following are already installed on the system: JDK 17, Docker, Kubernetes, and Helm.

The example application

While the application (shown below) is a Java application, it would be easily implemented in any language, as it is just a simple loop that every 2 seconds calls the method chain methodA->methodB->methodC->methodD, with methodC sleeping for 10 milliseconds and methodD sleeping for 200 milliseconds. The choice of application is just to be able to clearly display in the Elastic APM UI that the application is being monitored.

The Java application in full is shown here:

package test;

public class Testing implements Runnable {

  public static void main(String[] args) {
    new Thread(new Testing()).start();
  }

  public void run()
  {
    while(true) {
      try {Thread.sleep(2000);} catch (InterruptedException e) {}
      methodA();
    }
  }

  public void methodA() {methodB();}

  public void methodB() {methodC();}

  public void methodC() {
    System.out.println("methodC executed");
    try {Thread.sleep(10);} catch (InterruptedException e) {}
    methodD();
  }

  public void methodD() {
    System.out.println("methodD executed");
    try {Thread.sleep(200);} catch (InterruptedException e) {}
  }
}

We created a Docker image containing that simple Java application for you that can be pulled from the following Docker repository:

docker.elastic.co/demos/apm/k8s-webhook-test

Deploy the pod

First we need a deployment config. We’ll call the config file webhook-test.yaml, and the contents are pretty minimal — just pull the image and run that as a pod & container called webhook-test in the default namespace:

apiVersion: v1
kind: Pod
metadata:
  name: webhook-test
  labels:
    app: webhook-test
spec:
  containers:
    - image: docker.elastic.co/demos/apm/k8s-webhook-test
      imagePullPolicy: Always
      name: webhook-test

This can be deployed normally using kubectl:

kubectl apply -f webhook-test.yaml

The result is exactly as expected:

$ kubectl get pods
NAME           READY   STATUS    RESTARTS   AGE
webhook-test   1/1     Running   0          10s

$ kubectl logs webhook-test
methodC executed
methodD executed
methodC executed
methodD executed

So far, this is just setting up a standard Kubernetes application with no APM monitoring. Now we get to the interesting bit: adding in auto-instrumentation.

Install Elastic APM K8s Attacher

The first step is to install the Elastic APM K8s Attacher. This only needs to be done once for the cluster — once installed, it is always available. Before installation, we will define where the monitored data will go. As you will see later, we can decide or change this any time. For now, we’ll specify our own Elastic APM server, which is at https://myserver.somecloud:443 — we also have a secret token for authorization to that Elastic APM server, which has value MY_SECRET_TOKEN. (If you want to set up a quick test Elastic APM server, you can do so at https://cloud.elastic.co/).

There are two additional environment variables set for the application that are not generally needed but will help when we see the resulting UI content toward the end of the walkthrough (when the agent is auto-installed, these two variables tell the agent what name to give this application in the UI and what method to trace). Now we just need to define the custom yaml file to hold these. On installation, the custom yaml will be merged into the yaml for the Attacher:

apm:
  secret_token: MY_SECRET_TOKEN
  namespaces:
    - default
webhookConfig:
  agents:
    java:
      environment:
        ELASTIC_APM_SERVER_URL: "https://myserver.somecloud:443"
        ELASTIC_APM_TRACE_METHODS: "test.Testing#methodB"
        ELASTIC_APM_SERVICE_NAME: "webhook-test"

That custom.yaml file is all we need to install the attacher (note we’ve only specified the default namespace for agent auto-installation for now — this can be easily changed, as you’ll see later). Next we’ll add the Elastic charts to helm — this only needs to be done once, then all Elastic charts are available to helm. This is the usual helm add repo command, specifically:

helm repo add elastic https://helm.elastic.co

Now the Elastic charts are available for installation (helm search repo would show you all the available charts). We’re going to use “elastic-webhook” as the name to install into, resulting in the following installation command:

helm install elastic-webhook elastic/apm-attacher --namespace=elastic-apm --create-namespace --values custom.yaml

And that’s it, we now have the Elastic APM K8s Attacher installed and set to send data to the APM server defined in the custom.yaml file! (You can confirm installation with a helm list -A if you need.)

Auto-install the Java agent

The Elastic APM K8s Attacher is installed, but it doesn’t auto-install the APM application agents into every pod — that could lead to problems! Instead the Attacher is deliberately limited to auto-install agents into deployments defined a) by the namespaces listed in the custom.yaml, and b) to those deployments in those namespaces that have a specific annotation “co.elastic.apm/attach.”

So for now, restarting the webhook-test pod we created above won’t have any different effect on the pod, as it isn’t yet set to be monitored. What we need to do is add the annotation. Specifically, we need to add the annotation using the default agent configuration that was installed with the Attacher called “java” for the Java agent (we’ll see later how that agent configuration is altered — the default configuration installs the latest agent version and leaves everything else default for that version). So adding that annotation in to webhook-test yaml gives us the new yaml file contents (the additional config is shown labelled (1)):

apiVersion: v1
kind: Pod
metadata:
  name: webhook-test
  annotations: #(1)
    co.elastic.apm/attach: java #(1)
  labels:
    app: webhook-test
spec:
  containers:
    - image: docker.elastic.co/demos/apm/k8s-webhook-test
      imagePullPolicy: Always
      name: webhook-test

Applying this change gives us the application now monitored:

$ kubectl delete -f webhook-test.yaml
pod "webhook-test" deleted
$ kubectl apply -f webhook-test.yaml
pod/webhook-test created
$ kubectl logs webhook-test
… StartupInfo - Starting Elastic APM 1.45.0 …

And since the agent is now feeding data to our APM server, we can now see it in the UI:

Note that the agent identifies Testing.methodB method as a trace root because of the ELASTIC_APM_TRACE_METHODS environment variable set to test.Testing#methodB in the custom.yaml — this tells the agent to specifically trace that method. The time taken by that method will be available in the UI for each invocation, but we don’t see the sub-methods . . . currently. In the next section, we’ll see how easy it is to customize the Attacher, and in doing so we’ll see more detail about the method chain being executed in the application.

Customizing the agents

In your systems, you’ll likely have development, testing, and production environments. You’ll want to specify the version of the agent to use rather than just pull the latest version whatever that is, you’ll want to have debug on for some applications or instances, and you’ll want to have specific options set to specific values. This sounds like a lot of effort, but the attacher lets you enable these kinds of changes in a very simple way. In this section, we’ll add a configuration that specifies all these changes and we can see just how easy it is to configure and enable it.

We start at the custom.yaml file we defined above. This is the file that gets merged into the Attacher. Adding a new configuration with all the items listed in the last paragraph is easy — though first we need to decide a name for our new configuration. We’ll call it “java-interesting” here. The new custom.yaml in full is (the first part is just the same as before, the new config is simply appended):

apm:
  secret_token: MY_SECRET_TOKEN
  namespaces:
    - default
webhookConfig:
  agents:
    java:
      environment:
        ELASTIC_APM_SERVER_URL: "https://myserver.somecloud:443"
        ELASTIC_APM_TRACE_METHODS: "test.Testing#methodB"
        ELASTIC_APM_SERVICE_NAME: "webhook-test"
    java-interesting:
      image: docker.elastic.co/observability/apm-agent-java:1.55.4
      artifact: "/usr/agent/elastic-apm-agent.jar"
      environment:
        ELASTIC_APM_SERVER_URL: "https://myserver.somecloud:443"
        ELASTIC_APM_TRACE_METHODS: "test.Testing#methodB"
        ELASTIC_APM_SERVICE_NAME: "webhook-test"
        ELASTIC_APM_ENVIRONMENT: "testing"
        ELASTIC_APM_LOG_LEVEL: "debug"
        ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED: "true"
        JAVA_TOOL_OPTIONS: "-javaagent:/elastic/apm/agent/elastic-apm-agent.jar"

Breaking the additional config down, we have:

The name of the new config java-interesting
The APM Java agent image docker.elastic.co/observability/apm-agent-java
- With a specific version 1.43.0 instead of latest
We need to specify the agent jar location (the attacher puts it here)
- artifact: "/usr/agent/elastic-apm-agent.jar"
And then the environment variables
ELASTIC_APM_SERVER_URL as before
ELASTIC_APM_ENVIRONMENT set to testing, useful when looking in the UI
ELASTIC_APM_LOG_LEVEL set to debug for more detailed agent output
ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED turning this on (setting to true) will give us additional interesting information about the method chain being executed in the application
And lastly we need to set JAVA_TOOL_OPTIONS to the enable starting the agent "-javaagent:/elastic/apm/agent/elastic-apm-agent.jar" — this is fundamentally how the attacher auto-attaches the Java agent

More configurations and details about configuration options are here for the Java agent, and other language agents are also available.

The application traced with the new configuration

And finally we just need to upgrade the attacher with the changed custom.yaml:

helm upgrade elastic-webhook elastic/apm-attacher --namespace=elastic-apm --create-namespace --values custom.yaml

This is the same command as the original install, but now using upgrade. That’s it — add config to the custom.yaml and upgrade the attacher, and it’s done! Simple.

Of course we still need to use the new config on an app. In this case, we’ll edit the existing webhook-test.yaml file, replacing java with java-interesting, so the annotation line is now:

co.elastic.apm/attach: java-interesting

Applying the new pod config and restarting the pod, you can see the logs now hold debug output:

$ kubectl delete -f webhook-test.yaml
pod "webhook-test" deleted
$ kubectl apply -f webhook-test.yaml
pod/webhook-test created
$ kubectl logs webhook-test
… StartupInfo - Starting Elastic APM 1.44.0 …
… DEBUG co.elastic.apm.agent. …
… DEBUG co.elastic.apm.agent. …

More interesting is the UI. Now that inferred spans is on, the full method chain is visible.

This gives the details for methodB (it takes 211 milliseconds because it calls methodC - 10ms - which calls methodD - 200ms). The times for methodC and methodD are inferred rather than recorded, (inferred rather than traced — if you needed accurate times you would instead add the methods to trace_methods and have them traced too).

Note on the ECK operator

The Elastic Cloud on Kubernetes operator allows you to install and manage a number of other Elastic components on Kubernetes. At the time of publication of this blog, the Elastic APM K8s Attacher is a separate component, and there is no conflict between these management mechanisms — they apply to different components and are independent of each other.

Try it yourself!

This walkthrough is easily repeated on your system, and you can make it more useful by replacing the example application with your own and the Docker registry with the one you use.

Learn more about real-time monitoring with Kubernetes and Elastic Observability.

How to deploy Hello World Elastic Observability on Google Cloud Run

Mon, 28 Aug 2023 00:00:00 GMT

Elastic Cloud Observability is the premiere tool to provide visibility into your running web apps. Google Cloud Run is the serverless platform of choice to run your web apps that need to scale up massively and scale down to zero. Elastic Observability combined with Google Cloud Run is the perfect solution for developers to deploy web apps that are auto-scaled with fully observable operations, in a way that’s straightforward to implement and manage.

This blog post will show you how to deploy a simple Hello World web app to Cloud Run and then walk you through the steps to instrument the Hello World web app to enable observation of the application’s operations with Elastic Cloud.

Elastic Observability setup

We’ll start with setting up an Elastic Cloud deployment, which is where observability will take place for the web app we’ll be deploying.

From the Elastic Cloud console, select Create deployment.

Enter a deployment name and click Create deployment. It takes a few minutes for your deployment to be created. While waiting, you are prompted to save the admin credentials for your deployment, which provides you with superuser access to your Elastic^® deployment. Keep these credentials safe as they are shown only once.

To copy the APM Server URL and the APM Secret Token, go to Elastic Cloud. Then go to the Deployments page which lists all of the deployments you have created. Select the deployment you want to use, which will open the deployment details page. In the Kibana row of links, click on Open to open Kibana for your deployment.

Select Integrations from the top-level menu. Then click the APM tile.

On the APM Agents page, copy the secretToken and the serverUrl values and save them for use in a later step.

Now that we’ve completed the Elastic Cloud setup, the next step is to set up our Google Cloud project for deploying apps to Cloud Run.

Google Cloud Run setup

First we’ll need a Google Cloud project, so let’s create one by going to the Google Cloud console and creating a new project. Select the project menu and then click the New Project button.

Once the new project is created, we’ll need to enable the necessary APIs that our Hello World app will be using. This can be done by clicking this enable APIs link, which opens a page in the Google Cloud console that lists the APIs that will be enabled and allows us to confirm their activation.

After we’ve enabled the necessary APIs, we’ll need to set up the required permissions for our Hello World app, which can be done in the IAM section of the Google Cloud Console. Within the IAM section, select the Compute Engine default service account and add the following roles:

Logs Viewer
Monitoring Viewer
Pub/Sub Subscriber

Deploy a Hello World web app to Cloud Run

We’ll perform the process of deploying a Node.js Hello World web app to Cloud Run using the handy Google Cloud tool called Cloud Shell Editor. To deploy the Hello World app, we’ll perform the following five steps:

In Cloud Shell Editor, in the terminal window that appears at the bottom of the screen, clone a Node.js Hello World sample app repo from GitHub by entering the following command.

git clone https://github.com/elastic/observability-examples

Change directory to the location of the Hello World web app code.

cd gcp/run/helloworld

Build the Hello World app image and push the image to Google Container Registry by running the command below in the terminal. Be sure to replace your-project-id in the command below with your actual Google Cloud project ID.

gcloud builds submit --tag gcr.io/your-project-id/elastic-helloworld

Deploy the Hello World app to Google Cloud Run by running the command below. Be sure to replace your-project-id in the command below with your actual Google Cloud project ID.

gcloud run deploy elastic-helloworld --image gcr.io/your-project-id/elastic-helloworld

When the deployment process is complete, a Service URL will be displayed within the terminal. Copy and paste the Service URL in a browser to view the Hello World app running in Cloud Run.

Instrument the Hello World web app with Elastic Observability

With a web app successfully running in Cloud Run, we’re now ready to add the minimal code necessary to start monitoring the app. To enable observability for the Hello World app in Elastic Cloud, we’ll perform the following six steps:

In the Google Cloud Shell Editor, edit the Dockerfile file to add the following Elastic Open Telemetry environment variables along with the commands to install and run the Elastic APM agent. Replace the ELASTIC_APM_SERVER_URL text and the ELASTIC_APM_SECRET_TOKEN text with the APM Server URL and the APM Secret Token values that you copied and saved in an earlier step.

ENV OTEL_EXPORTER_OTLP_ENDPOINT='ELASTIC_APM_SERVER_URL'
ENV OTEL_EXPORTER_OTLP_HEADERS='Authorization=Bearer ELASTIC_APM_SECRET_TOKEN'
ENV OTEL_LOG_LEVEL=info
ENV OTEL_METRICS_EXPORTER=otlp
ENV OTEL_RESOURCE_ATTRIBUTES=service.version=1.0,deployment.environment=production
ENV OTEL_SERVICE_NAME=helloworld
ENV OTEL_TRACES_EXPORTER=otlp
RUN npm install --save @opentelemetry/api
RUN npm install --save @opentelemetry/auto-instrumentations-node
CMD ["node", "--require", "@opentelemetry/auto-instrumentations-node/register", "index.js"]

The updated Dockerfile should look something like this:

FROM node:18-slim
WORKDIR /usr/src/app
COPY package*.json ./
RUN npm install --only=production
COPY . ./
OTEL_EXPORTER_OTLP_ENDPOINT='https://******.apm.us-central1.gcp.cloud.es.io:443'
ENV OTEL_EXPORTER_OTLP_HEADERS='Authorization=Bearer ******************'
ENV OTEL_LOG_LEVEL=info
ENV OTEL_METRICS_EXPORTER=otlp
ENV OTEL_RESOURCE_ATTRIBUTES=service.version=1.0,deployment.environment=production
ENV OTEL_SERVICE_NAME=helloworld
ENV OTEL_TRACES_EXPORTER=otlp
RUN npm install --save @opentelemetry/api
RUN npm install --save @opentelemetry/auto-instrumentations-node
CMD ["node", "--require", "@opentelemetry/auto-instrumentations-node/register", "index.js"]

In the Google Cloud Shell Editor, edit the package.json file to add the Elastic APM dependency. The dependencies section in package.json should look something like this:

"dependencies": {
  	"express": "^4.18.2",
  	"elastic-apm-node": "^3.49.1"
  },

In the Google Cloud Shell Editor, edit the index.js file:

Add the code required to initialize the Elastic Open Telemetry APM agent:

const otel = require("@opentelemetry/api");
const tracer = otel.trace.getTracer("hello-world");

Replace the “Hello World!” output code . . .

res.send(`Hello World!`);

...with the “Hello Elastic Observability” code block.

res.send(
  `
   
   Hello Elastic Observability - Google Cloud Run - Node.js
   
   
   `
);

Add a trace “hi” before the “Hello Elastic Observability” code block and add a trace “bye” after the “Hello Elastic Observability” code block.

tracer.startActiveSpan("hi", (span) => {
  console.log("hello");
  span.end();
});
res.send(
  `
   
   Hello Elastic Observability - Google Cloud Run - Node.js
   
   
   `
);
tracer.startActiveSpan("bye", (span) => {
  console.log("goodbye");
  span.end();
});

The completed index.js file should look something like this:

const otel = require("@opentelemetry/api");
const tracer = otel.trace.getTracer("hello-world");

const express = require("express");
const app = express();

app.get("/", (req, res) => {
  tracer.startActiveSpan("hi", (span) => {
    console.log("hello");
    span.end();
  });
  res.send(
    `
    
    Hello Elastic Observability - Google Cloud Run - Node.js
    
   
    `
  );
  tracer.startActiveSpan("bye", (span) => {
    console.log("goodbye");
    span.end();
  });
});

const port = parseInt(process.env.PORT) || 8080;
app.listen(port, () => {
  console.log(`helloworld: listening on port ${port}`);
});

Rebuild the Hello World app image and push the image to the Google Container Registry by running the command below in the terminal. Be sure to replace your-project-id in the command below with your actual Google Cloud project ID.

gcloud builds submit --tag gcr.io/your-project-id/elastic-helloworld

Redeploy the Hello World app to Google Cloud Run by running the command below. Be sure to replace your-project-id in the command below with your actual Google Cloud project ID.

gcloud run deploy elastic-helloworld --image gcr.io/your-project-id/elastic-helloworld

When the deployment process is complete, a Service URL will be displayed within the terminal. Copy and paste the Service URL in a browser to view the updated Hello World app running in Cloud Run.

Observe the Hello World web app

Now that we’ve instrumented the web app to send observability data to Elastic Observability, we can now use Elastic Cloud to monitor the web app’s operations.

In Elastic Cloud, select the Observability Services menu item.
Click the helloworld service.
Click the Transactions tab.
Scroll down and click the GET / transaction.
Scroll down to the Trace Sample section to see the GET / , hi and bye trace samples.

Observability made to scale

You’ve seen the entire process of deploying a web app to Google Cloud Run that is instrumented with Elastic Observability. The end result is a web app that will scale up and down with demand combined with the observability tools to monitor the web app as it serves a single user or millions of users.

Now that you’ve seen how to deploy a serverless web app instrumented with observability, visit Elastic Observability to learn more about how to implement a complete observability solution for your apps. Or visit Getting started with Elastic on Google Cloud for more examples of how you can drive the data insights you need by combining Google Cloud monitoring and cloud computing services with Elastic’s search-powered platform.

How to deploy a Hello World web app with Elastic Observability on Azure Container Apps

Mon, 23 Oct 2023 00:00:00 GMT

Elastic Observability is the optimal tool to provide visibility into your running web apps. Microsoft Azure Container Apps is a fully managed environment that enables you to run containerized applications on a serverless platform so that your applications scale up and down. This allows you to accomplish the dual objective of serving every customer’s need for availability while meeting your needs to do so as efficiently as possible.

Using Elastic Observability and Azure Container Apps is a perfect combination for developers to deploy web apps that are auto-scaled with fully observable operations.

This blog post will show you how to deploy a simple Hello World web app to Azure Container Apps and then walk you through the steps to instrument the Hello World web app to enable observation of the application’s operations with Elastic Cloud.

Elastic Observability setup

We’ll start with setting up an Elastic Cloud deployment, which is where observability will take place for the web app we’ll be deploying.

From the Elastic Cloud console, select Create deployment.

To copy the APM Server URL and the APM Secret Token, go to Elastic Cloud . Then go to the Deployments page, which lists all of the deployments you have created. Select the deployment you want to use, which will open the deployment details page. In the Kibana row of links, click on Open to open Kibana® for your deployment.

Select Integrations from the top-level menu. Then click the APM tile.

On the APM Agents page, copy the secretToken and the serverUrl values and save them for use in a later step.

Now that we’ve completed the Elastic Cloud setup, the next step is to set up our account in Azure for deploying apps to the Container Apps service.

Azure Container Apps setup

First we’ll need an Azure account, so let’s create one by going to the Microsoft Azure portal and creating a new project. Click the Start free button and follow the steps to sign in or create a new account.

Deploy a Hello World web app to Container Apps

We’ll perform the process of deploying a C# Hello World web app to Container Apps using the handy Azure tool called Cloud Shell. To deploy the Hello World app, we’ll perform the following 12 steps:

From the Azure portal, click the Cloud Shell icon at the top of the portal to open Cloud Shell…

… and when the Cloud Shell first opens, select Bash as the shell type to use.

If you’re prompted that “You have no storage mounted,” then click the Create storage button to create a file store to be used for saving and editing files from Cloud Shell.

In Cloud Shell, clone a C# Hello World sample app repo from GitHub by entering the following command.

git clone https://github.com/elastic/observability-examples

Change directory to the location of the Hello World web app code.

cd observability-examples/azure/container-apps/helloworld

Define the environment variables that we’ll be using in the commands throughout this blog post.

RESOURCE_GROUP="helloworld-containerapps"
LOCATION="centralus"
ENVIRONMENT="env-helloworld-containerapps"
APP_NAME="elastic-helloworld"

Define a registry container name that is unique by running the following command.

ACR_NAME="helloworld"$RANDOM

Create an Azure resource group by running the following command.

az group create --name $RESOURCE_GROUP --location "$LOCATION"

Run the following command to create a registry container in Azure Container Registry.

az acr create --resource-group $RESOURCE_GROUP \
--name $ACR_NAME --sku Basic --admin-enable true

Build the app image and push it to Azure Container Registry by running the following command.

az acr build --registry $ACR_NAME --image $APP_NAME .

az provider register -n Microsoft.OperationalInsights --wait

Run the following command to create a Container App environment for deploying your app into.

az containerapp env create --name $ENVIRONMENT \
--resource-group $RESOURCE_GROUP --location "$LOCATION"

Create a new Container App by deploying the Hello World app’s image to Container Apps, using the following command.

az containerapp create \
  --name $APP_NAME \
  --resource-group $RESOURCE_GROUP \
  --environment $ENVIRONMENT \
  --image $ACR_NAME.azurecr.io/$APP_NAME \
  --target-port 3500 \
  --ingress 'external' \
  --registry-server $ACR_NAME.azurecr.io \
  --query properties.configuration.ingress.fqdn

This command will output the deployed Hello World app's fully qualified domain name (FQDN). Copy and paste the FQDN into a browser to see your running Hello World app.

Instrument the Hello World web app with Elastic Observability

With a web app successfully running in Container Apps, we’re now ready to add the minimal code necessary to enable observability for the Hello World app in Elastic Cloud. We’ll perform the following eight steps:

In Azure Cloud Shell, create a new file named Telemetry.cs by typing the following command.

touch Telemetry.cs

Open the Azure Cloud Shell file editor by typing the following command in Cloud Shell.

code .

In the Azure Cloud Shell editor, open the Telemetry.cs file and paste in the following code. Save the edited file in Cloud Shell by pressing the [Ctrl] + [s] keys on your keyboard (or if you’re on a macOS computer, use the [⌘] + [s] keys). This class file is used to create a tracer ActivitySource, which can generate trace Activity spans for observability.

using System.Diagnostics;

public static class Telemetry
{
	public static readonly ActivitySource activitySource = new("Helloworld");
}

In the Azure Cloud Shell editor, edit the file named Dockerfile to add the following Elastic OpenTelemetry environment variables. Replace the ELASTIC_APM_SERVER_URL text and the ELASTIC_APM_SECRET_TOKEN text with the APM Server URL and the APM Secret Token values that you copied and saved in an earlier step.

Save the edited file in Cloud Shell by pressing the [Ctrl] + [s] keys on your keyboard (or if you’re on a macOS computer, use the [⌘] + [s] keys).

The updated Dockerfile should look something like this:

FROM ${ARCH}mcr.microsoft.com/dotnet/aspnet:7.0. AS base
WORKDIR /app

FROM mcr.microsoft.com/dotnet/sdk:8.0-preview AS build
ARG TARGETPLATFORM

WORKDIR /src
COPY ["helloworld.csproj", "./"]
RUN dotnet restore "./helloworld.csproj"
COPY . .
WORKDIR "/src/."
RUN dotnet build "helloworld.csproj" -c Release -o /app/build

FROM build AS publish
RUN dotnet publish "helloworld.csproj" -c Release -o /app/publish

FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
EXPOSE 3500
ENV ASPNETCORE_URLS=http://+:3500

ENV OTEL_EXPORTER_OTLP_ENDPOINT='https://******.apm.us-east-2.aws.elastic-cloud.com:443'
ENV OTEL_EXPORTER_OTLP_HEADERS='Authorization=Bearer ***********'
ENV OTEL_LOG_LEVEL=info
ENV OTEL_METRICS_EXPORTER=otlp
ENV OTEL_RESOURCE_ATTRIBUTES=service.version=1.0,deployment.environment=production
ENV OTEL_SERVICE_NAME=helloworld
ENV OTEL_TRACES_EXPORTER=otlp

ENTRYPOINT ["dotnet", "helloworld.dll"]

In the Azure Cloud Shell editor, edit the helloworld.csproj file to add the Elastic APM and OpenTelemetry dependencies. The updated helloworld.csproj file should look something like this:




  
	net7.0
	enable
	enable

In the Azure Cloud Shell editor, edit the Program.cs:

Add a using statement at the top of the file to import System.Diagnostics, which is used to create Activities that are equivalent to “spans” in OpenTelemetry. Also import the OpenTelemetry.Resources and OpenTelemetry.Trace packages.

using System.Diagnostics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;

Update the “builder” initialization code block to include configuration to enable Elastic OpenTelemetry observability.

builder.Services.AddOpenTelemetry().WithTracing(builder => builder.AddOtlpExporter()
                	.AddSource("helloworld")
                	.AddAspNetCoreInstrumentation()
                	.AddOtlpExporter()
    	.ConfigureResource(resource =>
        	resource.AddService(
            	serviceName: "helloworld"))
);
builder.Services.AddControllers();

Replace the “Hello World!” HTML output string…

Hello World!

...with the “Hello Elastic Observability” HTML output string.


  
    Hello Elastic Observability - Azure Container Apps - C#

Add a telemetry trace span around the output response utilizing the Telemetry class’ ActivitySource.

using (Activity activity = Telemetry.activitySource.StartActivity("HelloSpan")!)
   	{
   		Console.Write("hello");
   		await context.Response.WriteAsync(output);
   	}

The updated Program.cs file should look something like this:

using System.Diagnostics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;

var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOpenTelemetry().WithTracing(builder => builder.AddOtlpExporter()
                	.AddSource("helloworld")
                	.AddAspNetCoreInstrumentation()
                	.AddOtlpExporter()
    	.ConfigureResource(resource =>
        	resource.AddService(
            	serviceName: "helloworld"))
);
builder.Services.AddControllers();
var app = builder.Build();

string output =
"""


Hello Elastic Observability - Azure Container Apps - C#



""";

app.MapGet("/", async context =>
	{
    	using (Activity activity = Telemetry.activitySource.StartActivity("HelloSpan")!)
    		{
        		Console.Write("hello");
        		await context.Response.WriteAsync(output);
    		}
	}
);
app.Run();

Rebuild the Hello World app image and push the image to the Azure Container Registry by running the following command.

az acr build --registry $ACR_NAME --image $APP_NAME .

Redeploy the updated Hello World app to Azure Container Apps, using the following command.

az containerapp create \
  --name $APP_NAME \
  --resource-group $RESOURCE_GROUP \
  --environment $ENVIRONMENT \
  --image $ACR_NAME.azurecr.io/$APP_NAME \
  --target-port 3500 \
  --ingress 'external' \
  --registry-server $ACR_NAME.azurecr.io \
  --query properties.configuration.ingress.fqdn

This command will output the deployed Hello World app's fully qualified domain name (FQDN). Copy and paste the FQDN into a browser to see the updated Hello World app running in Azure Container Apps.

Observe the Hello World web app

Now that we’ve instrumented the web app to send observability data to Elastic Observability, we can now use Elastic Cloud to monitor the web app’s operations.

In Elastic Cloud, select the Observability Services menu item.
Click the helloworld service.
Click the Transactions tab.
Scroll down and click the GET / transaction.Scroll down to the Trace Sample section to see the GET / , HelloSpan trace sample.

Observability made to scale

You’ve seen the entire process of deploying a web app to Azure Container Apps that is instrumented with Elastic Observability. This web app is now fully available on the web running on a platform that will auto-scale to serve visitors worldwide. And it’s instrumented for Elastic Observability APM using OpenTelemetry to ingest data into Elastic Cloud’s Kibana dashboards.

Now that you’ve seen how to deploy a Hello World web app with a basic observability setup, visit Elastic Observability to learn more about expanding to a full scale observability coverage solution for your apps. Or visit Getting started with Elastic on Microsoft Azure for more examples of how you can drive the data insights you need by combining Microsoft Azure’s cloud computing services with Elastic’s search-powered platform.

How to monitor Kafka and Confluent Cloud with Elastic Observability

Mon, 03 Apr 2023 00:00:00 GMT

The blog will take you through best practices to observe Kafka-based solutions implemented on Confluent Cloud with Elastic Observability. (To monitor Kafka brokers that are not in Confluent Cloud, I recommend checking out this blog.) We will instrument Kafka applications with Elastic APM, use the Confluent Cloud metrics endpoint to get data about brokers, and pull it all together with a unified Kafka and Confluent Cloud monitoring dashboard in Elastic Observability.

Using full-stack Elastic Observability to understand Kafka and Confluent performance

In the 2023 Dice Tech Salary Report, Elasticsearch and Kakfa are ranked #3 and #5 out of the top 12 most in demand skills at the moment, so it’s no surprise that we are seeing a large number of customers who are implementing data in motion with Kafka.

Kafka comes with some additional complexities that go beyond traditional architectures and which make observability an even more important topic. Understanding where the bottlenecks are in messaging and stream-based architectures can be tough. This is why you need a comprehensive observability solution with machine learning to help you.

In this blog, we will explore how to get Kafka applications instrumented with Elastic APM, how to collect performance data with JMX, and how you can use the Elasticsearch Platform to pull in data from Confluent Cloud — which is by far the easiest and most cost-effective way to implement Kafka architectures.

For this blog post, we will be following the code at this git repository. There are three services here that are designed to run on two clouds and push data from one cloud to the other and finally into Google BigQuery. We want to monitor all of this using Elastic Observability to give you a complete picture of Confluent and Kafka Services performance as a teaser — this is the goal below:

A look at the architecture

As mentioned, we have three multi-cloud services implemented in our example application.

The first service is a Spring WebFlux service that runs inside AWS EKS. This service will take a message from a REST Endpoint and simply put it straight on to a Kafka topic.

The second service, which is also a Spring WebFlux service hosted inside Google Cloud Platform (GCP) with its Google Cloud monitoring, will then pick this up and forward it to another service that will put the message into BigQuery.

These services are all instrumented using Elastic APM. For this blog, we have decided to use Spring config to inject and configure the APM agent. You could of course use the “-javaagent” argument to inject the agent instead if preferred.

Getting started with Elastic Observability and Confluent Cloud

Before we dive into the application and its configuration, you will want to get an Elastic Cloud and Confluent Cloud account. You can sign up here for Elastic and here for Confluent Cloud. There are some initial configuration steps we need to do inside Confluent Cloud, as you will need to create three topics: gcpTopic, myTopic, and topic_2.

When you sign up for Confluent Cloud, you will be given an option of what type of cluster to create. For this walk-through, a Basic cluster is fine (as shown) — if you are careful about usage, it will not cost you a penny.

Once you have a cluster, go ahead and create the three topics.

For this walk-through, you will only need to create single partition topics as shown below:

Now we are ready to set up the Elastic Cloud cluster.

One thing to note here is that when setting up an Elastic cluster, the defaults are mostly OK. With one minor tweak to add in the Machine Learning under “Advanced Settings,” add capacity for machine learning here.

Getting APM up and running

The first thing we want to do here is get our Spring Boot Webflux-based services up and running. For this blog, I have decided to implement this using the Spring Configuration, as you can see below. For brevity, I have not listed all the JMX configuration information, but you can see those details in GitHub.

package com.elastic.multicloud;
import co.elastic.apm.attach.ElasticApmAttacher;
import jakarta.annotation.PostConstruct;
import lombok.Setter;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;

import java.util.HashMap;
import java.util.Map;

@Setter
@Configuration
@ConfigurationProperties(prefix = "elastic.apm")
@ConditionalOnProperty(value = "elastic.apm.enabled", havingValue = "true")
public class ElasticApmConfig {

    private static final String SERVER_URL_KEY = "server_url";
    private String serverUrl;

    private static final String SERVICE_NAME_KEY = "service_name";
    private String serviceName;

    private static final String SECRET_TOKEN_KEY = "secret_token";
    private String secretToken;

    private static final String ENVIRONMENT_KEY = "environment";
    private String environment;

    private static final String APPLICATION_PACKAGES_KEY = "application_packages";
    private String applicationPackages;

    private static final String LOG_LEVEL_KEY = "log_level";
    private String logLevel;
    private static final Logger LOGGER = LoggerFactory.getLogger(ElasticApmConfig.class);

    @PostConstruct
    public void init() {
        LOGGER.info(environment);

        Map apmProps = new HashMap<>(6);
        apmProps.put(SERVER_URL_KEY, serverUrl);
        apmProps.put(SERVICE_NAME_KEY, serviceName);
        apmProps.put(SECRET_TOKEN_KEY, secretToken);
        apmProps.put(ENVIRONMENT_KEY, environment);
        apmProps.put(APPLICATION_PACKAGES_KEY, applicationPackages);
        apmProps.put(LOG_LEVEL_KEY, logLevel);
        apmProps.put("enable_experimental_instrumentations","true");
          apmProps.put("capture_jmx_metrics","object_name[kafka.producer:type=producer-metrics,client-id=*] attribute[batch-size-avg:metric_name=kafka.producer.batch-size-avg]");


        ElasticApmAttacher.attach(apmProps);
    }
}

Now obviously this requires some dependencies, which you can see here in the Maven pom.xml.


			co.elastic.apm
			apm-agent-attach
			1.35.1-SNAPSHOT
		
		
			co.elastic.apm
			apm-agent-api
			1.35.1-SNAPSHOT

Strictly speaking, the agent-api is not required, but it could be useful if you have a desire to add your own monitoring code (as per the example below). The agent will happily auto-instrument without needing to do that though.

Transaction transaction = ElasticApm.currentTransaction();
        Span span = ElasticApm.currentSpan()
                .startSpan("external", "kafka", null)
                .setName("DAVID").setServiceTarget("kafka","gcp-elastic-apm-spring-boot-integration");
        try (final Scope scope = transaction.activate()) {
            span.injectTraceHeaders((name, value) -> producerRecord.headers().add(name,value.getBytes()));
            return Mono.fromRunnable(() -> {
                kafkaTemplate.send(producerRecord);
            });
        } catch (Exception e) {
            span.captureException(e);
            throw e;
        } finally {
            span.end();
        }

Now we have enough code to get our agent bootstrapped.

To get the code from the GitHub repository up and running, you will need the following installed on your system and to ensure that you have the credentials for your GCP and AWS cloud.


Java
Maven
Docker
Kubernetes CLI (kubectl)

Clone the project

Clone the multi-cloud Spring project to your local machine.

git clone https://github.com/davidgeorgehope/multi-cloud

Build the project

From each service in the project (aws-multi-cloud, gcp-multi-cloud, gcp-bigdata-consumer-multi-cloud), run the following commands to build the project.

mvn clean install

Now you can run the Java project locally.

java -jar gcp-bigdata-consumer-multi-cloud-0.0.1-SNAPSHOT.jar --spring.config.location=/Users/davidhope/applicaiton-gcp.properties

That will just get the Java application running locally, but you can also deploy this to Kubernetes using EKS and GKE as shown below.

Create a Docker image

Create a Docker image from the built project using the dockerBuild.sh provided in the project. You may want to customize this shell script to upload the built docker image to your own docker repository.

./dockerBuild.sh

Create a namespace for each service

kubectl create namespace aws

kubectl create namespace gcp-1

kubectl create namespace gcp-2

Once you have the namespaces created, you can switch context using the following command:

kubectl config set-context --current --namespace=my-namespace

Configuration for each service

Each service needs an application.properties file. I have put an example here.

You will need to replace the following properties with those you find in Elastic.

elastic.apm.server-url=
elastic.apm.secret-token=

These can be found by going into Elastic Cloud and clicking on Services inside APM and then Add Data , which should be visible in the top right corner.

From there you will see the following, which gives you the config information you need.

You will need to replace the following properties with those you find in Confluent Cloud.

elastic.kafka.producer.sasl-jaas-config=

This configuration comes from the Clients page in Confluent Cloud.

Adding the config for each service in Kubernetes

Once you have a fully configured application properties, you need to add it to your Kubernetes environment as below.

From the aws namespace.

kubectl create secret generic my-app-config --from-file=application.properties

From the gcp-1 namespace.

kubectl create secret generic my-app-config --from-file=application.properties

From the gcp-2 namespace.

kubectl create secret generic bigdata-creds --from-file=elastic-product-marketing-e145e13fbc7c.json

kubectl create secret generic my-app-config-gcp-bigdata --from-file=application.properties

Create a Kubernetes deployment

Create a Kubernetes deployment YAML file and add your Docker image to it. You can use the deployment.yaml file provided in the project as a template. Make sure to update the image name in the file to match the name of the Docker image you just created.

kubectl apply -f deployment.yaml

Create a Kubernetes service

Create a Kubernetes service YAML file and add your deployment to it. You can use the service.yaml file provided in the project as a template.

kubectl apply -f service.yaml

Access your application

Your application is now running in a Kubernetes cluster. To access it, you can use the service's cluster IP and port. You can get the service's IP and port using the following command.

kubectl get services

Now once you know where the service is, you need to execute it!

You can regularly poke the service endpoint using the following command.

curl -X POST -H "Content-Type: application/json" -d '{"name": "linuxize", "email": "linuxize@example.com"}' http://localhost:8080/api/my-objects/publish

With this up and running, you should see the following service map build out in the Elastic APM product.

And traces will contain a waterfall graph showing all the spans that have executed across this distributed application, allowing you to pinpoint where any issues are within each transaction.

JMX for Kafka Producer/Consumer metrics

In the previous part of this blog, we briefly touched on the JMX metric configuration you can see below.

"capture_jmx_metrics","object_name[kafka.producer:type=producer-metrics,client-id=*] attribute[batch-size-avg:metric_name=kafka.producer.batch-size-avg]"

We can use this “capture_jmx_metrics” configuration to configure JMX for any Kafka Producer/Consumer metrics we want to monitor.

Check out the documentation here to understand how to configure this and here to see the available JMX metrics you can monitor. In the example code in GitHub, we actually pull all the available metrics in, so you can check in there how to configure this.

One thing that’s worth pointing out here is that it’s important to use the “metric_name” property shown above or it gets quite difficult to find the metrics in Elastic Discover without being specific here.

Monitoring Confluent Cloud with Elastic Observability

So we now have some good monitoring set up for Kafka Producers and Consumers and we can trace transactions between services down to the lines of code that are executing. The core part of our Kafka infrastructure is hosted in Confluent Cloud. How, then, do we get data from there into our full stack observability solution?

Luckily, Confluent has done a fantastic job of making this easy. It provides important Confluent Cloud metrics via an open Prometheus-based metrics URL. So let's get down to business and configure this to bring data into our observability tool.

The first step is to configure Confluent Cloud with the MetricsViewer. The MetricsViewer role provides service account access to the Metrics API for all clusters in an organization. This role also enables service accounts to import metrics into third-party metrics platforms.

To assign the MetricsViewer role to a new service account:

In the top-right administration menu (☰) in the upper-right corner of the Confluent Cloud user interface, click ADMINISTRATION > Cloud API keys.
Click Add key.
Click the Granular access tile to set the scope for the API key. Click Next.
Click Create a new one and specify the service account name. Optionally, add a description. Click Next.
The API key and secret are generated for the service account. You will need this API key and secret to connect to the cluster, so be sure to safely store this information. Click Save. The new service account with the API key and associated ACLs is created. When you return to the API access tab, you can view the newly-created API key to confirm.
Return to Accounts & access in the administration menu, and in the Accounts tab, click Service accounts to view your service accounts.
Select the service account that you want to assign the MetricsViewer role to.
In the service account’s details page, click Access.
In the tree view, open the resource where you want the service account to have the MetricsViewer role.
Click Add role assignment and select the MetricsViewer tile. Click Save.

Next we can head to Elastic Observability and configure the Prometheus integration to pull in the metrics data.

Go to the integrations page in Kibana.

Find the Prometheus integration. We are using the Prometheus integration because the Confluent Cloud metrics server can provide data in prometheus format. Trust us, this works really well — good work Confluent!

Add Prometheus in the next page.

Configure the Prometheus plugin in the following way: In the hosts box, add the following URL, replacing the resource kafka id with the cluster id you want to monitor.

https://api.telemetry.confluent.cloud:443/v2/metrics/cloud/export?resource.kafka.id=lkc-3rw3gw

Add the username and password under the advanced options you got from the API keys step you executed against Confluent Cloud above.

Once the Integration is created, the policy needs to be applied to an instance of a running Elastic Agent.

That’s it! It’s that easy to get all the data you need for a full stack observability monitoring solution.

Finally, let’s pull all this together in a dashboard.

Pulling it all together

Using Kibana to generate dashboards is super easy. If you configured everything the way we recommended above, you should find the metrics (producer/consumer/brokers) you need to create your own dashboard as per the following screenshot.

Luckily, I made a dashboard for you and stored it in GitHub. Take a look below and use this to import it into your own environments.

Adding the icing on the cake: machine learning anomaly detection

Now that we have all the critical bits in place, we are going to add the icing on the cake: machine learning (ML)!

Within Kibana, let's head over to the Machine Learning tab in “Analytics.”

Go to the jobs page, where we’ll get started creating our first anomaly detection job.

The metrics data view contains what we need to create this new anomaly detection job.

Use the wizard and select a “Single Metric.”

Use the full data.

In this example, we are going to look for anomalies in the connection count. We really do not want a major deviation here, as this could indicate something very bad occurring if we suddenly have too many or too few things connecting to our Kafka cluster.

Once you have selected the connection count metric, you can proceed through the wizard and eventually your ML job will be created and you should be able to view the data as per the example below.

Congratulations, you have now created a machine learning job to alert you if there are any problems with your Kafka cluster, adding a full AIOps solution to your Kafka and Confluent observability!

Summary

We looked at monitoring Kafka-based solutions implemented on Confluent Cloud using Elastic Observability.

We covered the architecture of a multi-cloud solution involving AWS EKS, Confluent Cloud, and GCP GKE. We looked at how to instrument Kafka applications with Elastic APM, use JMX for Kafka Producer/Consumer metrics, integrate Prometheus, and set up machine learning anomaly detection.

We went through a detailed walk-through with code snippets, configuration steps, and deployment instructions included to help you get started.

Interested in learning more about Elastic Observability? Check out the following resources:

And sign up for our Elastic Observability Trends Webinar featuring AWS and Forrester, not to be missed!

How to remove PII from your Elastic data in 3 easy steps

Tue, 20 Jun 2023 00:00:00 GMT

Personally identifiable information (PII) compliance is an ever-increasing challenge for any organization. Whether you’re in ecommerce, banking, healthcare, or other fields where data is sensitive, PII may inadvertently be captured and stored. Having structured logs enables quick identification, removal, and protection of sensitive data fields easily; but what about unstructured messages? Or perhaps call center transcriptions?

Elasticsearch, with its long experience in machine learning, provides various options to bring in custom models, such as large language models (LLMs), and provides its own models. These models will help implement PII redaction.

If you would like to learn more about natural language processing, machine learning, and Elastic, please be sure to check out these related articles:

In this blog, we will show you how to set up PII redaction through the use of Elasticsearch’s ability to load a trained model within machine learning and the flexibility of Elastic’s ingest pipelines.

Specifically, we’ll walk through setting up a named entity recognition (NER) model for person and location identification, as well as deploying the redact processor for custom data identification and removal. All of this will then be combined with an ingest pipeline where we can use Elastic machine learning and data transformations capabilities to remove sensitive information from your data.

Loading the trained model

Before we begin, we must load our NER model into our Elasticsearch cluster. This may be easily accomplished with Docker and the Elastic Eland client. From a command line, let’s install the Eland client via git:

git clone https://github.com/elastic/eland.git

Navigate into the recently downloaded client:

cd eland/

Now let’s build the client:

docker build -t elastic/eland .

From here, you’re ready to deploy the trained model to an Elastic machine learning node! Be sure to replace your username, password, es-cluster-hostname, and esport.

If you’re using the Elastic Cloud or have signed certificates, simply run this command:

docker run -it --rm --network host elastic/eland eland_import_hub_model --url https://:@:/ --hub-model-id dslim/bert-base-NER --task-type ner --start

If you’re using self-signed certificates, run this command:

docker run -it --rm --network host elastic/eland eland_import_hub_model --url https://:@:/ --insecure --hub-model-id dslim/bert-base-NER --task-type ner --start

From here you’ll witness the Eland client in action downloading the trained model from HuggingFace and automatically deploying it into your cluster!

Synchronize your newly loaded trained model by clicking on the blue hyperlink via your Machine Learning Overview UI “Synchronize your jobs and trained models.”

Now click the Synchronize button.

That’s it! Congratulations, you just loaded your first trained model into Elastic!

Create the redact processor and ingest pipeline

From DevTools, let’s configure the redact processor along with our inference processor to take advantage of Elastic’s trained model we just loaded. This will create an ingest pipeline named “redact” that we can then use to remove sensitive data from any field we wish. In this example, I’ll be focusing on the “message” field. Note: at the time of this writing, the redact processor is experimental and must be created via DevTools.

PUT _ingest/pipeline/redact
{
  "processors": [
    {
      "set": {
        "field": "redacted",
        "value": "{{{message}}}"
      }
    },
    {
      "inference": {
        "model_id": "dslim__bert-base-ner",
        "field_map": {
          "message": "text_field"
        }
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": "String msg = ctx['message'];\r\n                for (item in ctx['ml']['inference']['entities']) {\r\n                msg = msg.replace(item['entity'], '<' + item['class_name'] + '>')\r\n                }\r\n                ctx['redacted']=msg"
      }
    },
    {
      "redact": {
        "field": "redacted",
        "patterns": [
          "%{EMAILADDRESS:EMAIL}",
          "%{IP:IP_ADDRESS}",
          "%{CREDIT_CARD:CREDIT_CARD}",
          "%{SSN:SSN}",
          "%{PHONE:PHONE}"
        ],
        "pattern_definitions": {
          "CREDIT_CARD": "\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{4}",
          "SSN": "\d{3}-\d{2}-\d{4}",
          "PHONE": "\d{3}-\d{3}-\d{4}"
        }
      }
    },
    {
      "remove": {
        "field": [
          "ml"
        ],
        "ignore_missing": true,
        "ignore_failure": true
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "field": "failure",
        "value": "pii_script-redact"
      }
    }
  ]
}

OK, but what does each processor really do? Let’s walk through each processor in detail here:

The SET processor creates the field “redacted,” which is copied over from the message field and used later on in the pipeline.
The INFERENCE processor calls the NER model we loaded to be used on the message field for identifying names, locations, and organizations.
The SCRIPT processor then replaced the detected entities within the redacted field from the message field.
Our REDACT processor uses Grok patterns to identify any custom set of data we wish to remove from the redacted field (which was copied over from the message field).
The REMOVE processor deletes the extraneous ml.* fields from being indexed; note we’ll add “message” to this processor once we validate data is being redacted properly.
The ON_FAILURE / SET processor captures any errors just in case we have them.

Slice your PII

Now that your ingest pipeline with all the necessary steps has been configured, let’s start testing how well we can remove sensitive data from documents. Navigate over to Stack Management, select Ingest Pipelines and search for “redact”, and then click on the result.

Click on the Manage button, and then click Edit.

Here we are going to test our pipeline by adding some documents. Below is a sample you can copy and paste to make sure everything is working correctly.

{
  "_source":
    {
      "message": "John Smith lives at 123 Main St. Highland Park, CO. His email address is jsmith123@email.com and his phone number is 412-189-9043.  I found his social security number, it is 942-00-1243. Oh btw, his credit card is 1324-8374-0978-2819 and his gateway IP is 192.168.1.2",
    },
}

Simply press the Run the pipeline button, and you will then see the following output:

What’s next?

After you’ve added this ingest pipeline to a data set you’re indexing and validated that it is meeting expectations, you can add the message field to be removed so that no PII data is indexed. Simply update your REMOVE processor to include the message field and simulate again to only see the redacted field.

Conclusion

With this step-by-step approach, you are now ready and able to detect and redact any sensitive data throughout your indices.

Here’s a quick recap of what we covered:

Loading a pre-trained named entity recognition model into an Elastic cluster
Configuring the Redact processor, along with the inference processor, to use the trained model during data ingestion
Testing sample data and modifying the ingest pipeline to safely remove personally identifiable information

Ready to get started? Sign up for Elastic Cloud and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your OpenTelemetry data.

In this blog post, we may have used third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.

Elastic, Elasticsearch and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.

How we fixed head-based sampling in OpenTelemetry

Wed, 18 Mar 2026 00:00:00 GMT

Head-based sampling in OpenTelemetry is cheap and practical, but it used to create a major analytics problem: sampled traces reduced raw span counts, so backend throughput charts became wrong. The fix was to carry sampling probability in tracestate so a backend can estimate how many original traces each sampled trace represents. This article explains the problem, the spec, and how we implemented the fix in OpenTelemetry Java, JavaScript, and Python.

Why sampling creates a throughput problem

Most production systems sample traces because sending every span is expensive. Two common approaches are:

Head-based sampling: decide at trace start whether to keep or drop the trace.
Tail-based sampling: decide later, after seeing more or all spans from the trace.

Head-based sampling is fast and low-cost because it decides early. But if your backend only sees 10% of traces, a naive throughput chart built from ingested traces can undercount real traffic by 10x.

In other words, without extra metadata, sampled telemetry loses the context needed to reconstruct volume-based metrics.

The OpenTelemetry spec that solves it

The OpenTelemetry specification defines a way to encode probability sampling information in tracestate:

At a high level, the sampler writes enough information into tracestate for downstream systems to understand the effective sampling probability of a trace. When this metadata is present and propagated correctly, throughput and rate-oriented analytics can stay accurate while still getting the cost benefits of head-based sampling. Elastic Observability supports this spec and behavior out of the box. If you use Elastic's distribution (EDOT) SDKs or correctly configure the upstream OpenTelemetry SDKs as described below, Elastic can estimate the original throughput metrics from sampled data.

For example, a sampled span might carry this entry:

tracestate: ot=th:fd70a4;rv:fe123456789abc

Using the spec rules:

th is the rejection threshold (T) with trailing zeros removed.
rv is the 56-bit randomness value (R).
A participant keeps the span when R >= T.

Here, th:fd70a4 expands to T = 0xfd70a400000000, and rv gives R = 0xfe123456789abc, so the span is kept because R >= T.

In decimal, that is:

T = 0xfd70a400000000 = 71,337,018,784,743,424
R = 0xfe123456789abc = 71,514,660,082,850,492
2^56 = 72,057,594,037,927,936

Since 71,514,660,082,850,492 >= 71,337,018,784,743,424, this trace is sampled.

The backend can convert T into representative count (adjusted count):

probability = (2^56 - T) / 2^56
adjusted_count = 1 / probability = 2^56 / (2^56 - T)

Plugging in the values:

probability = (72,057,594,037,927,936 - 71,337,018,784,743,424) / 72,057,594,037,927,936
            = 720,575,253,184,512 / 72,057,594,037,927,936
           ~= 0.01

adjusted_count = 1 / 0.01 = 100

So in practice this is approximately 1% sampling, and each sampled span represents about 100 original spans.

That means a backend can do weighted calculations, for example:

extrapolated_throughput = sampled_throughput * adjusted_count

What was missing before

A few months ago, OpenTelemetry SDK users had the spec but no out-of-the-box implementation in major SDKs. Hence, spans received at a backend didn't carry the sampling metadata that would allow the backend to estimate the original trace volume. In practice, this meant teams adopting head-based sampling had limited options:

accept skewed throughput numbers,
build custom sampler logic, or
switch to more complex sampling setups.

For many teams, that made standard head-based sampling much less useful than it should be.

The fix: implementation across Java, JavaScript, and Python

We implemented the spec-aligned behavior in three SDKs so teams can use standardized sampling metadata instead of custom workarounds.

Java: open-telemetry/opentelemetry-java#7626
JavaScript: open-telemetry/opentelemetry-js#5839
Python: open-telemetry/opentelemetry-python#4714

All three PRs implemented the composite/probability sampling behavior so the root sampling decision is represented in tracestate and can be preserved across service boundaries.

What changed conceptually

The important shift is not "sample more" or "sample less". It is:

keep probabilistic head-based sampling,
propagate probability metadata with the trace,
let backends compute weighted rates from sampled data.

This keeps ingestion costs manageable and restores correct aggregate analysis.

Implementation walkthrough

The exact API shape differs by language and release, but the rollout pattern is similar:

Use the SDK sampler implementation that supports the probability/composite spec behavior.
Keep W3C Trace Context propagation enabled so tracestate moves across services.
Validate in your backend that throughput/rate charts use weighted interpretation when sampling metadata exists.

Java

If you are using the Elastic Distribution for OpenTelemetry Java (EDOT Java) the sampler is by default already configured to use the probability/composite spec behavior. By default, EDOT comes with a sampling rate of 100% for all traces. You can change the sampling rate by setting the sampling_rate in central configuration or by setting the otel.traces.sampler.arg Java system property / OTEL_TRACES_SAMPLER_ARG environment variable.

For the upstream OTel Java SDK use the following logic to configure the sampler:

import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.sdk.OpenTelemetrySdk;
import io.opentelemetry.sdk.extension.incubator.trace.samplers.ComposableSampler;
import io.opentelemetry.sdk.extension.incubator.trace.samplers.CompositeSampler;
import io.opentelemetry.sdk.trace.SdkTracerProvider;

// Use a sampling ratio. For example, 10% sampling:
double ratio = 0.1;

SdkTracerProvider tracerProvider =
    SdkTracerProvider.builder()
        .setSampler(
            CompositeSampler.wrap(
                ComposableSampler.parentThreshold(
                    ComposableSampler.probability(ratio)
                )
            )
        )
        // (other configuration, e.g., span processor, exporter)
        .build();

OpenTelemetry openTelemetry =
    OpenTelemetrySdk.builder()
        .setTracerProvider(tracerProvider)
        .build();

Tracer tracer = openTelemetry.getTracer("my-instrumentation-library");

// You can now start spans with the configured sampler.
// Example:
tracer.spanBuilder("example-span").startSpan();

The example above shows how to configure head-based probability sampling using the OpenTelemetry Java SDK. Let’s break down the key parts:

Importing necessary classes: The imports bring in the required OpenTelemetry APIs and sampler extensions.
Setting the sampling ratio: The ratio variable controls the fraction of traces sampled (for example, 0.1 for 10%).
Sampler configuration:
- CompositeSampler and ComposableSampler are used to set up a sampler that follows the OpenTelemetry specification for composite samplers, enabling more accurate probability-based head sampling.
- ComposableSampler.probability(ratio) specifies that traces are sampled at the configured ratio.
- ComposableSampler.parentThreshold(...) ensures parent sampling decisions are respected, which keeps trace context consistent across service boundaries.
- Wrapping this in CompositeSampler.wrap(...) gives you a sampler compliant with the latest spec.
- In current OTel Java, sampled root spans emit the th value in tracestate, and rv is preserved when it is already present from upstream context.
Tracer provider and OpenTelemetry setup:
- The configured sampler is attached to the SdkTracerProvider which is then built into the OpenTelemetrySdk instance.
Using the configured tracer:
- When you build a span (like tracer.spanBuilder("example-span").startSpan()), the SDK applies your sampling policy as you generate traces.

This pattern ensures that your head-based sampler not only controls costs (by sampling only a percentage of traces) but also carries and respects sampling metadata. This, in turn, enables downstream backends (like Elastic or any OTel-compliant backend) to correctly calculate throughput/volume metrics, accounting for sampling, and provide more accurate operational measurements.

Node.js

If you are using the Elastic Distribution for OpenTelemetry Node.js (EDOT Node) the sampler is by default already configured to use the probability/composite spec behavior. By default, EDOT comes with a sampling rate of 100% for all traces. You can change the sampling rate by setting the sampling_rate in central configuration or by setting the OTEL_TRACES_SAMPLER_ARG environment variable.

For the upstream OTel JavaScript SDK use the following logic to configure the sampler:

const { NodeSDK } = require('@opentelemetry/sdk-node');
const {
  createCompositeSampler,
  createComposableParentThresholdSampler,
  createComposableTraceIDRatioBasedSampler,
} = require('@opentelemetry/sampler-composite');

// Example: sample 10% of new root traces and preserve parent decisions.
const sampler = createCompositeSampler(
  createComposableParentThresholdSampler(
    createComposableTraceIDRatioBasedSampler(0.1)
  )
);

const sdk = new NodeSDK({ sampler });
sdk.start();

This JavaScript snippet demonstrates how to configure head-based probability sampling using the upstream OpenTelemetry JavaScript SDK with the new composite sampler specification.

Imports:
The code imports utility functions—createCompositeSampler, createComposableParentThresholdSampler, and createComposableTraceIDRatioBasedSampler—from the @opentelemetry/sampler-composite extension, which implements the spec-compliant composable sampling logic.
Sampler configuration:
The sampler is constructed to:
- Use createComposableTraceIDRatioBasedSampler(0.1) to sample 10% of all (root) traces.
- Wrap this in createComposableParentThresholdSampler, so sampling respects the decision made by any parent span that might come from upstream (preserving distributed trace context).
- Finally, the whole structure is wrapped in createCompositeSampler, which puts it into the form expected by the OTel SDK.
Usage:
The sampler is passed to the NodeSDK from @opentelemetry/sdk-node at startup. After this, all spans you create in your application will follow this sampling logic.

This approach enables accurate, head-based sampling in OpenTelemetry JavaScript, following the latest OTel specification. It ensures your traces are sampled at the rate you set, while also propagating sampling-related metadata in the tracestate to downstream services and telemetry backends (e.g., Elastic, Jaeger). This is critical for volume adjustment, accurate metrics, and cost control in distributed tracing environments.

Python

If you are using the Elastic Distribution for OpenTelemetry Python (EDOT Python) the sampler is by default already configured to use the probability/composite spec behavior. By default, EDOT comes with a sampling rate of 100% for all traces. You can change the sampling rate by setting the sampling_rate in central configuration or by setting the OTEL_TRACES_SAMPLER_ARG environment variable.

For the upstream OTel Python SDK, register a custom sampler entry point and point OTEL_TRACES_SAMPLER to it:

# pyproject.toml
[project.entry-points.opentelemetry_traces_sampler]
parentbased_composite = "your_package.sampling:ParentBasedCompositeSampler"

# your_package/sampling.py
from __future__ import annotations

from typing import Sequence

from opentelemetry.context import Context
from opentelemetry.sdk.trace._sampling_experimental import (
    composable_parent_threshold,
    composable_traceid_ratio_based,
    composite_sampler,
)
from opentelemetry.sdk.trace.sampling import Sampler, SamplingResult
from opentelemetry.trace import Link, SpanKind, TraceState
from opentelemetry.util.types import Attributes


class ParentBasedCompositeSampler(Sampler):
    # The SDK passes OTEL_TRACES_SAMPLER_ARG as this constructor argument.
    def __init__(self, ratio_str: str | None):
        try:
            ratio = float(ratio_str) if ratio_str else 1.0
        except ValueError:
            ratio = 1.0
        self._delegate = composite_sampler(
            composable_parent_threshold(composable_traceid_ratio_based(ratio))
        )

    def should_sample(
        self,
        parent_context: Context | None,
        trace_id: int,
        name: str,
        kind: SpanKind | None = None,
        attributes: Attributes | None = None,
        links: Sequence[Link] | None = None,
        trace_state: TraceState | None = None,
    ) -> SamplingResult:
        return self._delegate.should_sample(
            parent_context,
            trace_id,
            name,
            kind,
            attributes,
            links,
            trace_state,
        )

    def get_description(self) -> str:
        return self._delegate.get_description()

export OTEL_TRACES_SAMPLER=parentbased_composite
export OTEL_TRACES_SAMPLER_ARG=0.10

This Python snippet demonstrates how to configure probability-based head sampling in the OpenTelemetry Python SDK using a custom sampler that supports the tracestate probability propagation spec.

Here's what happens in the example:

It imports experimental sampling APIs from opentelemetry.sdk.trace._sampling_experimental to build a composite sampler that encodes the sampling probability in the tracestate of each root span. This supports backend throughput correction.
The ParentBasedCompositeSampler class is a wrapper you can plug into the SDK. Its constructor accepts the sampling probability as a string (from OTEL_TRACES_SAMPLER_ARG, but defaults to 1.0 = 100% sampling).
composite_sampler(composable_parent_threshold(composable_traceid_ratio_based(ratio))) builds a sampler that:
1. Uses the probability to sample root spans.
2. Propagates the root sampling decision for child spans via parent-based threshold logic.
3. Embeds and respects the OpenTelemetry probability fields in tracestate.
The example then shows the required environment variables to enable this sampling logic:
```
export OTEL_TRACES_SAMPLER=parentbased_compositeexport OTEL_TRACES_SAMPLER_ARG=0.10
```
With these values, you enable parent-based composite sampling with a 10% sampling rate.

This snippet enables standards-compliant probability sampling and propagation in OpenTelemetry Python, so throughput metrics can be accurately estimated by your backend based on the tracestate metadata.

Validation

To validate your setup and ensure accurate throughput metrics when using head-based sampling, follow these steps:

Deploy the SDK in a controlled environment where you can manage both load and sampling rate.
Generate a steady, predictable load with a known throughput.
Set a fixed sampling rate for traces.
Send the resulting telemetry data to Elastic Observability.
Confirm that the reported throughput metrics in Elastic match your expectations.

You can use the following ES|QL commands to compare the observed raw throughput (counted from sampled traces) to the extrapolated throughput available in the derived metrics:

Raw throughput:

FROM traces-* 
| WHERE service.name == "your-service"
| WHERE transaction.name IS NOT NULL
| STATS count_transactions = COUNT(*),
    time_range = DATE_DIFF("minute", MIN(@timestamp), MAX(@timestamp))
| EVAL raw_throughput_per_min = count_transactions::double / time_range

Extrapolated throughput:

FROM metrics-*
| WHERE service.name == "your-service"
| WHERE metricset.name == "service_transaction" AND metricset.interval == "1m"
| STATS count_transactions = COUNT(transaction.duration.summary),
    time_range = DATE_DIFF("minute", MIN(@timestamp), MAX(@timestamp))
| EVAL extrapolated_throughput_per_min = count_transactions::double / time_range

The extrapolated_throughput_per_min should be close to your real throughput rate, while the raw_throughput_per_min should be close to the your configured sampling rate.

Conclusion

This work turns head-based sampling into a much safer default for teams that need both cost control and reliable operational metrics. You no longer have to choose between affordable trace volume and trustworthy throughput calculations.

Before enabling head-based sampling with probability propagation, review the linked PRs and consult each SDK’s release notes to confirm the minimum version with full support (Java, JavaScript, and Python). Start by enabling sampling-aware configuration in a single service, and once you verify correct tracestate propagation and backend metric accuracy, gradually roll out the change to additional services. To maintain reliability, establish monitoring and automated regression tests for throughput accuracy, so you can spot any unintended metric drift when sampling rates or SDK components are updated.

Improving the Elastic APM UI performance with continuous rollups and service metrics

Thu, 29 Jun 2023 00:00:00 GMT

In today's fast-paced digital landscape, the ability to monitor and optimize application performance is crucial for organizations striving to deliver exceptional user experiences. At Elastic, we recognize the significance of providing our user base with a reliable observability platform that scales with you as you’re onboarding thousands of services that produce terabytes of data each day. We have been diligently working behind the scenes to enhance our solution to meet the demands of even the largest deployments.

In this blog post, we are excited to share the significant strides we have made in improving the UI performance of Elastic APM. Maintaining a snappy user interface can be a challenge when interactively summarizing the massive amounts of data needed to provide an overview of the performance for an entire enterprise-scale service inventory. We want to assure our customers that we have listened, taken action, and made notable architectural changes to elevate the scalability and maturity of our solution.

Architectural enhancements

Our journey began back in the 7.x series where we noticed that doing ad-hoc aggregations on raw transaction data put Elasticsearch^® under a lot of pressure in large-scale environments. Since then, we’ve begun to pre-aggregate the transactions into transaction metrics during ingestion. This has helped to keep the performance of the UI relatively stable. Regardless of how busy the monitored application is and how many transaction events it is creating, we’re just querying pre-aggregated metrics that are stored at a constant rate. We’ve enabled the metrics-powered UI by default in 7.15.

However, when showing an inventory of a large number of services over large time ranges, the number of metric data points that need to be aggregated can still be large enough to cause performance issues. We also create a time series for each distinct set of dimensions. The dimensions include metadata, such as the transaction name and the host name. Our documentation includes a full list of all available dimensions. If there’s a very high number of unique transaction names, which could be a result of improper instrumentation (see docs for more details), this will create a lot of individual time series that will need to be aggregated when requesting a summary of the service’s overall performance. Global labels that are added to the APM Agent configuration are also added as dimensions to these metrics, and therefore they can also impact the number of time series. Refer to the FAQs section below for more details.

Within the 8.7 and 8.8 releases, we’ve addressed these challenges with the following architectural enhancements that aim to reduce the number of documents Elasticsearch needs to search and aggregate on-the-fly, resulting in faster response times:

Pre-aggregation of transaction metrics into service metrics. Instead of aggregating all distinct time series that are created for each individual transaction name on-the-fly for every user request, we’re already pre-aggregating a summary time series for each service during data ingestion. Depending on how many unique transaction names the services have, this reduces the number of documents Elasticsearch needs to look up and aggregate by a factor of typically 10–100. This is particularly useful for the service inventory and the service overview pages.
Pre-aggregation of all metrics into different levels of granularity. The APM UI chooses the most appropriate level of granularity, depending on the selected time range. In addition to the metrics that are stored at a 1-minute granularity, we’re also summarizing and storing metrics at a 10-minute and 60-minute granularity level. For example, when looking at a 7-day period, the 60-minute data stream is queried instead of the 1-minute one, resulting in 60x fewer documents for Elasticsearch to examine. This makes sure that all graphs are rendered quickly, even when looking at larger time ranges.
Safeguards on the number of unique transactions per service for which we are aggregating metrics. Our agents are designed to keep the cardinality of the transaction name low. But in the wild, we’ve seen some services that have a huge amount of unique transaction names. This used to cause performance problems in the UI because APM Server would create many time series that the UI needed to aggregate at query time. In order to protect APM Server from running out of memory when aggregating a large number of time series for each unique transaction name, metrics were published without aggregating when limits for the number of time series were reached. This resulted in a lot of individual metric documents that needed to be aggregated at query time. To address the problem, we've introduced a system where we aggregate metrics in a dedicated overflow bucket for each service when limits are reached. Refer to our documentation for more details.

The exact factor of the document count reduction depends on various conditions. But to get a feeling for a typical scenario, if your services, on average, have 10 instances, no instance-specific global labels, 100 unique transaction names each, and you’re looking at time ranges that can leverage the 60m granularity, you’d see a reduction of documents that Elasticsearch needs to aggregate by a factor of 180,000 (10 instances x 100 transaction names x 60m x 3 because we’re also collapsing the event.outcome dimension). While the response times of Elasticsearch aggregations isn’t exactly scaling linearly with the number of documents, there is a strong correlation.

FAQs

When upgrading to the latest version, will my old data also load faster?

Updating to 8.8 doesn’t immediately make the UI faster. Because the improvements are powered by pre-aggregations that APM Server is doing during ingestion, only new data will benefit from it. For that reason, you should also make sure to update APM Server as well. The UI can still display data that was ingested using an older version of the stack.

If the UI is based on metrics, can I still slice and dice using custom labels?

High cardinality analysis is a big strength of Elastic Observability, and this focus on pre-aggregated metrics does not compromise that in any way.

The UI implements a sophisticated fallback mechanism that uses service metrics, transaction metrics, or raw transaction events, depending on which filters are applied. We’re not creating metrics for each user.id, for example. But you can still filter the data by user.id and the UI will then use raw transaction events. Chances are that you’re looking at a narrow slice of data when filtering by a dimension that is not available on the pre-aggregated metrics, therefore aggregations on the raw data are typically very fast.

Note that all global labels that are added to the APM agent configuration are part of the dimension of the pre-aggregated metrics, with the exception of RUM (see more details in this issue).

Can I use the pre-aggregated metrics in custom dashboards?

Yes! If you use Lens and select the "APM" data view, you can filter on either metricset.name:service_transaction or metricset.name:transaction, depending on the level of detail you need. Transaction latency is captured in transaction.duration.histogram, and successful outcomes and failed outcomes are stored in event.success_count. If you don't need a distribution of values, you can also select the transaction.duration.summary field for your metric aggregations, which should be faster. If you want to calculate the failure rate, here's a Lens formula: 1 - (sum(event.success_count) / count(event.success_count)). Note that the only granularity supported here is 1m.

Do the additional metrics have an impact on the storage?

While we’re storing more metrics than before, and we’re storing all metrics in different levels of granularity, we were able to offset that by enabling synthetic source for all metric data streams. We’ve even increased the default retention for the metrics in the coarse-grained granularity levels, so that the 60m rollup data streams are now stored for 390 days. Please consult our documentation for more information about the different metric data streams.

Are there limits on the amount of time series that APM Server can aggregate?

APM Server performs pre-aggregations in memory, which is fast, but consumes a considerable amount of memory. There are limits in place to protect APM Server from running out of memory, and from 8.7, most of them scale with available memory by default, meaning that allocating more memory to APM Server will allow it to handle more unique pre-aggregation groups like services and transactions. These limits are described in APM Server Data Model docs.

On the APM Server roadmap, we have plans to move to a LSM-based approach where pre-aggregations are performed with the help of disks in order to reduce memory usage. This will enable APM Server to scale better with the input size and cardinality.

A common pitfall when working with pre-aggregations is to add instance-specific global labels to APM agents. This may exhaust the aggregation limits and cause metrics to be aggregated under the overflow bucket instead of the corresponding service. Therefore, make sure to follow the best practice of only adding a limited set of global labels to a particular service.

Validation

To validate the effectiveness of the new architecture, and to ensure that the accuracy of the data is not negatively affected, we prepared a test environment where we generated 35K+ transactions per minute in a timespan of 14 days resulting in approximately 850 million documents.

We’ve tested the queries that power our service inventory, the service overview, and the transaction details using different time ranges (1d, 7d, 14d). Across the board, we’ve seen orders of magnitude improvements. Particularly, queries across larger time ranges that benefit from using the coarse-grained metrics in addition to the pre-aggregated service metrics saw incredible reductions of the response time.

We’ve also validated that there’s no loss in accuracy when using the more coarse-grained metrics for larger time ranges.

Every environment will behave a bit differently, but we’re confident that the impressive improvements in response time will translate well to setups of even bigger scale.

Planned improvements

As mentioned in the FAQs section, the number of time series for transaction metrics can grow quickly, as it is the product of multiple dimensions. For example, given a service that runs on 100 hosts and has 100 transaction names that each have 4 transaction results, APM Server needs to track 40,000 (100 x 100 x 4) different time series for that service. This would even exceed the maximum per-service limit of 32,000 for APM Servers with 64GB of main memory.

As a result, the UI will show an entry for “Remaining Transactions” in the Service overview page. This tracks the transaction metrics for a service once it hits the limit. As a result, you may not see all transaction names of your service. It may also be that all distinct transaction names are listed, but that the transaction metrics for some of the instances of that service are combined in the “Remaining Transactions” category.

We’re currently considering restructuring the dimensions for the metrics to avoid that the combination of the dimensions for transaction name and service instance-specific dimensions (such as the host name) lead to an explosion of time series. Stay tuned for more details.

Conclusion

The architectural improvements we’ve delivered in the past releases provide a step-function in terms of the scalability and responsiveness of our UI. Instead of having to aggregate massive amounts of data on-the-fly as users are navigating through the user interface, we pre-aggregate the results for the most common queries as data is coming in. This ensures we have the answers ready before users have even asked their most frequently asked questions, while still being able to answer ad-hoc questions.

We are excited to continue supporting our community members as they push boundaries on their growth journey, providing them with a powerful and mature platform that can effortlessly handle the demands of the largest workloads. Elastic is committed to its mission to enable everyone to find the answers that matter. From all data. In real time. At scale.

Infrastructure monitoring with OpenTelemetry in Elastic Observability

Wed, 24 Jul 2024 00:00:00 GMT

At Elastic, we recently made a decision to fully embrace OpenTelemetry as the premier data collection framework. As an Observability engineer, I firmly believe that vendor agnosticism is essential for delivering the greatest value to our customers. By committing to OpenTelemetry, we are not only staying current with technological advancements but also driving them forward. This investment positions us at the forefront of the industry, championing a more open and flexible approach to observability.

Elastic donated Elastic Common Schema (ECS) to OpenTelemetry and is actively working to converge it with semantic conventions. In the meantime, we are dedicated to support our users by ensuring they don’t have to navigate different standards. Our goal is to provide a seamless end-to-end experience while using OpenTelemetry with our application and infrastructure monitoring solutions. This commitment allows users to benefit from the best of both worlds without any friction.

In this blog, we explore how to use the OpenTelemetry (OTel) collector to capture core system metrics from various sources such as AWS EC2, Google Compute, Kubernetes clusters, and individual systems running Linux or MacOS.

Powering Infrastructure UIs with Two Ingest Paths

Elastic users who wish to have OpenTelemetry as their data collection mechanism can now monitor the health of the hosts where the OpenTelemetry collector is deployed using the Hosts and Inventory UIs available in Elastic Observability.

Elastic offers two distinct ingest paths to power Infrastructure UIs: the ElasticsearchExporter Ingest Path and the OTLP Exporter Ingest Path.

ElasticsearchExporter Ingest Path:

The hostmetrics receiver in OpenTelemetry collects system-level metrics such as CPU, memory, and disk usage from the host machine in OTel Schema. The ElasticsearchExporter ingest path leverages the Hostmetrics Receiver to generate host metrics in the OTel schema. We've developed the ElasticInfraMetricsProcessor, which utilizes the opentelemetry-lib to convert these metrics into a format that Elastic UIs understand.

For example, the system.network.io OTel metric includes a direction attribute with values receive or transmit. These correspond to system.network.in.bytes and system.network.out.bytes, respectively, within Elastic.

The processor then forwards these metrics to the Elasticsearch Exporter, now enhanced to support exporting metrics in ECS mode. The exporter sends the metrics to an Elasticsearch endpoint, lighting up the Infrastructure UIs with insightful data.

To utilize this path, you can deploy the collector from the Elastic Collector Distro, available here.

An example collector config for this Ingest Path:

receivers:
  hostmetrics:
    collection_interval: 10s
    scrapers:
      cpu:
        metrics:
          system.cpu.utilization:
            enabled: true
          system.cpu.logical.count:
            enabled: true
      memory:
        metrics:
          system.memory.utilization:
            enabled: true
      process:
        metrics:
          process.open_file_descriptors:
            enabled: true
          process.memory.utilization:
            enabled: true
          process.disk.operations:
            enabled: true
      network:
      processes:
      load:
      disk:
      filesystem:

processors:
  resourcedetection/system:
    detectors: ["system", "ec2"]
  elasticinframetrics:

exporters:  
  logging:
    verbosity: detailed
  elasticsearch/metrics: 
    endpoints: 
    api_key: 
    mapping:
      mode: ecs

service:
  pipelines:
    metrics/host:
      receivers: [hostmetrics]
      processors: [resourcedetection/system, elasticinframetrics]
      exporters: [logging, elasticsearch/ metrics]

The Elastic exporter path is ideal for users who would prefer using the custom Elastic Collector Distro. This path includes the ElasticInfraMetricsProcessor, which sends data to Elasticsearch via Elasticsearch exporter.

OTLP Exporter Ingest Path:

In the OTLP Exporter Ingest path, the hostmetrics receiver collects system-level metrics such as CPU, memory, and disk usage from the host machine in OTel Schema. These metrics are sent to the OTLP Exporter, which forwards them to the APM Server endpoint. The APM Server, using the same opentelemetry-lib, converts these metrics into a format compatible with Elastic UIs. Subsequently, the APM Server pushes the metrics to Elasticsearch, powering the Infrastructure UIs.

An example collector configuration for the APM Ingest Path

receivers:
  hostmetrics:
    collection_interval: 10s
    scrapers:
      cpu:
        metrics:
          system.cpu.utilization:
            enabled: true
          system.cpu.logical.count:
            enabled: true
      memory:
        metrics:
          system.memory.utilization:
            enabled: true
      process:
        metrics:
          process.open_file_descriptors:
            enabled: true
          process.memory.utilization:
            enabled: true
          process.disk.operations:
            enabled: true
      network:
      processes:
      load:
      disk:
      filesystem:

processors:
  resourcedetection/system:
    detectors: ["system"]
    system:
      hostname_sources: ["os"]

exporters:
  otlphttp:
    endpoint: 
    tls:
      insecure: false
    headers:
      Authorization: 
  logging:
    verbosity: detailed

service:
  pipelines:
    metrics/host:
      receivers: [hostmetrics]
      processors: [resourcedetection/system]
      exporters: [logging, otlphttp]

The OTLP Exporter Ingest path can help existing users who are already using Elastic APM and want to see the Infrastructure UIs populated as well. These users can use the default OpenTelemetry Collector.

A glimpse of the Infrastructure UIs

The Infrastructure UIs showcase both Host and Kubernetes level views. Below are some of the glimpses of the UIs

The Hosts Overview UI

The Hosts Inventory UI

The Process-related Details of the Host

The Kubernetes Inventory UI

Pod level Metrics

Our next step is to create Infrastructure UIs powered by native OTel data, with dedicated OTel dashboards that run on this native data.

Conclusion

Elastic's integration with OpenTelemetry simplifies the observability landscape and while we are diligently working to align ECS with OpenTelemetry’s semantic conventions, our immediate priority is to support our users by simplifying their experience. With this added support, we aim to deliver a seamless, end-to-end experience for those using OpenTelemetry with our application and infrastructure monitoring solutions. We are excited to see how our users will leverage these capabilities to gain deeper insights into their systems.

Ingesting and analyzing Prometheus metrics with Elastic Observability

Mon, 09 Oct 2023 00:00:00 GMT

In the world of monitoring and observability, Prometheus has grown into the de-facto standard for monitoring in cloud-native environments because of its robust data collection mechanism, flexible querying capabilities, and integration with other tools for rich dashboarding and visualization.

Prometheus is primarily built for short-term metric storage, typically retaining data in-memory or on local disk storage, with a focus on real-time monitoring and alerting rather than historical analysis. While it offers valuable insights into current metric values and trends, it may pose economic challenges and fall short of the robust functionalities and capabilities necessary for in-depth historical analysis, long-term trend detection, and forecasting. This is particularly evident in large environments with a substantial number of targets or high data ingestion rates, where metric data accumulates rapidly.

Numerous organizations assess their unique needs and explore avenues to augment their Prometheus monitoring and observability capabilities. One effective approach is integrating Prometheus with Elastic®. In this blog post, we will showcase the integration of Prometheus with Elastic, emphasizing how Elastic elevates metrics monitoring through extensive historical analytics, anomaly detection, and forecasting, all in a cost-effective manner.

Integrate Prometheus with Elastic seamlessly

Organizations that have configured their cloud-native applications to expose metrics in Prometheus format can seamlessly transmit the metrics to Elastic by using Prometheus integration. Elastic enables organizations to monitor their metrics in conjunction with all other data gathered through Elastic's extensive integrations.

Go to Integrations and find the Prometheus integration.

To gather metrics from Prometheus servers, the Elastic Agent is employed, with central management of Elastic agents handled through the Fleet server.

After enrolling the Elastic Agent in the Fleet, users can choose from the following methods to ingest Prometheus metrics into Elastic.

1. Prometheus collectors

The Prometheus collectors connect to the Prometheus server and pull metrics or scrape metrics from a Prometheus exporter.

2. Prometheus queries

The Prometheus queries execute specific Prometheus queries against Prometheus Query API.

3. Prometheus remote-write

The Prometheus remote_write can receive metrics from a Prometheus server that has configured the remote_write setting.

After your Prometheus metrics are ingested, you have the option to visualize your data graphically within the Metrics Explorer and further segment it based on labels, such as hosts, containers, and more.

You can also query your metrics data in Discover and explore the fields of your individual documents within the details panel.

Storing historical metrics with Elastic’s data tiering mechanism

By exporting Prometheus metrics to Elasticsearch, organizations can extend the retention period and gain the ability to analyze metrics historically. Elastic optimizes data storage and access based on the frequency of data usage and the performance requirements of different data sets. The goal is to efficiently manage and store data, ensuring that it remains accessible when needed while keeping storage costs in check.

After ingesting Prometheus metrics data, you have various retention options. You can set the duration for data to reside in the hot tier, which utilizes high IO hardware (SSD) and is more expensive. Alternatively, you can move the Prometheus metrics to the warm tier, employing cost-effective hardware like spinning disks (HDD) while maintaining consistent and efficient search performance. The cold tier mirrors the infrastructure of the warm tier for primary data but utilizes S3 for replica storage. Elastic automatically recovers replica indices from S3 in case of node or disk failure, ensuring search performance comparable to the warm tier while reducing disk cost.

The frozen tier allows direct searching of data stored in S3 or an object store, without the need for rehydration. The purpose is to further reduce storage costs for Prometheus metrics data that is less frequently accessed. By moving historical data into the frozen tier, organizations can optimize their storage infrastructure, ensuring that the recent, critical data remains in higher-performance tiers while less frequently accessed data is stored economically in the frozen tier. This way, organizations can perform historical analysis and trend detection, identify patterns and make informed decisions, and maintain compliance with regulatory standards in a cost-effective manner.

An alternative way to store your cloud-native metrics more efficiently is to use Elastic Time Series Data Stream (TSDS). TSDS can store your metrics data more efficiently with ~70% less disk space than a regular data stream. The downsampling functionality will further reduce the storage required by rolling up metrics within a fixed time interval into a single summary metric. This not only assists organizations in cutting down on storage expenses for metric data but also simplifies the metric infrastructure, making it easier for users to correlate metrics with logs and traces through a unified interface.

Advanced analytics

Besides Metrics Explorer and Discover, Elasticsearch® provides more advanced analytics capabilities and empowers organizations to gain deeper, more valuable insights into their Prometheus metrics data.

Out of the box, Prometheus integration provides a default overview dashboard.

From Metrics Explorer or Discover, users can also easily edit their Prometheus metrics visualization in Elastic Lens or create new visualizations from Lens.

Elastic Lens enables users to explore and visualize data intuitively through dynamic visualizations. This user-friendly interface eliminates the need for complex query languages, making data analysis accessible to a broader audience. Elasticsearch also offers other powerful visualization methods with aggregations and filters, enabling users to perform advanced analytics on their Prometheus metrics data, including short-term and historical data. To learn more, check out the how-to series: Kibana.

Anomaly detection and forecasting

When analyzing data, maintaining a constant watch on the screen is simply not feasible, especially when dealing with millions of time series of Prometheus metrics. Engineers frequently encounter the challenge of differentiating normal from abnormal data points, which involves analyzing historical data patterns — a process that can be exceedingly time consuming and often exceeds human capabilities. Thus, there is a pressing need for a more intelligent approach to detect anomalies efficiently.

Setting up alerts may seem like an obvious solution, but relying solely on rule-based alerts with static thresholds can be problematic. What's normal on a Wednesday at 9:00 a.m. might be entirely different from a Sunday at 2:00 a.m. This often leads to complex and hard-to-maintain rules or wide alert ranges that end up missing crucial issues. Moreover, as your business, infrastructure, users, and products evolve, these fixed rules don't keep up, resulting in lots of false positives or, even worse, important issues slipping through the cracks without detection. A more intelligent and adaptable approach is needed to ensure accurate and timely anomaly detection.

Elastic's machine learning anomaly detection excels in such scenarios. It automatically models the normal behavior of your Prometheus data, learning trends, and identifying anomalies, thereby reducing false positives and improving mean time to resolution (MTTR). With over 13 years of development experience in this field, Elastic has emerged as a trusted industry leader.

The key advantage of Elastic's machine learning anomaly detection lies in its unsupervised learning approach. By continuously observing real-time data, it acquires an understanding of the data's behavior over time. This includes grasping daily and weekly patterns, enabling it to establish a normalcy range of expected behavior. Behind the scenes, it constructs statistical models that allow accurate predictions, promptly identifying any unexpected variations. In cases where emerging data exhibits unusual trends, you can seamlessly integrate with alerting systems, operationalizing this valuable insight.

Machine learning's ability to project into the future, forecasting data trends one day, a week, or even a month ahead, equips engineers not only with reporting capabilities but also with pattern recognition and failure prediction based on historical Prometheus data. This plays a crucial role in maintaining mission-critical workloads, offering organizations a proactive monitoring approach. By foreseeing and addressing issues before they escalate, organizations can avert downtime, cut costs, optimize resource utilization, and ensure uninterrupted availability of their vital applications and services.

Creating a machine learning job for your Prometheus data is a straightforward task with a few simple steps. Simply specify the data index and set the desired time range in the single metric view. The machine learning job will then automatically process the historical data, building statistical models behind the scenes. These models will enable the system to predict trends and identify anomalies effectively, providing valuable and actionable insights for your monitoring needs.

In essence, Elastic machine learning empowers us to harness the capabilities of data scientists and effectively apply them in monitoring Prometheus metrics. By seamlessly detecting anomalies and predicting potential issues in advance, Elastic machine learning bridges the gap and enables IT professionals to benefit from the insights derived from advanced data analysis. This practical and accessible approach to anomaly detection equips organizations with a proactive stance toward maintaining the reliability of their systems.

Try it out

Start a free trial on Elastic Cloud and ingest your Prometheus metrics into Elastic. Enhance your Prometheus monitoring with Elastic Observability. Stay ahead of potential issues with advanced AI/ML anomaly detection and prediction capabilities. Eliminate data silos, reduce costs, and enhance overall response efficiency.

Elevate your monitoring capabilities with Elastic today!

Introducing Elastic Distribution for OpenTelemetry Python

Sun, 07 Jul 2024 00:00:00 GMT

We are delighted to announce the alpha release of the Elastic Distribution for OpenTelemetry Python. This project is a customized OpenTelemetry distribution that allows us to configure better defaults for using OpenTelemetry with the Elastic cloud offering.

Background

Elastic is standardizing on OpenTelemetry (OTel) for observability and security data collection. As part of that effort, we are providing distributions of the OpenTelemetry Language SDKs. We have recently released alpha distributions for Java, .NET and Node.js. Our Android and iOS SDKs have been OpenTelemetry-based from the start. The Elastic Distribution for OpenTelemetry Python is the latest addition.

Design choices

We have chosen to provide a lean distribution that does not install all the instrumentations by default but that instead provides tools to do so. We leverage the opentelemetry-bootstrap tool provided by OpenTelemetry Python project to scan the packages installed in your environment and recognizes libraries we are able to instrument. This tool can just report the instrumentations available and optionally is able to install them as well. This allows you to avoid installing packages you are not going to need or instrument libraries you are not interested in tracing.

Getting started

To get started with Elastic Distribution for OpenTelemetry Python you need to install the package elastic-opentelemetry in your project environment. We'll use pip in our examples but you are free to use any python package and environment manager of your choice.

pip install elastic-opentelemetry

Once you have installed our distro you'll have also the opentelemetry-bootstrap command available. Running it:

opentelemetry-bootstrap

will list all available packages for your instrumentation, e.g. you can expect something like the following:

opentelemetry-instrumentation-asyncio==0.46b0
opentelemetry-instrumentation-dbapi==0.46b0
opentelemetry-instrumentation-logging==0.46b0
opentelemetry-instrumentation-sqlite3==0.46b0
opentelemetry-instrumentation-threading==0.46b0
opentelemetry-instrumentation-urllib==0.46b0
opentelemetry-instrumentation-wsgi==0.46b0
opentelemetry-instrumentation-grpc==0.46b0
opentelemetry-instrumentation-requests==0.46b0
opentelemetry-instrumentation-system-metrics==0.46b0
opentelemetry-instrumentation-urllib3==0.46b0

It also provides a command option to install the packages automatically

opentelemetry-bootstrap --action=install

It is advised to run this command every time you release a new version of your application so that you can install or just revise any instrumentation packages for your code.

Some environment variables are needed to provide the needed configuration for instrumenting your services. These mostly concern the destination of your traces but also for easily identifying your service. A service name is required to have your service distinguishable from the others. Then you need to provide the authorization headers for authentication with Elastic Observability cloud and the Elastic cloud endpoint where the data is sent.

The API Key you get from your Elastic cloud serverless project must be URL-encoded, you can do that with the following Python snippet:

from urllib.parse import quote
quote("ApiKey )

Once you have all your configuration values you can export via environment variables as below:

export OTEL_RESOURCE_ATTRIBUTES=service.name=
export OTEL_EXPORTER_OTLP_HEADERS="Authorization="
export OTEL_EXPORTER_OTLP_ENDPOINT=

We are done with the configuration and the last piece of the puzzle is wrapping your service invocation with opentelemetry-instrument, the wrapper that provides zero-code instrumentation. Zero-code (or Automatic) instrumentation means that the distribution will set up the OpenTelemetry SDK and enable all the previously installed instrumentations for you. Unfortunately Zero-code instrumentation does not cover all libraries and some — web frameworks in particular — will require minimal manual configuration.

For a web service running with gunicorn it may look like:

opentelemetry-instrument gunicorn main:app

The result is an observable application using the industry-standard OpenTelemetry — offering high-quality instrumentation of many popular Python libraries, a portable API to avoid vendor lock-in and an active community.

Using Elastic Observability, some out-of-the-box benefits you can expect are: rich trace viewing, Service maps, integrated metrics and log analysis, and more.

What's next?

In the Python ecosystem we are active reviewers and contributors of both the opentelemetry-python and opentelemetry-python-contrib repositories.

The Elastic Distribution for OpenTelemetry Python is currently an alpha. Please try it out and let us know if it might work for you. Watch for the latest releases here. You can engage with us on the project issue tracker.

We are eager to know your use cases to help you succeed in your Observability journey.

Resources

Introducing the OTTL Playground for OpenTelemetry

Thu, 13 Mar 2025 00:00:00 GMT

OTTL Playground

As the demand for observability and monitoring solutions grows, OpenTelemetry has emerged as a key framework for collecting, processing, and exporting telemetry data. Within this ecosystem, the OpenTelemetry Transformation Language (OTTL) is a powerful way to customize telemetry data transformation, but it can be daunting for both new and experienced users alike.

To help addressing these challenges, we are thrilled to introduce the OTTL Playground (https://ottl.run), a powerful and user-friendly tool designed to allow users to experiment with OTTL effortlessly. The playground provides a rich interface for users to create, modify, and test statements in real-time, making it easier to understand how different configurations impact the OpenTelemetry data transformation. Users can instantly validate OTTL transformations, from input to output, along with diffs. This allows new users to explore the nuances of OTTL without the risk of disrupting production environments.

How does it work?

The OTTL Playground allows you to run OTTL statements using different processors and versions. Currently, it supports the transform processor and the filter processor, with additional evaluators potentially being added in the future.

To start exploring, simply visit https://ottl.run. Once the processor configuration and OTLP payload are filled, click on the “Run” button in the top-right corner of the screen, and the result will instantly appear on the right-sided result panel.

The above example uses the transform processor to rename a trace resource attribute, and its effect on the data can be easily spotted on the diff-based result panel.

Examples of processor configurations and OTLP payloads can be loaded by selecting an option from the dropdown menu in the top-right corner of the Configuration and OTLP Payload panels.

Different results flavors

The primary goal of the OTTL Playground is to help users understand how processors and their OTTL statements impact telemetry data. The Playground offers several types of results to aid in this understanding, including visual results, JSON results, and execution debug logs.

Visual delta: Shows a diff-based comparison, providing an intuitive and immediate understanding of how the data is transformed. This makes it easier for users to grasp complex changes without delving into raw data.
Annotated delta: This visualization is similar to the visual delta, but instead of a graphical representation, it shows the JSON diff values, providing a detailed step-by-step explanation of the changes made to the original data.
JSON: Offers a detailed view of the data, allowing users to see the exact output of their OTTL statements. This is particularly useful for debugging and verifying precise data transformations.
Execution logs: The OTTL and processors have very detailed debug logs, which provide a step-by-step account of the processing. This is invaluable for troubleshooting and understanding the sequence of transformations applied to the data.

Together, these features empower users to experiment confidently with OTTL statements, ensuring they can fine-tune their telemetry data transformations with clarity and precision.

Sharing

The OTTL Playground supports sharing configurations easily. By clicking on the “Copy Link” button located in the top right corner of the interface, users can generate a unique URL that encapsulates their current playground state. This link can then be shared with colleagues or community members, allowing others to quickly load and review the exact setup. This feature facilitates collaboration and troubleshooting by enabling seamless sharing of specific OTTL configurations and results.

Given that the shareable links are public and might carry data on it, we advise you to refrain from submitting any confidential information.

Playground architecture

The OTTL Playground is a static website that operates entirely within the client's browser. It leverages WebAssembly, compiled from the actual OpenTelemetry Collector code, to deliver results identical to those of a real collector distribution, all while maintaining near-native performance.

The user interface of the OTTL Playground uses WebComponents, which offers several benefits. This modular approach makes the UI more maintainable and scalable, enabling reusability, and allowing developers to create and reuse the Playground elements across different projects.

The road so far

As we wrap up, it's important to note that the OTTL Playground is still in its beta phase. This means we're actively working on refining its features and improving its performance based on user feedback.

We're incredibly excited about the potential this tool holds for simplifying the OTTL usage and enhancing collaboration within the community. We invite you to explore the OTTL Playground, share your experiences, and help us shape it into a more efficient and user-friendly tool. Stay tuned for more updates and new features in the coming months!

This project is being developed in collaboration with the OTel community, and you can track its progress and contribute through this GitHub issue. The source code is available in this repository.

Gaining new perspectives beyond logging: An introduction to application performance monitoring

Tue, 30 May 2023 00:00:00 GMT

Prioritize customer experience with APM and tracing

Enterprise software development and operations has become an interesting space. We have some incredibly powerful tools at our disposal, yet as an industry, we have failed to adopt many of these tools that can make our lives easier. One such tool that is currently underutilized is application performance monitoring (APM) and tracing, despite the fact that OpenTelemetry has made it possible to adopt at low friction.

Logging, however, is ubiquitous. Every software application has logs of some kind, and the default workflow for troubleshooting (even today) is to go from exceptions experienced by customers and systems to the logs and start from there to find a solution.

There are various challenges with this, one of the main ones being that logs often do not give enough information to solve the problem. Many services today return ambiguous 500 errors with little or nothing to go on. What if there isn’t an error or log file at all or the problem is that the system is very slow? Logging alone cannot help solve these problems. This leaves users with half broken systems and poor user experiences. We’ve all been on the wrong side of this, and it can be incredibly frustrating.

The question I find myself asking is why does the customer experience often come second to errors? If the customer experience is a top priority, then a strategy should be in place to adopt tracing and APM and make this as important as logging. Users should stop going to logs by default and thinking primarily in logs, as many are doing today. This will also come with some required changes to mental models.

What’s the path to get there? That’s exactly what we will explore in this blog post. We will start by talking about supporting organizational changes, and then we’ll outline a recommended journey for moving from just logging to a fully integrated solution with logs, traces, and APM.

Cultivating a new monitoring mindset: How to drive APM and tracing adoption

To get teams to shift their troubleshooting mindset, what organizational changes need to be made?

Initially, businesses should consider strategic priorities and goals that need to be shared broadly among the teams. One thing that can help drive this in a very large organization is to consider an entire product team devoted to Observability or a CoE (Center of Excellence) with its own roadmap and priorities.

This team (either virtual or permanent) should start with the customer in mind and work backward, starting with key questions like: What do I need to collect? What do I need to observe? How do I act? Once team members understand the answers to these questions, they can start to think about the technology decisions needed to drive those outcomes.

From a tracing and APM perspective, the areas of greatest concern are the customer experience, service level objectives, and service level outcomes. From here, organizations can start to implement programs of work to continuously improve and share knowledge across teams. This will help to align teams around a common framework with shared goals.

In the next few sections, we will go through a four step journey to help you maximize your success with APM and tracing. This journey will take you through the following key steps on your journey to successful APM adoption:

Ingest: What choices do you have to make to get tracing activated and start ingesting trace data into your observability tools?
Integrate: How does tracing integrate with logs to enable full end-to-end observability, and what else beyond simple tracing can you utilize to get even better resolution on your data?
Analytics and AIOPs: Improve the customer experience and reduce the noise through machine learning.
Scale and total cost of ownership: Roll out enterprise-wide tracing and adopt strategies to deal with data volume.

1. Ingest

Ingesting data for APM purposes generally involves “instrumenting” the application. In this section, we will explore methods for instrumenting applications, talk a little bit about sampling, and finally wrap up with a note on using common schemas for data representation.

Getting started with instrumentation

What options do we have for ingesting APM and trace data? There are many, many options we will discuss to help guide you, but first let's take a step back. APM has a deep history — in very first implementations of APM, people were concerned mainly with timing methods, like this below:

Usually you had a configuration file to specify which methods you wanted to time, and the APM implementation would instrument the specified code with method timings.

From here things started to evolve, and one of the first additions to APM was to add in tracing.

For Java, it’s fairly trivial to implement a system to do this by using what's known as a Java agent. You just specify -javagent command line argument, and the agent code gets access to the dynamic compilation routines within Java so it can modify the code before it is compiled into machine code, allowing you to “wrap” specific methods with timing or tracing routines. So, auto instrumenting Java was one of the first things that the original APM vendors did.

OpenTelemetry has agents like this, and most observability vendors that offer APM solutions have their own proprietary ways of doing this, often with more advanced and differing features from the open source tooling.

Things have moved on since then, and Node.JS and Python are now popular.

As a result, ways of auto instrumenting these language runtimes have appeared, which mostly work by injecting the libraries into the code before starting them up. OpenTelemetry has a way of doing this on Kubernetes with an Operator and sidecar here, which supports Python, Node.JS, Java, and DotNet.

The other alternative is to start adding APM and tracing API calls into your own code, which is not dissimilar to adding logging functionality. You may even wish to create an abstraction in your code to deal with this cross-cutting concern, although this is less of a problem now that there are open standards with which you can implement this.

You can see an example of how to add OpenTelemetry spans and attributes to your code for manual instrumentation below and here.

from flask import Flask
import monitor  # Import the module
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import urllib
import os

from opentelemetry import trace
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.requests import RequestsInstrumentor


# Service name is required for most backends
resource = Resource(attributes={
    SERVICE_NAME: "your-service-name"
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT'),
        headers="Authorization=Bearer%20"+os.getenv('OTEL_EXPORTER_OTLP_AUTH_HEADER')))

provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
RequestsInstrumentor().instrument()

# Initialize Flask app and instrument it
app = Flask(__name__)

@app.route("/completion")
@tracer.start_as_current_span("do_work")
def completion():
        span = trace.get_current_span()
        if span:
            span.set_attribute("completion_count",1)

By implementing APM in this way, you could even eliminate the need to do any logging by storing all your required logging information within span attributes, exceptions, and metrics. The downside is that you can only do this with code that you own, so you will not be able to remove all logs this way.

Sampling

Many people don’t realize that APM is an expensive process. It adds a lot of CPU cycles and memory to your applications, and although there is a lot of value to be had, there are certainly trade-offs to be made.

Should you sample everything 100% and eat the cost? Or should you think about an intelligent trade-off with fewer samples or even tail-based sampling, which many products commonly support? Here, we will talk about the two most common sampling techniques — head-based sampling and tail-based sampling — to help you decide.

Head-based sampling
In this approach, sampling decisions are made at the beginning of a trace, typically at the entry point of a service or application. A fixed rate of traces is sampled, and this decision propagates through all the services involved in a distributed trace.

With head-based sampling, you can control the rate using a configuration, allowing you to control the percentage of requests that are sampled and reported to the APM server. For instance, a sampling rate of 0.5 means that only 50% of requests are sampled and sent to the server. This is useful for reducing the amount of collected data while still maintaining a representative sample of your application's performance.

Tail-based sampling
Unlike head-based sampling, tail-based sampling makes sampling decisions after the entire trace has been completed. This allows for more intelligent sampling decisions based on the actual trace data, such as only reporting traces with errors or traces that exceed a certain latency threshold.

We recommend tail-based sampling because it has the highest likelihood of reducing the noise and helping you focus on the most important issues. It also helps keep costs down on the data store side. A downside of tail-based sampling, however, is that it results in more data being generated from APM agents. This could use more CPU and memory on your application.

OpenTelemetry Semantic Conventions and Elastic Common Schema

OpenTelemetry prescribes Semantic Conventions, or Semantic Attributes, to establish uniform names for various operations and data types. Adhering to these conventions fosters standardization across codebases, libraries, and platforms, ultimately streamlining the monitoring process.

Creating OpenTelemetry spans for tracing is flexible, allowing implementers to annotate them with operation-specific attributes. These spans represent particular operations within and between systems, often involving widely recognized protocols like HTTP or database calls. To effectively represent and analyze a span in monitoring systems, supplementary information is necessary, contingent upon the protocol and operation type.

Unifying attribution methods across different languages is essential for operators to easily correlate and cross-analyze telemetry from polyglot microservices without needing to grasp language-specific nuances.

Elastic's recent contribution of the Elastic Common Schema to OpenTelemetry enhances Semantic Conventions to encompass logs and security.

Abiding by a shared schema yields considerable benefits, enabling operators to rapidly identify intricate interactions and correlate logs, metrics, and traces, thereby expediting root cause analysis and reducing time spent searching for logs and pinpointing specific time frames.

We advocate for adhering to established schemas such as ECS when defining trace, metrics, and log data in your applications, particularly when developing new code. This practice will conserve time and effort when addressing issues.

2. Integrate

Integrations are very important for APM. How well your solution can integrate with other tools and technologies such as cloud, as well as its ability to integrate logs and metrics into your tracing data, is critical to fully understand the customer experience. In addition, most APM vendors have adjacent solutions for synthetic monitoring and profiling to gain deeper perspectives to supercharge your APM. We will explore these topics in the following section.

APM + logs = superpowers!

Because APM agents can instrument code, they can also instrument code that is being used for logging. This way, you can capture log lines directly within APM. This is normally simple to enable.

With this enabled, you will also get automated injection of useful fields like these:

service.name, service.version, service.environment
trace.id, transaction.id, error.id

This means log messages will be automatically correlated with transactions as shown below, making it far easier to reduce mean time to resolution (MTTR) and find the needle in the haystack:

If this is available to you, we highly recommend turning it on.

Deploying APM inside Kubernetes

It is common for people to want to deploy APM inside a Kubernetes environment, and tracing is critical for monitoring applications in cloud-native environments. There are three different ways you can tackle this.

1. Auto instrumentation using sidecars
With Kubernetes, it is possible to use an init container and something that will modify Kubernetes manifests on the fly to auto instrument your applications.

The init container will be used simply to copy the required library or jar file into the container at startup that you need to the main Kubernetes pod. Then, you can use Kustomize to add the required command line arguments to bootstrap your agents.

If you are not familiar with it, Kustomize adds, removes, or modifies Kubernetes manifests on the fly. It is even available as a flag to the Kubernetes CLI — simply execute kubectl -k.

OpenTelemetry has an operator that does all this for you automatically (without the need for Kustomize) for Java, DotNet, Python, and Node.JS, and many vendors also have their own operator or helm charts that can achieve the same result.

2. Baking APM into containers or code
A second option for deploying out APM in Kubernetes — and indeed any containerized environment — is using Docker to bake the APM agents and configuration into a dockerfile.

Have a look at an example here using the OpenTelemetry Java Agent:

# Use the official OpenJDK image as the base image
FROM openjdk:11-jre-slim

# Set up environment variables
ENV APP_HOME /app
ENV OTEL_VERSION 1.7.0-alpha
ENV OTEL_JAVAAGENT_URL https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v${OTEL_VERSION}/opentelemetry-javaagent-${OTEL_VERSION}-all.jar

# Create the application directory
RUN mkdir $APP_HOME
WORKDIR $APP_HOME

# Download the OpenTelemetry Java agent
ADD ${OTEL_JAVAAGENT_URL} /otel-javaagent.jar

# Add your Java application JAR file
COPY your-java-app.jar $APP_HOME/your-java-app.jar

# Expose the application port (e.g. 8080)
EXPOSE 8080

# Configure the OpenTelemetry Java agent and run the application
CMD java -javaagent:/otel-javaagent.jar \
      -Dotel.resource.attributes=service.name=your-service-name \
      -Dotel.exporter.otlp.endpoint=your-otlp-endpoint:4317 \
      -Dotel.exporter.otlp.insecure=true \
      -jar your-java-app.jar

3. Tracing using a service mesh (Envoy/Istio)
The final option you have here is if you are using a service mesh. A service mesh is a dedicated infrastructure layer for handling service-to-service communication in a microservices architecture. It provides a transparent, scalable, and efficient way to manage and control the communication between services, enabling developers to focus on building application features without worrying about inter-service communication complexities.

The great thing about this is that we can activate tracing within the proxy and therefore get visibility into requests between services. We don’t have to change any code or even run APM agents for this; we simply turn on the OpenTelemetry collector that exists within the proxy — therefore this is likely the lowest overhead solution. Learn more about this option.

Synthetics Universal Profiling

Most APM vendors have add ons to the primary APM use cases. Typically we see synthetics and continuous profiling being added to APM solutions. APM can integrate with both, and there is some good value in bringing these technologies together to give even more insights into issues.

Synthetics
Synthetic monitoring is a method used to measure the performance, availability, and reliability of web applications, websites, and APIs by simulating user interactions and traffic. It involves creating scripts or automated tests that mimic real user behavior, such as navigating through pages, filling out forms, or clicking buttons, and then running these tests periodically from different locations and devices.

This gives Development and Operations teams the ability to spot problems far earlier than they might otherwise, catching issues before real users do in many cases.

Synthetics can be integrated with APM — inject an APM agent into the website when the script runs, so even if you didn’t put end user monitoring into your website initially, it can be injected at run time. This usually happens without any input from the user. From there, a tracing id for each request can be passed down through the various layers of the system, allowing teams to follow the request all the way from the synthetics script to the lowest levels of the application stack such as the database.

Universal profiling
“Profiling” is a dynamic method of analyzing the complexity of a program, such as CPU utilization or the frequency and duration of function calls. With profiling, you can locate exactly which parts of your application are consuming the most resources. “Continuous profiling” is a more powerful version of profiling that adds the dimension of time. By understanding your system’s resources over time, you can then locate, debug, and fix issues related to performance.

Universal profiling is a further extension of this, which allows you to capture profile information about all of the code running in your system all the time. Using a technology like eBPF can allow you to see all the function calls in your systems, including into things like the Kubernetes runtime. Doing this gives you the ability to finally see unknown unknowns — things you didn’t know were problems. This is very different from APM, which is really about tracking individual traces and requests and the overall customer experience. Universal profiling is about overcoming those issues you didn’t even know existed and even answering the question “What is my most expensive line of code?”

Universal profiling can be linked into APM, showing you profiles that occurred during a specific customer issue, for example, or by linking profiles directly to traces by looking at the global state that exists at the thread level. These technologies can work wonders when used together.

Typically, profiles are viewed as “flame graphs” shown below. The boxes represent the amount of “on-cpu” time spent executing a particular function.

3. Analytics and AIOps

The interesting thing about APM is it opens up a whole new world of analytics versus just logs. All of a sudden, you have access to the information flows from inside applications.

This allows you to easily capture things like the amount of money a specific customer is currently spending on your most critical ecommerce store, or look at failed trades in a brokerage app to see how much lost revenue those failures are impacting. You can even then apply machine learning algorithms to project future spend or look at anomalies occurring in this data, giving you a new window into how your business runs.

In this section, we will look at ways to do this and how to get the most out of this new world, as well as how to apply AIOps practices to this new data. We will also discuss getting SLIs and SLOs setup for APM data.

Getting business data into your traces

There are generally two ways of getting business data into your traces. You can modify code and add in Span attributes, an example of which is available here and shown below. Or you can write an extension or a plugin, which has the benefit of avoiding code changes. OpenTelemetry supports adding extensions in its auto-instrumentation agents. Most other APM vendors usually have something similar.

def count_completion_requests_and_tokens(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        counters['completion_count'] += 1
        response = func(*args, **kwargs)

        token_count = response.usage.total_tokens
        prompt_tokens = response.usage.prompt_tokens
        completion_tokens = response.usage.completion_tokens
        cost = calculate_cost(response)
        strResponse = json.dumps(response)

        # Set OpenTelemetry attributes
        span = trace.get_current_span()
        if span:
            span.set_attribute("completion_count", counters['completion_count'])
            span.set_attribute("token_count", token_count)
            span.set_attribute("prompt_tokens", prompt_tokens)
            span.set_attribute("completion_tokens", completion_tokens)
            span.set_attribute("model", response.model)
            span.set_attribute("cost", cost)
            span.set_attribute("response", strResponse)
        return response
    return wrapper

Using business data for fun and profit

Once you have the business data in your traces, you can start to have some fun with it. Take a look at the example below for a financial services fraud team. Here we are tracking transactions — average transaction value for our larger business customers. Crucially, we can see if there are any unusual transactions.

A lot of this is powered by machine learning, which can classify transactions or do anomaly detection. Once you start capturing the data, it is possible to do a lot of useful things like this, and with a flexible platform, integrating machine learning models into this process becomes a breeze.

SLIs and SLOs

Service level indicators (SLIs) and service level objectives (SLOs) serve as critical components for maintaining and enhancing application performance. SLIs, which represent key performance metrics such as latency, error rate, and throughput, help quantify an application's performance, while SLOs establish target performance levels to meet user expectations.

By selecting relevant SLIs and setting achievable SLOs, organizations can better monitor their application's performance using APM tools. Continually evaluating and adjusting SLIs and SLOs in response to changes in application requirements, user expectations, or the competitive landscape ensures that the application remains competitive and delivers an exceptional user experience.

In order to define and track SLIs and SLOs, APM becomes a critical perspective that is needed for understanding the user experience. Once APM is implemented, we recommend that organizations perform the following steps.

Define SLOs and SLIs required to track them.
Define SLO budgets and how they are calculated. Reflect business’ perspective and set realistic targets.
Define SLIs to be measured from a user experience perspective.
Define different alerting and paging rules, page only on customer facing SLO degradations, record symptomatic alerts, notify on critical symptomatic alerts.

Synthetic monitoring and end user monitoring (EUM) can also help with getting even more data required to understand latency, throughput, and error rate from the user’s perspective, where it is critical to get good business focused metrics and data from.

4. Scale and total cost of ownership

With increased perspectives, customers often run into scalability and total cost of ownership issues. All this new data can be overwhelming. Luckily there are various techniques you can use to deal with this. Tracing itself can actually help with volume challenges because you can decompose unstructured logs and combine them with traces, which leads to additional efficiency. You can also use different sampling methods to deal with scale challenges (i.e., both techniques we previously mentioned).

In addition to this, for large enterprise scale, we can use streaming pipelines like Kafka or Pulsar to manage the data volumes. This has an additional benefit that you get for free: if you take down the systems consuming the data or they face outages, it is less likely you will lose data.

With this configuration in place, your “Observability pipeline” architecture would look like this:

This completely decouples your sources of data from your chosen observability solution, which will future proof your observability stack going forward, enable you to reach massive scale, and make you less reliant on specific vendor code for collection of data.

Another thing we recommend doing is being intelligent about instrumentation. This will serve two benefits: you will get some CPU cycles back in the instrumented application, and your backend data collection systems will have less data to process. If you know, for example, that you have no interest in tracking calls to a specific endpoint, you can exclude those classes and methods from instrumentation.

And finally, data tiering is a transformative approach for managing data storage that can significantly reduce the total cost of ownership (TCO) for businesses. Primarily, it allows organizations to store data across different types of storage mediums based on their accessibility needs and the value of the data. For instance, frequently accessed, high-value data can be stored in expensive, high-speed storage, while less frequently accessed, lower-value data can be stored in cheaper, slower storage.

This approach, often incorporated in cloud storage solutions, enables cost optimization by ensuring that businesses only pay for the storage they need at any given time. Furthermore, it provides the flexibility to scale up or down based on demand, eliminating the need for large capital expenditures on storage infrastructure. This scalability also reduces the need for costly over-provisioning to handle potential future demand.

Conclusion

In today's highly competitive and fast-paced software development landscape, simply relying on logging is no longer sufficient to ensure top-notch customer experiences. By adopting APM and distributed tracing, organizations can gain deeper insights into their systems, proactively detect and resolve issues, and maintain a robust user experience.

In this blog, we have explored the journey of moving from a logging-only approach to a comprehensive observability strategy that integrates logs, traces, and APM. We discussed the importance of cultivating a new monitoring mindset that prioritizes customer experience, and the necessary organizational changes required to drive APM and tracing adoption. We also delved into the various stages of the journey, including data ingestion, integration, analytics, and scaling.

By understanding and implementing these concepts, organizations can optimize their monitoring efforts, reduce MTTR, and keep their customers satisfied. Ultimately, prioritizing customer experience through APM and tracing can lead to a more successful and resilient enterprise in today's challenging environment.

Learn more about APM at Elastic.

Dynamic workload discovery on Kubernetes now supported with EDOT Collector

Tue, 01 Apr 2025 00:00:00 GMT

At Elastic, Kubernetes is one of the most significant observability use cases we focus on. We want to provide the best onboarding experience and lifecycle management based on real-world GitOps best practices.

OpenTelemetry recently published a blog on how to do Autodiscovery based on Kubernetes Pods' annotations with the OpenTelemetry Collector.

In this blog post, we will talk about how to use this Kubernetes-related feature of the OpenTelemetry Collector, which is already available with the Elastic Distribution of the OpenTelemetry (EDOT) Collector.

In addition to this feature, at Elastic, we heavily invest in making OpenTelemetry the best, standardized ingest solution for Observability. You might already have seen us focusing on:

Semantic Conventions standardization
significant log collection improvements
various other topics around instrumentation
profiling

Let's walk you through a hands-on journey using the EDOT Collector covering various use cases you might encounter in the real world, highlighting the capabilities of this powerful feature.

Configuring EDOT Collector

The Collector’s configuration is not our main focus here, since based on the nature of this feature it is minimal, letting workloads define how they should be monitored.

To illustrate the point, here is the Collector configuration snippet that enables the feature for both logs and metrics:

receivers:
    receiver_creator/metrics:
      watch_observers: [k8s_observer]
      discovery:
        enabled: true
      receivers:

    receiver_creator/logs:
      watch_observers: [k8s_observer]
      discovery:
        enabled: true
      receivers:

You can include the above in the EDOT’s Collector configuration, specifically the receivers’ section.

Since logs collection in our examples will happen from the discovery feature make sure that the static filelog receiver configuration block is removed and its preset is disabled (i.e. set to false) to avoid having log duplication.

Make sure that the receiver creator is properly added in the pipelines for logs (in addition to removing the filelog receiver completely) and metrics respectively.

Ensure that k8sobserver is enabled as part of the extensions:

extensions:
  k8s_observer:
    observe_nodes: true
    observe_services: true
    observe_ingresses: true

// ...

service:
  extensions: [k8s_observer]

Last but not least, ensure the log files' volume is mounted properly:

volumeMounts:
 - name: varlogpods
   mountPath: /var/log/pods
   readOnly: true

volumes:
  - name: varlogpods
    hostPath:
      path: /var/log/pods

Once the configuration is ready follow the Kubernetes quickstart guides on how to deploy the EDOT Collector. Make sure to replace the values.yaml file linked in the quickstart guide with the file that includes the above-described modifications.

Collecting Metrics from Moving Targets Based on Their Annotations

In this example, we have a Deployment with a Pod spec that consists of two different containers. One container runs a Redis server, while the other runs an NGINX server. Consequently, we want to provide different hints for each of these target containers.

The annotation-based discovery feature supports this, allowing us to specify metrics annotations per exposed container port.

Here is how the complete spec file looks:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-conf
data:
  nginx.conf: |
    user  nginx;
    worker_processes  1;
    error_log  /dev/stderr warn;
    pid        /var/run/nginx.pid;
    events {
      worker_connections  1024;
    }
    http {
      include       /etc/nginx/mime.types;
      default_type  application/octet-stream;

      log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                        '$status $body_bytes_sent "$http_referer" '
                        '"$http_user_agent" "$http_x_forwarded_for"';
      access_log  /dev/stdout main;
      server {
          listen 80;
          server_name localhost;

          location /nginx_status {
              stub_status on;
          }
      }
      include /etc/nginx/conf.d/*;
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-deployment
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
      annotations:
        # redis container port hints
        io.opentelemetry.discovery.metrics.6379/enabled: "true"
        io.opentelemetry.discovery.metrics.6379/scraper: redis
        io.opentelemetry.discovery.metrics.6379/config: |
          collection_interval: "20s"
          timeout: "10s"

        # nginx container port hints
        io.opentelemetry.discovery.metrics.80/enabled: "true"
        io.opentelemetry.discovery.metrics.80/scraper: nginx
        io.opentelemetry.discovery.metrics.80/config: |
          endpoint: "http://`endpoint`/nginx_status"
          collection_interval: "30s"
          timeout: "20s"
    spec:
      volumes:
      - name: nginx-conf
        configMap:
          name: nginx-conf
          items:
            - key: nginx.conf
              path: nginx.conf
      containers:
        - name: webserver
          image: nginx:latest
          ports:
            - containerPort: 80
              name: webserver
          volumeMounts:
            - mountPath: /etc/nginx/nginx.conf
              readOnly: true
              subPath: nginx.conf
              name: nginx-conf
        - image: redis
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - name: redis
              containerPort: 6379
              protocol: TCP

When this workload is deployed, the Collector will automatically discover it and identify the specific annotations. After this, two different receivers will be started, each one responsible for each of the target containers.

Collecting Logs from Multiple Target Containers

The annotation-based discovery feature also supports log collection based on the provided annotations. In the example below, we again have a Deployment with a Pod consisting of two different containers, where we want to apply different log collection configurations. We can specify annotations that are scoped to individual container names:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox-logs-deployment
  labels:
    app: busybox
spec:
  replicas: 1
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
      annotations:
        io.opentelemetry.discovery.logs.lazybox/enabled: "true"
        io.opentelemetry.discovery.logs.lazybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints-lazybox
        io.opentelemetry.discovery.logs.busybox/enabled: "true"
        io.opentelemetry.discovery.logs.busybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints-busybox
    spec:
      containers:
        - name: busybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo "otel logs from busybox at $(date +%H:%M:%S)" && sleep 5s; done
        - name: lazybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo "otel logs from lazybox at $(date +%H:%M:%S)" && sleep 25s; done

The above configuration enables two different filelog receiver instances, each applying a unique parsing configuration. This is handy when we know how to parse specific technology logs, such as Apache server access logs.

Combining Both Metrics and Logs Collection

In our third example, we illustrate how to define both metrics and log annotations on the same workload. This allows us to collect both signals from the discovered workload. Below is a Deployment with a Pod consisting of a Redis server and a BusyBox container that performs dummy log writing. We can target annotations to the port and container levels to collect metrics from the Redis server using the Redis receiver, and logs from the BusyBox using the filelog receiver. Here’s how:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-deployment
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
      annotations:
        io.opentelemetry.discovery.metrics.6379/enabled: "true"
        io.opentelemetry.discovery.metrics.6379/scraper: redis
        io.opentelemetry.discovery.metrics.6379/config: |
          collection_interval: "20s"
          timeout: "10s"

        io.opentelemetry.discovery.logs.busybox/enabled: "true"
        io.opentelemetry.discovery.logs.busybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints
    spec:
      containers:
        - image: redis
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - name: redis
              containerPort: 6379
              protocol: TCP
        - name: busybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo "otel logs at $(date +%H:%M:%S)" && sleep 15s; done

Explore and analyse data coming from dynamic targets in Elastic

Once the target Pods are discovered and the Collector has started collecting telemetry data from them, we can then explore this data in Elastic. In Discover we can search for Redis and NGINX metrics as well as logs collected from the Busybox container. Here is how it looks like:

Summary

The examples above showcase how users of our OpenTelemetry Collector can take advantage of this new feature — one we played a major role in developing.

For this, we leveraged our years of experience with similar features already supported in Metricbeat, Filebeat, and Elastic-Agent. This makes us extremely happy and confident, as it closes the feature gap between Elastic's specific monitoring agents and the OpenTelemetry Collector — making it even better.

Interested in learning more? Visit the documentation and give it a try by following our EDOT quickstart guide.

Troubleshooting Kafka-Logstash-Elasticsearch Performance Issues in delay-sensitive platforms

Wed, 11 Mar 2026 00:00:00 GMT

Kafka is an open-source, distributed event streaming and queuing platform widely used with Elastic to build high-throughput, large-scale data pipelines, facilitate seamless data integration, and support mission-critical applications. System designs with Kafka significantly enable the decoupling of components within the data pipeline ensuring scalability and a robust design for failure by managing downstream back-pressure during traffic surges, maintenance activities, or any other periods of performance degradation.

In addition to its queuing capabilities, Kafka can serve as a central processing middleware for data pre-processing and enrichment. This is particularly useful when such operations are impractical to perform directly downstream due to specific business or technical requirements or constraints.

For instance, integrating Kafka with stream processing engines like KsqlDB or Materialize, allows for advanced stream processing tasks, including SQL-based joins across topics and streams to enrich data at scale in real-time. The enriched datasets can then be ingested into Elasticsearch for further processing at subsequent stages.

Despite these benefits, adopting Kafka or similar queuing systems is arguably conditional. These systems introduce additional costs and complexity to the overall platform implementation and maintenance. They may also add processing overhead, delay data flow to the downstream, and risk becoming bottlenecks if not correctly sized or optimized to align with other pipeline components.

This article provides guidance for troubleshooting ingestion bottlenecks in data pipelines built with Kafka and Elastic. Identifying and fixing such issues can be sometimes challenging, particularly when multiple changes are made across multiple systems aspects at the same time, which often increases the number of variables in play. This commonly results in a longer process and inconsistent results.

Consider the below Security Operations Center (SOC) platform, where data is ingested from various sources via Elastic Agent. The data is queued and pre-processed in a Kafka cluster before being pulled by Logstash and forwarded to Elastic Security. In this environment, delays at any stage of the pipeline can result in critical security events going undetected by Elastic Security, emphasizing the importance of a well-optimized data pipeline.

Implement lag and throughput monitoring

Ingestion bottlenecks usually materialize as limited throughput and event lags, which often correlate. Monitoring these two indicators is important to measure the impact of tuning attempts.

Tip: With the anomaly detection features of machine learning you can use the Logs Anomalies page to detect and inspect log anomalies and the log partitions where the log anomalies occur.

End-to-end lag monitoring can be broken down into the various stages of the pipeline. The incremental improvements across those stages would collectively contribute to a significant reduction in the end-to-end lag:

A) Ingest lag between the source and Kafka: This lag is the time difference between the real event-time, which is typically extracted from the event itself or added by the event producer (Elastic Agent for example), and the Kafka record timestamp, which can be added to the Logstash events via event decoration in the Kafka input plugin.

In most cases, this lag is influenced by the write performance of the Kafka cluster and network latency between the event source and Kafka. In some cases, the lag may also appear due to time configuration mismatches that make it look like there's a lag when there really isn't.

B) Ingest lag between Kafka and Logstash: This lag is the time difference between the Kafka record's timestamp and the execution timestamp of the first filter in the Logstash pipeline. If your pipelines are using a persistent queue, note that this duration also includes the time spent in the PQ.

The below Ruby filter adds the current-time to the event in the `logstash.start` field to use for comparison later.

ruby {
 code => "event.set(logstash.start, Time.now());"
}

The primary factors contributing to ingestion lag include the consumption performance of the Kafka cluster, the Logstash input performance, data skew across the different topic partitions, and most importantly, the backpressure propagation to the Logstash input plugin, because Logstash does not fetch new events from the Kafka topic as quickly as they become available, when it is busy processing the events that it has already fetched.

Network latency and reduced size of TCP read buffer (SO_RCVBUF) on the Logstash host can also throttle Logstash from fetching the data from Kafka at the required rate.

Consumer lag serves as an effective indicator of this issue and can be viewed on Kafka's consumer group metrics. It is calculated as the difference between the log-end offset (the offset of the most recently produced message) and the current offset (the last committed offset by the consumer) for each partition.

$KAFKA_HOME/bin/kafka-consumer-groups.sh  --bootstrap-server  --describe --group

GROUP                 TOPIC             PARTITION  CURRENT-OFFSET  LOG-END-OFFSET   LAG
logstash-cg-soc-1     windows-events    0          4498            17309            12811
logstash-cg-soc-1     windows-events    1          4470            17213            12743
...

C) Ingest lag in the Logstash processing: This lag is the time difference between the first and last Logstash filters. To calculate this lag, an additional filter can be added at the end of the pipeline to record the `logstash.end` timestamp in the same way the `logstash.start` field was added before. The primary factors contributing to this lag are the filters efficiency of processing, which is primarily affected by the complexity and optimization of the transformations they perform, access to external services for data loading which might require network, limited number of the pipeline’s workers and small batch size, and the amount of resources available for Logstash – particularly when running on virtual environments with resources contention.

D) Ingest lag between Logstash and Elasticsearch: This lag is the time difference between the last applied Logstash filter in the pipeline, and the timestamp when the event is ingested in Elasticsearch. The ECS field `event.ingested` is automatically added by the Elastic integrations to record this value. For custom sources, the field should be added via an ingest pipeline:

{
    "processors": [
      {
        "set": {
          "field": "event.ingested",
          "value": "{{_ingest.timestamp}}"
        }
      }
…

If the data is undergoing heavy processing in Elasticsearch before indexing, it also pays to analyze the performance of each ingest processor in the pipeline to pinpoint and optimize the heaviest ones. Ingest pipelines monitoring dashboard can help streamline this process.

The primary factors contributing to this phase’s lag are usually the Logstash output configuration like a small number of pipeline workers and batch size, slow indexing actions (like upserts), network latency, and how fast the Elasticsearch cluster can run the ingest pipelines and index the data. You can find more techniques about this last point here.

Visualizing these stages in Kibana helps identify the most throttled areas and analyze the impact of various parameter adjustments across the entire data pipeline during the tuning process.

Isolate and fix the bottleneck

Identifying the source of the bottleneck can be challenging without a systematic approach to isolating the behavior of each component and stage of the pipeline. To make the investigation approach more consistent, it is important to keep the source data consistent as well. One approach can be to use a dedicated topic with a replicated production workload, and repeat the test using different consumer groups.

Below is a set of benchmarks that can be driven while monitoring the event lag and the pipeline throughput. The best achieved results from each of the tuning exercises can be used as a basis for the next one.

First benchmark: Kafka input, no filters, null output

This benchmark is aimed at assessing the throughput of the Kafka input in isolation, excluding the downstream impacts of the Logstash filters and outputs. Use the sink plugin in the output section to discard the events without incurring IO overhead and get a theoretical maximum reading speed.

This test is better performed with and without a persistence queue to isolate the additional overhead at this stage.

It is helpful to use a unique consumer group_id for this test instead of the default `logstash`. Otherwise, this null-output pipeline might consume and drop events that should be processed by other pipelines.

input {
 kafka {
   ...
 }
}
filter {
}
output {
  sink { }
}

If the throughput from this test closely matches the original pipeline, then most probably you have a closed valve upstream and consuming the events is definitely a bottleneck.

Note that the maximum throughput is significantly impacted by the Kafka cluster's ability to handle consumer requests and network latency. The maximum throughput is also bound by the rate of events that is flowing into the Kafka topic once the consumer group has caught up with the topic.

A few things might be considered in this exercise:

Match consumers count to partitions count: Ideally, the total number of consumer threads across all the pipelines that share the same consumer group_id, should be equal to the number of topic partitions for a perfect balance. Each Kafka topic-partition can be assigned to at-most one consumer within a consumer group at a time. So if you have more consumer threads than your topic partitions, some of those threads will not be assigned a partition. Partition-replicas do not count, as consumer threads consume messages from the leader partitions, not directly from replicas. Exceeding 1:1 ratio may also introduce unnecessary computational overhead in Logstash without any gains in read throughput. Incrementally increasing the partition count in the topic can potentially improve the throughput. Kafka 4.0 introduces early access to KIP932, which bypasses this 1:1 mapping requirement using share groups implementing a queuing semantic to the consumption model. The Share Groups are not supported in Logstash yet.
Tune the input parameters for maximum throughput: Increasing max.poll.records, fetch.max.bytes, and receive.buffer.bytes can enhance performance. The TCP read buffer size is rarely an issue but can also be significantly important. This setting is bound by the net.core.rmem_max value.
Use fast disks with enough space if using persistent queues: The queue sits between the input and filter stages in the same process. The I/O performance of the storage directly impacts the input throughput. When the queue is full, Logstash puts back pressure on the inputs to stall the data flow.

Second benchmark: Kafka input, filters, no outputs

This benchmark helps measure the impact of the filters on the input throughput using the best achieved input configuration from the first exercise. It quantifies the throttling effect on the input stream only caused by the events processing. Note that some filter plugins are also IO-bound, like the plugins that use the network to enrich the events.

input {
 kafka {
   ...
 }
}
filter {
...
}
output {
}

To increase the number of simultaneously processed events by the filters, try increasing the number of pipeline workers and the pipeline batch size, particularly if the pipeline worker\_utilization flow metric is near 100 and Logstash is not spending all available CPU. Increasing the workers number past the number of available processors can also yield better results as some of the filter plugins may spend significant time in an I/O wait state like external lookups.

Increasing the number of workers past the number of available processors can also improve performance, as some filter plugins may spend considerable time in an I/O wait state, such as during external lookups. This also makes a more efficient use of the Logstash host resources.

Optimizing the pipeline filters is the most effective approach to resolving this bottleneck. It can significantly reduce latency and increase the throughput regardless of the pipeline input configuration and Logstash resources. The per-plugin worker_utilization and worker_millis_per_event flow metrics are very useful in identifying where most of the resources are being spent, and consequently, where these improvements should focus first.

Optimizing pipeline filters is the most effective way to address this bottleneck. it can significantly reduce latency and boost throughput, regardless of the pipeline's input configuration or available resources. The per-plugin worker_utilization and worker_millis_per_event flow metrics are useful for finding which plugins are spending the most resources, and the optimization efforts should focus on those plugins first. Some general best practices that can usually make improvements are utilizing anchors for Grok plugins, switching to faster plugins like dissect whenever possible, optimizing Ruby filters code, eliminating unnecessary parsing, and improving the network-based enrichments.

Source: do you grok

In some cases, optimizing the pipeline may require a complete redesign of the ingestion workflow or the pipeline itself!

Third benchmark: Kafka input, no filters, Elasticsearch output

This benchmark helps quantify the throttling effect of the Elasticsearch output on the input throughput. The test can be divided into two phases: the first phase uses raw logs to isolate the impact of Elasticsearch indexing, while the second phase assesses the impact of ingest pipelines.

In case a pipeline is using multiple outputs, note that, by default, a pipeline is blocked if any single output is blocked. This behavior is important in guaranteeing at-least-once delivery of data, but can cause the outputs to perform at the rate of the most clogged one.

input {
 kafka {
   ...
 }
}
filter {
}
output {
 Elasticsearch {
   ...
 }
}

To increase throughput, consider progressively increasing the number of pipeline workers and the pipeline batch size. Prior guidance about the worker\_utilization flow metric applies here too although availability of CPU plays a smaller role since this output is mostly IO-bound. Also keep looking for the Elasticsearch Output's rejection rates (e.g.: response code 429 `es_rejected_execution_exception` indicating explicit back-pressure) as a signal that the Elasticsearch cluster is busy processing other batches.

The Logstash output tries to send batches of events to the Elasticsearch Bulk API in a single request. However, if a batch exceeds 20 MB, the plugin splits it into multiple bulk requests.

If the Elasticsearch cluster is behind a proxy or API gateway, it's important to adjust the proxy limits to allow Logstash requests with large payloads to pass through to the Elasticsearch cluster. By default, most proxy servers have a much smaller maximum size for HTTP request payloads, which should be tuned in this case to accommodate larger requests. To identify potential issues, look for error code 413 in your proxy logs, as this indicates that the size of the Logstash request has exceeded the maximum payload size the proxy is configured to handle.

On the Elasticsearch cluster, tune your ingest pipelines efficiency following the same general best practices discussed above for the Logstash pipelines. Also, tune for the indexing speed by using faster hardware, less index refreshes, auto-generated IDs, and consider increasing the number of primary shards to enhance indexing parallelism if you have multiple nodes. Beware that excessively increasing this number can negatively impact the search performance.

Finally, keep in mind that the Elasticsearch output plugin is mostly IO-bound, which means that your network latency and bandwidth significantly reduce the rate at which data is transferred and hence your output throughput.

Reassemble your pipeline

After tuning the pipeline in each of the previous phases separately, put all the parts together again to assess the real throughput and latency of the reassembled pipeline. At this last step, you should have reached the best performance from your Logstash host as well, and you can progressively add more instances to reach the ultimate latency and throughput you are aiming for for a specific topic or data source.

Example

Below is an example of the configuration required on Logstash and Elasticsearch to implement the architecture above.

Logstash pipeline:

input {
 kafka {
   bootstrap_servers => ":"
   topics => [""]
   group_id => ""
   decorate_events => "extended"
   auto_offset_reset => "earliest"
   codec => json {
   }
 }
}


filter {
 ruby {
   code => "event.set('[logstash][start]', Time.now());"
 }


 mutate {
   add_field => {
     "[kafka][timestamp]" => "%{[@metadata][kafka][timestamp]}"
     "[kafka][offset]" => "%{[@metadata][kafka][offset]}"
     "[kafka][consumer_group]" => "%{[@metadata][kafka][consumer_group]}"
     "[kafka][topic]" => "%{[@metadata][kafka][topic]}"
   }
 }


 date {
   match => ["[kafka][timestamp]", "UNIX", "UNIX_MS"]
   target => "[kafka][timestamp]"
 }
 ...
 ruby {
   code => "event.set('[logstash][end]', Time.now());"
 }
}


output {
 elasticsearch {
   hosts => "hosts"
   api_key => "api_key"
   data_stream => true
   ssl => true
 }
}

Create an ingest pipeline for lag calculation. Note that when using Elastic integrations, the ECS fields: \ \*.end\, \ \*.start\, \ \*.timestamp\ are automatically mapped as a date.

PUT _ingest/pipeline/calculate_ingest_lag
{
  "processors": [
    {
      "set": {
        "field": "event.ingested",
        "value": "{{_ingest.timestamp}}",
        "ignore_failure": true
      }
    },
    {
      "script": {
        "lang": "painless",
        "if": "ctx['@timestamp'] != null && ctx?.kafka?.timestamp != null && ctx?.logstash?.start != null && ctx?.logstash?.end != null && ctx?.event?.ingested != null",
        "source": """ 
  ctx.lag_in_millis = [:];
              ctx.lag_in_millis.src_kfk = Duration.between(ZonedDateTime.parse(ctx['@timestamp']), ZonedDateTime.parse(ctx['kafka']['timestamp'])).toMillis(); 
              ctx.lag_in_millis.kfk_ls = Duration.between(ZonedDateTime.parse(ctx['kafka']['timestamp']), ZonedDateTime.parse(ctx['logstash']['start'])).toMillis();
              ctx.lag_in_millis.within_ls  = Duration.between(ZonedDateTime.parse(ctx['logstash']['start']), ZonedDateTime.parse(ctx['logstash']['end'])).toMillis();
              ctx.lag_in_millis.ls_es = Duration.between(ZonedDateTime.parse(ctx['logstash']['end']), ZonedDateTime.parse(ctx['event']['ingested'])).toMillis(); 
              ctx.lag_in_millis.end_end = Duration.between(ZonedDateTime.parse(ctx['@timestamp']), ZonedDateTime.parse(ctx['event']['ingested'])).toMillis();     
        """
      }
    }
  ]
}

Use the pipeline to add the lag calculation to your Elastic integrations

PUT _ingest/pipeline/logs-system.integration@custom
{
  "processors": [
    {
      "pipeline": {
        "name": "calculate_ingest_lag",
        "ignore_missing_pipeline": true,
        "description": "add ingest lag calculation to elastic_agent integration"
      }
    }
  ]
}

Kibana Dashboard and Alerts

Using the metrics mentioned above along with the Log Rate ML job, you can set up Kibana alerts to trigger when with anomalous changes in throughput or delays or simply when delays exceed defined thresholds.

Time to try it out

Start your free 14-day trial of Elastic Cloud to experience the latest version of Elastic. Also, make sure to take advantage of the Elastic threat detection training to set yourself up for success.

Turn Dashboards Into an Investigation Tool with ES|QL Variable Controls

Wed, 18 Mar 2026 00:00:00 GMT

Static dashboards are useful until the first incident, where the default view hides the signal you need. ES|QL variable controls on a Kibana dashboard make it possible to go from a healthy-looking fleet overview to a clear root cause without editing a single query.

In this blog, we’ll show how these ES|QL variable controls turn dashboards into interactive investigation tools, and how to set them up to uncover problems that averages were hiding. By selecting a value in a control, every panel using that variable adapts.

The dashboard

This is a custom "Infrastructure Overview" dashboard monitoring 10 hosts across 3 AWS regions using OpenTelemetry host metrics. Four line charts (CPU, Memory, Disk, Load average) and ES|QL variable controls at the top.

With the default dashboard controls (AVG aggregation, region breakdown, 15-minute buckets, all hosts selected), everything looks healthy. Smooth diurnal cycles across all three regions.

But there is a problem hiding in this view.

The problem with fixed queries

A fixed chart query hardcodes decisions that need to change during an investigation:

The aggregation function (AVG, MAX, MIN, MEDIAN)
The dimension used to slice the data (host, region, availability zone)
Which hosts are included or excluded
The time bucket interval (1m, 5m, 15m, 1h)

With those baked in, every change means editing queries across multiple panels.

ES|QL variable controls

ES|QL variable controls inject user-selected values into queries at runtime. Two types:

Value controls (?variable): replace a value in the query, such as a time interval or a list of hostnames
Structure controls (??variable): replace a function name or field name, such as the aggregation function or the dimension used to slice data

One query pattern, reused across all panels.

The query

The original static CPU query looks like this:

TS metrics-hostmetricsreceiver.otel-default
| WHERE system.cpu.utilization IS NOT NULL
  AND attributes.state != "idle"
| STATS AVG(system.cpu.utilization)
  BY BUCKET(@timestamp, 1 minute), resource.attributes.host.name

To adapt this query to use variable controls, each hardcoded part has to be replaced with a variable. The aggregation function, the time bucket, and the breakdown dimension are straightforward replacements. The hostname filter requires one extra step because we want the control to allow selecting multiple hosts at once, and filtering by a single value only matches one host at a time. MV_CONTAINS checks whether a value exists inside a multi-value list, so MV_CONTAINS(?hostname, resource.attributes.host.name) returns true if the field contains any of the selected values in the control.

After replacing each part, the query becomes:

TS metrics-hostmetricsreceiver.otel-default
| WHERE system.cpu.utilization IS NOT NULL
  AND attributes.state != "idle"
  AND MV_CONTAINS(?hostname, resource.attributes.host.name)
| STATS ??aggregation(system.cpu.utilization)
  BY BUCKET(@timestamp, ?interval), ??breakdown

The same pattern applies to all four panels (CPU, Memory, Disk, Load). Changing any control updates every panel at once.

The controls

Hostname (?hostname): Filters to the hosts selected in the control. Configured as "Values from a query" with multi-select enabled. It runs an ES|QL query that returns available host names, and MV_CONTAINS in the chart queries enables selecting more than one.
Aggregation (??aggregation): Swaps the aggregation function. Static values control with AVG, MAX, MIN, MEDIAN.
Time interval (?interval): Controls the time bucket size. Static values control with 1 minute, 5 minutes, 15 minutes, 1 hour.
Breakdown (??breakdown): Swaps the dimension used to slice the data. Static values control with resource.attributes.host.name, resource.attributes.cloud.region, resource.attributes.cloud.availability_zone.

The investigation

The dashboard opens with AVG aggregation, region breakdown, 15-minute buckets, and all hosts selected. Nothing looks wrong. The first change is switching the aggregation from AVG to MAX and the time interval to 1 minute. A bump immediately appears in us-east-1 around March 7, roughly 68% where normal peak sits around 57%. The average was hiding this because one host's intermittent spikes get averaged across five hosts in the region.

Next, switching the breakdown from region to host makes it clear. db-01 stands out with spikes to 65-70% while its normal baseline sits around 24%. Every other host follows its expected pattern.

Setting the hostname control to db-01 only isolates the incident. Intermittent CPU bursts, not sustained saturation. Memory climbs from 85% to 93%, Load from 2.4 to 3.0, Disk from 67% to 73%. All four panels corroborate a 4-hour event window.

Why structure your queries with variable controls

A dashboard built with variable controls supports investigation paths that did not exist when the dashboard was built. Without them, every dashboard is a frozen perspective chosen at build time. When an incident does not match that perspective, someone has to edit queries or build a new dashboard under pressure. With controls, the panels adapt.

Value controls like ?hostname and ?interval handle what you filter and define the granularity of the data. Structure controls like ??aggregation and ??breakdown handle how you aggregate and how you slice. Panels sharing one query pattern means a fix or improvement applies everywhere, and a new investigation path is a single value added to a control. Together they turn a static dashboard into an investigation surface.

Kibana: How to create impactful visualisations with magic formulas ? (part 1)

Mon, 09 Sep 2024 00:00:00 GMT

Kibana: How to create impactful visualizations with magic formulas? (part 1)

Introduction

In the previous blog post, Designing Intuitive Kibana Dashboards as a non-designer, we highlighted the importance of creating intuitive dashboards. It demonstrated how simple changes (grouping themes, changing type charts, and more) can make a difference in understanding your data. When delivering courses like Data Analysis with Kibana or Elastic Observability Engineer courses, we emphasize this blog post and how these changes help bring essential information to the surface. I like a complementary approach to reach this goal: using two colors to separate the highest data values from the common ones.

To illustrate this idea, we will use the Sample flight data dataset. Now, let’s compare two visualizations ranking the top 10 destination countries per total number of flights. Which visualization has a higher impact?

If you chose the second one, you may be wondering how this was done with the Kibana Lens editor. While preparing for the certification last year, I found a way to achieve this result. The secret is using two different layers and some magic formulas. This post will explain how math in Lens formulas helps create two data-color visualizations.

We will start with the first example that emphasizes only the highest value of the dataset we are focusing on. The second example describes how to highlight other high values (as shown in the illustration above).

[Note: the tips explained in this blog post can be applied from v 7.15]

Only the highest value

To understand how math helps to separate high values from common ones, let’s start with this first example: emphasizing only the highest value.

We start with a bar horizontal chart:

We need to identify the highest value of the scope we are currently examining. We will use one proper overall_* function: the overall_max(), a pipeline function (equivalent to a pipeline aggregation in Query DSL).

In our example, we group the flights by country(destination). This means we count the number of flights for each DestCountry (= 1 bucket). The overall_max() will select which bucket has the highest value.

The math trick here is to divide the number of flights per bucket by the maximum value found among all buckets. Only one bucket will return 1: the bucket matching the max value found by overall_max(). All the other buckets will return a value < 1 and >0. We use floor() to ensure any 0.xxx values are rounded to 0.

Now, we can multiple it with a count() and we have our formula for the 1st layer!

Layer 1: count()*floor(count()/overall_max(count()))

From here, in Lens Editor, we duplicate the layer to adjust the formula of the second layer containing the rest of the data. We need to append another count() followed by the minus operator to the formula. This is the other trick. In this layer, we just need to ensure the highest value is not represented, which will happen only once. It is when count() = overall_max(), which is = 1 when we divide them.

Layer 2: count() - count()*floor(count()/overall_max(count()))

To achieve a nice merge of these two layers, we need to do the following adjustments in both:

select bar horizontal stacked
Vertical axis: change”Rank by” to Custom and ensure Rank function is “Count”

Here is the final setup of the two layers:

Layer 1: count()*floor(count()/overall_max(count()))

Layer 2: count() - count()*floor(count()/overall_max(count()))

This visualization also works well for time series data where you need to quickly highlight which time period (12h in the example below) had the highest number of flights:

Above the surface

Building on what we have done earlier, we can extend the approach to get other high values above the surface. Let’s see which formula we used to create the visualization in the introduction:

For this visualization, we used a property of the round() function. This function brings in only the values greater than 50% of the highest value.

50% of max explanation" />

Let's duplicate our first visualization and swap out the floor() function with round().

Layer 1: count()*round(count()/overall_max(count()))

Layer 2: count() - count()*round(count()/overall_max(count()))

It was an easy fix.
What if we want to extend the first layer further by adding more high values?
For instance, we would like all the values above the average.

To do this, we use overall_average() as a new reference value instead of the overall_max () reference to separate the eligible values in Layer 1.

As we are comparing against the average value among all the buckets, the division might return values greater than 1.

Here, the clamp() function nicely solves this issue.

According to the formula reference, clamp() "limits the value from a minimum to maximum". Combining clamp() and floor() ensures that there are only two possible output values: either the minimum value ( 0 ) or the maximum value ( 1 ) given as parameters.

Applied to our flights dataset, it highlights the country destinations that have more flights than the average:

Layer 1: count()*clamp(floor(count()/overall_average(count())),0,1)

Layer 2: count() - count()*clamp(floor(count()/overall_average(count())),0,1)

It also opens up options for using other dynamic references. For instance, we could place all the values greater than 60% of the highest above the surface ( > 0.6*overall_max(count())). We can tune our formula as follow:


count()*clamp(floor(count()/(0.6*overall_max(count()) ) ),0,1)

Conclusion

In the first part, we have seen the main tips allowing us to create a two-color histogram:

Two layers: one for the highest value and one for the remaining values
Visualization type: bar horizontal/vertical stacked
To separate the data we use a formula where only the highest value return 1 otherwise 0

Then in the second part, we have seen how we can extend this principle to embrace more high values above the surface. This approach can be summarized as follows:

Start with layer 1 focusing on the high value: count()*
Duplicate the layer and adjust the formula:
( count() - count()*)

Finally, we provide 4 generic formulas that are ready to use to spice up your dashboards:


1. Only the highest
Layer 1	`count()*floor(count()/overall_max(count()))`
Layer 2	`count() - count()*floor(count()/overall_max(count()))`


2.1. Above the surface : high values (above 50% of the max value)
Layer 1	`count()*floor(count()/overall_max(count()))`
Layer 2	`count() - count()*floor(count()/overall_max(count()))`


2.2. Above the surface : all values above the overall average
Layer 1	`count()*clamp(floor(count()/overall_average(count())),0,1)`
Layer 2	`count() - count()*clamp(floor(count()/overall_average(count())),0,1)`


2.2. Above the surface : all the values greater than 60% of the highest
Layer 1	`count()clamp(floor(count()/(0.6overall_max(count()) ) ),0,1)`
Layer 2	`count() - count()clamp(floor(count()/(0.6overall_max(count()) ) ),0,1)`

Try these examples out for yourself by signing up for a free trial of Elastic Cloud or download the self-managed version of the Elastic Stack for free. If you have additional questions about getting started, head on over to the Kibana forum or check out the Kibana documentation guide.
In the next blog post, we will see how the new function ifelse() (introduced in version 8.6) will greatly simplify the creation of visualizations with more advanced formulas.

References:

Designing intuitive Kibana dashboards as a non-designer
Kibana: Lens editor - use formula to perform math
Discovering the clamp() function in this discussion (Thanks Marco!)

Managing your Kubernetes cluster with Elastic Observability

Mon, 24 Oct 2022 00:00:00 GMT

As an operations engineer (SRE, IT manager, DevOps), you’re always struggling with how to manage technology and data sprawl. Kubernetes is becoming increasingly pervasive and a majority of these deployments will be in Amazon Elastic Kubernetes Service (EKS), Google Kubernetes Engine (GKE), or Azure Kubernetes Service (AKS). Some of you may be on a single cloud while others will have the added burden of managing clusters on multiple Kubernetes cloud services. In addition to cloud provider complexity, you also have to manage hundreds of deployed services generating more and more observability and telemetry data.

The day-to-day operations of understanding the status and health of your Kubernetes clusters and applications running on them, through the logs, metrics, and traces they generate, will likely be your biggest challenge. But as an operations engineer you will need all of that important data to help prevent, predict, and remediate issues. And you certainly don’t need that volume of metrics, logs and traces spread across multiple tools when you need to visualize and analyze Kubernetes telemetry data for troubleshooting and support.

Elastic Observability helps manage the sprawl of Kubernetes metrics and logs by providing extensive and centralized observability capabilities beyond just the logging that we are known for. Elastic Observability provides you with granular insights and context into the behavior of your Kubernetes clusters along with the applications running on them by unifying all of your metrics, log, and trace data through OpenTelemetry and APM agents.

Regardless of the cluster location (EKS, GKE, AKS, self-managed) or application, Kubernetes monitoring is made simple with Elastic Observability. All of the node, pod, container, application, and infrastructure (AWS, GCP, Azure) metrics, infrastructure and application logs, along with application traces are available in Elastic Observability.

In this blog we will show:

How Elastic Cloud can aggregate and ingest metrics and log data through the Elastic Agent (easily deployed on your cluster as a DaemonSet) to retrieve logs and metrics from the host (system metrics, container stats) along with logs from all services running on top of Kubernetes.
How Elastic Observability can bring a unified telemetry experience (logs, metrics,traces) across all your Kubernetes cluster components (pods, nodes, services, namespaces, and more).

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up this demonstration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here).
While we used GKE, you can use any location for your Kubernetes cluster.
We used a variant of the ever so popular HipsterShop demo application. It was originally written by Google to showcase Kubernetes across a multitude of variants available such as the OpenTelemetry Demo App. To use the app, please go here and follow the instructions to deploy. You don’t need to deploy otelcollector for Kubernetes metrics to flow — we will cover this below.
Elastic supports native ingest from Prometheus and FluentD, but in this blog, we are showing a direct ingest from Kubernetes cluster via Elastic Agent. There will be a follow-up blog showing how Elastic can also pull in telemetry from Prometheus or FluentD/bit.

What can you observe and analyze with Elastic?

Before we walk through the steps on getting Elastic set up to ingest and visualize Kubernetes cluster metrics and logs, let’s take a sneak peek at Elastic’s helpful dashboards.

As we noted, we ran a variant of HipsterShop on GKE and deployed Elastic Agents with Kubernetes integration as a DaemonSet on the GKE cluster. Upon deployment of the agents, Elastic starts ingesting metrics from the Kubernetes cluster (specifically from kube-state-metrics) and additionally Elastic will pull all log information from the cluster.

Visualizing Kubernetes metrics on Elastic Observability

Here are a few Kubernetes dashboards that will be available out of the box (OOTB) on Elastic Observability.

In addition to the cluster overview dashboard and pod dashboard, Elastic has several useful OOTB dashboards:

Kubernetes overview dashboard (see above)
Kubernetes pod dashboard (see above)
Kubernetes nodes dashboard
Kubernetes deployments dashboard
Kubernetes DaemonSets dashboard
Kubernetes StatefulSets dashboards
Kubernetes CronJob & Jobs dashboards
Kubernetes services dashboards
More being added regularly

Additionally, you can either customize these dashboards or build out your own.

Working with logs on Elastic Observability

As you can see from the screens above, not only can I get Kubernetes cluster metrics, but also all the Kubernetes logs simply by using the Elastic Agent in my Kubernetes cluster.

Prevent, predict, and remediate issues

In addition to helping manage metrics and logs, Elastic can help you detect and predict anomalies across your cluster telemetry. Simply turn on Machine Learning in Elastic against your data and watch it help you enhance your analysis work. As you can see below, Elastic is not only a unified observability location for your Kubernetes cluster logs and metrics, but it also provides extensive true machine learning capabilities to enhance your analysis and management.

In the top graph, you see anomaly detection across logs and it shows something potentially wrong in the September 21 to 23 time period. Dig into the details on the bottom chart by analyzing a single kubernetes.pod.cpu.usage.node metric showing cpu issues early in September and again, later on in the month. You can do more complicated analyses on your cluster telemetry with Machine Learning using multi-metric analysis (versus the single metric issue I am showing above) along with population analysis.

Elastic gives you better machine learning capabilities to enhance your analysis of Kubernetes cluster telemetry. In the next section, let’s walk through how easy it is to get your telemetry data into Elastic.

Setting it all up

Let’s walk through the details of how to get metrics, logs, and traces into Elastic from a HipsterShop application deployed on GKE.

First, pick your favorite version of Hipstershop — as we noted above, we used a variant of the OpenTelemetry-Demo because it already has OTel. We slimmed it down for this blog, however (fewer services with some varied languages).

Step 0: Get an account on Elastic Cloud

Follow the instructions to get started on Elastic Cloud.

Step 1: Get a Kubernetes cluster and load your Kubernetes app into your cluster

Get your app on a Kubernetes cluster in your Cloud service of choice or local Kubernetes platform. Once your app is up on Kubernetes, you should have the following pods (or some variant) running on the default namespace.

NAME                                    READY   STATUS    RESTARTS   AGE
adservice-8694798b7b-jbfxt              1/1     Running   0          4d3h
cartservice-67b598697c-hfsxv            1/1     Running   0          4d3h
checkoutservice-994ddc4c4-p9p2s         1/1     Running   0          4d3h
currencyservice-574f65d7f8-zc4bn        1/1     Running   0          4d3h
emailservice-6db78645b5-ppmdk           1/1     Running   0          4d3h
frontend-5778bfc56d-jjfxg               1/1     Running   0          4d3h
jaeger-686c775fbd-7d45d                 1/1     Running   0          4d3h
loadgenerator-c8f76d8db-gvrp7           1/1     Running   0          4d3h
otelcollector-5b87f4f484-4wbwn          1/1     Running   0          4d3h
paymentservice-6888bb469c-nblqj         1/1     Running   0          4d3h
productcatalogservice-66478c4b4-ff5qm   1/1     Running   0          4d3h
recommendationservice-648978746-8bzxc   1/1     Running   0          4d3h
redis-cart-96d48485f-gpgxd              1/1     Running   0          4d3h
shippingservice-67fddb767f-cq97d        1/1     Running   0          4d3h

Step 2: Turn on kube-state-metrics

Next you will need to turn on kube-state-metrics.

First:

git clone https://github.com/kubernetes/kube-state-metrics.git

Next, in the kube-state-metrics directory under the examples directory, just apply the standard config.

kubectl apply -f ./standard

This will turn on kube-state-metrics, and you should see a pod similar to this running in kube-system namespace.

kube-state-metrics-5f9dc77c66-qjprz                    1/1     Running   0          4d4h

Step 3: Install the Elastic Agent with Kubernetes integration

Add Kubernetes Integration:

In Elastic, go to integrations and select the Kubernetes Integration, and select to Add Kubernetes.
Select a name for the Kubernetes integration.
Turn on kube-state-metrics in the configuration screen.
Give the configuration a name in the new-agent-policy-name text box.
Save the configuration. The integration with a policy is now created.

You can read up on the agent policies and how they are used on the Elastic Agent here.

Add Kubernetes integration.
Select the policy you just created in the second.
In the third step of Add Agent instructions, copy and paste or download the manifest.
Add manifest to the shell where you have kubectl running, save it as elastic-agent-managed-kubernetes.yaml, and run the following command.

kubectl apply -f elastic-agent-managed-kubernetes.yaml

You should see a number of agents come up as part of a DaemonSet in kube-system namespace.

NAME                                                   READY   STATUS    RESTARTS   AGE
elastic-agent-qr6hj                                    1/1     Running   0          4d7h
elastic-agent-sctmz                                    1/1     Running   0          4d7h
elastic-agent-x6zkw                                    1/1     Running   0          4d7h
elastic-agent-zc64h                                    1/1     Running   0          4d7h

In my cluster, I have four nodes and four elastic-agents started as part of the DaemonSet.

Step 4: Look at Elastic out of the box dashboards (OOTB) for Kubernetes metrics and start discovering Kubernetes logs

That is it. You should see metrics flowing into all the dashboards. To view logs for specific pods, simply go into Discover in Kibana and search for a specific pod name.

Additionally, you can browse all the pod logs directly in Elastic.

In the above example, I searched for frontendService and cartService logs.

Step 5: Bonus!

Because we were using an OTel based application, Elastic can even pull in the application traces. But that is a discussion for another blog.

Here is a quick peek at what Hipster Shop’s traces for a front end transaction look like in Elastic Observability.

Conclusion: Elastic Observability rocks for Kubernetes monitoring

I hope you’ve gotten an appreciation for how Elastic Observability can help you manage Kubernetes clusters along with the complexity of the metrics, log, and trace data it generates for even a simple deployment.

A quick recap of lessons and more specifically learned:

How Elastic Cloud can aggregate and ingest telemetry data through the Elastic Agent, which is easily deployed on your cluster as a DaemonSet and retrieves metrics from the host, such as system metrics, container stats, and metrics from all services running on top of Kubernetes
Show what Elastic brings from a unified telemetry experience (Kubernenetes logs, metrics, traces) across all your Kubernetes cluster components (pods, nodes, services, any namespace, and more).
Interest in exploring Elastic’s ML capabilities which will reduce your MTTHH (mean time to happy hour)

Ready to get started? Register and try out the features and capabilities I’ve outlined above.

Gain insights into Kubernetes errors with Elastic Observability logs and OpenAI

Thu, 18 May 2023 00:00:00 GMT

As we’ve shown in previous blogs, Elastic^® provides a way to ingest and manage telemetry from the Kubernetes cluster and the application running on it. Elastic provides out-of-the-box dashboards to help with tracking metrics, log management and analytics, APM functionality (which also supports native OpenTelemetry), and the ability to analyze everything with AIOps features and machine learning (ML). While you can use pre-existing ML models in Elastic, out-of-the-box AIOps features, or your own ML models, there is a need to dig deeper into the root cause of an issue.

Elastic helps reduce the operational work to support more efficient operations, but users still need a way to investigate and understand everything from the cause of an issue to the meaning of specific error messages. As an operations user, if you haven’t run into a particular error before or it's part of some runbook, you will likely go to Google and start searching for information.

OpenAI’s ChatGPT is becoming an interesting generative AI tool that helps provide more information using the models behind it. What if you could use OpenAI to obtain deeper insights (even simple semantics) for an error in your production or development environment? You can easily tie Elastic to OpenAI’s API to achieve this.

Kubernetes, a mainstay in most deployments (on-prem or in a cloud service provider) requires a significant amount of expertise — even if that expertise is to manage a service like GKE, EKS, or AKS.

In this blog, I will cover how you can use Elastic’s watcher capability to connect Elastic to OpenAI and ask it for more information about the error logs Elastic is ingesting from a Kubernetes cluster(s). More specifically, we will use Azure’s OpenAI Service. Azure OpenAI is a partnership between Microsoft and OpenAI, so the same models from OpenAI are available in the Microsoft version.

While this blog goes over a specific example, it can be modified for other types of errors Elastic receives in logs. Whether it's from AWS, the application, databases, etc., the configuration and script described in this blog can be modified easily.

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up the configuration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here).
We used a GCP GKE Kubernetes cluster, but you can use any Kubernetes cluster service (on-prem or cloud based) of your choice.
We’re also running with a version of the OpenTelemetry Demo. Directions for using Elastic with OpenTelemetry Demo are here.
We also have an Azure account and Azure OpenAI service configured. You will need to get the appropriate tokens from Azure and the proper URL endpoint from Azure’s OpenAI service.
We will use Elastic’s dev tools, the console to be specific, to load up and run the script, which is an Elastic watcher.
We will also add a new index to store the results from the OpenAI query.

Here is the configuration we will set up in this blog:

As we walk through the setup, we’ll also provide the alternative setup with OpenAI versus Azure OpenAI Service.

Setting it all up

Over the next few steps, I’ll walk through:

Getting an account on Elastic Cloud and setting up your K8S cluster and application
Gaining Azure OpenAI authorization (alternative option with OpenAI)
Identifying Kubernetes error logs
Configuring the watcher with the right script
Comparing the output from Azure OpenAI/OpenAI versus ChatGPT UI

Step 0: Create an account on Elastic Cloud

Follow the instructions to get started on Elastic Cloud.

Once you have the Elastic Cloud login, set up your Kubernetes cluster and application. A complete step-by-step instructions blog is available here. This also provides an overview of how to see Kubernetes cluster metrics in Elastic and how to monitor them with dashboards.

Step 1: Azure OpenAI Service and authorization

When you log in to your Azure subscription and set up an instance of Azure OpenAI Service, you will be able to get your keys under Manage Keys.

There are two keys for your OpenAI instance, but you only need KEY 1 .

Additionally, you will need to get the service URL. See the image above with our service URL blanked out to understand where to get the KEY 1 and URL.

If you are not using Azure OpenAI Service and the standard OpenAI service, then you can get your keys at:

**https** ://platform.openai.com/account/api-keys

You will need to create a key and save it. Once you have the key, you can go to Step 2.

Step 2: Identifying Kubernetes errors in Elastic logs

As your Kubernetes cluster is running, Elastic’s Kubernetes integration running on the Elastic agent daemon set on your cluster is sending logs and metrics to Elastic. The telemetry is ingested, processed, and indexed. Kubernetes logs are stored in an index called .ds-logs-kubernetes.container_logs-default-* (* is for the date), and an automatic data stream logs-kubernetes.container_logs is also pre-loaded. So while you can use some of the out-of-the-box dashboards to investigate the metrics, you can also look at all the logs in Elastic Discover.

While any error from Kubernetes can be daunting, the more nuanced issues occur with errors from the pods running in the kube-system namespace. Take the pod konnectivity agent, which is essentially a network proxy agent running on the node to help establish tunnels and is a vital component in Kubernetes. Any error will cause the cluster to have connectivity issues and lead to a cascade of issues, so it’s important to understand and troubleshoot these errors.

When we filter out for error logs from the konnectivity agent, we see a good number of errors.

But unfortunately, we still can’t understand what these errors mean.

Enter OpenAI to help us understand the issue better. Generally, you would take the error message from Discover and paste it with a question in ChatGPT (or run a Google search on the message).

One error in particular that we’ve run into but do not understand is:

E0510 02:51:47.138292       1 client.go:388] could not read stream err=rpc error: code = Unavailable desc = error reading from server: read tcp 10.120.0.8:46156->35.230.74.219:8132: read: connection timed out serverID=632d489f-9306-4851-b96b-9204b48f5587 agentID=e305f823-5b03-47d3-a898-70031d9f4768

The OpenAI output is as follows:

ChatGPT has given us a fairly nice set of ideas on why this rpc error is occurring against our konnectivity-agent.

So how can we get this output automatically for any error when those errors occur?

Step 3: Configuring the watcher with the right script

What is an Elastic watcher? Watcher is an Elasticsearch feature that you can use to create actions based on conditions, which are periodically evaluated using queries on your data. Watchers are helpful for analyzing mission-critical and business-critical streaming data. For example, you might watch application logs for errors causing larger operational issues.

Once a watcher is configured, it can be:

Manually triggered
Run periodically
Created using a UI or a script

In this scenario, we will use a script, as we can modify it easily and run it as needed.

We’re using the DevTools Console to enter the script and test it out:

The script is listed at the end of the blog in the appendix. It can also be downloaded here .

The script does the following:

It runs continuously every five minutes.
It will search the logs for errors from the container konnectivity-agent.
It will take the first error’s message, transform it (re-format and clean up), and place it into a variable first_hit.

"script": "return ['first_hit': ctx.payload.first.hits.hits.0._source.message.replace('\"', \"\")]"

The error message is sent into OpenAI with a query:

What are the potential reasons for the following kubernetes error:
  { { ctx.payload.second.first_hit } }

If the search yielded an error, it will proceed to then create an index and place the error message, pod.name (which is konnectivity-agent-6676d5695b-ccsmx in our setup), and OpenAI output into a new index called chatgpt_k8_analyzed.

To see the results, we created a new data view called chatgpt_k8_analyzed against the newly created index:

In Discover, the output on the data view provides us with the analysis of the errors.

For every error the script sees in the five minute interval, it will get an analysis of the error. We could alternatively also use a range as needed to analyze during a specific time frame. The script would just need to be modified accordingly.

Step 4. Output from Azure OpenAI/OpenAI vs. ChatGPT UI

As you noticed above, we got relatively the same result from the Azure OpenAI API call as we did by testing out our query in the ChatGPT UI. This is because we configured the API call to run the same/similar model as what was selected in the UI.

For the API call, we used the following parameters:

"request": {
             "method" : "POST",
             "Url": "https://XXX.openai.azure.com/openai/deployments/pme-gpt-35-turbo/chat/completions?api-version=2023-03-15-preview",
             "headers": {"api-key" : "XXXXXXX",
                         "content-type" : "application/json"
                        },
             "body" : "{ \"messages\": [ { \"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, { \"role\": \"user\", \"content\": \"What are the potential reasons for the following kubernetes error: {{ctx.payload.second.first_hit}}\"}], \"temperature\": 0.5, \"max_tokens\": 2048}" ,
              "connection_timeout": "60s",
               "read_timeout": "60s"
                            }

By setting the role: system with You are a helpful assistant and using the gpt-35-turbo url portion, we are essentially setting the API to use the davinci model, which is the same as the ChatGPT UI model set by default.

Additionally, for Azure OpenAI Service, you will need to set the URL to something similar the following:

https://YOURSERVICENAME.openai.azure.com/openai/deployments/pme-gpt-35-turbo/chat/completions?api-version=2023-03-15-preview

If you use OpenAI (versus Azure OpenAI Service), the request call (against https://api.openai.com/v1/completions) would be as such:

"request": {
            "scheme": "https",
            "host": "api.openai.com",
            "port": 443,
            "method": "post",
            "path": "\/v1\/completions",
            "params": {},
            "headers": {
               "content-type": "application\/json",
               "authorization": "Bearer YOUR_ACCESS_TOKEN"
                        },
            "body": "{ \"model\": \"text-davinci-003\",  \"prompt\": \"What are the potential reasons for the following kubernetes error: {{ctx.payload.second.first_hit}}\",  \"temperature\": 1,  \"max_tokens\": 512,     \"top_p\": 1.0,      \"frequency_penalty\": 0.0,   \"presence_penalty\": 0.0 }",
            "connection_timeout_in_millis": 60000,
            "read_timeout_millis": 60000
          }

If you are interested in creating a more OpenAI-based version, you can download an alternative script and look at another blog from an Elastic community member.

Gaining other insights beyond Kubernetes logs

Now that the script is up and running, you can modify it using different:

Inputs
Conditions
Actions
Transforms

Learn more on how to modify it here. Some examples of modifications could include:

Look for error logs from application components (e.g., cartService, frontEnd, from the OTel demo), cloud service providers (e.g., AWS/Azure/GCP logs), and even logs from components such as Kafka, databases, etc.
Vary the time frame from running continuously to running over a specific range.
Look for specific errors in the logs.
Query for analysis on a set of errors at once versus just one, which we demonstrated.

The modifications are endless, and of course you can run this with OpenAI rather than Azure OpenAI Service.

Conclusion

I hope you’ve gotten an appreciation for how Elastic Observability can help you connect to OpenAI services (Azure OpenAI, as we showed, or even OpenAI) to better analyze an error log message instead of having to run several Google searches and hunt for possible insights.

Here’s a quick recap of what we covered:

Developing an Elastic watcher script that can be used to find and send Kubernetes errors into OpenAI and insert them into a new index
Configuring Azure OpenAI Service or OpenAI with the right authorization and request parameters

Ready to get started? Sign up for Elastic Cloud and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your OpenTelemetry data.

Appendix

Watcher script

PUT _watcher/watch/chatgpt_analysis
{
    "trigger": {
      "schedule": {
        "interval": "5m"
      }
    },
    "input": {
      "chain": {
          "inputs": [
              {
                  "first": {
                      "search": {
                          "request": {
                              "search_type": "query_then_fetch",
                              "indices": [
                                "logs-kubernetes*"
                              ],
                              "rest_total_hits_as_int": true,
                              "body": {
                                "query": {
                                  "bool": {
                                    "must": [
                                      {
                                        "match": {
                                          "kubernetes.container.name": "konnectivity-agent"
                                        }
                                      },
                                      {
                                        "match" : {
                                          "message":"error"
                                        }
                                      }
                                    ]
                                  }
                                },
                                "size": "1"
                              }
                            }
                        }
                    }
                },
                {
                    "second": {
                        "transform": {
                            "script": "return ['first_hit': ctx.payload.first.hits.hits.0._source.message.replace('\"', \"\")]"
                        }
                    }
                },
                {
                    "third": {
                        "http": {
                            "request": {
                                "method" : "POST",
                                "url": "https://XXX.openai.azure.com/openai/deployments/pme-gpt-35-turbo/chat/completions?api-version=2023-03-15-preview",
                                "headers": {
                                    "api-key" : "XXX",
                                    "content-type" : "application/json"
                                },
                                "body" : "{ \"messages\": [ { \"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, { \"role\": \"user\", \"content\": \"What are the potential reasons for the following kubernetes error: {{ctx.payload.second.first_hit}}\"}], \"temperature\": 0.5, \"max_tokens\": 2048}" ,
                                "connection_timeout": "60s",
                                "read_timeout": "60s"
                            }
                        }
                    }
                }
            ]
        }
    },
    "condition": {
      "compare": {
        "ctx.payload.first.hits.total": {
          "gt": 0
        }
      }
    },
    "actions": {
        "index_payload" : {
            "transform": {
                "script": {
                    "source": """
                        def payload = [:];
                        payload.timestamp = new Date();
                        payload.pod_name = ctx.payload.first.hits.hits[0]._source.kubernetes.pod.name;
                        payload.error_message = ctx.payload.second.first_hit;
                        payload.chatgpt_analysis = ctx.payload.third.choices[0].message.content;
                        return payload;
                    """
                }
            },
            "index" : {
                "index" : "chatgpt_k8s_analyzed"
            }
        }
    }
}

Additional logging resources:

Common use case examples with logs:

Screenshots of Microsoft products used with permission from Microsoft.

Process Kubernetes logs with ease using Elastic Streams

Thu, 12 Mar 2026 00:00:00 GMT

Streams, a new AI capability within Elastic Observability. Built on the Elasticsearch platform, it's designed for Site Reliability Engineers (SREs) to use logs as the primary signal for investigations, enabling faster answers and quicker issue resolution. For decades, logs have been considered too noisy, expensive, and complex to manage, and many observability vendors have treated them as a second-class citizen. Streams flips this script by transforming raw logs into your most valuable asset to immediately identify not only the root cause, but also the why behind the root cause to enable instant resolution.

Learn more from our previous article Introducing Streams

Many SREs deploy on cloud native archtiectures. Kubernetes is essentially the baseline deployment architecture of choice. Yet Kubernetes logs are messy by default. A single (data)stream often mixes access logs, JSON payloads, health checks, and internal service chatter.

Elastic Streams gives you a faster path. You can isolate subsets of logs with conditionals, use AI to generate Grok patterns from real samples, and drop documents you do not need before they add storage and query cost.

Why Kubernetes logs get messy fast

The default Kubernetes container logs stream can contain data from many services at once. In one sample, you might see:

HTTP access logs from application pods
Verbose worker or batch job status logs
Platform and container lifecycle events with different formats

This is why "one global parsing rule" will fail. You need targeted processing logic per log shape or type of application. Histrocially doing this kind of custom processing has been error prone and time consuming.

What Streams Processing changes

Streams Processing (available in 9.2 and later) moves this workflow into a live, interactive experience:

You build conditions and processors in the UI
You validate each change against sample documents before saving
You can use AI to generate extraction patterns from selected logs

The result is a safer way to iterate on parsing logic without guessing.

Walkthrough: parse custom application logs

We'll start from your Kubernetes stream (logs-kubernetes.containers_logs-default) and create a conditional block that scopes processing to one service.

Once the condition is saved, it will automatically filter the sample data to a subset of logs that match the condition. This is indicate by the blue highlight in the preview.

Inside that block, we'll add a Grok processor and click Generate pattern.

This agentic process will now use an LLM to generate a Grok pattern that will be used to parse the logs. By default this would be using the Elastic Inference Service, but you can configure it to use your own LLM. Review the generated pattern and accept it once the sample set validates.

Walkthrough: drop noisy postgres-loadgen documents

Not all logs are that important that we'd like to keep them around forever. For example, logs from a load testing tool like a load generator are not useful for long-term analysis, so let's drop those.

To do this we will add a second conditional block for logs you intentionally do not want to index long-term.

Add a drop processor inside this block, then validate in the Dropped tab.

Save safely with live simulation

One of the most useful parts of Streams is the preview-first workflow. You can inspect matched, parsed, skipped, failed, and dropped samples before making the change live.

YAML mode and the equivalent API request

The interactive builder works well for most edits, but advanced users can switch to YAML mode for direct control.

You can also open Equivalent API Request to copy the payload for automation and Infrastructure as Code workflows.

A note on backwards compatibility

Streams Processing builds on Elasticsearch ingest pipelines, so it works with the same ingestion model teams already use.

When you save processing changes, Streams appends logic through the stream processing pipeline model (for example via @custom conventions used by data streams). That means you can adopt conditionals, parsing, and selective dropping incrementally, without changing your Kubernetes log shippers.

What's next?

Streams Processing is consistently getting new processing capabilities. Check out the Streams documentation for the latest updates.

Over the coming months more of this will be automated and moved to the background, reducing the manual effort required to process logs.

Another miletsone we're working towards is to offer this processing at read time, rather than write time. Using ES|QL this will enable you to iterate on your parsing logic without having to worry about committing changes that are harder to revert.

Also try this out by getting a free trial on Elastic Serverless.

Happy log analytics!!!

Troubleshooting your Agents and Amazon Bedrock AgentCore with Elastic Observability

Mon, 01 Dec 2025 00:00:00 GMT

Troubleshooting your Agents and Amazon Bedrock AgentCore with Elastic Observability

Introduction

We're excited to introduce Elastic Observability’s Amazon Bedrock AgentCore integration, which allows users to observe Amazon Bedrock AgentCore and the agents' LLM interactions end-to-end. Agentic AI represents a fundamental shift in how we build applications.

Unlike standard LLM chatbots that simply generate text, agents can reason, plan, and execute multi-step workflows to complete complex tasks autonomously. Many times these agents are running on a platform such as Amazon Bedrock AgentCore, which helps developers build, deploy and scale agents. Amazon Bedrock AgentCore is Amazon Bedrock's platform providing the secure, scalable, and modular infrastructure services (like agent runtime, memory, and identity) necessary for developers to deploy and operate highly capable AI agents built with any framework or model.

Using a platform, such as Amazon Bedrock Agentcore, is easy, but troubleshooting an agent is far more complex than debugging a standard microservice. Key challenges include:

Non-Deterministic Behavior: Agents may choose different tools or reasoning paths for the same prompt, making it difficult to reproduce bugs.
"Black Box" Execution: When an agent fails or provides a hallucinated answer, it is often unclear if the issue lies in the LLM's reasoning, the context provided, or a failed tool execution.
Cost & Latency Blind Spots: A single user query can trigger recursive loops or expensive multi-step tool calls, leading to unexpected spikes in token usage and latency.

To effectively observe these systems, you need to correlate signals from two distinct layers:

The Platform Layer (Amazon Bedrock AgentCore): You need to understand the overall health of the managed service. This includes high-level metrics like invocation counts, latency, throttling, and platform-level errors that affect all agents running in AgentCore.
The Application Layer (Your Agentic Logic): You want to understand the granular "why" behind the behavior. This includes distributed traces, usually with OpenTelemetry, that visualize the full request lifecycle (e.g. waterfall view), identifying exactly which step in the reasoning chain failed or took too long.

Agentic AI Observability in Elastic provides a unified, end-to-end view of your agentic deployment by combining platform-level insights from Amazon Bedrock AgentCore, through the new Amazon Bedrock AgentCore integration, with deep application-level visibility from OpenTelemetry (OTel) traces, logs and metrics form the agent. This unified view in Elastic allows you to observe, troubleshoot, and optimize your agentic applications from end to end without switching tools. Additionally, Elastic provides Agent Builder which allows you to create agents to analyze any of the data from Amazon Bedrock AgentCore and the agents running on it.

Agentic AI Observability in Elastic

As mentioned above there are two main parts to end-to-end Agentic AI Observability in Elastic.

Amazon Bedrock AgentCore Platform Observability - using platform logs and metrics, Elastic provides comprehensive visibility into the high-level health of the AgentCore service by ingesting AWS vended logs and metrics across four critical components:
- Runtime: Monitor core performance indicators such as agent errors, overall latency, throttle counts, and invocation rates, for each endpoint.
- Gateway: specific insights into gateway and tool call performance, including invocations, error rates, and latency.
- Memory: Track short-term and long-term memory operations, including event creation, retrieval, and listing, alongside performance analysis, errors, and latency metrics.
- Identity: Audit security and access health with logs on successful and failed access attempts.

Agent Observability with APM, logs and metrics - To understand how your agent is behaving, Elastic ingests OTel-native traces, metrics and logs from your application running within AgentCore. This allows you to visualize the full execution path, including LLM reasoning steps and tool calls, in a detailed waterfall diagram.

Agentic AI Analysis - All of the data from Amazon Bedrock AgentCore and the agent running on it, can be analyzed with Elastic’s AI driven capabilities. These include:

Elastic AgentCore SRE Agent built on Elastic Agent Builder - We don't just monitor agents; we provide you with one to assist your team. The AgentCore SRE Agent is a specialized assistant built using Elastic Agent Builder. It possesses specialized knowledge of AgentCore applications observed in Elastic.
- How it helps: You can ask specific questions regarding your AgentCore environment, such as how to interpret a complex error log or why a specific trace shows latency.
- Get the Agent: You can deploy this agent yourself from our GitHub repository.
Elastic Observability AI Assistant - Use natural language anywhere in Elastic’s UI to help you pinpoint issues, analyze something specific, or just learn what the problem is through LLM knowledge base. Additionally, SREs can interpret log messages, errors, metrics patterns, optimize code, write reports, and even identify and execute a runbook, or find a related github issue.
Streams - AI-Driven Log Analysis - When you send AgentCore logs from your instrumented application into Elastic, you can parse and analyze them. Additionally, Streams finds Significant Events within your log stream allowing you to focus immediately on what matters most.
Dashboards and ES|QL Data is only useful if you can act on it. Elastic provides out-of-the-box (OOTB) assets to accelerate your mean time to resolution (MTTR). And Elastic provides ES|QL to help you perform ad-hoc analysis on any signal
- OOTB Dashboards: Pre-built visualizations based on AgentCore service signals. These dashboards provide an immediate, high-level overview of the usage, health, and performance of your AgentCore runtime, gateway, memory, and identity components.
- OOTB Alert Templates: Pre-configured alerts for common agentic issues (e.g., high error rates, latency spikes, or unusual token consumption), allowing you to move from reactive to proactive troubleshooting immediately.

Onboarding Amazon Bedrock AgentCore signals into Elastic

Amazon Bedrock AgentCore Integration

To get started with platform-level visibility, you need to enable the Amazon Bedrock AgentCore integration in Elastic. This integration automatically collects metrics and logs from your AgentCore runtime, gateway, memory, and identity components via Amazon CloudWatch.

Setup Steps:

Prepare AWS Environment: Ensure your AgentCore agents are deployed and running and that you have enabled logging on your AgentCore resources in the AWS console.
Add the Integration:
- In Elastic (Kibana), navigate to Integrations.
- Search for "Amazon Bedrock AgentCore". Select Add Amazon Bedrock AgentCore.
Configure & Deploy:

Configure Elastic's Amazon Bedrock AgentCore integration to collect CloudWatch metrics from your chosen AWS region at the specified collection interval. Logs will be added soon after the publication of this blog.

Onboard the Agent with OTel Instrumentation

The next step is observing the application logic itself. The beauty of Amazon Bedrock AgentCore is that the application runtime often comes pre-instrumented. You simply need to tell it where to send the telemetry data.

For this example, we will use the Travel Assistant from the Elastic Observability examples.

To instrument this agent, you do not need to modify the source code. Instead, when you invoke the agent using the agentcore CLI, you simply pass your Elastic connection details as environment variables. This redirects the OTel signals (traces, metrics, and logs) directly to the Elastic EDOT collector.

Example Invoke Command: Run the following command to launch the agent and start streaming telemetry to Elastic:

    agentcore launch \
    --env BEDROCK_MODEL_ID="us.anthropic.claude-3-5-sonnet-20240620-v1:0" \
    --env OTEL_EXPORTER_OTLP_ENDPOINT="https://.region.cloud.elastic.co:443" \
    --env OTEL_EXPORTER_OTLP_HEADERS="Authorization=ApiKey " \
    --env OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf" \
    --env OTEL_METRICS_EXPORTER="otlp" \
    --env OTEL_TRACES_EXPORTER="otlp" \
    --env OTEL_LOGS_EXPORTER="otlp" \
    --env OTEL_RESOURCE_ATTRIBUTES="service.name=travel_assistant,service.version=1.0.0" \
    --env AGENT_OBSERVABILITY_ENABLED="true" \
    --env DISABLE_ADOT_OBSERVABILITY="true" \
    --env TAVILY_API_KEY=""

Key Configuration Parameters:

OTEL_EXPORTER_OTLP_ENDPOINT: Your Elastic OTLP endpoint (ensure port 443 is specified).
OTEL_EXPORTER_OTLP_HEADERS: The Authorization header containing your Elastic API Key.
DISABLE_ADOT_OBSERVABILITY=true: This ensures the native AgentCore signals are routed exclusively to your defined endpoint (Elastic) rather than default AWS paths.

Analyzing Agentic Data in Elastic Observability

As we walk through the analysis features below, we will use the Travel Assistant agent which we instrumented earlier as well as any other apps you may be running on AgentCore. For the purposes of this example, as a second agent, we will use Customer Support Assistant from the AWS Labs AgentCore samples

Out-of-the-Box (OOTB) Dashboards

Elastic populates a set of comprehensive dashboards based on Amazon Bedrock AgentCore service logs and metrics. These appear as a unified view with tabs, providing a "single pane of glass" into the operational health of your platform.

This view is divided into four key zones, each addressing specific components of AgentCore - Runtime, Gateway, Memory, Identity. Note that note all agentic applications use all 4 components. In our example only the Customer Assistant uses all four components, whereas the Travel agent uses only Runtime.

Runtime Health

Visualize agent invocations, session metrics, error trends (system vs. user), and performance stats like latency and throttling, split per endpoint. This dashboard helps you answer questions like

"How are my Travel Assistant agent and Customer Support agent performing in terms of overall traffic and latency, and are there any spikes in errors or throttling?"

Gateway Performance

Analyze invocations across Lambda and MCP (Model Context Protocol), with detailed breakdowns for tool vs. non-tool calls. The dashboard highlights throttling detection, target execution times, and separates system errors from user errors.

Question answered: "Are my external integrations (Lambda, MCP) performing efficiently, or are specific tool calls experiencing high latency, throttling, or system-level errors?"

Memory Operations

Track core operations like event creation, retrieval, and listing, alongside deep dives into long-term memory processing. This includes extraction and consolidation metrics broken down by strategy type, as well as specific monitoring for throttling and system vs. user errors.

Question answered: "Are failures in memory consolidation strategies or high retrieval latency preventing the agent from effectively recalling user context?"

Identity & Access

Monitor identity token fetch operations (workload, OAuth, API keys) and real-time authentication success/failure rates. The dashboard breaks down activity by provider and highlights throttling or capacity bottlenecks.

Question answered: "Are authentication failures or token fetch bottlenecks from specific providers preventing agents from accessing required resources?"

Out-of-the-Box (OOTB) Alert Templates

Observability isn't just about looking at dashboards; it's about knowing when to act. To move from reactive checking to proactive monitoring, Elastic provides OOTB Alert Rule Templates (starting with Elastic version 9.2.1).

These templates eliminate guesswork by pre-selecting the optimal metrics to monitor and applying sensible thresholds. This configuration focuses on high-fidelity alerts for genuine anomalies, helping you catch critical issues early while minimizing alert fatigue.

Suggested OOTB Alerts:

Agent Runtime System Errors: Detects server-side errors (500 Internal Server Error) during agent runtime invocations, indicating infrastructure or service issues with AWS Bedrock AgentCore.
Agent Runtime User Errors: Flags client-side errors (4xx) during agent runtime invocations, including validation failures (400), resource not found (404), access denied (403), and resource conflicts (409). This helps catch misconfigured permissions, invalid input, or missing resources early.
Agent Runtime High Latency: Triggers when the average latency for agent runtime invocations exceeds 10 seconds (10,000ms). Latency measures the time elapsed between receiving a request and sending the final response token.

APM Tracing

While logs and metrics tell you that an issue exists, APM Tracing tells you exactly where and why it is happening. By ingesting the OpenTelemetry signals from your instrumented agent, Elastic generates a detailed distributed trace (e.g. waterfall view) for every interaction. To get further details on LLM information such as prompts, responses, token usage, etc, you can explore the APM logs.

This allows you to peer inside the "black box" of the agent's execution flow:

Visualize the Chain of Thought: See the full sequence of events, from the user's initial prompt to the final response, including all intermediate reasoning steps.
Pinpoint Tool Failures: Identify exactly which external tool (e.g., a Lambda function for flight booking or a knowledge base query) failed or timed out.
Analyze Latency Contributors: Distinguish between latency caused by the LLM's generation time versus latency caused by slow downstream API calls.
Debug with Context: Drill down into individual spans to see specific error messages, attributes, and metadata that explain why a particular step failed.

Conclusion

As organizations move from experimental chatbots to complex, autonomous agents in production, the need for robust observability has never been greater. Agentic applications introduce new layers of complexity—non-deterministic behaviors, multi-step reasoning loops, and cost implications—that standard monitoring tools simply cannot see.

Elastic Agentic AI Observability for Amazon Bedrock AgentCore bridges this gap. By unifying platform-level health metrics from AgentCore with deep, transaction-level distributed tracing from OpenTelemetry, Elastic gives SREs and developers the complete picture. Whether you are debugging a failed tool call, optimizing latency, or controlling token costs, you have the visibility needed to run agentic AI with confidence.

Complete Visibility: AgentCore + Amazon Bedrock: For the most comprehensive view, we recommend onboarding Elastic’s Amazon Bedrock integration alongside AgentCore. While the AgentCore integration focuses on the orchestration layer—monitoring agent errors, tool latency, and invocations—the Bedrock integration provides deep visibility into the underlying foundation models themselves. This includes tracking model-specific latency, token usage, full prompts and responses, and even Guardrails usage and effectiveness. By combining both, you ensure complete coverage from the high-level agent workflow down to the raw model inference.

Read more: Monitor Amazon Bedrock with Elastic
Read more: Amazon Bedrock Guardrails Observability

Get Started Today Ready to see your agents in action?

Try it out: Log in to Elastic Cloud and add the Amazon Bedrock AgentCore integration. Or use Elastic from Amazon Marketplace
Explore the Code: Check out our GitHub repository for the Travel assistant which you saw in this blog, as well as the AgentCore SRE Agent.
Learn More: Read the full documentation on setting up integration for Agentic AI Observability for Amazon Bedrock AgentCore.

LLM observability with Elastic: Taming the LLM with Guardrails for Amazon Bedrock

Sun, 02 Mar 2025 00:00:00 GMT

In a previous blog we showed you how to set up observability for your models hosted on Amazon Bedrock using Elastic’s integration. You can now effortlessly enable observability for your Amazon Bedrock guardrails using the enhanced Elastic Amazon Bedrock integration. If you previously onboarded the Amazon Bedrock integration, just upgrade it and you will automatically get all guardrails-related updates. The enhanced integration provides a single pane of glass dashboard with two panels - one focusing on overall Bedrock visualizations as well as a separate panel dedicated to Guardrails. You can now ingest and visualize metrics and logs specific to Guardrails, such as guardrail invocation count, invocation latency, text unit utilization, guardrail policy types associated with interventions and many more.

In this blog we will show you how to set up observability for Amazon Bedrock Guardrails, how you can make use of the enhanced dashboards and what key signals to alert on for an effective observability coverage of your Bedrock guardrails.

Prerequisites

To follow along with this blog, please make sure you have:

An account on Elastic Cloud and a deployed stack in AWS (see instructions here). Ensure you are using version 8.16.2 or higher. Alternatively, you can use Elastic Cloud Serverless, a fully managed solution that eliminates infrastructure management, automatically scales based on usage, and lets you focus entirely on extracting value from your data.
An AWS account with permissions to pull the necessary data from AWS. See details in our documentation.

Steps to create a guardrail for Amazon Bedrock

Before you set up observability for the guardrails, ensure that you have configured guardrails for your model. Follow the steps below to create an Amazon Bedrock Guardrail

Access the Amazon Bedrock Console
- Sign in to the AWS Management Console with appropriate permissions and navigate to the Amazon Bedrock console.
Navigate to Guardrails
- From the left-hand menu, select Guardrails.
Create a New Guardrail
- Select Create guardrail.
- Provide a descriptive name, an optional brief description, and specify a message to display when the guardrail blocks the user prompt.
  - Example: Sorry, I am not configured to answer such questions. Kindly ask a different question.
Configure Guardrail Policies
- Content Filters: Adjust settings to block harmful content and prompt attacks.
- Denied Topics: Specify topics to block.
- Word Filters: Define specific words or phrases to block.
- Sensitive Information Filters: Set up filters to detect and remove sensitive information.
- Contextual Grounding:
  - Configure the Grounding Threshold to set the minimum confidence level for factual accuracy.
  - Set the Relevance Threshold to ensure responses align with user queries.
Review and Create
- Review your settings and select Create to finalize the guardrail.
Create a Guardrail Version
- In the Version section, select Create.
- Optionally add a description, then select Create Version.

After creating a version of your guardrail, it's important to note down the Guardrail ID and the Guardrail Version Name. These identifiers are essential when integrating the guardrail into your application, as you'll need to specify them during guardrail invocation.

Example code to integrate with Amazon Bedrock guardrails

Integrating Amazon Bedrock's ChatBedrock into your Python application enables advanced language model interactions with customisable safety measures. By configuring guardrails, you can ensure that the model adheres to predefined policies, preventing it from generating inappropriate or sensitive content.

The following code demonstrates how to integrate Amazon Bedrock with guardrails to enforce contextual grounding in AI-generated responses. It sets up a Bedrock client using AWS credentials, defines a reference grounding statement, and uses the ChatBedrock API to process user queries with contextual constraints. The converse_with_guardrails function sends a user query alongside a predefined grounding reference, ensuring that responses align with the provided knowledge source.

Setting Up Environment Variables

Before running the script, configure the required AWS credentials and guardrail settings as environment variables. These variables allow the script to authenticate with Amazon Bedrock and apply the necessary guardrails for safe and controlled AI interactions.

Create a .env file in the same directory as your script and add:

AWS_ACCESS_KEY="your-access-key" 
AWS_SECRET_KEY="your-secret-key" 
AWS_REGION="your-aws-region" 
GUARDRAIL_ID="your-guardrail-id" 
GUARDRAIL_VERSION="your-guardrail-version"

Create a Python script and run

Create a Python script using the code below and execute it to interact with the Amazon Bedrock Guardrails you set up.

import os
import boto3
from dotenv import load_dotenv
from langchain_aws import ChatBedrock
import json
from botocore.exceptions import ClientError

# Load environment variables
load_dotenv()

# Function to check for hallucinations using contextual grounding
def check_hallucination(response):
   output_assessments = response.get("trace", {}).get("guardrail", {}).get("outputAssessments", {})

   # Iterate over all assessments
   for key, assessments in output_assessments.items():
       for assessment in assessments:
           contextual_policy = assessment.get("contextualGroundingPolicy", {})
          
           if "filters" in contextual_policy:
               grounding = relevance = None
               grounding_threshold = relevance_threshold = None

               for filter_result in contextual_policy["filters"]:
                   filter_type = filter_result.get("type")
                   if filter_type == "RELEVANCE":
                       relevance = filter_result.get("score", 0)
                       relevance_threshold = filter_result.get("threshold", 0)
                   elif filter_type == "GROUNDING":
                       grounding = filter_result.get("score", 0)
                       grounding_threshold = filter_result.get("threshold", 0)
          
           if relevance < relevance_threshold or grounding < grounding_threshold:
               return True, relevance, grounding, relevance_threshold, grounding_threshold  # Hallucination detected
  
   return False, relevance, grounding, relevance_threshold, grounding_threshold  # No hallucination detected

def converse_with_guardrails(bedrock_client, messages, grounding_reference):
   message = [
       {
           "role": "user",
           "content": [
               {
                   "guardContent": {
                       "text": {
                           "text": grounding_reference,
                           "qualifiers": ["grounding_source"],
                       }
                   }
               },
               {
                   "guardContent": {
                       "text": {
                           "text": messages,
                           "qualifiers": ["query"],
                       }
                   }
               },
           ],
       }
   ]
   converse_config = {
       "modelId": os.getenv('CHAT_MODEL'),
       "messages": message,
       "guardrailConfig": {
           "guardrailIdentifier": os.getenv("GUARDRAIL_ID"),
           "guardrailVersion": os.getenv("GUARDRAIL_VERSION"),
           "trace": "enabled"
       },
       "inferenceConfig": {
           "temperature": 0.5       
       },
   }
   try:
       response = bedrock_client.converse(**converse_config)
       return response
   except ClientError as e:
       error_message = e.response['Error']['Message']
       print(f"An error occurred: {error_message}")
       print("Converse config:")
       print(json.dumps(converse_config, indent=2))
       return None
  
def pretty_print_response(response, is_hallucination, relevance, relevance_threshold, grounding, grounding_threshold):
   print("\n" + "="*60)
   print(" Guardrail Assessment")
   print("="*60)
   # Extract response message safely
   response_text = response.get("output", {}).get("message", {}).get("content", [{}])[0].get("text", "N/A")
   print("\n **Model Response:**")
   print(f"   {response_text}")
   print("\n **Guardrail Assessment:**")
   print(f"   Is Hallucination : {is_hallucination}")
   print("\n **Contextual Grounding Policy Scores:**")
   print(f"   - Relevance Score : {relevance:.2f} (Threshold: {relevance_threshold:.2f})")
   print(f"   - Grounding Score : {grounding:.2f} (Threshold: {grounding_threshold:.2f})")
   print("\n" + "="*60 + "\n")
  
def main():
   bs = boto3.Session(
       aws_access_key_id=os.getenv('AWS_ACCESS_KEY'),
       aws_secret_access_key=os.getenv('AWS_SECRET_KEY'),
       region_name=os.getenv('AWS_REGION')
   )

   # Initialize Bedrock client
   bedrock_client = bs.client("bedrock-runtime")

   # Grounding reference
   grounding_reference = "The Wright brothers made the first powered aircraft flight on December 17, 1903."

   # User query
   user_query = "Who were the first to fly an airplane?"
  
   # Get model response
   response = converse_with_guardrails(bedrock_client, user_query, grounding_reference)

   # Check for hallucinations
   is_hallucination, relevance, grounding, relevance_threshold, grounding_threshold = check_hallucination(response)

   # Print the results
   pretty_print_response(response, is_hallucination, relevance, relevance_threshold, grounding, grounding_threshold)


if __name__ == "__main__":
   main()

Identifying Hallucinations with Contextual Grounding

The contextual grounding feature proved effective in identifying potential hallucinations by comparing model responses against reference information. Relevance and grounding scores provided quantitative measures to assess the accuracy of model outputs.

The python script run output below demonstrates how the Grounding Score helps detect hallucinations:

============================================================
 Guardrail Assessment
============================================================

 **Model Response:**
   Sorry, I am not configured to answer such questions. Kindly ask a different question.

 **Guardrail Assessment:**
   Is Hallucination : True

 **Contextual Grounding Policy Scores:**
   - Relevance Score : 1.00 (Threshold: 0.99)
   - Grounding Score : 0.03 (Threshold: 0.99)

============================================================

Here, the Grounding Score of 0.03 is significantly lower than the configured threshold of 0.99, indicating that the response lacks factual accuracy. Since the score falls below the threshold, the system flags the response as a hallucination, highlighting the need to monitor guardrail outputs to ensure AI safety.

Configuring Amazon Bedrock Guardrails Metrics & Logs Collection

Elastic makes it easy to collect both logs and metrics from Amazon Bedrock Guardrails using the Amazon Bedrock integration. By default, Elastic provides a curated set of logs and metrics, but you can customize the configuration based on your needs. The integration supports Amazon S3 and Amazon CloudWatch Logs for log collection, along with metrics collection from your chosen AWS region at a specified interval.

Follow these steps to enable the collection of metrics and logs:

Navigate to Amazon Bedrock Settings - In the AWS Console, go to Amazon Bedrock and open the Settings section.
Choose Logging Destination - Select whether to send logs to Amazon S3 or Amazon CloudWatch Logs.
Provide Required Details
- If using Amazon S3, logs can be collected from objects referenced in S3 notification events (read from an SQS queue) or by direct polling from an S3 bucket.
- If using CloudWatch Logs: you need to create a CloudWatch log group and note its ARN, as this will be required for configuring both Amazon Bedrock and Elastic Amazon Bedrock integration.

Configure Elastic's Amazon Bedrock integration - In Elastic, set up the Amazon Bedrock integration, ensuring the logging destination matches the one configured in Amazon Bedrock. Logs from your selected source and metrics from your AWS region will be collected automatically.

Accept Defaults or Customize Settings - Elastic provides a default configuration for logs and metrics collection. You can accept these defaults or adjust settings such as collection intervals to better fit your needs.

Understanding the pre-configured dashboard for Amazon Bedrock Guardrails

You can access the Amazon Bedrock Guardrails dashboard using either of the following methods:

Navigate to the Dashboard Menu - Select the Dashboard menu option in Elastic and search for [Amazon Bedrock] Guardrails to open the dashboard.
Navigate to the Integrations Menu - Open the Integrations menu in Elastic, select Amazon Bedrock, go to the Assets tab, and choose [Amazon Bedrock] Guardrails from the dashboard assets.

The Amazon Bedrock Guardrails dashboard in the Elastic integration provides insights into guardrail performance, tracking total invocations, API latency, text unit usage, and intervention rates. It analyzes policy-based interventions, highlighting trends, text consumption, and frequently triggered policies. The dashboard also showcases instances where guardrails modified or blocked responses and offers a detailed breakdown of invocations by policy and content source.

Guardrail invocation overview

This dashboard section provides a comprehensive summary of key metrics related to guardrail performance and usage:

Total guardrails API invocations: Displays the overall count of times guardrails were invoked.
Average Guardrails API invocation latency: Shows the average response time for guardrail API calls, offering insights into system performance.
Total text unit utilization: Indicates the volume of text processed during guardrail invocations. For pricing of text units refer to Amazon Bedrock pricing page.
Invocations - with and without guardrail interventions: A pie chart representation showing the distribution of LLM invocations based on guardrail activity. It displays the count of invocations where no guardrail interventions occurred, those where guardrails intervened and detected policy violations, and those where guardrails intervened but found no violations.

These metrics help users evaluate guardrail effectiveness, track intervention patterns, and optimize configurations to ensure policy enforcement while maintaining system performance.

Guardrail policy types for interventions

This section provides a comprehensive view of guardrail policy interventions and their impact:

Interventions by Policy Type: Bar charts display the number of interventions applied to user inputs and model outputs, categorized by policy type (e.g., Contextual Grounding Policy, Word Policy, Content Policy, Sensitive Information Policy, Topic Policy).
Text Unit Utilization by Policy Type: Panels highlight the text units consumed by various policy interventions, separately for user inputs and model outputs.
Policy Usage Trends: A word cloud visualisation reveals the most frequently applied policy types, offering insights into intervention patterns.

By analyzing intervention counts, text unit usage, and policy trends, users can identify frequently triggered policies, optimize guardrail settings, and ensure LLM interactions align with compliance and safety requirements.

Prompt and response where guardrails intervened

This dashboard section displays the original LLM prompt, inputs from various sources (API calls, applications, or chat interfaces), and the corresponding guardrail response. The text panel presents the prompt alongside the model's response after applying guardrail interventions. These interventions occur when input evaluation or model responses violate configured policies, leading to blocked or masked outputs.

The section also includes additional details to enhance visibility into how guardrails operate. It indicates whether a violation was detected, along with the violation type (e.g., GROUNDING, RELEVANCE) and the action taken (BLOCKED, NONE). For contextual grounding, the dashboard also shows the filter threshold, which defines the minimum confidence level required for a response to be considered valid, and the confidence score, which reflects how well the response aligns with the expected criteria.

By analyzing violations, actions taken, and confidence scores, users can adjust guardrail thresholds to balance blocking unsafe responses and allowing valid ones, ensuring optimal accuracy and compliance. This process is particularly crucial for detecting and mitigating hallucinations—instances where models generate information not grounded in source data. Implementing contextual grounding checks enables the identification of such ungrounded or irrelevant content, enhancing the reliability of applications like retrieval-augmented generation (RAG).

Guardrail invocation by guardrail policy

This section offers insights into the number of Guardrails API invocations, the overall latency, the total text units categorised by various guardrail policies (identified by guardrail ARN) and the policy versions.

Guardrail invocation by content source (Input & Output)

This section provides a detailed overview of critical metrics related to guardrail performance and usage. It includes the total number of guardrail invocations, the count of intervention invocations where policies were applied, the volume of text units consumed during these interventions for both user inputs and model outputs and the average guardrail API invocation latency.

These insights help users understand how guardrails operate across different policies and content sources. By analyzing invocation counts, latency, and text unit consumption, users can assess policy effectiveness, track intervention patterns, and optimize configurations. Evaluating how guardrails interact with user inputs and model outputs ensures consistent enforcement, helping refine thresholds and improve compliance strategies.

Configure SLOs and Alerts

To create an SLO for monitoring contextual grounding accuracy, define a custom query SLI where good events are model responses that meet contextual grounding criteria, ensuring factual accuracy and alignment with the provided reference.

A suitable query for tracking good events is:

gen_ai.prompt : "*qualifiers[\\\"grounding_source\\\"]*" and 
(gen_ai.compliance.violation_detected : false or 
not gen_ai.compliance.violation_detected : *)

The total query considers all relevant interactions having contextual grounding check is:

gen_ai.prompt : "*qualifiers[\\\"grounding_source\\\"]*"

Set an SLO target of 99.5%, ensuring that the vast majority of responses remain factually grounded. This helps detect hallucinations and misaligned outputs in real-time. By continuously monitoring contextual grounding accuracy, you can proactively address inconsistencies, retrain models, or refine RAG pipelines before inaccuracies impact end users.

Elastic's alerting capabilities enable proactive monitoring of key performance metrics. For instance, by setting up an alert on the average aws_bedrock.guardrails.invocation_latency with a 500ms threshold, you can promptly identify and address performance bottlenecks, ensuring that policy enforcement remains efficient without causing unexpected delays.

Conclusion

The Elastic Amazon Bedrock integration makes it easy for you to collect a curated set of metrics and logs for your LLM-powered applications using Amazon Bedrock including Guardrails. It comes with an out-of-the-box dashboard which you can further customize for your specific needs.

If you haven’t already done so, read our previous blog on what you can do with the Amazon Bedrock integration, set up guardrails for your Bedrock models, and enable the Bedrock integration to start observing your Bedrock models and guardrails today!

LLM Observability with the new Amazon Bedrock Integration in Elastic Observability

Mon, 25 Nov 2024 00:00:00 GMT

As organizations increasingly adopt LLMs for AI-powered applications such as content creation, Retrieval-Augmented Generation (RAG), and data analysis, SREs and developers face new challenges. Tasks like monitoring workflows, analyzing input and output, managing query latency, and controlling costs become critical. LLM observability helps address these issues by providing clear insights into how these models perform, allowing teams to quickly identify bottlenecks, optimize configurations, and improve reliability. With better observability, SREs can confidently scale LLM applications, especially on platforms like Amazon Bedrock, while minimizing downtime and keeping costs in check.

Elastic is expanding support for LLM Observability with Elastic Observability's new Amazon Bedrock integration. This new observability integration provides you with comprehensive visibility into the performance and usage of foundational models from leading AI companies and from Amazon available through Amazon Bedrock. The new Amazon Bedrock Observability integration offers an out-of-the-box experience by simplifying the collection of Amazon Bedrock metrics and logs, making it easier to gain actionable insights and effectively manage your models. The integration is simple to set up and comes with pre-built, out-of-the-box dashboards. With real-time insights, SREs can now monitor, optimize and troubleshoot LLM applications that are using Amazon Bedrock.

This blog will walk through the features available to SREs, such as monitoring invocations, errors, and latency information across various models, along with the usage and performance of LLM requests. Additionally, the blog will show how easy it is to set up and what insights you can gain from Elastic for LLM Observability.

Prerequisites

To follow along with this blog, please make sure you have:

An account on Elastic Cloud and a deployed stack in AWS (see instructions here). Ensure you are using version 8.13 or higher.
An AWS account with permissions to pull the necessary data from AWS. See details in our documentation.

Configuring Amazon Bedrock Logs Collection

To collect Amazon Bedrock logs, you can choose from the following options:

Amazon Simple Storage Service (Amazon S3) bucket
Amazon CloudWatch logs

S3 Bucket Logs Collection: When collecting logs from the Amazon S3 bucket, you can retrieve logs from Amazon S3 objects pointed to by Amazon S3 notification events, which are read from an SQS queue, or by directly polling a list of Amazon S3 objects in an Amazon S3 bucket. Refer to Elastic’s Custom AWS Logs integration for more details.

CloudWatch Logs Collection: In this option, you will need to create a CloudWatch log group. After creating the log group, be sure to note down the ARN of the newly created log group, as you will need it for the Amazon Bedrock settings configuration and Amazon Bedrock integration configuration for logs.

Configure the Amazon Bedrock CloudWatch logs with the Log group ARN to start collecting CloudWatch logs.

Please visit the AWS Console and navigate to the "Settings" section under Amazon Bedrock and select your preferred method of collecting logs. Based on the value you select from the Logging Destination in the Amazon Bedrock settings, you will need to enter either the Amazon S3 location or the CloudWatch log group ARN.

Configuring Amazon Bedrock Metrics Collection

Configure Elastic's Amazon Bedrock integration to collect Amazon Bedrock metrics from your chosen AWS region at the specified collection interval.

Maximize Visibility with Out-of-the-Box Dashboards

Amazon Bedrock integration offers rich out-of-the-box visibility into the performance and usage information of models in Amazon Bedrock, including text and image models. The Amazon Bedrock Overview dashboard provides a summarized view of the invocations, errors and latency information across various models.

The Text / Chat metrics section in the Amazon Bedrock Overview dashboard provides insights into token usage for Text models in Amazon Bedrock. This includes use cases such as text content generation, summarization, translation, code generation, question answering, and sentiment analysis.

The Image metrics section in the Amazon Bedrock Overview dashboard offers valuable insights into the usage of Image models in Amazon Bedrock.

The Logs section of the Amazon Bedrock Overview dashboard in Elastic provides detailed insights into the usage and performance of LLM requests. It enables you to monitor key details such as model name, version, LLM prompt and response, usage tokens, request size, completion tokens, response size, and any error codes tied to specific LLM requests.

The detailed logs provide full visibility into raw model interactions, capturing both the inputs (prompts) and the outputs (responses) generated by the models. This transparency enables you to analyze and optimize how your LLM handles different requests, allowing for more precise fine-tuning of both the prompt structure and the resulting model responses. By closely monitoring these interactions, you can refine prompt strategies and enhance the quality and reliability of model outputs.

Amazon Bedrock Overview dashboard provides a comprehensive view of the initial and final response times. It includes a percentage comparison graph that highlights the performance differences between these response stages, enabling you to quickly identify efficiency improvements or potential bottlenecks in your LLM interactions.

Creating Alerts and SLOs to Monitor Amazon Bedrock

As with any Elastic integration, Amazon Bedrock logs and metrics are fully integrated into Elastic Observability, allowing you to leverage features like SLOs, alerting, custom dashboards, and detailed logs exploration.

To create an alert, for example to monitor LLM invocation latency in Amazon Bedrock, you can apply a Custom Threshold rule on the Amazon Bedrock datastream. Set the rule to trigger an alert when the LLM invocation latency exceeds a defined threshold. This ensures proactive monitoring of model performance, allowing you to detect and address latency issues before they impact the user experience.

When a violation occurs, the Alert Details view linked in the notification provides detailed context, including when the issue began, its current status, and any history of similar violations. This rich information enables rapid triaging, investigation, and root cause analysis to resolve issues efficiently.

Similarly, to create an SLO for monitoring Amazon Bedrock invocation performance for instance, you can define a custom query SLI where good events are those Amazon Bedrock invocations that do not result in client errors or server errors and have latency less than 10 seconds. Set an appropriate SLO target, such as 99%. This will help you identify errors and latency issues in applications using LLMs, allowing you to take timely corrective actions before they affect the overall user experience.

The image below highlights the SLOs, SLIs, and the remaining error budget for Amazon Bedrock models. The observed violations are a result of deliberately crafted long text generation prompts, which led to extended response times. This example demonstrates how the system tracks performance against defined targets, helping you quickly identify latency issues and performance bottlenecks. By monitoring these metrics, you gain valuable insights for proactive issue triaging, allowing for timely corrective actions and improved user experience of applications using LLM.

Try it out today

The Amazon Bedrock playgrounds provide a console environment to experiment with running inference on different models and configurations before deciding to use them in an application. Start your own 7-day free trial by signing up via AWS Marketplace and quickly spin up a deployment in minutes on any of the Elastic Cloud regions on AWS around the world.

Deploy a cluster on our Elasticsearch Service, download the Elasticsearch stack, or run Elastic from AWS Marketplace then spin up the new technical preview of Amazon Bedrock integration, open the curated dashboards in Kibana and start monitoring your Amazon Bedrock service!

LLM Observability with Elastic’s Azure AI Foundry Integration

Fri, 25 Jul 2025 00:00:00 GMT

Introduction

As organizations increasingly adopt LLMs for AI-powered applications such as content creation, Retrieval-Augmented Generation (RAG), and data analysis, SREs and developers face new challenges. Tasks like monitoring workflows, analyzing input and output, managing query latency, and controlling costs become critical. LLM Observability helps address these issues by providing clear insights into how these models perform, allowing teams to quickly identify bottlenecks, optimize configurations, and improve reliability. With better observability, SREs can confidently scale LLM applications, especially on platforms like Azure AI Foundry, while minimizing downtime and keeping costs in check.

Elastic is expanding support for LLM Observability with Elastic Observability's new Azure AI Foundry integration. This is now available as a tech preview on Elastic Cloud. This new observability integration provides you with comprehensive visibility into the performance and usage of foundational models, such as GPT-4, Mistral, Llama, and thousands of others from leading AI companies and from Azure available through Azure AI Foundry. The new Azure AI Foundry Integration in Elastic Observability integration offers an out-of-the-box experience by simplifying the collection of metrics and logs, making it easier to gain actionable insights and effectively manage your models. The integration is simple to set up and comes with pre-built, out-of-the-box dashboards. With real-time insights, SREs can now monitor, optimize and troubleshoot LLM applications that are using Azure AI Foundry.

Prerequisites

To get started with the Azure AI Foundry integration, you will need:

An account on Elastic Cloud and a deployed stack in Azure (see instructions here). Ensure you are using version 9.0.0 or higher.
An Azure account with permissions to pull the necessary data from Azure and Azure AI Foundry. See details in our documentation.

Configuring Azure AI Foundry Integration

To collect logs and metrics from Azure AI Foundry ensure you properly configure Azure logs and metrics from the following links:

Configure to receive Azure Metrics - This integration specifically collects Azure AI Foundry metrics which will come from the service, and ensure you have the client id, subscription id, and tenant id from Azure AI Foundry to collect metrics.
Configure to receive Azure Logs and more specifically ensure that you configure Azure event hub to properly allow Elastic to ingest logs. Once you have the Azure event hub information, you will need it to configure the logs section of the Azure AI Foundry Integration.

Maximize Visibility with Out-of-the-box dashboards

Azure AI Foundry integration offers rich out-of-the-box visibility into the performance and usage information of models in Azure AI Foundry, including text and image models. There are several dashboards currently available. More will be coming as the integration goes to GA.

Azure AI Foundry Overview dashboard provides a summarized view of the invocations, errors and latency information across various models.
Azure AI Foundry Billing dashboard - which provides total costs and daily usage costs from Azure cognitive services.
Azure AI Foundry Advanced Monitoring - which focuses on logs generated by the Azure AI Foundry service when connected through the API Management Service. Provides request rate, error rate, model usage, latency, LLM prompt input, response completion.

Each dashboard provides specific insights important to SREs. Here is a quick overview of some of these insights:

Model Usage and Token Trends – Visualize token consumption and completion counts by model, endpoint, and time window.
Latency Metrics – Monitor average and percentile latency per prompt, per endpoint, and correlate with prompt types or user IDs.
Cost Estimation – Estimate API usage cost based on token consumption and model pricing.
Prompt/Completion Logging – View prompt-response pairs for debugging and quality monitoring.
Content Filtering and Guardrails – See which prompts or completions are being filtered, and why.

You can drill into specific users or sessions, slice by model type or region, and export reports for usage reviews or compliance.

Try it out today

The Azure AI Foundry Integration is currently available in Elastic Cloud (both serverless and hosted options). Sign up for a 7 day trial by signing up to Elastic Cloud directly or through Azure Marketplace. Alternatively you can also deploy a cluster on our Elasticsearch Service, download the Elasticsearch stack, or run Elastic from Azure Marketplace then spin up the new technical preview of Azure AI Foundry integration, open the curated dashboards in Kibana and start monitoring your Azure AI Foundry service!

Optimizing Spend and Content Moderation on Azure OpenAI with Elastic

Tue, 13 May 2025 00:00:00 GMT

In a previous blog we showed you how to set up observability for your models hosted on Azure OpenAI using Elastic’s integration. We’ve expanded the integration to also include Azure OpenAI content filtering, and cost analysis for Azure OpenAI. If you previously onboarded the Azure OpenAI integration, just upgrade it and you will automatically get all new features we discuss in this blog. The enhanced integration now provides multiple dashboards including a general Azure OpenAI Overview, Azure Provisioned Throughput Unit dashboard, Azure Content filtering, and a dashboard for Azure OpenAI billing.

In this blog we will cover how to use Azure OpenAI Content Filtering and tracking Azure OpenAI usage costs. Let’s first review what these two capabilities from Azure OpenAI enable you to do:

Azure OpenAI Content Filtering: Enhancing AI Safety

Content filtering for Azure OpenAI plays a critical role in addressing AI safety challenges by helping to mitigate the risks associated with harmful or inappropriate content generated by AI models. By implementing robust content filtering mechanisms, organizations can proactively identify and filter out potentially harmful content, such as hate speech, misinformation, or violent imagery, before it is disseminated to users. This helps prevent the spread of harmful content and reduces the potential negative impact on individuals and communities.

Monitoring Azure OpenAI content filtering is essential for staying proactive in addressing emerging content moderation challenges. By closely monitoring the system, businesses can quickly detect any new types of harmful content or patterns of misuse that may arise. This enables organizations to stay ahead of potential content moderation issues and take timely action to protect their users and uphold their brand reputation.

Tracking Azure OpenAI Usage Costs

Monitoring Azure OpenAI model usage costs is crucial for managing budget and resource allocation effectively. By keeping track of usage costs, organizations can optimize their operations to avoid unnecessary expenses and ensure that they are getting the best value from their investment in AI technologies. Additionally, it helps in forecasting future expenses and aids in scaling resources according to the demand without compromising performance or incurring excessive costs. Effective monitoring also allows for transparency and accountability, enabling better decision-making in terms of AI deployment and utilization within Azure environments.

As we walk through this blog, we will provide you with prerequisites to set up and use the pre-configured dashboards for both of these capabilities, which are part of the Azure OpenAI integration.

Prerequisites

In order to follow along in this blog you will have to

Set up and install the Azure billing integration to monitor the usage costs. Once the integration is installed, you can track the usage in the enhanced Azure OpenAI Billing dashboard.
Additionally, make sure you have enabled the Azure API Management service to access the Azure OpenAI models.

How to Use Azure API Management with Azure OpenAI:

Provision an Azure OpenAI resource: Create an Azure OpenAI resource and select a model for your application.
Create an API Management instance: Establish an Azure API Management instance to manage the Azure OpenAI APIs.
Import the Azure OpenAI API: Import the Azure OpenAI API into your API Management instance using its OpenAPI specification.
Configure Policies: Implement policies in API Management to manage request authentication, rate limiting, traffic shaping, and more.

Steps to create a content filter for Azure OpenAI

Before you set up observability for the content filtering, ensure that you have configured the Azure content filtering for your model. Follow the steps below to create an Azure OpenAI content filtering,

Access the Azure OpenAI service console:
- Sign in to the Azure Console with the appropriate permissions and navigate to the Azure OpenAI service console.
Navigate to Safety + security:
- From the left-hand menu, select Safety + security.
Create a New Content filter:
- Select Create content filter.
- Configure various content filter policies including the following
  - Set input filter: Content will be annotated by category and blocked according to the threshold you set for prompts.
  - Set output filter: Content will be annotated by category and blocked according to the threshold you set for response output.
  - Blocklists: Define specific words or phrases to block.
  - Deployments: Apply filters to model deployments.
Review and Create:
- Review your settings and select Create to finalize the content filter configurations.

Customers can also configure content filters and create custom safety policies that are tailored to their use case requirements. The configurability feature allows customers to adjust the settings, separately for prompts and completions, to filter content for each content category at different severity levels.

Content filter types

The content filtering categories,
- (hate, sexual, violence, self-harm)
- Other optional classification models aimed at detecting jailbreak risk and known content for text and code.
Severity level within each content filter category,
- (low, medium, high)
- Content detected at the 'safe' severity level is labeled in annotations but isn't subject to filtering and isn't configurable.

Understanding the pre-configured dashboard for Azure OpenAI Content Filtering

Now that you have set up the filter, you can see what is being filtered in Elastic through the Azure OpenAI content filtering dashboard.

Navigate to the Dashboard Menu – Select the Dashboard menu option in Elastic and search for [Azure OpenAI] Content Filtering Overview to open the dashboard.
Navigate to the Integrations Menu – Open the Integrations menu in Elastic, select Azure OpenAI, go to the Assets tab, and choose [Azure OpenAI] Content Filtering Overview from the dashboard assets.

The Azure OpenAI Content Filtering Overview dashboard in the Elastic integration provides insights into blocked requests, API latency, error rates. This dashboard also provides detailed breakdown of content being filtered by the content filtering policy.

Content Filter overview

When the content filtering system detects harmful content, you receive either an error on the API call if the prompt was deemed inappropriate, or the finish_reason on the response will be content_filter to signify that some of the completion was filtered.

This can be summarized as,

Prompt filters: The prompt content that is classified in the filtered category will return HTTP 400 error.
Non-streaming completion: When the content is filtered, non-streaming completions calls won't return any content. In rare cases with longer responses, a partial result can be returned. In these cases, the finish_reason is updated.
Streaming completion: For streaming completions calls, segments are returned back to the user as they're completed. The service continues streaming until either reaching a stop token, length, or when content that is classified at a filtered category and severity level is detected.

Prompt and response where content has been blocked

This dashboard section displays the original LLM prompt, inputs from various sources (API calls, applications, or chat interfaces), and the corresponding completion response. The panel below gives a view on the responses after applying content filtering policy for prompts and completions.

You can use the following code snippet to start integrating your current prompt and settings into your application to test the content filter:

chat_prompt = [
   {
       "role": "user",
       "content": "How to kill a mocking bird?"
   }
]

After running the code, you can find the content being filtered by violence category with the severity level medium.

Content filtered by content source (Input & Output)

The content filtering system helps monitor and moderate different categories of content based on severity levels. The categories typically include things like adult content, offensive language, hate speech, violence, and more. The severity levels indicate the degree of sensitivity or potential harm associated with the content. This panel helps the user to effectively monitor and filter out inappropriate or harmful content to maintain a safe environment.

These metrics can be categorized into the following groups:

Blocked requests by category: Provides insights into the total blocked requests by category.
Severity distribution by categories: Monitors the blocked requests by categories and severity distribution. The severity distribution may be either low, medium or high.
Content filtered categories: Provides insights into the content filtered categories over time.

Reviewing the Azure OpenAI Billing dashboard

You can now look at what you are spending on Azure OpenAI.

Here is what you see on this dashboard:

Total costs: This measures the total usage cost across all the model deployments.
Overall Usage by model: This tracks the total usage costs broken down by model.
Daily usage: Monitors usage costs on a daily basis.
Daily usage costs by model: Monitors daily usage costs broken down by model deployments.

Conclusion

The Azure OpenAI integration makes it easy for you to collect a curated set of metrics and logs for your LLM-powered applications using Azure OpenAI along with content filtered responses. It comes with an out-of-the-box dashboard which you can further customize for your specific needs.

Deploy a cluster on our Elasticsearch Service or download the stack, spin up the new Azure OpenAI integration, open the curated dashboards in Kibana and start monitoring your Azure OpenAI service!

LLM Observability with Elastic: Azure OpenAI Part 2

Fri, 23 Aug 2024 00:00:00 GMT

We recently announced GA of the Azure OpenAI integration. You can find details in our previous blog LLM Observability: Azure OpenAI.

Since then, we have added further capabilities to the Azure OpenAI GA package, which now offer prompt and response monitoring, PTU deployment performance tracking, and billing insights. Read on to learn more!

Advanced Logging and Monitoring

The initial GA release of the integration focused mainly on the native logs, to track the telemetry of the service by using cognitive services logging. This version of the Azure OpenAI integration allows you to process the advanced logs which gives a more holistic view of OpenAI resource usage.

To achieve this, you have to setup API Management services in Azure. The API Management service is a centralized place where you can put all OpenAI services endpoints to manage all of them end-to-end. Enable the API Management services and configure the Azure event hub to stream the logs.

To learn more about setting up the API Management service to access Azure OpenAI, please refer to the Azure documentation.

By using advanced logging, you can collect the following log data:

Request input text
Response output text
Content filter results
Usage Information
- Input prompt tokens
- Output completion tokens
- Total tokens

Azure OpenAI integration now collects the API Management Gateway logs. When a question from the user goes to the API Management, it logs the questions and the responses from the GPT models.

Here’s what a sample log looks like,

Content filtered results

Azure OpenAI’s content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. With Azure OpenAI model deployments, you can use the default content filter or create your own content filter.

Now, The integration collects the content filtered result logs. In this example let's create a custom filter in the Azure OpenAI Studio that generates an error log.

By leveraging the Azure Content Filters, you can create your own custom lists of terms or phrases to block or flag.

And the document ingested in Elastic would look like this: This screenshot provides insights into the content filtered request.

PTU Deployment Monitoring

Provisioned throughput units (PTU) are units of model processing capacity that you can reserve and deploy for processing prompts and generating completions.

The curated dashboard for PTU Deployment gives comprehensive visibility into metrics such as request latency, active token usage, PTU utilization, and fine-tuning activities, offering a quick snapshot of your deployment's health and performance.

Here are the essential PTU metrics captured by default:

Time to Response: Time taken for the first response to appear after a user send a prompt.
Active Tokens: Use this metric to understand your TPS or TPM based utilization for PTUs and compare to the benchmarks for target TPS or TPM scenarios.
Provision-managed Utilization V2: Provides insights into utilization percentages, helping prevent overuse and ensuring efficient resource allocation.
Prompt Token Cache Match Rate: The prompt token cache hit ratio expressed as a percentage.

Using Billing for cost

Using the curated overview dashboard you can now monitor the actual usage cost for the AI applications. You are one step away from processing the billing information.

You need to configure and install the Azure billing metrics integration. Once the installation is complete the usage cost is visualized for the cognitive services in the Azure OpenAI overview dashboard.

Try it out today

Deploy a cluster on our Elasticsearch Service or download the stack, spin up the new Azure OpenAI integration, open the curated dashboards in Kibana and start monitoring your Azure OpenAI service!

LLM Observability: Azure OpenAI

Mon, 24 Jun 2024 00:00:00 GMT

We are excited to announce the general availability of the Azure OpenAI Integration that provides comprehensive Observability into the performance and usage of the Azure OpenAI Service! Also look at Part 2 of this blog

While we have offered visibility into LLM environments for a while now, the addition of our Azure OpenAI integration enables richer out-of-the-box visibility into the performance and usage of your Azure OpenAI based applications, further enhancing LLM Observability.

The Azure OpenAI integration leverages Elastic Agent’s Azure integration capabilities to collect both logs (using Azure EventHub) and metrics (using Azure Monitor) to provide deep visibility on the usage of the Azure OpenAI Service.

The integration includes an out-of-the-box dashboard that summarizes the most relevant aspects of the service usage, including request and error rates, token usage and chat completion latency.

Creating Alerts and SLOs to monitor Azure OpenAI

As with every other Elastic integration, all the logs and metrics information is fully available to leverage in every capability in Elastic Observability, including SLOs, alerting, custom dashboards, in-depth logs exploration, etc.

To create an alert to monitor token usage, for example, start with the Custom Threshold rule on the Azure OpenAI datastream and set an aggregation condition to track and report violations of token usage past a certain threshold.

When a violation occurs, the Alert Details view linked in the alert notification for that alert provides rich context surrounding the violation, such as when the violation started, its current status, and any previous history of such violations, enabling quick triaging, investigation and root cause analysis.

Similarly, to create an SLO to monitor error rates in Azure OpenAI calls, start with the custom query SLI definition adding in the good events to be any result signature at or above 400 over a total value that includes all responses. Then, by setting an appropriate SLO target such as 99%, start monitoring your Azure OpenAI error rate SLO over a period of 7, 30, or 90 days to track degradation and take action before it becomes a pervasive problem.

Please refer to the User Guide to learn more and to get started!

End to end LLM observability with Elastic: seeing into the opaque world of generative AI applications

Wed, 02 Apr 2025 00:00:00 GMT

In the ever-evolving landscape of artificial intelligence, Large Language Models (LLMs) stand as beacons of innovation, offering unprecedented capabilities across industries. From generating human-like text and translating languages to providing personalized customer interactions, the possibilities with LLMs are vast and increasingly indispensable. Enterprises are deploying these models for everything, from automating customer support systems to enhancing creative writing processes. Imagine a virtual assistant not only answering questions but also drafting business proposals or a customer service bot that understands and responds with empathy—all powered by LLMs. However, with great power comes the need for great oversight.

Despite the transformative potential, LLMs introduce complex challenges that necessitate a new level of observability as LLMs are notoriously opaque. Enter LLM observability: a crucial component in the lifecycle management of LLMs. This aspect becomes vital for Service Reliability Engineers (SREs) and other key stakeholders tasked with ensuring seamless, error-free operations, cost control, and minimizing the risks associated with the unpredictable nature of LLM generated responses. SREs need insights into performance metrics, error frequencies, latency issues, the cost implications of running these sophisticated models, and the prompt and response exchange with the model. Traditional monitoring tools fall short in this high-stakes environment; what’s needed is a nuanced approach to address the unique observability demands that LLMs introduce.

Elastic's LLM Observability Capabilities Address These Challenges

With Elastic’s end-to-end LLM observability you can cover a wide range of use cases. To achieve this, you can onboard two types of integrations - API-based logs and metrics and via APM instrumentation. Depending on your use case, you can also choose to use of the LLM integrations.

High level overview: via API-based logs and metrics. Monitoring LLM services from providers by ingesting a curated set of service metrics and logs like latency, invocation frequency, tokens, errors, and prompts and responses. Each LLM integration comes with out-of-the-box dashboards.
Troubleshooting applications: via APM instrumentation. Fully OTel-native tracing and auto-instrumentation for LLM-based applications through Elastic Distributions of OpenTelemetry (EDOT). Additionally, you can use third party libraries (Langtrace, OpenLit, OpenLLMetry) together with Elastic to extend the coverage to additional LLM-related technologies.

High level overview: LLM Observability for Leading Providers

Elastic offers tailored API-based integrations for four major LLM hosting providers:

Azure OpenAI
OpenAI
Amazon Bedrock
Google Vertex AI

These integrations bring a curated set of logs and metrics collection tailored to each provider. What this means for SREs is straightforward access to pre-configured dashboards that highlight the prompts and responses, usage patterns, performance metrics, and cost details across different models and providers.

For instance, SREs keen on identifying which LLM generates the most errors or insights about the models in terms of latency, cost, or usage frequency can leverage these integrations. Imagine having the capability to instantly visualize which LLM is slowing down processes or incurring high costs, thus enabling data-driven decisions to optimize operations.

Troubleshooting applications: Tracing and Auto-Instrumentation of OpenAI, Amazon Bedrock and Google Vertex AI models

Elastic supports OTLP tracing capabilities in EDOT for applications using OpenAI models and models hosted on Amazon Bedrock and Google Vertex AI. In addition, Elastic also supports LLM tracing from third party libraries (Langtrace, OpenLIT, OpenLLMetry).

Tracing offers a comprehensive map of an application's request flow, pinpointing granular details about each call within the system. For each transaction and span of a request, tracing shows critical information such as specific models utilized, request duration, errors encountered, tokens used per request, and the prompts and responses between the LLM.

Tracing helps SREs troubleshoot performance issues with applications developed in languages like Python, Node.js and Java." If an SRE needs to investigate latency or error issues, LLM tracing provides a zoomed-in view into the request lifecycle and allows for profound insights into whether a delay is application-specific, model-specific or systemic across deployments.

Use Cases: Bringing Elastic's Observability Features to Life

Let’s explore some practical scenarios where Elastic’s observability tools shine:

1. Understanding LLM Performance and Reliability

An SRE team looking to optimize a customer support system powered by Azure OpenAI can utilize Elastic’s Azure OpenAI integration to quickly ascertain which model variants incur higher latency or error rates. This enhances decision-making regarding model deployment or even switching providers based on performance metrics.

Similarly SREs can also use in parallel integrations for Google Vertex AI, Amazon Bedrock, and OpenAI for other applications using models hosted on these providers.

2. Troubleshooting OpenAI-Powered Applications

Consider an enterprise utilizing an OpenAI model for real-time user interactions. Encountering unexplained delays, an SRE can use OpenAI tracing to dissect the transaction pathway, identifying if one specific API call or model invocation is the bottleneck. The SRE can also check the out-of-the-box OpenAI integration dashboard to verify if the latency is only affecting this application or all model invocations across the organization.

An engineer troubleshooting the LLM-based application can also check to see what were the prompt and response exchanges with the LLM during this request so they can rule out possible impact on performance due to the input.

3. Addressing Cost and Usage Concerns

SREs are generally acutely aware of which LLM configurations are less cost-effective than required. Elastic’s integration dashboards, pre-configured to display model usage patterns, help mitigate unnecessary spending effectively. You can find out-of-the box dashboards for Azure OpenAI, OpenAI, Amazon Bedrock, and Google VertexAI models. These dashboards show key cost and usage information such as total invocations and tokens, as well as time series breakdown by model and endpoint. In addition, some integrations show more advanced usage information such as provisioned throughput units (PTU) as well as billing cost.

4. Understanding LLM Compliance

With the Elastic Amazon Bedrock integration for Guardrails, and Azure OpenAI integration for content filtering, SREs can swiftly address security concerns, like verifying if certain user interactions prompt policy violations. Elastic's observability logs clarify whether guardrails rightly blocked potentially harmful responses, bolstering compliance assurance.

Conclusion

As LLMs continue to revolutionize the capabilities of modern applications, the role of observability becomes increasingly paramount. Elastic’s comprehensive observability framework empowers enterprises to harness the full potential of LLMs while maintaining robust operational insight and control. The integration with prominent LLM hosting providers and advanced tracing for OpenAI, Amazon Bedrock and Google Vertex AI models, equips SREs with the necessary arsenal to navigate the complex landscape of LLM-driven applications, ensuring they remain safe, reliable, efficient, and cost-effective.

In this new era of AI, balancing innovation with observability isn't just beneficial—it's essential. Whether optimizing performance, troubleshooting intricacies, or managing costs and compliance, Elastic stands at the forefront, ensuring your LLM journey is as seamless as it is groundbreaking.

LLM observability: track usage and manage costs with Elastic's OpenAI integration

Tue, 11 Mar 2025 00:00:00 GMT

In an era where AI-driven applications are becoming ubiquitous, understanding and managing the usage of language models is crucial. OpenAI has been at the forefront of developing advanced language models that power a multitude of applications, from chatbots to code generation. However, as applications grow in complexity and scale, observing crucial metrics that ensure optimal performance and cost-effectiveness becomes essential. Specific needs arise in areas such as performance and reliability monitoring, and cost management, which are pivotal for maximizing the potential of language models.

As organizations adopt OpenAI's diverse AI models, including language models like GPT-4o and GPT-3.5 Turbo, image models like DALL·E, and audio models like Whisper, comprehensive usage monitoring is crucial to track and optimize performance, reliability, usage and cost of each model.

Elastic's new OpenAI integration offers a solution to the challenges faced by developers and businesses using these models. It is designed to provide a unified view of your OpenAI usage across all model types.

Key benefits of the OpenAI integration

OpenAI's usage-based pricing model applies across all these services, making it essential to track consumption and identify which models are being used to control costs and optimize deployments. The new OpenAI integration by Elastic utilizes the OpenAI Usage API to track consumption and identify specific models being used. It offers an out-of-the-box experience with pre-built dashboards, simplifying the process of monitoring your usage patterns.

Continue reading to learn about what you will get with the integration. We'll also show you the setup process, how to leverage the pre-built dashboards, and what insights you can gain from Elastic for LLM Observability.

Setting up the OpenAI Integration

Prerequisites

To follow along with this blog, you will need:

An Elastic cloud account (version 8.16.3 or higher). Alternatively, you can use Elastic Cloud Serverless, a fully managed solution that eliminates infrastructure management, automatically scales based on usage, and lets you focus entirely on extracting value from your data.
An OpenAI account with an Admin API key.
Applications that use the OpenAI APIs.

Generating sample OpenAI usage data

If you're new to OpenAI and eager to try this integration, you can quickly set it up and populate your dashboards with sample data. You'll just need to generate some usage by interacting with the OpenAI API. If you don't have an OpenAI API key, you can create one here. For more information on authentication, refer to the OpenAI documentation.

The OpenAI documentation provides detailed examples for each of their API endpoints. Here are direct links to the relevant sections for generating sample usage data:

Language models (completions): Use the Chat Completions API to generate text. See the examples here.
Audio models (text-to-speech): Generate audio from text using the Speech API. See the examples here.
Audio models (speech-to-text): Transcribe audio to text using the Transcriptions API. See the examples here.
Embeddings: Generate vector representations of text using the Embeddings API. See the examples here.
Image models: Create images from text prompts using the Image Generation API. See the examples here.
Moderation: Check the contents with Moderation API. See the examples here.

There are more endpoints that you can explore to generate sample usage data.

After running these examples (using your API key), remember that the OpenAI Usage API has a delay. It may take some time (usually a few minutes) for the usage data to appear in your dashboard.

Configuration

To connect the OpenAI integration to your OpenAI account, you'll need your OpenAI's Admin API key. The integration will use this key to periodically retrieve usage data from the OpenAI Usage API.

The integration supports eight distinct data streams, corresponding to different categories of OpenAI API usage:

Audio speeches (text-to-speech)
Audio transcriptions (speech-to-text)
Code interpreter sessions
Completions (language models)
Embeddings
Images
Moderations
Vector stores

By default, all data streams are enabled. However, you can disable any data streams that are not relevant to your usage. All enabled data streams are visualized in a single, comprehensive dashboard, providing a unified view of your usage.

For advanced users, the integration offers additional configuration options, including setting the bucket width and initial interval. These options are documented in detail in the official integration documentation.

Maximize visibility with the out-of-the-box dashboard

You can access the OpenAI dashboard in two ways:

Navigate to the Dashboards menu in the left side panel and search for "OpenAI". In the search results select [Metrics OpenAI] OpenAI Usage Overview to open the dashboard.
Alternatively, navigate to the Integrations Menu — Open the Integrations menu under the Management section in Elastic, select OpenAI, go to the Assets tab, and choose [Metrics OpenAI] OpenAI Usage Overview from the dashboards assets.

Understanding the pre-configured dashboard for OpenAI

The pre-built dashboard provides a structured view of OpenAI's API consumption, displaying key metrics such as token usage, API call distribution, and model-wise invocation counts. It highlights top-performing projects, users, and API keys, along with breakdowns of image generation, audio transcription, and text-to-speech usage. By analyzing these insights, users can track usage patterns, and optimize AI-driven applications.

OpenAI usage metrics overview

This dashboard section shows key usage metrics from OpenAI, including invocation rates, token usage, and the top-performing models. It also highlights the total number of invocations and tokens and the invocation count by object type. Understanding these insights can help users optimize model usage, reduce costs, and enhance efficiency when integrating AI models into their applications.

Top performing Project, User, and API Key IDs

Here, you can analyze the top Project IDs, User IDs, and API Key IDs based on invocation counts. This data provides valuable insights to help organizations track usage patterns across different projects and applications.

Token metrics

In this dashboard section you can see token usage trends across various models. This can help you analyze trends across input types (e.g., audio, embeddings, moderations), output types (e.g., audio), and input cached tokens. This information can help developers fine-tune their prompts and optimize token consumption.

Image generation metrics

AI-generated images are becoming increasingly popular across industries. This section provides an overview of image generation metrics, including invocation rates by model and the most common output dimensions. These insights help assess invocation costs and analyze image generation usage.

Audio transcription metrics

OpenAI's AI-powered transcription services make speech-to-text conversion easier than ever. This section tracks audio transcription metrics, including invocation rates and total transcribed seconds per model. Understanding these trends can help businesses optimize costs when building audio transcription-based applications.

Audio speech metrics

OpenAI's text-to-speech (TTS) models deliver realistic voice synthesis for applications such as accessibility tools and virtual assistants. This section explores TTS invocation rates and the number of characters synthesized per model, offering insights into the adoption of AI-driven voice synthesis.

Creating Alerts and SLOs to monitor OpenAI

To proactively manage your OpenAI token usage and avoid unexpected costs, create a custom threshold rule in Observability Alerts.

Example: Target the relevant data stream, and configure the rule to sum the related tokens field (along with other token-related fields, if applicable). Set a threshold representing your desired usage limit, and the alert will notify you if this limit is exceeded within a specified timeframe, such as daily or hourly.

When an alert condition is met, the Alert Details view linked in the alert notification for that alert provides detailed insights surrounding the violation, such as when the violation started, its current status, and any previous history of similar violations, enabling proactive issue resolution, and improving system resilience.

Example: To create an SLO that monitors model distribution in OpenAI, start by defining a custom metric SLI definition, adding good events where openai.base.model contains gpt-3.5* and total events encompassing all OpenAI requests, grouped by openai.base.project_id and openai.base.user_id. Then, set an appropriate SLO target such as 80% and monitor this over a 7-day rolling window to identify projects and users that may be overusing more expensive models.

You can now track the distribution of requests across different OpenAI models by project and user. This example demonstrates how Elastic's OpenAI integration helps you optimize costs. By monitoring the percentage of requests handled by cost-efficient GPT-3.5 models — the SLI — against the 80% target (part of the SLO), you can quickly identify which specific projects or users are driving up costs through excessive usage of models like GPT-4-turbo, GPT-4o, etc. This visibility enables targeted optimization strategies, ensuring your AI initiatives remain cost-effective while still leveraging advanced capabilities.

Conclusion, next steps and further reading

You now know how Elastic's OpenAI integration provides an essential tool for anyone relying on OpenAI's models to power their applications. By offering a comprehensive and customizable dashboard, this integration empowers SREs and developers to effectively monitor performance, manage costs, and optimize your AI systems effortlessly. Now, it's your turn to onboard this application following the instructions in this blog and start monitoring your OpenAI usage! We'd love to hear from you on how you get on and always welcome ideas for enhancements.

To learn how to set up Application Performance Monitoring (APM) tracing of OpenAI-powered applications, read this blog. For further reading and more LLM observability use cases, explore Elastic's observability lab blogs here.

Serverless log analytics powered by Elasticsearch, in a new low priced tier

Thu, 07 Aug 2025 00:00:00 GMT

We're thrilled to introduce Elastic Observability Logs Essentials (Logs Essentials), a new tier in Elastic Cloud Serverless (SaaS). Built on the same robust stateless architecture as Elastic Observability Complete, it’s designed for Site Reliability Engineers (SREs) and developers seeking powerful, efficient, and economical log analytics, without the overhead of managing the Elastic Stack. As the leader in log management, Elasticsearch powers this new tier with unmatched search and analytics.

Logs Essentials is ideal for teams that want Elastic’s speed and scale without paying for premium features or managing the Elastic Stack. With Elastic Cloud Serverless, there’s no infrastructure to manage, and pricing is simple and predictable, making it easy to get started, stay supported, and focus on solving problems faster.

Unmatched value for log analytics

Logs Essentials empowers SREs and developers with analytics capabilities designed to help them quickly pinpoint the root cause of issues.

Accelerate root cause analysis with fast, precise log search using filters, pattern matching, and event identification in seconds.
Gain deep contextual insights through ES|QL, Elastic’s powerful piped query language that supports structured exploration and joins across indices.
Detect issues proactively by setting alerts for error spikes or unusual log volumes, enabling timely incident response.
Visualize and monitor operational health with rich dashboards built in Kibana, giving teams a clear and actionable view of system behavior.

Once on Logs Essentials, if you need SLOs, AI/ML, AI Assistant, or other advanced features to analyze logs, you should upgrade to Observability Complete. Additionally, if you are also interested in expanding to traces and metrics, you should upgrade to Observability Complete.

SaaS making it simple

SREs don’t have to worry about managing the powerful Elastic Stack with Logs Essentials. Elastic Cloud Serverless automatically scales and adjusts to needs seamlessly without impacting performance, all while keeping costs low. SREs don’t have to worry about the operational overhead of managing your deployment or being an Elastic Stack expert. SREs get the following benefits:

No infrastructure to manage or scale: Elastic Cloud Serverless transitions from traditional stateful deployments to a fully stateless, autoscaling architecture, offloading storage to cloud-native object stores and orchestrating compute through Kubernetes. SRE teams can now focus solely on logs and insights, not capacity planning or cluster sizing.

High reliability, resilience, and automation built-in: Elastic’s Cloud Serverless features multi-region deployments, automated control-plane and data-plane upgrades, automatic configuration updates, canary deployments, and capacity pool management to ensure always-on observability

These capabilities deliver what SREs need: a hassle-free, scale-as-you-go, high-availability logging solution that empowers SREs to focus entirely on operational insights, not infrastructure.

Affordable log analytics

Logs Essentials offers a cost-effective and predictable path to log analytics. Elastic Cloud Serverless employs advanced autoscaling controllers that adjust compute and storage dynamically, enabling a flexible pricing model that charges based on real usage (ingest and retention), enabling SREs to “sign up and use,” without upfront provisioning or surprise costs.

Instead of paying for idle capacity or managing infrastructure costs, users are billed based on ingest, and retention, eliminating the guesswork and overprovisioning common in traditional observability solutions. SREs can simply sign up and start analyzing logs. No infrastructure to manage, no surprise costs, just transparent, cost-effective pricing for what they use.

Logs Essentials in action

Let’s walk through how a Site Reliability Engineer (SRE) would use it in a real-world scenario. Customers are unable to complete transactions on an ecommerce site and the root cause isn’t clear. The issue could be in the front end, the back end, the database, or even the load balancer. Fortunately, logs are being collected from multiple components including NGINX, MySQL, and the application itself. With Elastic Observability Logs Essentials, an SRE can quickly dive into these logs to investigate the issue by starting with high-level symptoms and drill down across services using powerful search, correlation via ES|QL, and visualization tools like dashboarding.

The investigation continues as the SRE walks through several steps using ES|QL, search, and dashboards.

There is an alert indicating a logs spike, which is triggered by a significant number of MySQL errors indicating that a database table “orders” is full. We also use ES|QL to understand how many errors have been seen in the last three hours.

Next, the SRE tries to understand the impact on customers and potential revenue by looking at how many http issues are occurring and what region is seeing it most. With a significant number of >=400 and the US as the main region seeing the issue, this is revenue impacting.

Next, the SRE looks at whether infrastructure is being impacted by finding the related Kubernetes cluster and pod. With this the SRE can further investigate whether the MySQL pod or the Kubernetes node is having CPU or memory utilization issues.

SREs can also create visualizations and dashboards easily through Observability Logs Essentials’ ES|QL, discover, alerting, and dashboards capabilities.

Get started with Observability Logs Essentials

By combining the trusted capabilities of Elasticsearch with the flexibility and scalability of Elastic Cloud Serverless offering, Log Essentials delivers a streamlined, cost-effective solution that helps teams resolve incidents faster and with greater clarity. Whether you're troubleshooting critical outages, monitoring service health, or building dashboards for proactive insight, Logs Essentials gives you the tools you need — search, ES|QL, alerting, and visualization — in a package that’s simple to adopt and scale.

In order to get started, first register on Elastic Cloud and start a trial.

Process data from Elastic integrations with the integration filter plugin in Logstash

Fri, 14 Mar 2025 00:00:00 GMT

The Elastic Integration filter plugin for Logstash allows you to process data from Elastic integrations through executing ingest pipelines within Logstash, before forwarding the data to Elastic.

Why should I use it?

This approach has the advantage of offloading data processing operations outside of your Elastic deployment and onto Logstash, giving you flexibility on where this should occur.

Additionally, with Logstash as the final route for your data before ingestion into Elastic, it could save you from having to open different ports and set different firewall rules for each agent or beats instance, as Logstash could aggregate all output from these components.

Prerequisites

You have an Elastic agent with one or more integrations inside an agent policy running on a server. If you need to install an Elastic agent, you can follow the guide here.

Steps

We will:

Install Logstash, but not run it until all steps are complete
Generate custom certificates and keys on our Logstash server, to enable secure communication between Fleet server and Logstash
Configure Fleet to add a Logstash output
Set up Logstash, including a custom pipeline that receives input from Elastic agent, uses the integration filter plugin, and finally forwards the events to Elastic
Start Logstash
Update an agent policy to use that new Logstash output

Installing Logstash

Use this guide to install Logstash on your server.

Set up SSL/TLS on the Logstash server

Use this guide to create custom certificates and keys for securing the Logstash output connection that will be used by Fleet. We need to do this before we set up a custom pipeline file for Logstash, as we’ll refer to some of the certificate values in that config.

As per the guide, I downloaded Elasticsearch so I could use the certutil tool that is included, and extracted the contents.

Add a Logstash output to Fleet in Kibana

With our certificates and keys to hand, we can complete the steps necessary to set up a Logstash output for Fleet from within Kibana. Do not yet set the Logstash output on an agent policy, as we need to configure a custom pipeline in Logstash first.

Set up a custom pipeline for Logstash

We need to add a custom pipeline yml file, which will include our Elastic agent input and integration filter. The typical definition for a Logstash pipeline is this:

Our custom pipeline yml file will start with the Elastic agent input plugin, the guide for which is here.

We will then have the integration filter, and an output to Elastic Cloud that will be different depending on if you are ingesting to a hosted cloud deployment, or a serverless project.

Your completed file should look something like this:

input {
  elastic_agent {
    port => 5044
    ssl_enabled => true
    ssl_certificate_authorities => ["/pathtoca/ca.crt"]
    ssl_certificate => "/pathtologstashcrt/logstash.crt"
    ssl_key => "/pathtologstashkey/logstash.pkcs8.key"
    ssl_client_authentication => "required"
  }
}
filter {
  elastic_integration{
    cloud_id => "Ross_is_Testing:123456"
    cloud_auth => "elastic:yourpasswordhere"
  }
}
output {
    # For cloud hosted deployments
    elasticsearch {
        cloud_id => "Ross_is_Testing:123456
        cloud_auth => "elastic:yourpasswordhere"
        data_stream => true
        ssl => true
        ecs_compatibility => v8
    }
    # For serverless projects
    elasticsearch {
        hosts => ["https://projectname.es.us-east-1.aws.elastic.cloud:443"]
        api_key => "yourapikey-here"
        data_stream => true
        ssl => true
        ecs_compatibility => v8
    }
}

The above syntax for the output section is valid, you can specify multiple outputs!

For cloud hosted deployments you can use the deployment’s CloudId for authentication, which you can get from the cloud admin console, on the deployment overview screen:

I'm also using a username and password, but you could instead specify an API key if desired.

For serverless projects, you’ll need to use your Elasticsearch endpoint and an API key to connect Logstash, as documented here. You can get the Elasticsearch endpoint from the manage project screen of the cloud admin console:

Ensure the main pipelines.yml file for Logstash also includes a reference to our custom pipeline file:

# This file is where you define your pipelines. You can define multiple.
# For more information on multiple pipelines, see the documentation:
#   https://www.elastic.co/guide/en/logstash/current/multiple-pipelines.html
- pipeline.id: fromagent
  path.config: "/etc/logstash/conf.d/agent.conf"

We can then start Logstash. As we haven’t yet updated an Elastic agent policy to use our Logstash output, no events will yet be going through Logstash.

Update an agent policy to use our Logstash output

With Logstash running, we can now set our configured Logstash output on an agent policy of our choosing.

Complete

Events from the integrations on the chosen agent policy will be sent through Logstash, and relevant ingest pipelines run within Logstash to process the data before sending to Elastic Cloud.

Logstash Pipeline Management & Configuration with GitOps

Mon, 09 Feb 2026 00:00:00 GMT

Is your Logstash environment a 'black box'? Are manual configuration changes leading to unexpected outages, security gaps, and countless hours spent on troubleshooting? It's time to stop treating observability infrastructure like a fragile art project. This blog post delivers a strategic blueprint for taming your Logstash pipelines, transforming them into a version-controlled, automated, and auditable asset. By adopting a GitOps approach, you can eliminate configuration drift, empower your teams to collaborate securely and ensure your observability platform is as resilient as the systems it monitors.

From Fragile Art Project to Auditable Asset: How to Tame Your Logstash Configurations with Version Control and Automation

Observability ensures system health, performance, and security. Logstash drives this by processing and routing your data. But as you scale, manual configuration management becomes a bottleneck. It leads to errors, outages, and security gaps. You need a better way.

This blog post shows you how to manage Logstash pipelines using GitOps. You will use Git as your single source of truth and automate deployments to increase stability, security, and efficiency of your enterprise organisation’s observability infrastructure.

This blog post details the benefits of this methodology and provides a practical implementation model using GitHub for version control and Jenkins for Continuous Integration and Continuous Deployment (CI/CD).

The Unsung Hero: Why Logstash Remains a Cornerstone of Enterprise Data Strategy

In the evolving landscape of observability and data pipelines, Logstash remains one of the most powerful and reliable components in the Elastic ecosystem. While it may not always take the spotlight, its depth of capability, flexibility, and resilience make it essential for enterprises managing complex, varied data streams. Logstash offers four main benefits:

Extensive Integration Support: Logstash supports a wide array of input and output plugins — including Kafka, syslog, Beats, cloud services, and databases — making it ideal for ingesting data from diverse environments and routing it across your architecture.
Advanced Data Transformation: With rich filtering capabilities and optional Ruby scripting, Logstash enables complex enrichment, field manipulation, and conditional routing — allowing teams to standardise and prepare data early in the pipeline.
Offloading Elasticsearch Ingest Load: The elastic_integration filter replicates ingest pipeline logic in Logstash, enabling upstream transformations that reduce processing overhead on Elasticsearch and streamline the indexing path.
Operational Resilience with Persistent Queues: Logstash’s persistent-queue buffers data during downstream slowdowns or outages, helping smooth ingestion spikes, prevent data loss, and maintain stability under load.

In modern CI/CD workflows, where automation and rapid iteration are standard, Logstash’s maturity and flexibility continue to make it a dependable choice — quietly powering the data flows that keep observability pipelines running strong.

The Case for a GitOps-Driven Observability Strategy

GitOps is a paradigm that applies proven DevOps best practices such as version control, collaboration, compliance, and CI/CD to infrastructure and configuration management. When applied to Logstash, this means that every pipeline configuration is treated as code—defined, versioned, reviewed, and deployed from a Git repository.

For enterprise environments, the adoption of a GitOps model for Logstash pipelines offers compelling advantages:

Enhanced Auditability and Compliance: Every pipeline modification is captured as a Git commit, creating an immutable, chronological audit trail. This provides unparalleled visibility into who made what change, when, and why, which is indispensable for meeting regulatory compliance requirements and conducting security audits.
Improved System Stability and Reliability: The risk of deploying faulty configurations is drastically reduced. By enforcing a pull request (PR) workflow, all changes undergo peer review and automated validation before they are merged and deployed. In the event of an incident caused by a new configuration, a rollback is as fast and straightforward as reverting a Git commit.
Increased Automation and Operational Efficiency: Automating the deployment lifecycle eliminates manual, error-prone configuration tasks. This frees up skilled engineers from routine operational duties, allowing them to focus on higher-value activities such as optimising data flows, improving analytics, and strengthening security postures.
Fostered Cross-Team Collaboration: Git provides a universal and well-understood platform for collaboration. Development, Security, and Operations (DevSecOps) teams can work together seamlessly on a unified codebase. This shared ownership breaks down silos and ensures that pipeline configurations are robust, secure, and fit for purpose across the organization.

Implementation Model: GitHub and Jenkins

This section details a practical framework for implementing a GitOps workflow for Logstash.

1. Prerequisites

An established GitHub organisation or account.
A running Jenkins instance with the necessary plugins installed (e.g., Git, GitHub Integration).
A target Logstash environment where configurations will be deployed.
Working knowledge of Git, Jenkins pipelines, and Logstash configuration syntax.

2. Step 1: Establish a Centralised Git Repository

The foundation of a GitOps workflow is a version-controlled repository.

Create a Repository: In GitHub, create a new repository (e.g., logstash-configurations). This will serve as the single source of truth for all pipeline configurations.
Define a Directory Structure: A logical directory structure is crucial for managing configurations across different environments. A recommended structure is:

    /
    ├── pipelines/
    │   ├── development/
    │   │   ├── 01-input-beats.conf
    │   │   ├── 10-filter-nginx.conf
    │   │   └── 99-output-elasticsearch.conf
    │   ├── staging/
    │   │   └── ...
    │   └── production/
    │       └── ...
    └── Jenkinsfile

This structure clearly separates configurations by environment and allows for a modular and maintainable pipeline design.

3. Step 2: Automate Deployment with a Jenkins CI/CD Pipeline

The Jenkins pipeline automates validation and deployment of the configurations from Git to your Logstash instances.

Create a Jenkinsfile: Add a Jenkinsfile to the root of your repository to define the automation pipeline. This pipeline-as-code approach ensures the deployment process itself is version-controlled.

Define the Pipeline Stages: The pipeline should include distinct stages for checking out code, validating configurations, and deploying to the target environment.

A sample Jenkinsfile could look as follows:


 pipeline {
     agent any

     // Trigger the pipeline on every push to the main branch
     triggers {
         githubPush()
     }

     stages {
         stage('Checkout') {
             steps {
                 // Clone the repository
                 git 'https://github.com/your-org/logstash-configurations.git'
             }
         }

         stage('Validate Staging Configs') {
             steps {
                 // Run Logstash's built-in config test
                 // This prevents syntax errors from reaching production
                 sh 'docker run --rm -v ${WORKSPACE}/pipelines/staging:/usr/share/logstash/pipeline/ docker.elastic.co/logstash/logstash:9.3.1 logstash --config.test_and_exit'
                 sh 'docker run --rm -v ${WORKSPACE}/pipelines/staging:/usr/share/logstash/pipeline/ docker.elastic.co/logstash/logstash:9.3.1 logstash --config.test_and_exit'
             }
         }

         stage('Deploy to Staging') {
             // This stage requires Jenkins to have credentials to access the Staging server
             steps {
                 withCredentials([sshUserPrivateKey(credentialsId: 'staging-server-creds', keyFileVariable: 'KEY_FILE')]) {
                     sh '''
                         scp -i ${KEY_FILE} ${WORKSPACE}/pipelines/staging/*.conf user@staging-logstash-host:/etc/logstash/conf.d/
                         ssh -i ${KEY_FILE} user@staging-logstash-host 'sudo systemctl reload logstash'
                     '''
                 }
             }
         }

         // Optional: Add a manual approval step before deploying to production
         stage('Approval for Production') {
             steps {
                 input 'Deploy to Production?'
             }
         }

         stage('Deploy to Production') {
              steps {
                 // Similar deployment steps for the production environment
                 // using production credentials
              }
         }
     }
 }

4. The GitOps Workflow in Practice

This setup enables a controlled, auditable, and automated workflow:

Branch Creation: An engineer creates a feature branch in Git to propose a change (e.g., feature/add-syslog-input).
Configuration Change: The engineer modifies or adds a pipeline configuration file in their branch.
Pull Request: A pull request is created in GitHub. This action can trigger automated checks in Jenkins to validate the syntax of the proposed changes.
Peer Review: Team members review the changes for logic, security, and adherence to standards.
Merge and Deploy: Upon approval, the PR is merged into the main branch. This merge automatically triggers the Jenkins pipeline, which deploys the validated configuration to the corresponding Logstash environment.

Best Practices for Enterprise Adoption

To successfully implement this model at an enterprise scale, consider the following best practices:

Branching Strategy: Adopt a consistent branching strategy, such as GitFlow, to manage features, releases, and hotfixes in an orderly manner. Protect your main or production branches with rules that require PR reviews and passing status checks before merging.
Scalability: For large-scale deployments with many Logstash nodes, use configuration management tools like Ansible, Puppet, or Chef within your Jenkins pipeline to orchestrate the deployment across your entire fleet.
Fostering a GitOps Culture: Successful adoption is as much about people and processes as it is about tools. Provide training and documentation to ensure all stakeholders understand the workflow and their role within it. Emphasise the collaborative benefits and the shared responsibility for maintaining a stable and secure observability platform.
Pipeline Observability (Optional): Monitoring the health and performance of your CI/CD pipelines is crucial and recommended for early detection of issues, visibility into bottlenecks, and auditability. Elastic Observability provides native support for monitoring Jenkins pipelines using the Elastic CI/CD Observability plugin.
Secrets Management: Never hardcode sensitive information (passwords, API keys) in your configuration files. Use a secrets management tool like HashiCorp Vault or AWS Secrets Manager, and have Logstash retrieve these secrets at runtime.

A sample snippet to retrieve the secret and store it securely in a Logstash keystore could look as follows:

```
...

	stage('Update Logstash Secret') {
    // Define the secret path and key in Vault
    def secretPath = 'secret/logstash/production'
    def secretKey = 'elasticsearch_password'
    def keystoreKey = 'ES_PWD' // The key name to be used in the Logstash keystore

    // Wrap the steps in withVault to get access to the secrets
    withVault(configuration: [url: 'http://your-vault-server:8200',
                             credentialsId: 'vault-approle-creds']) {
        
        // Retrieve the secret from Vault. The plugin makes it available as an environment variable.
        def secrets = readVault(path: secretPath, key: secretKey)
        def esPassword = secrets[secretKey]

        // Use SSH credentials to access the Logstash server
        withCredentials() {
            sh """
            ssh -i ${KEY_FILE} user@logstash-host <<'ENDSSH'
            # Pipe the secret directly into the logstash-keystore command
            # This avoids writing the secret to disk or exposing it in the process list
            echo "${esPassword}" | sudo -u logstash /usr/share/logstash/bin/logstash-keystore add ${keystoreKey} --stdin

            # After updating the keystore, reload Logstash to apply the change
            sudo systemctl reload logstash
            ENDSSH
            """
        }
    }
}
...

```

Conclusion

Adopting a GitOps approach for managing Logstash pipelines is a strategic move that aligns observability with modern DevSecOps principles. It replaces manual, opaque processes with an automated, transparent, and collaborative framework. For enterprise organisations, this leads to a more secure, resilient, and efficient observability infrastructure, empowering teams to derive maximum value from their data while minimising operational overhead and risk.

The above example is just a start; there’s a lot more you can do once you lay the foundation—GitOps is just the beginning. From branching automation to pipeline promotion workflows to building self-service deployment portals, the possibilities are limited only by your creativity (and maybe your CI minutes).

GitOps lays the foundation. To see the whole picture, you need to monitor your pipelines. Start a trial and try out Elastic’s CI/CD Observability solution to track build health and deployment trends. It connects code changes to production behavior, giving you deep visibility into your new automated workflow.

Build smarter pipelines. Monitor what matters. And let your GitOps-powered observability stack become the quiet hero of your DevSecOps story. See how Streams can supercharge your data engineering with the next generation of AI-powered log management & log processing.

Convert Logstash pipelines to OpenTelemetry Collector Pipelines

Fri, 25 Oct 2024 00:00:00 GMT

Convert Logstash pipelines to OpenTelemetry Collector Pipelines

Introduction

Elastic observability strategy is increasingly aligned with OpenTelemetry. With the recent launch of Elastic Distributions of OpenTelemetry we’re expanding our offering to make it easier to use OpenTelemetry, the Elastic Agent now offers an "otel" mode, enabling it to run a custom distribution of the OpenTelemetry Collector, seamlessly enhancing your observability onboarding and experience with Elastic.

This post is designed to assist users familiar with Logstash transitioning to OpenTelemetry by demonstrating how to convert some standard Logstash pipelines into corresponding OpenTelemetry Collector configurations.

What is OpenTelemetry Collector and why should I care?

OpenTelemetry is an open-source framework that ensures vendor-agnostic data collection, providing a standardized approach for the collection, processing, and ingestion of observability data. Elastic is fully committed to this principle, aiming to make observability truly vendor-agnostic and eliminating the need for users to re-instrument their observability when switching platforms.

By embracing OpenTelemetry, you have access to these benefits:

Unified Observability: By using the OpenTelemetry Collector, you can collect and manage logs, metrics, and traces from a single tool, providing holistic observability into your system's performance and behavior. This simplifies monitoring and debugging in complex, distributed environments like microservices.
Flexibility and Scalability: Whether you're running a small service or a large distributed system, the OpenTelemetry Collector can be scaled to handle the amount of data generated, offering the flexibility to deploy as an agent (running alongside applications) or as a gateway (a centralized hub).
Open Standards: Since OpenTelemetry is an open-source project under the Cloud Native Computing Foundation (CNCF), it ensures that you're working with widely accepted standards, contributing to the long-term sustainability and compatibility of your observability stack.
Simplified Telemetry Pipelines: The ability to build pipelines using receivers, processors, and exporters simplifies telemetry management by centralizing data flows and minimizing the need for multiple agents.

In the next sections, we will explain how OTEL Collector and Logstash pipelines are structured, and we will clarify how the steps for each option are used.

OTEL Collector Configuration

An OpenTelemetry Collector Configuration has different sections:

Receivers: Collect data from different sources.
Processors: Transform the data collected by receivers
Exporters: Send data to different collectors
Connectors: Link two pipelines together
Service: defines which components are active
- Pipelines: Combine the defined receivers, processors, exporters, and connectors to process the data
- Extensions are optional components that expand the capabilities of the Collector to accomplish tasks not directly involved with processing telemetry data (e.g., health monitoring)
- Telemetry where you can set observability for the collector itself (e.g., logging and monitoring)

We can visualize it schematically as follows:

We refer to the official documentation Configuration | OpenTelemetry for an in-depth introduction in the components.

Logstash pipeline definition

A Logstash pipeline is composed of three main components:

Input Plugins: Allow us to read data from different sources
Filters Plugins: Allow us to transform and filter the data
Output Plugins: Allow us to send the data

Logstash also has a special input and a special output that allow the pipeline-to-pipeline communication, we can consider this as a similar concept to an OpenTelemetry connector.

Logstash pipeline compared to Otel Collector components

We can schematize how Logstash Pipeline and OTEL Collector pipeline components can relate to each other as follows:

Enough theory! Let us dive into some examples.

Convert a Logstash Pipeline into OpenTelemetry Collector Pipeline

Example 1: Parse and transform log line

Let's consider the below line:

2024-09-20T08:33:27: user frank accessed from 89.66.167.22:10592 path /blog with error 404

We will apply the following steps:

Read the line from the file /tmp/demo-line.log.
Define the output to be an Elasticsearch datastream logs-access-default.
Extract the @timestamp, user.name, client.ip, client.port, url.path and http.status.code.
Drop log messages related to the SYSTEM user.
Parse the date timestamp with the relevant date format and store it in @timestamp.
Add a code http.status.code_description based on known codes' descriptions.
Send data to Elasticsearch.

Logstash pipeline

input {
    file {
        path => "/tmp/demo-line.log" #[1]
        start_position => "beginning"
        add_field => { #[2]
            "[data_stream][type]" => "logs"
            "[data_stream][dataset]" => "access_log"
            "[data_stream][namespace]" => "default"
        }
    }
}

filter {
    grok { #[3]
        match => {
            "message" => "%{TIMESTAMP_ISO8601:[date]}: user %{WORD:[user][name]} accessed from %{IP:[client][ip]}:%{NUMBER:[client][port]:int} path %{URIPATH:[url][path]} with error %{NUMBER:[http][status][code]}"
        }
    }
    if "_grokparsefailure" not in [tags] {
        if [user][name] == "SYSTEM" { #[4]
            drop {}
        }
        date { #[5]
            match => ["[date]", "ISO8601"]
            target => "[@timestamp]"
            timezone => "UTC"
            remove_field => [ "date" ]
        }
        translate { #[6]
            source => "[http][status][code]"
            target => "[http][status][code_description]"
            dictionary => {
                "200" => "OK"
                "403" => "Permission denied"
                "404" => "Not Found"
                "500" => "Server Error"
            }
            fallback => "Unknown error"
        }
    }
}

output {
    elasticsearch { #[7]
        hosts => "elasticsearch-enpoint:443"
        api_key => "${ES_API_KEY}"
    }
}

OpenTelemtry Collector configuration

receivers:
  filelog: #[1]
    start_at: beginning
    include:
      - /tmp/demo-line.log
    include_file_name: false
    include_file_path: true
    storage: file_storage 
    operators:
    # Copy the raw message into event.original (this is done OOTB by Logstash in ECS mode)
    - type: copy
      from: body
      to: attributes['event.original']
    - type: add #[2]
      field: attributes["data_stream.type"]
      value: "logs"
    - type: add #[2]
      field: attributes["data_stream.dataset"]
      value: "access_log_otel" 
    - type: add #[2]
      field: attributes["data_stream.namespace"]
      value: "default"

extensions:
  file_storage:
    directory: /var/lib/otelcol/file_storage

processors:
  # Adding  host.name (this is done OOTB by Logstash)
  resourcedetection/system:
    detectors: ["system"]
    system:
      hostname_sources: ["os"]
      resource_attributes:
        os.type:
          enabled: false

  transform/grok: #[3]
    log_statements:
      - context: log
        statements:
        - 'merge_maps(attributes, ExtractGrokPatterns(attributes["event.original"], "%{TIMESTAMP_ISO8601:date}: user %{WORD:user.name} accessed from %{IP:client.ip}:%{NUMBER:client.port:int} path %{URIPATH:url.path} with error %{NUMBER:http.status.code}", true), "insert")'

  filter/exclude_system_user:  #[4]
    error_mode: ignore
    logs:
      log_record:
        - attributes["user.name"] == "SYSTEM"

  transform/parse_date: #[5]
    log_statements:
      - context: log
        statements:
          - set(time, Time(attributes["date"], "%Y-%m-%dT%H:%M:%S"))
          - delete_key(attributes, "date")
        conditions:
          - attributes["date"] != nil

  transform/translate_status_code:  #[6]
    log_statements:
      - context: log
        conditions:
        - attributes["http.status.code"] != nil
        statements:
        - set(attributes["http.status.code_description"], "OK")                where attributes["http.status.code"] == "200"
        - set(attributes["http.status.code_description"], "Permission Denied") where attributes["http.status.code"] == "403"
        - set(attributes["http.status.code_description"], "Not Found")         where attributes["http.status.code"] == "404"
        - set(attributes["http.status.code_description"], "Server Error")      where attributes["http.status.code"] == "500"
        - set(attributes["http.status.code_description"], "Unknown Error")     where attributes["http.status.code_description"] == nil

exporters:
  elasticsearch: #[7]
    endpoints: ["elasticsearch-enpoint:443"]
    api_key: ${env:ES_API_KEY}
    tls:
    logs_dynamic_index:
      enabled: true
    mapping:
      mode: ecs

service:
  extensions: [file_storage]
  pipelines:
    logs:
      receivers:
        - filelog
      processors:
        - resourcedetection/system
        - transform/grok
        - filter/exclude_system_user
        - transform/parse_date
        - transform/translate_status_code
      exporters:
        - elasticsearch

These will generate the following document in Elasticsearch

{
    "@timestamp": "2024-09-20T08:33:27.000Z",
    "client": {
        "ip": "89.66.167.22",
        "port": 10592
    },
    "data_stream": {
        "dataset": "access_log",
        "namespace": "default",
        "type": "logs"
    },
    "event": {
        "original": "2024-09-20T08:33:27: user frank accessed from 89.66.167.22:10592 path /blog with error 404"
    },
    "host": {
        "hostname": "my-laptop",
        "name": "my-laptop",
     },
    "http": {
        "status": {
            "code": "404",
            "code_description": "Not Found"
        }
    },
    "log": {
        "file": {
            "path": "/tmp/demo-line.log"
        }
    },
    "message": "2024-09-20T08:33:27: user frank accessed from 89.66.167.22:10592 path /blog with error 404",
    "url": {
        "path": "/blog"
    },
    "user": {
        "name": "frank"
    }
}

Example 2: Parse and transform a NDJSON-formatted log file

Let's consider the below json line:

{"log_level":"INFO","message":"User login successful","service":"auth-service","timestamp":"2024-10-11 12:34:56.123 +0100","user":{"id":"A1230","name":"john_doe"}}

We will apply the following steps:

Read a line from the file /tmp/demo.ndjson.
Define the output to be an Elasticsearch datastream logs-json-default
Parse the JSON and assign relevant keys and values.
Parse the date.
Override the message field.
Rename fields to follow ECS conventions.
Send data to Elasticsearch.

Logstash pipeline

input {
    file {
        path => "/tmp/demo.ndjson" #[1]
        start_position => "beginning"
        add_field => { #[2]
            "[data_stream][type]" => "logs"
            "[data_stream][dataset]" => "json"
            "[data_stream][namespace]" => "default"
        }
    }
}

filter {
  if [message] =~ /^\{.*/ {
    json { #[3] & #[5]
        source => "message"
    }
  }
  date { #[4]
    match => ["[timestamp]", "yyyy-MM-dd HH:mm:ss.SSS Z"]
    remove_field => "[timestamp]"
  }
  mutate {
    rename => { #[6]
      "service" => "[service][name]"
      "log_level" => "[log][level]"
    }
  }
}


output {
    elasticsearch { # [7]
        hosts => "elasticsearch-enpoint:443"
        api_key => "${ES_API_KEY}"
    }
}

OpenTelemtry Collector configuration

receivers:
  filelog/json: # [1]
    include: 
      - /tmp/demo.ndjson
    retry_on_failure:
      enabled: true
    start_at: beginning
    storage: file_storage 
    operators:
     # Copy the raw message into event.original (this is done OOTB by Logstash in ECS mode)
    - type: copy
      from: body
      to: attributes['event.original']
    - type: add #[2]
      field: attributes["data_stream.type"]
      value: "logs"      
    - type: add #[2]
      field: attributes["data_stream.dataset"]
      value: "otel" #[2]
    - type: add
      field: attributes["data_stream.namespace"]
      value: "default"     


extensions:
  file_storage:
    directory: /var/lib/otelcol/file_storage

processors:
  # Adding  host.name (this is done OOTB by Logstash)
  resourcedetection/system:
    detectors: ["system"]
    system:
      hostname_sources: ["os"]
      resource_attributes:
        os.type:
          enabled: false

  transform/json_parse:  #[3]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - merge_maps(attributes, ParseJSON(body), "upsert")
        conditions: 
          - IsMatch(body, "^\\{")
      

  transform/parse_date:  #[4]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(time, Time(attributes["timestamp"], "%Y-%m-%d %H:%M:%S.%L %z"))
          - delete_key(attributes, "timestamp")
        conditions: 
          - attributes["timestamp"] != nil

  transform/override_message_field: [5]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(body, attributes["message"])
          - delete_key(attributes, "message")

  transform/set_log_severity: # [6]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(severity_text, attributes["log_level"])          

  attributes/rename_attributes: #[6]
    actions:
      - key: service.name
        from_attribute: service
        action: insert
      - key: service
        action: delete
      - key: log_level
        action: delete

exporters:
  elasticsearch: #[7]
    endpoints: ["elasticsearch-enpoint:443"]
    api_key: ${env:ES_API_KEY}
    tls:
    logs_dynamic_index:
      enabled: true
    mapping:
      mode: ecs

service:
  extensions: [file_storage]
  pipelines:
    logs/json:
      receivers: 
        - filelog/json
      processors:
        - resourcedetection/system    
        - transform/json_parse
        - transform/parse_date        
        - transform/override_message_field
        - transform/set_log_severity
        - attributes/rename_attributes
      exporters: 
        - elasticsearch

These will generate the following document in Elasticsearch

{
    "@timestamp": "2024-10-11T12:34:56.123000000Z",
    "data_stream": {
        "dataset": "otel",
        "namespace": "default",
        "type": "logs"
    },
    "event": {
        "original": "{\"log_level\":\"WARNING\",\"message\":\"User login successful\",\"service\":\"auth-service\",\"timestamp\":\"2024-10-11 12:34:56.123 +0100\",\"user\":{\"id\":\"A1230\",\"name\":\"john_doe\"}}"
    },
    "host": {
        "hostname": "my-laptop",
        "name": "my-laptop",
     },
    "log": {
        "file": {
            "name": "json.log"
        },
        "level": "WARNING"
    },
    "message": "User login successful",
    "service": {
        "name": "auth-service"
    },
    "user": {
        "id": "A1230",
        "name": "john_doe"
    }
}

Conclusion

In this post, we showed examples of how to convert a typical Logstash pipeline into an OpenTelemetry Collector pipeline for logs. While OpenTelemetry provides powerful tools for collecting and exporting logs, if your pipeline relies on complex transformations or scripting, Logstash remains a superior choice. This is because Logstash offers a broader range of built-in features and a more flexible approach to handling advanced data manipulation tasks.

What's Next?

Now that you've seen basic (but realistic) examples of converting a Logstash pipeline to OpenTelemetry, it's your turn to dive deeper. Depending on your needs, you can explore further and find more detailed resources in the following repositories:

OpenTelemetry Collector: Learn about the core OpenTelemetry components, from receivers to exporters.
OpenTelemetry Collector Contrib: Find community-contributed components for a wider range of integrations and features.
Elastic's opentelemetry-collector-components: Dive into Elastic's extensions for the OpenTelemetry Collector, offering more tailored features for Elastic Stack users.

If you encounter specific challenges or need to handle more advanced use cases, these repositories will be an excellent resource for discovering additional components or integrations that can enhance your pipeline. All these repositories have a similar structure with folders named receiver, processor, exporter, connector, which should be familiar after reading this blog. Whether you are migrating a simple Logstash pipeline or tackling more complex data transformations, these tools and communities will provide the support you need for a successful OpenTelemetry implementation.

Managing your applications on Amazon ECS EC2-based clusters with Elastic Observability

Tue, 15 Aug 2023 00:00:00 GMT

In previous blogs, we explored how Elastic Observability can help you monitor various AWS services and analyze them effectively:

One of the more heavily used AWS container services is Amazon ECS (Elastic Container Service). While there is a trend toward using Fargate to simplify the setup and management of ECS clusters, many users still prefer using Amazon ECS with EC2 instances. It may not be as straightforward or efficient as AWS Fargate, but it offers more control over the underlying infrastructure.

In the most recent blog, we explored how Elastic Observability helps manage Amazon ECS with Fargate. However, this blog will review how to manage an Amazon ECS cluster with EC2 instances using Elastic Observability instead.

In general, when setting up Amazon ECS-based clusters with EC2, you may or may not have access to the EC2 instances. This determines what you can use with Elastic Observability in monitoring your EC2-based ECS cluster. Hence, there are two components you can use in monitoring the EC2-based ECS cluster with Elastic Observability:

As you can see in the diagram above, the two components are:

_ Baseline setup __ : _ The elastic agent running the AWS integration is configured to obtain ECS metrics and logs from cloud watch. This agent runs on an instance that is not part of the ECS cluster because it allows you to see ALL ECS clusters and other AWS Services, such as EKS, RDS, and EC2.
_ Additional setup: _ If you have access to the EC2 instances in the ECS cluster, then you can run Elastic’s docker integration in each EC2 instance. This gives you significantly more details on the containers than AWS container insights. And it does not require AWS Cloudwatch, which can be fairly costly.

Using either just the baseline or the additional setup, you will have to set up AWS CloudWatch Container Insights for the ECS cluster. However, the docker integration with the additional setup can provide additional information to the AWS CloudWatch Container Insights.

Hence, we will review how you can monitor the various components of an EC2-based ECS cluster:

EC2 instances in the ASG group
ECS services running in the ECS cluster
ECS tasks (containers)

Also, we will review how you can obtain metrics and logs from the ECS cluster with and without AWS Cloudwatch. We’ll show you how to use:

AWS CloudWatch Container Insights (from Cloudwatch)
Docker metrics (non-Cloudwatch)
Amazon ECS logs via Cloudwatch

Prerequisites and configuration

If you plan on following this blog, here are some of the components and details we used to set up the configuration:

An account on Elastic Cloud and a deployed stack (see instructions here) — ensure that you have both.
A nginx container and a stress container — we will use these two basic containers to help highlight the load on the Elastic ECS Cluster.
An ECS EC2 Cluster in an Auto Scaling Group — ensure you have access in order to load up the Elastic agent on the EC2 instances, or you can create an AMI and use that as the baseline image for your ECS cluster.
An EC2 instance anywhere in your account that is not part of the ECS cluster and has public access (to send metrics and logs)

What will you see in Elastic Observability once it's all set up?

If you utilize the baseline configuration with ECS EC2 cluster configured with AWS CloudWatch Container Insights configured, the Elastic Agent configured with the following Elastic agent integrations:

ECS integration
EC2 integration
AWS Cloudwatch Integration with metrics and logging

Then you will be able to get the following information in Elastic dashboards:

Containers in the cluster (AWS CloudWatch Container Insights via Elastic Agent and AWS Cloudwatch integration)
Services in the cluster (AWS CloudWatch Container Insights via Elastic Agent and AWS Cloudwatch integration)
CPU and memory utilization of the ECS Cluster (Elastic Agent with ECS integration)
EC2 CPU and memory utilization of the instance in the cluster (Elastic Agent with EC2 integration)
CPU and memory utilization per container (via AWS CloudWatch Container Insights via Elastic Agent and AWS Cloudwatch integration)

If the additional configuration using Elastic agents with docker integration per ECS EC2 instance is used, you will be able to get a direct feed of metrics via docker. The following metrics can be viewed:

Let’s see how to set this all up.

Setting it all up

Over the next few steps, I’ll walk through:

Getting an account on Elastic Cloud
Bringing up an ECS EC2 cluster and potentially setting up your own AMI
Setting up the containers nginx and a stress container
Setting up the Elastic agent with docker container integration on the ECS EC2 instances
Setting up the Elastic agent with AWS, Cloudwatch, and ECS integrations on an independent EC2 instance

Step 1: Create an account on Elastic Cloud

Follow the instructions to get started on Elastic Cloud.

Step 2: Set up an ECS Cluster with EC2 instances

When creating a cluster, you have two options when setting it up using the console:

Create a new ASG group where you will only be allowed to use the preloaded set of Amazon Linux (2 or 2023) based AMIs
Set up your own ASG Cluster prior to setting up the ECS Cluster and select this from the options. This option will give you more control over what Linux version and the ability to add things like Elastic agents in the AMI used for the instances in the ASG.

Regardless of either option, you will need to turn on Container Insights (see the bottom part of the image below).

Once the cluster is setup, you can go to AWS Cloudwatch where you should see Container Insights for your cluster:

Step 3: Set up Elastic agent with docker integration

Next, you will need to add an Elastic agent to each one of the instances. In the Elastic cloud, set up an Elastic policy with the docker and system integrations as such:

Next, add an agent for the policy, then copy the appropriate install script (in our case it was Linux since we were running Amazon Linux 2), and run it on every EC2 instance in the cluster:

Once this is added you should see agents in the fleet. Each agent will be on each EC2 instance:

If you decide to set up an ECS EC2 cluster with your own ASG and don’t use Amazon Lunix AMIs (2 or 2023 version), you will have to:

Pick your base image to base an AMI on
Add an ECS agent and register each instance to the AMI base image manually
Add the Elastic agent — standalone version — this step will require you to configure your Elastic endpoint and API key (or simply add the script in the “add agent” part of the configuration above when using the UI)
Create the AMI once all the above components are added
Use the newly created AMI in creating the ASG for ECS cluster

Step 4: Set up an Elastic agent with the AWS integration

From the integrations tab in Elastic Cloud, select AWS integration and select add agent. You will then have to walk through the configuration of the AWS integration.

At a minimum, ensure that you have the following configuration options turned on:

This will ensure that not only EC2 metrics and logs are ingested but that all CloudWatch metrics and logs are also ingested. ECS metrics and logs are stored in CloudWatch.

If you want to ensure only logs from the specific ECS cluster are ingested, you can also restrict what to ingest by several parameters. In our setup, we are collecting only logs from Log Group with a prefix of /aws/ecs/containerinsights/EC2BasedCluster/.

Once this policy is set up, add an agent like in Step 1.

However, this agent needs to be added to an EC2 instance which is independent of the ECS cluster.

Once installed, this agent will help pull in:

All EC2 instance metrics across your account (which can be adjusted in the integration policy)
Ingest AWS CloudWatch Container Insights from ECS
ECS metrics such as:
- aws.ecs.metrics.CPUReservation.avg
- aws.ecs.metrics.CPUUtilization.avg
- aws.ecs.metrics.GPUReservation.avg
- aws.ecs.metrics.MemoryReservation.avg
- aws.ecs.metrics.MemoryUtilization.avg
- More - see the full list here

Step 5: Setting up services and containers

In running this configuration, we used nginx and a stress container before we go into the task.

In order to initiate service and containers on ECS, you will need to set up a task for each of these containers. But more importantly, you will need to ensure that the roles for both of the following:

"taskRoleArn": "arn:aws:iam::xxxxx:role/ecsTaskExecutionRol"executionRoleArn":,

"arn:aws:iam::xxxxx:role/ecsTaskExecutionRole",

have the following permissions:

Most importantly, you should ensure that this permission is added:

AmazonEC2ContainerServiceforEC2Role

It will ensure containers can be brought up on the EC2 instances in the cluster.

Once you have the right permissions, then set up the following tasks.

Here is the task JSON for NGINX:

{
  "family": "NGINX",
  "containerDefinitions": [
    {
      "name": "nginx",
      "image": "nginx: latest",
      "cpu": 0,
      "portMappings": [
        {
          "name": "nginx-80-tcp",
          "containerPort": 80,
          "hostPort": 80,
          "protocol": "tcp",
          "appProtocol": "http"
        }
      ],
      "essential": true,
      "environment": [],
      "environmentFiles": [],
      "mountPoints": [],
      "volumesFrom": [],
      "ulimits": [],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-create-group": "true",
          "awslogs-group": "/ecs/",
          "awslogs-region": "us-west-2",
          "awslogs-stream-prefix": "ecs"
        },
        "secretOptions": []
      }
    }
  ],
  "taskRoleArn": "arn:aws:iam::xxxxxx:role/ecsTaskExecutionRole",
  "executionRoleArn": "arn:aws:iam::xxxxx:role/ecsTaskExecutionRole",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["EC2"],
  "cpu": "256",
  "memory": "512",
  "runtimePlatform": {
    "cpuArchitecture": "X86_64",
    "operatingSystemFamily": "LINUX"
  }
}

Here is the task JSON for stress container:

{
  "family": "stressLoad",
  "containerDefinitions": [
    {
      "name": "stressLoad",
      "image": "containerstack/alpine-stress",
      "cpu": 0,
      "memory": 512,
      "memoryReservation": 512,
      "portMappings": [],
      "essential": true,
      "entryPoint": ["sh", "-c"],
      "command": [
        "/usr/local/bin/stress --cpu 2 --io 2 --vm 1 --vm-bytes 128M --timeout 6000s"
      ],
      "environment": [],
      "mountPoints": [],
      "volumesFrom": [],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-create-group": "true",
          "awslogs-group": "/ecs/",
          "awslogs-region": "us-west-2",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ],
  "taskRoleArn": "arn:aws:iam::xxxxx:role/ecsTaskExecutionRole",
  "executionRoleArn": "arn:aws:iam::xxxxx:role/ecsTaskExecutionRole",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["EC2"],
  "cpu": "256",
  "memory": "512",
  "runtimePlatform": {
    "cpuArchitecture": "X86_64",
    "operatingSystemFamily": "LINUX"
  }
}

Once you have defined the tasks, ensure you bring up each service (one for each task) with the launch type of EC2:

You should have two services running now.

Step 6: Check on metrics and logs in Elastic Cloud

Go to Elastic Cloud and ensure that you are getting metrics and logs from the ECS Cluster. First, check to see if you are receiving metrics by viewing the built-in dashboard called [Metrics Docker] Overview.

_ With some work on this dashboard by adding in container insight metrics and docker metrics, you should be able to see: _

If you only have the ECS integration and the Elastic agent in Step 2, then you will need to create a new dashboard:

This dashboard can be set up with the following metrics:

Containers in the cluster (containerInsights via Elastic Agent and AWS Cloudwatch integration). Set up a TSVB panel using the following metric: aws.dimensions.ClusterName : "EC2BasedCluster" with aws.containerinsights.metrics.TaskCount.max
Services in the cluster (containerInsights via Elastic Agent and AWS Cloudwatch integration). Use the following configuration to setup the chart:

CPU and memory utilization of the ECS Cluster (Elastic Agent with ECS integration). Use the following configuration to set up both CPU and memory utilization charts:

EC2 CPU and storage utilization of the instance in the cluster (Elastic Agent with EC2 integration). Use the following configuration to set up both CPU and memory utilization charts:

(Not shown): CPU and memory utilization per container (via containerInsights via Elastic Agent and AWS Cloudwatch integration)

Step 7: Look at logs from your ECS cluster

Since we set up AWS CloudWatch logs collection in Step 2, we can view these logs in Discover by filtering on the logs group arn /aws/ecs/containerinsights/EC2BasedCluster/.

Summary

I hope you’ve gotten an appreciation for how Elastic Observability can help your AWS monitoring ECS service metrics. Here’s a quick recap of lessons and what you learned:

Elastic Observability supports ingesting and analysis of AWS ECS service metrics and the corresponding EC2 metrics through the AWS integration on the Elastic Agent. It’s easy to set up ingest from AWS Services via the Elastic Agent.
Elastic Observability can also get container metrics via the Docker integration running on Elastic agents on each of the EC2 instances in the ECS EC2 auto scaling group.
Elastic has multiple out-of-the-box (OOTB) AWS service dashboards that can be used as baselines to get your own customized view.

Ready to get started? Start your own 7-day free trial by signing up via AWS Marketplace and quickly spin up a deployment in minutes on any of the Elastic Cloud regions on AWS around the world. Your AWS Marketplace purchase of Elastic will be included in your monthly consolidated billing statement and will draw against your committed spend with AWS.

Manual instrumentation of Go applications with OpenTelemetry

Tue, 12 Sep 2023 00:00:00 GMT

Thanks to OpenTelemetry (OTel), DevOps and SRE teams now have a standard way to collect and send data that doesn't rely on proprietary code and have a large support community reducing vendor lock-in.

In this blog post, we will show you how to manually instrument Go applications using OpenTelemetry. This approach is slightly more complex than using auto-instrumentation

In a previous blog, we also reviewed how to use the OpenTelemetry demo and connect it to Elastic^®, as well as some of Elastic’s capabilities with OpenTelemetry. In this blog, we will use an alternative demo application, which helps highlight manual instrumentation in a simple way.

Finally, we will discuss how Elastic supports mixed-mode applications, which run with Elastic and OpenTelemetry agents. The beauty of this is that there is no need for the otel-collector! This setup enables you to slowly and easily migrate an application to OTel with Elastic according to a timeline that best fits your business.

Application, prerequisites, and config

The application that we use for this blog is called Elastiflix, a movie streaming application. It consists of several micro-services written in .NET, NodeJS, Go, and Python.

Before we instrument our sample application, we will first need to understand how Elastic can receive the telemetry data.

All of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:

Service maps
Service details (latency, throughput, failed transactions)
Dependencies between services, distributed tracing
Transactions (traces)
Machine learning (ML) correlations
Log correlation

Prerequisites

An Elastic Cloud account — sign up now
A clone of the Elastiflix demo application, or your own Go application
Basic understanding of Docker — potentially install Docker Desktop
Basic understanding of Go

View the example source code

The full source code including the Dockerfile used in this blog can be found on GitHub. The repository also contains the same application without instrumentation. This allows you to compare each file and see the differences.

Before we begin, let’s look at the non-instrumented code first.

This is our simple go application that can receive a GET request. Note that the code shown here is a slightly abbreviated version.

package main

import (
	"log"
	"net/http"
	"os"
	"time"

	"github.com/go-redis/redis/v8"

	"github.com/sirupsen/logrus"

	"github.com/gin-gonic/gin"
	"strconv"
	"math/rand"
)

var logger = &logrus.Logger{
	Out:   os.Stderr,
	Hooks: make(logrus.LevelHooks),
	Level: logrus.InfoLevel,
	Formatter: &logrus.JSONFormatter{
		FieldMap: logrus.FieldMap{
			logrus.FieldKeyTime:  "@timestamp",
			logrus.FieldKeyLevel: "log.level",
			logrus.FieldKeyMsg:   "message",
			logrus.FieldKeyFunc:  "function.name", // non-ECS
		},
		TimestampFormat: time.RFC3339Nano,
	},
}

func main() {
	delayTime, _ := strconv.Atoi(os.Getenv("TOGGLE_SERVICE_DELAY"))

	redisHost := os.Getenv("REDIS_HOST")
	if redisHost == "" {
		redisHost = "localhost"
	}

	redisPort := os.Getenv("REDIS_PORT")
	if redisPort == "" {
		redisPort = "6379"
	}

	applicationPort := os.Getenv("APPLICATION_PORT")
	if applicationPort == "" {
		applicationPort = "5000"
	}

	// Initialize Redis client
	rdb := redis.NewClient(&redis.Options{
		Addr:     redisHost + ":" + redisPort,
		Password: "",
		DB:       0,
	})

	// Initialize router
	r := gin.New()
	r.Use(logrusMiddleware)

	r.GET("/favorites", func(c *gin.Context) {
		// artificial sleep for delayTime
		time.Sleep(time.Duration(delayTime) * time.Millisecond)

		userID := c.Query("user_id")

		contextLogger(c).Infof("Getting favorites for user %q", userID)

		favorites, err := rdb.SMembers(c.Request.Context(), userID).Result()
		if err != nil {
			contextLogger(c).Error("Failed to get favorites for user %q", userID)
			c.String(http.StatusInternalServerError, "Failed to get favorites")
			return
		}

		contextLogger(c).Infof("User %q has favorites %q", userID, favorites)

		c.JSON(http.StatusOK, gin.H{
			"favorites": favorites,
		})
	})

	// Start server
	logger.Infof("App startup")
	log.Fatal(http.ListenAndServe(":"+applicationPort, r))
	logger.Infof("App stopped")
}

Step-by-step guide

Step 0. Log in to your Elastic Cloud account

This blog assumes you have an Elastic Cloud account — if not, follow the instructions to get started on Elastic Cloud.

Step 1. Install and initialize OpenTelemetry

As a first step, we’ll need to add some additional packages to our application.

import (
      "github.com/go-redis/redis/extra/redisotel/v8"
      "go.opentelemetry.io/otel"
      "go.opentelemetry.io/otel/attribute"
      "go.opentelemetry.io/otel/exporters/otlp/otlptrace"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"

	"go.opentelemetry.io/otel/propagation"

	"google.golang.org/grpc/credentials"
	"crypto/tls"

      sdktrace "go.opentelemetry.io/otel/sdk/trace"

	"go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin"

	"go.opentelemetry.io/otel/trace"
	"go.opentelemetry.io/otel/codes"
)

This code imports necessary OpenTelemetry packages, including those for tracing, exporting, and instrumenting specific libraries like Redis.

Next we read the "OTEL_EXPORTER_OTLP_ENDPOINT" variable and initialize the exporter.

var (
    collectorURL = os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
)
var tracer trace.Tracer


func initTracer() func(context.Context) error {
	tracer = otel.Tracer("go-favorite-otel-manual")

	// remove https:// from the collector URL if it exists
	collectorURL = strings.Replace(collectorURL, "https://", "", 1)
	secretToken := os.Getenv("ELASTIC_APM_SECRET_TOKEN")
	if secretToken == "" {
		log.Fatal("ELASTIC_APM_SECRET_TOKEN is required")
	}

	secureOption := otlptracegrpc.WithInsecure()
    exporter, err := otlptrace.New(
        context.Background(),
        otlptracegrpc.NewClient(
            secureOption,
            otlptracegrpc.WithEndpoint(collectorURL),
			otlptracegrpc.WithHeaders(map[string]string{
				"Authorization": "Bearer " + secretToken,
			}),
			otlptracegrpc.WithTLSCredentials(credentials.NewTLS(&tls.Config{})),
        ),
    )

    if err != nil {
        log.Fatal(err)
    }

    otel.SetTracerProvider(
        sdktrace.NewTracerProvider(
            sdktrace.WithSampler(sdktrace.AlwaysSample()),
            sdktrace.WithBatcher(exporter),
        ),
    )
	otel.SetTextMapPropagator(
		propagation.NewCompositeTextMapPropagator(
			propagation.Baggage{},
			propagation.TraceContext{},
		),
	)
    return exporter.Shutdown
}

For instrumenting connections to Redis, we will add a tracing hook to it, and in order to instrument Gin, we will add the OTel middleware. This will automatically capture all interactions with our application, since Gin will be fully instrumented. In addition, all outgoing connections to Redis will also be instrumented.

// Initialize Redis client
	rdb := redis.NewClient(&redis.Options{
		Addr:     redisHost + ":" + redisPort,
		Password: "",
		DB:       0,
	})
	rdb.AddHook(redisotel.NewTracingHook())
	// Initialize router
	r := gin.New()
	r.Use(logrusMiddleware)
	r.Use(otelgin.Middleware("go-favorite-otel-manual"))

Adding custom spans
Now that we have everything added and initialized, we can add custom spans.

If we want to have additional instrumentation for a part of our app, we simply start a custom span and then defer ending the span.

// start otel span
ctx := c.Request.Context()
ctx, span := tracer.Start(ctx, "add_favorite_movies")
defer span.End()

For comparison, this is the instrumented code of our sample application. You can find the full source code in GitHub.

package main

import (
	"log"
	"net/http"
	"os"
	"time"
	"context"

	"github.com/go-redis/redis/v8"
	"github.com/go-redis/redis/extra/redisotel/v8"


	"github.com/sirupsen/logrus"

	"github.com/gin-gonic/gin"

  "go.opentelemetry.io/otel"
  "go.opentelemetry.io/otel/attribute"
  "go.opentelemetry.io/otel/exporters/otlp/otlptrace"
  "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"

	"go.opentelemetry.io/otel/propagation"

	"google.golang.org/grpc/credentials"
	"crypto/tls"

  sdktrace "go.opentelemetry.io/otel/sdk/trace"

	"go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin"

	"go.opentelemetry.io/otel/trace"

	"strings"
	"strconv"
	"math/rand"
	"go.opentelemetry.io/otel/codes"

)

var tracer trace.Tracer

func initTracer() func(context.Context) error {
	tracer = otel.Tracer("go-favorite-otel-manual")

	collectorURL = strings.Replace(collectorURL, "https://", "", 1)

	secureOption := otlptracegrpc.WithInsecure()

	// split otlpHeaders by comma and convert to map
	headers := make(map[string]string)
	for _, header := range strings.Split(otlpHeaders, ",") {
		headerParts := strings.Split(header, "=")

		if len(headerParts) == 2 {
			headers[headerParts[0]] = headerParts[1]
		}
	}

    exporter, err := otlptrace.New(
        context.Background(),
        otlptracegrpc.NewClient(
            secureOption,
            otlptracegrpc.WithEndpoint(collectorURL),
			otlptracegrpc.WithHeaders(headers),
			otlptracegrpc.WithTLSCredentials(credentials.NewTLS(&tls.Config{})),
        ),
    )

    if err != nil {
        log.Fatal(err)
    }

    otel.SetTracerProvider(
        sdktrace.NewTracerProvider(
            sdktrace.WithSampler(sdktrace.AlwaysSample()),
            sdktrace.WithBatcher(exporter),
            //sdktrace.WithResource(resources),
        ),
    )
	otel.SetTextMapPropagator(
		propagation.NewCompositeTextMapPropagator(
			propagation.Baggage{},
			propagation.TraceContext{},
		),
	)
    return exporter.Shutdown
}

var (
  collectorURL = os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
	otlpHeaders = os.Getenv("OTEL_EXPORTER_OTLP_HEADERS")
)


var logger = &logrus.Logger{
	Out:   os.Stderr,
	Hooks: make(logrus.LevelHooks),
	Level: logrus.InfoLevel,
	Formatter: &logrus.JSONFormatter{
		FieldMap: logrus.FieldMap{
			logrus.FieldKeyTime:  "@timestamp",
			logrus.FieldKeyLevel: "log.level",
			logrus.FieldKeyMsg:   "message",
			logrus.FieldKeyFunc:  "function.name", // non-ECS
		},
		TimestampFormat: time.RFC3339Nano,
	},
}

func main() {
	cleanup := initTracer()
  defer cleanup(context.Background())

	redisHost := os.Getenv("REDIS_HOST")
	if redisHost == "" {
		redisHost = "localhost"
	}

	redisPort := os.Getenv("REDIS_PORT")
	if redisPort == "" {
		redisPort = "6379"
	}

	applicationPort := os.Getenv("APPLICATION_PORT")
	if applicationPort == "" {
		applicationPort = "5000"
	}

	// Initialize Redis client
	rdb := redis.NewClient(&redis.Options{
		Addr:     redisHost + ":" + redisPort,
		Password: "",
		DB:       0,
	})
	rdb.AddHook(redisotel.NewTracingHook())


	// Initialize router
	r := gin.New()
	r.Use(logrusMiddleware)
	r.Use(otelgin.Middleware("go-favorite-otel-manual"))


	// Define routes
	r.GET("/", func(c *gin.Context) {
		contextLogger(c).Infof("Main request successful")
		c.String(http.StatusOK, "Hello World!")
	})

	r.GET("/favorites", func(c *gin.Context) {
		// artificial sleep for delayTime
		time.Sleep(time.Duration(delayTime) * time.Millisecond)

		userID := c.Query("user_id")

		contextLogger(c).Infof("Getting favorites for user %q", userID)

		favorites, err := rdb.SMembers(c.Request.Context(), userID).Result()
		if err != nil {
			contextLogger(c).Error("Failed to get favorites for user %q", userID)
			c.String(http.StatusInternalServerError, "Failed to get favorites")
			return
		}

		contextLogger(c).Infof("User %q has favorites %q", userID, favorites)

		c.JSON(http.StatusOK, gin.H{
			"favorites": favorites,
		})
	})

	// Start server
	logger.Infof("App startup")
	log.Fatal(http.ListenAndServe(":"+applicationPort, r))
	logger.Infof("App stopped")
}

Step 2. Running the Docker image with environment variables

As specified in the OTEL documentation, we will use environment variables and pass in the configuration values that are found in your APM Agent’s configuration section.

Because Elastic accepts OTLP natively, we just need to provide the Endpoint and authentication where the OTEL Exporter needs to send the data, as well as some other environment variables.

Where to get these variables in Elastic Cloud and Kibana ^®
You can copy the endpoints and token from Kibana under the path /app/home#/tutorial/apm.

You will need to copy the OTEL_EXPORTER_OTLP_ENDPOINT as well as the OTEL_EXPORTER_OTLP_HEADERS.

Build the image

docker build -t  go-otel-manual-image .

Run the image

docker run \
       -e OTEL_EXPORTER_OTLP_ENDPOINT="" \
       -e OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer " \
       -e OTEL_RESOURCE_ATTRIBUTES="service.version=1.0,deployment.environment=production,service.name=go-favorite-otel-manual" \
       -p 5000:5000 \
       go-otel-manual-image

curl localhost:500/favorites
# or alternatively issue a request every second

while true; do curl "localhost:5000/favorites"; sleep 1; done;

How do the traces show up in Elastic?

Now that the service is instrumented, you should see the following output in Elastic APM when looking at the transactions section of your Node.js service:

Conclusion

In this blog, we discussed the following:

How to manually instrument Go with OpenTelemetry
How to properly initialize OpenTelemetry and add a custom span
How to easily set the OTLP ENDPOINT and OTLP HEADERS with Elastic without the need for a collector

Hopefully, this provides an easy-to-understand walk-through of instrumenting Go with OpenTelemetry and how easy it is to send traces into Elastic.

Developer resources:

Elastiflix application, a guide to instrument different languages with OpenTelemetry

Python: Auto-instrumentation, Manual-instrumentation

Java: Auto-instrumentation, Manual-instrumentation

Node.js: Auto-instrumentation, Manual-instrumentation

.NET: Auto-instrumentation, Manual-instrumentation

Go: Manual-instrumentation

Best practices for instrumenting OpenTelemetry

General configuration and use case resources:

Independence with OpenTelemetry on Elastic

Modern observability and security on Kubernetes with Elastic and OpenTelemetry

3 models for logging with OpenTelemetry and Elastic

Adding free and open Elastic APM as part of your Elastic Observability deployment

Capturing custom metrics through OpenTelemetry API in code with Elastic

Future-proof your observability platform with OpenTelemetry and Elastic

Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more

Manual instrumentation of Java applications with OpenTelemetry

Thu, 31 Aug 2023 00:00:00 GMT

In the fast-paced universe of software development, especially in the cloud-native realm, DevOps and SRE teams are increasingly emerging as essential partners in application stability and growth.

Observability in our modern distributed software ecosystem goes beyond mere monitoring—it demands limitless data collection, precision in processing, and the correlation of this data into actionable insights. However, the road to achieving this holistic view is paved with obstacles: from navigating version incompatibilities to wrestling with restrictive proprietary code.

Enter OpenTelemetry (OTel), with the following benefits for those who adopt it:

Escape vendor constraints with OTel, freeing yourself from vendor lock-in and ensuring top-notch observability.
See the harmony of unified logs, metrics, and traces come together to provide a complete system view.
Improve your application oversight through richer and enhanced instrumentations.
Embrace the benefits of backward compatibility to protect your prior instrumentation investments.
Embark on the OpenTelemetry journey with an easy learning curve, simplifying onboarding and scalability.
Rely on a proven, future-ready standard to boost your confidence in every investment.

In this blog, we will explore how you can use manual instrumentation in your Java application using Docker, without the need to refactor any part of your application code. We will use an application called Elastiflix. This approach is slightly more complex than using automatic instrumentation.

Application, prerequisites, and config

The application that we use for this blog is called Elastiflix, a movie streaming application. It consists of several micro-services written in .NET, NodeJS, Go, and Python.

Before we instrument our sample application, we will first need to understand how Elastic can receive the telemetry data.

All of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:

Service maps
Service details (latency, throughput, failed transactions)
Dependencies between services, distributed tracing
Transactions (traces)
Machine learning (ML) correlations
Log correlation

Prerequisites

An Elastic Cloud account — sign up now
A clone of the Elastiflix demo application, or your own Java application
Basic understanding of Docker — potentially install Docker Desktop
Basic understanding of Java

View the example source code

In particular, we will be working through the following file:

Elastiflix/java-favorite/src/main/java/com/movieapi/ApiServlet.java

Before we begin, let’s look at the non-instrumented code first.

Step-by-step guide

Step 0. Log in to your Elastic Cloud account

This blog assumes you have an Elastic Cloud account — if not, follow the instructions to get started on Elastic Cloud.

Step 1. Set up OpenTelemetry

The first step is to set up the OpenTelemetry SDK in your Java application. You can start by adding the OpenTelemetry Java SDK and its dependencies to your project's build file, such as Maven or Gradle. In our example application, we are using Maven. Add the dependencies below to your pom.xml:


      io.opentelemetry.instrumentation
      opentelemetry-logback-mdc-1.0
      1.25.1-alpha
    

    
      io.opentelemetry
      opentelemetry-api
    
    
      io.opentelemetry
      opentelemetry-sdk
    
    
      io.opentelemetry
      opentelemetry-exporter-otlp
    
    
      io.opentelemetry
      opentelemetry-semconv
    
    
      io.opentelemetry
      opentelemetry-exporter-otlp-logs
    
    
      io.opentelemetry.instrumentation
      opentelemetry-logback-appender-1.0
      1.25.1-alpha

And add the following bill of materials from OpenTelemetry too:


    
      
        io.opentelemetry
        opentelemetry-bom
        1.25.0
        pom
        import
      
      
        io.opentelemetry
        opentelemetry-bom-alpha
        1.25.0-alpha
        pom
        import

Step 2. Add the application configuration

We recommend that you add the following configuration to the application’s main method, to start before any application code. Doing it like this gives you a bit more control and flexibility and ensures that OpenTelemetry will be available at any stage of the application lifecycle. In the examples, we put this code before the Spring Boot Application startup. Elastic supports OTLP over HTTP and OTLP over GRPC. In this example, we are using GRPC.

String SERVICE_NAME = System.getenv("OTEL_SERVICE_NAME");

// set service name on all OTel signals
Resource resource = Resource.getDefault().merge(Resource.create(Attributes.of(ResourceAttributes.SERVICE_NAME,SERVICE_NAME,ResourceAttributes.SERVICE_VERSION,"1.0",ResourceAttributes.DEPLOYMENT_ENVIRONMENT,"production")));

// init OTel logger provider with export to OTLP
SdkLoggerProvider sdkLoggerProvider = SdkLoggerProvider.builder().setResource(resource).addLogRecordProcessor(BatchLogRecordProcessor.builder(OtlpGrpcLogRecordExporter.builder().setEndpoint(System.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")).addHeader("Authorization", "Bearer " + System.getenv("ELASTIC_APM_SECRET_TOKEN")).build()).build()).build();

// init OTel trace provider with export to OTLP
SdkTracerProvider sdkTracerProvider = SdkTracerProvider.builder().setResource(resource).setSampler(Sampler.alwaysOn()).addSpanProcessor(BatchSpanProcessor.builder(OtlpGrpcSpanExporter.builder().setEndpoint(System.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")).addHeader("Authorization", "Bearer " + System.getenv("ELASTIC_APM_SECRET_TOKEN")).build()).build()).build();

// init OTel meter provider with export to OTLP
SdkMeterProvider sdkMeterProvider = SdkMeterProvider.builder().setResource(resource).registerMetricReader(PeriodicMetricReader.builder(OtlpGrpcMetricExporter.builder().setEndpoint(System.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")).addHeader("Authorization", "Bearer " + System.getenv("ELASTIC_APM_SECRET_TOKEN")).build()).build()).build();

// create sdk object and set it as global
OpenTelemetrySdk sdk = OpenTelemetrySdk.builder().setTracerProvider(sdkTracerProvider).setLoggerProvider(sdkLoggerProvider).setMeterProvider(sdkMeterProvider).setPropagators(ContextPropagators.create(W3CTraceContextPropagator.getInstance())).build();

GlobalOpenTelemetry.set(sdk);
// connect logger
GlobalLoggerProvider.set(sdk.getSdkLoggerProvider());
// Add hook to close SDK, which flushes logs
Runtime.getRuntime().addShutdownHook(new Thread(sdk::close));

Step 3. Create the Tracer and start the OpenTelemetry Span inside the TracingFilter

In the Spring Boot, example you will notice that we have a TracingFilter class which extends the OncePerRequestFilter class. This Filter is a component placed at the front of the request processing chain. Its primary roles are to intercept incoming requests and outgoing responses, performing tasks such as logging, authentication, transformation of request/response entities, and more. So what we do here is intercept the request as it comes into the Favorite service, so that we can pull out the headers which may contain tracing information from upstream systems.

We start by using the OpenTelemetry Tracer, which is a core component of OpenTelemetry that allows you to create spans, start and stop them, and add attributes and events. In your Java code, import the necessary OpenTelemetry classes and create an instance of the Tracer within your application.

We use this to create a new downstream span, which will continue as a child from the span created in the upstream system using the information we got from the upstream request. In our Elastiflix example, this will be the nodejs application.

@Override
protected void doFilterInternal(jakarta.servlet.http.HttpServletRequest request, jakarta.servlet.http.HttpServletResponse response, jakarta.servlet.FilterChain filterChain) throws jakarta.servlet.ServletException, IOException {
        Tracer tracer = GlobalOpenTelemetry.getTracer(SERVICE_NAME);

        Context extractedContext = GlobalOpenTelemetry.getPropagators()
                .getTextMapPropagator()
                .extract(Context.current(), request, getter);

        Span span = tracer.spanBuilder(request.getRequestURI())
                .setSpanKind(SpanKind.SERVER)
                .setParent(extractedContext)
                .startSpan();

        try (Scope scope = span.makeCurrent()) {
            filterChain.doFilter(request, response);
        } catch (Exception e) {
            span.setStatus(StatusCode.ERROR);
            throw e;
        } finally {
            span.end();
        }
    }

Step 4. Instrument other interesting code with spans

To instrument with spans and track specific regions of your code, you can use the Tracer's SpanBuilder to create spans. To accurately measure the duration of a specific operation, make sure to start and stop the spans at the appropriate locations in your code. Use the startSpan and endSpan methods provided by the Tracer to mark the beginning and end of the span. For example, you can create a span around a specific method or operation in your code, as shown here in the handleCanary method:

private void handleCanary() throws Exception {
        Span span = GlobalOpenTelemetry.getTracer(SERVICE_NAME).spanBuilder("handleCanary").startSpan();
        Scope scope = span.makeCurrent();

///.....


 span.setStatus(StatusCode.OK);

        span.end();

        scope.close();
    }

Step 5. Add attributes and events to spans

You can enhance the spans with additional attributes and events to provide more context and details about the operation being tracked. Attributes can be key-value pairs that describe the span, while events can be used to mark significant points in the span's lifecycle. This is also shown in the handleCanary method:

private void handleCanary() throws Exception {

            Span.current().setAttribute("canary", "test-new-feature");
            Span.current().setAttribute("quiz_solution", "correlations");

            span.addEvent("a span event", Attributes
                    .of(AttributeKey.longKey("someKey"), Long.valueOf(93)));
    }

Step 6. Instrument backends

Let's consider an example where we are instrumenting a Redis database call. We're using the Java OpenTelemetry SDK, and our goal is to create a trace that captures each "Post User Favorites" operation to the database.

Below is the Java method that performs the operation and collects telemetry data:

public void postUserFavorites(String user_id, String movieID) {
  ...
}

Let's go through it line by line:

Initializing a span
The first important line of our method is where we initialize a span. A span represents a single operation within a trace, which could be a database call, a remote procedure call (RPC), or any segment of code that you want to measure.

Span span = GlobalOpenTelemetry.getTracer(SERVICE_NAME).spanBuilder("Redis.Post").setSpanKind(SpanKind.CLIENT).startSpan();

Setting span attributes
Next, we add attributes to our span. Attributes are key-value pairs that provide additional information about the span. In order to get the backend call to appear correctly in the service map, it is critical that the attributes are set correctly for the backend call type. In this example, we set the db.system attribute to redis.

span.setAttribute("db.system", "redis");
span.setAttribute("db.connection_string", redisHost);
span.setAttribute(
  "db.statement",
  "POST user_id " + user_id + " AND movie_id " + movieID
);

This will ensure calls to the backend redis backend are tracked as shown below:

Capturing the result of the operation
We then execute the operation we're interested in, within a try-catch block. If an exception occurs during the execution of the operation, we record it in the span.

try (Scope scope = span.makeCurrent()) {
    ...
} catch (Exception e) {
    span.setStatus(StatusCode.ERROR, "Error while getting data from Redis");
    span.recordException(e);
}

Closing resources
Finally, we close the Redis connection and end the span.

finally {
    jedis.close();
    span.end();
}

Step 7. Configure logging

Logging is an essential part of application monitoring and troubleshooting. OpenTelemetry allows you to integrate with existing logging frameworks, such as Logback or Log4j, to capture logs along with the telemetry data. Configure the logging framework of your choice to capture logs related to the instrumented spans. In our example application, check out the logback configuration, which shows how to export logs directly to Elastic.




    
        false
        true
        true
    

    
        
            %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n

Step 8. Running the Docker image with environment variables

As specified in the OTEL Java documentation, we will use environment variables and pass in the configuration values to enable it to connect with Elastic Observability’s APM server.

Because Elastic accepts OTLP natively, we just need to provide the Endpoint and authentication where the OTEL Exporter needs to send the data, as well as some other environment variables.

Getting Elastic Cloud variables
You can copy the endpoints and token from Kibana under the path /app/home#/tutorial/apm.

You will need to copy the following environment variable:

OTEL_EXPORTER_OTLP_ENDPOINT

As well as the token from:

OTEL_EXPORTER_OTLP_HEADERS

Build the Docker image

docker build -t java-otel-manual-image .

Run the Docker image

docker run \
       -e OTEL_EXPORTER_OTLP_ENDPOINT="REPLACE WITH OTEL_EXPORTER_OTLP_ENDPOINT" \
       -e ELASTIC_APM_SECRET_TOKEN="REPLACE WITH TOKEN" \
       -e OTEL_RESOURCE_ATTRIBUTES="service.version=1.0,deployment.environment=production" \
       -e OTEL_SERVICE_NAME="java-favorite-otel-manual" \
       -p 5000:5000 \
       java-otel-manual-image

curl localhost:5000/favorites

# or alternatively issue a request every second

while true; do curl "localhost:5000/favorites"; sleep 1; done;

Step 9. Explore traces and logs in Elastic APM

Once you have this up and running, you can ping the endpoint for your instrumented service (in our case, this is /favorites), and you should see the app appear in Elastic APM, as shown below:

It will begin by tracking throughput and latency critical metrics for SREs to pay attention to.

Digging in, we can see an overview of all our Transactions.

And look at specific transactions:

Click on Logs , and we see that logs are also brought over. The OTel Agent will automatically bring in logs and correlate them with traces for you:

This gives you complete visibility across logs, metrics, and traces!

Wrapping up

Manually instrumenting your Java applications with OpenTelemetry gives you greater control over what to track and monitor. By following the steps outlined in this blog post, you can effectively monitor the performance of your Java applications, identify issues, and gain insights into the overall health of your application.

Remember, OpenTelemetry is a powerful tool, and proper instrumentation requires careful consideration of what metrics, traces, and logs are essential for your specific use case. Experiment with different configurations, leverage the OpenTelemetry SDK for Java documentation, and continuously iterate to achieve the observability goals of your application.

In this blog, we discussed the following:

How to manually instrument Java with OpenTelemetry
How to properly initialize and instrument span
How to easily set the OTLP ENDPOINT and OTLP HEADERS from Elastic without the need for a collector

Hopefully, this provided an easy-to-understand walk-through of instrumenting Java with OpenTelemetry and how easy it is to send traces into Elastic.

Developer resources:

Elastiflix application, a guide to instrument different languages with OpenTelemetry

Python: Auto-instrumentation, Manual-instrumentation

Java: Auto-instrumentation, Manual-instrumentation

Node.js: Auto-instrumentation, Manual-instrumentation

.NET: Auto-instrumentation, Manual-instrumentation

Go: Manual-instrumentation

Best practices for instrumenting OpenTelemetry

General configuration and use case resources:

Independence with OpenTelemetry on Elastic

Modern observability and security on Kubernetes with Elastic and OpenTelemetry

3 models for logging with OpenTelemetry and Elastic

Adding free and open Elastic APM as part of your Elastic Observability deployment

Capturing custom metrics through OpenTelemetry API in code with Elastic

Future-proof your observability platform with OpenTelemetry and Elastic

Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more

Manual instrumentation of .NET applications with OpenTelemetry

Fri, 01 Sep 2023 00:00:00 GMT

In the fast-paced universe of software development, especially in the cloud-native realm, DevOps and SRE teams are increasingly emerging as essential partners in application stability and growth.

Enter OpenTelemetry (OTel), with the following benefits for those who adopt it:

Escape vendor constraints with OTel, freeing yourself from vendor lock-in and ensuring top-notch observability.
See the harmony of unified logs, metrics, and traces come together to provide a complete system view.
Improve your application oversight through richer and enhanced instrumentations.
Embrace the benefits of backward compatibility to protect your prior instrumentation investments.
Embark on the OpenTelemetry journey with an easy learning curve, simplifying onboarding and scalability.
Rely on a proven, future-ready standard to boost your confidence in every investment.
Explore manual instrumentation, enabling customized data collection to fit your unique needs.
Ensure monitoring consistency across layers with a standardized observability data framework.
Decouple development from operations, driving peak efficiency for both.

In this post, we will dive into the methodology to instrument a .NET application manually using Docker.

What's covered?

Instrumenting the .NET application manually
Creating a Docker image for a .NET application with the OpenTelemetry instrumentation baked in
Installing and running the OpenTelemetry .NET Profiler for automatic instrumentation

Prerequisites

An understanding of Docker and .NET
Elastic Cloud
Docker installed on your machine (we recommend docker desktop)

View the example source code

Step-by-step guide

This blog assumes you have an Elastic Cloud account — if not, follow the instructions to get started on Elastic Cloud.

Step 1. Getting started

In our demonstration, we will manually instrument a .NET Core application - Login. This application simulates a simple user login service. In this example, we are only looking at Tracing since the OpenTelemetry logging instrumentation is currently at mixed maturity, as mentioned here.

The application has the following files:

Program.cs
Startup.cs
Telemetry.cs
LoginController.cs

Step 2. Instrumenting the application

When it comes to OpenTelemetry, the .NET ecosystem presents some unique aspects. While OpenTelemetry offers its API, .NET leverages its native System.Diagnostics API to implement OpenTelemetry's Tracing API. The pre-existing constructs such as ActivitySource and Activity are aptly repurposed to comply with OpenTelemetry.

That said, understanding the OpenTelemetry API and its terminology remains crucial for .NET developers. It's pivotal in gaining full command over instrumenting your applications, and as we've seen, it also extends to understanding elements of the System.Diagnostics API.

For those who might lean toward using the original OpenTelemetry APIs over the System.Diagnostics ones, there is also a way. OpenTelemetry provides an API shim for tracing that you can use. It enables developers to switch to OpenTelemetry APIs, and you can find more details about it in the OpenTelemetry API Shim documentation.

By integrating such practices into your .NET application, you can take full advantage of the powerful features OpenTelemetry provides, irrespective of whether you're using OpenTelemetry's API or the System.Diagnostics API.

In this blog, we are sticking to the default method and using the Activity convention which the System.Diagnostics API dictates.

To manually instrument a .NET application, you need to make changes in each of these files. Let's take a look at these changes one by one.

Program.cs

This is the entry point for our application. Here, we create an instance of IHostBuilder with default configurations. Notice how we set up a console logger with Serilog.

public static void Main(string[] args)
{
    Log.Logger = new LoggerConfiguration().WriteTo.Console().CreateLogger();
    CreateHostBuilder(args).Build().Run();
}

Startup.cs

In the Startup.cs file, we use the ConfigureServices method to add the OpenTelemetry Tracing.

public void ConfigureServices(IServiceCollection services)
{
    services.AddOpenTelemetry().WithTracing(builder => builder.AddOtlpExporter()
        .AddSource("Login")
        .AddAspNetCoreInstrumentation()
        .AddOtlpExporter()
        .ConfigureResource(resource =>
            resource.AddService(
                serviceName: "Login"))
    );
    services.AddControllers();
}

The WithTracing method enables tracing in OpenTelemetry. We add the OTLP (OpenTelemetry Protocol) exporter, which is a general-purpose telemetry data delivery protocol. We also add the AspNetCoreInstrumentation, which will automatically collect traces from our application. This is a critically important step that is not mentioned in the OpenTelemetry docs. Without adding this method, the instrumentation was not working for me for the Login application.

Telemetry.cs

This file contains the definition of our ActivitySource. The ActivitySource represents the source of the telemetry activities. It is named after the service name for your application, and this name can come from a configuration file, constants file, etc. We can use this ActivitySource to start activities.

using System.Diagnostics;

public static class Telemetry
{
    //...

    // Name it after the service name for your app.
    // It can come from a config file, constants file, etc.
    public static readonly ActivitySource LoginActivitySource = new("Login");

    //...
}

In our case, we've created an ActivitySource named Login. In our LoginController.cs, we use this LoginActivitySource to start a new activity when we begin our operations.

using (Activity activity = Telemetry.LoginActivitySource.StartActivity("SomeWork"))
{
    // Perform operations here
}

This piece of code starts a new activity named SomeWork , performs some operations (in this case, generating a random user and logging them in), and then ends the activity. These activities are traced and can be analyzed later to understand the performance of the operations.

This ActivitySource is fundamental to OpenTelemetry's manual instrumentation. It represents the source of the activities and provides a way to start and stop activities.

LoginController.cs

In the LoginController.cs file, we are tracing the operations performed by the GET and POST methods. We start a new activity, SomeWork , before we begin our operations and dispose of it once we're done.

using (Activity activity = Telemetry.LoginActivitySource.StartActivity("SomeWork"))
{
    var user = GenerateRandomUserResponse();
    Log.Information("User logged in: {UserName}", user);
    return user;
}

This will track the time taken by these operations and send this data to any configured telemetry backend via the OTLP exporter.

Step 3. Base image setup

Now that we have our application source code created and instrumented, it’s time to create a Dockerfile to build and run our .NET Login service.

Start with the .NET runtime image for the base layer of our Dockerfile:

FROM ${ARCH}mcr.microsoft.com/dotnet/aspnet:7.0. AS base
WORKDIR /app
EXPOSE 8000

Here, we're setting up the application's runtime environment.

Step 4. Building the .NET application

This feature of Docker is just the best. Here, we compile our .NET application. We'll use the SDK image. In the bad old days, we used to build on a different platform and then put the compiled code into the Docker container. This way, we are much more confident our build will replicate from a developers desktop and into production by using Docker all the way through.

FROM --platform=$BUILDPLATFORM mcr.microsoft.com/dotnet/sdk:8.0-preview AS build
ARG TARGETPLATFORM

WORKDIR /src
COPY ["login.csproj", "./"]
RUN dotnet restore "./login.csproj"
COPY . .
WORKDIR "/src/."
RUN dotnet build "login.csproj" -c Release -o /app/build

This section ensures that our .NET code is properly restored and compiled.

Step 5. Publishing the application

Once built, we'll publish the app:

FROM build AS publish
RUN dotnet publish "login.csproj" -c Release -o /app/publish

Step 6. Preparing the final image

Now, let's set up the final runtime image:

FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .

Step 7. Entry point setup

Lastly, set the Docker image's entry point to both source the OpenTelemetry instrumentation, which sets up the Environment variables required to bootstrap the .NET Profiler, and then we start our .NET application:

ENTRYPOINT ["/bin/bash", "-c", "dotnet login.dll"]

Step 8. Running the Docker image with environment variables

To build and run the Docker image, you'd typically follow these steps:

Build the Docker image

First, you'd want to build the Docker image from your Dockerfile. Let's assume the Dockerfile is in the current directory, and you'd like to name/tag your image dotnet-login-otel-image.

docker build -t dotnet-login-otel-image .

Run the Docker image

After building the image, you'd run it with the specified environment variables. For this, the docker run command is used with the -e flag for each environment variable.

docker run \
       -e OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer ${ELASTIC_APM_SECRET_TOKEN}" \
       -e OTEL_EXPORTER_OTLP_ENDPOINT="${ELASTIC_APM_SERVER_URL}" \
       -e OTEL_METRICS_EXPORTER="otlp" \
       -e OTEL_RESOURCE_ATTRIBUTES="service.version=1.0,deployment.environment=production" \
       -e OTEL_SERVICE_NAME="dotnet-login-otel-manual" \
       -e OTEL_TRACES_EXPORTER="otlp" \
       dotnet-login-otel-image

Make sure that ${ELASTIC_APM_SECRET_TOKEN} and ${ELASTIC_APM_SERVER_URL} are set in your shell environment, replace them with their actual values from the cloud as shown below.

Getting Elastic Cloud variables
You can copy the endpoints and token from Kibana under the path /app/home#/tutorial/apm.

You can also use an environment file with docker run --env-file to make the command less verbose if you have multiple environment variables.

Once you have this up and running, you can ping the endpoint for your instrumented service (in our case, this is /login), and you should see the app appear in Elastic APM, as shown below:

It will begin by tracking throughput and latency critical metrics for SREs to pay attention to.

Digging in, we can see an overview of all our Transactions.

And look at specific transactions, including the “SomeWork” activity/span we created in the code above:

There is clearly an outlier here, where one transaction took over 20ms. This is likely to be due to the CLR warming up.

Wrapping up

With the code here instrumented and the Dockerfile bootstrapping the application, you've transformed your simple .NET application into one that's instrumented with OpenTelemetry. This will aid greatly in understanding application performance, tracing errors, and gaining insights into how users interact with your software.

In this blog, we discussed the following:

How to manually instrument .NET with OpenTelemetry.
Using standard commands in a Docker file, our instrumented application was built and started.
Using OpenTelemetry and its support for multiple languages, DevOps and SRE teams can instrument their applications with ease, gaining immediate insights into the health of the entire application stack and reducing mean time to resolution (MTTR).

Since Elastic can support a mix of methods for ingesting data whether it be using auto-instrumentation of open-source OpenTelemetry or manual instrumentation with its native APM agents, you can plan your migration to OTel by focusing on a few applications first and then using OpenTelemety across your applications later on in a manner that best fits your business needs.

Developer resources:

Elastiflix application, a guide to instrument different languages with OpenTelemetry

Python: Auto-instrumentation, Manual-instrumentation

Java: Auto-instrumentation, Manual-instrumentation

Node.js: Auto-instrumentation, Manual-instrumentation

.NET: Auto-instrumentation, Manual-instrumentation

Go: Manual-instrumentation

Best practices for instrumenting OpenTelemetry

General configuration and use case resources:

Independence with OpenTelemetry on Elastic

Modern observability and security on Kubernetes with Elastic and OpenTelemetry

3 models for logging with OpenTelemetry and Elastic

Adding free and open Elastic APM as part of your Elastic Observability deployment

Capturing custom metrics through OpenTelemetry API in code with Elastic

Future-proof your observability platform with OpenTelemetry and Elastic

Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more

Don’t have an Elastic Cloud account yet? Sign up for Elastic Cloud and try out the instrumentation capabilities that I discussed above. I would be interested in getting your feedback about your experience in gaining visibility into your application stack with Elastic.

Manual instrumentation with OpenTelemetry for Node.js applications

Thu, 31 Aug 2023 00:00:00 GMT

Thanks to OpenTelemetry (OTel), DevOps and SRE teams now have a standard way to collect and send data that doesn't rely on proprietary code and have a large support community reducing vendor lock-in.

In a previous blog, we also reviewed how to use the OpenTelemetry demo and connect it to Elastic^®, as well as some of Elastic’s capabilities with OpenTelemetry and Kubernetes.

In this blog, we will show how to use manual instrumentation for OpenTelemetry with the Node.js service of our application called Elastiflix. This approach is slightly more complex than using auto-instrumentation.

Application, prerequisites, and config

The application that we use for this blog is called Elastiflix, a movie streaming application. It consists of several micro-services written in .NET, NodeJS, Go, and Python.

Before we instrument our sample application, we will first need to understand how Elastic can receive the telemetry data.

All of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:

Service maps
Service details (latency, throughput, failed transactions)
Dependencies between services, distributed tracing
Transactions (traces)
Machine learning (ML) correlations
Log correlation

Prerequisites

An Elastic Cloud account — sign up now
A clone of the Elastiflix demo application, or your own Node.js application
Basic understanding of Docker — potentially install Docker Desktop
Basic understanding of Node.js

View the example source code

Before we begin, let’s look at the non-instrumented code first.

This is our simple index.js file that can receive a POST request. See the full code here.

const pino = require("pino");
const ecsFormat = require("@elastic/ecs-pino-format"); //
const log = pino({ ...ecsFormat({ convertReqRes: true }) });
const expressPino = require("express-pino-logger")({ logger: log });

var API_ENDPOINT_FAVORITES =
  process.env.API_ENDPOINT_FAVORITES || "127.0.0.1:5000";
API_ENDPOINT_FAVORITES = API_ENDPOINT_FAVORITES.split(",");

const express = require("express");
const cors = require("cors")({ origin: true });
const cookieParser = require("cookie-parser");
const { json } = require("body-parser");

const PORT = process.env.PORT || 3001;

const app = express().use(cookieParser(), cors, json(), expressPino);

const axios = require("axios");

app.use(express.json());
app.use(express.urlencoded({ extended: false }));
app.use((err, req, res, next) => {
  log.error(err.stack);
  res.status(500).json({ error: err.message, code: err.code });
});

var favorites = {};

app.post("/api/favorites", (req, res) => {
  var randomIndex = Math.floor(Math.random() * API_ENDPOINT_FAVORITES.length);
  if (process.env.THROW_NOT_A_FUNCTION_ERROR == "true" && Math.random() < 0.5) {
    // randomly choose one of the endpoints
    axios
      .post(
        "http://" +
          API_ENDPOINT_FAVORITES[randomIndex] +
          "/favorites?user_id=1",
        req.body
      )
      .then(function (response) {
        favorites = response.data;
        // quiz solution: "42"
        res.jsonn({ favorites: favorites });
      })
      .catch(function (error) {
        res.json({ error: error, favorites: [] });
      });
  } else {
    axios
      .post(
        "http://" +
          API_ENDPOINT_FAVORITES[randomIndex] +
          "/favorites?user_id=1",
        req.body
      )
      .then(function (response) {
        favorites = response.data;
        res.json({ favorites: favorites });
      })
      .catch(function (error) {
        res.json({ error: error, favorites: [] });
      });
  }
});

app.listen(PORT, () => {
  console.log(`Server listening on ${PORT}`);
});

Step-by-step guide

Step 0. Log in to your Elastic Cloud account

This blog assumes you have an Elastic Cloud account — if not, follow the instructions to get started on Elastic Cloud.

Step 1. Install and initialize OpenTelemetry

As a first step, we’ll need to add some additional modules to our application.

const opentelemetry = require("@opentelemetry/api");
const { NodeTracerProvider } = require("@opentelemetry/sdk-trace-node");
const { BatchSpanProcessor } = require("@opentelemetry/sdk-trace-base");
const { Resource } = require("@opentelemetry/resources");
const {
  SemanticResourceAttributes,
} = require("@opentelemetry/semantic-conventions");

const { registerInstrumentations } = require("@opentelemetry/instrumentation");
const { HttpInstrumentation } = require("@opentelemetry/instrumentation-http");
const {
  ExpressInstrumentation,
} = require("@opentelemetry/instrumentation-express");

We start by creating a collectorOptions object with parameters such as the url and headers for connecting to the Elastic APM Server or OpenTelemetry collector.

const collectorOptions = {
  url: OTEL_EXPORTER_OTLP_ENDPOINT,
  headers: OTEL_EXPORTER_OTLP_HEADERS,
};

In order to pass additional parameters to OpenTelemetry, we will read the OTEL_RESOURCE_ATTRIBUTES variable and convert it into an object.

const envAttributes = process.env.OTEL_RESOURCE_ATTRIBUTES || "";

// Parse the environment variable string into an object
const attributes = envAttributes.split(",").reduce((acc, curr) => {
  const [key, value] = curr.split("=");
  if (key && value) {
    acc[key.trim()] = value.trim();
  }
  return acc;
}, {});

Next we will then use these parameters to populate the resources configuration.

const resource = new Resource({
  [SemanticResourceAttributes.SERVICE_NAME]:
    attributes["service.name"] || "node-server-otel-manual",
  [SemanticResourceAttributes.SERVICE_VERSION]:
    attributes["service.version"] || "1.0.0",
  [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]:
    attributes["deployment.environment"] || "production",
});

We then set up the trace provider using the previously created resource, followed by the exporter which takes the collectorOptions from before. The trace provider will allow us to create spans later.

Additionally, we specify the use of BatchSPanProcessor. The Span processor is an interface that allows hooks for span start and end method invocations.

In OpenTelemetry, different Span processors are offered. The BatchSPanProcessor batches span and sends them in bulk. Multiple Span processors can be configured to be active at the same time using the MultiSpanProcessor. See OpenTelemetry documentation.

Additionally, we added the resource module. This allows us to specify attributes such as service.name, version, and more. See OpenTelemetry semantic conventions documentation for more details.

const tracerProvider = new NodeTracerProvider({
  resource: resource,
});

const exporter = new OTLPTraceExporter(collectorOptions);
tracerProvider.addSpanProcessor(new BatchSpanProcessor(exporter));
tracerProvider.register();

Next, we are going to register some instrumentations. This will automatically instrument Express and HTTP for us. While it’s possible to do this step fully manually as well, it would be complex and a waste of time. This way we can ensure that any incoming and outgoing request is captured properly and that functionality such as distributed tracing works without any additional work.

registerInstrumentations({
  instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()],
  tracerProvider: tracerProvider,
});

As a last step, we will now get an instance of the tracer that we can use to create custom spans.

const tracer = opentelemetry.trace.getTracer();

Step 2. Adding custom spans

Now that we have the modules added and initialized, we can add custom spans.

Our sample application has a POST request which calls a downstream service. If we want to have additional instrumentation for this part of our app, we simply wrap the function code with:

tracer.startActiveSpan('favorites',   tracer.startActiveSpan('favorites', (span) => {...

The wrapped code is as follows:

app.post("/api/favorites", (req, res, next) => {
  tracer.startActiveSpan("favorites", (span) => {
    axios
      .post(
        "http://" + API_ENDPOINT_FAVORITES + "/favorites?user_id=1",
        req.body
      )
      .then(function (response) {
        favorites = response.data;
        span.end();
        res.jsonn({ favorites: favorites });
      })
      .catch(next);
  });
});

Automatic error handling
For automatic error handling, we are adding a function that we use in Express which captures the exception for any error that happens during runtime.

app.use((err, req, res, next) => {
  log.error(err.stack);
  span = opentelemetry.trace.getActiveSpan();
  span.recordException(error);
  span.end();
  res.status(500).json({ error: err.message, code: err.code });
});

Additional code
n addition to modules and span instrumentation, the sample application also checks some environment variables at startup. When sending data to Elastic without an OTel collector, the OTEL_EXPORTER_OTLP_HEADERS variable is required as it contains the authentication. The same is true for OTEL_EXPORTER_OTLP_ENDPOINT, the host where we’ll send the telemetry data.

const OTEL_EXPORTER_OTLP_HEADERS = process.env.OTEL_EXPORTER_OTLP_HEADERS;
// error if secret token is not set
if (!OTEL_EXPORTER_OTLP_HEADERS) {
  throw new Error("OTEL_EXPORTER_OTLP_HEADERS environment variable is not set");
}

const OTEL_EXPORTER_OTLP_ENDPOINT = process.env.OTEL_EXPORTER_OTLP_ENDPOINT;
// error if server url is not set
if (!OTEL_EXPORTER_OTLP_ENDPOINT) {
  throw new Error(
    "OTEL_EXPORTER_OTLP_ENDPOINT environment variable is not set"
  );
}

Final code
For comparison, this is the instrumented code of our sample application. You can find the full source code in GitHub.

const pino = require("pino");
const ecsFormat = require("@elastic/ecs-pino-format"); //
const log = pino({ ...ecsFormat({ convertReqRes: true }) });
const expressPino = require("express-pino-logger")({ logger: log });

// Add OpenTelemetry packages
const opentelemetry = require("@opentelemetry/api");
const { NodeTracerProvider } = require("@opentelemetry/sdk-trace-node");
const { BatchSpanProcessor } = require("@opentelemetry/sdk-trace-base");
const {
  OTLPTraceExporter,
} = require("@opentelemetry/exporter-trace-otlp-grpc");
const { Resource } = require("@opentelemetry/resources");
const {
  SemanticResourceAttributes,
} = require("@opentelemetry/semantic-conventions");

const { registerInstrumentations } = require("@opentelemetry/instrumentation");

// Import OpenTelemetry instrumentations
const { HttpInstrumentation } = require("@opentelemetry/instrumentation-http");
const {
  ExpressInstrumentation,
} = require("@opentelemetry/instrumentation-express");

var API_ENDPOINT_FAVORITES =
  process.env.API_ENDPOINT_FAVORITES || "127.0.0.1:5000";
API_ENDPOINT_FAVORITES = API_ENDPOINT_FAVORITES.split(",");

const OTEL_EXPORTER_OTLP_HEADERS = process.env.OTEL_EXPORTER_OTLP_HEADERS;
// error if secret token is not set
if (!OTEL_EXPORTER_OTLP_HEADERS) {
  throw new Error("OTEL_EXPORTER_OTLP_HEADERS environment variable is not set");
}

const OTEL_EXPORTER_OTLP_ENDPOINT = process.env.OTEL_EXPORTER_OTLP_ENDPOINT;
// error if server url is not set
if (!OTEL_EXPORTER_OTLP_ENDPOINT) {
  throw new Error(
    "OTEL_EXPORTER_OTLP_ENDPOINT environment variable is not set"
  );
}

const collectorOptions = {
  // url is optional and can be omitted - default is http://localhost:4317
  // Unix domain sockets are also supported: 'unix:///path/to/socket.sock'
  url: OTEL_EXPORTER_OTLP_ENDPOINT,
  headers: OTEL_EXPORTER_OTLP_HEADERS,
};

const envAttributes = process.env.OTEL_RESOURCE_ATTRIBUTES || "";

// Parse the environment variable string into an object
const attributes = envAttributes.split(",").reduce((acc, curr) => {
  const [key, value] = curr.split("=");
  if (key && value) {
    acc[key.trim()] = value.trim();
  }
  return acc;
}, {});

// Create and configure the resource object
const resource = new Resource({
  [SemanticResourceAttributes.SERVICE_NAME]:
    attributes["service.name"] || "node-server-otel-manual",
  [SemanticResourceAttributes.SERVICE_VERSION]:
    attributes["service.version"] || "1.0.0",
  [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]:
    attributes["deployment.environment"] || "production",
});

// Create and configure the tracer provider
const tracerProvider = new NodeTracerProvider({
  resource: resource,
});
const exporter = new OTLPTraceExporter(collectorOptions);
tracerProvider.addSpanProcessor(new BatchSpanProcessor(exporter));
tracerProvider.register();

//Register instrumentations
registerInstrumentations({
  instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()],
  tracerProvider: tracerProvider,
});

const express = require("express");
const cors = require("cors")({ origin: true });
const cookieParser = require("cookie-parser");
const { json } = require("body-parser");

const PORT = process.env.PORT || 3001;

const app = express().use(cookieParser(), cors, json(), expressPino);

const axios = require("axios");

app.use(express.json());
app.use(express.urlencoded({ extended: false }));
app.use((err, req, res, next) => {
  log.error(err.stack);
  span = opentelemetry.trace.getActiveSpan();
  span.recordException(error);
  span.end();
  res.status(500).json({ error: err.message, code: err.code });
});

const tracer = opentelemetry.trace.getTracer();

var favorites = {};

app.post("/api/favorites", (req, res, next) => {
  tracer.startActiveSpan("favorites", (span) => {
    var randomIndex = Math.floor(Math.random() * API_ENDPOINT_FAVORITES.length);

    if (
      process.env.THROW_NOT_A_FUNCTION_ERROR == "true" &&
      Math.random() < 0.5
    ) {
      // randomly choose one of the endpoints
      axios
        .post(
          "http://" +
            API_ENDPOINT_FAVORITES[randomIndex] +
            "/favorites?user_id=1",
          req.body
        )
        .then(function (response) {
          favorites = response.data;
          // quiz solution: "42"
          span.end();
          res.jsonn({ favorites: favorites });
        })
        .catch(next);
    } else {
      axios
        .post(
          "http://" +
            API_ENDPOINT_FAVORITES[randomIndex] +
            "/favorites?user_id=1",
          req.body
        )
        .then(function (response) {
          favorites = response.data;
          span.end();
          res.json({ favorites: favorites });
        })
        .catch(next);
    }
  });
});

app.listen(PORT, () => {
  log.info(`Server listening on ${PORT}`);
});

Step 3. Running the Docker image with environment variables

We will use environment variables and pass in the configuration values to enable it to connect with Elastic Observability’s APM server.

Because Elastic accepts OTLP natively, we just need to provide the Endpoint and authentication where the OTEL Exporter needs to send the data, as well as some other environment variables.

Getting Elastic Cloud variables
You can copy the endpoints and token from Kibana^® under the path /app/home#/tutorial/apm.

You will need to copy the following environment variables:

OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_EXPORTER_OTLP_HEADERS

Build the image

docker build -t  node-otel-manual-image .

Run the image

docker run \
       -e OTEL_EXPORTER_OTLP_ENDPOINT="" \
       -e OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer " \
       -e OTEL_RESOURCE_ATTRIBUTES="service.version=1.0,deployment.environment=production,service.name=node-server-otel-manual" \
       -p 3001:3001 \
       node-otel-manual-image

curl localhost:3001/api/login
curl localhost:3001/api/favorites

# or alternatively issue a request every second

while true; do curl "localhost:3001/api/favorites"; sleep 1; done;

Step 4. Explore in Elastic APM

Now that the service is instrumented, you should see the following output in Elastic APM when looking at the transactions section of your Node.js service:

Notice how this mirrors the auto-instrumented version.

Is it worth it?

This is the million-dollar question. Depending on what level of detail you need, it's potentially necessary to manually instrument. Manual instrumentation lets you add custom spans, custom labels, and metrics where you want or need them. It allows you to get a level of detail that otherwise would not be possible and is oftentimes important for tracking business-specific KPIs.

Your operations, and whether you need to troubleshoot or analyze the performance of specific parts of the code, will dictate when and what to instrument. But it’s helpful to know that you have the option to manually instrument.

If you noticed we didn’t yet instrument metrics, that is another blog. We discussed logs in a previous blog.

Conclusion

In this blog, we discussed the following:

How to manually instrument Node.js with OpenTelemetry
The different modules needed when using Express
How to properly initialize and instrument span
How to easily set the OTLP ENDPOINT and OTLP HEADERS from Elastic without the need for a collector

Hopefully, this provides an easy-to-understand walk-through of instrumenting Node.js with OpenTelemetry and how easy it is to send traces into Elastic.

Developer resources:

Elastiflix application, a guide to instrument different languages with OpenTelemetry

Python: Auto-instrumentation, Manual-instrumentation

Java: Auto-instrumentation, Manual-instrumentation

Node.js: Auto-instrumentation, Manual-instrumentation

.NET: Auto-instrumentation, Manual-instrumentation

Go: Manual-instrumentation

Best practices for instrumenting OpenTelemetry

General configuration and use case resources:

Independence with OpenTelemetry on Elastic

Modern observability and security on Kubernetes with Elastic and OpenTelemetry

3 models for logging with OpenTelemetry and Elastic

Adding free and open Elastic APM as part of your Elastic Observability deployment

Capturing custom metrics through OpenTelemetry API in code with Elastic

Future-proof your observability platform with OpenTelemetry and Elastic

Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more

Manual instrumentation with OpenTelemetry for Python applications

Thu, 31 Aug 2023 00:00:00 GMT

Thanks to OpenTelemetry (OTel), DevOps and SRE teams now have a standard way to collect and send data that doesn't rely on proprietary code and have a large support community reducing vendor lock-in.

In a previous blog, we also reviewed how to use the OpenTelemetry demo and connect it to Elastic^®, as well as some of Elastic’s capabilities with OpenTelemetry and Kubernetes.

In this blog, we will show how to use manual instrumentation for OpenTelemetry with the Python service of our application called Elastiflix. This approach is slightly more complex than using automatic instrumentation.

Application, prerequisites, and config

The application that we use for this blog is called Elastiflix, a movie streaming application. It consists of several micro-services written in .NET, NodeJS, Go, and Python.

Before we instrument our sample application, we will first need to understand how Elastic can receive the telemetry data.

All of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:

Service maps
Service details (latency, throughput, failed transactions)
Dependencies between services, distributed tracing
Transactions (traces)
Machine learning (ML) correlations
Log correlation

Prerequisites

An Elastic Cloud account — sign up now
A clone of the Elastiflix demo application, or your own Python application
Basic understanding of Docker — potentially install Docker Desktop
Basic understanding of Python

View the example source code

Before we begin, let’s look at the non-instrumented code first.

This is our simple Python Flask application that can receive a GET request. (This is a portion of the full main.py file.)

from flask import Flask, request
import sys

import logging
import redis
import os
import ecs_logging
import datetime
import random
import time

redis_host = os.environ.get('REDIS_HOST') or 'localhost'
redis_port = os.environ.get('REDIS_PORT') or 6379

application_port = os.environ.get('APPLICATION_PORT') or 5000

app = Flask(__name__)

# Get the Logger
logger = logging.getLogger("app")
logger.setLevel(logging.DEBUG)

# Add an ECS formatter to the Handler
handler = logging.StreamHandler()
handler.setFormatter(ecs_logging.StdlibFormatter())
logger.addHandler(handler)
logging.getLogger('werkzeug').setLevel(logging.ERROR)
logging.getLogger('werkzeug').addHandler(handler)

r = redis.Redis(host=redis_host, port=redis_port, decode_responses=True)

@app.route('/favorites', methods=['GET'])
def get_favorite_movies():
    user_id = str(request.args.get('user_id'))

    logger.info('Getting favorites for user ' + user_id, extra={
        "event.dataset": "favorite.log",
        "user.id": request.args.get('user_id')
    })

    favorites = r.smembers(user_id)

    # convert to list
    favorites = list(favorites)
    logger.info('User ' + user_id + ' has favorites: ' + str(favorites), extra={
        "event.dataset": "favorite.log",
        "user.id": user_id
    })
    return { "favorites": favorites}

logger.info('App startup')
app.run(host='0.0.0.0', port=application_port)
logger.info('App Stopped')

Step-by-step guide

Step 0. Log in to your Elastic Cloud account

This blog assumes you have an Elastic Cloud account — if not, follow the instructions to get started on Elastic Cloud.

Step 1. Install and initialize OpenTelemetry

As a first step, we’ll need to add some additional libraries to our application.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
from opentelemetry.sdk.resources import Resource

This code imports necessary OpenTelemetry libraries, including those for tracing, exporting, and instrumenting specific libraries like Flask, Requests, and Redis.

Next we read the variables:

OTEL_EXPORTER_OTLP_HEADERS
OTEL_EXPORTER_OTLP_ENDPOINT

And then initialize the exporter.

otel_exporter_otlp_headers = os.environ.get('OTEL_EXPORTER_OTLP_HEADERS')

otel_exporter_otlp_endpoint = os.environ.get('OTEL_EXPORTER_OTLP_ENDPOINT')

exporter = OTLPSpanExporter(endpoint=otel_exporter_otlp_endpoint, headers=otel_exporter_otlp_headers)

In order to pass additional parameters to OpenTelemetry, we will read the OTEL_RESOURCE_ATTRIBUTES variable and convert it into an object.

resource_attributes = os.environ.get('OTEL_RESOURCE_ATTRIBUTES') or 'service.version=1.0,deployment.environment=production'
key_value_pairs = resource_attributes.split(',')
result_dict = {}

for pair in key_value_pairs:
    key, value = pair.split('=')
    result_dict[key] = value

Next, we will then use these parameters to populate the resources configuration.

resourceAttributes = {
     "service.name": otel_service_name,
     "service.version": result_dict['service.version'],
     "deployment.environment": result_dict['deployment.environment']
}

resource = Resource.create(resourceAttributes)

We then set up the trace provider using the previously created resource. The trace provider will allow us to create spans later after getting a tracer instance from it.

Additionally, we specify the use of BatchSPanProcessor. The Span processor is an interface that allows hooks for span start and end method invocations.

Additionally, we added the resource module. This allows us to specify attributes such as service.name, version, and more. See OpenTelemetry semantic conventions documentation for more details.

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(exporter)
provider.add_span_processor(processor)

# Sets the global default tracer provider
trace.set_tracer_provider(provider)

# Creates a tracer from the global tracer provider
tracer = trace.get_tracer(otel_service_name)

Finally, because we are using Flask and Redis, we also add the following, which allows us to automatically instrument both Flask and Redis.

Technically you could consider this “cheating.” We are using some parts of the Python auto-instrumentation. However, it’s generally a good approach to resort to using some of the auto-instrumentation modules. This saves you a lot of time, and in addition, it ensures that functionality like distributed tracing will work automatically for any requests you receive or send.

FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
RedisInstrumentor().instrument()

Step 2. Adding Custom Spans

Now that we have everything added and initialized, we can add custom spans.

If we want to have additional instrumentation for a part of our app, we simply wrap the /favoritesGET function code using Python with:

with tracer.start_as_current_span("add_favorite_movies", set_status_on_exception=True) as span:
        ...

The wrapped code is as follows:

@app.route('/favorites', methods=['GET'])
def get_favorite_movies():
    # add artificial delay if enabled
    if delay_time > 0:
        time.sleep(max(0, random.gauss(delay_time/1000, delay_time/1000/10)))

    with tracer.start_as_current_span("get_favorite_movies") as span:
        user_id = str(request.args.get('user_id'))

        logger.info('Getting favorites for user ' + user_id, extra={
            "event.dataset": "favorite.log",
            "user.id": request.args.get('user_id')
        })

        favorites = r.smembers(user_id)

        # convert to list
        favorites = list(favorites)
        logger.info('User ' + user_id + ' has favorites: ' + str(favorites), extra={
            "event.dataset": "favorite.log",
            "user.id": user_id
        })

Additional code

In addition to modules and span instrumentation, the sample application also checks some environment variables at startup. When sending data to Elastic without an OTel collector, the OTEL_EXPORTER_OTLP_HEADERS variable is required as it contains the authentication. The same is true for OTEL_EXPORTER_OTLP_ENDPOINT, the host where we’ll send the telemetry data.

otel_exporter_otlp_headers = os.environ.get('OTEL_EXPORTER_OTLP_HEADERS')
# fail if secret token not set
if otel_exporter_otlp_headers is None:
    raise Exception('OTEL_EXPORTER_OTLP_HEADERS environment variable not set')


otel_exporter_otlp_endpoint = os.environ.get('OTEL_EXPORTER_OTLP_ENDPOINT')
# fail if server url not set
if otel_exporter_otlp_endpoint is None:
    raise Exception('OTEL_EXPORTER_OTLP_ENDPOINT environment variable not set')
else:
    exporter = OTLPSpanExporter(endpoint=otel_exporter_otlp_endpoint, headers=otel_exporter_otlp_headers)

Final code
For comparison, this is the instrumented code of our sample application. You can find the full source code in GitHub.

from flask import Flask, request
import sys

import logging
import redis
import os
import ecs_logging
import datetime
import random
import time

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

#Using grpc exporter since per the instructions in OTel docs this is needed for any endpoint receiving OTLP.

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
#from opentelemetry.instrumentation.wsgi import OpenTelemetryMiddleware
from opentelemetry.sdk.resources import Resource

redis_host = os.environ.get('REDIS_HOST') or 'localhost'
redis_port = os.environ.get('REDIS_PORT') or 6379
otel_traces_exporter = os.environ.get('OTEL_TRACES_EXPORTER') or 'otlp'
otel_metrics_exporter = os.environ.get('OTEL_TRACES_EXPORTER') or 'otlp'
environment = os.environ.get('ENVIRONMENT') or 'dev'
otel_service_version = os.environ.get('OTEL_SERVICE_VERSION') or '1.0.0'
resource_attributes = os.environ.get('OTEL_RESOURCE_ATTRIBUTES') or 'service.version=1.0,deployment.environment=production'

otel_exporter_otlp_headers = os.environ.get('OTEL_EXPORTER_OTLP_HEADERS')
# fail if secret token not set
if otel_exporter_otlp_headers is None:
    raise Exception('OTEL_EXPORTER_OTLP_HEADERS environment variable not set')
#else:
#    otel_exporter_otlp_fheaders= f"Authorization=Bearer%20{secret_token}"

otel_exporter_otlp_endpoint = os.environ.get('OTEL_EXPORTER_OTLP_ENDPOINT')
# fail if server url not set
if otel_exporter_otlp_endpoint is None:
    raise Exception('OTEL_EXPORTER_OTLP_ENDPOINT environment variable not set')
else:
    exporter = OTLPSpanExporter(endpoint=otel_exporter_otlp_endpoint, headers=otel_exporter_otlp_headers)


key_value_pairs = resource_attributes.split(',')
result_dict = {}

for pair in key_value_pairs:
    key, value = pair.split('=')
    result_dict[key] = value

resourceAttributes = {
     "service.name": result_dict['service.name'],
     "service.version": result_dict['service.version'],
     "deployment.environment": result_dict['deployment.environment']
#     # Add more attributes as needed
}

resource = Resource.create(resourceAttributes)


provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(exporter)
provider.add_span_processor(processor)

# Sets the global default tracer provider
trace.set_tracer_provider(provider)

# Creates a tracer from the global tracer provider
tracer = trace.get_tracer("favorite")


application_port = os.environ.get('APPLICATION_PORT') or 5000

app = Flask(__name__)


FlaskInstrumentor().instrument_app(app)
#OpenTelemetryMiddleware().instrument()
RequestsInstrumentor().instrument()
RedisInstrumentor().instrument()

#app.wsgi_app = OpenTelemetryMiddleware(app.wsgi_app)

# Get the Logger
logger = logging.getLogger("app")
logger.setLevel(logging.DEBUG)

# Add an ECS formatter to the Handler
handler = logging.StreamHandler()
handler.setFormatter(ecs_logging.StdlibFormatter())
logger.addHandler(handler)
logging.getLogger('werkzeug').setLevel(logging.ERROR)
logging.getLogger('werkzeug').addHandler(handler)

r = redis.Redis(host=redis_host, port=redis_port, decode_responses=True)

@app.route('/favorites', methods=['GET'])
def get_favorite_movies():
    with tracer.start_as_current_span("get_favorite_movies") as span:
        user_id = str(request.args.get('user_id'))

        logger.info('Getting favorites for user ' + user_id, extra={
            "event.dataset": "favorite.log",
            "user.id": request.args.get('user_id')
        })

        favorites = r.smembers(user_id)

        # convert to list
        favorites = list(favorites)
        logger.info('User ' + user_id + ' has favorites: ' + str(favorites), extra={
            "event.dataset": "favorite.log",
            "user.id": user_id
        })
        return { "favorites": favorites}

logger.info('App startup')
app.run(host='0.0.0.0', port=application_port)
logger.info('App Stopped')

Step 3. Running the Docker image with environment variables

As specified in the OTEL documentation, we will use environment variables and pass in the configuration values to enable it to connect with Elastic Observability’s APM server.

Because Elastic accepts OTLP natively, we just need to provide the Endpoint and authentication where the OTEL Exporter needs to send the data, as well as some other environment variables.

Getting Elastic Cloud variables
You can copy the endpoints and token from Kibana^® under the path /app/home#/tutorial/apm.

You will need to copy the following environment variables:

OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_EXPORTER_OTLP_HEADERS

Build the image

docker build -t  python-otel-manual-image .

Run the image

docker run \
       -e OTEL_EXPORTER_OTLP_ENDPOINT="" \
       -e OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer " \
       -e OTEL_RESOURCE_ATTRIBUTES="service.version=1.0,deployment.environment=production,service.name=python-favorite-otel-manual" \
       -p 3001:3001 \
       python-otel-manual-image

curl localhost:500/favorites
# or alternatively issue a request every second

while true; do curl "localhost:5000/favorites"; sleep 1; done;

Step 4. Explore traces, metrics, and logs in Elastic APM

Now that the service is instrumented, you should see the following output in Elastic APM when looking at the transactions section of your Python service:

Notice how this is slightly different from the auto-instrumented version, as we now also have our custom span in this view.

Is it worth it?

If you noticed we didn’t yet instrument metrics, that is another blog. We discussed logs in a previous blog.

Conclusion

In this blog, we discussed the following:

How to manually instrument Python with OpenTelemetry
How to properly initialize OpenTelemetry and add a custom span
How to easily set the OTLP ENDPOINT and OTLP HEADERS with Elastic without the need for a collector

Hopefully, this provides an easy-to-understand walk-through of instrumenting Python with OpenTelemetry and how easy it is to send traces into Elastic.

Developer resources:

Elastiflix application, a guide to instrument different languages with OpenTelemetry

Python: Auto-instrumentation, Manual-instrumentation

Java: Auto-instrumentation, Manual-instrumentation

Node.js: Auto-instrumentation, Manual-instrumentation

.NET: Auto-instrumentation, Manual-instrumentation

Go: Manual-instrumentation

Best practices for instrumenting OpenTelemetry

General configuration and use case resources:

Independence with OpenTelemetry on Elastic

Modern observability and security on Kubernetes with Elastic and OpenTelemetry

3 models for logging with OpenTelemetry and Elastic

Adding free and open Elastic APM as part of your Elastic Observability deployment

Capturing custom metrics through OpenTelemetry API in code with Elastic

Future-proof your observability platform with OpenTelemetry and Elastic

Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more

Automating User Journeys for Synthetic Monitoring with MCP in Elastic

Wed, 17 Sep 2025 00:00:00 GMT

Synthetic Monitoring in Elastic Observability enables you to track user pathways using a global testing infrastructure, emulating the full user path to measure the impact of web applications. It also provides comprehensive insight into your website's performance, functionality, and availability from development to production, allowing you to identify and resolve issues before they affect your customers.

One of the main components of Elastic's Synthetic Monitoring is the ability to create user journeys, which can be done with or without code. There is a Synthetics agent,, a CLI tool that guides you through the process of creating both heartbeat monitors and user journeys and deploying your code to Elastic Observability. If you are using code to create user journeys, you are using Playwright under the hood with some additional configuration to make it easier to work with Elastic Observability.

To automatically create user journeys using TypeScript, you can create Playwright tests based on a prompt using Warp, an AI-assisted terminal, Gemini 2.5 Pro, and MCP. This application was built using Python and FastMCP, which wraps the synthetic agent to deploy browser tests to Elastic automatically. This blog post will guide you through how the application works, how to use it, and its development process. You can find the complete code on GitHub.

Solution overview

Currently, this solution is set up to run inside Warp as an MCP server; however, you can also use another client, such as Claude Desktop or Cursor. From there, you create a Python script using FastMCP, which allows you to create functions that are callable by an LLM. Within Warp, you can make a configuration file in JSON that enables you to point to your Python script and pass in all the environment variables you are working with. From there, you'll want to toggle agent mode and ask a question about creating synthetic testing or call the MCP function directly. There are many options for which LLM you can select, be sure to check out Warp's documentation to learn more about the options available.

After that, you should ask a question about creating synthetic testing or call the MCP function you are looking for. The following three functions can be used:

diagnose_warp_mcp_config Used for debugging environment variable issues that may arise. This function likely won't be needed unless there is an issue with your configuration.
create_and_deploy_browser_test Will automatically create Playwright tests if given the test name, the URL you want to test, and a schedule. This approach uses a template-based method, rather than a machine learning-based method, and all the tests it outputs will appear similar.
llm_create_and_deploy_test_from_prompt Similar to create_and_deploy_browser_test, but the main difference is that it uses an LLM to create tests based on a prompt you give it. The tests should reflect the prompt you provided. To run this function you'll provide a test name, URL, prompt, and schedule.

Why create this solution as an MCP server?

The reason this was developed as an MCP server, as opposed to just a standalone script or a standard CLI, is that it can be structured and interacted with in a more conversational manner. It enables an LLM to generate dynamic Playwright testing while maintaining consistent arguments, environment variables, and responses to ensure accuracy and reliability. Thus, it becomes a reliable workflow that other agents or developers can compose with additional tools. In other words, the MCP layer turns your LLM-based test authoring into a standardized, reusable capability instead of a one-off script. To learn more about the direction of MCP, be sure to check out our article on the topic.

Implementation considerations

When creating a solution like this one, one thing to be mindful of is your use of tokens. An early version of this solution took approximately twenty minutes to create synthetic tests and ultimately led to severe rate-limiting.

Another issue faced during the building process was striking a balance between creating a template that facilitates the creation of a Playwright script and having an LLM create Playwright scripts based on prompts that didn't feel cookie-cutter. While using a more LLM approach an issue faced was that the scripts often didn't work or were based on parameters that didn't exist and a more templated approach was more reliable but felt repetitive. The final version of this solution attempted to balance this by using elements of the template while adjusting the LLM parameter of temperature, which controls the randomness or creativity of a large language model's output.

While testing this solution, a failing test also emerged that required navigating past a pop-up. In more complex cases, this may serve as a building block that requires additional domain knowledge to create a complete passing Playwright test.

How to get started

Prerequisites

The version of Python that is used is Python 3.12.1 but you can use any version of Python higher than 3.10.
This application uses Elastic Observability version 9.1.2, but you can use any version of Elastics Observability that is higher than 8.10. You can also use Elastic Cloud Serverless as well.
You will also need an OpenAI API key to use the LLM capabilities of this application. You will want to configure an environment variable for your OpenAI API Key, which you can find on the API keys page in OpenAI's developer portal.

Step 1: Install the packages and clone the repository

In order for this MCP server to run locally you will need to install the the following packages:

pip install fastmcp openai
npm install -g playwright @elastic/synthetics

You will use FastMCP 2.0 to create the MCP server, and OpenAI to generate tests based on prompts that you provide. Additionally, you will want to clone the repository to obtain a local copy of the server.

Step 2: Set up a configuration file in Warp

Inside of Warp, you will want to go to the side panel, where it says MCP servers and where it says “add”.

After that, you will be prompted to add a JSON configuration file that should resemble the following. Be sure to add your own Kibana URL, update the correct path, and include your own keys and tokens.

{
 "elastic-synthetics": {
   "command": "python",
   "args": ["elastic_synthetics_server.py"],
   "env": {
     "PYTHONPATH": ".",
     "ELASTIC_KIBANA_URL": "https://your-kibana-url.elastic-cloud.com",
     "ELASTIC_API_KEY": "your-api-key-here",
     "ELASTIC_PROJECT_ID": "mcp-synthetics-demo",
     "ELASTIC_SPACE": "default",
     "ELASTIC_AUTO_PUSH": "true",
     "ELASTIC_USE_JAVASCRIPT": "false",
     "ELASTIC_INSTALL_DEPENDENCIES": "true",
     "OPENAI_API_KEY": "sk-your-openai-key",
     "LLM_MODEL": "gpt-4o"
   },
   "working_directory": "/path/to/your/file",
   "start_on_launch": true 
   }
}

Step 3: Ask a question or call the tools directly

Now that you've set up locally, you will want to toggle agent mode and select the LLM you wish to use. The reason why Gemini-Pro-2.5 was chosen for this blog post is that it provides a straightforward answer, while other LLMs selected returned a very lengthy response.

To start using the MCP tools, from your MCP server, you can ask a question that contains the test name, URL, prompt, and schedule.

You can also call the directly by typing llm_create_and_deploy_test_from_prompt() and the program will prompt you for the relevant details:

Inside Kibana, you should see your monitor listed if you click under Applications and select Monitors listed under Synthetics. You can also find a link to your monitor in the response of your MCP tool.

What's Going On Inside

This code sample consists of three primary functions, which are MCP tools that you can call from your MCP client, including diagnose_warp_mcp_config, create_and_deploy_browser_test and llm_create_and_deploy_test_from_prompt.

Debugging environment issues

There were various issues that came up while creating this application around environment variable loading, so there was a need to create an MCP that could be called depending on errors that may be present.

The tool diagnose_warp_mcp_config kicks off with a decorator @mcp.tool() which allows it to be called and listed in the list of available tools. This tool is designed to help debug issues with Elastic-specific environment variables for troubleshooting purposes. First, it loads in the environment variables and looks for the Elastic specific variables, after it does some security masking so it doesn't show any variables and hides sensitive information like API keys in the output, showing only the first eight characters followed by "...". This tool determines if the minimum required credentials (Kibana URL and API Key) are present to proceed with deployment and provides a report letting you know to address any issues that may exist.

@mcp.tool()
def diagnose_warp_mcp_config() -> Dict[str, Any]:
   """Diagnose Warp MCP environment configuration for Elastic Synthetics"""
   try:
       env_vars = load_env_from_warp_mcp()
      
       # Check for required variables
       kibana_url = env_vars.get('ELASTIC_KIBANA_URL') or env_vars.get('KIBANA_URL')
       api_key = env_vars.get('ELASTIC_API_KEY') or env_vars.get('API_KEY')
       project_id = env_vars.get('ELASTIC_PROJECT_ID') or env_vars.get('PROJECT_ID')
       space = env_vars.get('ELASTIC_SPACE') or env_vars.get('SPACE', 'default')
      
       # Mask sensitive values for display
       masked_vars = {}
       for key, value in env_vars.items():
           if 'API_KEY' in key or 'TOKEN' in key:
               masked_vars[key] = f"{value[:8]}..." if value and len(value) > 8 else "***"
           else:
               masked_vars[key] = value
      
       deployment_ready = bool(kibana_url and api_key)
      
       return safe_json_response({
           "status": "success",
           "environment_variables": masked_vars,
           "required_check": {
               "kibana_url": bool(kibana_url),
               "api_key": bool(api_key),
               "project_id": bool(project_id),
               "space": bool(space)
           },
           "deployment_ready": deployment_ready,
           "recommendations": [
               "Environment variables detected" if env_vars else "No environment variables found",
               "Kibana URL configured" if kibana_url else "Missing ELASTIC_KIBANA_URL or KIBANA_URL",
               "API Key configured" if api_key else "Missing ELASTIC_API_KEY or API_KEY",
               "Ready for deployment" if deployment_ready else "Missing required credentials"
           ]
       })
      
   except Exception as e:
       return safe_json_response({
           "status": "error",
           "error": str(e),
           "error_type": type(e).__name__
       })

Creating synthetic tests based on a template

While developing this solution to generate tests based on a prompt, the process wasn't always smooth. Early versions encountered issues with accuracy, hallucinations, and the creation of loops. To make progress, a version that relied on creating a test template to verify the mechanics of the solution, such as whether the test could pass and be deployed to Elastic correctly, was a logical next step.

This solution automates the entire process of creating a synthetic browser test that will regularly check if a website is working correctly, then deploys it to Elastic Observability Synthetics. Similar to diagnose_warp_mcp_config, the MCP tool create_and_deploy_browser_test starts with the decorator @mcp.tool() and checks to make sure that the proper environment variables are loaded.

From there, it creates a TypeScript test file that is based on templates and generates dynamic test steps based on the target website's characteristics, including navigating to the website, verifying the page title exists, checking page load performance, taking a screenshot, verifying page content is visible, and finally saves the test file in a synthetic_tests directory.

Finally, it wraps Elastic's CLI tool @elastic/synthetics to push the test to Kibana, allowing you to set which geographic locations to run tests from, how often to run the test, and the project and workspace settings.

You check out the full code for this MCP tool here.

Creating synthetic tests based on a prompt

While creating browser tests based on a templated approach is a good starting point, it felt generic and cookie-cutter. But it made a helpful structure to build an LLM-based function on top of.

The MCP tool llm_create_and_deploy_test_from_prompt begins by ensuring that basic parameters, including locations, schedule, and directories, are listed. Additionally, it aims to learn more about the target website to inform the AI and initialize the OpenAI client and model, which is GPT-4o.

After setting up the LLM, it converts natural language requests into actual Playwright test code, then cleans and validates the AI-generated code to prevent issues like injection attacks or malformed syntax. It draws inspiration from the templated approach, wrapping AI-generated steps within a proven, reliable test framework template. Finally, it deploys the test to Elastic in a similar manner to the previous tool.

You can find the code for this tool here.

Conclusion and next steps

Synthetic monitoring in Elastic Observability makes it easy to test complete user journeys and keep your site reliable, with simple setup and a Playwright integration. A tool like this can provide a starting point for tests that you can iterate on after.

A solution like this is just the start of an MCP implementation that automatically generates Playwright tests for you and can be expanded in the future to include heartbeat monitors, utilize the Playwright MCP server, or consider experimenting with Claude for Chrome to create synthetic testing.

Check out more articles on Observability Labs on Synthetic Monitoring

Explore and Analyze Metrics with Ease in Elastic Observability

Thu, 23 Oct 2025 00:00:00 GMT

Metrics are critical in identifying the “what”

As a core pillar of Observability, metrics offer a highly structured, quantitative view of system performance and health. They provide a crucial symptomatic perspective—revealing what is happening, such as high application latency, increasing service errors, or spiking container CPU utilization, which is essential for initiating alerting and triaging efforts. This capability for effective monitoring, alerting, and triaging is paramount to ensuring robust service delivery and achieving successful business outcomes.

Elastic Observability provides a comprehensive, end-to-end experience for metrics data. Elastic ensures that metrics data can be collected from numerous sources, enriched as needed and shipped to the Elastic Stack. Elastic efficiently stores this time series data, including high-cardinality metrics, utilizing the TSDS index mode (Time Series Data Stream), introduced in prior versions and used across Elastic time series integrations. This foundation ensures comprehensive observability through out-of-the-box dashboards, alerts, SLOs, and streamlined data management.

Elastic Observability 9.2 provides enhancements to metrics exploration and analysis through powerful query language extensions and expanded UI capabilities. These enhancements focus on making analysis on TSDS data via counter rates and common aggregations over time easier and faster than ever before.

The main metrics enhancements center on these key features, offered as Tech Preview:

Metrics analytics with TSDS and ES|QL
Interactive metrics exploration in Discover
OTLP endpoint for metrics

Metrics analytics with TSDS and ES|QL

The introduction of the new TS source command in ES|QL (Elasticsearch Query Language) on TSDS metrics dramatically simplifies time series analysis.

The TS command is specifically designed to target only time series indices, differentiating it from the general FROM command. Its core power lies in enabling a dedicated suite of time series aggregation functions within the STATS command.

This mechanism utilizes a dual aggregation paradigm, which is standard for time series querying. These queries involve two aggregation functions:

Inner (Time Series) function: Applied implicitly per time series, often over bucketed time intervals.
Outer (Regular) function: Used to aggregate the results of the inner function across groups. For instance, if you use STATS SUM(RATE(search_requests)) BY TBUCKET(1 hour), host, the RATE() function is the inner function applied per time series in hourly buckets, and SUM() is the outer function, summing these rates for each host and hourly bucket.

If an ES|QL query using the TS command is missing an inner (time series) aggregation function, LAST_OVER_TIME() is implicitly assumed and used. For example, TS metrics | STATS AVG(memory_usage) is equivalent to TS metrics | STATS AVG(LAST_OVER_TIME(memory_usage)).

Key time series aggregation functions available in ES|QL via `TS` command

These functions allow for powerful analysis on time-series data:


Function	Description	Example Use Case
`RATE()` / `IRATE()`	Calculates the per-second average rate of increase of a counter (`RATE`), accounting for non-monotonic breaks like counter resets, making it the most appropriate function for counters, or the per-second rate of increase between the last two data points (`IRATE`), ignoring all but the last two points for high responsiveness.	Calculating request per second (RPS) or throughput.
`AVG_OVER_TIME()`	Calculates the average of a numeric field over the defined time range.	Determining average resource usage over an hour.
`SUM_OVER_TIME()`	Calculates the sum of a field over the time range.	Total errors over a specific time window.
`MAX_OVER_TIME()` / `MIN_OVER_TIME()`	Calculates the maximum or minimum value of a field over time.	Identifying peak resource consumption.
`DELTA()` / `IDELTA()`	Calculates the absolute change of a gauge field over a time window (`DELTA`) or specifically between the last two data points (`IDELTA`), making `IDELTA` more responsive to recent changes.	Tracking changes in system gauge metrics (e.g., buffer size).
`INCREASE()`	Calculates the absolute increase of a counter (`INCREASE`).	Analyzing immediate rate changes in fast-moving counters.
`FIRST_OVER_TIME()` / `LAST_OVER_TIME()`	Calculates the earliest or latest recorded value of a field, determined by the `@timestamp` field.	Inspecting initial and final metric states within a bucket.
`ABSENT_OVER_TIME()` / `PRESENT_OVER_TIME()`	Calculates the absence or presence of a field in the result over the time range.	Identifying monitoring coverage gaps.
`COUNT_OVER_TIME()` / `COUNT_DISTINCT_OVER_TIME()`	Calculates the total count or the count of distinct values of a field over time.	Measuring frequency or cardinality changes.

These functions, available with the TS command, allow SREs and Ops teams to easily perform rate calculations and other common aggregations, enabling efficient metrics analysis as a routine part of observability workflows. And it’s much faster, too! Internal performance testing has revealed that TS commands outperform other ways of querying metrics data by an order of magnitude or more, and consistently!

Interactive metrics exploration in Discover

The 9.2 release introduces the capability to explore and analyze metrics directly and interactively within the Discover interface. In addition to exploring and analyzing logs and raw events, Discover now provides a dedicated environment for metrics exploration:

Easy start: Begin exploration simply by querying metrics ingested via TS metrics-*.
Grid view and pre-applied aggregations: This command displays all metrics in a grid format at a glance, immediately applying the appropriate aggregations based on the metric type, such as rate versus avg.
Search and group-by: Quickly search for specific metrics by name. Also easily group and analyze metrics by dimensions (labels) and specific values. This allows narrowing down to metrics and dimensions of choice for targeted analysis.
Quick access to details: Furthermore, the interface provides access to crucial details, including query and response details, the underlying ES|QL commands, the metric field type, and applicable dimensions, for each metric.
Easy tweaking and dashboarding: The system automatically populates ES|QL queries, aiding in making easy tweaks, slicing, and dicing the data. Once analyzed, metrics and resulting analyses can be added to new or existing dashboards with ease.

OTLP endpoint for metrics

We are also introducing a native OpenTelemetry Protocol (OTLP) endpoint specifically for metrics ingest directly into Elasticsearch. The endpoint especially benefits self-managed customers, and will be integrated into our Elastic Cloud Managed OTLP Endpoint for Elastic-managed offerings. The native endpoint and related updates improve ingest performance and scalability of OTel metrics, providing up to 60% higher throughput via _otlp, and up to 25% higher throughput when using classic _bulk methods.

In Conclusion

By merging the power of ES|QL's new time series aggregations with the familiar interactive experience of Discover, Elastic 9.2 enables a potent set of metrics analytics tools. The tools significantly boost the exploration and analysis phase of any observability workflow. And we’re just getting started on unleashing the full power of metrics in Elastic Observability!

We welcome you to try the new features today!

Also learn more about how we provide metrics analytics for AWS, Azure, GCP, Kubernetes, and LLMs on Observability Labs

Migrating 1 billion log lines from OpenSearch to Elasticsearch

Wed, 11 Oct 2023 00:00:00 GMT

What are the current options to migrate from OpenSearch to Elasticsearch^®?

OpenSearch is a fork of Elasticsearch 7.10 that has diverged quite a bit from itself lately, resulting in a different set of features and also different performance, as this benchmark shows (hint: it’s currently much slower than Elasticsearch).

Given the differences between the two solutions, restoring a snapshot from OpenSearch is not possible, nor is reindex-from-remote, so our only option is then using something in between that will read from OpenSearch and write to Elasticsearch.

This blog will show you how easy it is to migrate from OpenSearch to Elasticsearch for better performance and less disk usage!

1 billion log lines

We are going to use part of the data set we used for the benchmark, which takes about half a terabyte on disk, including replicas, and spans over a week ( January 1–7, 2023).

We have in total 1,009,165,775 documents that take 453.5GB of space in OpenSearch, including the replicas. That’s 241.2KB per document. This is going to be important later when we enable a couple optimizations in Elasticsearch that will bring this total size way down without sacrificing performance!

This billion log line data set is spread over nine indices that are part of a datastream we are calling logs-myapplication-prod. We have primary shards of about 25GB in size, according to the best practices for optimal shard sizing. A GET _cat/indices show us the indices we are dealing with:

index                              docs.count pri rep pri.store.size store.size
.ds-logs-myapplication-prod-000049  102519334   1   1         22.1gb     44.2gb
.ds-logs-myapplication-prod-000048  114273539   1   1         26.1gb     52.3gb
.ds-logs-myapplication-prod-000044  111093596   1   1         25.4gb     50.8gb
.ds-logs-myapplication-prod-000043  113821016   1   1         25.7gb     51.5gb
.ds-logs-myapplication-prod-000042  113859174   1   1         24.8gb     49.7gb
.ds-logs-myapplication-prod-000041  112400019   1   1         25.7gb     51.4gb
.ds-logs-myapplication-prod-000040  113362823   1   1         25.9gb     51.9gb
.ds-logs-myapplication-prod-000038  110994116   1   1         25.3gb     50.7gb
.ds-logs-myapplication-prod-000037  116842158   1   1         25.4gb     50.8gb

Both OpenSearch and Elasticsearch clusters have the same configuration: 3 nodes with 64GB RAM and 12 CPU cores. Just like in the benchmark, the clusters are running in Kubernetes.

Moving data from A to B

Typically, moving data from one Elasticsearch cluster to another is easy as a snapshot and restore if the clusters are compatible versions of each other or a reindex from remote if you need real-time synchronization and minimized downtime. These methods do not apply when migrating data from OpenSearch to Elasticsearch because the projects have significantly diverged from the 7.10 fork. However, there is one method that will work: scrolling.

Scrolling

Scrolling involves using an external tool, such as Logstash^®, to read data from the source cluster and write it to the destination cluster. This method provides a high degree of customization, allowing us to transform the data during the migration process if needed. Here are a couple advantages of using Logstash:

Easy parallelization: It’s really easy to write concurrent jobs that can read from different “slices” of the indices, essentially maximizing our throughput.
Queuing: Logstash automatically queues documents before sending.
Automatic retries: In the event of a failure or an error during data transmission, Logstash will automatically attempt to resend the data; moreover, it will stop querying the source cluster as often, until the connection is re-established, all without manual intervention.

Scrolling allows us to do an initial search and to keep pulling batches of results from Elasticsearch until there are no more results left, similar to how a “cursor” works in relational databases.

A scrolled search takes a snapshot in time by freezing the segments that make the index up until the time the request is made, preventing those segments from merging. As a result, the scroll doesn’t see any changes that are made to the index after the initial search request has been made.

Migration strategies

Reading from A and writing in B in can be slow without optimization because it involves paginating through the results, transferring each batch over the network to Logstash, which will assemble the documents in another batch and then transfer those batches over the network again to Elasticsearch, where the documents will be indexed. So when it comes to such large data sets, we must be very efficient and extract every bit of performance where we can.

Let’s start with the facts — what do we know about the data we need to transfer? We have nine indices in the datastream, each with about 100 million documents. Let’s test with just one of the indices and measure the indexing rate to see how long it takes to migrate. The indexing rate can be seen by activating the monitoring functionality in Elastic^® and then navigating to the index you want to inspect.

Scrolling in the deep
The simplest approach for transferring the log lines over would be to make Elasticsearch scroll over the entire data set and check it later when it finishes. Here we will introduce our first two variables: PAGE_SIZE and BATCH_SIZE. The former is how many records we are going to bring from the source every time we query it, and the latter is how many documents are going to be assembled together by Logstash and written to the destination index.

With such a large data set, the scroll slows down as this deep pagination progresses. The indexing rate starts at 6,000 docs/second and steadily descends down to 700 docs/second because the pagination gets very deep. Without any optimization, it would take us 19 days (!) to migrate the 1 billion documents. We can do better than that!

Slice me nice
We can optimize scrolling by using an approach called Sliced scroll, where we split the index in different slices to consume them independently.

Here we will introduce our last two variables: SLICES and WORKERS. The amount of slices cannot be too small as the performance decreases drastically over time, and it can’t be too big as the overhead of maintaining the scrolls would counter the benefits of a smaller search.

Let’s start by migrating a single index (out of the nine we have) with different parameters to see what combination gives us the highest throughput.


SLICES	PAGE_SIZE	WORKERS	BATCH_SIZE	Average Indexing Rate
3	500	3	500	13,319 docs/sec
3	1,000	3	1,000	13,048 docs/sec
4	250	4	250	10,199 docs/sec
4	500	4	500	12,692 docs/sec
4	1,000	4	1,000	10,900 docs/sec
5	500	5	500	12,647 docs/sec
5	1,000	5	1,000	10,334 docs/sec
5	2,000	5	2,000	10,405 docs/sec
10	250	10	250	14,083 docs/sec
10	250	4	1,000	12,014 docs/sec
10	500	4	1,000	10,956 docs/sec

It looks like we have a good set of candidates for maximizing the throughput for a single index, in between 12K and 14K documents per second. That doesn't mean we have reached our ceiling. Even though search operations are single threaded and every slice will trigger sequential search operations to read data, that does not prevent us from reading several indices in parallel.

By default, the maximum number of open scrolls is 500 — this limit can be updated with the search.max_open_scroll_context cluster setting, but the default value is enough for this particular migration.

Let’s migrate

Preparing our destination indices

We are going to create a datastream called logs-myapplication-reindex to write the data to, but before indexing any data, let’s ensure our index template and index lifecycle management configurations are properly set up. An index template acts as a blueprint for creating new indices, allowing you to define various settings that should be applied consistently across your indices.

Index lifecycle management policy
Index lifecycle management (ILM) is equally vital, as it automates the management of indices throughout their lifecycle. With ILM, you can define policies that determine how long data should be retained, when it should be rolled over into new indices, and when old indices should be deleted or archived. Our policy is really straightforward:

PUT _ilm/policy/logs-myapplication-lifecycle-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_primary_shard_size": "25gb"
          }
        }
      },
      "warm": {
        "min_age": "0d",
        "actions": {
          "forcemerge": {
            "max_num_segments": 1
          }
        }
      }
    }
  }
}

Index template (and saving 23% in disk space)
Since we are here, we’re going to go ahead and enable Synthetic Source, a clever feature that allows us to store and discard the original JSON document while still reconstructing it when needed from the stored fields.

For our example, enabling Synthetic Source resulted in a remarkable 23.4% improvement in storage efficiency , reducing the size required to store a single document from 241.2KB in OpenSearch to just 185KB in Elasticsearch.

Our full index template is therefore:

PUT _index_template/logs-myapplication-reindex
{
  "index_patterns": [
    "logs-myapplication-reindex"
  ],
  "priority": 500,
  "data_stream": {},
  "template": {
    "settings": {
      "index": {
        "lifecycle.name": "logs-myapplication-lifecycle-policy",
        "codec": "best_compression",
        "number_of_shards": "1",
        "number_of_replicas": "1",
        "query": {
          "default_field": [
            "message"
          ]
        }
      }
    },
    "mappings": {
      "_source": {
        "mode": "synthetic"
      },
      "_data_stream_timestamp": {
        "enabled": true
      },
      "date_detection": false,
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "agent": {
          "properties": {
            "ephemeral_id": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "id": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "name": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "type": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "version": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "aws": {
          "properties": {
            "cloudwatch": {
              "properties": {
                "ingestion_time": {
                  "type": "keyword",
                  "ignore_above": 1024
                },
                "log_group": {
                  "type": "keyword",
                  "ignore_above": 1024
                },
                "log_stream": {
                  "type": "keyword",
                  "ignore_above": 1024
                }
              }
            }
          }
        },
        "cloud": {
          "properties": {
            "region": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "data_stream": {
          "properties": {
            "dataset": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "namespace": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "type": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "ecs": {
          "properties": {
            "version": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "event": {
          "properties": {
            "dataset": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "id": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "ingested": {
              "type": "date"
            }
          }
        },
        "host": {
          "type": "object"
        },
        "input": {
          "properties": {
            "type": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "log": {
          "properties": {
            "file": {
              "properties": {
                "path": {
                  "type": "keyword",
                  "ignore_above": 1024
                }
              }
            }
          }
        },
        "message": {
          "type": "match_only_text"
        },
        "meta": {
          "properties": {
            "file": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "metrics": {
          "properties": {
            "size": {
              "type": "long"
            },
            "tmin": {
              "type": "long"
            }
          }
        },
        "process": {
          "properties": {
            "name": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "tags": {
          "type": "keyword",
          "ignore_above": 1024
        }
      }
    }
  }
}

Building a custom Logstash image

We are going to use a containerized Logstash for this migration because both clusters are sitting on a Kubernetes infrastructure, so it's easier to just spin up a Pod that will communicate to both clusters.

Since OpenSearch is not an official Logstash input, we must build a custom Logstash image that contains the logstash-input-opensearch plugin. Let’s use the base image from docker.elastic.co/logstash/logstash:9.3.1 and just install the plugin:

FROM docker.elastic.co/logstash/logstash:9.3.1

USER logstash
WORKDIR /usr/share/logstash
RUN bin/logstash-plugin install logstash-input-opensearch

Writing a Logstash pipeline

Now we have our Logstash Docker image, and we need to write a pipeline that will read from OpenSearch and write to Elasticsearch.

The input

input {
    opensearch {
        hosts => ["os-cluster:9200"]
        ssl => true
        ca_file => "/etc/logstash/certificates/opensearch-ca.crt"
        user => "${OPENSEARCH_USERNAME}"
        password => "${OPENSEARCH_PASSWORD}"
        index => "${SOURCE_INDEX_NAME}"
        slices => "${SOURCE_SLICES}"
        size => "${SOURCE_PAGE_SIZE}"
        scroll => "5m"
        docinfo => true
        docinfo_target => "[@metadata][doc]"
    }
}

Let’s break down the most important input parameters. The values are all represented as environment variables here:

hosts: Specifies the host and port of the OpenSearch cluster. In this case, it’s connecting to “os-cluster” on port 9200.
index: Specifies the index in the OpenSearch cluster from which to retrieve logs. In this case, it’s “logs-myapplication-prod” which is a datastream that contains the actual indices (e.g., .ds-logs-myapplication-prod-000049).
size: Specifies the maximum number of logs to retrieve in each request.
scroll: Defines how long a search context will be kept open on the OpenSearch server. In this case, it’s set to “5m,” which means each request must be answered and a new “page” asked within five minutes.
docinfo and docinfo_target: These settings control whether document metadata should be included in the Logstash output and where it should be stored. In this case, document metadata is being stored in the [@metadata][doc] field — this is important because the document’s _id will be used as the destination id as well.

The ssl and ca_file are highly recommended if you are migrating from clusters that are in a different infrastructure (separate cloud providers). You don’t need to specify a ca_file if your TLS certificates are signed by a public authority, which is likely the case if you are using a SaaS and your endpoint is reachable over the internet. In this case, only ssl => true would suffice. In our case, all our TLS certificates are self-signed, so we must also provide the Certificate Authority (CA) certificate.

The (optional) filter
We could use this to drop or alter the documents to be written to Elasticsearch if we wanted, but we are not going to, as we want to migrate the documents as is. We are only removing extra metadata fields that Logstash includes in all documents, such as "@version" and "host". We are also removing the original "data_stream" as it contains the source data stream name, which might not be the same in the destination.

filter {
    mutate {
        remove_field => ["@version", "host", "data_stream"]
    }
}

The output
The output is really simple — we are going to name our datastream logs-myapplication-reindex and we are using the document id of the original documents in document_id, to ensure there are no duplicate documents. In Elasticsearch, datastream names follow a convention -- so our logs-myapplication-reindex datastream has “myapplication” as dataset and “prod” as namespace.

elasticsearch {
    hosts => "${ELASTICSEARCH_HOST}"

    user => "${ELASTICSEARCH_USERNAME}"
    password => "${ELASTICSEARCH_PASSWORD}"

    document_id => "%{[@metadata][doc][_id]}"

    data_stream => "true"
    data_stream_type => "logs"
    data_stream_dataset => "myapplication"
    data_stream_namespace => "prod"
}

Deploying Logstash

We have a few options to deploy Logstash: it can be deployed locally from the command line, as a systemd service, via docker, or on Kubernetes.

Since both of our clusters are deployed in a Kubernetes environment, we are going to deploy Logstash as a Pod referencing our Docker image created earlier. Let’s put our pipeline inside a ConfigMap along with some configuration files (pipelines.yml and config.yml).

In the below configuration, we have SOURCE_INDEX_NAME, SOURCE_SLICES, SOURCE_PAGE_SIZE, LOGSTASH_WORKERS, and LOGSTASH_BATCH_SIZE conveniently exposed as environment variables so you just need to fill them out.

apiVersion: v1
kind: Pod
metadata:
  name: logstash-1
spec:
  containers:
    - name: logstash
      image: ugosan/logstash-opensearch-input:8.10.0
      imagePullPolicy: Always
      env:
        - name: SOURCE_INDEX_NAME
          value: ".ds-logs-benchmark-dev-000037"
        - name: SOURCE_SLICES
          value: "10"
        - name: SOURCE_PAGE_SIZE
          value: "500"
        - name: LOGSTASH_WORKERS
          value: "4"
        - name: LOGSTASH_BATCH_SIZE
          value: "1000"
        - name: OPENSEARCH_USERNAME
          valueFrom:
            secretKeyRef:
              name: os-cluster-admin-password
              key: username
        - name: OPENSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: os-cluster-admin-password
              key: password
        - name: ELASTICSEARCH_USERNAME
          value: "elastic"
        - name: ELASTICSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: es-cluster-es-elastic-user
              key: elastic
      resources:
        limits:
          memory: "4Gi"
          cpu: "2500m"
        requests:
          memory: "1Gi"
          cpu: "300m"
      volumeMounts:
        - name: config-volume
          mountPath: /usr/share/logstash/config
        - name: etc
          mountPath: /etc/logstash
          readOnly: true
  volumes:
    - name: config-volume
      projected:
        sources:
          - configMap:
              name: logstash-configmap
              items:
                - key: pipelines.yml
                  path: pipelines.yml
                - key: logstash.yml
                  path: logstash.yml
    - name: etc
      projected:
        sources:
          - configMap:
              name: logstash-configmap
              items:
                - key: pipeline.conf
                  path: pipelines/pipeline.conf
          - secret:
              name: os-cluster-http-cert
              items:
                - key: ca.crt
                  path: certificates/opensearch-ca.crt
          - secret:
              name: es-cluster-es-http-ca-internal
              items:
                - key: tls.crt
                  path: certificates/elasticsearch-ca.crt
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: logstash-configmap
data:
  pipelines.yml: |
    - pipeline.id: reindex-os-es
      path.config: "/etc/logstash/pipelines/pipeline.conf"
      pipeline.batch.size: ${LOGSTASH_BATCH_SIZE}
      pipeline.workers: ${LOGSTASH_WORKERS}
  logstash.yml: |
    log.level: info
    pipeline.unsafe_shutdown: true
    pipeline.ordered: false
  pipeline.conf: |
    input {
        opensearch {
          hosts => ["os-cluster:9200"]
          ssl => true
          ca_file => "/etc/logstash/certificates/opensearch-ca.crt"
          user => "${OPENSEARCH_USERNAME}"
          password => "${OPENSEARCH_PASSWORD}"
          index => "${SOURCE_INDEX_NAME}"
          slices => "${SOURCE_SLICES}"
          size => "${SOURCE_PAGE_SIZE}"
          scroll => "5m"
          docinfo => true
          docinfo_target => "[@metadata][doc]"
        }
    }

    filter {
        mutate {
            remove_field => ["@version", "host", "data_stream"]
        }
    }

    output {
        elasticsearch {
            hosts => "https://es-cluster-es-http:9200"
            ssl => true
            ssl_certificate_authorities => ["/etc/logstash/certificates/elasticsearch-ca.crt"]
            ssl_verification_mode => "full"

            user => "${ELASTICSEARCH_USERNAME}"
            password => "${ELASTICSEARCH_PASSWORD}"

            document_id => "%{[@metadata][doc][_id]}"

            data_stream => "true"
            data_stream_type => "logs"
            data_stream_dataset => "myapplication"
            data_stream_namespace => "reindex"
        }
    }

That’s it.

After a couple hours, we successfully migrated 1 billion documents from OpenSearch to Elasticsearch and even saved 23% plus on disk storage! Now that we have the logs in Elasticsearch how about extracting actual business value from them? Logs contain so much valuable information - we can not only do all sorts of interesting things with AIOPS, like Automatically Categorize those logs, but also extract business metrics and detect anomalies on them, give it a try.


OpenSearch			Elasticsearch
Index	docs	size	Index	docs	size	Diff.
.ds-logs-myapplication-prod-000037	116842158	27285520870	logs-myapplication-reindex-000037	116842158	21998435329	21.46%
.ds-logs-myapplication-prod-000038	110994116	27263291740	logs-myapplication-reindex-000038	110994116	21540011082	23.45%
.ds-logs-myapplication-prod-000040	113362823	27872438186	logs-myapplication-reindex-000040	113362823	22234641932	22.50%
.ds-logs-myapplication-prod-000041	112400019	27618801653	logs-myapplication-reindex-000041	112400019	22059453868	22.38%
.ds-logs-myapplication-prod-000042	113859174	26686723701	logs-myapplication-reindex-000042	113859174	21093766108	23.41%
.ds-logs-myapplication-prod-000043	113821016	27657006598	logs-myapplication-reindex-000043	113821016	22059454752	22.52%
.ds-logs-myapplication-prod-000044	111093596	27281936915	logs-myapplication-reindex-000044	111093596	21559513422	23.43%
.ds-logs-myapplication-prod-000048	114273539	28111420495	logs-myapplication-reindex-000048	114273539	22264398939	23.21%
.ds-logs-myapplication-prod-000049	102519334	23731274338	logs-myapplication-reindex-000049	102519334	19307250001	20.56%

Interested in trying Elasticsearch? Start our 14-day free trial.

ML and AI Ops Observability with OpenTelemetry and Elastic

Tue, 31 Mar 2026 00:00:00 GMT

While isolated execution logs might work for local experiments, they are no longer enough for the new era of complex, production-ready Machine Learning (ML) pipelines and Artificial Intelligence (AI) agents. Modern ML and AI systems present three unique challenges:

Distributed components: A single request might hit an API gateway, retrieve data from a feature store, evaluate a predictive model in a Python inference service, query a vector database, and call an external LLM.
Non-determinism: AI agents make autonomous decisions and tool calls. If an agent fails, you need a full trace to understand its reasoning loop and what external tools it tried to invoke.
Context dependence: You don't just care that an error happened; you need to know what model version was running, what hyperparameters were used, what the input data looked like, what was the commit that made that change. Many of these attributes are custom to your app, and you need an Observability environment that has the flexibility of creating new parameters on the fly and use them to find and fix issues.

On top of that, with the increased use of AI agents to generate code and make autonomous decisions, Observability becomes key to understanding what is working and what is not. It creates a critical feedback loop to quickly fix problems. More than ever, ML and AI applications need to adopt the best practices of mature software engineering systems to succeed.

This guide shows how to use OpenTelemetry and Elastic to correlate traces, logs, and metrics to track runs, compare model behavior, and trace requests across Python and Go services with one shared context.

Problem context: why AI systems are harder to debug

Traditional services already have distributed failure modes, but ML and AI systems add more moving parts:

notebook experiments and ad hoc jobs
batch training and evaluation pipelines
online inference services
external API calls, including LLM providers
changing model versions and hyperparameters

When one prediction path gets slower or starts failing, plain isolated logs do not answer enough questions. You need to correlate:

what ran (run ID, model version, parameters)
where time was spent (pipeline stage latencies)
what was the result (model stats, predictions, API calls, compare with other runs)
what changed (code, data, dependencies)

In a future blog post, we'll show you how to set up automatic RCA and remediations with Elastic Workflows and our AI integrations. But as a first step, ML and AI pipelines need a robust Observability framework, which is very easy to set up with OpenTelemetry and Elastic.

Solution overview

OpenTelemetry gives you a standard way to emit traces, metrics, and logs. Elastic provides full OpenTelemetry ingestion, giving you a single place to store and query that telemetry. Kibana's UI is fully integrated with OpenTelemetry, allowing you to explore your services, service dependencies, service latencies, spans, and metrics out-of-the-box.

You can start with two deployment options:

Cloud: send OpenTelemetry data directly to Elastic Cloud Managed OTLP Endpoint (mOTLP docs), without the overhead of managing collectors
Local: run Elastic and the EDOT Collector with start-local, the EDOT Collector will be automatically listening for OTLP data in localhost:4317

Both options let you keep your application code unchanged for the initial implementation.

Step 1: zero-code baseline for Python services

Start by just installing the Elastic Distribution of OpenTelemetry Python (EDOT Python) package and using the opentelemetry-instrument wrapper to run your script. By simply running your script with this wrapper—without modifying your application code—your Python services begin emitting standard telemetry right away. This includes any logs exported via logging, alongside metrics and traces for auto-instrumented libraries. This data can be routed directly to Elastic's managed OTLP endpoint or a local EDOT collector.

pip install elastic-opentelemetry
edot-bootstrap --action=install

Export the OpenTelemetry environment variables, then run opentelemetry-instrument on your script to enable auto-instrumentation.

export OTEL_EXPORTER_OTLP_ENDPOINT="https://" # No need when using start-local with EDOT
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=ApiKey " # No need when using start-local with EDOT
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=prod,service.version=1.0.0" # Set the environment and version for your app
export OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true
export ELASTIC_OTEL_SYSTEM_METRICS_ENABLED=true
export OTEL_METRIC_EXPORT_INTERVAL=5000 # Choose the interval for your application metrics

opentelemetry-instrument --service_name= python3 .py # Set your chosen name for your service

With this baseline, you can quickly get:

Centralized logs with trace context. Any logs exported via logging will be searchable in Elastic and Kibana, with the ability to perform full-text search on your logs
Set alerting on log errors
Process and system metrics. System and process metrics from the execution will be automatically exported to Elastic. You can visualize them, and analyse memory usage (leaks, OOM errors), CPU utilization (Bottlenecks / Spikes), thread counts, disk I/O bottlenecks or network I/O saturation.
Set alerting on metrics
Spans for auto instrumented libraries
Service latency baselines and error trends
Set manual or Anomaly detection alerting on error rates, latencies or throughput
Correlate logs, metrics, and traces in a single shared context to quickly find the root cause of issues, using OpenTelemetry for instrumentation and Elastic for analysis.

Once ingested, Kibana immediately populates out-of-the-box dashboards. You can explore full-text searchable logs, monitor system and process metrics, investigate auto-instrumented trace waterfalls, map out your ML dependencies with service maps, and easily set up alerts for latency spikes, memory or CPU usage or log errors.

For LLM-specific observability, OpenTelemetry provides official Semantic Conventions for Generative AI to standardize how you track token usage, model names, and prompts. These semantic conventions are still in development and not stable yet. Some instrumentations for the most used libraries in this space are being developed as part of the OpenTelemetry Python Contrib repository. Alternatively you can implement these conventions manually in your custom spans. LLM related OpenTelemetry logs, metrics and traces sent to Elastic will be in context and automatically correlated with the rest of your application or stack of applications.

Step 2: add ML-specific context with custom spans and log fields

Auto-instrumentation is a starting point. For ML and AI Ops, add explicit spans around business stages and attach run metadata. Elastic's schema flexibility and dynamic mappings make it a perfect fit for custom attributes or metrics that are exclusive to your pipelines or specific experiments. There is no need to know what the data will look like before writing it. You have the flexibility of creating new parameters on the fly, Elastic maps them automatically, and you can track them instantly.

Add custom fields and metric-like values as structured log fields so you can chart and alert on them later:

logger.info("training metrics", extra={
    "ml.run_id": run_id,
    "ml.training_accuracy": train_accuracy,
    "ml.validation_accuracy": val_accuracy,
    "ml.drift_detected": drift_detected,
})

Because Elastic handles dynamic mapping, any custom metrics or attributes you log, like model ids, training accuracy or drift detection, are instantly indexed and available to search in Discover or visualize via Dashboards.

This makes dashboards and rules practical:

alert when ml.validation_accuracy < 0.8
alert when ml.drift_detected == true
compare stage latency by ml.model_version

You can use these custom attributes to build targeted visualizations, and trigger alerts when ML-specific metrics like validation accuracy drop below a critical threshold.

Adding custom spans allows you to break down the specific stages of your ML pipelines, such as data loading and model training, wrapping them in their own measurable execution blocks, and analyze average latency or error rates for specific pipeline stages.

from opentelemetry import trace

tracer = trace.get_tracer("ml.pipeline")

with tracer.start_as_current_span("load_data") as span:
    span.set_attribute("ml.run_id", run_id)
    span.set_attribute("ml.dataset", dataset_source)
    load_data()

with tracer.start_as_current_span("train_model") as span:
    span.set_attribute("ml.model_version", model_version)
    span.set_attribute("ml.learning_rate", learning_rate)
    train_model()

Custom spans will be reflected in the APM UI alongside your traces. So you can explore their latency, impact in total execution, stack traces, error rates.

Step 3: trace across Python and Go in production

Real inference paths often cross service boundaries. For example:

In a production environment, a user request might pass through a Go-based API before hitting your Python ML inference service. OpenTelemetry ensures tracing context is preserved seamlessly across these boundaries.

In our example, we have a simple Go HTTP service that acts as the entry point and demonstrates OpenTelemetry instrumentation in Go. This REST API service stores and retrieves ML predictions by querying Elasticsearch based on data IDs from the source dataset. All of its endpoints are natively instrumented with OTel spans.

The full request lifecycle looks like this:

The Go API receives the client request.
It searches Elasticsearch for an existing prediction or calls the Python model service to run inference.
The Python service loads features, runs the model, and returns predictions.

When both services use OpenTelemetry, trace context is propagated automatically through headers. In Elastic, you can inspect one end-to-end trace and locate latency or errors by service and span.

The resulting distributed trace in Elastic pieces the entire journey together. You can see the exact breakdown of time spent in the Go API versus the Python model, and correlate logs from both services in a single unified view.

Validation checklist

After instrumentation, validate with a short runbook:

Confirm logs, metrics, and traces arrive for each service.
Verify your custom attributes (e.g. run_id, model_version, llm_ground_truth_score) are present in traces and logs.
Compare p95 latency per stage (load_data, train_model, predict).
Trigger a controlled failure and confirm error traces include stack context.
Test one rule for errors, one rule for latency spikes, and one rule for model-quality fields. Set up a connector and attach it to the rule to reach you in Slack, email, or trigger an auto-remediation workflow.

Conclusion and next steps

OpenTelemetry gives ML and AI teams a unified telemetry layer, while Elastic makes that data instantly queryable and actionable across your entire lifecycle—from notebook experiments to production inference. By starting with zero-code instrumentation and incrementally adding ML-specific attributes and cross-language tracing, your team can easily adopt the Observability best practices of mature software engineering systems and succeed in the new era of complex AI operations.

Try this setup in Elastic Cloud, and use mOTLP for a managed ingest path. If you want a local sandbox first, start with Elastic start-local + EDOT Collector.

AIOps with Elastic Observability: Modern AIOps & Log Intelligence

Wed, 26 Nov 2025 00:00:00 GMT

AIOps Blog Refresher: Unlocking Intelligence from Your Logs with Elastic

Elastic has been leading the charge with AIOps, especially in the recent 9.2 update of Elastic Observability with Streams. The conversation around AIOps has shifted dramatically as we move through the year. DevOps and SRE teams aren't asking whether they need AIOps, they're asking how to leverage it more effectively to stay ahead of exponentially growing complexity.

The current challenge of AIOps is that modern cloud-native environments generate massive volumes of telemetry data that are magnitudes larger than past environments. But here's what many teams overlook: logs are the richest source of operational intelligence you have. Logs are able to tell you exactly what happened and why, while metrics only tell you something is wrong, and traces only tell you where. The problem is that most organizations are drowning in logs. Microservices, such as user authentications or inventories, serverless functions, and Kubernetes generate millions of log entries daily. Without AI and machine learning, finding meaningful patterns in this data takes too much time and energy.

Log Intelligence Improvement: What's New in 2025

Historically in observability, unlocking your log intelligence included long manual effort that required not only parsing through logs, but also structuring those logs. Elastic Observability has drastically changed how teams extract value from logs. Observability is not just simple signal analysis - modern tools need to have proactive, log-driven investigations. At Elastic, this modernity is Streams.

Streams, a new release from Elastic, is a collection of AI-driven tools that identify significant events in parsed raw logs by enriching logs with meaningful fields. With Streams, SREs can maximize the value of their data, their logs, and their systems. With system reliability as the goal, Streams helps to reduce pipeline management overhead and accelerates observability analysis. And it takes nearly no time to set up!

Here is how Streams powers the Elastic Observability capabilities available now.

Advanced Log Rate Analysis

Log rate analysis can go far beyond only detecting spikes. Elastic's machine learning automatically identifies when log volumes deviate from expected baselines, then contextualizes these changes within your broader system performance. When your application suddenly generates more error logs, Elastic’s AIOps doesn't just alert you, it also determines whether it's a critical issue requiring immediate attention or just a temporary anomaly.

This matters to your analysis because not all log spikes are equal. A 10x increase in DEBUG logs might indicate verbose logging accidentally enabled in production. A 2x increase in ERROR logs could signal a cascading failure. Log rate analysis distinguishes between these scenarios automatically, giving your team the context needed to respond appropriately.

Intelligent Log Categorization with Streams

This is where AIOps shines with log data. Streams uses machine learning algorithms in order to automatically classify and group similar log patterns, dramatically reducing noise. Instead of manually parsing millions of entries, the system identifies common structures, groups related events, and surfaces the categories that matter most.

Logs are unstructured by nature, making them difficult to analyze at scale. Streams corrals chaotic log streams into organized, queryable patterns. Instantly, you can see that 80% of your errors fall into three categories, helping you prioritize where to focus remediation efforts. This approach helps you reduce noise and accelerate analysis, allowing teams to act on insights faster.

Multi-Dimensional Anomaly Detection

Anomaly detection now simultaneously examines relationships between logs, metrics, and traces. A slight increase in response time might not trigger an alert by itself, but when correlated with unusual log patterns and memory consumption changes, the system recognizes it as an early warning sign.

Logs contain a myriad of contextual information that metrics and traces can't capture: stack traces, user IDs, transaction details, error messages, etc. By correlating log anomalies with other signals, you get the full picture of what's happening in your system. This whole holistic view enables teams to catch issues earlier, as well as understand their full impact across the stack.

Enhanced Root Cause Analysis Powered by Significant Events

When an issue occurs, Elastic's Streams accelerates root cause analysis through AI-assisted parsing of logs and bringing about “Significant events.” Significant event queries can be defined by AI or manually, depending on if you know what logs you are looking for or not. Then, Elastic’s AIOps traces the problem through your entire stack using these events, as well as enriched log data combined with distributed tracing. This system is able to correlate failed transactions with specific log entries, deployment events, and infrastructure changes. This helps you understand not just what broke, but why and when.

Streams makes the analysis of your logs quick and automatic by going across your entire distributed system within seconds, grabbing relevant log entries such as stack traces, state information, error messages, and more. What used to require hours of manual investigation and deduction now happens automatically, freeing you and your team from tedious detective work and enabling faster resolution.

Logs in Action: Real-World Impact

Let's look at how these capabilities work together in practice. Imagine your payment processing service is experiencing intermittent failures - only 0.5% of transactions, but enough to concern your team. Traditional monitoring shows everything is mostly okay, but customers are still complaining.

Without Streams, an SRE might initially run some broad queries, manually sift through thousands of logs, struggle to connect all the dots, and ultimately not understand the correlation between the errors and recent system changes.

With Elastic Streams and AIOps, many of these potential problems are instantly mitigated:

Streams automatically parse the payment service, adding connection timeouts to a new category of significant events
Log rate analysis with Streams reveal that this significant event category has been slowly growing over the past month, showing growth of the timeouts from a small number of occurrences into a larger amount
Elastic’s built-in anomaly detection correlates these significant events with deployment data, and identifies that they started appearing after a recent load balancer configuration
Root analysis pinpoints the exact database connection pool setting that is too restrictive for peak load by tracing affected transactions through previously enriched logs

What usually takes 4-8 hours of manual log analysis is resolved in minutes, with Elastic automatically highlighting the relevant log entries that tell the complete story. This is the power of AIOps and Streams as applied to log intelligence.

The Power of Unified Log Intelligence

What sets Elastic apart is treating logs as a priority in your observability strategy. Elastic provides comprehensive log ingestion that centralizes petabytes of logs from across your infrastructure with flexible parsing and enrichment. The platform uses purpose-built machine learning models that understand log patterns, not generic algorithms retrofitted for log analysis.

Logs don't exist in isolation, which is why Elastic correlates log data with metrics, traces, and business events to provide complete context. And because log volumes can be massive, Elastic's tiered storage approach means you can retain years of logs for compliance and historical analysis without breaking the budget.

Why Logs Matter More Than Ever

Logs have become the cornerstone of effective AIOps for three critical reasons.

First off, logs capture what metrics can't. A metric tells you the CPU is at 80%, but a log tells you which process is consuming resources and why. This level of detail is essential for understanding not just that something is wrong, but what specifically is causing the problem.

Second, logs provide business context. Error messages contain user IDs, transaction ldetails, and business logic failures that help you understand customer impact. When you're troubleshooting an issue, knowing which customers are affected and what they were trying to do is invaluable for prioritizing your response.

Third, logs enable true root cause analysis. Stack traces, error messages, and application state captured in logs are essential for understanding the why behind every incident. Without this information, teams are left guessing at root causes rather than definitively identifying and fixing them.

The teams winning with AIOps in 2025 aren't just monitoring metrics, they're extracting intelligence from their logs at scale, turning operational data into actionable insights.

Transform Your Log Strategy Today

Every hour your team spends manually searching through logs is an hour they're not spending on innovation. Every incident that could have been prevented through intelligent log analysis represents both technical debt and business risk.

Elastic Observability provides the foundation you need to unlock the intelligence hidden in your logs. With automatic categorization, anomaly detection, and ML-powered analysis, you can start seeing value immediately. Check out this recent article to get started with Elastic Streams and Observability today!

The observability gap: Why your monitoring strategy isn't ready for what's coming next

Mon, 25 Aug 2025 00:00:00 GMT

Anyone that’s been to London knows the announcements at the Tube to “Mind the gap” but what about the gap that’s developing in our monitoring and observability strategies? I’ve been through this toil before, and have run a distributed system that was humming along perfectly. My alerts were manageable, my dashboards made sense, and when things broke, I could usually track down the issue in a reasonable amount of time.

Fast forward 3-5 years and things have changed, we added Kubernetes, embraced microservices, maybe these days you might have even sprinkled in some AI-powered features. Suddenly, you're drowning in telemetry data, your alert fatigue is real, and correlating issues across your distributed architecture feels stressful.

You're experiencing what I call the "observability gap", where system complexity rockets ahead while our monitoring maturity crawls behind. Today, we're going to explore why this gap exists, what's driving it wider, and most importantly, how to close it using modern observability practices.

The complexity rocket ship has left the station

Let's be honest about what we're dealing with. The scale and complexity of our infrastructure isn't growing linearly, it's exponential. We've gone from monolithic applications running on physical servers to container orchestration platforms managing hundreds of microservices, with AI algorithms now starting to make scaling decisions autonomously.

This trajectory shows no signs of slowing down. With AI-assisted coding accelerating development cycles and intelligent orchestration systems like Kubernetes evolving toward predictive scaling, we're looking at infrastructure that's not just complex, but dynamically complex.

Meanwhile, our observability tooling? It's stuck in the past, designed for a world where you knew exactly how many servers you had and could manually correlate logs with metrics by cross-referencing timestamps.

The telemetry data explosion (and why sampling isn't the answer)

One of the first things teams notice as they scale is their observability bill climbing faster than their infrastructure costs. The knee-jerk reaction is often to start sampling data downsample metrics, head-sample traces, deduplicate logs. While these techniques have their place, they're fundamentally at odds with where we're heading.

Here's the thing: ML and AI systems thrive on rich, contextual data. When you sample away the "noise," you're often discarding the very signals that could help you understand system behavior patterns or predict failures. Instead of asking "how can we collect less data?", the better question is "how can we store and process all this data cost-effectively?"

Modern storage architectures, particularly those leveraging object storage and advanced compression techniques like ZStandard, can achieve remarkable cost-to-value ratios. The secret is organizing related data together and moving it to cheaper storage tiers quickly. This approach lets you have your cake and eat it too, full fidelity data retention without breaking the bank.

Now of course there is a balance to this and not all your applications are equal, so as a first step you should look at all your most critical flows and applications and ensure that they have the richest telemetry. Do not use a sledge hammer approach and sample all your data just to reduce bills when a scalpel is best.

OpenTelemetry (OTel): the foundation everything else builds on

If I had to pick the single most transformative change in observability during my career, it would be OpenTelemetry. Not because it's flashy or revolutionary in concept, but because it solves fundamental problems that have plagued us for years.

Before OTel, instrumenting applications meant vendor lock-in. Want to switch from vendor A to vendor B? Good luck re-instrumenting your entire codebase. Want to send the same telemetry to multiple backends? Hope you enjoy maintaining multiple agent configurations.

OpenTelemetry changes things completely. Here's the three main reasons why.

Vendor Neutrality: Your instrumentation code becomes portable. The same OTEL SDK can send data to any compliant backend.

OpenTelemetry Semantic Conventions: All your telemetry (logs, metrics, traces, profiles, wide-events) shares common metadata like service names, resource attributes, and trace context.

Auto-Instrumentation: For most popular languages and frameworks, you get rich telemetry with zero code changes.

OTEL also makes manual instrumentation incredibly valuable with minimal effort. Adding a single line like this

baggage.set_baggage("customer.id", "alice123")

In your authentication service means that customer ID automatically flows through every downstream service call, every database query, every log message. Suddenly, you can search all your telemetry data by customer ID across your entire distributed system.

The trajectory is clear: within a few years, OTel will be as ubiquitous and invisible as Kubernetes is becoming today. Runtimes will include it by default, cloud providers will offer OTel collectors at the edge, and frameworks will come pre-instrumented.

Correlation: the secret sauce that makes everything click

You get an alert about high latency. You check your metrics dashboard yep, 95th percentile is spiking. You switch to your tracing system and you can see some slow requests. You hop over to your logging system and there are some error messages around the same time. Now comes the fun part: figuring out which logs correspond to which traces and whether they're related to the metric that alerted you.

This context-switching nightmare is exactly what proper correlation eliminates. When your telemetry data shares common identifiers for example, trace IDs in logs, consistent service names, synchronized timestamps or even customer IDs you can seamlessly pivot between different signal types without losing context.

But correlation goes beyond just technical convenience. When you can search all your logs by customer.id and immediately see the traces and metrics for that customer's journey through your system, you transform how you approach support and debugging. When you can filter your entire observability stack by deployment version and instantly understand the impact of a release, you change how you think about deployments.

Metrics? Yes, even metrics can be correlated by using OpenTelemetry exemplars, for example using python you would turn on exemplars as follows.

# Setup metrics with exemplars enabled

exemplar_filter = ExemplarFilter(trace_based=True)  

exemplar_reservoir = ExemplarReservoir(

    exemplar_filter=exemplar_filter,`

    max_exemplars=5
)

This would then associate metrics with a trace that happens to be occurring so you get some metrics correlated to your traces.

Then again, why correlate at all?

So you may be thinking, this is great and I can see this being a useful strategy. It is especially useful when you have metrics, logs and traces in separate systems, however, pretty soon you realize that it's a lot of effort when you could just combine all this data together in a single data structure and avoid the need to correlate at all. The observability industry agrees and has recently been espousing the benefits of a new signal type called wide-events.

Wide-events are just really structured logs, the idea is to put metric data, trace data and log data all into the same wide data structure which can make analysis much easier. Think about it, if you have a single data structure you can very quickly run queries and aggregations without having to join any data which can get pretty expensive.

Additionally you are increasing the information density per log record which is particularly great for AI applications. AI gets a context-rich dataset to do analysis on with minimal latency, a single record with enough descriptive capability to quickly find the root cause of your issue without having to dig around in other data stores and try to figure out whatever schema those data stores are using.

LLMs especially LOVE context and if you can give them all the context they need without having them try to find it, your investigation time will significantly reduce.

This isn't just about making SRE life easier (though it does that). It's about creating the rich, interconnected dataset that AI and ML systems need to understand your infrastructure's behavior patterns.

AI-driven investigations

Observability tools today have been pretty good at solving the alerting fatigue and dashboarding problems, things have gotten quite mature there. Alert correlation and other techniques drastically reduce the noise in these domains, not to mention a focus on being alerted by SLOs instead of pure technical metrics. Life has gotten better over the past few years for SREs here.

Now alerts are one piece of the puzzle but the latest AI techniques using LLMs and agentic AI can unlock time savings in a different spot, during investigations. Think about it, investigations are typically what drags on when you have an outage, the cognitive overload while the pressure is on is very real and pretty stressful for SREs.

The good news is that when we get our data in good shape with correlation, enrichment and adopting wide-events and we store the data in full fidelity we now have the tools to help us drive faster investigations.

LLMs can take all that rich data and do some very powerful analysis that can cut down your investigation time. Let's walk through an example.

Imagine we have the following basic log. We only have a limited amount of data for an LLM to reason about. All it can tell is that a database failed.

Let's see what this looks like when we use a wide-event, notice that already we can see some significant benefits, firstly we only had to visit the log from a single node, the node that serviced the request. We didn’t have to dig into downstream logs. This already makes life easier for the LLM; it doesn't have to figure out how to correlate multiple log lines and traces and metrics though we do still have correlation IDs if we desperately need to look in downstream systems.

Next we have all this additional rich data that an LLM can use to reason about what happened. LLMs work best with context and if you can feed them as much context as possible they will work more effectively to reduce your investigation time.

Field	How an LLM uses it
`trace_id`, `parent_span_id`	Thread every hop together without parsing free-text
`status.code`, `error.*`	Precise failure class; no NLP guess-work
`db.*`	Root-cause surface ("postgres isn't provisioned")
`user.id`, `cloud.region`	Instant blast-radius queries
`deployment.version`	Correlation with new releases

Notice that we didn’t get rid of the unstructured error message, this is still useful context! LLMs are great at processing unstructured text so this textual description helps it understand the problem even further.

Large language models shine when they’re handed complete, context-rich evidence, exactly what wide-event logging supplies. Invest once in richer logs, and every downstream AI workflow (summaries, anomaly detection, natural-language queries) becomes simpler, cheaper, and far more reliable.

Building toward the future

As I look ahead, three trends seem inevitable:

OpenTelemetry semantic conventions powers wide-events: OTel semantic conventions will become as standard as logging is today to create wide-events. Cloud providers, runtimes, and frameworks will use it by default.
Making sense of logs with LLMs: Both improving the richness of your data and having LLMs automatically improve the richness of your existing logs will become essential for shortening investigation times.
AI will be essential: As system complexity outpaces human cognitive ability to understand it, AI assistance will become necessary for maintaining reasonable investigation times.

The organizations that start building toward this future now, adopting OpenTelemetry, investing in richer observability, and beginning to experiment with AI-assisted debugging will have a significant advantage as these trends accelerate.

Your next steps

If you're dealing with the observability gap in your own environment, here's where I'd start

Evaluate your logs: Do your logs have the richness of data you need to shorten investigation times? Can LLMs help provide additional context?
Start experimenting with OpenTelemetry: Even if you can't migrate everything immediately, instrumenting new services with OTel and using semantic conventions to produce wide-events gives you experience with the technology and starts building your enriched dataset.
Add high-value context: Customer IDs, session IDs, deployment versions even small amounts of contextual metadata can dramatically improve your debugging capabilities.
Think beyond storage costs: Instead of sampling data away, investigate modern storage architectures that let you keep everything at a reasonable cost for your most critical services.

The complexity rocket ship has left the station, and it's not slowing down. The question isn't whether your observability strategy needs to evolve; it's whether you'll evolve it proactively or reactively. I know which approach leads to better sleep at night.

Additional resources

Monitor dbt pipelines with Elastic Observability

Fri, 26 Jul 2024 00:00:00 GMT

In the Data Analytics team within the Observability organization in Elastic, we use dbt (dbt™, data build tool) to execute our SQL data transformation pipelines. dbt is a SQL-first transformation workflow that lets teams quickly and collaboratively deploy analytics code. In particular, we use dbt core, the open-source project, where you can develop from the command line and run your dbt project.

Our data transformation pipelines run daily and process the data that feed our internal dashboards, reports, analyses, and Machine Learning (ML) models.

There have been incidents in the past when the pipelines have failed, the source tables contained wrong data or we have introduced a change into our SQL code that has caused data quality issues, and we only realized once we saw it in a weekly report that was showing an anomalous number of records. That’s why we have built a monitoring system that proactively alerts us about these types of incidents as soon as they happen and helps us with visualizations and analyses to understand their root cause, saving us several hours or days of manual investigations.

We have leveraged our own Observability Solution to help solve this challenge, monitoring the entire lifecycle of our dbt implementation. This setup enables us to track the behavior of our models and conduct data quality testing on the final tables. We export dbt process logs from run jobs and tests into Elasticsearch and utilize Kibana to create dashboards, set up alerts, and configure Machine Learning jobs to monitor and assess issues.

The following diagram shows our complete architecture. In a follow-up article, we’ll also cover how we observe our python data processing and ML model processes using OTEL and Elastic - stay tuned.

Why monitor dbt pipelines with Elastic?

With every invocation, dbt generates and saves one or more JSON files called artifacts containing log data on the invocation results. dbt run and dbt test invocation logs are stored in the file run_results.json, as per the dbt documentation:

This file contains information about a completed invocation of dbt, including timing and status info for each node (model, test, etc) that was executed. In aggregate, many run_results.json can be combined to calculate average model runtime, test failure rates, the number of record changes captured by snapshots, etc.

Monitoring dbt run invocation logs can help solve several issues, including tracking and alerting about table volumes, detecting excessive slot time from resource-intensive models, identifying cost spikes due to slot time or volume, and pinpointing slow execution times that may indicate scheduling issues. This system was crucial when we merged a PR with a change in our code that had an issue, producing a sudden drop in the number of daily rows in upstream Table A. By ingesting the dbt run logs into Elastic, our anomaly detection job quickly identified anomalies in the daily row counts for Table A and its downstream tables, B, C, and D. The Data Analytics team received an alert notification about the issue, allowing us to promptly troubleshoot, fix and backfill the tables before it affected the weekly dashboards and downstream ML models.

Monitoring dbt test invocation logs can also address several issues, such as identifying duplicates in tables, detecting unnoticed alterations in allowed values for specific fields through validation of all enum fields, and resolving various other data processing and quality concerns. With dashboards and alerts on data quality tests, we proactively identify issues like duplicate keys, unexpected category values, and increased nulls, ensuring data integrity. In our team, we had an issue where a change in one of our raw lookup tables produced duplicated rows in our user table, doubling the number of users reported. By ingesting the dbt test logs into Elastic, our rules detected that some duplicate tests had failed. The team received an alert notification about the issue, allowing us to troubleshoot it right away by finding the upstream table that was the root cause. These duplicates meant that downstream tables had to process 2x the amount of data, creating a spike in the bytes processed and slot time. The anomaly detection and alerts on the dbt run logs also helped us spot these spikes for individual tables and allowed us to quantify the impact on our billing.

Processing our dbt logs with Elastic and Kibana allows us to obtain real-time insights, helps us quickly troubleshoot potential issues, and keeps our data transformation processes running smoothly. We set up anomaly detection jobs and alerts in Kibana to monitor the number of rows processed by dbt, the slot time, and the results of the tests. This lets us catch real-time incidents, and by promptly identifying and fixing these issues, Elastic makes our data pipeline more resilient and our models more cost-effective, helping us stay on top of cost spikes or data quality issues.

We can also correlate this information with other events ingested into Elastic, for example using the Elastic Github connector, we can correlate data quality test failures or other anomalies with code changes to find the root cause of the commit or PR that caused the issues. By ingesting application logs into Elastic, we can also analyze if these issues in our pipelines have affected downstream applications, increasing latency, throughput or error rates using APM. Ingesting billing, revenue data or web traffic, we could also see the impact in business metrics.

How to export dbt invocation logs to Elasticsearch

We use the Python Elasticsearch client to send the dbt invocation logs to Elastic after we run our dbt run and dbt test processes daily in production. The setup just requires you to install the Elasticsearch Python client and obtain your Elastic Cloud ID (go to https://cloud.elastic.co/deployments/, select your deployment and find the Cloud ID) and Elastic Cloud API Key (following this guide)

This python helper function will index the results from your run_results.json file to the specified index. You just need to export the variables to the environment:

RESULTS_FILE: path to your run_results.json file
DBT_RUN_LOGS_INDEX: the name you want to give to dbt run logs index in Elastic, e.g. dbt_run_logs
DBT_TEST_LOGS_INDEX: the name you want to give to the dbt test logs index in Elastic, e.g. dbt_test_logs
ES_CLUSTER_CLOUD_ID
ES_CLUSTER_API_KEY

Then call the function log_dbt_es from your python code or save this code as a python script and run it after executing your dbt run or dbt test commands:

from elasticsearch import Elasticsearch, helpers
import os
import sys
import json

def log_dbt_es():
   RESULTS_FILE = os.environ["RESULTS_FILE"]
   DBT_RUN_LOGS_INDEX = os.environ["DBT_RUN_LOGS_INDEX"]
   DBT_TEST_LOGS_INDEX = os.environ["DBT_TEST_LOGS_INDEX"]
   es_cluster_cloud_id = os.environ["ES_CLUSTER_CLOUD_ID"]
   es_cluster_api_key = os.environ["ES_CLUSTER_API_KEY"]


   es_client = Elasticsearch(
       cloud_id=es_cluster_cloud_id,
       api_key=es_cluster_api_key,
       request_timeout=120,
   )


   if not os.path.exists(RESULTS_FILE):
       print(f"ERROR: {RESULTS_FILE} No dbt run results found.")
       sys.exit(1)


   with open(RESULTS_FILE, "r") as json_file:
       results = json.load(json_file)
       timestamp = results["metadata"]["generated_at"]
       metadata = results["metadata"]
       elapsed_time = results["elapsed_time"]
       args = results["args"]
       docs = []
       for result in results["results"]:
           if result["unique_id"].split(".")[0] == "test":
               result["_index"] = DBT_TEST_LOGS_INDEX
           else:
               result["_index"] = DBT_RUN_LOGS_INDEX
           result["@timestamp"] = timestamp
           result["metadata"] = metadata
           result["elapsed_time"] = elapsed_time
           result["args"] = args
           docs.append(result)
       _ = helpers.bulk(es_client, docs)
   return "Done"

# Call the function
log_dbt_es()

If you want to add/remove any other fields from run_results.json, you can modify the above function to do it.

Once the results are indexed, you can use Kibana to create Data Views for both indexes and start exploring them in Discover.

Go to Discover, click on the data view selector on the top left and “Create a data view”.

Now you can create a data view with your preferred name. Do this for both dbt run (DBT_RUN_LOGS_INDEX in your code) and dbt test (DBT_TEST_LOGS_INDEX in your code) indices:

Going back to Discover, you’ll be able to select the Data Views and explore the data.

dbt run alerts, dashboards and ML jobs

The invocation of dbt run executes compiled SQL model files against the current database. dbt run invocation logs contain the following fields:

unique_id: Unique model identifier
execution_time: Total time spent executing this model run

The logs also contain the following metrics about the job execution from the adapter:

adapter_response.bytes_processed
adapter_response.bytes_billed
adapter_response.slot_ms
adapter_response.rows_affected

We have used Kibana to set up Anomaly Detection jobs on the above-mentioned metrics. You can configure a multi-metric job split by unique_id to be alerted when the sum of rows affected, slot time consumed, or bytes billed is anomalous per table. You can track one job per metric. If you have built a dashboard of the metrics per table, you can use this shortcut to create the Anomaly Detection job directly from the visualization. After the jobs are created and are running on incoming data, you can view the jobs and add them to a dashboard using the three dots button in the anomaly timeline:

We have used the ML job to set up alerts that send us emails/slack messages when anomalies are detected. Alerts can be created directly from the Jobs (Machine Learning > Anomaly Detection Jobs) page, by clicking on the three dots at the end of the ML job row:

We also use Kibana dashboards to visualize the anomaly detection job results and related metrics per table, to identify which tables consume most of our resources, to have visibility on their temporal evolution, and to measure aggregated metrics that can help us understand month over month changes.

dbt test alerts and dashboards

You may already be familiar with tests in dbt, but if you’re not, dbt data tests are assertions you make about your models. Using the command dbt test, dbt will tell you if each test in your project passes or fails. Here is an example of how to set them up. In our team, we use out-of-the-box dbt tests (unique, not_null, accepted_values, and relationships) and the packages dbt_utils and dbt_expectations for some extra tests. When the command dbt test is run, it generates logs that are stored in run_results.json.

dbt test logs contain the following fields:

unique_id: Unique test identifier, tests contain the “test” prefix in their unique identifier
status: result of the test, pass or fail
execution_time: Total time spent executing this test
failures: will be 0 if the test passes and 1 if the test fails
message: If the test fails, reason why it failed

The logs also contain the metrics about the job execution from the adapter.

We have set up alerts on document count (see guide) that will send us an email / slack message when there are any failed tests. The rule for the alerts is set up on the dbt test Data View that we have created before, the query filtering on status:fail to obtain the logs for the tests that have failed, and the rule condition is document count bigger than 0. Whenever there is a failure in any test in production, we get an alert with links to the alert details and dashboards to be able to troubleshoot them:

We have also built a dashboard to visualize the tests run, tests failed, and their execution time and slot time to have a historical view of the test run:

Finding Root Causes with the AI Assistant

The most effective way for us to analyze these multiple sources of information is using the AI Assistant to help us troubleshoot the incidents. In our case, we got an alert about a test failure, and we used the AI Assistant to give us context on what happened. Then we asked if there were any downstream consequences, and the AI Assistant interpreted the results of the Anomaly Detection job, which indicated a spike in slot time for one of our downstream tables and the increase of the slot time vs. the baseline. Then, we asked for the root cause, and the AI Assistant was able to find and provide us a link to a PR from our Github changelog that matched the start of the incident and was the most probable cause.

Conclusion

As a Data Analytics team, we are responsible for guaranteeing that the tables, charts, models, reports, and dashboards we provide to stakeholders are accurate and contain the right sources of information. As teams grow, the number of models we own becomes larger and more interconnected, and it isn’t easy to guarantee that everything is running smoothly and providing accurate results. Having a monitoring system that proactively alerts us on cost spikes, anomalies in row counts, or data quality test failures is like having a trusted companion that will alert you in advance if something goes wrong and help you get to the root cause of the issue.

dbt invocation logs are a crucial source of information about the status of our data pipelines, and Elastic is the perfect tool to extract the maximum potential out of them. Use this blog post as a starting point for utilizing your dbt logs to help your team achieve greater reliability and peace of mind, allowing them to focus on more strategic tasks rather than worrying about potential data issues.

Monitor OpenAI API and GPT models with OpenTelemetry and Elastic

Tue, 04 Apr 2023 00:00:00 GMT

ChatGPT is so hot right now, it broke the internet. As an avid user of ChatGPT and a developer of ChatGPT applications, I am incredibly excited by the possibilities of this technology. What I see happening is that there will be exponential growth of ChatGPT-based solutions, and people are going to need to monitor those solutions.

Since this is a pretty new technology, we wouldn’t want to burden our shiny new code with proprietary technology, would we? No, we would not, and that is why we are going to use OpenTelemetry to monitor our ChatGPT code in this blog. This is particularly relevant for me as I recently created a service to generate meeting notes from Zoom calls. If I am to release this into the wild, how much is it going to cost me and how do I make sure it is available?

OpenAI APIs to the rescue

The OpenAI API is pretty awesome, there is no doubt. It also gives us the information shown below in each response to each API call, which can help us with understanding what we are being charged. By using the token counts, the model, and the pricing that OpenAI has put up on its website, we can calculate the cost. The question is, how do we get this information into our monitoring tools?

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "text": "\n\nElastic is an amazing observability tool because it provides a comprehensive set of features for monitoring"
    }
  ],
  "created": 1680281710,
  "id": "cmpl-70CJq07gibupTcSM8xOWekOTV5FRF",
  "model": "text-davinci-003",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 20,
    "prompt_tokens": 9,
    "total_tokens": 29
  }
}

OpenTelemetry to the rescue

OpenTelemetry is truly a fantastic piece of work. It has had so much adoption and work committed to it over the years, and it seems to really be getting to the point where we can call it the Linux of Observability. We can use it to record logs, metrics, and traces and get those in a vendor neutral way into our favorite observability tool — in this case, Elastic Observability.

With the latest and greatest otel libraries in Python, we can auto-instrument external calls, and this will help us understand how OpenAI calls are performing. Let's take a sneak peek at our sample Python application, which implements Flask and the ChatGPT API and also has OpenTelemetry. If you want to try this yourself, take a look at the GitHub link at the end of this blog and follow these steps.

Set up Elastic Cloud account (if you already don’t have one)

Sign up for a two-week free trial at https://www.elastic.co/cloud/elasticsearch-service/signup.
Create a deployment.

Once you are logged in, click Add integrations.

Click on APM Integration.

Then scroll down to get the details you need for this blog:

Be sure to set the following Environment variables, replacing the variables with data you get from Elastic as above and OpenAI from here, and then run these export commands on the command line.

export OPEN_AI_KEY=sk-abcdefgh5ijk2l173mnop3qrstuvwxyzab2cde47fP2g9jij
export OTEL_EXPORTER_OTLP_AUTH_HEADER=abc9ldeofghij3klmn
export OTEL_EXPORTER_OTLP_ENDPOINT=https://123456abcdef.apm.us-west2.gcp.elastic-cloud.com:443

And install the following Python libraries:

pip3 install opentelemetry-api
pip3 install opentelemetry-sdk
pip3 install opentelemetry-exporter-otlp
pip3 install opentelemetry-instrumentation
pip3 install opentelemetry-instrumentation-requests
pip3 install openai
pip3 install flask

Here is a look at the code we are using for the example application. In the real world, this would be your own code. All this does is call OpenAI APIs with the following message: “Why is Elastic an amazing observability tool?”

import openai
from flask import Flask
import monitor  # Import the module
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import urllib
import os
from opentelemetry import trace
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# OpenTelemetry setup up code here, feel free to replace the “your-service-name” attribute here.
resource = Resource(attributes={
    SERVICE_NAME: "your-service-name"
})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT'),
        headers="Authorization=Bearer%20"+os.getenv('OTEL_EXPORTER_OTLP_AUTH_HEADER')))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
RequestsInstrumentor().instrument()



# Initialize Flask app and instrument it

app = Flask(__name__)
# Set OpenAI API key
openai.api_key = os.getenv('OPEN_AI_KEY')


@app.route("/completion")
@tracer.start_as_current_span("do_work")
def completion():
    response = openai.Completion.create(
        model="text-davinci-003",
        prompt="Why is Elastic an amazing observability tool?",
        max_tokens=20,
        temperature=0
    )
    return response.choices[0].text.strip()

if __name__ == "__main__":
    app.run()

This code should be fairly familiar to anyone who has implemented OpenTelemetry with Python here — there is no specific magic. The magic happens inside the “monitor” code that you can use freely to instrument your own OpenAI applications.

Monkeying around

Inside the monitor.py code, you will see we do something called “Monkey Patching.” Monkey patching is a technique in Python where you dynamically modify the behavior of a class or module at runtime by modifying its attributes or methods. Monkey patching allows you to change the functionality of a class or module without having to modify its source code. It can be useful in situations where you need to modify the behavior of an existing class or module that you don't have control over or cannot modify directly.

What we want to do here is modify the behavior of the “Completion” call so we can steal the response metrics and add them to our OpenTelemetry spans. You can see how we do that below:

def count_completion_requests_and_tokens(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        counters['completion_count'] += 1
        response = func(*args, **kwargs)
        token_count = response.usage.total_tokens
        prompt_tokens = response.usage.prompt_tokens
        completion_tokens = response.usage.completion_tokens
        cost = calculate_cost(response)
        strResponse = json.dumps(response)
        # Set OpenTelemetry attributes
        span = trace.get_current_span()
        if span:
            span.set_attribute("completion_count", counters['completion_count'])
            span.set_attribute("token_count", token_count)
            span.set_attribute("prompt_tokens", prompt_tokens)
            span.set_attribute("completion_tokens", completion_tokens)
            span.set_attribute("model", response.model)
            span.set_attribute("cost", cost)
            span.set_attribute("response", strResponse)
        return response
    return wrapper
# Monkey-patch the openai.Completion.create function
openai.Completion.create = count_completion_requests_and_tokens(openai.Completion.create)

By adding all this data to our Span, we can actually send it to our OpenTelemetry OTLP endpoint (in this case it will be Elastic). The benefit of doing this is that you can easily use the data for search or to build dashboards and visualizations. In the final step, we also want to calculate the cost. We do this by implementing the following function, which will calculate the cost of a single request to the OpenAI APIs.

def calculate_cost(response):
    if response.model in ['gpt-4', 'gpt-4-0314']:
        cost = (response.usage.prompt_tokens * 0.03 + response.usage.completion_tokens * 0.06) / 1000
    elif response.model in ['gpt-4-32k', 'gpt-4-32k-0314']:
        cost = (response.usage.prompt_tokens * 0.06 + response.usage.completion_tokens * 0.12) / 1000
    elif 'gpt-3.5-turbo' in response.model:
        cost = response.usage.total_tokens * 0.002 / 1000
    elif 'davinci' in response.model:
        cost = response.usage.total_tokens * 0.02 / 1000
    elif 'curie' in response.model:
        cost = response.usage.total_tokens * 0.002 / 1000
    elif 'babbage' in response.model:
        cost = response.usage.total_tokens * 0.0005 / 1000
    elif 'ada' in response.model:
        cost = response.usage.total_tokens * 0.0004 / 1000
    else:
        cost = 0
    return cost

Elastic to the rescue

Once we are capturing all this data, it’s time to have some fun with it in Elastic. In Discover, we can see all the data points we sent over using the OpenTelemetry library:

With these labels in place, it is very easy to build a dashboard. Take a look at this one I built earlier (which is also checked into my GitHub Repository):

We can also see Transactions, Latency of the OpenAI service, and all the spans related to our ChatGPT service calls.

In the transaction view, we can also see how long specific OpenAI calls have taken:

Some requests to OpenAI here have taken over 3 seconds. ChatGPT can be very slow, so it’s important for us to understand how slow this is and if users are becoming frustrated.

Summary

We looked at monitoring ChatGPT with OpenTelemetry with Elastic. ChatGPT is a worldwide phenomenon and it’s going to no doubt grow and grow, and pretty soon everyone will be using it. Because it can be slow to get responses out, it is critical that people are able to understand the performance of any code that is using this service.

There is also the issue of cost, since it’s incredibly important to understand if this service is eating into your margins and if what you are asking for is profitable for your business. With the current economic environment, we have to keep an eye on profitability.

Take a look at the code for this solution here. And please feel free to use the “monitor” library to instrument your own OpenAI code.

Interested in learning more about Elastic Observability? Check out the following resources:

And sign up for our Elastic Observability Trends Webinar featuring AWS and Forrester, not to be missed!