<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Elastic Observability Labs</title>
        <link>https://www.elastic.co/observability-labs</link>
        <description>Trusted security news &amp; research from the team at Elastic.</description>
        <lastBuildDate>Thu, 02 Apr 2026 07:00:58 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <image>
            <title>Elastic Observability Labs</title>
            <url>https://www.elastic.co/observability-labs/assets/observability-labs-thumbnail.png</url>
            <link>https://www.elastic.co/observability-labs</link>
        </image>
        <copyright>© 2026. Elasticsearch B.V. All Rights Reserved</copyright>
        <item>
            <title><![CDATA[3 models for logging with OpenTelemetry and Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/3-models-logging-opentelemetry</link>
            <guid isPermaLink="false">3-models-logging-opentelemetry</guid>
            <pubDate>Tue, 27 Jun 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Because OpenTelemetry increases usage of tracing and metrics with developers, logging continues to provide flexible, application-specific, and event-driven data. Explore OpenTelemetry logging and how it provides guidance on the available approaches.]]></description>
            <content:encoded><![CDATA[<p>Arguably, <a href="https://www.elastic.co/blog/opentelemetry-observability">OpenTelemetry</a> exists to (greatly) increase usage of tracing and metrics among developers. That said, logging will continue to play a critical role in providing flexible, application-specific, event-driven data. Further, OpenTelemetry has the potential to bring added value to existing application logging flows:</p>
<ol>
<li>
<p>Common metadata across tracing, metrics, and logging to facilitate contextual correlation, including metadata passed between services as part of REST or RPC APIs; this is a critical element of service observability in the age of distributed, horizontally scaled systems</p>
</li>
<li>
<p>An optional unified data path for tracing, metrics, and logging to facilitate common tooling and signal routing to your observability backend</p>
</li>
</ol>
<p>Adoption of metrics and tracing among developers to date has been relatively small. Further, the number of proprietary vendors and APIs (compared to adoption rate) is relatively large. As such, OpenTelemetry took a greenfield approach to developing new, vendor-agnostic APIs for tracing and metrics. In contrast, most developers have nearly 100% log coverage across their services. Moreover, logging is largely supported by a small number of vendor-agnostic, open-source logging libraries and associated APIs (e.g., <a href="https://logback.qos.ch">Logback</a> and <a href="https://learn.microsoft.com/en-us/dotnet/api/microsoft.extensions.logging.ilogger">ILogger</a>). As such, <a href="https://opentelemetry.io/docs/specs/otel/logs/#introduction">OpenTelemetry’s approach to logging</a> meets developers where they already are using hooks into existing, popular logging frameworks. In this way, developers can add OpenTelemetry as a log signal output without otherwise altering their code and investment in logging as an observability signal.</p>
<p>Notably, logging is the least mature of OTel supported observability signals. Depending on your service’s <a href="https://opentelemetry.io/docs/instrumentation/#status-and-releases">language</a>, and your appetite for adventure, there exist several options for exporting logs from your services and applications and marrying them together in your observability backend.</p>
<p>The intent of this article is to explore the current state of the art of <a href="https://www.elastic.co/blog/introduction-apm-tracing-logging-customer-experience">OpenTelemetry logging</a> and to provide guidance on the available approaches with the following tenants in mind:</p>
<ul>
<li>Correlation of service logs with OTel-generated tracing where applicable</li>
<li>Proper capture of exceptions</li>
<li>Common context across tracing, metrics, and logging</li>
<li>Support for <a href="https://www.slf4j.org/manual.html#fluent">slf4j key-value pairs</a> (“structured logging”)</li>
<li>Automatic attachment of metadata carried between services via <a href="https://opentelemetry.io/docs/concepts/signals/baggage/">OTel baggage</a></li>
<li>Use of an Elastic&lt;sup&gt;®&lt;/sup&gt; Observability backend</li>
<li>Consistent data fidelity in Elastic regardless of the approach taken</li>
</ul>
<h2>OpenTelemetry logging models</h2>
<p>Three models currently exist for getting your application or service logs to Elastic with correlation to OTel tracing and baggage:</p>
<ol>
<li>
<p>Output logs from your service (alongside traces and metrics) using an embedded <a href="https://opentelemetry.io/docs/instrumentation/#status-and-releases">OpenTelemetry Instrumentation library</a> to Elastic via the OTLP protocol</p>
</li>
<li>
<p>Write logs from your service to a file scraped by the <a href="https://opentelemetry.io/docs/collector/">OpenTelemetry Collector</a>, which then forwards to Elastic via the OTLP protocol</p>
</li>
<li>
<p>Write logs from your service to a file scraped by <a href="https://www.elastic.co/elastic-agent">Elastic Agent</a> (or <a href="https://www.elastic.co/beats/filebeat">Filebeat</a>), which then forwards to Elastic via an Elastic-defined protocol</p>
</li>
</ol>
<p>Note that (1), in contrast to (2) and (3), does not involve writing service logs to a file prior to ingestion into Elastic.</p>
<h2>Logging vs. span events</h2>
<p>It is worth noting that most APM systems, including OpenTelemetry, include provisions for <a href="https://opentelemetry.io/docs/instrumentation/ruby/manual/#add-span-events">span events</a>. Like log statements, span events contain arbitrary, textual data. Additionally, span events automatically carry any custom attributes (e.g., a “user ID”) applied to the parent span, which can help with correlation and context. In this regard, it may be advantageous to translate some existing log statements (inside spans) to span events. As the name implies, of course, span events can only be emitted from within a span and thus are not intended to be a general purpose replacement for logging.</p>
<p>Unlike logging, span events do not pass through existing logging frameworks and therefore cannot (practically) be written to a log file. Further, span events are technically emitted as part of trace data and follow the same data path and signal routing as other trace data.</p>
<h2>Polyfill appender</h2>
<p>Some of the demos make use of a custom Logback <a href="https://github.com/ty-elastic/otel-logging/blob/main/java-otel-log/src/main/java/com/tb93/otel/batteries/PolyfillAppender.java">“Polyfill appender”</a> (inspired by OTel’s <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/logback/logback-mdc-1.0/library">Logback MDC</a>), which provides support for attaching <a href="https://www.slf4j.org/manual.html#fluent">slf4j key-value pairs</a> to log messages for models (2) and (3).</p>
<h2>Elastic Common Schema</h2>
<p>For log messages to exhibit full fidelity within Elastic, they eventually need to be formatted in accordance with the <a href="https://www.elastic.co/guide/en/ecs/current/ecs-reference.html">Elastic Common Schema</a> (ECS). In models (1) and (2), log messages remain formatted in OTel log semantics until ingested by the Elastic APM Server. The Elastic APM Server then translates OTel log semantics to ECS. In model (3), ECS is applied at the source.</p>
<p>Notably, OpenTelemetry recently <a href="https://www.elastic.co/blog/ecs-elastic-common-schema-otel-opentelemetry-announcement">adopted the Elastic Common Schema</a> as its standard for semantic conventions going forward! As such, it is anticipated that current OTel log semantics will be updated to align with ECS.</p>
<h2>Getting started</h2>
<p>The included demos center around a “POJO” (no assumed framework) Java project. Java is arguably the most mature of OTel-supported languages, particularly with respect to logging options. Notably, this singular Java project was designed to support the three models of logging discussed here. In practice, you would only implement one of these models (and corresponding project dependencies).</p>
<p>The demos assume you have a working <a href="https://www.docker.com/">Docker</a> environment and an <a href="https://www.elastic.co/cloud/">Elastic Cloud</a> instance.</p>
<ol>
<li>
<p>git clone <a href="https://github.com/ty-elastic/otel-logging">https://github.com/ty-elastic/otel-logging</a></p>
</li>
<li>
<p>Create an .env file at the root of otel-logging with the following (appropriately filled-in) environment variables:</p>
</li>
</ol>
<pre><code class="language-bash"># the service name
OTEL_SERVICE_NAME=app4

# Filebeat vars
ELASTIC_CLOUD_ID=(see https://www.elastic.co/guide/en/beats/metricbeat/current/configure-cloud-id.html)
ELASTIC_CLOUD_AUTH=(see https://www.elastic.co/guide/en/beats/metricbeat/current/configure-cloud-id.html)

# apm vars
ELASTIC_APM_SERVER_ENDPOINT=(address of your Elastic Cloud APM server... i.e., https://xyz123.apm.us-central1.gcp.cloud.es.io:443)
ELASTIC_APM_SERVER_SECRET=(see https://www.elastic.co/guide/en/apm/guide/current/secret-token.html)
</code></pre>
<ol start="3">
<li>Start up the demo with the desired model:</li>
</ol>
<ul>
<li>If you want to demo logging via OTel APM Agent, run MODE=apm docker-compose up</li>
<li>If you want to demo logging via OTel filelogreceiver, run MODE=filelogreceiver docker-compose up</li>
<li>If you want to demo logging via Elastic filebeat, run MODE=filebeat docker-compose up</li>
</ul>
<ol start="4">
<li>Validate incoming span and correlated log data in your Elastic Cloud instance</li>
</ol>
<h2>Model 1: Logging via OpenTelemetry instrumentation</h2>
<p>This model aligns with the long-term goals of OpenTelemetry: <a href="https://opentelemetry.io/docs/specs/otel/logs/#opentelemetry-solution">integrated tracing, metrics, and logging (with common attributes) from your services</a> via the <a href="https://opentelemetry.io/docs/instrumentation/#status-and-releases">OpenTelemetry Instrumentation libraries</a>, without dependency on log files and scrappers.</p>
<p>In this model, your service generates log statements as it always has, using popular logging libraries (e.g., <a href="https://logback.qos.ch">Logback</a> for Java). OTel provides a “Southbound hook” to Logback via the OTel <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/logback/logback-appender-1.0/library">Logback Appender</a>, which injects ServiceName, SpanID, TraceID, slf4j key-value pairs, and OTel baggage into log records and passes the composed records to the co-resident OpenTelemetry Instrumentation library. We further employ a <a href="https://github.com/ty-elastic/otel-logging/blob/main/java-otel-log/src/main/java/com/tb93/otel/batteries/AddBaggageLogProcessor.java">custom LogRecordProcessor</a> to add baggage to the log record as attributes.</p>
<p>The OTel instrumentation library then formats the log statements per the <a href="https://opentelemetry.io/docs/specs/otel/logs/data-model/">OTel logging spec</a> and ships them via OTLP to either an OTel Collector for further routing and enrichment or directly to Elastic.</p>
<p>Notably, as language support improves, this model can and will be supported by runtime agent binding with auto-instrumentation where available (e.g., no code changes required for runtime languages).</p>
<p>One distinguishing advantage of this model, beyond the simplicity it affords, is the ability to more easily tie together attributes and tracing metadata directly with log statements. This inherently makes logging more useful in the context of other OTel-supported observability signals.</p>
<h3>Architecture</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/3-models-logging-opentelemetry/elastic-blog-model-1-architecture.png" alt="model 1 architecture" /></p>
<p>Although not explicitly pictured, an <a href="https://opentelemetry.io/docs/collector/">OpenTelemetry Collector</a> can be inserted in between the service and Elastic to facilitate additional enrichment and/or signal routing or duplication across observability backends.</p>
<h3>Pros</h3>
<ul>
<li>Simplified signal architecture and fewer “moving parts” (no files, disk utilization, or file rotation concerns)</li>
<li>Aligns with long-term OTel vision</li>
<li>Log statements can be (easily) decorated with OTel metadata</li>
<li>No polyfill adapter required to support structured logging with slf4j</li>
<li>No additional collectors/agents required</li>
<li>Conversion to ECS happens within Elastic keeping log data vendor-agnostic until ingestion</li>
<li>Common wireline protocol (OTLP) across tracing, metrics, and logs</li>
</ul>
<h3>Cons</h3>
<ul>
<li>Not available (yet) in many OTel-supported languages</li>
<li>No intermediate log file for ad-hoc, on-node debugging</li>
<li>Immature (alpha/experimental)
Unknown “glare” conditions, which could result in loss of log data if service exits prematurely or if the backend is unable to accept log data for an extended period of time</li>
</ul>
<h3>Demo</h3>
<p>MODE=apm docker-compose up</p>
<h2>Model 2: Logging via the OpenTelemetry Collector</h2>
<p>Given the cons of Model 1, it may be advantageous to consider a model that continues to leverage an actual log file intermediary between your services and your observability backend. Such a model is possible using an <a href="https://opentelemetry.io/docs/collector/">OpenTelemetry Collector</a> collocated with your services (e.g., on the same host), running the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/filelogreceiver/README.md">filelogreceiver</a> to scrape service log files.</p>
<p>In this model, your service generates log statements as it always has, using popular logging libraries (e.g., <a href="https://logback.qos.ch">Logback</a> for Java). OTel provides a MDC Appender for Logback (<a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/logback/logback-mdc-1.0/library">Logback MDC</a>), which adds SpanID, TraceID, and Baggage to the <a href="https://logback.qos.ch/manual/mdc.html">Logback MDC context</a>.</p>
<p>Notably, no log record structure is assumed by the OTel filelogreceiver. In the example provided, we employ the <a href="https://github.com/logfellow/logstash-logback-encoder">logstash-logback-encoder</a> to JSON-encode log messages. The logstash-logback-encoder will read the OTel SpanID, TraceID, and Baggage off the MDC context and encode it into the JSON structure. Notably, logstash-logback-encoder doesn’t explicitly support <a href="https://www.slf4j.org/manual.html#fluent">slf4j key-value pairs</a>. It does, however, support <a href="https://github.com/logfellow/logstash-logback-encoder#event-specific-custom-fields">Logback structured arguments</a>, and thus I use the <a href="https://github.com/ty-elastic/otel-logging/blob/main/java-otel-log/src/main/java/com/tb93/otel/batteries/PolyfillAppender.java">Polyfill Appender</a> to convert slf4j key-value pairs to Logback structured arguments.</p>
<p>From there, we write the log lines to a log file. If you are using Kubernetes or other container orchestration in your environment, you would more typically write to stdout (console) and let the orchestration log driver write to and manage log files.</p>
<p>We then <a href="https://github.com/ty-elastic/otel-logging/blob/main/collector/filelogreceiver.yml">configure</a> the OTel Collector to scrape this log file (using the filelogreceiver). Because no assumptions are made about the format of the log lines, you need to <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/stanza/docs/types/parsers.md#parsers">explicitly map fields</a> from your log schema to the OTel log schema.</p>
<p>From there, the OTel Collector batches and ships the formatted log lines via OTLP to Elastic.</p>
<h3>Architecture</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/3-models-logging-opentelemetry/elastic-blog-model-2-architecture.png" alt="model 2 architecture" /></p>
<h3>Pros</h3>
<ul>
<li>Easy to debug (you can manually read the intermediate log file)</li>
<li>Inherent file-based FIFO buffer</li>
<li>Less susceptible to “glare” conditions when service prematurely exits</li>
<li>Conversion to ECS happens within Elastic keeping log data vendor-agnostic until ingestion</li>
<li>Common wireline protocol (OTLP) across tracing, metrics, and logs</li>
</ul>
<h3>Cons</h3>
<ul>
<li>All the headaches of file-based logging (rotation, disk overflow)</li>
<li>Beta quality and not yet proven in the field</li>
<li>No support for slf4j key-value pairs</li>
</ul>
<h3>Demo</h3>
<p>MODE=filelogreceiver docker-compose up</p>
<h2>Model 3: Logging via Elastic Agent (or Filebeat)</h2>
<p>Although the second model described affords some resilience as a function of the backing file, the OTel Collector filelogreceiver module is still decidedly <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver">“beta”</a> in quality. Because of the importance of logs as a debugging tool, today I generally recommend that customers continue to import logs into Elastic using the field-proven <a href="https://www.elastic.co/elastic-agent">Elastic Agent</a> or <a href="https://www.elastic.co/beats/filebeat">Filebeat</a> scrappers. Elastic Agent and Filebeat have many years of field maturity under their collective belt. Further, it is often advantageous to deploy Elastic Agent anyway to capture the multitude of signals outside the purview of OpenTelemetry (e.g., deep Kubernetes and host metrics, security, etc.).</p>
<p>In this model, your service generates log statements as it always has, using popular logging libraries (e.g., <a href="https://logback.qos.ch">Logback</a> for Java). As with model 2, we employ OTel’s <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/logback/logback-mdc-1.0/library">Logback MDC</a> to add SpanID, TraceID, and Baggage to the <a href="https://logback.qos.ch/manual/mdc.html">Logback MDC context</a>.</p>
<p>From there, we employ the <a href="https://www.elastic.co/guide/en/ecs-logging/java/current/setup.html">Elastic ECS Encoder</a> to encode log statements compliant to the Elastic Common Schema. The Elastic ECS Encoder will read the OTel SpanID, TraceID, and Baggage off the MDC context and encode it into the JSON structure. Similar to model 2, the Elastic ECS Encoder doesn’t support sl4f key-vair arguments. Curiously, the Elastic ECS encoder also doesn’t appear to support Logback structured arguments. Thus, within the Polyfill Appender, I add slf4j key-value pairs as MDC context. This is less than ideal, however, since MDC forces all values to be strings.</p>
<p>From there, we write the log lines to a log file. If you are using Kubernetes or other container orchestration in your environment, you would more typically write to stdout (console) and let the orchestration log driver write to and manage log files.We then configure Elastic Agent or Filebeat to scrape the log file. Notably, the Elastic ECS Encoder does not currently translate incoming OTel SpanID and TraceID variables on the MDC. Thus, we need to perform manual translation of these variables in the <a href="https://github.com/ty-elastic/otel-logging/blob/main/filebeat.yml">Filebeat (or Elastic Agent) configuration</a> to map them to their ECS equivalent.</p>
<h2>Architecture</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/3-models-logging-opentelemetry/elastic-blog-model-3-architecture.png" alt="model 3 architecture" /></p>
<h3>Pros</h3>
<ul>
<li>Robust and field-proven</li>
<li>Easy to debug (you can manually read the intermediate log file)</li>
<li>Inherent file-based FIFO buffer</li>
<li>Less susceptible to “glare” conditions when service prematurely exits</li>
<li>Native ECS format for easy manipulation in Elastic</li>
<li>Fleet-managed via Elastic Agent</li>
</ul>
<h3>Cons</h3>
<ul>
<li>All the headaches of file-based logging (rotation, disk overflow)</li>
<li>No support for slf4j key-value pairs or Logback structured arguments</li>
<li>Requires translation of OTel SpanID and TraceID in Filebeat config</li>
<li>Disparate data paths for logs versus tracing and metrics</li>
<li>Vendor-specific logging format</li>
</ul>
<h3>Demo</h3>
<p>MODE=filebeat docker-compose up</p>
<h2>Recommendations</h2>
<p>For most customers, I currently recommend Model 3 — namely, write to logs in ECS format (with OTel SpanID, TraceID, and Baggage metadata) and collect them with an Elastic Agent installed on the node hosting the application or service. Elastic Agent (or Filebeat) today provides the most field-proven and robust means of capturing log files from applications and services with OpenTelemetry context.</p>
<p>Further, you can leverage this same Elastic Agent instance (ideally running in your <a href="https://www.elastic.co/guide/en/fleet/current/running-on-kubernetes-managed-by-fleet.html">Kubernetes daemonset</a>) to collect rich and robust metrics and logs from <a href="https://docs.elastic.co/en/integrations/kubernetes">Kubernetes</a> and many other supported services via <a href="https://www.elastic.co/integrations/data-integrations">Elastic Integrations</a>. Finally, Elastic Agent facilitates remote management via <a href="https://www.elastic.co/guide/en/fleet/current/fleet-overview.html">Fleet</a>, avoiding bespoke configuration files.</p>
<p>Alternatively, for customers who either wish to keep their nodes vendor-neutral or use a consolidated signal routing system, I recommend Model 2, wherein an OpenTelemetry collector is used to scrape service log files. While workable and practiced by some early adopters in the field today, this model inherently carries some risk given the current beta nature of the OpenTelemetry filelogreceiver.</p>
<p>I generally do not recommend Model 1 given its limited language support, experimental/alpha status (the API could change), and current potential for data loss. That said, in time, with more language support and more thought to resilient designs, it has clear advantages both with regard to simplicity and richness of metadata.</p>
<h2>Extracting more value from your logs</h2>
<p>In contrast to tracing and metrics, most organizations have nearly 100% log coverage over their applications and services. This is an ideal beachhead upon which to build an application observability system. On the other hand, logs are notoriously noisy and unstructured; this is only amplified with the scale enabled by the hyperscalers and Kubernetes. Collecting log lines reliably is the easy part; making them useful at today’s scale is hard.</p>
<p>Given that logs are arguably the most challenging observability signal from which to extract value at scale, one should ideally give thoughtful consideration to a vendor’s support for logging in the context of other observability signals. Can they handle surges in log rates because of unexpected scale or an error or test scenario? Do they have the machine learning tool set to automatically recognize patterns in log lines, sort them into categories, and identify true anomalies? Can they provide cost-effective online searchability of logs over months or years without manual rehydration? Do they provide the tools to extract and analyze business KPIs buried in logs?</p>
<p>As an ardent and early supporter of OpenTelemetry, Elastic, of course, <a href="https://www.elastic.co/guide/en/apm/guide/current/open-telemetry.html">natively ingests OTel traces, metrics, and logs</a>. And just like all logs coming into our system, logs coming from OTel-equipped sources avail themselves of our <a href="https://www.elastic.co/observability/log-monitoring">mature tooling and next-gen AI Ops technologies</a> to enable you to extract their full value.Interested? <a href="https://www.elastic.co/contact?storm=global-header-en">Reach out to our pre-sales team</a> to get started building with Elastic!</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/3-models-logging-opentelemetry/log_infrastructure_apm_synthetics-monitoring.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Accelerate Otel Adoption with Elastic Agent Hybrid Ingestion]]></title>
            <link>https://www.elastic.co/observability-labs/blog/hybrid-elastic-agent-opentelemetry-integration</link>
            <guid isPermaLink="false">hybrid-elastic-agent-opentelemetry-integration</guid>
            <pubDate>Fri, 09 Jan 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic Agent 9.2 brings hybrid ingestion to Elastic Observability, unifying native integrations and OpenTelemetry receivers to simplify large-scale OTel adoption without disruption.]]></description>
            <content:encoded><![CDATA[<h2>Hybrid Elastic Agent: The Most Practical Path to OpenTelemetry Adoption</h2>
<p>OpenTelemetry is quickly becoming the standard foundation for modern observability. Organizations want its open ecosystem, unified model, and vendor-neutral instrumentation—but moving a mature production environment to OTel is rarely straightforward.</p>
<p>Most teams already rely on battle-tested pipelines for logs, metrics, and security signals. They have dashboards tuned over years, operational practices built around existing data flows, and mission-critical systems where disruption simply isn’t an option.</p>
<p>This means the question isn’t &quot;Why OpenTelemetry?&quot;
It’s &quot;How do we get there without breaking what already works?&quot;</p>
<p>Elastic Observability introduces a way to ingest telemetry without disrupting existing data and dashboards with Hybrid ingestion. Released in Elastic 9.2, its a low-friction way to adopt OTel receivers alongside existing native Elastic integrations-managed centrally through Fleet.</p>
<p>This hybrid approach offers one of the most pragmatic and operationally safe routes to OTel adoption available today.</p>
<h3>The Challenge: Adopting OTel Without Disrupting the Present</h3>
<p>For many organizations, the path to OTel adoption is complicated by realities such as:</p>
<ul>
<li>Established log pipelines powering critical alerting</li>
<li>Legacy infrastructure that isn’t easily re-instrumented</li>
<li>Existing dashboards and visualizations built on Elastic-native datasets</li>
<li>Teams with different levels of OTel experience</li>
<li>Risk constraints that make large changes difficult to roll out</li>
</ul>
<p>Standardizing on OTel is the right long-term direction, but replacing everything at once is neither realistic nor desirable.</p>
<p>Teams need a way to bring OTel into their environment incrementally, while preserving continuity, reliability, and central governance.</p>
<h3>Elastic Agent 9.2+: Hybrid Ingestion as a Bridge to the Future</h3>
<p>Elastic Agent now supports two fully supported ingestion paths, both running inside the same unified agent:</p>
<ol>
<li>Elastic-native integrations</li>
</ol>
<p>Perfect for logs and host-level telemetry, with mature dashboards, alerts, and ECS mappings.</p>
<ol start="2">
<li>OpenTelemetry input integrations (OTel receivers)</li>
</ol>
<p>Powered by upstream OTel Collector components, managed directly from Fleet.</p>
<p>And crucially:</p>
<p>You can use both, simultaneously, on the same agent.</p>
<p>This hybrid ingestion model allows teams to:</p>
<ul>
<li>Continue collecting logs using native Elastic integrations</li>
<li>Begin collecting metrics or traces via OTel receivers</li>
<li>Maintain full control through Fleet</li>
<li>Introduce OTel exactly where and when it makes sense</li>
<li>Avoid running parallel agents or duplicate pipelines</li>
</ul>
<p>It’s a way to evolve—not replace—your observability strategy.</p>
<h3>A Practical Example: Adding OTel Inputs While Keeping Your Existing Pipelines</h3>
<p>Imagine a system where NGINX logs are already handled via Elastic-native integrations. These pipelines drive dashboards, audits, and critical alerts. Interrupting them isn’t an option.</p>
<p>At the same time, your platform team wants to standardize metrics and service telemetry using OpenTelemetry.</p>
<p>With Elastic Agent hybrid ingestion, both goals align:</p>
<ol>
<li>Keep your existing log integration in Fleet</li>
<li>Add an OTel input integration (e.g., OTel <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/nginxreceiver">nginxreceiver</a>)</li>
<li>Fleet deploys both inside the same Elastic Agent</li>
<li>Deployment is done at scale across your infrastructure from a single management console</li>
<li>Logs and OTel metrics flow into Elasticsearch side-by-side</li>
</ol>
<p>No re-instrumentation.
No duplicate agents.
No loss of historical visibility.
No new tooling for operations.
No external deployment tool.</p>
<p>Whether the component is a web server, reverse proxy, database, JVM runtime, or custom service already instrumented in OTel, the workflow is the same.</p>
<h3>Why This Hybrid Approach Matters Strategically</h3>
<p>Hybrid ingestion is not simply a technical capability—it’s an organizational enabler for OpenTelemetry transformation.</p>
<p><strong>Incremental migration without downtime</strong></p>
<p>Teams can begin adopting OTel at the exact pace they’re comfortable with.
Existing collection signals remain stable. OTel metrics or logs are added progressively.</p>
<p><strong>Fleet remains your single control plane</strong></p>
<p>Fleet continues to manage:</p>
<ul>
<li>agent lifecycle</li>
<li>policy management</li>
<li>version upgrades</li>
<li>diagnostics and monitoring</li>
</ul>
<p>Even as OTel becomes part of your ingestion strategy.</p>
<p><strong>Consistent semantics across teams</strong></p>
<p>Adopting OTel receivers through <a href="https://www.elastic.co/docs/reference/edot-collector">EDOT</a> helps harmonize telemetry models across microservices, infrastructure, and applications.</p>
<p>OTel becomes the shared language—Elastic becomes the scalable backend.</p>
<p><strong>Future-proof flexibility</strong></p>
<p>When the day comes that a team needs advanced OTel features, custom pipelines, custom processors, or additional exporters, they can build their own <a href="https://www.elastic.co/docs/reference/edot-collector/custom-collector">EDOT custom collector</a> flavor and use it in their elastic-agent in hybrid mode.</p>
<p>This allows deep customization without abandoning the Elastic Agent runtime.</p>
<p><strong>No vendor lock-in—full ecosystem alignment</strong></p>
<p>Hybrid ingestion leverages upstream OpenTelemetry components directly.
This reinforces the open, vendor-neutral ecosystem organizations prefer when standardizing observability across teams while being supported by Elastic.</p>
<h3>What About Standalone Mode? (Advanced Use Cases)</h3>
<p>While Fleet-managed hybrid ingestion will meet the needs of most users, Elastic Agent in hybrid mode also support standalone deployment with the same functions as the managed version.</p>
<ul>
<li>native integrations support</li>
<li>full control over Otel receivers, processors, and exporters</li>
<li>Elasticsearch output as the backend</li>
</ul>
<p>This is particularly useful for platform teams testing advanced OTel deployments or building custom telemetry strategies.</p>
<p>But it remains optional—the managed experience is still the default path.</p>
<h3>Conclusion: A Modern, Flexible Path Toward OpenTelemetry</h3>
<p>Migrating to OpenTelemetry is a journey, not a switch. With hybrid ingestion, Elastic provides a realistic, scalable, and low-risk pathway for organizations that want to adopt OTel gradually while maintaining operational continuity.</p>
<p>Elastic Agent 9.2+ enables teams to:</p>
<ul>
<li>retain reliable log integrations</li>
<li>introduce OTel inputs seamlessly</li>
<li>manage everything from Fleet</li>
<li>reduce complexity and operational overhead</li>
<li>expand into OTel at the right pace</li>
<li>stay aligned with open standards and best practices</li>
</ul>
<p>It brings the best of both worlds—Elastic-native richness and OTel-standard flexibility—into a single agent and a unified operational model.</p>
<p>Hybrid isn’t a workaround.
It’s the strategic bridge between where your observability platform is today and where it needs to go next.</p>
<h2>Technical Walkthrough: Deploying Hybrid Elastic Agent + EDOT in Fleet</h2>
<p>Before we close, let’s look at what this actually looks like in practice.
Conceptual advantages are important, but many teams want to see how hybrid ingestion works when deployed through Fleet.</p>
<p>The example below walks through a simple, production-ready setup using Elastic Agent 9.2, combining a native integration and an OTel input integration inside a single agent,  the same approach you can apply to any service across your environment.</p>
<p>Here is a step-by-step guide showing how to deploy Elastic Agent 9.2 in <strong>Fleet-managed hybrid mode</strong>, using the OTel nginxreceiver as one concrete example.
This applies to any service with an OTel receiver (Redis, HAProxy, Kafka, JVM, etc.).</p>
<h3>Requirements</h3>
<ul>
<li>Elastic Stack <strong>9.2+</strong></li>
<li>Elastic Agent <strong>9.2+</strong></li>
<li>Fleet configured in Kibana</li>
<li>A host running your workload (NGINX in this example)</li>
<li>NGINX <code>stub_status</code> endpoint or any equivalent OTel metrics endpoint</li>
<li>API key with ingest privileges</li>
</ul>
<h2>1. Create or Select an Agent Policy</h2>
<ol>
<li>In Kibana → <strong>Management → Fleet → Agent policies</strong></li>
<li>Create a new policy: <code>nginx-o11y</code></li>
<li>Enable system monitoring (recommended)</li>
<li>Save</li>
</ol>
<h2>2. Enroll Elastic Agent into the Policy</h2>
<p>From the policy page:</p>
<ol>
<li>Click <strong>Add agent</strong></li>
<li>Choose your OS</li>
<li>Copy the installation command</li>
<li>Run:</li>
</ol>
<pre><code class="language-bash">sudo elastic-agent install \
  --url=&lt;FLEET_URL&gt; \
  --enrollment-token=&lt;ENROLLMENT_TOKEN&gt;
</code></pre>
<p>You should soon see the agent appear as Healthy in Fleet.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/hybrid-elastic-agent-opentelemetry-integration/image1.png" alt="" /></p>
<h2>3. Add the Native Integration (Logs)</h2>
<ol>
<li>In Fleet, go to Integrations.</li>
<li>Search for NGINX.</li>
<li>Click Add NGINX.</li>
<li>Select your <code>nginx-o11y</code> policy.</li>
<li>Only enable log collection (access + error logs).</li>
<li>Save and deploy.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/hybrid-elastic-agent-opentelemetry-integration/image2.png" alt="" /></p>
<h2>4. Validate Log Collection</h2>
<ol>
<li>In Kibana, go to Analytics → Discover and search for:</li>
</ol>
<pre><code class="language-bash">data_stream.dataset : &quot;nginx.access&quot; or &quot;nginx.error&quot;
</code></pre>
<ol start="2">
<li>Or open the built-in dashboard:</li>
</ol>
<pre><code class="language-bash">Analytics → Dashboards → [Logs Nginx] Access and error logs
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/hybrid-elastic-agent-opentelemetry-integration/image3.png" alt="" /></p>
<h3>5. Collecting NGINX Metrics via the OTel NGINX Receiver</h3>
<p>Elastic Agent 9.2+ allows Fleet to deploy OTel input integrations.<br />
This scenario uses the OpenTelemetry <code>nginxreceiver</code> through a Fleet-managed integration.</p>
<h4>5.1. Install the NGINX OpenTelemetry Integration Content</h4>
<ol>
<li>In Kibana, go to Management → Fleet → Integrations.</li>
<li>Search for NGINX OpenTelemetry Assets.</li>
<li>Click Add Integration.</li>
</ol>
<h4>5.2. Install the NGINX OpenTelemetry Input Integration</h4>
<ol>
<li>In Kibana, go to Management → Fleet → Integrations.</li>
<li>Search for NGINX OpenTelemetry Input Package.</li>
<li>Click Add Integration.</li>
<li>Assign it to your agent <code>nginx-o11y</code> policy.</li>
</ol>
<p>Provide the endpoint for the NGINX status page:</p>
<ul>
<li><strong>Endpoint</strong>: <code>http://localhost/status</code></li>
<li><strong>Collection interval</strong>: <code>10s</code></li>
</ul>
<p>Click <strong>Add integration</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/hybrid-elastic-agent-opentelemetry-integration/image4.png" alt="" /></p>
<h3>6. Validate OTel Metrics</h3>
<ol>
<li>Go to <strong>Analytics → Dashboards</strong>.</li>
<li>Open: <strong>[Metrics Nginx OTEL Overview]</strong> Dashboard</li>
</ol>
<p>You should see metrics such as active connections, writes, reads, waiting, and request counts.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/hybrid-elastic-agent-opentelemetry-integration/image5.png" alt="" /></p>
<h3>7. Closing thoughts</h3>
<p>This example highlights how straightforward hybrid ingestion becomes with Elastic Agent 9.2. By combining native integrations and OTel receivers within a single, centrally managed policy, you gain the flexibility to adopt OpenTelemetry where it adds the most value without disrupting existing pipelines or introducing operational overhead.</p>
<p>Whether you extend this pattern to additional services, experiment with other OTel receivers, or scale it across your fleet, the deployment model remains consistent, repeatable, and production-ready.</p>
<p>For more information and other innovations Elastic Observability has made check out:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-agent-pivot-opentelemetry">Discover how Elastic is evolving data ingestion with OpenTelemetry</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-sdk-central-configuration-opamp">Learn how OpAMP enables centralized configuration of OpenTelemetry SDKs</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-observability-streams-ai-logs-investigations">Explore how Streams reshape AI-driven log investigation workflows</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/hybrid-elastic-agent-opentelemetry-integration/feature-image.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Adding free and open Elastic APM as part of your Elastic Observability deployment]]></title>
            <link>https://www.elastic.co/observability-labs/blog/free-open-elastic-apm-observability-deployment</link>
            <guid isPermaLink="false">free-open-elastic-apm-observability-deployment</guid>
            <pubDate>Wed, 28 Feb 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to gather application trace data and store it alongside the logs and metrics from your applications and infrastructure with Elastic Observability and Elastic APM.]]></description>
            <content:encoded><![CDATA[<p>In a recent post, we showed you <a href="https://www.elastic.co/blog/getting-started-with-free-and-open-elastic-observability">how to get started with the free and open tier of Elastic Observability</a>. Below, we'll walk through what you need to do to expand your deployment so you can start gathering metrics from application performance monitoring (APM) or &quot;tracing&quot; data in your observability cluster, for free.</p>
<h2>What is APM?</h2>
<p>Application performance monitoring lets you see where your applications spend their time, what they are doing, what other applications or services they are calling, and what errors or exceptions they are encountering.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/free-open-elastic-apm-observability-deployment/screenshot-serverless-distributed-trace.png" alt="" /></p>
<p>In addition, APM also lets you see history and trends for key performance indicators, such as latency and throughput, as well as transaction and dependency information:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/free-open-elastic-apm-observability-deployment/ruby-overview.png" alt="" /></p>
<p>Whether you're setting up alerts for SLA breaches, trying to gauge the impact of your latest release, or deciding where to make the next improvement, APM can help with your root-cause analysis to help improve your users' experience and drive your mean time to resolution (MTTR) toward zero.</p>
<h2>Logical architecture</h2>
<p>Elastic APM relies on the APM Integration inside Elastic Agent, which forwards application trace and metric data from applications instrumented with APM agents to an Elastic Observability cluster. Elastic APM supports multiple agent flavors:</p>
<ul>
<li>Native Elastic APM Agents, available for <a href="https://www.elastic.co/guide/en/apm/agent/index.html">multiple languages</a>, including Java, .NET, Go, Ruby, Python, Node.js, PHP, and client-side JavaScript</li>
<li>Code instrumented with <a href="https://www.elastic.co/guide/en/apm/get-started/current/open-telemetry-elastic.html">OpenTelemetry</a></li>
<li>Code instrumented with <a href="https://www.elastic.co/guide/en/apm/get-started/current/opentracing.html">OpenTracing</a></li>
<li>Code instrumented with <a href="https://www.elastic.co/guide/en/apm/server/current/jaeger.html">Jaeger</a></li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/free-open-elastic-apm-observability-deployment/blog-elastic-observability-instrumented-services.png" alt="" /></p>
<p>In this blog, we'll provide a quick example of how to instrument code with the native Elastic APM Python agent, but the overall steps are similar for other languages.</p>
<p>Please note that there is a strong distinction between the <strong>Elastic APM Agent</strong> and the <strong>Elastic Agent</strong>. These are very different components, as you can see in the diagram above, so it's important not to confuse them.</p>
<h2>Install the Elastic Agent</h2>
<p>The first step is to install the Elastic Agent. You either need Fleet <a href="https://www.elastic.co/guide/en/fleet/current/add-a-fleet-server.html">installed first</a>, or you can install the Elastic Agent standalone. Install the Elastic Agent somewhere by <a href="https://www.elastic.co/guide/en/fleet/master/elastic-agent-installation.html">following this guide</a>. This will give you an APM Integration endpoint you can hit. Note that this step is not necessary in Elastic Cloud, as we host the APM Integration for you. Check Elastic Agent is up by running:</p>
<pre><code class="language-bash">curl &lt;ELASTIC_AGENT_HOSTNAME&gt;:8200
</code></pre>
<h2>Instrumenting sample code with an Elastic APM agent</h2>
<p>The instructions for the various language agents differ based on the programming language, but at a high level they have a similar flow. First, you add the dependency for the agent in the language's native spec, then you configure the agent to let it know how to find the APM Integration.</p>
<p>You can try out any flavor you'd like, but I am going to walk through the Python instructions using this Python example that <a href="https://github.com/davidgeorgehope/PythonElasticAPMExample">I created</a>.</p>
<h3>Get the sample code (or use your own)</h3>
<p>To get started, I clone the GitHub repository then change to the directory:</p>
<pre><code class="language-python">git clone https://github.com/davidgeorgehope/PythonElasticAPMExample
cd PythonElasticAPMExample
</code></pre>
<h3>How to add the dependency</h3>
<p>Adding the Elastic APM Dependency is simple — check the app.py file from <a href="https://github.com/davidgeorgehope/PythonElasticAPMExample/blob/main/app.py">the github repo</a> and you will notice the following lines of code.</p>
<pre><code class="language-python">import elasticapm
from elasticapm import Client

app = Flask(__name__)
app.config[&quot;ELASTIC_APM&quot;] = {    &quot;SERVICE_NAME&quot;: os.environ.get(&quot;APM_SERVICE_NAME&quot;, &quot;flask-app&quot;),    &quot;SECRET_TOKEN&quot;: os.environ.get(&quot;APM_SECRET_TOKEN&quot;, &quot;&quot;),    &quot;SERVER_URL&quot;: os.environ.get(&quot;APM_SERVER_URL&quot;, &quot;http://localhost:8200&quot;),}
elasticapm.instrumentation.control.instrument()
client = Client(app.config[&quot;ELASTIC_APM&quot;])
</code></pre>
<p>The Python library for Flask is capable of auto detecting transactions, but you can also start transactions in code as per the following, as we have done in this example:</p>
<pre><code class="language-python">@app.route(&quot;/&quot;)
def hello():
    client.begin_transaction('demo-transaction')
    client.end_transaction('demo-transaction', 'success')
</code></pre>
<h3>Configure the agent</h3>
<p>The agents need to send application trace data to the APM Integration, and to do this it has to be reachable. I configured the Elastic Agent to listen on my local host's IP, so anything in my subnet can send data to it. As you can see from the code below, we use docker-compose.yml to pass in the config via environment variables. Please edit these variables for your own Elastic installation.</p>
<pre><code class="language-yaml"># docker-compose.yml
version: &quot;3.9&quot;
services:
  flask_app:
    build: .
    ports:
      - &quot;5001:5001&quot;
    environment:
      - PORT=5001
      - APM_SERVICE_NAME=flask-app
      - APM_SECRET_TOKEN=your_secret_token
      - APM_SERVER_URL=http://host.docker.internal:8200
</code></pre>
<p>Some commentary on the above:</p>
<ul>
<li><strong>service_name:</strong> If you leave this out it will just default to the application's name, but you can override that here.</li>
<li><strong>secret_token:</strong> <a href="https://www.elastic.co/guide/en/apm/server/current/secret-token.html">Secret tokens</a> allow you to authorize requests to the APM Server, but they require that the APM Server is set up with SSL/TLS and that a secret token has been set up. We're not using HTTPS between the agents and the APM Server, so we'll comment this one out.</li>
<li><strong>server_url:</strong> This is how the agent can reach the APM Integration inside Elastic Agent. Replace this with the name or IP of your host running Elastic Agent.</li>
</ul>
<p>Now that the Elastic APM side of the configuration is done, we simply follow the steps from the <a href="https://github.com/davidgeorgehope/PythonElasticAPMExample/blob/main/README.md">README</a> to start up.</p>
<pre><code class="language-bash">docker-compose up --build -d
</code></pre>
<p>The build step will take several minutes.</p>
<p>You can navigate to the running sample application by visiting <a href="http://localhost:5001">http://localhost:5001</a>. There's not a lot to the sample, but it does generate some APM data. To generate a bit of a load, you can reload them a few times or run a quick little script:</p>
<pre><code class="language-bash">#!/bin/bash
# load_test.sh
url=&quot;http://localhost:5001&quot;
for i in {1..1000}
do
  curl -s -o /dev/null $url
  sleep 1
done
</code></pre>
<p>This will just reload the pages every second.</p>
<p>Back in Kibana, navigate back to the APM app (hamburger icon, then select <strong>APM</strong> ) and you should see our new flask-app service (I let mine run so it shows a bit more history):</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/free-open-elastic-apm-observability-deployment/blog-elastic-observability-services.png" alt="" /></p>
<p>The Service Overview page provides an at-a-glance summary of the health of a service in one place. If you're a developer or an SRE, this is the page that will help you answer questions like:</p>
<ul>
<li>How did a new deployment impact performance?</li>
<li>What are the top impacted transactions?</li>
<li>How does performance correlate with underlying infrastructure?</li>
</ul>
<p>This view provides a list of all of the applications that have sent application trace data to Elastic APM in the specified period of time (in this case, the last 15 minutes). There are also sparklines showing mini graphs of latency, throughput, and error rate. Clicking on <strong>flask-app</strong> takes us to the <strong>service overview</strong> page, which shows the various transactions within the service (recall that my script is hitting the / endpoint, as seen in the <strong>Transactions</strong> section). We get bigger graphs for <strong>Latency</strong> , <strong>Throughput</strong> , <strong>Errors</strong> , and <strong>Error Rates</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/free-open-elastic-apm-observability-deployment/blog-elastic-observability-flask-app.png" alt="" /></p>
<p>When you're instrumenting real applications, under real load, you'll see a lot more connectivity (and errors!)</p>
<p>Clicking on a transaction in the transaction view, in this case, our sample app's demo-transaction transaction, we can see exactly what operations were called:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/free-open-elastic-apm-observability-deployment/blog-elastic-observability-flask-app-demo-transaction.png" alt="" /></p>
<p>This includes detailed information about calls to external services, such as database queries:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/free-open-elastic-apm-observability-deployment/blog-elastic-observability-span-details.png" alt="" /></p>
<h2>What's next?</h2>
<p>Now that you've got your Elastic Observability cluster up and running and collecting out-of-the-box application trace data, explore the public APIs for the languages that your applications are using, which allow you to take your APM data to the next level. The APIs allow you to add custom metadata, define business transactions, create custom spans, and more. You can find the public API specs for the various APM agents (such as <a href="https://www.elastic.co/guide/en/apm/agent/java/current/public-api.html">Java</a>, <a href="https://www.elastic.co/guide/en/apm/agent/ruby/current/api.html">Ruby</a>, <a href="https://www.elastic.co/guide/en/apm/agent/python/current/index.html">Python</a>, and more) on the APM agent <a href="https://www.elastic.co/guide/en/apm/agent/index.html">documentation pages</a>.</p>
<p>If you'd like to learn more about Elastic APM, check out <a href="https://www.elastic.co/webinars/introduction-to-elastic-apm-in-the-shift-to-cloud-native">our webinar on Elastic APM in the shift to cloud native</a> to see other ways that Elastic APM can help you in your ecosystem.</p>
<p>If you decide that you'd rather have us host your observability cluster, you can sign up for a free trial of the <a href="https://www.elastic.co/cloud/">Elasticsearch Service on Elastic Cloud</a> and change your agents to point to your new cluster.</p>
<p><em>Originally published May 5, 2021; updated April 6, 2023.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/free-open-elastic-apm-observability-deployment/blog-thumb-release-apm.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Using Elastic Agent Builder & OpenTelemetry to Observe Devices]]></title>
            <link>https://www.elastic.co/observability-labs/blog/agent-builder-opentelemetry</link>
            <guid isPermaLink="false">agent-builder-opentelemetry</guid>
            <pubDate>Mon, 02 Feb 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to use Elastic Agent Builder and OpenTelemetry to build IoT observability and gain insights into your appliance usage patterns and efficiency.]]></description>
            <content:encoded><![CDATA[<blockquote>
<p><em>“Anything that emits data can be instrumented and observed.”</em></p>
</blockquote>
<p>That’s the mindset that started this little experiment.</p>
<h2><strong>The Curiosity Behind It</strong></h2>
<p>Over the years, I’ve worked closely with customers to design IT solutions that are scalable, secure, and cost-effective. I’ve partnered with them on their cloud migration and digital transformation journeys, enabling full-stack Observability across their cloud and on-premise systems.</p>
<p>One day, I found myself looking at the appliances in my own home — the dishwasher, washer, dryer, and refrigerator — and realized that they, too, were generating valuable data. What if I could observe them? What if the same principles that power enterprise telemetry could help me understand my home appliances — their patterns, behavior, and efficiency?</p>
<p>That curiosity became the seed for this experiment: IoT Observability at home, powered by OpenTelemetry - EDOT, and Agent Builder.</p>
<h2><strong>Building the IoT Observability Foundation</strong></h2>
<p>The idea was simple:</p>
<ol>
<li>Treat every device as a data source.</li>
<li>Use OpenTelemetry to capture signals</li>
<li>Use EDOT (Elastic Distribution of OpenTelemetry) as a unified collector and exporter.</li>
<li>Send all data to an Elastic Serverless Observability cluster.</li>
<li>Layer Agent Builder on top, to <em>talk</em> to the data using natural language.</li>
</ol>
<p>So now, my dishwasher, washer, dryer and refrigerator — all part of an Elastic-powered, home-scale telemetry pipeline.</p>
<h2><strong>Turning Signals into Stories</strong></h2>
<h3><strong>Technical overview: What does this system do?</strong></h3>
<p>I set up a system that connects my LG ThinQ smart appliances — washer, dryer, dishwasher, and refrigerator — to Home Assistant, turning everyday household devices into observable systems by sending metrics, logs, and traces to Elastic Cloud Serverless.</p>
<p>Key Capabilities:</p>
<p>✅ Natural language queries (Agent Builder)<br />
✅ Real-time appliance state monitoring<br />
✅ Anomaly detection<br />
✅ Full stack observability stack</p>
<h3><strong>Architecture overview</strong></h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/agent-builder-opentelemetry/architecture-overview.png" alt="Architecture overview Agent Builder OpenTelemetry" /></p>
<h3><strong>The Aha Moment</strong></h3>
<p>What is Agent Builder?</p>
<p><a href="https://www.elastic.co/search-labs/blog/ai-agentic-workflows-elastic-ai-agent-builder">Agent Builder</a> provides an out-of-the-box conversational agent to allow you to immediately start chatting with any data in Elasticsearch (or from external sources through integrations) with a full experience built in Kibana and accessible via API. Developers can also customize their tools to search specific indexes or use <a href="https://www.elastic.co/docs/explore-analyze/query-filter/languages/esql">ES|QL</a> for business logic, relevance tuning, or personalization.</p>
<p>It has the ability to transform natural language into intuitive, piped, multi-step ES|QL, giving the agent the power to do analytical and hybrid <a href="https://www.elastic.co/search-labs/blog/introduction-to-vector-search">semantic search</a>. Finally, developers can compose custom Agents based on a set of user defined instructions and configurable set of available tools, and these Agents can be interacted with via chat in Kibana or via APIs, MCP and A2A.</p>
<p>Agent Builder creates a transformative experience, turning raw telemetry into an interactive dialogue. So instead of building complex queries manually, I can simply ask:</p>
<p>&quot;Can you show me a report for all my appliances?&quot;
…and voilà, I get the insights right in Kibana.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/agent-builder-opentelemetry/1.png" alt="Appliance activity summary with Agent Builder &amp; OpenTelemetry" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/agent-builder-opentelemetry/appliance-comparison.png" alt="Appliance comparison with Agent Builder &amp; OpenTelemetry" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/agent-builder-opentelemetry/graph.png" alt="Appliance activity graph with Agent Builder &amp; OpenTelemetry" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/agent-builder-opentelemetry//analysis.png" alt="Appliance usage anayliss with Agent Builder &amp; OpenTelemetry" /></p>
<h2>Conclusion</h2>
<p>This experiment reminded me that Observability isn’t limited to enterprise systems. Anything that emits data, whether it’s a Kubernetes pod or a coffee maker, has insights that can be uncovered. It could be any IoT devices for that matter, your data center thermostat, your office building badge scanners — all emit telemetry that can be valuable to ensuring safe and efficient operations. The same principles that help organizations gain visibility into production workloads can also bring insights, efficiency, and a sense of connection to the systems around us every day.</p>
<p>By combining OpenTelemetry (EDOT), Elastic Cloud Serverless, and Agent Builder, I realized how simple it can be to go from raw telemetry to conversation — turning metrics into meaning and data into dialogue.</p>
<p>This experiment showed me something simple yet profound: Observability is no longer just about dashboards and alerts; it’s about conversations. When data becomes conversational, insights become accessible to everyone — not just developers or SREs, but anyone curious enough to ask “why?”</p>
<blockquote>
<p>Anything that emits data can be observed.</p>
</blockquote>
<p>Now, with Agent Builder:</p>
<blockquote>
<p>Anything that emits data can also answer back.</p>
</blockquote>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/agent-builder-opentelemetry/capture-custom-metrics.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Agentic CI/CD: Kubernetes Deployment Gates with Elastic MCP Server]]></title>
            <link>https://www.elastic.co/observability-labs/blog/agentic-cicd-kubernetes-mcp-server</link>
            <guid isPermaLink="false">agentic-cicd-kubernetes-mcp-server</guid>
            <pubDate>Mon, 09 Feb 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Deploy agentic CI/CD gates with Elastic MCP Server. Integrate AI agents into GitHub Actions to monitor K8s health and improve deployment reliability via Observability (O11y)]]></description>
            <content:encoded><![CDATA[<p>The &quot;Build-Push-Deploy&quot; cycle is never simple. High-availability environments require automated guardrails, proactive checks that prevent a deployment from even starting if the target cluster is under stress. Today these are generally performed with APIs and scripts during the CI/CD process. Different gates are initiated during the process to ensure the application tests have passed, the artifact is clean, the infrastructure is stable, and many more. </p>
<p>With AI, and agents, these gates are slowly becoming more sophisticated. More and more these gates are using a <strong>Model Context Protocol (MCP)</strong> server for this check. This is a newer, more cutting-edge &quot;agentic&quot; approach. It allows your CI/CD pipeline to act as an intelligent agent that &quot;asks&quot; your cluster for its health status before making a change.</p>
<p>A standard Kubernetes deployment workflow generally follows these high-level steps:</p>
<ol>
<li>
<p>Verification Gate: Ensuring all automated testing has passed.</p>
</li>
<li>
<p>Artifact Creation: Building the Docker container.</p>
</li>
<li>
<p>Environment Gate: Verifying that the production Kubernetes environment, supporting infrastructure, and existing applications are healthy.</p>
</li>
</ol>
<p>Kubernetes Deployment: Triggering the final release. Modern workflows often use GitOps tools like ArgoCD or Flux, where a simple image tag update in Docker Hub automatically synchronizes the cluster.</p>
<p>Kubernetes health checks can range from simple to complex depending on your Service Level Objectives (SLOs) and operational maturity. Typically, the primary goal is to ensure the cluster is healthy and not nearing a resource bottleneck. Common &quot;red flag&quot; metrics used in these gates include:</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Red Flag</strong></td>
<td><strong>Scenario</strong></td>
<td><strong>SRE Meaning</strong></td>
</tr>
<tr>
<td><strong>Pod Count &gt; 90%</strong></td>
<td>High pod density</td>
<td>Approaching node-level scheduling limits.</td>
</tr>
<tr>
<td><strong>CPU Usage &gt; 70%</strong></td>
<td>High real-time load</td>
<td>Risk of CPU throttling during deployment.</td>
</tr>
<tr>
<td><strong>Memory Usage &gt; 80%</strong></td>
<td>Memory pressure</td>
<td>High risk of Out-of-Memory (OOM) kills.</td>
</tr>
<tr>
<td><strong>OOM Terminating Processes</strong></td>
<td>Resource limits reached</td>
<td>Inadequate pod configuration or sizing.</td>
</tr>
<tr>
<td><strong>Available vs. Requested</strong></td>
<td>Capacity imbalance</td>
<td>Risk of deployment failure due to insufficient reserved space.</td>
</tr>
</tbody>
</table>
<p>I will show you how you can use a CI/CD pipeline that integrates Observability AI Agents with GitHub Actions via an Model Context Protocol (MCP) server, creating automated pre-deployment health checks for Kubernetes clusters.</p>
<p>By introducing an observability checkpoint before deployment, we transform the pipeline into an intelligent system that:</p>
<ul>
<li>
<p><strong>Queries real-time metrics</strong> from Kubernetes clusters</p>
</li>
<li>
<p><strong>Analyzes capacity</strong> using custom ESQL queries</p>
</li>
<li>
<p><strong>Makes autonomous decisions</strong> about deployment readiness</p>
</li>
<li>
<p><strong>Prevents failures proactively</strong> rather than reacting to them</p>
</li>
<li>
<p><strong>Provides actionable feedback</strong> to engineering teams</p>
</li>
</ul>
<p>Here is the “architecture” of what is being deployed and how it works in this blog.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/agentic-cicd-kubernetes-mcp-server/diagram_mcp_flow.png" alt="Github Actions and Elastic MCP Server Architecture" /></p>
<p>As you can see the flow uses Elastic Observability, which is storing and analyzing Kubernetes OpenTelemetry metrics from the opentelemetry-kube-stack-cluster-stats-collector (deployed via OpenTelemetry Operator).</p>
<p><a href="https://github.com/features/actions">Github Actions</a> calls the Observability Kubernetes Agent, via the <a href="https://www.elastic.co/docs/explore-analyze/ai-features/agent-builder/mcp-server">Elastic MCP server</a>, which has tools that help check for some of the “red flag” issues identified in the table above.</p>
<p>Based on the results, Github Actions will either stop the process or continue to deploy the artifact via a trigger for <a href="https://argo-cd.readthedocs.io/en/stable/">ArgoCD</a>.</p>
<p>The Observability Kubernetes Agent was built using Elastic’s Agent Builder capability, as well as some of the tools it uses. These are then exposed via the MCP server.</p>
<p>Hence the overall set of components used here include:</p>
<ol>
<li>
<p><strong>GitHub Actions</strong>: Orchestrates the build and deployment workflow</p>
</li>
<li>
<p><strong>Elastic MCP Server</strong>: Serverless endpoint that exposes AI agents</p>
</li>
<li>
<p><strong>Observability Kubernetes Agent</strong>: Custom agent with specialized ESQL tools</p>
</li>
<li>
<p><strong>Kubernetes Cluster</strong>: Target deployment environment with metrics collection</p>
</li>
<li>
<p><strong>ES|QL Query Tools</strong>: Precision queries for node and pod resource analysis</p>
</li>
</ol>
<h2>What Happens When a Kubernetes Health Check Fails in GitHub Actions?</h2>
<h3>How the Pipeline Blocks a Deployment Automatically</h3>
<p>When the cluster exceeds capacity thresholds, the workflow automatically blocks deployment. In this scenario I didn’t load the cluster, but used a simple check of whether more than 25% of resources were being used to purposely stop the deployment.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/agentic-cicd-kubernetes-mcp-server/github_actions_failure.png" alt="Github Actions and Elastic MCP Server Architecture" /></p>
<p>The workflow shows:</p>
<ul>
<li>
<p>Build Docker Image (28s)</p>
</li>
<li>
<p>Push to Docker Hub (5s)</p>
</li>
<li>
<p>K8s Health via Elastic O11y K8s Agent (16s) - <strong>FAILED</strong></p>
</li>
<li>
<p>Deploy to otel-test Cluster - <strong>BLOCKED</strong></p>
</li>
</ul>
<p><strong>Annotation</strong>: &quot;Cluster has resource issues - blocking deployment&quot;</p>
<h3>What Does the AI Agent's Health Check Response Look Like?</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/agentic-cicd-kubernetes-mcp-server/github_actions_logs.png" alt="Github Actions and Elastic MCP Server Architecture" /></p>
<p>The agent provides detailed analysis:</p>
<pre><code class="language-bash">Step 1: Finding Kubernetes analysis agent...
Found agent: Observability Kubernetes Agent (kubernetes_analysis_agent)

Step 2: Querying cluster health...
Prompt: tell me if my cluster otel-test is using more than 25% memory or CPU on any of its nodes

Agent Response:
================================================================
Yes, your cluster &quot;otel-test&quot; has nodes and pods using more than 25% of resources.

++Node exceeding 25%:++
- ip-192-168-165-175.us-west-2.compute.internal
  - Memory: 36.44%
  - CPU: 7.99% (below threshold)

++All other nodes are below the 25% threshold++ for both CPU and memory.

While the query for pods doesn't show percentage values directly, the data indicates
normal resource usage patterns for the pods in your cluster, with none appearing to
consume excessive resources relative to their allocations.
================================================================

Cluster has resource issues - blocking deployment
Error: Process completed with exit code 1.

</code></pre>
<p>As you can see, a prompt was sent to the Observability Kubernetes Agent via MCP vs having to build some logic or call another script etc.</p>
<p>This single check prevented:</p>
<ul>
<li>
<p>A deployment that would have failed</p>
</li>
<li>
<p>Wasted CI/CD minutes</p>
</li>
<li>
<p>Potential service degradation</p>
</li>
<li>
<p>Manual SRE intervention</p>
</li>
</ul>
<p>What it provided:</p>
<ul>
<li>Provided actionable intelligence for capacity planning</li>
</ul>
<h2>How to Build a Kubernetes Health Check Agent in Elastic</h2>
<p>Building the agent isn’t hard, <a href="https://www.elastic.co/docs/explore-analyze/ai-features/elastic-agent-builder">Elastic’s AgentBuilder’s</a> UI makes it easy to create it and have it running in minutes. </p>
<h3>How to Configure the Observability Kubernetes Agent</h3>
<p>Other than naming the agent, you need to provide it with some instructions.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/agentic-cicd-kubernetes-mcp-server/elastic_agent_config.png" alt="Github Actions and Elastic MCP Server Architecture" /></p>
<p><strong>Custom Instructions</strong>:</p>
<pre><code class="language-bash"># Agent Instructions

## Primary Role
You are a Kubernetes monitoring assistant that helps users analyze cluster performance
and resource utilization. Your primary goal is to provide clear, accurate information
about Kubernetes clusters using available data sources.

## Tool Selection Guidelines
1. When users ask about Kubernetes metrics, node performance, or cluster health:
   - Use ESQL tools for detailed analysis
   - Query metrics from kubeletstatsreceiver.otel-default

2. For alert-related queries:
   - Use the alerts tool to check active alerts

3. Always provide context about:
   - Time ranges queried
   - Cluster names
   - Resource thresholds

</code></pre>
<h3>How to Write ES|QL Queries for Kubernetes Node and Pod Metrics</h3>
<p>I created several tools that checked Node CPU and memory, pod CPU and memory, and OOM from pods. Additionally, the Observability Kubernetes Agent utilized a large portion of the OOTB tools like observability_alerts as part of its abilities. </p>
<p>Here is an example of the node CPU and memory tool, which uses a simple ES|QL query against OpenTelemetry metrics to check the CPU and memory utilization in the cluster.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/agentic-cicd-kubernetes-mcp-server/elastic_tool_node_metrics.png" alt="Github Actions and Elastic MCP Server Architecture" /></p>
<p><strong>ES|QL Query</strong>:</p>
<pre><code class="language-bash">FROM metrics-kubeletstatsreceiver.otel-default
| WHERE resource.attributes.k8s.cluster.name == ?cluster_name
  AND @timestamp &gt; NOW() - 3 hours
| STATS
    avg_cpu_usage = AVG(metrics.k8s.node.cpu.usage),
    avg_memory_usage = AVG(metrics.k8s.node.memory.usage),
    avg_memory_available = AVG(metrics.k8s.node.memory.available),
    avg_memory_working_set = AVG(metrics.k8s.node.memory.working_set)
  BY resource.attributes.k8s.node.name
| EVAL
    cpu_usage_pct = avg_cpu_usage * 100,
    memory_usage_pct = (avg_memory_working_set / (avg_memory_working_set + avg_memory_available)) * 100
| SORT cpu_usage_pct DESC, memory_usage_pct DESC
| KEEP resource.attributes.k8s.node.name, cpu_usage_pct, memory_usage_pct
| LIMIT 100

</code></pre>
<p><strong>Parameters</strong>:</p>
<ul>
<li><code>cluster_name</code> (string): Name of the K8s cluster to analyze</li>
</ul>
<h3>How to Expose the Agent via the Elastic MCP Server</h3>
<p>Once configured, the agent is automatically available via Elastic's MCP server running in your Observability project. The MCP server provides a standardized interface that any MCP-compatible client can query.</p>
<p><strong>MCP Endpoint</strong>: <code>https://your-elastic-project.elastic.cloud/mcp</code></p>
<p><strong>Authentication</strong>: Uses Elastic API keys for secure access</p>
<h2>Why Agentic CI/CD Matters for Kubernetes Operations</h2>
<p>Agentic CI/CD represents an evolution in proactive deployment strategies. By integrating Elastic Observability AI agents with GitHub Actions via MCP, we've created a system that:</p>
<p><strong>Prevents failures before they happen</strong>
<strong>Provides real-time cluster health insights</strong>
<strong>Makes data-driven deployment decisions</strong>
<strong>Reduces operational burden on SRE teams</strong>
<strong>Improves overall deployment reliability</strong></p>
<p>This approach is at the cutting edge of modern CI/CD practices. While traditional pipelines focus solely on the &quot;Build-Push-Deploy&quot; cycle, agentic pipelines introduce automated pre-deployment guardrails using observability data, transforming your CI/CD infrastructure into an intelligent agent that actively protects production environments.</p>
<h2>Resources and Next Steps</h2>
<p>Sign up for <a href="https://www.elastic.co/cloud/serverless">Elastic Cloud Serverless</a> and try this out with your pipeline.</p>
<h3>Documentation</h3>
<ul>
<li>
<p><a href="https://www.elastic.co/elasticsearch/agent-builder">Elastic Agent Builder</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/esql.html">ESQL Query Language</a></p>
</li>
<li>
<p><a href="https://modelcontextprotocol.io/">Model Context Protocol</a></p>
</li>
<li>
<p><a href="https://docs.github.com/en/actions/using-workflows">GitHub Actions Workflows</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/agentic-cicd-kubernetes-mcp-server/diagram_mcp_flow.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Bringing observability insights from Elastic AI Assistant to the world of GitHub Copilot]]></title>
            <link>https://www.elastic.co/observability-labs/blog/ai-assistant-to-github-copilot</link>
            <guid isPermaLink="false">ai-assistant-to-github-copilot</guid>
            <pubDate>Thu, 23 May 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[GitHub announced GitHub Copilot Extensions this week at Microsoft Build. We are working with the GitHub team to bring observability insights from Elastic AI Assistant to GitHub Copilot users.]]></description>
            <content:encoded><![CDATA[<p>GitHub <a href="https://github.blog/2024-05-21-introducing-github-copilot-extensions/">announced</a> GitHub Copilot Extensions this week at Microsoft Build. We are working with the GitHub team in the Limited Beta Program to explore bringing observability insights from Elastic AI Assistant to GitHub Copilot users.</p>
<p>Elastic’s GitHub Copilot Extension aims to combine the capabilities of GitHub Copilot and Elastic AI Assistant for Observability. This could enable developers to access critical insights from Elastic AI Assistant from GitHub Copilot Chat on GitHub.com, Visual Studio, GitHub.com, Visual Studio, and VS Code - places where they write their code.</p>
<p>Developers will be able ask questions such as</p>
<ul>
<li>What errors are active?</li>
<li>What’s the latest stacktrace for my application?</li>
<li>What caused a slowdown in the application after the last push to the dev environment?</li>
<li>How to write an ES|QL for query that my app will send to Elasticsearch?</li>
<li>What runbook from Github has been loaded into Elasticsearch and is related to the issue I’m investigating
And many more!</li>
</ul>
<p><a href="https://build.microsoft.com/en-US/sessions/acc48a7a-b412-4b4f-88a6-53ef4b2cb2bc?source=/schedule">Watch Jeff's PoC Demo@Microsoft Build 2024</a></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-assistant-to-github-copilot/elastic-copilot-vscode.png" alt="Elastic's Copilot Extension in VSCode" /></p>
<p><em>Elastic AI Assistant surfaced in GitHub Copilot Chat from our Extension (Proof of Concept)</em></p>
<h2>What is the Elastic AI Assistant for Observability</h2>
<p>The Elastic Observability AI Assistant for Observability, a user-centric tool, is a game-changer in providing contextual insights and streamlining troubleshooting within the Elastic Observability environment. By harnessing generative AI capabilities, the assistant offers open prompts that decipher error messages and propose remediation actions. It adopts a Retrieval-Augmented Generation (RAG) approach to fetch the most pertinent internal information, such as APM traces, log messages, SLOs, GitHub issues, runbooks, and more. This contextual assistance is a huge leap forward for Site Reliability Engineers (SREs) and operations teams, offering immediate, relevant solutions to issues based on existing documentation and resources, boosting developer productivity.</p>
<p>For more information on setting up and using the AI Assistant for Observability check out the blog <a href="https://www.elastic.co/observability-labs/blog/elastic-ai-assistant-observability-microsoft-azure-openai">Getting started with the Elastic AI Assistant for Observability and Microsoft Azure OpenAI</a>. Additionally, learn how <a href="https://www.elastic.co/observability-labs/blog/elastic-rag-ai-assistant-application-issues-llm-github">Elastic Observability AI Assistant uses RAG to help analyze application issues with GitHub issues</a>.</p>
<p>One unique feature of the AI Assistant is its API support. This allows you to take advantage of all the capabilities provided by the Elastic AI Assistant, and integrate them right into your workflow.</p>
<h2>What is a GitHub Copilot Extension</h2>
<p>GitHub Copilot Extensions, a new addition to GitHub Copilot, revolutionizes the developer experience by integrating a diverse array of tools and services directly into the developer's workflow. These unique extensions, crafted by partners, enable developers to interact with various services and tools using natural language within their Integrated Development Environment (IDE) or GitHub.com. This integration eliminates the need for context-switching, allowing developers to maintain their flow state, troubleshoot issues, and deploy solutions with unparalleled efficiency. These extensions will be accessible through GitHub Copilot Chat in the GitHub Marketplace, with options for organizations to create private extensions tailored to their internal tooling.</p>
<h2>What’s next</h2>
<p>We are participating in the Github Limited Beta Program as a partner and exploring the possibility of bringing Elastic GitHub Copilot Extension to the GitHub Marketplace. We are excited to unlock insights from Elastic Observability to GitHub Copilot users side by side to the code behind those services. Stay tuned!</p>
<p>Resources:</p>
<ul>
<li><a href="https://www.elastic.co/observability-labs/blog/elastic-ai-assistant-observability-microsoft-azure-openai">Getting Started with Elastic AI Assistant for Observability with Azure OpenAI</a></li>
<li><a href="https://ela.st/assistant-escapes">The Elastic AI Assistant for Observability escapes Kibana!</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/elastic-rag-ai-assistant-application-issues-llm-github">Elastic Observability AI Assistant uses RAG to help analyze application issues with GitHub issues</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/sre-troubleshooting-ai-assistant-observability-runbooks">Troubleshooting with Elastic AI Assistant using your organization's runbooks</a></li>
<li><a href="https://www.elastic.co/guide/en/observability/current/obs-ai-assistant.html">The AI Assistant Observability documentation</a></li>
<li><a href="https://github.blog/2024-05-21-introducing-github-copilot-extensions/">GitHub Copilot Extensions Blog Announcement</a></li>
<li><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/esql.html">ES|QL documentation</a></li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/ai-assistant-to-github-copilot/githubcopilot-aiassistant-C-2x.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[AI-driven incident response with logs: A technical deep dive in Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/ai-driven-incident-response-with-logs</link>
            <guid isPermaLink="false">ai-driven-incident-response-with-logs</guid>
            <pubDate>Mon, 20 Oct 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[How Elastic combines ML anomaly detection, ES|QL, and the AI Assistant to accelerate incident response using logs.]]></description>
            <content:encoded><![CDATA[<h1>AI-driven incident response with logs: A technical deep dive in Elastic Observability</h1>
<p>Modern customer‑facing applications, whether e‑commerce sites, streaming platforms, or API gateways, run on fleets of microservices and cloud resources. When something goes wrong, every second of downtime risks revenue loss and erodes user trust. Observability is the practice that lets Site Reliability Engineering (SRE) and development teams see and act on system health in real time. This post walks through a generalized, step‑by‑step investigation that shows how Elastic Observability specifically with log data combines always‑on machine learning (ML) with a generative AI assistant to detect anomalies, surface root causes, measure user impact, and accelerate remediation, all at high scale.</p>
<h2>Anomaly Detection</h2>
<p>A production environment is ingesting millions of log lines per minute. Elastic’s AIOps jobs continuously profile normal log throughput and content without any manual rules. When log volume or message structure deviates beyond learned baselines, the platform automatically fires a high‑fidelity anomaly alert. Because the models are unsupervised, they adapt to changing traffic patterns and flag both sudden spikes (e.g., 10× error surge) and rare new log categories.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image3.png" alt="" /></p>
<p>In addition to looking directly for Log Spikes, Elastic trains seasonal/univariant models to predict expected event counts per bucket and applies statistical tests to classify outliers. Simultaneously, log categorization clusters similar messages with cosine similarity on token embeddings, making it trivial to identify a previously unseen error string.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image10.png" alt="" /></p>
<h2>Investigating Alerts: Automated Pattern Analysis</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image9.png" alt="" /></p>
<p>Clicking the alert reveals more than a timestamp. Elastic’s ML job already correlates the spike with the dominant new log pattern ERROR 1114 (HY000): table &quot;orders&quot; is full and surfaces example lines. Instead of grep‑driven hunting, engineers get an immediate hypothesis about what subsystem is failing and why.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image4.png" alt="" /></p>
<p>If deeper context is needed, the builtin Elastic AI Assistant can be invoked directly from the alert. Thanks to Retrieval‑Augmented Generation (RAG) over your telemetry, the assistant explains the anomaly in plain language, references the exact log events, and proposes next steps without hallucinating.</p>
<h2>AI‑Assisted Root Cause Verification</h2>
<p>From within the same chat, you might ask, “Using lens create a single graph of all http response status codes =400 from logs-nginx.access-default over the last 3 hours..”  The assistant translates that intent into an ES|QL aggregation, retrieves the data, and renders a bar chart with no DSL knowledge required. If there are a number of errors with a status code above 400, you’ve validated that end‑users are impacted.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image7.png" alt="" /></p>
<h2>Global Impact Analysis with Enriched Logs</h2>
<p>Structured log enrichment (e.g., GeoIP, user ID, service tags) lets the assistant answer business questions on the fly. A query like “What are the top 10 source.geo.country_name with http.response.status.code&gt;=400 over the last 3 hours. Use logs-nginx.access-default. Provide counts for each country name.” surfaces whether the incident is regional or global.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image2.png" alt="" /></p>
<h2>Quantifying Business Impact</h2>
<p>Technical metrics alone rarely sway executives. Suppose historical data shows the application normally processes $1,000 in transactions per minute. The assistant can combine that baseline with real‑time failure counts to estimate revenue loss. Presenting financial impact alongside error graphs sharpens prioritization and justifies extraordinary remediation steps.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image5.png" alt="" /></p>
<h2>Pinpointing Infrastructure &amp; Ownership</h2>
<p>Every log is automatically enriched with Kubernetes, cloud, and custom metadata. A single question “Which pod and cluster emit the ‘table full’ error, and who owns it?” returns the full information about the pod, namespace and owner as shown below.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image1.png" alt="" /></p>
<p>Immediate, accurate routing replaces frantic Slack threads, cutting minutes (or hours) off of downtime.</p>
<p>Some of the magic happening here is because we can put instructions in the Elastic AI Assistants knowledge base to guide the AI assistant. For example this simple entry in the knowledge base is what allows the assistant to populate the response in the previous screenshot.</p>
<pre><code class="language-markdown">If asked about Kubernetes pod, namespace, cluster, location, or owner run the &quot;query&quot; tool.
1. Use the index `logs-mysql.error-default` unless another log location is specified.
2. Include the following fields in the query:
   - Pod: `agent.name`
   - Namespace: `data\_stream.namespace`
   - Cluster Name: `orchestrator.cluster.name`
   - Cloud Provider: `cloud.provider`
   - Region: `cloud.region`
   - Availability Zone: `cloud.availability\_zone`
   - Owner: `cloud.account.id`
3. Use the ES|QL query format:
   esql
   FROM logs-mysql.error-default
   | KEEP agent.name, data\_stream.namespace, orchestrator.cluster.name, cloud.provider, cloud.region, cloud.availability\_zone, cloud.account.id
   
4. Ensure the query is executed within the appropriate time range and context. 
</code></pre>
<h2>Leveraging Institutional Knowledge with RAG</h2>
<p>Elastic can index runbooks, GitHub issues, and wikis alongside telemetry. Asking “Find documentation on fixing a full orders table” retrieves and summarizes a prior runbook that details archiving old rows and adding a partition. Grounding remediation in proven procedures avoids guesswork and accelerates fixes.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image6.png" alt="" /></p>
<h2>Automated Communication &amp; Documentation</h2>
<p>Good incident response includes timely stakeholder updates. A prompt such as “Draft an incident update email with root cause, impact, and next steps” lets the assistant assemble a structured message and send it via the alerting framework’s email or Slack connector complete with dashboard links and next‑update timelines. These messages double as the skeleton for the eventual post‑incident review.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image8.png" alt="" /></p>
<p>Again as before, some of the magic happening here is because we can put instructions in the Elastic AI Assistants knowledge base to guide the AI assistant. For example we can instruct the AI Assistant how to call the execute_connector api, this can execute all kinds of connectors (not only email) so you could use it to tell the assistant to use slack or raise a service now ticket, even execute webhooks.</p>
<pre><code class="language-markdown">Here are specific instructions to send an email. Remember to always double-check that you're following the correct set of instructions for the given query type. Provide clear, concise, and accurate information in your response.

## Email Instructions

If the user's query requires sending an email:
1. Use the `Elastic-Cloud-SMTP` connector with ID `elastic-cloud-email`.
2. Prepare the email parameters:
   - Recipient email address(es) in the `to` field (array of strings)
   - Subject in the `subject` field (string)
   - Email body in the `message` field (string)
3. Include
   - Details for the alert along with a link to the alert
   - Root cause analysis
   - Revenue impact
   - Remediation recommendations
   - Link to GitHub issue
   - All relevant information from this conversation
   - Link to the Business Health Dashboard
4. Send the email immediately. Do not ask the user for confirmation.
5. Execute the connector using this format:
   
   execute_connector(
     id=&quot;elastic-cloud-email&quot;,
     params={
       &quot;to&quot;: [&quot;recipient@example.com&quot;],
       &quot;subject&quot;: &quot;Your Email Subject&quot;,
       &quot;message&quot;: &quot;Your email content here.&quot;
     }
   )
   
6. Check the response and confirm if the email was sent successfully.
</code></pre>
<h2>Conclusion &amp; Key Takeaways</h2>
<p>Elastic Observability's combination of unsupervised ML, schema-aware data ingestion, and a context-rich RAG powered AI assistant enables teams to transform incident response from reactive firefighting into proactive, data-driven operations. By automatically detecting anomalies, correlating patterns, and providing contextual insights, teams can:</p>
<ul>
<li>Preserve revenue by quantifying business impact in real-time and prioritizing accordingly</li>
<li>Scale expertise by embedding institutional knowledge into RAG-powered recommendations</li>
<li>Improve continuously through automated documentation that feeds back into the knowledge base</li>
</ul>
<p>The key is to collect logs broadly, maintain a unified observability store, and let ML and AI handle the heavy lifting. The payoff isn't just reduced downtime, it's the transformation of incident response from a source of organizational stress into a competitive advantage.</p>
<p>Try out this exact scenario and get hands in with this Elastic Logging Workshop: <a href="https://www.google.com/url?q=https://play.instruqt.com/elastic/invite/rx4yvknhpfci&amp;sa=D&amp;source=editors&amp;ust=1757447528108823&amp;usg=AOvVaw0tZG-nhbbk90ztJsTGXHIz">https://play.instruqt.com/elastic/invite/rx4yvknhpfci</a></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/ai-driven-incident-response-with-logs.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[AI agent observability and monitoring with OTel, OpenLit & Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/ai-observability-web-agents-openlit</link>
            <guid isPermaLink="false">ai-observability-web-agents-openlit</guid>
            <pubDate>Mon, 30 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to monitor AI web agents to identify performance bottlenecks, token waste, and hallucinations using OpenTelemetry, OpenLit, and Elastic]]></description>
            <content:encoded><![CDATA[<p>AI agents don't fail like traditional apps. They hallucinate, loop, burn tokens, and make unpredictable tool calls that standard monitoring was never designed to capture. Traditional APM tools show HTTP status codes and latency, but they miss the AI-specific failures that matter: prompt injection attempts, evaluation score degradation, and tool-calling loops.</p>
<p>This guide explains the key considerations for full-stack monitoring AI web agents, exploring both best practices and practical examples using OpenLit, OpenTelemetry and Elastic. Specifically we'll cover monitoring an example web travel planner <a href="https://github.com/carlyrichmond/observing-ai-agents">located in this example repo</a>.</p>
<h2>Why is AI agent observability different?</h2>
<p>The aim of traditional monitoring is to detect and alert on failures, performance issues, inefficiencies and resource bottlenecks. Monitoring AI agents still adheres to this common goal, but there are several differences that must be considered:</p>
<ul>
<li>AI models are probabilistic, meaning that the same input can lead to different outputs. This makes it hard to define and monitor success based on a single correct answer.</li>
<li>AI systems can appear to function correctly on the surface, but their outputs may be suspect, incorrect, or biased without a way to immediately detect it. Telemetry must therefore be able to capture hidden capabilities such as tool call executions for SREs to scrutinize.</li>
<li>The dynamic and evolving nature of LLMs can mean that their behavior can change dramatically between updates and versions due to changes in data, embeddings, or prompts. This means monitoring and pre-production evaluation when upgrading is vitally important for performance continuity.</li>
<li>Models are black boxes. For this reason it's often difficult to understand why an AI made a particular decision. This makes troubleshooting harder compared to systems with clear, explicit logic.</li>
<li>Beyond traditional metrics, AI output must be monitored for issues like hallucinations (generating false information), toxicity, and bias, which can damage user trust and lead to reputational harm.</li>
<li>Contextual performance of an AI system can vary greatly depending on the context, including user interaction. Capturing user prompts and telemetry helps establish a complete picture of system performance.</li>
<li>From a security perspective, AI agents can be vulnerable to adversarial attacks including data poisoning and obfuscation. Monitoring for unusual behavioral patterns and prompts is crucial to detect and mitigate these threats.</li>
</ul>
<p>For these reasons the metrics and tracing that SREs capture and investigate will differ.</p>
<h2>AI agent monitoring in practice</h2>
<p>Let's apply these concepts by instrumenting an actual AI agent and capturing telemetry. Here we shall be using the <a href="https://github.com/openlit/openlit">TypeScript SDK of OpenLit</a>, an open-source library that generates OpenTelemetry signals from LLM interactions in JavaScript applications. Specifically we shall instrument a simple web travel planner agent, available <a href="https://github.com/carlyrichmond/observing-ai-agents">here</a> that uses LLMs to generate travel recommendations based on user prompts and information from various tools. OpenLit works well for this type of project due to its TypeScript SDK and built in capabilities for capturing LLM interactions, tool calls, and generating evaluation and guardrail metrics.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-observability-web-agents-openlit/observable-travel-planner.gif" alt="Travel Planner Example Interaction" /></p>
<p>The architecture diagram shows the key components:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-observability-web-agents-openlit/observable-travel-planner-architecture.png" alt="Travel Planner Agent Architecture Diagram" /></p>
<p>The concepts and best practices discussed in this article can be applied to any AI agent regardless of the specific monitoring tools used. Many vendors have AI monitoring capabilities. Alternative open source technologies are also available for agentic monitoring, including <a href="https://www.langchain.com/langsmith/observability">LangSmith</a>, <a href="https://github.com/traceloop/openllmetry">OpenLLMetry</a>, or indeed manual instrumentation using <a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/">OpenTelemetry SDKs and the AI semantic conventions</a>.</p>
<h2>Prerequisites</h2>
<p>This project requires that the following prerequisites are met:</p>
<ul>
<li>Active Elastic cluster (Cloud, Serverless or self-managed)</li>
<li>OpenAI, Azure OpenAI, or compatible LLM provider API key</li>
<li>Node.js 18+ with npm or yarn</li>
<li>OTLP-compatible endpoint (Elastic Managed OTLP endpoint or OTel collector)</li>
</ul>
<p>The following environment variables should be set:</p>
<ul>
<li><code>OTEL_ENDPOINT</code>: Your Elastic OTLP endpoint or OTel collector URL</li>
<li><code>OPENAI_API_KEY</code>: API key for the evaluation/guardrail LLM</li>
<li><code>OPENAI_ENDPOINT</code>: Optional custom base URL for OpenAI-compatible providers</li>
</ul>
<h2>Basic instrumentation</h2>
<p>Often DevOps engineers and SREs start with automatic instrumentation to obtain basic telemetry. This is possible with the <a href="https://docs.openlit.io/latest/openlit/quickstart-ai-observability#python-2">OpenLit Python SDK</a>. However with TypeScript we manually have to add our configuration to the AI entrypoint (here <code>api/chat/route.ts</code>).</p>
<p>First we install the dependency using our favourite package manager:</p>
<pre><code class="language-shell">npm install openlit
</code></pre>
<p>Then we add the OpenLit configuration to our entrypoint:</p>
<pre><code class="language-ts">import openlit from &quot;openlit&quot;;

// Other imports omitted for brevity 

// Allow streaming responses up to 30 seconds to address typically longer responses from LLMs
export const maxDuration = 30;

openlit.init({
  applicationName: &quot;ai-travel-agent&quot;, // akin to OTEL resource name
  environment: &quot;development&quot;,
  otlpEndpoint: process.env.OTEL_ENDPOINT, // OTLP compatible endpoint (Elastic ingest or OTel collector)
  disableBatch: true, // batching disabled for demo purposes - not recommended for production use
});

// Post request handler
export async function POST(req: Request) {
   // AI logic omitted for brevity - see full code in repo
}
</code></pre>
<p>This instrumentation will automatically generate OpenTelemetry traces for all LLM interactions, including tool calls, and send them to the specified OTLP endpoint. Note that for production rather than demo usage, <code>environment</code> should be set to <code>production</code> and batching should not be disabled to ensure optimal network usage and protect the OTel backend.</p>
<p>Let's discuss the key telemetry signals that are generated in subsequent sections.</p>
<h3>Inputs</h3>
<p>The first rule of debugging AI agents is simple: if you don't capture the prompt, you can't reproduce the problem. Unlike traditional applications where inputs are predictable request parameters, AI agents consume free-form user prompts that can trigger wildly different behaviors based on subtle phrasing changes. OpenLit automatically captures system prompts and all user messages as structured attributes on your traces, giving you the exact input that caused your agent to hallucinate, loop, or fail.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-observability-web-agents-openlit/elastic-prompt-capture.png" alt="Elastic prompt capture example" /></p>
<p>The full conversation needs to be available to SREs to understand the context of failures and performance issues and identify patterns in the inputs that may be causing issues. However these inputs are also useful for improving agent behavior, and can be used as testing messages to evaluate model performance and test enhancements once sanitized for identifiable attributes such as PII.</p>
<p>Beyond prompts, we still need comprehensive logging, specifically capturing full stack traces emitted by our applications. This is crucial for diagnosing issues that may arise from the underlying infrastructure or codebase, rather than the AI model itself. For example, the below error sent to Elastic shows a simple fetch error. We must not forget that traditional errors can still occur in AI applications, and capturing them is essential.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-observability-web-agents-openlit/elastic-error-log.png" alt="Elastic error log example" /></p>
<h3>Tracing</h3>
<p>Traces are essential for understanding the flow of requests through your AI agent, especially when it comes to tool calls. Generally traces are a hierarchy of spans, which themselves are a single, timed unit representing a specific operation, such as a database query or an HTTP handler. In AI systems they also represent tool calls made by the LLM, along with the API calls and data retrieval steps performed within the tool execution.</p>
<p>Visualizing tool calling patterns is important in validating pre-production systems as well as monitoring production systems for several reasons:</p>
<ol>
<li>It helps us evaluate the tool calling capabilities of different models. LLMs make the choice of which tools to use based on the user prompt, system instructions and the tool metadata (such as name and description). By visualizing the tool calling patterns we can understand whether the model is correctly interpreting the tool metadata and making appropriate calls based on the prompt.</li>
<li>It allows us to identify inefficient or erroneous tool calling patterns. For example, if we see a pattern of repeated calls to the same tool with similar inputs, it may indicate that the model is stuck in a loop or not effectively utilizing the tools. Or if a single tool is being called where we would expect multiple tools to be called, it may indicate that the model is not correctly connecting the prompt or system instructions require said tool(s).</li>
<li>Commonly occurring tool-calling patterns can also be identified to optimize the available tools. For example, if the location and weather tools are frequently called together, it may make sense to combine them into a single tool that provides both pieces of information in one call.</li>
</ol>
<p>With the above configuration, we can see the traces for each tool call, as illustrated in the below example:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-observability-web-agents-openlit/elastic-otel-trace-tool-call.png" alt="Elastic OTel trace tool calling example" /></p>
<h3>Metrics</h3>
<p>While tracing is essential for understanding the flow of requests and tool calls, metrics are crucial for monitoring the overall health and performance of your system, agentic or not. When considering metrics many think solely of cost and total token usage. While both are important, they are not the only metrics that matter.</p>
<p>Through the example above OpenLit automatically generates key metrics that can be used to evaluate agent performance, such as request latency, error rates, cost and token usage, which can be visualized in Elastic to identify trends and anomalies. Token usage specifically can be split by input, output and reasoning token counts, helping us identify optimization opportunities at key stages in the generation cycle. For example, an increase in input token counts may indicate a significant increase in context length that can be optimized via prompt and context engineering techniques.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-observability-web-agents-openlit/openlit-metrics-dashboard.png" alt="Elastic sample AI metrics dashboard" /></p>
<p>It's also important to make sure that metrics capture traditional performance metrics such as CPU, memory and request counts to caches, traditional databases and vector databases. This helps us identify whether performance issues are being caused by the AI model itself or by underlying infrastructure problems. Alerting for spikes in key measures such as large token usage increases or request volumes would also be considered best practice.</p>
<h2>Evaluation</h2>
<p>AI evaluation refers to the process of assessing the performance of AI models and the quality of the responses they generate. This involves monitoring various metrics and signals to ensure that the AI system is functioning as intended, providing accurate outputs, and not exhibiting undesirable behaviors such as hallucinations, toxic behavior or bias. While evaluation is considered as a pre-production activity to test and validate an agentic system, it's also important to continue monitoring these signals in production to identify issues over time.</p>
<p>There are several different evaluation methodologies that we can use. OpenLit makes use of <em>AI as a Judge</em>. This involves using an LLM to evaluate the quality of the output generated by another LLM based on a set of criteria. An example of traditional evaluation is depicted below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-observability-web-agents-openlit/llm-as-a-judge-example.png" alt="LLM as a Judge example (credit Zheng et al. 2023)" /></p>
<p>When considering evaluation from a monitoring viewpoint, it's important to identify hallucinations, bias, toxicity and potential injection issues in production. Hallucinations, bias and toxic responses expose us to reputational risk and loss of user trust. Out of the box, OpenLit identifies the following issues, calculates a score and provides an explanation of the issue:</p>
<table>
<thead>
<tr>
<th>Issue</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Hallucinations</td>
<td>The LLM generates false or misleading information based on the provided context and its own knowledge</td>
</tr>
<tr>
<td>Bias</td>
<td>A generated response contains bias or statements negatively impacting protected groups and characteristics including but not limited to gender, ethnicity, socioeconomic status or religion</td>
</tr>
<tr>
<td>Toxicity</td>
<td>The LLM returns harmful or offensive content that is threatening, harassing or dismissive</td>
</tr>
</tbody>
</table>
<p>These issues can be identified using the below code:</p>
<pre><code class="language-ts">import openlit from &quot;openlit&quot;;

// Other imports omitted for brevity

// Allow streaming responses up to 30 seconds to address typically longer responses from LLMs
export const maxDuration = 30;

// Tools and Azure configuration omitted for brevity

openlit.init({
  applicationName: &quot;ai-travel-agent&quot;,
  environment: &quot;development&quot;,
  otlpEndpoint: process.env.OTEL_ENDPOINT,
  disableBatch: true,
});

// Choose one of the following approaches:
// Option 1: enable all available evaluations
const evalsAll = openlit.evals.All({
  provider: &quot;openai&quot;,
  collectMetrics: true, // Ensures evaluations are exported to Elastic
  apiKey: process.env.OPENAI_API_KEY,
  baseUrl: process.env.OPENAI_ENDPOINT
});

// Option 2: enable specific evaluations with custom configuration
const evalsHallucination = openlit.evals.Hallucination({
  provider: &quot;openai&quot;,
  collectMetrics: true,
  apiKey: process.env.OPENAI_API_KEY,
  baseUrl: process.env.OPENAI_ENDPOINT
});

// Post request handler
export async function POST(req: Request) {
  const { messages, id } = await req.json();

  try {
    const convertedMessages = await convertToModelMessages(messages);
    const prompt = `You are a helpful assistant that returns travel itineraries...`;

    const result = streamText({
      model: azure(&quot;gpt-4o&quot;),
      system: prompt,
      messages: convertedMessages,
      stopWhen: stepCountIs(2),
      tools,
      experimental_telemetry: { isEnabled: true },
      onFinish: async ({ text, steps }) =&gt; {
        // Concatenate tool results and content as full evaluation context
        const toolResults = steps.flatMap((step) =&gt; {
          return step.content
            .filter((content) =&gt; content.type == &quot;tool-result&quot;)
            .map((c) =&gt; {
              return JSON.stringify(c.output);
            });
        });

        // Measure evaluation
        const evalResults = await evalsAll.measure({
          prompt: prompt,
          contexts: convertedMessages
            .map((m) =&gt; {
              return m.content.toString();
            })
            .concat(toolResults),
          text: text,
        });
        console.log(`Evals results: ${evalResults}`);
      },
    });

    // Return data stream to allow the useChat hook to handle the results as they are streamed through for a better user experience
    return result.toUIMessageStreamResponse();
  } catch (e) {
    console.error(e);
    return new NextResponse(
      &quot;Unable to generate a plan. Please try again later!&quot;
    );
  }
}
</code></pre>
<p>By using the <code>collectMetrics</code> option, the evaluation results are automatically exported as metrics to Elastic, allowing us to monitor the quality of our AI agent's outputs over time and identify trends or issues that may arise in production. The evaluation results can also be used to trigger alerts or automated responses if certain thresholds or <a href="https://www.elastic.co/docs/solutions/observability/incident-management/service-level-objectives-slos">SLOs</a> are breached, such as a high evaluation score sustained for several minutes, increased number of hallucinations detected over time, or triggering of a toxic result.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-observability-web-agents-openlit/openlit-evals-example.png" alt="Evals example" /></p>
<p>The advantage of using LLMs to evaluate results is that they help identify issues and inaccuracies quickly compared to leveraging manual quality checks. However, this methodology does have limitations, specifically:</p>
<ol>
<li>Increased cost and latency due to the additional requests to an LLM to evaluate results. This can be mitigated by using a smaller, cheaper model for evaluation, by only evaluating a sample of responses, or using cached responses to reduce the number of LLM calls for similar questions.</li>
<li>LLM evaluations are prone to biases. Specifically, <a href="https://arxiv.org/pdf/2306.05685">Zheng et al. cite in their 2023 paper</a> that LLM evaluations are subject to:</li>
</ol>
<ul>
<li>Positional bias, where an LLM prefers responses where the answer is located in a specific position in the response, and may miss correct answers located elsewhere in the reply.</li>
<li>Self-enhancement bias, where LLMs show preference for responses they have generated compared to other models. This can be a consideration if you wish to use cheaper, or self-hosted models for evaluation.</li>
<li>Verbosity bias, where they prefer more expansive responses over succinct replies.</li>
</ul>
<h2>Guardrail monitoring</h2>
<p>In addition to assessing the quality of responses, we must also be monitoring for dangerous or irrelevant responses that could be harmful to users. The quality of in-built protections within models are patchy and model dependent. Research from several research papers, including <a href="https://www.anthropic.com/research/agentic-misalignment">Anthropic in their 2025 agentic misalignment paper</a>, show that in some cases models can resort to malicious behaviors and the model bypassing company policies and moral expectations.</p>
<p>Guardrail detection in monitoring tools allows us to identify risky responses generated by AI agents, such as generating harmful content, engaging in inappropriate interactions, or performing injection actions to try and hack into systems or elicit confidential information. Using OpenLit as our example, we are able to monitor for breaches of the following guardrail types:</p>
<table>
<thead>
<tr>
<th>Guardrail Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Prompt Injection</td>
<td>Detection of malicious injection attempts, impersonation and other jailbreaking techniques</td>
</tr>
<tr>
<td>Sensitive Topics</td>
<td>Detection of content on controversial, sensitive or illegal topics such as politics, religion, adult content, substance abuse or violence</td>
</tr>
<tr>
<td>Restricted Topics</td>
<td>Detection of content that violates company policies, ethical guidelines or covers topics that the tool should avoid such as giving financial or legal advice</td>
</tr>
</tbody>
</table>
<p>These shields can be set up using OpenLit as per the below code:</p>
<pre><code class="language-ts">import openlit from &quot;openlit&quot;;

// Other imports omitted for brevity

// Allow streaming responses up to 30 seconds to address typically longer responses from LLMs
export const maxDuration = 30;

// Tools and Azure configuration omitted for brevity

openlit.init({
  applicationName: &quot;ai-travel-agent&quot;,
  environment: &quot;development&quot;,
  otlpEndpoint: process.env.OTEL_ENDPOINT,
  disableBatch: true,
});

// Choose one of the following approaches:
// Option 1: enable all available guardrails
const guardsAll = openlit.guard.All({
  provider: &quot;openai&quot;,
  collectMetrics: true,
  apiKey: process.env.OPENAI_API_KEY,
  baseUrl: process.env.OPENAI_ENDPOINT,
  validTopics: [&quot;travel&quot;, &quot;culture&quot;],
  invalidTopics: [&quot;finance&quot;, &quot;software engineering&quot;],
});

// Option 2: enables specific guardrail types (for example, prompt injection detection)
const guardsPromptInjection = openlit.guard.PromptInjection({
  provider: &quot;openai&quot;,
  collectMetrics: true, // Ensures guardrail breaches are exported to Elastic
  apiKey: process.env.OPENAI_API_KEY,
  baseUrl: process.env.OPENAI_ENDPOINT
});

// Post request handler
export async function POST(req: Request) {
  const { messages, id } = await req.json();

  try {
    const convertedMessages = await convertToModelMessages(messages);
    const prompt = `You are a helpful assistant that returns travel itineraries...`;

    const result = streamText({
      model: azure(&quot;gpt-4o&quot;),
      system: prompt,
      messages: convertedMessages,
      stopWhen: stepCountIs(2),
      tools,
      experimental_telemetry: { isEnabled: true },
      onFinish: async ({ text, steps }) =&gt; {
        const guardrailResult = await guardsAll.detect(text);
        console.log(`Guardrail results: ${guardrailResult}`);
      },
    });

    // Return data stream to allow the useChat hook to handle the results as they are streamed through for a better user experience
    return result.toUIMessageStreamResponse();
  } catch (e) {
    console.error(e);
    return new NextResponse(
      &quot;Unable to generate a plan. Please try again later!&quot;
    );
  }
}
</code></pre>
<p>In the event of a guardrail breach, metrics containing detail of the breach are sent to Elastic, similar to the following:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-observability-web-agents-openlit/openlit-guardrail-breach-example.png" alt="Elastic guardrail breach example" /></p>
<p>Of course we can leverage dashboards to visualize trends of guardrail breaches, including metrics such as volumes by category, as shown in the below example (with the corresponding NDJSON available <a href="https://github.com/carlyrichmond/observing-ai-agents/blob/main/dashboard/llm-issues-dashboard.ndjson">here</a>):</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-observability-web-agents-openlit/elastic-guardrail-dashboard.png" alt="Elastic LLM issues dashboard" /></p>
<p>We can also perform action on these breaches to notify relevant teams via alerts triggered via the <a href="https://www.elastic.co/docs/explore-analyze/alerting">available alerting tools</a>. These should be triggered based on the severity of the detected issue, as well as the classification, for example mentions of violence, illegal themes, or injection attacks may trigger immediately compared to minor inaccuracies. The guardrail breaches can also be used to trigger automated responses, such as blocking the response from being sent to the user, or providing a warning message to the user that their request has been flagged for review, triggering a human-in-the-loop response for the relevant teams.</p>
<h2>Conclusion</h2>
<p>AI agents are becoming more autonomous, more powerful, and more unpredictable. For this reason, it's important to introduce monitoring telemetry as early as possible in the development process and in organizational cultures. This article helps you understand how monitoring AI agents is different and how to do it using OpenLit to generate OpenTelemetry signals to send to Elastic. Check out the code <a href="https://github.com/carlyrichmond/observing-ai-agents">here</a> and start monitoring your AI agents in production.</p>
<blockquote>
<p>Developer resources:</p>
<ul>
<li><a href="https://github.com/carlyrichmond/observing-ai-agents">Observing AI Agents Example</a></li>
<li><a href="https://docs.openlit.io/latest/sdk/overview">OpenLit SDK Documentation</a></li>
<li><a href="https://arxiv.org/pdf/2306.05685">Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena | Zheng et al. 2023</a></li>
<li><a href="https://www.anthropic.com/research/agentic-misalignment">Agentic Misalignment: How LLMs could be insider threats | Anthropic</a></li>
</ul>
</blockquote>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/ai-observability-web-agents-openlit/travel-planner-blog-header.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Observability for Amazon MQ with Elastic: Demystifying Messaging Flows with Real-Time Insights]]></title>
            <link>https://www.elastic.co/observability-labs/blog/amazonmq-observability-rabbitmq-integration</link>
            <guid isPermaLink="false">amazonmq-observability-rabbitmq-integration</guid>
            <pubDate>Fri, 02 May 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[RabbitMQ, managed by Amazon MQ, enables asynchronous communication in distributed architectures but introduces operational risks such as retries, processing delays, and queue backlogs. Elastic’s Amazon MQ integration for RabbitMQ delivers deep observability into broker health, queue performance, message flow, and resource usage through Amazon CloudWatch metrics and logs. This blog outlines key operational risks associated with RabbitMQ and explains how Elastic observability helps maintain system reliability and optimize message delivery at scale.]]></description>
            <content:encoded><![CDATA[<h1>Observability for Amazon MQ with Elastic: Demystifying Messaging Flows with Real-Time Insights</h1>
<h2>Managing the Hidden Complexity of Message-Driven Architectures</h2>
<p>Amazon MQ is a managed message broker service for <a href="http://activemq.apache.org/">Apache ActiveMQ</a> Classic and <a href="https://www.rabbitmq.com/">RabbitMQ</a> that manages the setup, operation, and maintenance of message brokers. Messaging systems like RabbitMQ, managed by <a href="https://aws.amazon.com/amazon-mq/">Amazon MQ</a>, are pivotal in modern decoupled, event-driven applications. By serving as an intermediary between services, RabbitMQ facilitates asynchronous communication through message queuing, routing, and reliable delivery, making it an ideal fit for microservices, real-time pipelines, and event-driven architectures. However, this flexibility introduces operational challenges, such as retries, processing delays, consumer failures, and queue backlogs, which can gradually impact downstream performance and system reliability.</p>
<p>With Elastic’s <a href="https://www.elastic.co/docs/reference/integrations/aws_mq">Amazon MQ integration</a>, users gain deep visibility into message flow patterns, queue performance, and consumer health. This integration allows for the proactive detection of bottlenecks, helps optimize system behaviour, and ensures reliable message delivery at scale.</p>
<p>In this blog, we'll dive into the operational challenges of RabbitMQ in modern architectures, while also examining the common gaps and strategies for overcoming them.</p>
<h2>Why Observability for RabbitMQ on Amazon MQ Matters?</h2>
<p>RabbitMQ brokers are integral to distributed systems, handling tasks ranging from order processing to payment workflows and notification delivery. Any disruption can cascade into significant downstream issues. Observability into RabbitMQ helps answer critical operational questions like:​</p>
<ul>
<li>Is CPU and memory utilization increasing over time?</li>
<li>What are the trends in the message publish rate, message confirmation rate?</li>
<li>Are consumers failing to acknowledge messages?</li>
<li>Which queues are experiencing abnormal growth?</li>
<li>Are there an increasing number of messages being dead-lettered over time?</li>
</ul>
<h2>Enhanced Observability with Amazon MQ Integration</h2>
<p>Elastic provides a dedicated <a href="https://www.elastic.co/docs/reference/integrations/aws_mq">Amazon MQ integration</a> for RabbitMQ that utilizes Amazon CloudWatch metrics and logs to deliver comprehensive observability data. This integration enables the ingestion of metrics related to connections, nodes, queues, exchanges, and system logs.</p>
<p>By deploying <a href="https://www.elastic.co/elastic-agent">Elastic Agent</a> with this integration, the users can monitor:​</p>
<ul>
<li><strong>Queue performance and Dead-letter queue (DLQ) metrics</strong> include total message count (<code>MessageCount.max</code>), messages ready for delivery (<code>MessageReadyCount.max</code>), and unacknowledged messages (<code>MessageUnacknowledgedCount.max</code>). <code>MessageCount.max</code> metric tracks the total number of messages in a queue, including those that have been dead-lettered, and monitoring this over time can help identify trends in message accumulation, which may suggest issues leading to dead-lettering.</li>
<li><strong>Consumer behaviour</strong> through metrics like consumer count (<code>ConsumerCount.max</code>) and acknowledgement rate (<code>AckRate.max</code>), which help identify underperforming consumers or potential backlogs.</li>
<li><strong>Messaging throughput</strong> by tracking publish (<code>PublishRate.max</code>), confirm (<code>ConfirmRate.max</code>), and acknowledgement rates in real time. These are crucial for understanding application messaging patterns and flow.</li>
<li><strong>Broker and node-level health,</strong> including memory usage (<code>RabbitMQMemUsed.max</code>), CPU utilization (<code>SystemCpuUtilization.max</code>), disk availability (<code>RabbitMQDiskFree.min</code>), and file descriptor usage (<code>RabbitMQFdUsed.max</code>). These indicators are essential for diagnosing resource saturation and avoiding service disruption.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/amazonmq-observability-rabbitmq-integration/amazonmq-rabbitmq-dashboard-overview.png" alt="" /></p>
<h2>Integrating Amazon MQ Metrics into Elastic Observability</h2>
<p>Elastic's Amazon MQ integration facilitates the ingestion of CloudWatch metrics and logs into Elastic Observability, delivering near real-time insights into RabbitMQ. The prebuilt Amazon MQ dashboard visualizes this data, providing a centralized view of broker health, messaging activity, and resource usage, helping users quickly detect and resolve issues. Elastic's <a href="https://www.elastic.co/docs/solutions/observability/incident-management/alerting">alerting</a> for Observability enables proactive notifications based on custom conditions, while its <a href="https://www.elastic.co/docs/solutions/observability/incident-management/service-level-objectives-slos">SLO</a> capabilities allow users to define and track key performance targets, strengthening system reliability and service commitments. </p>
<p>Elastic brings together logs and metrics from Amazon MQ alongside data from a wide range of other services and applications, whether running in AWS, on-premises, or across multi-cloud environments, offering unified observability from a single platform.</p>
<h3>Prerequisites</h3>
<p>To follow along, ensure you have:</p>
<ul>
<li>An account on <a href="http://cloud.elastic.co/">Elastic Cloud</a> and a deployed stack in AWS (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>). Ensure you are using version 8.16.5 or higher. Alternatively, you can use <a href="https://www.elastic.co/cloud/serverless">Elastic Cloud Serverless</a>, a fully managed solution that eliminates infrastructure management, automatically scales based on usage, and lets you focus entirely on extracting value from your data.</li>
<li>An AWS account with permissions to pull the necessary data from AWS. <a href="https://docs.elastic.co/en/integrations/aws#aws-permissions">See details in our documentation</a>.</li>
</ul>
<h2>Architecture</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/amazonmq-observability-rabbitmq-integration/rabbitmq_lambda_messageflow.png" alt="" /></p>
<h2>Tracing Audit Flows from RabbitMQ to AWS Lambda</h2>
<p>Consider a financial audit trail use case, where every user action, such as a funds transfer, is published to RabbitMQ. A Python-based AWS Lambda function consumes these messages, deduplicates them using the <strong>id</strong> field, and logs structured audit events for downstream analysis.</p>
<p>Sample payload sent through RabbitMQ:</p>
<pre><code class="language-json">{
  &quot;id&quot;: &quot;txn-849302&quot;,
  &quot;type&quot;: &quot;audit&quot;,
  &quot;payload&quot;: {
    &quot;user_id&quot;: &quot;u-10245&quot;,
    &quot;event&quot;: &quot;funds.transfer&quot;,
    &quot;amount&quot;: 1200.75,
    &quot;currency&quot;: &quot;USD&quot;,
    &quot;timestamp&quot;: &quot;T14:20:15Z&quot;,
    &quot;ip&quot;: &quot;192.168.0.8&quot;,
    &quot;location&quot;: &quot;New York, USA&quot;
  }
}
</code></pre>
<p>You can now correlate message publishing activity from RabbitMQ with AWS Lambda invocation logs, track processing latency, and configure alerts for conditions like drops in consumer throughput or an unexpected surge in RabbitMQ queue depth.</p>
<h3>AWS Lambda Function: Processing RabbitMQ Messages</h3>
<p>This Python-based AWS Lambda function processes audit events received from RabbitMQ. It deduplicates messages based on the <strong>id</strong> field and logs structured event data for downstream analysis or compliance. Save the code below in a file named <strong>app.py</strong>.</p>
<pre><code class="language-python">import json
import logging
import base64
# Configure logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# In-memory set to track processed message IDs for deduplication
processed_ids = set()
def lambda_handler(event, context):
    logger.info(&quot;Lambda triggered by RabbitMQ event&quot;)
    if 'rmqMessagesByQueue' not in event:
        logger.warning(&quot;Invalid event: missing 'rmqMessagesByQueue'&quot;)
        return {'statusCode': 400, 'body': 'Invalid RabbitMQ event'}
    for queue_name, messages in event['rmqMessagesByQueue'].items():
        logger.info(f&quot;Processing queue: {queue_name}, Messages count: {len(messages)}&quot;)
        for msg in messages:
            try:
                raw_data = msg['data']
                decoded_json = base64.b64decode(raw_data).decode('utf-8')
                message = json.loads(decoded_json)
                logger.info(f&quot;Decoded message: {json.dumps(message)}&quot;)
                message_id = message.get('id')
                if not message_id:
                    logger.warning(&quot;Message missing 'id', skipping.&quot;)
                    continue
                if message_id in processed_ids:
                    logger.warning(f&quot;Duplicate message detected: {message_id}&quot;)
                    continue
                payload = message.get('payload', {})
                logger.info(f&quot;Processing message ID: {message_id}&quot;)
                logger.info(f&quot;Event Type: {message.get('type')}&quot;)
                logger.info(f&quot;User ID: {payload.get('user_id')}&quot;)
                logger.info(f&quot;Event: {payload.get('event')}&quot;)
                logger.info(f&quot;Amount: {payload.get('amount')} {payload.get('currency')}&quot;)
                logger.info(f&quot;Timestamp: {payload.get('timestamp')}&quot;)
                logger.info(f&quot;IP Address: {payload.get('ip')}&quot;)
                logger.info(f&quot;Location: {payload.get('location')}&quot;)
                processed_ids.add(message_id)
            except Exception as e:
                logger.error(f&quot;Error processing message: {str(e)}&quot;)
    return {'statusCode': 200, 'body': 'Messages processed successfully'}

</code></pre>
<h3>Setting up AWS Secrets Manager</h3>
<p>To securely store and manage your RabbitMQ credentials, use AWS Secrets Manager.​</p>
<ol>
<li>
<p><strong>Create a New Secret:</strong></p>
<ul>
<li>Navigate to the<a href="https://console.aws.amazon.com/secretsmanager/"> AWS Secrets Manager console</a>.</li>
<li>Choose <strong>Store a new secret</strong>.</li>
<li>Select <strong>Other type of secret</strong>.</li>
<li>Enter the following key-value pairs:
<ul>
<li><code>username</code>: Your RabbitMQ username</li>
<li><code>password</code>: Your RabbitMQ password</li>
</ul>
</li>
</ul>
</li>
<li>
<p><strong>Configure the Secret:</strong></p>
<ul>
<li>Provide a meaningful name, such as <code>RabbitMQAccess</code>.</li>
<li>Optionally, add tags and set rotation if needed.​</li>
</ul>
</li>
<li>
<p><strong>Store the Secret:</strong></p>
<ul>
<li>Review the settings and store the secret. Note the ARN of the secret you have created.
<img src="https://www.elastic.co/observability-labs/assets/images/amazonmq-observability-rabbitmq-integration/aws-secret-manager-configuration.png" alt="" /></li>
</ul>
</li>
</ol>
<h3>Setting up Amazon MQ for RabbitMQ</h3>
<p>To get started with RabbitMQ on Amazon MQ, follow these steps to set up your broker.</p>
<ul>
<li>Open the <a href="https://console.aws.amazon.com/amazonmq/">Amazon MQ console</a>.</li>
<li>Create a new broker with the <strong>RabbitMQ</strong> engine.</li>
<li>Choose your preferred deployment option—<strong>single-instance</strong> or <strong>clustered</strong></li>
<li>Use the same <strong>username</strong> and <strong>password</strong> that you previously stored in <strong>AWS Secrets Manager</strong>.</li>
<li>Under <strong>Additional settings</strong>, enable <strong>CloudWatch Logs</strong> for observability.
<img src="https://www.elastic.co/observability-labs/assets/images/amazonmq-observability-rabbitmq-integration/amazonmq-cloudwatch-enable.png" alt="" /></li>
<li>Configure access and security settings, ensuring that the broker is accessible to your AWS Lambda function.</li>
</ul>
<ul>
<li>
<p>After the broker is created, note the following important details:</p>
<ul>
<li>ARN of the RabbitMQ broker.</li>
<li>RabbitMQ web console URL.
<img src="https://www.elastic.co/observability-labs/assets/images/amazonmq-observability-rabbitmq-integration/amazonmq-rabbitmq-configuration-summary.png" alt="" /></li>
</ul>
</li>
<li>
<p>You’ll need the RabbitMQ log group ARN to set up Elastic’s Amazon MQ integration for RabbitMQ. Follow these steps to locate it:</p>
<ul>
<li>Go to the <strong>General – Enabled Logs</strong> section of the broker. </li>
<li>Copy the <strong>CloudWatch log group ARN</strong>.
<img src="https://www.elastic.co/observability-labs/assets/images/amazonmq-observability-rabbitmq-integration/amazonmq-rabbitmq-loggroup-arn.png" alt="" /></li>
</ul>
</li>
</ul>
<h3>Create a RabbitMQ Queue</h3>
<p>Now that the RabbitMQ broker is configured, use the management console to create a queue where messages will be published.</p>
<ul>
<li>Access the RabbitMQ management console using the web console URL.</li>
<li>Create a new queue (example: <strong>myQueue</strong>) to receive messages.
<img src="https://www.elastic.co/observability-labs/assets/images/amazonmq-observability-rabbitmq-integration/rabbitmq-create-queue.png" alt="" /></li>
</ul>
<h3>Build and deploy the AWS Lambda function</h3>
<p>In this section, we'll set up the Lambda function using AWS SAM, add the message processing logic, and deploy it to AWS. This Lambda function will be responsible for consuming messages from RabbitMQ and logging audit events.</p>
<p>Before continuing, make sure you have completed the following prerequisites.</p>
<ul>
<li>
<p><a href="https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/prerequisites.html">AWS SAM prerequisites</a></p>
</li>
<li>
<p><a href="https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/install-sam-cli.html">Install the AWS SAM CLI</a></p>
</li>
</ul>
<p>Next, follow the steps outlined below to continue with the setup.</p>
<ol>
<li>In your command line, run the command <code>sam init</code> from a directory of your choice.</li>
<li>The AWS SAM CLI will walk you through the setup.
<ul>
<li>Select <strong>AWS Quick Start Templates</strong>.</li>
<li>Choose the <strong>Hello World Example</strong> </li>
<li>Use the <strong>Python</strong> runtime and <strong>zip</strong> package type.</li>
<li>Proceed with the default options.</li>
<li>Name your application as <strong>sample-rabbitmq-app</strong>.</li>
<li>The AWS SAM CLI downloads your starting template and creates the application project directory structure.</li>
</ul>
</li>
<li>From your command line, move to the newly created sample-rabbitmq-app directory.
<ul>
<li>Replace the content of the <strong>hello_world/app.py</strong> file with the lambda function code for rabbitmq message processing.</li>
<li>In the <strong>template.yaml</strong> file, use the values mentioned below to update the file content.
<pre><code class="language-yaml">Resources: SampleRabbitMQApp:   Type: AWS::Serverless::Function   Properties:     CodeUri: hello_world/     Description: A starter AWS Lambda function.     MemorySize: 128     Timeout: 3     Handler: app.lambda_handler     Runtime: python3.10     PackageType: Zip     Policies:       - Statement:           - Effect: Allow             Resource: '*'             Action:               - mq:DescribeBroker               - secretsmanager:GetSecretValue               - ec2:CreateNetworkInterface               - ec2:DescribeNetworkInterfaces               - ec2:DescribeVpcs               - ec2:DeleteNetworkInterface               - ec2:DescribeSubnets               - ec2:DescribeSecurityGroups     Events:       MQEvent:         Type: MQ         Properties:           Broker: &lt;ARN of the Broker&gt;           Queues:             - myQueue           SourceAccessConfigurations:             - Type: BASIC_AUTH               URI: &lt;ARN of the secret&gt;
</code></pre></li>
</ul>
</li>
<li>Run the command <code>sam deploy --guided</code> and wait for the confirmation message. This deploys all of the resources.</li>
</ol>
<h3>Sending Audit Events to RabbitMQ and Triggering Lambda</h3>
<p>To test the end-to-end setup, simulate the flow by publishing audit event data into RabbitMQ using its web UI. Once the message is sent, it triggers the Lambda function. </p>
<ol>
<li>
<p>Navigate to the <a href="https://console.aws.amazon.com/amazon-mq/home">Amazon MQ console</a> and select your newly created broker.</p>
</li>
<li>
<p>Locate and open the Rabbit web console URL<br />
<img src="https://www.elastic.co/observability-labs/assets/images/amazonmq-observability-rabbitmq-integration/amazonmq-rabbitmq-webconsole-details.png" alt="" /></p>
</li>
<li>
<p>Under the <strong>Queues and Streams</strong> tab, select the target queue (example: <strong>myQueue</strong>).</p>
</li>
<li>
<p>Enter the message payload, and click <strong>Publish message</strong> to send it to the queue.<br />
Here’s a sample payload published via RabbitMQ:</p>
<pre><code class="language-json">{
  &quot;id&quot;: &quot;txn-849302&quot;,
  &quot;type&quot;: &quot;audit&quot;,
  &quot;payload&quot;: {
    &quot;user_id&quot;: &quot;u-10245&quot;,
    &quot;event&quot;: &quot;funds.transfer&quot;,
    &quot;amount&quot;: 1200.75,
    &quot;currency&quot;: &quot;USD&quot;,
    &quot;timestamp&quot;: &quot;T14:20:15Z&quot;,
    &quot;ip&quot;: &quot;192.168.0.8&quot;,
    &quot;location&quot;: &quot;New York, USA&quot;
  }
}
</code></pre>
</li>
<li>
<p>Navigate to the AWS Lambda function created earlier.</p>
</li>
<li>
<p>Under the <strong>Monitor</strong> tab, click <strong>View CloudWatch logs</strong>.</p>
</li>
<li>
<p>Check the latest log stream to confirm that the Lambda was triggered by Amazon MQ and that the message was processed successfully.
<img src="https://www.elastic.co/observability-labs/assets/images/amazonmq-observability-rabbitmq-integration/amazonmq-lambda-logstream.png" alt="" /></p>
</li>
</ol>
<h1>Configuring Amazon MQ integration for Metrics and Logs collection</h1>
<p>Elastic’s <a href="https://www.elastic.co/docs/reference/integrations/aws_mq">Amazon MQ integration</a> simplifies the collection of logs and metrics from RabbitMQ brokers managed by Amazon MQ. Logs are ingested via <strong>Amazon CloudWatch Logs</strong>, while metrics are fetched from the specified AWS region at a defined interval.</p>
<p>Elastic provides a default configuration for metrics collection. You can accept these defaults or adjust settings such as the <strong>Collection Period</strong> to better fit your needs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/amazonmq-observability-rabbitmq-integration/amazonmq-metrics-configuration.png" alt="" /></p>
<p>To enable the collection of logs:</p>
<ol>
<li>Navigate to the <a href="https://console.aws.amazon.com/amazon-mq/home">Amazon MQ console</a> and select the newly created broker.</li>
<li>Click the <strong>Logs</strong> hyperlink under the <strong>General – Enabled Logs</strong> section to open the detailed log settings page.</li>
<li>From this page, copy the <strong>CloudWatch log group ARN</strong>.
<img src="https://www.elastic.co/observability-labs/assets/images/amazonmq-observability-rabbitmq-integration/amazonmq-rabbitmq-loggroup-arn.png" alt="" /></li>
<li>In <strong>Elastic</strong>, set up the <strong>Amazon MQ integration</strong> and paste the CloudWatch log group ARN.
<img src="https://www.elastic.co/observability-labs/assets/images/amazonmq-observability-rabbitmq-integration/amazonmq-logs-configuration.png" alt="" /></li>
<li><strong>Accept Defaults or Customize Settings</strong> – Elastic provides a <strong>default configuration</strong> for logs collection. You can accept these defaults or adjust settings such as <strong>collection intervals</strong> to better fit your needs.</li>
</ol>
<h3>Visualizing RabbitMQ Workloads with the Pre-Built Amazon MQ Dashboard</h3>
<p>You can access the RabbitMQ dashboard by:</p>
<ol>
<li>
<p>Navigate to the Dashboard Menu – Select the Dashboard menu option in Elastic and search for <strong>[Amazon MQ] RabbitMQ Overview</strong> to open the dashboard.</p>
</li>
<li>
<p>Navigate to the Integrations Menu – Open the <strong>Integrations</strong> menu in Elastic, select <strong>Amazon MQ</strong>, go to the <strong>Assets</strong> tab, and choose <strong>[Amazon MQ] RabbitMQ Overview</strong> from the dashboard assets</p>
</li>
</ol>
<p>The Amazon MQ RabbitMQ dashboard in the Elastic integration delivers a comprehensive overview of broker health and messaging activity. It provides real-time insights into broker resource utilization, queue and topic performance, connection trends, and messaging throughput. The dashboard helps users track system behaviour, detect performance bottlenecks, and ensure reliable message delivery across distributed applications.</p>
<h4>Broker Metrics</h4>
<p>This section provides a centralised view of the overall health and performance of the RabbitMQ broker on Amazon MQ. The visualizations highlights the number of configured exchanges and queues, active broker connections, producers, consumers, and total messages in flight. System-level metrics such as CPU utilization, memory consumption, and free disk space help assess whether the broker has sufficient resources to handle current workloads.</p>
<p>Message flow metrics such as publish rate, confirmation rate, and acknowledgement rate are displayed to provide visibility into how messages are processed through the broker. Monitoring trends in these values helps detect message delivery issues, throughput degradation, or potential saturation of the broker under load.</p>
<h4>Node Metrics</h4>
<p>Node-level visibility helps identify resource imbalances across nodes in clustered RabbitMQ setups. This section includes per-node CPU usage, memory consumption, and available disk space, offering insight into the underlying infrastructure's ability to support broker operations.</p>
<h4>Queue Metrics</h4>
<p>Queue-specific insights are critical for understanding message delivery patterns and backlog conditions. This section details total messages, ready messages, and unacknowledged messages, segmented by broker, virtual host, and queue.</p>
<p>By observing how these counts change over time, users can identify slow consumers, message build-ups, or delivery issues that may affect application performance or lead to dropped messages under pressure.</p>
<h4>Logs</h4>
<p>This section displays log level, process ID, and raw message content. These logs provide immediate visibility into events such as connection failures, resource thresholds being hit, or unexpected queue behaviors.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/amazonmq-observability-rabbitmq-integration/amazonmq-rabbitmq-dashboard.png" alt="" /></p>
<h3>Detecting Queue Backlogs with Alerting Rules</h3>
<p>Elastic’s <a href="https://www.elastic.co/docs/solutions/observability/incident-management/alerting">alert</a> framework allows you to define rules that monitor critical RabbitMQ metrics and automatically trigger actions when specific thresholds are breached. </p>
<h4>Alert: Queue Backlog (Message Ready or Unacknowledged Messages)</h4>
<p>This alert helps detect queue backlog in Amazon MQ by evaluating two metrics</p>
<ul>
<li><code>MessageUnacknowledgedCount.max</code> and</li>
<li><code>MessageReadyCount.max</code>. </li>
</ul>
<p>The alert is triggered if either condition persists for more than <strong>10 minutes</strong>:</p>
<ul>
<li><code>MessageUnacknowledgedCount.max</code> exceeds <strong>5,000</strong></li>
<li><code>MessageReadyCount.max</code> exceeds <strong>7,000</strong></li>
</ul>
<p>These thresholds should be adjusted based on typical message volume and consumer throughput. Sustained high values can indicate that consumers are not keeping up or message delivery pipelines are congested, potentially causing delays or dropped messages. Sustained high values may result in processing delays or dropped messages if not addressed.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/amazonmq-observability-rabbitmq-integration/amazonmq-alert-configuration.png" alt="" /></p>
<h3>Tracking Resource Utilization to Maintain RabbitMQ Performance</h3>
<p>Elastic’s <a href="https://www.elastic.co/docs/solutions/observability/incident-management/service-level-objectives-slos">Service-level objectives (SLOs)</a> capabilities allow you to define and monitor performance targets using key indicators like latency, availability, and error rates. Once configured, Elastic continuously evaluates these SLOs in real time, offering intuitive dashboards, alerts for threshold violations, and insights into error budget consumption. This enables teams to stay ahead of issues, ensuring service reliability and consistent performance.</p>
<h4>SLO: Node Resource Health (CPU, Memory, Disk)</h4>
<p>This SLO focuses on ensuring RabbitMQ brokers and nodes have sufficient resources to process messages without performance degradation. It tracks CPU, memory, and disk usage across RabbitMQ brokers and nodes to prevent resource exhaustion that could lead to service interruptions.</p>
<p><strong>Target thresholds:</strong></p>
<ul>
<li><code>SystemCpuUtilization.max</code> remains below <strong>85%</strong> for <strong>99%</strong> of the time.</li>
<li><code>RabbitMQMemUsed.max</code> remains below <strong>80%</strong> of <code>RabbitMQMemLimit.max</code> for <strong>99%</strong> of the time.</li>
<li><code>RabbitMQDiskFree.min</code> remains above <strong>25%</strong> of <code>RabbitMQDiskFreeLimit.max</code> for <strong>99%</strong> of the time.</li>
</ul>
<p>Sustained high values in CPU or memory usage can signal resource contention, which may result in slower message processing or downtime. Low disk availability may cause the broker to stop accepting messages, risking message loss. These thresholds are designed to catch early signs of resource saturation and ensure smooth, uninterrupted message flow across RabbitMQ deployments.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/amazonmq-observability-rabbitmq-integration/amazonmq-slo-configuration.png" alt="" /></p>
<h2>Conclusion</h2>
<p>As RabbitMQ-based messaging architectures scale and become more complex, the need for in-depth visibility into system performance and potential issues deepens. Elastic’s <a href="https://www.elastic.co/docs/reference/integrations/aws_mq">Amazon MQ integration</a> brings that visibility front and center—helping you go beyond basic health checks to understand real-time messaging throughput, queue backlog trends, and resource saturation across your brokers and consumers.</p>
<p>By leveraging the prebuilt dashboards, configuring alerts and SLOs, you can proactively detect anomalies, fine-tune consumer performance, and ensure reliable delivery across your event-driven applications.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/amazonmq-observability-rabbitmq-integration/AmazonMQ-observability-RabbitMQ.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Analyzing OpenTelemetry apps with Elastic AI Assistant and APM]]></title>
            <link>https://www.elastic.co/observability-labs/blog/analyzing-opentelemetry-apps-elastic-ai-assistant-apm</link>
            <guid isPermaLink="false">analyzing-opentelemetry-apps-elastic-ai-assistant-apm</guid>
            <pubDate>Tue, 12 Mar 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic Observability provides native OpenTelemetry support, but analyzing applications logs, metrics, and traces can be daunting. Elastic Observability not only provides AIOps features but also an AI Assistant (co-pilot) to help get to MTTR faster.]]></description>
            <content:encoded><![CDATA[<p>OpenTelemetry is rapidly becoming the most expansive project within the Cloud Native Computing Foundation (CNCF), boasting as many commits as Kubernetes and garnering widespread support from customers. Numerous companies are adopting OpenTelemetry and integrating it into their applications. Elastic® offers detailed <a href="https://www.elastic.co/blog/getting-started-opentelemetry-instrumentation-sample-app">guides</a> on implementing OpenTelemetry for applications. However, like many applications, pinpointing and resolving issues can be time-consuming.</p>
<p>The <a href="https://www.elastic.co/blog/context-aware-insights-elastic-ai-assistant-observability">Elastic AI Assistant</a> significantly enhances the process, not only in identifying but also in resolving issues. This is further enhanced by Elastic’s new Service Level Objective (SLO) capability, allowing you to streamline your entire site reliability engineering (SRE) process from detecting potential issues to enhancing the overall customer experience.</p>
<p>In this blog, we will demonstrate how you, as an SRE, can detect issues in a service equipped with OpenTelemetry. We will explore problem identification using Elastic APM, Elastic’s AIOps capabilities, and the Elastic AI Assistant.</p>
<p>We will illustrate this using the <a href="https://github.com/elastic/opentelemetry-demo">OpenTelemetry demo</a>, with a <a href="https://opentelemetry.io/docs/demo/feature-flags/">feature flag (cartService)</a> that is activated.</p>
<p>Our walkthrough will encompass two scenarios:</p>
<ol>
<li>
<p>When the SLO for cart service becomes noncompliant, we will analyze the error through Elastic APM. The Elastic AI Assistant will assist by providing a runbook and a GitHub issue to facilitate issue analysis.</p>
</li>
<li>
<p>Should the SLO for the cart service be noncompliant, we will examine the trace that indicates a high failure rate. We will employ AIOps for failure correlation and the AI Assistant to analyze logs and Kubernetes metrics directly from the Assistant.</p>
</li>
</ol>
<h2>Prerequisites and config</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up the configuration:</p>
<ul>
<li>
<p>Ensure you have an account on <a href="http://cloud.elastic.co/">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>).</p>
</li>
<li>
<p>We used the OpenTelemetry Demo. Directions for using Elastic with OpenTelemetry Demo are <a href="https://github.com/elastic/opentelemetry-demo">here</a>.</p>
</li>
<li>
<p>Additionally you will need to connect your AI Assistant to your favorite LLM. We used Azure OpenAI GPT-4.</p>
</li>
<li>
<p>We also ran the OpenTelemetry Demo on Kubernetes, specifically on GKE.</p>
</li>
</ul>
<h2>SLO noncompliance</h2>
<p>Elastic APM recently released the SLO (Service Level Objectives) feature in <a href="https://www.elastic.co/guide/en/observability/8.12/slo.html">8.12</a>. This feature enables setting measurable performance targets for services, such as <a href="https://sre.google/sre-book/monitoring-distributed-systems/">availability, latency, traffic, errors, and saturation or define your own</a>. Key components include:</p>
<ul>
<li>
<p>Defining and monitoring SLIs (Service Level Indicators)</p>
</li>
<li>
<p>Monitoring error budgets indicating permissible performance shortfalls</p>
</li>
<li>
<p>Alerting on burn rates showing error budget consumption</p>
</li>
</ul>
<p>We set up two SLOs for cart service:</p>
<ul>
<li>
<p><strong>Availability SLO</strong> , which monitors its availability by ensuring that transactions succeed. We set up the feature flag in the OpenTelemetry application, which generates an error for EmptyCart transactions 10% of the time.</p>
</li>
<li>
<p><strong>Latency SLO</strong> to ensure transactions are not going below a specific latency, which will reduce customer experiences.</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/analyzing-opentelemetry-apps-elastic-ai-assistant-apm/image1.png" alt="1 - SLOs" /></p>
<p>Because of the OTel cartservice feature flag, the availability SLO is triggered, and within the SLO details, we see that over a seven-day period the availability is well below our target of 99.9, at 95.5. Additionally all the error budget that was available is also exhausted.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/analyzing-opentelemetry-apps-elastic-ai-assistant-apm/image2.png" alt="2 - cart service otel" /></p>
<p>With SLO, you can easily identify when issues with customer experience occur, or when potential issues with services arise before they become potentially worse.</p>
<h2>Scenario 1: Analyzing APM trace and logs with AI Assistant</h2>
<p>Once the SLO is found as non-compliant, we can dive into cart service to investigate in Elastic APM. The following walks through the set of steps you can take in Elastic APM and how to use the AI Assistant to analyze the issue:</p>
&lt;Video vidyardUuid=&quot;FSpw53JN9Xu32V1kLQCE8z&quot; /&gt;
<p>From the video, we can see that once in APM, we took the following steps.</p>
<ol>
<li>
<p>Investigated the trace EmptyCart, which was experiencing larger than normal failure rates.</p>
</li>
<li>
<p>The trace showed a significant number of failures, which also resulted in slightly larger latency.</p>
</li>
<li>
<p>We used AIOps failure correlation to identify the potential component causing the failure, which correlated to a field value of FailedPrecondition.</p>
</li>
<li>
<p>While filtering on that value and reviewing the logs, we still couldn’t understand what this meant.</p>
</li>
<li>
<p>This is where you can use Elastic’s AI Assistant to further your understanding of the issue.</p>
</li>
</ol>
<p>AI Assistant helped us analyze the following:</p>
<ol>
<li>
<p>It helped us understand what the log message meant and that it was related to the Redis connection failure issue.</p>
</li>
<li>
<p>Because we couldn’t connect to Redis, we asked the AI Assistant to give us the metrics for the Redis Kubernetes pods.</p>
</li>
<li>
<p>We learned there were two pods for Redis from the logs over the last two hours.</p>
</li>
<li>
<p>However, we also learned that the memory of one seems to be increasing.</p>
</li>
<li>
<p>It seems that Redis restarted (hence the second pod), and with this information we could dive deeper into what happened to Redis.</p>
</li>
</ol>
<p>You can see how quickly we could correlate a significant amount of information, logs, metrics, and traces through the AI Assistant and Elastic’s APM capabilities. We didn’t have to go through multiple screens to hunt down information.</p>
<h2>Scenario 2: Analyzing APM error with AI Assistant</h2>
<p>Once the SLO is found as noncompliant, we can dive into cart service to investigate in Elastic APM. The following walks through the set of steps you can take in Elastic APM and use the AI Assistant to analyze the issue:</p>
&lt;Video vidyardUuid=&quot;dVScqDxPJWCPCeGu8WMoCw&quot; /&gt;
<p>From the video, we can see that once in APM, we took the following steps:</p>
<ol>
<li>
<p>We noticed a specific error for the APM service.</p>
</li>
<li>
<p>We investigated this in the error tab, and while we see it’s an issue with connection to Redis, we still need more information.</p>
</li>
<li>
<p>The AI Assistant helps us understand the stacktrace and provides some potential causes for the error and ways to diagnose and resolve it.</p>
</li>
<li>
<p>We also asked it for a runbook, created by our SRE team, which gives us steps to work through this particular issue.</p>
</li>
</ol>
<p>But as you can see, AI Assistant provides us not only with information about the error message but also how to diagnose it and potentially resolve it with an internal runbook.</p>
<h2>Achieving operational excellence, optimal performance, and reliability</h2>
<p>We’ve shown how an OpenTelemetry instrumented application (OTel demo) can be analyzed using Elastic’s features, especially the AI Assistant coupled with Elastic APM, AIOps, and the latest SLO features. Elastic significantly streamlines the process of identifying and resolving issues within your applications.</p>
<p>Through our detailed walkthrough of two distinct scenarios, we have seen how Elastic APM and the AI Assistant can efficiently analyze and address noncompliance with SLOs in a cart service. The ability to quickly correlate information, logs, metrics, and traces through these tools not only saves time but also enhances the overall effectiveness of the troubleshooting process.</p>
<p>The use of Elastic's AI Assistant in these scenarios underscores the value of integrating advanced AI capabilities into operational workflows. It goes beyond simple error analysis, offering insights into potential causes and providing actionable solutions, sometimes even with customized runbooks. This integration of technology fundamentally changes how SREs approach problem-solving, making the process more efficient and less reliant on manual investigation.</p>
<p>Overall, the advancements in Elastic’s APM, AIOps capabilities, and the AI Assistant, particularly in handling OpenTelemetry data, represent a significant step forward in operational excellence. These tools enable SREs to not only react swiftly to emerging issues but also proactively manage and optimize the performance and reliability of their services, thereby ensuring an enhanced customer experience.</p>
<h2>Try it out</h2>
<p>Existing Elastic Cloud customers can access many of these features directly from the <a href="https://cloud.elastic.co/">Elastic Cloud console</a>. Not taking advantage of Elastic on cloud? <a href="https://www.elastic.co/cloud/cloud-trial-overview">Start a free trial</a>.</p>
<blockquote>
<ul>
<li><a href="https://www.elastic.co/blog/service-level-objectives-slos-logs-metrics">Build better Service Level Objectives (SLOs) from logs and metrics</a></li>
<li><a href="https://www.elastic.co/blog/whats-new-elastic-observability-8-12-0">Elastic Observability 8.12: GA for AI Assistant, SLO, and Mobile APM support</a></li>
<li><a href="https://www.elastic.co/blog/native-opentelemetry-support-in-elastic-observability">Native Observability support in Elastic Observability</a></li>
<li><a href="https://www.elastic.co/blog/context-aware-insights-elastic-ai-assistant-observability">Context-aware insights using the Elastic AI Assistant for Observability</a></li>
</ul>
</blockquote>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/analyzing-opentelemetry-apps-elastic-ai-assistant-apm/ecs-otel-announcement-3.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Using Anomaly Detection in Elastic Cloud to Identify Fraud]]></title>
            <link>https://www.elastic.co/observability-labs/blog/anomaly-detection-to-identify-fraud</link>
            <guid isPermaLink="false">anomaly-detection-to-identify-fraud</guid>
            <pubDate>Thu, 30 Jan 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Follow the step-by-step process of using Elastic Cloud’s anomaly detection to analyze example credit card transactions to detect potential fraud.]]></description>
            <content:encoded><![CDATA[<p><strong>Fraud detection is one of the most pressing challenges facing the financial services industry today.</strong> With the rise of digital payments, app-based banking, and online financial services, the volume and sophistication of fraudulent activity have grown significantly. In recent years, high-profile incidents like the <a href="https://www.justice.gov/usao-nj/pr/eighteen-people-charged-international-200-million-credit-card-fraud-scam">$200 million credit card fraud scheme</a> uncovered by the U.S. Department of Justice, which involved the creation of thousands of fake identities, have highlighted just how advanced fraud operations have become. These threats pose serious risks to financial institutions and their customers, making real-time fraud prevention an absolute necessity.</p>
<p>Elastic Cloud provides a powerful solution to meet these challenges. Its scalable, high-performance platform enables organizations to ingest and analyze all data types efficiently (from transactional data to customers’ personal information to claims data), delivering actionable insights that empower fraud prevention teams to detect anomalies and stop fraud before it occurs. From identifying unusual spending patterns to uncovering hidden threats, Elastic Cloud offers the speed and flexibility needed to safeguard assets in an increasingly digital economy.</p>
<p>In this blog, we’ll walk you through how Elastic Cloud can be used to identify fraud within credit card transactions—a key area of focus due to the high volume of data and the significant potential for fraudulent activity.</p>
<p>We’ll use a <code>Node.js</code> code example to generate an example set of credit card transactions. The generated transactions include a data anomaly similar to an anomaly that might occur as a result of fraudulent activity known as “Card Testing”, which is when a malicious actor tests to see if stolen credit card data can be used to make fraudulent transactions. We’ll then import the credit card transactions into an Elastic Cloud index and use Elastic Observability’s Anomaly Detection feature to analyze the transactions to detect potential signs of “Card Testing”.</p>
<h2>Performing fraud detection with Elastic Cloud</h2>
<h3>Generate example credit card transactions</h3>
<p>Begin the process by using a terminal on your local computer to run a <a href="https://github.com/elastic/observability-examples/tree/main/anomaly-detection">Node.js code example</a> that will generate some example credit card transaction data.</p>
<p>Within your terminal window, run the following <strong>git clone</strong> command to clone the Github repository containing the Node.js code example:</p>
<pre><code>git clone https://github.com/elastic/observability-examples
</code></pre>
<p>Run the following <strong>cd</strong> command to change directory to the code example folder:</p>
<pre><code>cd observability-examples/anomaly-detection
</code></pre>
<p>Run the following npm install command to install the code example’s dependencies:</p>
<pre><code>npm install
</code></pre>
<p>Enter the following <strong>node</strong> command to run the code example which will generate a JSON file named transactions.ndjson containing 1000 example credit card transactions:</p>
<pre><code>node generate-transactions.js 
</code></pre>
<p>Now that we've got some credit card transaction data, we can import the transactions into Elastic Cloud to analyze the data.</p>
<h3>Import transactions data into an Elastic Cloud index</h3>
<p>We’ll start the import process in <a href="https://cloud.elastic.co/">Elastic Cloud</a>. Create an Elastic Serverless project in which we can import and analyze the transaction data. Click <strong>Create project</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/create-serverless-project.png" alt="Create Elastic serverless project" /></p>
<p>Click <strong>Next</strong> in the <strong>Elastic for Observability</strong> project type tile.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/create-serverless-observability-project.png" alt="Create Elastic Observability serverless project" /></p>
<p>Click <strong>Create project</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/create-serverless-observability-project-confirm.png" alt="Create Elastic Observability serverless project confirm" /></p>
<p>Click <strong>Continue</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/create-serverless-observability-project-continue.png" alt="Create Elastic Observability serverless project continue" /></p>
<p>Select the <strong>Application</strong> tile.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/select-application-data-import.png" alt="Select application data import" /></p>
<p>Enter the text “Upload” into the search box.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/data-import-search-for-upload-option.png" alt="Data import search for upload option" /></p>
<p>Select the <strong>Upload a file</strong> tile.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/data-import-select-upload-tile.png" alt="Data import select upload tile" /></p>
<p>Click <strong>Select or drag and drop a file.</strong></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/data-import-select-upload-file-selector.png" alt="Data import select upload file selector" /></p>
<p>Select the <strong>transactions.ndjson</strong> file on your local computer that was created from running the Node.js code example in a previous step.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/data-import-select-local-file.png" alt="Data import select local file" /></p>
<p>Click <strong>Import</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/data-import-select-local-file-import.png" alt="Data import select local file import" /></p>
<p>Enter an <strong>Index</strong> <strong>name</strong> and click <strong>Import</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/data-import-select-local-file-import-enter-index.png" alt="Data import select local file import enter index" /></p>
<p>You’ll see a confirmation when the import process completes and the new index is successfully created.</p>
<h3>Use Anomaly Detection to analyze credit card transactions</h3>
<p>Anomaly Detection is a powerful tool that can analyze your data to find unusual patterns that would otherwise be difficult, if not impossible, to manually uncover. Now that we've got transaction data loaded into an index, let's use anomaly detection to analyze it. Click <strong>Machine learning</strong> in the navigation menu.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/select-machine-learning.png" alt="Select-machine-learning" /></p>
<p>Select <strong>Anomaly Detection Jobs</strong></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/select-machine-learning-anomaly-detection-jobs.png" alt="Select machine learning anomaly detection jobs" /></p>
<p>Click <strong>Create anomaly detection job</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/select-machine-learning-create-anomaly-detection-job.png" alt="Select machine learning create anomaly detection job" /></p>
<p>Select the Index containing the imported transactions as the data source of the anomaly detection job.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/select-machine-learning-index-for-anomaly-detection-job.png" alt="Select machine learning index for anomaly-detection-job" /></p>
<p>As mentioned above, one form of credit card fraud is called “Card Testing” where a malicious actor tests a batch of credit cards to determine if they are still valid.</p>
<p>We can analyze the transaction data in our index to detect fraudulent “Card Testing” by using the anomaly detection <strong>Population</strong> wizard. Select the <strong>Population</strong> wizard tile.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/select-population-wizard-for-anomaly-detection-job.png" alt="Select population wizard for anomaly detection job" /></p>
<p>Click <strong>Use full data</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/select-population-wizard-use-full-data-anomaly-detection-job.png" alt="Select population wizard use full data anomaly detection job" /></p>
<p>Click <strong>Next</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/select-population-wizard-use-full-data-anomaly-detection-job-next.png" alt="Select population wizard use full data anomaly detection job next" /></p>
<p>Click the <strong>Population field</strong> selector and select <strong>IPAddress</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/configure-anomaly-detection-job-population.png" alt="Configure anomaly detection job population" /></p>
<p>Click the <strong>Add metric</strong> option.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/configure-anomaly-detection-job-population-select-count.png" alt="Configure anomaly detection job population select count" /></p>
<p>Select <strong>Count(Event rate)</strong> as the metric to be added.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/configure-anomaly-detection-job-population-select-add-metric.png" alt="Configure anomaly detection job population select add metric" /></p>
<p>Click <strong>Next</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/configure-anomaly-detection-job-population-create-next.png" alt="Configure anomaly detection job population create next" /></p>
<p>Enter a <strong>Job ID</strong> and click <strong>Next.</strong></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/configure-anomaly-detection-job-population-enter-job-id-next.png" alt="Configure anomaly detection job population enter job id next" /></p>
<p>Click <strong>Next</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/configure-anomaly-detection-job-population-confirm-create-next.png" alt="Configure anomaly detection job population confirm create next" /></p>
<p>Click <strong>Create job</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/anomaly-detection-job-create-job.png" alt="Anomaly detection job create job" /></p>
<p>Once the job completes, click <strong>View results</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/anomaly-detection-job-view-results.png" alt="Anomaly detection job view results" /></p>
<p>You should see that an anomaly has been detected. It looks like a specific IP Address has been identified performing an exceedingly high number of transactions with multiple credit cards on a single day.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/anomaly-detection-job-anomaly-detected.png" alt="Anomaly detection job anomaly detected" /></p>
<p>You can click the red highlighted segments in the timeline to see more details to assist you with evaluating possible remediation actions to implement.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/anomaly-detection-job-anomaly-detected-details.png" alt="Anomaly detection job anomaly detected details" /></p>
<p>In just a few steps, we were able to create a machine learning job that grouped all the transactions by the IP address that sent them and identified slices of time where one IP sent an unusually large number of requests compared to other IPs. Our fraudster!</p>
<h2>Take the next step in fraud prevention</h2>
<p>Fraud detection is an ongoing battle for organizations across industries, and the stakes are higher than ever. As digital payments, insurance claims, and online banking continue to dominate, the need for robust, real-time solutions to detect and prevent fraud is critical. In this blog, we demonstrated how Elastic Cloud empowers organizations to address this challenge effectively.</p>
<p>By using Elastic Cloud’s powerful capabilities, we ingested and analyzed a dataset of credit card transactions to detect potential fraudulent activity, such as “Card Testing.” From ingesting data into an Elastic index to leveraging machine learning-powered anomaly detection, this step-by-step process highlighted how Elastic Cloud can uncover hidden patterns and provide actionable insights to fraud prevention teams.</p>
<p>This example is just the beginning of what Elastic Cloud can do. Its scalable architecture, flexible tools, and powerful analytics make it an invaluable asset for any organization looking to protect their customers and assets from fraud. Whether it's detecting unusual spending patterns, identifying compromised accounts, or monitoring large-scale operations, Elastic Cloud provides the speed, precision, and efficiency financial services organizations need to stay one step ahead of fraudsters.</p>
<p>As fraud continues to evolve, so must the tools we use to combat it. Elastic Cloud gives you the power to meet these challenges head-on, enabling your institution to provide a safer, more secure experience for your customers.</p>
<p>Ready to explore more? View a <a href="https://elastic.navattic.com/fraud-detection">guided tour</a> of all the steps in this blog post or create an <a href="https://cloud.elastic.co/projects">Elastic Serverless Observability project</a> and start analyzing your data for anomalies today.</p>
<p><strong>Related resources:</strong></p>
<ul>
<li><strong>Overview:</strong> <a href="https://www.elastic.co/accelerate-fraud-detection-and-prevention-with-elastic">Accelerate fraud detection and prevention with Elastic</a></li>
<li><strong>Blog:</strong> <a href="https://www.elastic.co/blog/elastic-ai-fraud-detection-financial-services">AI-powered fraud detection: Protecting financial services with Elastic</a></li>
<li><strong>Blog:</strong> <a href="https://www.elastic.co/blog/financial-services-fraud-generative-ai-attack-surface">Fraud in financial services: Leaning on generative AI to protect a rapidly expanding attack surface</a></li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/anomaly-detection-to-identify-fraud/anomaly-detection-to-identify-fraud.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[The antidote for index mapping exceptions: ignore_malformed]]></title>
            <link>https://www.elastic.co/observability-labs/blog/antidote-index-mapping-exceptions-ignore-malformed</link>
            <guid isPermaLink="false">antidote-index-mapping-exceptions-ignore-malformed</guid>
            <pubDate>Thu, 03 Aug 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[How an almost unknown setting called ignore_malformed can make the difference between dropping a document entirely if a single field is malformed or just ignoring that field and ingesting the document anyway.]]></description>
            <content:encoded><![CDATA[<p>In this article, I'll explain how the setting <em>ignore_malformed</em> can make the difference between a 100% dropping rate and a 100% success rate, even with ignoring some malformed fields.</p>
<p>As a senior software engineer working at Elastic®, I have been on the first line of support for anything related to Beats or Elastic Agent running on Kubernetes and Cloud Native integrations like Nginx ingress controller.</p>
<p>During my experience, I have seen all sorts of issues. Users have very different requirements. But at some point during their experience, most of them encounter a very common problem with Elasticsearch&lt;sup&gt;&lt;/sup&gt;: <em>index mapping exceptions</em>.</p>
<h2>How mappings work</h2>
<p>Like any other document-based NoSQL database, Elasticsearch doesn’t force you to provide the document schema (called <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html">index mapping</a> or simply <em>mapping</em>) upfront. If you provide a mapping, it will use it. Otherwise, it will infer one from the first document or any subsequent documents that contain new fields.</p>
<p>In reality, the situation is not black and white. You can also provide a partial mapping that covers only some of the fields, like the most common fields, and leave Elasticsearch to figure out the mapping of all the other fields during ingestion with <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-mapping.html">Dynamic Mapping</a>.</p>
<h2>What happens when data is malformed?</h2>
<p>No matter if you specified a mapping upfront or if Elasticsearch inferred one automatically, Elasticsearch will drop an entire document with just one field that doesn't match the mapping of an index and return an error instead. This is not much different from what happens with other SQL databases or NoSQL data stores with inferred schemas. The reason for this behavior is to prevent malformed data and exceptions at query time.</p>
<p>A problem arises if a user doesn't look at the ingestion logs and misses those errors. They might never figure out that something went wrong, or even worse, Elasticsearch might stop ingesting data entirely if all the subsequent documents are malformed.</p>
<p>The above situation sounds very catastrophic, but it's entirely possible since I have seen it many times when on-call for support or on <a href="https://discuss.elastic.co/latest">discuss.elastic.co</a>. The situation is even more likely to happen if you have user-generated documents, so you don't have full control over the quality of your data.</p>
<p>Luckily, there is a setting that not many people know about in Elasticsearch that solves the exact problems above. This field has been there since <a href="https://www.elastic.co/guide/en/elasticsearch/reference/2.0/ignore-malformed.html">Elasticsearch 2.0</a>. We are talking ancient history here since the latest version of the stack at the time of writing is <a href="https://www.elastic.co/blog/whats-new-elastic-enterprise-search-8-9-0">Elastic Stack 8.9.0</a>.</p>
<p>Let's now dive into how to use this Elasticsearch feature.</p>
<h2>A toy use case</h2>
<p>To make it easier to interact with Elasticsearch, I am going to use <a href="https://www.elastic.co/guide/en/kibana/current/console-kibana.html">Kibana® Dev Tools</a> in this tutorial.</p>
<p>The following examples are taken from the official documentation on <a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.8/ignore-malformed.html#ignore-malformed">ignore_malformed</a>. I am here to expand on those examples by providing a few more details about what happens behind the scenes and on how to search for ignored fields. We are going to use the index name <em>my-index</em>, but feel free to change that to whatever you like.</p>
<p>First, we want to create an index mapping with two fields called <em>number_one</em> and <em>number_two</em>. Both fields have type <em>integer</em>, but only one of them has _ <strong>ignore_malformed</strong> _ set to true, and the other one inherits the default value <em>ignore_malformed: false</em> instead.</p>
<pre><code class="language-json">PUT my-index
{
  &quot;mappings&quot;: {
    &quot;properties&quot;: {
      &quot;number_one&quot;: {
        &quot;type&quot;: &quot;integer&quot;,
        &quot;ignore_malformed&quot;: true
      },
      &quot;number_two&quot;: {
        &quot;type&quot;: &quot;integer&quot;
      }
    }
  }
}
</code></pre>
<p>If the mentioned index didn’t exist before and the previous command ran successfully, you should get the following result:</p>
<pre><code class="language-json">{
  &quot;acknowledged&quot;: true,
  &quot;shards_acknowledged&quot;: true,
  &quot;index&quot;: &quot;my-index&quot;
}
</code></pre>
<p>To double-check that the above mapping has been created correctly, we can query the newly created index with the command:</p>
<pre><code class="language-bash">GET my-index/_mapping
</code></pre>
<p>You should get the following result:</p>
<pre><code class="language-json">{
  &quot;my-index&quot;: {
    &quot;mappings&quot;: {
      &quot;properties&quot;: {
        &quot;number_one&quot;: {
          &quot;type&quot;: &quot;integer&quot;,
          &quot;ignore_malformed&quot;: true
        },
        &quot;number_two&quot;: {
          &quot;type&quot;: &quot;integer&quot;
        }
      }
    }
  }
}
</code></pre>
<p>Now we can ingest two sample documents — both invalid:</p>
<pre><code class="language-bash">PUT my-index/_doc/1
{
  &quot;text&quot;:       &quot;Some text value&quot;,
  &quot;number_one&quot;: &quot;foo&quot;
}

PUT my-index/_doc/2
{
  &quot;text&quot;:       &quot;Some text value&quot;,
  &quot;number_two&quot;: &quot;foo&quot;
}
</code></pre>
<p>The document with <em>id=1</em> is correctly ingested, while the document with <em>id=2</em> fails with the following error. The difference between those two documents is in which field we are trying to ingest a sample string “foo” instead of an integer.</p>
<pre><code class="language-json">{
  &quot;error&quot;: {
    &quot;root_cause&quot;: [
      {
        &quot;type&quot;: &quot;document_parsing_exception&quot;,
        &quot;reason&quot;: &quot;[3:17] failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: 'foo'&quot;
      }
    ],
    &quot;type&quot;: &quot;document_parsing_exception&quot;,
    &quot;reason&quot;: &quot;[3:17] failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: 'foo'&quot;,
    &quot;caused_by&quot;: {
      &quot;type&quot;: &quot;number_format_exception&quot;,
      &quot;reason&quot;: &quot;For input string: \&quot;foo\&quot;&quot;
    }
  },
  &quot;status&quot;: 400
}
</code></pre>
<p>Depending on the client used for ingesting your documents, you might get different errors or warnings, but logically the problem is the same. The entire document is not ingested because part of it doesn’t conform with the index mapping. There are too many possible error messages to name, but suffice it to say that malformed data is quite a common problem. And we need a better way to handle it.</p>
<p>Now that at least one document has been ingested, you can try searching with the following query:</p>
<pre><code class="language-bash">GET my-index/_search
{
  &quot;fields&quot;: [
    &quot;*&quot;
  ]
}
</code></pre>
<p>Here, the parameter <em>fields</em> is required to show the values of those fields that have been ignored. More on this later.</p>
<p>From the result, you can see that only the first document (with <em>id=1</em>) has been ingested correctly while the second document (with <em>id=2</em>) has been completely dropped.</p>
<pre><code class="language-json">{
  &quot;took&quot;: 14,
  &quot;timed_out&quot;: false,
  &quot;_shards&quot;: {
    &quot;total&quot;: 1,
    &quot;successful&quot;: 1,
    &quot;skipped&quot;: 0,
    &quot;failed&quot;: 0
  },
  &quot;hits&quot;: {
    &quot;total&quot;: {
      &quot;value&quot;: 1,
      &quot;relation&quot;: &quot;eq&quot;
    },
    &quot;max_score&quot;: null,
    &quot;hits&quot;: [
      {
        &quot;_index&quot;: &quot;my-index&quot;,
        &quot;_id&quot;: &quot;1&quot;,
        &quot;_score&quot;: null,
        &quot;_ignored&quot;: [&quot;number_one&quot;],
        &quot;_source&quot;: {
          &quot;text&quot;: &quot;Some text value&quot;,
          &quot;number_one&quot;: &quot;foo&quot;
        },
        &quot;fields&quot;: {
          &quot;text&quot;: [&quot;Some text value&quot;],
          &quot;text.keyword&quot;: [&quot;Some text value&quot;]
        },
        &quot;ignored_field_values&quot;: {
          &quot;number_one&quot;: [&quot;foo&quot;]
        },
        &quot;sort&quot;: [&quot;1&quot;]
      }
    ]
  }
}
</code></pre>
<p>From the above JSON response, you will notice some things, such as:</p>
<ul>
<li>A new field called _ <strong>_ignored</strong> _ of type array with the list of all fields that have been ignored while ingesting documents</li>
<li>A new field called _ <strong>ignored_field_values</strong> _ with a dictionary of ignored fields and their values</li>
<li>The field called __ <strong>source</strong> _ contains the original document unmodified. This is especially useful if you want to fix the problems with the mapping later.</li>
<li>The field called _ <strong>text</strong> _ was not present in the original mapping, but it is now included since Elasticsearch automatically inferred the type of this field. In fact, if you try to query the mapping of the index _ <strong>my-index</strong> _ again via the command:</li>
</ul>
<pre><code class="language-bash">GET my-index/_mapping
</code></pre>
<p>You should get this result:</p>
<pre><code class="language-json">{
  &quot;my-index&quot;: {
    &quot;mappings&quot;: {
      &quot;properties&quot;: {
        &quot;number_one&quot;: {
          &quot;type&quot;: &quot;integer&quot;,
          &quot;ignore_malformed&quot;: true
        },
        &quot;number_two&quot;: {
          &quot;type&quot;: &quot;integer&quot;
        },
        &quot;text&quot;: {
          &quot;type&quot;: &quot;text&quot;,
          &quot;fields&quot;: {
            &quot;keyword&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 256
            }
          }
        }
      }
    }
  }
}
</code></pre>
<p>Finally, if you ingest some valid documents like the following command:</p>
<pre><code class="language-bash">PUT my-index/_doc/3
{
  &quot;text&quot;:       &quot;Some text value&quot;,
  &quot;number_two&quot;: 10
}
</code></pre>
<p>You can check how many documents have at least one ignored field with the following <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-exists-query.html">Exists query</a>:</p>
<pre><code class="language-bash">GET my-index/_search
{
  &quot;query&quot;: {
    &quot;exists&quot;: {
      &quot;field&quot;: &quot;_ignored&quot;
    }
  }
}
</code></pre>
<p>You can also see that out of the two documents ingested (with <em>id=1</em> and <em>id=3</em>) only the document with <em>id=1</em> contains an ignored field.</p>
<pre><code class="language-json">{
  &quot;took&quot;: 193,
  &quot;timed_out&quot;: false,
  &quot;_shards&quot;: {
    &quot;total&quot;: 1,
    &quot;successful&quot;: 1,
    &quot;skipped&quot;: 0,
    &quot;failed&quot;: 0
  },
  &quot;hits&quot;: {
    &quot;total&quot;: {
      &quot;value&quot;: 1,
      &quot;relation&quot;: &quot;eq&quot;
    },
    &quot;max_score&quot;: 1,
    &quot;hits&quot;: [
      {
        &quot;_index&quot;: &quot;my-index&quot;,
        &quot;_id&quot;: &quot;1&quot;,
        &quot;_score&quot;: 1,
        &quot;_ignored&quot;: [&quot;number_one&quot;],
        &quot;_source&quot;: {
          &quot;text&quot;: &quot;Some text value&quot;,
          &quot;number_one&quot;: &quot;foo&quot;
        }
      }
    ]
  }
}
</code></pre>
<p>Alternatively, you can search for all documents that have a specific field being ignored with this <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html">Terms query</a>:</p>
<pre><code class="language-bash">GET my-index/_search
{
  &quot;query&quot;: {
    &quot;terms&quot;: {
      &quot;_ignored&quot;: [ &quot;number_one&quot;]
    }
  }
}
</code></pre>
<p>The result, in this case, will be the same as the previous one since we only managed to ingest a single document with that exact single field ignored.</p>
<h2>Conclusion</h2>
<p>Because we are a big fan of this flag, we've enabled _ <strong>ignore_malformed</strong> _ by default for all Elastic integrations and in the <a href="https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/core/template-resources/src/main/resources/logs-settings.json#L13">default index template for logs data streams</a> as of 8.9.0. More information can be found in the official documentation for <a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.9/ignore-malformed.html">ignore_malformed</a>.</p>
<p>And since I am personally working on this feature, I can reassure you that it is a game changer.</p>
<p>You can start by setting _ <strong>ignore_malformed</strong> _ on any cluster manually before Elastic Stack 8.9.0. Or you can use the defaults that we set for you starting from <a href="https://www.elastic.co/blog/whats-new-elastic-enterprise-search-8-9-0">Elastic Stack 8.9.0</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/antidote-index-mapping-exceptions-ignore-malformed/illustration-stack-modernize-solutions-1689x980_(1).png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Achieving seamless API management: Introducing AWS API Gateway integration with Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/api-management-aws-api-gateway-integration</link>
            <guid isPermaLink="false">api-management-aws-api-gateway-integration</guid>
            <pubDate>Thu, 14 Sep 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[With Elastic's AWS API Gateway integration, application owners and developers unlock the capability to proactively identify and resolve problems, fine-tune resource utilization, and provide extraordinary digital experiences to their users.]]></description>
            <content:encoded><![CDATA[<p><a href="https://aws.amazon.com/api-gateway/">AWS API Gateway</a> is a powerful service that redefines API management. It serves as a gateway for creating, deploying, and managing APIs, enabling businesses to establish seamless connections between different applications and services. With features like authentication, authorization, and traffic control, API Gateway ensures the security and reliability of API interactions.</p>
<p>In an era where APIs serve as the backbone of modern applications, having the means to maintain visibility and control over these vital components is absolutely essential. In this blog post, we dive deep into the comprehensive observability solution offered by Elastic&lt;sup&gt;®&lt;/sup&gt;, ensuring real-time visibility, advanced analytics, and actionable insights, empowering you to fine-tune your API Gateway for optimal performance.</p>
<p>For application owners and developers, this integration stands as a beacon of empowerment. Elastic's meticulous orchestration of the seamless merging of metrics, logs, and traces, built upon the robust <a href="https://www.elastic.co/elastic-stack">ELK Stack</a> foundation, equips them with potent real-time monitoring and analysis tools. These tools facilitate precise performance optimization and swift issue resolution, all within a secure and dependable environment.</p>
<p>With Elastic's AWS API Gateway integration, application owners and developers unlock the capability to proactively identify and resolve problems, fine-tune resource utilization, and provide extraordinary digital experiences to their users.</p>
<h2>Architecture</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/api-management-aws-api-gateway-integration/elastic-blog-1-architecture.png" alt="architecture" /></p>
<h2>Why the AWS API Gateway integration matters</h2>
<p>API Gateway now serves as the foundation of contemporary application development, simplifying the process of creating and overseeing APIs on a large scale. Yet, monitoring and troubleshooting these API endpoints can be challenging. With the new AWS API Gateway integration introduced by Elastic, you can gain the following:</p>
<ul>
<li><strong>Unprecedented visibility:</strong> Monitor your API Gateway endpoints' performance, error rates, and usage metrics in real time. Get a comprehensive view of your APIs' health and performance.</li>
<li><strong>Log analysis:</strong> Dive deep into API Gateway logs with ease. Our integration enables you to collect and analyze logs for HTTP, REST, and Websocket API types, helping you troubleshoot issues and gain valuable insights.</li>
<li><strong>Rapid issue resolution:</strong> Identify and resolve issues in your API Gateway workflows faster than ever. <a href="https://www.elastic.co/observability">Elastic Observability's</a> powerful search and analytics tools help you pinpoint problems with ease.</li>
<li><strong>Alerting and notifications:</strong> Set up custom alerts based on API Gateway metrics and logs. Receive notifications when performance thresholds are breached, ensuring that you can take action promptly.</li>
<li><strong>Optimized costs:</strong> Visualize resource usage and performance metrics for your API Gateway deployments. Use these insights to optimize resource allocation and reduce operational costs.</li>
<li><strong>Custom dashboards:</strong> Create customized dashboards and visualizations tailored to your API Gateway monitoring needs. Stay in control with real-time data and actionable insights.</li>
<li><strong>Effortless integration:</strong> Seamlessly connect your AWS API Gateway to our observability solution. Our intuitive setup process ensures a smooth integration experience.</li>
<li><strong>Scalability:</strong> Whether you have a handful of APIs or a complex API Gateway landscape, our observability solution scales to meet your needs. Grow confidently as your API infrastructure expands.</li>
</ul>
<h2>How to get started</h2>
<p>Getting started with the AWS API Gateway integration in Elastic Observability is seamless. Here's a quick overview of the steps:</p>
<h3>Prerequisites and configurations</h3>
<p>If you intend to follow the steps outlined in this blog post, there are a few prerequisites and configurations that you should have in place beforehand.</p>
<ol>
<li>
<p>You will need an account on <a href="http://cloud.elastic.co/">Elastic Cloud</a> and a deployed stack and agent. Instructions for deploying a stack on AWS can be found <a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">here</a>. This is necessary for AWS API Gateway logging and analysis.</p>
</li>
<li>
<p>You will also need an AWS account with the necessary permissions to pull data from AWS. Details on the required permissions can be found in our <a href="https://docs.elastic.co/en/integrations/aws#aws-permissions">documentation</a>.</p>
</li>
<li>
<p>You can monitor API execution by using CloudWatch, which collects and processes raw data from API Gateway into readable, near-real-time metrics and logs. Details on the required steps to enable logging can be found <a href="https://docs.aws.amazon.com/apigateway/latest/developerguide/set-up-logging.html">here</a>.</p>
</li>
</ol>
<h3>Step 1. Create an account with Elastic</h3>
<p><a href="https://cloud.elastic.co/registration?fromURI=/home">Create an account on Elastic Cloud</a> by following the steps provided.</p>
<h3>Step 2. Add integration</h3>
<ul>
<li>Log in to your Elastic Cloud deployment.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/api-management-aws-api-gateway-integration/elastic-blog-2-signup.png" alt="signup" /></p>
<ul>
<li>Click on <strong>Add integrations</strong>. You will be navigated to a catalog of supported integrations.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/api-management-aws-api-gateway-integration/elastic-blog-3-welcome-home.png" alt="welcome home dashboard" /></p>
<ul>
<li>Search and select <strong>AWS API Gateway</strong>.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/api-management-aws-api-gateway-integration/elastic-blog-4-integrations.png" alt="Integration " /></p>
<h3>Step 3. Configure integration</h3>
<ul>
<li>Click on the <strong>Add AWS API Gateway</strong> button and provide the required details.</li>
<li>If this is your first time adding an AWS integration, you’ll need to <a href="https://www.elastic.co/guide/en/fleet/current/elastic-agent-installation.html">configure and enroll the Elastic Agent</a> on an AWS instance.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/api-management-aws-api-gateway-integration/elastic-blog-5-aws-api-gateway.png" alt="aws-api-gateway" /></p>
<ul>
<li>Then complete the “Configure integration” form, providing all the necessary information required for agents to collect the AWS API Gateway metrics and associated CloudWatch logs. Multiple AWS credential methods are supported, including access keys, temporary security credentials, and IAM role ARN. Please see the <a href="https://docs.aws.amazon.com/apigateway/latest/developerguide/security-iam.html">IAM security and access documentation</a> for more details. You can choose to collect API Gateway metrics, API Gateway logs via S3, or API Gateway logs via CloudWatch.</li>
<li>Click on the <strong>Save and continue</strong> button at the bottom of the page.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/api-management-aws-api-gateway-integration/elastic-blog-6-add-aws-integration.png" alt="add-aws-integration" /></p>
<h3>Step 4. Analyze and monitor</h3>
<p>Explore the data using the out-of-the-box dashboards available for the integration. Select <strong>Discover</strong> from the Elastic Cloud top-level menu.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/api-management-aws-api-gateway-integration/elastic-blog-7-discover-dashboard.png" alt="discover-dashboard" /></p>
<p>Or, create custom dashboards, set up alerts, and gain actionable insights into your API Gateway service performance.</p>
<p>Here are key monitoring metrics collected through this integration across Rest APIs, HTTP APIs, and Websocket APIs:</p>
<ul>
<li><strong>4XXError</strong> – The number of client-side errors captured in a given period</li>
<li><strong>5XXError</strong> – The number of server-side errors captured in a given period</li>
<li><strong>CacheHitCount</strong> – The number of requests served from the API cache in a given period</li>
<li><strong>CacheMissCount</strong> – The number of requests served from the backend in a given period, when API caching is enabled</li>
<li><strong>Count</strong> – The total number of API requests in a given period</li>
<li><strong>IntegrationLatency</strong> – The time between when API Gateway relays a request to the backend and when it receives a response from the backend</li>
<li><strong>Latency</strong> – The time between when API Gateway receives a request from a client and when it returns a response to the client — the latency includes the integration latency and other API Gateway overhead</li>
<li><strong>DataProcessed</strong> – The amount of data processed in bytes</li>
<li><strong>ConnectCount</strong> – The number of messages sent to the $connect route integration<br />
<strong>MessageCount</strong> – The number of messages sent to the WebSocket API, either from or to the client</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/api-management-aws-api-gateway-integration/elastic-blog-8-graphs.png" alt="graphs" /></p>
<h2>Conclusion</h2>
<p>The native integration of AWS API Gateway into Elastic Observability marks a significant advancement in streamlining the monitoring and management of your APIs. With this integration, you gain access to a wealth of insights, real-time visibility, and powerful analytics tools, empowering you to optimize your API performance, enhance security, and troubleshoot with ease. Don't miss out on this opportunity to take your API management to the next level, ensuring your digital assets operate at their best, all while providing a seamless experience for your users. Embrace this integration, and stay at the forefront of API observability in the ever-evolving world of digital technology.</p>
<p>Visit our <a href="https://docs.elastic.co/integrations/aws/apigateway">documentation</a> to learn more about Elastic Observability and the AWS API Gateway integration, or <a href="https://www.elastic.co/contact">contact our sales team</a> to get started!</p>
<h2>Start a free trial today</h2>
<p>Start your own <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k?trk=5fbc596b-6d2a-433a-8333-0bd1f28e84da%E2%89%BBchannel=el">7-day free trial</a> by signing up via <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k?trk=5fbc596b-6d2a-433a-8333-0bd1f28e84da&amp;sc_channel=el&amp;ultron=gobig&amp;hulk=regpage&amp;blade=elasticweb&amp;gambit=mp-b">AWS Marketplace</a> and quickly spin up a deployment in minutes on any of the <a href="https://www.elastic.co/guide/en/cloud/current/ec-reference-regions.html#ec_amazon_web_services_aws_regions">Elastic Cloud regions on AWS</a> around the world. Your AWS Marketplace purchase of Elastic will be included in your monthly consolidated billing statement and will draw against your committed spend with AWS.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/api-management-aws-api-gateway-integration/illustration-midnight-bg-aws-elastic-1680x980.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Assembling an OpenTelemetry NGINX Ingress Controller Integration]]></title>
            <link>https://www.elastic.co/observability-labs/blog/assembling-an-opentelemetry-nginx-ingress-controller-integration</link>
            <guid isPermaLink="false">assembling-an-opentelemetry-nginx-ingress-controller-integration</guid>
            <pubDate>Wed, 15 Jan 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[This blog post explores how to set up an OpenTelemetry integration for the NGINX Ingress Controller, detailing the configuration process, key transformations, and upcoming enhancements for modular configuration support.]]></description>
            <content:encoded><![CDATA[<p>Our vision is clear: to support OpenTelemetry within Elastic. A key aspect of
this transition are integrations — how can we seamlessly adapt all existing
integrations to fit the OpenTelemetry model?</p>
<p>Elastic integrations are designed to simplify observability by providing tools
to ingest application data, process it through Ingest pipelines, and deliver
prebuilt dashboards for visualization. With OpenTelemetry support, data
collection and processing will transition to the OpenTelemetry Collector, while
dashboards will need to adopt the OpenTelemetry data structure.</p>
<h2>From a Log to an Integration</h2>
<p>Although the concept of an OpenTelemetry Integration has not yet been officially
defined, we envision it as a structured collection of artifacts that enables users
to start monitoring an application from scratch. Each artifact has a specific role;
for example, an OpenTelemetry Collector configuration file, which must be
integrated into the main Collector setup. This bundled configuration instructs
the Collector on how to gather and process data from the relevant application.</p>
<p>In the OpenTelemetry Collector, data collection is handled by the <a href="https://opentelemetry.io/docs/collector/configuration/#receivers">receivers</a>
component. Some receivers are tailored for specific applications, such as Kafka
or MySQL, while others are designed to support general data collection methods.
The specialized receivers combine data gathering and transformation within a
single component. For the more generic receivers, however, additional components
are needed to refine and transform the incoming data into a more
application-specific format. Let’s take a look at how we can build an
integration for monitoring a Nginx Ingress Controller.</p>
<p>The Ingress Nginx is an Ingress controller for Kubernetes, using NGINX as a
reverse proxy and load balancer. Widely adopted, it plays a crucial role in
directing external traffic into Kubernetes services, making its usage,
performance and health essential to observe. How can we start observing the external
requests done to our Ingress controller? Fortunately, the NGINX Ingress Controller
generates a structured log entry for each processed request. This structured
format ensures that each log entry follows a consistent structure, making it
straightforward to parse and generate consistent output.</p>
<pre><code>log_format upstreaminfo '$remote_addr - $remote_user [$time_local]
	&quot;$request&quot; ' '$status $body_bytes_sent &quot;$http_referer&quot; &quot;$http_user_agent&quot; '
	'$request_length $request_time [$proxy_upstream_name]
	[$proxy_alternative_upstream_name] $upstream_addr ' '$upstream_response_length
	$upstream_response_time $upstream_status $req_id';
</code></pre>
<p>All the field's definition can be found
<a href="https://github.com/kubernetes/ingress-nginx/blob/controller-v1.11.3/docs/user-guide/nginx-configuration/log-format.md">here</a>.</p>
<p>The OpenTelemetry Contrib Collector does not include a receiver capable of
reading and parsing all fields in an NGINX Ingress log. There are two primary
reasons for this:</p>
<ul>
<li><strong>Application Diversity</strong>: The landscape of applications is vast, with each
generating logs in unique formats. Developing and maintaining a dedicated
receiver for every application would be resource-intensive and difficult to
scale.</li>
<li><strong>Data Source Flexibility</strong>: Receivers are typically designed to collect data
from a specific source, like an HTTP endpoint. However, in some cases, we
may want to parse logs from an alternate source, such as an NGINX Ingress
log file stored in an AWS S3 bucket.</li>
</ul>
<p>These challenges can be addressed by combining receivers and processors.
Receivers handle the collection of raw data, while processors can extract
specific values when a known data structure is detected. Do we need a dedicated
processor to parse NGINX logs? Not necessarily. The transform processor can
handle this by modifying telemetry data according to a specified configuration.
This configuration is written in the OpenTelemetry Transformation Language
(OTTL), a language for transforming open telemetry data based on the
<a href="https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/processing.md">OpenTelemetry Collector Processing
Exploration</a>.</p>
<p>The concept of processors in OpenTelemetry is quite similar to the Ingest
pipeline strategy currently used in Elastic integrations. The main challenge,
therefore, lies in migrating Ingest pipeline configurations to OpenTelemetry
Collector configurations. For a deeper dive into the challenges of such
migrations, check out this
<a href="https://www.elastic.co/observability-labs/blog/logstash-to-otel">article</a>.</p>
<p>For reference, you can view the current Elastic NGINX Ingress
Controller Ingest pipeline configuration in the following link: <a href="https://github.com/elastic/integrations/blob/main/packages/nginx_ingress_controller/data_stream/access/elasticsearch/ingest_pipeline/default.yml">Elastic NGINX
Ingress Controller Ingest
Pipeline</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/logstash-to-otel/logstash-pipeline-to-otel-pipeline.png" alt="logstash-pipeline-to-otel-pipeline" /></p>
<p>Let’s start with the data collection. By default, the NGINX Ingress Controller
logs to stdout, and Kubernetes captures and stores these logs in a file.
Assuming that the
OpenTelemetry Collector running the following configuration has access to the
Kubernetes Pod logs, we can use the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver">filelog
receiver</a>
to read the controller logs:</p>
<pre><code class="language-yaml">receivers:
  filelog/nginx:
    include_file_path: true
    include: [/var/log/pods/*nginx-ingress-nginx-controller*/controller/*.log]
    operators:
      - id: container-parser
        type: container
</code></pre>
<p>This configuration is designed to exclusively read the controller's pod logs,
focusing on their default file path within a Kubernetes node. Furthermore, since
the Ingress controller does not inherently have access to its associated
Kubernetes metadata, the <code>container-parser</code> operator has been implemented to
bridge this gap. This operator appends Kubernetes-specific attributes, such as
<code>k8s.pod.name</code> and <code>k8s.namespace.name</code>, based solely on information available
from the filename. For a detailed overview of the <code>container-parser</code> operator, see
the following <a href="https://opentelemetry.io/blog/2024/otel-collector-container-log-parser/">OpenTelemetry blog
post</a>.</p>
<h3>Avoiding duplicated logs</h3>
<p>The configuration outlined in this blog is designed for Kubernetes environments,
where the collector runs as a Kubernetes Pod. In such setups, handling Pod
restarts properly is crucial. By default, the <code>filelog</code> receiver reads the entire
content of log files on startup. This behavior can lead to duplicate log entries
being reprocessed and sent through the pipeline if the collector Pod is
restarted.</p>
<p>To make the configuration resilient to restarts, you can use a storage extension
to track file offsets. These offsets allow the <code>filelog</code> receiver to resume
reading from the last processed position in the log file after a restart. Below
is an example of how to add a <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/storage/filestorage">file storage extension</a> and update the <code>filelog</code>
receiver configuration to store the offsets in a file:</p>
<pre><code class="language-yaml">extensions:
  file_storage:
  directory: /var/lib/otelcol

receivers:
  filelog/nginx:
    storage: file_storage
    ...
</code></pre>
<p><strong>Important</strong>: The /var/lib/otelcol directory must be mounted as part of a
Kubernetes persistent volume to ensure the stored offsets persist across Pod
restarts.</p>
<h3>Data transformation with OpenTelemetry processors</h3>
<p>Now it’s time to parse the structured log fields and transform them into
queryable OpenTelemetry fields. Initially, we considered using regular
expressions with the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/ottl/ottlfuncs#extract_patterns">extract_patterns
function</a>
available in the OpenTelemetry Transformation Language (OTTL). However, Elastic
recently contributed a new OTTL function,
<a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/ottl/ottlfuncs#extractgrokpatterns">ExtractGrokPatterns</a>,
based on Grok—a regular expression dialect that supports reusable, aliased
expressions. The function’s underlying library <a href="https://github.com/elastic/go-grok">Elastic
Go-Grok</a> ships with numerous predefined grok
patterns that simplify working with pattern matching, like <code>%NUMBER</code>
that will match any number type; &quot;123&quot;, &quot;456.789&quot;, &quot;-0.123&quot;.</p>
<p>Each Ingress Controller log entry begins with the client's source IP address
(which may be a single IP or a list of IPs) and the username provided via Basic
authentication, represented as “$remote_addr - $remote_user”. The Grok IP alias
can be used to parse either an IPv4 or IPv6 address from the remote_addr field,
while the <code>%GREEDYDATA</code> alias can capture the remote_user value.</p>
<p>For example, the following OTTL configuration will transform an unstructured
body message to a structured one with two fields:</p>
<ul>
<li>Parses a single IP address and assign it to the source.address key.</li>
<li>Delimited by a “-”, captures the optional value of the authenticated username
in the <code>user.name</code> key.</li>
</ul>
<pre><code class="language-yaml">transform/parse_nginx_ingress_access/log:
  log_statements:
    - context: log
      statements:
        - set(body, ExtractGrokPatterns(body, &quot;%{IP:source.address} - (-|%{GREEDYDATA:user.name})&quot;, true))
</code></pre>
<p>The screenshot below illustrates the transformation process, showing the
original input data alongside the resulting structured format (diff):</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/assembling-an-opentelemetry-nginx-ingress-controller-integration/data-diff.png" alt="data-transform-diff" /></p>
<p>In real-world scenarios, NGINX Ingress Controller logs may begin with a list of
IP addresses or, at times, a domain name. These variations can be handled with
an extended Grok pattern. Similarly, we can use Grok to parse an HTTP UserAgent
and URL strings, but additional OTTL functions, such as
<a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/ottl/ottlfuncs#url">URL</a>
or
<a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/ottl/ottlfuncs#useragent">UserAgent</a>,
are required to extract meaningful data from these fields.</p>
<p>The complete configuration is available in the documentation for Elastic’s
OpenTelemetry NGINX Ingress Controller integration: <a href="https://github.com/elastic/integrations/blob/main/packages/nginx_ingress_controller_otel/docs/README.md">Integration
Documentation</a>.</p>
<h2>Usage</h2>
<p>The Elastic OpenTelemetry NGINX Ingress Controller is currently on <strong>Technical
preview</strong>. To access it, you must enable the &quot;Display beta integrations&quot; toggle
in the <strong>Integrations</strong> menu within Kibana.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/assembling-an-opentelemetry-nginx-ingress-controller-integration/kibana-integration.png" alt="kibana-nginx-integration" /></p>
<p>By installing the Elastic OpenTelemetry NGINX Ingress Controller integration, a
couple of dashboards will become available in your Kibana profile. One of these
dashboards provides insights into access events for the controller, displaying
information such as HTTP response status codes over time, request volume per
URL, distribution of incoming requests by browser, top requested pages, and
more. The screenshot below shows the NGINX Ingress Controller Access Logs
dashboard, displaying data from a controller routing requests to an
<a href="https://github.com/elastic/opentelemetry-demo">OpenTelemetry Demo</a> deployment:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/assembling-an-opentelemetry-nginx-ingress-controller-integration/main-dashboard.png" alt="nginx-ingress-controller-otel-access-dashboard" /></p>
<p>The second dashboard focuses on errors within the Nginx Ingress controller, highlighting the
volume of error events generated over time:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/assembling-an-opentelemetry-nginx-ingress-controller-integration/error-access-dashboard.png" alt="nginx-ingress-controller-otel-error-access-dashboard" /></p>
<p>To start gathering and processing controller logs, we recommend incorporating
the OpenTelemetry Collector pipeline outlined in the integration’s documentation
into your collector configuration: <a href="https://www.elastic.co/guide/en/integrations/current/nginx_ingress_controller_otel.html">Integration
Documentation</a>.
Keep in mind that this configuration requires access to the Kubernetes node's
Pods logs,
typically stored in <code>/var/log/pods/*</code>. To ensure proper access, we recommend
deploying the OpenTelemetry Collector as a daemonset in Kubernetes, as this
deployment type allows the collector to access the necessary log directory on
each node.</p>
<p>The OpenTelemetry Collector configuration service pipeline should include a
similar configuration:</p>
<pre><code class="language-yaml">service:
  extensions: [file_storage]
  pipelines:
    logs/nginx_ingress_controller:
      receivers:
        - filelog
      processors:
        - transform/parse_nginx_ingress_access/log
        - transform/parse_nginx_ingress_error/log
        - resourcedetection/system
      exporters:
        - elasticsearch
</code></pre>
<h3>Adding GeoIP Metadata</h3>
<p>As an optional enhancement, the OpenTelemetry Collector <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/geoipprocessor">GeoIP processor</a> can be configured and added to the pipeline to enrich each NGINX Ingress Controller log with geographical attributes, such as the request’s originating country, region, and city, enabling geo maps in Kibana to visualize traffic distribution and geographic patterns.</p>
<p>While the OpenTelemetry GeoIP processor is similar to <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/geoip-processor.html">Elastic's GeoIP
processor</a>,
it requires users to provide their own local GeoLite2 database. The following
configuration extends the Integration’s configuration to include the GeoIP
processor with a <a href="https://dev.maxmind.com/geoip/geolite2-free-geolocation-data/">MaxMind's database</a>.</p>
<pre><code class="language-yaml">processors:
  geoip:
    context: record
    providers:
      maxmind:
        database_path: /tmp/GeoLite2-City.mmdb

service:
  extensions: [file_storage]
  pipelines:
    logs/nginx_ingress_controller:
      receivers:
        - filelog
      processors:
        - transform/parse_nginx_ingress_access/log
        - transform/parse_nginx_ingress_error/log
        - resourcedetection/system
        - geoip
      exporters:
        - elasticsearch
</code></pre>
<p>Sample Kibana Map with the OpenTelemetry Nginx Ingress Controller integration:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/assembling-an-opentelemetry-nginx-ingress-controller-integration/geoip-map.png" alt="geoip-map-dashboard" /></p>
<h2>Next steps</h2>
<h3>OpenTelemetry Log Event</h3>
<p>A closer look at the OTTL integration’s statements reveals that the raw log
message is replaced by the parsed fields. In other words, the configuration
transforms the body log field* from a string into a structured map of key-value
pairs, as seen in “set(body, ExtractGrokPatterns(body,...)”. This approach is
based on treating each NGINX Ingress Controller log entry as an <a href="https://opentelemetry.io/docs/specs/otel/logs/event-api/#event-data-model">OpenTelemetry
Event</a>—a
specialized type of LogRecord. Events are OpenTelemetry’s standardized semantic
formatting for LogRecords, containing an
“<a href="https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/events.md#event-definition">event.name</a>”
attribute which defines the structure of the body field. An NGINX Ingress
Controller log record aligns well with the OpenTelemetry Event data model. It
follows a structured format and clearly distinguishes between two event types:
access logs and error logs. There is an ongoing PR to incorporate the NGINX
Ingress controller log into the OpenTelemetry semantic convention:
<a href="https://github.com/open-telemetry/semantic-conventions/pull/982">https://github.com/open-telemetry/semantic-conventions/pull/982</a></p>
<h3>Operating system breakdown</h3>
<p>Each controller log contains the source UserAgent, from which the integration
extracts the browser that originated the request. This information is valuable
for understanding user access patterns, as it provides insights into the types
of browsers commonly interacting with your services. Additionally, an <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/35458">ongoing
pull
request</a>
into OTTL aims to extend this functionality by extracting operating system (OS)
details as well, providing even deeper insights into the environments
interacting with the NGINX Ingress Controller.</p>
<h3>Configuration encapsulation</h3>
<p>Setting up the configuration for the NGINX Ingress Controller integration can be
somewhat tedious, as it involves adding several complex processor configurations
to the existing collector pipelines. This process can quickly become cumbersome,
especially for non-expert users or in cases where the collector configuration is
already quite complex. In an ideal scenario, users would simply reference a
pre-defined integration configuration, and the collector would automatically
&quot;unwrap&quot; all the necessary components into the corresponding pipelines. This
would significantly simplify the setup process, making it more accessible and
reducing the risk of misconfigurations. To address this, there is a
<a href="https://github.com/open-telemetry/opentelemetry-collector/pull/11631">RFC</a>
(Request for Comments) proposing support for shareable, modular configurations
within the OpenTelemetry Collector. This feature would allow users to easily
collect signals from specific services or applications by referencing modular
configurations, streamlining the setup and enhancing usability for complex
scenarios.</p>
<p>*The OpenTelemetry community is currently discussing whether structured
body-extracted information should be stored in the attributes or body field.
For details, see this <a href="https://github.com/open-telemetry/semantic-conventions/issues/1651">ongoing issue</a>.</p>
<blockquote>
<p>This product includes GeoLite2 data created by MaxMind, available from <a href="https://www.maxmind.com">https://www.maxmind.com</a></p>
</blockquote>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/assembling-an-opentelemetry-nginx-ingress-controller-integration/ingress-controller.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Auto-instrumentation of Go applications with OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/auto-instrumentation-go-applications-opentelemetry</link>
            <guid isPermaLink="false">auto-instrumentation-go-applications-opentelemetry</guid>
            <pubDate>Wed, 02 Oct 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Instrumenting Go applications with OpenTelemetry provides insights into application performance, dependencies, and errors. We'll show you how to automatically instrument a Go application using Docker, with no changes to your application code.]]></description>
            <content:encoded><![CDATA[<p>In the fast-paced universe of software development, especially in the
cloud-native realm, DevOps and SRE teams are increasingly emerging as essential
partners in application stability and growth.</p>
<p>DevOps engineers continuously optimize software delivery, while SRE teams act
as the stewards of application reliability, scalability, and top-tier
performance. The challenge? These teams require a cutting-edge observability
solution, one that encompasses full-stack insights, empowering them to rapidly
manage, monitor, and rectify potential disruptions before they culminate into
operational challenges.</p>
<p>Observability in our modern distributed software ecosystem goes beyond mere
monitoring — it demands limitless data collection, precision in processing, and
the correlation of this data into actionable insights. However, the road to
achieving this holistic view is paved with obstacles, from navigating version
incompatibilities to wrestling with restrictive proprietary code.</p>
<p>Enter <a href="https://opentelemetry.io/">OpenTelemetry (OTel)</a>, with the following
benefits for those who adopt it:</p>
<ul>
<li>Escape vendor constraints with OTel, freeing yourself from vendor lock-in and
ensuring top-notch observability.</li>
<li>See the harmony of unified logs, metrics, and traces come together to provide
a complete system view.</li>
<li>Improve your application oversight through richer and enhanced
instrumentations.</li>
<li>Embrace the benefits of backward compatibility to protect your prior
instrumentation investments.</li>
<li>Embark on the OpenTelemetry journey with an easy learning curve, simplifying
onboarding and scalability.</li>
<li>Rely on a proven, future-ready standard to boost your confidence in every
investment.</li>
</ul>
<p>In this blog, we will explore how you can use <a href="https://github.com/open-telemetry/opentelemetry-go-instrumentation/">automatic instrumentation in
your Go</a>
application using Docker, without the need to refactor any part of your
application code. We will use an <a href="https://github.com/elastic/observability-examples">application called
Elastiflix</a>, which helps
highlight auto-instrumentation in a simple way.</p>
<h2>Application, prerequisites, and config</h2>
<p>The application that we use for this blog is called
<a href="https://github.com/elastic/observability-examples">Elastiflix</a>, a
movie-streaming application. It consists of several micro-services written in
.NET, NodeJS, Go, and Python.</p>
<p>Before we instrument our sample application, we will first need to understand
how Elastic can receive the telemetry data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-go-applications-opentelemetry/elastic-blog-1-config.png" alt="Elastic configuration options for
OpenTelemetry" /></p>
<p>All of Elastic Observability’s APM capabilities are available with OTel data.
Some of these include:</p>
<ul>
<li>Service maps</li>
<li>Service details (latency, throughput, failed transactions)</li>
<li>Dependencies between services, distributed tracing</li>
<li>Transactions (traces)</li>
<li>Machine learning (ML) correlations</li>
<li>Log correlation</li>
</ul>
<p>In addition to Elastic’s APM and a unified view of the telemetry data, you will
also be able to use Elastic’s powerful machine learning capabilities to reduce
the analysis, and alerting to help reduce MTTR.</p>
<h3>Prerequisites</h3>
<ul>
<li>An Elastic Cloud account — <a href="https://cloud.elastic.co/">sign up now</a>.</li>
<li>A clone of the <a href="https://github.com/elastic/observability-examples">Elastiflix demo application</a>, or your own Go application</li>
<li>Basic understanding of Docker — potentially install <a href="https://www.docker.com/products/docker-desktop/">Docker Desktop</a></li>
<li>Basic understanding of Go</li>
</ul>
<h3>View the example source code</h3>
<p>The full source code, including the Dockerfile used in this blog, can be found
on
<a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/go-favorite">GitHub</a>.</p>
<p>The following steps will show you how to instrument this application and run it
on the command line or in Docker. If you are interested in a more complete OTel
example, take a look at the docker-compose file
<a href="https://github.com/elastic/observability-examples/tree/main#start-the-app">here</a>,
which will bring up the full project.</p>
<h2>Step-by-step guide</h2>
<h3>Step 0. Log in to your Elastic Cloud account</h3>
<p>This blog assumes you have an Elastic Cloud account — if not, follow the
<a href="https://cloud.elastic.co/registration?elektra=en-cloud-page">instructions to get started on Elastic
Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-go-applications-opentelemetry/elastic-blog-2-trial.png" alt="free trial" /></p>
<h3>Step 1. Run the Docker Image with auto-instrumentation</h3>
<p>We are going to use automatic instrumentation with the Go service from the
<a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/go-favorite">Elastiflix demo
application</a>.</p>
<p>We will be using the following service from Elastiflix:</p>
<pre><code class="language-bash">Elastiflix/go-favorite
</code></pre>
<p>Per the <a href="https://github.com/open-telemetry/opentelemetry-go-instrumentation/blob/main/docs/getting-started.md">OpenTelemetry Automatic Instrumentation for Go
documentation</a>,
you will configure the application to be auto-instrumented using
docker-compose.</p>
<p>As specified in the <a href="https://github.com/open-telemetry/opentelemetry-go-instrumentation/blob/main/docs/getting-started.md">OTEL Go
documentation</a>,
we will use environment variables and pass in the configuration values to
enable it to connect with <a href="https://www.elastic.co/guide/en/observability/current/apm-open-telemetry.html">Elastic Observability’s APM
server</a>.</p>
<p>Because Elastic accepts OTLP natively, we just need to provide the Endpoint and
authentication where the OTEL Exporter needs to send the data, as well as some
other environment variables.</p>
<p><strong>Getting Elastic Cloud variables</strong>
You can copy the endpoints and token from Kibana under the path <code>/app/apm/onboarding?agent=openTelemetry</code>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-go-applications-opentelemetry/elastic-blog-3-apm-agents.png" alt="apm agents" /></p>
<p>You will need to copy the following environment variables:</p>
<pre><code class="language-bash">OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_EXPORTER_OTLP_HEADERS
</code></pre>
<p>Update the <code>docker-compose.yml</code> file at the top of the <code>Elastiflix</code> repository,
adding a <code>go-auto</code> service and updating the <code>favorite-go</code> one:</p>
<pre><code class="language-yaml">  favorite-go:
    build: go-favorite/.
    image: docker.elastic.co/demos/workshop/observability/elastiflix-go-favorite:${ELASTIC_VERSION}-${BUILD_NUMBER}
    depends_on:
      - redis
    networks:
      - app-network
    ports:
      - &quot;5001:5000&quot;
    environment:
      - REDIS_HOST=redis
      - TOGGLE_SERVICE_DELAY=${TOGGLE_SERVICE_DELAY:-0}
      - TOGGLE_CANARY_DELAY=${TOGGLE_CANARY_DELAY:-0}
      - TOGGLE_CANARY_FAILURE=${TOGGLE_CANARY_FAILURE:-0}
    volumes:
      - favorite-go:/app
  go-auto:
    image: otel/autoinstrumentation-go
    privileged: true
    pid: &quot;host&quot;
    networks:
      - app-network
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: &quot;REPLACE WITH OTEL_EXPORTER_OTLP_ENDPOINT&quot;
      OTEL_EXPORTER_OTLP_HEADERS: &quot;REPLACE WITH OTEL_EXPORTER_OTLP_HEADERS&quot;
      OTEL_GO_AUTO_TARGET_EXE: &quot;/app/main&quot;
      OTEL_SERVICE_NAME: &quot;go-favorite&quot;
      OTEL_PROPAGATORS: &quot;tracecontext,baggage&quot;
    volumes:
      - favorite-go:/app
      - /proc:/host/proc
</code></pre>
<p>And, at the bottom of the file:</p>
<pre><code class="language-yaml">volumes:
  favorite-go:
networks:
  app-network:
    driver: bridge
</code></pre>
<p>Finally, in the configuration for the main node app, you will want to tell Elastiflix to call the Go favorites app by replacing the line:</p>
<pre><code class="language-yaml">environment:
  - API_ENDPOINT_FAVORITES=favorite-java:5000
</code></pre>
<p>with:</p>
<pre><code class="language-yaml">environment:
  - API_ENDPOINT_FAVORITES=favorite-go:5000
</code></pre>
<h3>Step 3: Explore traces and logs in Elastic APM</h3>
<p>Once you have this up and running, you can ping the endpoint for your
instrumented service (in our case, this is /favorites), and you should see the
app appear in Elastic APM, as shown below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-go-applications-opentelemetry/elastic-blog-4-services.png" alt="services" /></p>
<p>It will begin by tracking throughput and latency critical metrics for SREs to
pay attention to.</p>
<p>Digging in, we can see an overview of all our Transactions.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-go-applications-opentelemetry/elastic-blog-5-services2.png" alt="services-2" /></p>
<p>And look at specific transactions:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-go-applications-opentelemetry/elastic-blog-6-graph-colored.png" alt="graph colored lines" /></p>
<p>This gives you complete visibility across metrics, and traces!</p>
<h2>Summary</h2>
<p>With this Dockerfile, you've transformed your simple Go application into one
that's automatically instrumented with OpenTelemetry. This will aid greatly in
understanding application performance, tracing errors, and gaining insights
into how users interact with your software.</p>
<p>Remember, observability is a crucial aspect of modern application development,
especially in distributed systems. With tools like OpenTelemetry, understanding
complex systems becomes a tad bit easier.</p>
<p>In this blog, we discussed the following:</p>
<ul>
<li>How to auto-instrument Go with OpenTelemetry.</li>
<li>Using standard commands in a Docker file, auto-instrumentation was done
efficiently and without adding code in multiple places enabling
manageability.</li>
<li>Using OpenTelemetry and its support for multiple languages, DevOps and SRE
teams can auto-instrument their applications with ease gaining immediate
insights into the health of the entire application stack and reduce mean time
to resolution (MTTR).</li>
</ul>
<p>Since Elastic can support a mix of methods for ingesting data, whether it be
using auto-instrumentation of open-source OpenTelemetry or manual
instrumentation with its native APM agents, you can plan your migration to OTel
by focusing on a few applications first and then using OpenTelemety across your
applications later on in a manner that best fits your business needs.</p>
<p>Developer resources:</p>
<ul>
<li><a href="https://www.elastic.co/observability-labs/blog/getting-started-opentelemetry-instrumentation-sample-app">Elastiflix application</a>, a guide to instrument different languages with OpenTelemetry</li>
<li>Python: <a href="https://www.elastic.co/observability-labs/blog/auto-instrumentation-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/observability-labs/blog/manual-instrumentation-python-apps-opentelemetry">Manual-instrumentation</a></li>
<li>Java: <a href="https://www.elastic.co/observability-labs/blog/auto-instrumentation-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/observability-labs/blog/manual-instrumentation-java-apps-opentelemetry">Manual-instrumentation</a></li>
<li>Node.js: <a href="https://www.elastic.co/observability-labs/blog/auto-instrument-nodejs-apps-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/observability-labs/blog/manual-instrumentation-nodejs-apps-opentelemetry">Manual-instrumentation</a></li>
<li>.NET: <a href="https://www.elastic.co/observability-labs/blog/auto-instrumentation-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/observability-labs/blog/manual-instrumentation-net-apps-opentelemetry">Manual-instrumentation</a></li>
<li>Go: <a href="https://www.elastic.co/observability-labs/blog/auto-instrumentation-go-applications-opentelemetry">Auto-instrumentation</a> <a href="https://www.elastic.co/observability-labs/blog/manual-instrumentation-apps-opentelemetry">Manual-instrumentation</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/best-practices-instrumenting-opentelemetry">Best practices for instrumenting OpenTelemetry</a></li>
</ul>
<p>General configuration and use case resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></li>
<li><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a></li>
<li><a href="https://www.elastic.co/blog/custom-metrics-app-code-java-agent-plugin">Capturing custom metrics through OpenTelemetry API in code with Elastic</a></li>
<li><a href="https://www.elastic.co/virtual-events/future-proof-your-observability-platform-with-opentelemetry-and-elastic">Future-proof your observability platform with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-k8s-observability-elasticsearch-cncf">Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more</a></li>
</ul>
<p>Don’t have an Elastic Cloud account yet? <a href="https://cloud.elastic.co/registration">Sign up for Elastic Cloud</a> and try out the auto-instrumentation capabilities that I discussed above. I would be interested in getting your feedback about your experience in gaining visibility into your application stack with Elastic.</p>
<p>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all._</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-go-applications-opentelemetry/observability-launch-series-3-go-auto.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Automatic instrumentation with OpenTelemetry for Node.js applications]]></title>
            <link>https://www.elastic.co/observability-labs/blog/auto-instrument-nodejs-apps-opentelemetry</link>
            <guid isPermaLink="false">auto-instrument-nodejs-apps-opentelemetry</guid>
            <pubDate>Wed, 30 Aug 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to auto-instrument Node.js applications using OpenTelemetry. With standard commands in a Docker file, applications can be instrumented quickly without writing code in multiple places, enabling rapid change, scale, and easier management.]]></description>
            <content:encoded><![CDATA[<p>DevOps and SRE teams are transforming the process of software development. While DevOps engineers focus on efficient software applications and service delivery, SRE teams are key to ensuring reliability, scalability, and performance. These teams must rely on a full-stack observability solution that allows them to manage and monitor systems and ensure issues are resolved before they impact the business.</p>
<p>Observability across the entire stack of modern distributed applications requires data collection, processing, and correlation often in the form of dashboards. Ingesting all system data requires installing agents across stacks, frameworks, and providers — a process that can be challenging and time-consuming for teams who have to deal with version changes, compatibility issues, and proprietary code that doesn't scale as systems change.</p>
<p>Thanks to <a href="http://opentelemetry.io">OpenTelemetry</a> (OTel), DevOps and SRE teams now have a standard way to collect and send data that doesn't rely on proprietary code and have a large support community reducing vendor lock-in.</p>
<p>In a <a href="https://www.elastic.co/blog/opentelemetry-observability">previous blog</a>, we also reviewed how to use the <a href="https://github.com/elastic/opentelemetry-demo">OpenTelemetry demo</a> and connect it to Elastic&lt;sup&gt;®&lt;/sup&gt;, as well as some of Elastic’s capabilities with OpenTelemetry and Kubernetes.</p>
<p>In this blog, we will show how to use <a href="https://opentelemetry.io/docs/instrumentation/js/automatic/">automatic instrumentation for OpenTelemetry</a> with the Node.js service of our <a href="https://github.com/elastic/observability-examples">application called Elastiflix</a>, which helps highlight auto-instrumentation in a simple way.</p>
<p>The beauty of this is that there is <strong>no need for the otel-collector</strong>! This setup enables you to slowly and easily migrate an application to OTel with Elastic according to a timeline that best fits your business.</p>
<h2>Application, prerequisites, and config</h2>
<p>The application that we use for this blog is called <a href="https://github.com/elastic/observability-examples">Elastiflix</a>, a movie streaming application. It consists of several micro-services written in .NET, NodeJS, Go, and Python.</p>
<p>Before we instrument our sample application, we will first need to understand how Elastic can receive the telemetry data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrument-nodejs-apps-opentelemetry/elastic-blog-1-otel-config-options.png" alt="options" /></p>
<p>All of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:</p>
<ul>
<li>Service maps</li>
<li>Service details (latency, throughput, failed transactions)</li>
<li>Dependencies between services, distributed tracing</li>
<li>Transactions (traces)</li>
<li>Machine learning (ML) correlations</li>
<li>Log correlation</li>
</ul>
<p>In addition to Elastic’s APM and a unified view of the telemetry data, you will also be able to use Elastic’s powerful machine learning capabilities to reduce the analysis, and alerting to help reduce MTTR.</p>
<h3>Prerequisites</h3>
<ul>
<li>An Elastic Cloud account — <a href="https://cloud.elastic.co/">sign up now</a></li>
<li>A clone of the <a href="https://github.com/elastic/observability-examples">Elastiflix demo application</a>, or your own <strong>Node.js</strong> application</li>
<li>Basic understanding of Docker — potentially install <a href="https://www.docker.com/products/docker-desktop/">Docker Desktop</a></li>
<li>Basic understanding of Node.js</li>
</ul>
<h3>View the example source code</h3>
<p>The full source code, including the Dockerfile used in this blog, can be found on <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/python-favorite-otel-auto">GitHub</a>. The repository also contains the <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/python-favorite">same application without instrumentation</a>. This allows you to compare each file and see the differences.</p>
<p>The following steps will show you how to instrument this application and run it on the command line or in Docker. If you are interested in a more complete OTel example, take a look at the docker-compose file <a href="https://github.com/elastic/observability-examples/tree/main#start-the-app">here</a>, which will bring up the full project.</p>
<h2>Step-by-step guide</h2>
<h3>Step 0. Log in to your Elastic Cloud account</h3>
<p>This blog assumes you have an Elastic Cloud account — if not, follow the <a href="https://cloud.elastic.co/registration?elektra=en-cloud-page">instructions to get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrument-nodejs-apps-opentelemetry/elastic-blog-2-free-trial.png" alt="free trial" /></p>
<h3>Step 1. Configure auto-instrumentation for the Node.js Service</h3>
<p>We are going to use automatic instrumentation with Node.js service from the <a href="https://github.com/elastic/observability-examples">Elastiflix demo application</a>.</p>
<p>We will be using the following service from Elastiflix:</p>
<pre><code class="language-bash">Elastiflix/node-server-otel-manual
</code></pre>
<p>Per the <a href="https://opentelemetry.io/docs/instrumentation/js/automatic/">OpenTelemetry JavaScript documentation</a> and <a href="https://www.npmjs.com/package/@opentelemetry/auto-instrumentations-node">@open-telemetry/auto-instrumentions-node</a> documentation, you will simply install the appropriate node packages using npm.</p>
<pre><code class="language-bash">npm install --save @opentelemetry/api
npm install --save @opentelemetry/auto-instrumentations-node
</code></pre>
<p>If you are running the Node.js service on the command line, then here is how you can run auto-instrument with Node.js.</p>
<pre><code class="language-bash">node --require '@opentelemetry/auto-instrumentations-node/register' app.js
</code></pre>
<p>For our application, we do this as part of the Dockerfile.</p>
<p><strong>Dockerfile</strong></p>
<pre><code class="language-dockerfile">FROM node:14

WORKDIR /app

COPY [&quot;package.json&quot;, &quot;./&quot;]
RUN ls
RUN npm install --production
COPY . .

RUN npm install --save @opentelemetry/api
RUN npm install --save @opentelemetry/auto-instrumentations-node


EXPOSE 3001

CMD [&quot;node&quot;, &quot;--require&quot;, &quot;@opentelemetry/auto-instrumentations-node/register&quot;, &quot;index.js&quot;]
</code></pre>
<h3>Step 2. Running the Docker image with environment variables</h3>
<p>As specified in the <a href="https://opentelemetry.io/docs/instrumentation/python/automatic/#configuring-the-agent">OTEL documentation</a>, we will use environment variables and pass in the configuration values to enable it to connect with <a href="https://www.elastic.co/guide/en/apm/guide/current/open-telemetry.html">Elastic Observability’s APM server</a>.</p>
<p>Because Elastic accepts OTLP natively, we just need to provide the Endpoint and authentication where the OTEL Exporter needs to send the data, as well as some other environment variables.</p>
<p><strong>Getting Elastic Cloud variables</strong><br />
You can copy the endpoints and token from Kibana&lt;sup&gt;®&lt;/sup&gt; under the path /app/home#/tutorial/apm.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrument-nodejs-apps-opentelemetry/elastic-blog-3-apm-agents.png" alt="apm agents" /></p>
<p>You will need to copy the following environment variables:</p>
<pre><code class="language-bash">OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_EXPORTER_OTLP_HEADERS
</code></pre>
<p><strong>Build the image</strong></p>
<pre><code class="language-bash">docker build -t  node-otel-auto-image .
</code></pre>
<p><strong>Run the image</strong></p>
<pre><code class="language-bash">docker run \
       -e OTEL_EXPORTER_OTLP_ENDPOINT=&quot;&lt;REPLACE WITH OTEL_EXPORTER_OTLP_ENDPOINT&gt;&quot; \
       -e OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=Bearer &lt;REPLACE WITH TOKEN&gt;&quot; \
       -e OTEL_RESOURCE_ATTRIBUTES=&quot;service.version=1.0,deployment.environment=production&quot; \
       -e OTEL_SERVICE_NAME=&quot;node-server-otel-auto&quot; \
       -p 3001:3001 \
       node-server-otel-auto
</code></pre>
<p>You can now issue a few requests in order to generate trace data. Note that these requests are expected to return an error, as this service relies on some downstream services that you may not have running on your machine.</p>
<pre><code class="language-bash">curl localhost:3001/api/login
curl localhost:3001/api/favorites

# or alternatively issue a request every second

while true; do curl &quot;localhost:3001/api/favorites&quot;; sleep 1; done;
</code></pre>
<h3>Step 3: Explore traces, metrics, and logs in Elastic APM</h3>
<p>Exploring the Services section in Elastic APM, you’ll see the Node service displayed.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrument-nodejs-apps-opentelemetry/elastic-blog-4-services.png" alt="services" /></p>
<p>Clicking on the node-server-otel-auto service, you can see that it is ingesting telemetry data using OpenTelemetry.</p>
<h2>Summary</h2>
<p>In this blog, we discussed the following:</p>
<ul>
<li>How to auto-instrument Node.js with OpenTelemetry</li>
<li>Using standard commands in a Dockerfile, auto-instrumentation was done efficiently and without adding code in multiple places enabling manageability</li>
</ul>
<p>Since Elastic can support a mix of methods for ingesting data, whether it be using auto-instrumentation of open-source OpenTelemetry or manual instrumentation with its native APM agents, you can plan your migration to OTel by focusing on a few applications first and then using OpenTelemety across your applications later on in a manner that best fits your business needs.</p>
<blockquote>
<p>Developer resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/getting-started-opentelemetry-instrumentation-sample-app">Elastiflix application</a>, a guide to instrument different languages with OpenTelemetry</li>
<li>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-apps-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual-instrumentation</a></li>
<li>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Go: <a href="https://elastic.co/blog/manual-instrumentation-of-go-applications-opentelemetry">Manual-instrumentation</a></li>
<li><a href="https://www.elastic.co/blog/best-practices-instrumenting-opentelemetry">Best practices for instrumenting OpenTelemetry</a></li>
</ul>
<p>General configuration and use case resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></li>
<li><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a></li>
<li><a href="https://www.elastic.co/blog/custom-metrics-app-code-java-agent-plugin">Capturing custom metrics through OpenTelemetry API in code with Elastic</a></li>
<li><a href="https://www.elastic.co/virtual-events/future-proof-your-observability-platform-with-opentelemetry-and-elastic">Future-proof your observability platform with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-k8s-observability-elasticsearch-cncf">Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more</a></li>
</ul>
</blockquote>
<p>Don’t have an Elastic Cloud account yet? <a href="https://cloud.elastic.co/registration">Sign up for Elastic Cloud</a> and try out the auto-instrumentation capabilities that I discussed above. I would be interested in getting your feedback about your experience in gaining visibility into your application stack with Elastic.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/auto-instrument-nodejs-apps-opentelemetry/observability-launch-series-1-node-js-auto_(1).jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Auto-instrumentation of Java applications with OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/auto-instrumentation-java-applications-opentelemetry</link>
            <guid isPermaLink="false">auto-instrumentation-java-applications-opentelemetry</guid>
            <pubDate>Thu, 31 Aug 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Instrumenting Java applications with OpenTelemetry provides insights into application performance, dependencies, and errors. We'll show you how to automatically instrument a Java application using Docker, with no changes to your application code.]]></description>
            <content:encoded><![CDATA[<p>In the fast-paced universe of software development, especially in the cloud-native realm, DevOps and SRE teams are increasingly emerging as essential partners in application stability and growth.</p>
<p>DevOps engineers continuously optimize software delivery, while SRE teams act as the stewards of application reliability, scalability, and top-tier performance. The challenge? These teams require a cutting-edge observability solution, one that encompasses full-stack insights, empowering them to rapidly manage, monitor, and rectify potential disruptions before they culminate into operational challenges.</p>
<p>Observability in our modern distributed software ecosystem goes beyond mere monitoring — it demands limitless data collection, precision in processing, and the correlation of this data into actionable insights. However, the road to achieving this holistic view is paved with obstacles, from navigating version incompatibilities to wrestling with restrictive proprietary code.</p>
<p>Enter <a href="https://opentelemetry.io/">OpenTelemetry (OTel)</a>, with the following benefits for those who adopt it:</p>
<ul>
<li>Escape vendor constraints with OTel, freeing yourself from vendor lock-in and ensuring top-notch observability.</li>
<li>See the harmony of unified logs, metrics, and traces come together to provide a complete system view.</li>
<li>Improve your application oversight through richer and enhanced instrumentations.</li>
<li>Embrace the benefits of backward compatibility to protect your prior instrumentation investments.</li>
<li>Embark on the OpenTelemetry journey with an easy learning curve, simplifying onboarding and scalability.</li>
<li>Rely on a proven, future-ready standard to boost your confidence in every investment.</li>
</ul>
<p>In this blog, we will explore how you can use <a href="https://opentelemetry.io/docs/instrumentation/java/automatic/">automatic instrumentation in your Java</a> application using Docker, without the need to refactor any part of your application code. We will use an <a href="https://github.com/elastic/observability-examples">application called Elastiflix</a>, which helps highlight auto-instrumentation in a simple way.</p>
<h2>Application, prerequisites, and config</h2>
<p>The application that we use for this blog is called <a href="https://github.com/elastic/observability-examples">Elastiflix</a>, a movie-streaming application. It consists of several micro-services written in .NET, NodeJS, Go, and Python.</p>
<p>Before we instrument our sample application, we will first need to understand how Elastic can receive the telemetry data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-java-applications-opentelemetry/elastic-blog-1-config.png" alt="Elastic configuration options for OpenTelemetry" /></p>
<p>All of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:</p>
<ul>
<li>Service maps</li>
<li>Service details (latency, throughput, failed transactions)</li>
<li>Dependencies between services, distributed tracing</li>
<li>Transactions (traces)</li>
<li>Machine learning (ML) correlations</li>
<li>Log correlation</li>
</ul>
<p>In addition to Elastic’s APM and a unified view of the telemetry data, you will also be able to use Elastic’s powerful machine learning capabilities to reduce the analysis, and alerting to help reduce MTTR.</p>
<h3>Prerequisites</h3>
<ul>
<li>An Elastic Cloud account — <a href="https://cloud.elastic.co/">sign up now</a>.</li>
<li>A clone of the <a href="https://github.com/elastic/observability-examples">Elastiflix demo application</a>, or your own Java application</li>
<li>Basic understanding of Docker — potentially install <a href="https://www.docker.com/products/docker-desktop/">Docker Desktop</a></li>
<li>Basic understanding of Java</li>
</ul>
<h3>View the example source code</h3>
<p>The full source code, including the Dockerfile used in this blog, can be found on <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/python-favorite-otel-auto">GitHub</a>. The repository also contains the <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/python-favorite">same application without instrumentation</a>. This allows you to compare each file and see the differences.</p>
<p>The following steps will show you how to instrument this application and run it on the command line or in Docker. If you are interested in a more complete OTel example, take a look at the docker-compose file <a href="https://github.com/elastic/observability-examples/tree/main#start-the-app">here</a>, which will bring up the full project.</p>
<h2>Step-by-step guide</h2>
<h3>Step 0. Log in to your Elastic Cloud account</h3>
<p>This blog assumes you have an Elastic Cloud account — if not, follow the <a href="https://cloud.elastic.co/registration?elektra=en-cloud-page">instructions to get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-java-applications-opentelemetry/elastic-blog-2-trial.png" alt="free trial" /></p>
<h3>Step 1. Configure auto-instrumentation for the Java service</h3>
<p>We are going to use automatic instrumentation with Java service from the <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/java-favorite-otel-auto">Elastiflix demo application</a>.</p>
<p>We will be using the following service from Elastiflix:</p>
<pre><code class="language-bash">Elastiflix/java-favorite-otel-auto
</code></pre>
<p>Per the <a href="https://opentelemetry.io/docs/instrumentation/java/automatic/">OpenTelemetry Automatic Instrumentation for Java documentation</a> and documentation, you will simply install the appropriate Java packages.</p>
<p>Create a local OTel directory to download the OpenTelemetry Java agent. Download opentelemetry-javaagent.jar.</p>
<pre><code class="language-bash">&gt;mkdir /otel

&gt;curl -L https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar –output /otel/opentelemetry-javaagent.jar
</code></pre>
<p>If you are going to run the service on the command line, then you can use the following command:</p>
<pre><code class="language-java">java -javaagent:/otel/opentelemetry-javaagent.jar \
-jar /usr/src/app/target/favorite-0.0.1-SNAPSHOT.jar --server.port=5000
</code></pre>
<p>For our application, we will do this as part of the Dockerfile.</p>
<p><strong>Dockerfile</strong></p>
<pre><code class="language-java">Start with a base image containing Java runtime
FROM maven:3.8.2-openjdk-17-slim as build

# Make port 8080 available to the world outside this container
EXPOSE 5000

# Change to the app directory
WORKDIR /usr/src/app

# Copy the local code to the container
COPY . .

# Build the application
RUN mvn clean install

USER root
RUN apt-get update &amp;&amp; apt-get install -y zip curl
RUN mkdir /otel
RUN curl -L -o /otel/opentelemetry-javaagent.jar https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v1.28.0/opentelemetry-javaagent.jar

COPY start.sh /start.sh
RUN chmod +x /start.sh

ENTRYPOINT [&quot;/start.sh&quot;]
</code></pre>
<h3>Step 2. Running the Docker Image with environment variables</h3>
<p>As specified in the <a href="https://opentelemetry.io/docs/instrumentation/java/automatic/">OTEL Java documentation</a>, we will use environment variables and pass in the configuration values to enable it to connect with <a href="https://www.elastic.co/guide/en/observability/current/apm-open-telemetry.html">Elastic Observability’s APM server</a>.</p>
<p>Because Elastic accepts OTLP natively, we just need to provide the Endpoint and authentication where the OTEL Exporter needs to send the data, as well as some other environment variables.</p>
<p><strong>Getting Elastic Cloud variables</strong><br />
You can copy the endpoints and token from Kibana under the path <code>/app/home#/tutorial/apm</code>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-java-applications-opentelemetry/elastic-blog-3-apm-agents.png" alt="apm agents" /></p>
<p>You will need to copy the following environment variables:</p>
<pre><code class="language-bash">OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_EXPORTER_OTLP_HEADERS
</code></pre>
<p><strong>Build the Docker image</strong></p>
<pre><code class="language-bash">docker build -t java-otel-auto-image .
</code></pre>
<p><strong>Run the Docker image</strong></p>
<pre><code class="language-bash">docker run \
       -e OTEL_EXPORTER_OTLP_ENDPOINT=&quot;REPLACE WITH OTEL_EXPORTER_OTLP_ENDPOINT&quot; \
       -e ELASTIC_APM_SECRET_TOKEN=&quot;REPLACE WITH THE BIT AFTER Authorization=Bearer &quot; \
       -e OTEL_RESOURCE_ATTRIBUTES=&quot;service.version=1.0,deployment.environment=production&quot; \
       -e OTEL_SERVICE_NAME=&quot;java-favorite-otel-auto&quot; \
       -p 5000:5000 \
       java-otel-auto-image
</code></pre>
<p>You can now issue a few requests in order to generate trace data. Note that these requests are expected to return an error, as this service relies on a connection to Redis that you don’t currently have running. As mentioned before, you can find a more complete example using docker-compose <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix">here</a>.</p>
<pre><code class="language-bash">curl localhost:5000/favorites

# or alternatively issue a request every second

while true; do curl &quot;localhost:5000/favorites&quot;; sleep 1; done;
</code></pre>
<h3>Step 3: Explore traces and logs in Elastic APM</h3>
<p>Once you have this up and running, you can ping the endpoint for your instrumented service (in our case, this is /favorites), and you should see the app appear in Elastic APM, as shown below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-java-applications-opentelemetry/elastic-blog-4-services.png" alt="services" /></p>
<p>It will begin by tracking throughput and latency critical metrics for SREs to pay attention to.</p>
<p>Digging in, we can see an overview of all our Transactions.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-java-applications-opentelemetry/elastic-blog-5-services2.png" alt="services-2" /></p>
<p>And look at specific transactions:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-java-applications-opentelemetry/elastic-blog-6-graph-colored.png" alt="graph colored lines" /></p>
<p>Click on <strong>Logs,</strong> and we see that logs are also brought over. The OTel Agent will automatically bring in logs and correlate them with traces for you:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-java-applications-opentelemetry/elastic-blog-7-graph-no-colors.png" alt="graph-no-colors" /></p>
<p>This gives you complete visibility across logs, metrics, and traces!</p>
<h2>Basic concepts: How APM works with Java</h2>
<p>Before we continue, let's first understand a few basic concepts and terms.</p>
<ul>
<li><strong>Java Agent:</strong> This is a tool that can be used to instrument (or modify) the bytecode of class files in the Java Virtual Machine (JVM). Java agents are used for many purposes like performance monitoring, logging, security, and more.</li>
<li><strong>Bytecode:</strong> This is the intermediary code generated by the Java compiler from your Java source code. This code is interpreted or compiled on the fly by the JVM to produce machine code that can be executed.</li>
<li><strong>Byte Buddy:</strong> Byte Buddy is a code generation and manipulation library for Java. It is used to create, modify, or adapt Java classes at runtime. In the context of a Java Agent, Byte Buddy provides a powerful and flexible way to modify bytecode. <strong>Both the Elastic APM Agent and the OpenTelemetry Agent use Byte Buddy under the covers.</strong></li>
</ul>
<p><strong>Now, let's talk about how automatic instrumentation works with Byte Buddy:</strong></p>
<p>Automatic instrumentation is the process by which an agent modifies the bytecode of your application's classes, often to insert monitoring code. The agent doesn't modify the source code directly, but rather the bytecode that is loaded into the JVM. This is done while the JVM is loading the classes, so the modifications are in effect during runtime.</p>
<p>Here's a simplified explanation of the process:</p>
<ol>
<li>
<p><strong>Start the JVM with the agent:</strong> When starting your Java application, you specify the Java agent with the -javaagent command line option. This instructs the JVM to load your agent before the main method of your application is invoked. At this point, the agent has the opportunity to set up class transformers.</p>
</li>
<li>
<p><strong>Register a class file transformer with Byte Buddy:</strong> Your agent will register a class file transformer with Byte Buddy. A transformer is a piece of code that is invoked every time a class is loaded into the JVM. This transformer receives the bytecode of the class, and it can modify this bytecode before the class is actually used.</p>
</li>
<li>
<p><strong>Transform the bytecode:</strong> When your transformer is invoked, it will use Byte Buddy's API to modify the bytecode. Byte Buddy allows you to specify your transformations in a high-level, expressive way rather than manually writing complex bytecode. For example, you could specify a certain class and method within that class that you want to instrument and provide an &quot;interceptor&quot; that will add new behavior to that method.</p>
</li>
<li>
<p><strong>Use the transformed classes:</strong> Once the agent has set up its transformers, the JVM continues to load classes as usual. Each time a class is loaded, your transformers are invoked, allowing them to modify the bytecode. Your application then uses these transformed classes as if they were the original ones, but they now have the extra behavior that you've injected through your interceptor.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-java-applications-opentelemetry/elastic-blog-8-flowchart.png" alt="flowchart" /></p>
<p>In essence, automatic instrumentation with Byte Buddy is about modifying the behavior of your Java classes at runtime, without needing to alter the source code directly. This is especially useful for cross-cutting concerns like logging, monitoring, or security, as it allows you to centralize this code in your Java Agent, rather than scattering it throughout your application.</p>
<h2>Summary</h2>
<p>With this Dockerfile, you've transformed your simple Java application into one that's automatically instrumented with OpenTelemetry. This will aid greatly in understanding application performance, tracing errors, and gaining insights into how users interact with your software.</p>
<p>Remember, observability is a crucial aspect of modern application development, especially in distributed systems. With tools like OpenTelemetry, understanding complex systems becomes a tad bit easier.</p>
<p>In this blog, we discussed the following:</p>
<ul>
<li>How to auto-instrument Java with OpenTelemetry.</li>
<li>Using standard commands in a Docker file, auto-instrumentation was done efficiently and without adding code in multiple places enabling manageability.</li>
<li>Using OpenTelemetry and its support for multiple languages, DevOps and SRE teams can auto-instrument their applications with ease gaining immediate insights into the health of the entire application stack and reduce mean time to resolution (MTTR).</li>
</ul>
<p>Since Elastic can support a mix of methods for ingesting data, whether it be using auto-instrumentation of open-source OpenTelemetry or manual instrumentation with its native APM agents, you can plan your migration to OTel by focusing on a few applications first and then using OpenTelemety across your applications later on in a manner that best fits your business needs.</p>
<blockquote>
<p>Developer resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/getting-started-opentelemetry-instrumentation-sample-app">Elastiflix application</a>, a guide to instrument different languages with OpenTelemetry</li>
<li>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual-instrumentation</a></li>
<li>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Go: <a href="https://elastic.co/blog/manual-instrumentation-of-go-applications-opentelemetry">Manual-instrumentation</a></li>
<li><a href="https://www.elastic.co/blog/best-practices-instrumenting-opentelemetry">Best practices for instrumenting OpenTelemetry</a></li>
</ul>
<p>General configuration and use case resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></li>
<li><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a></li>
<li><a href="https://www.elastic.co/blog/custom-metrics-app-code-java-agent-plugin">Capturing custom metrics through OpenTelemetry API in code with Elastic</a></li>
<li><a href="https://www.elastic.co/virtual-events/future-proof-your-observability-platform-with-opentelemetry-and-elastic">Future-proof your observability platform with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-k8s-observability-elasticsearch-cncf">Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more</a></li>
</ul>
</blockquote>
<p>Don’t have an Elastic Cloud account yet? <a href="https://cloud.elastic.co/registration">Sign up for Elastic Cloud</a> and try out the auto-instrumentation capabilities that I discussed above. I would be interested in getting your feedback about your experience in gaining visibility into your application stack with Elastic.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-java-applications-opentelemetry/observability-launch-series-3-java-auto.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Auto-instrumentation of .NET applications with OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/auto-instrumentation-net-applications-opentelemetry</link>
            <guid isPermaLink="false">auto-instrumentation-net-applications-opentelemetry</guid>
            <pubDate>Fri, 01 Sep 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[OpenTelemetry provides an observability framework for cloud-native software, allowing us to trace, monitor, and debug applications seamlessly. In this post, we'll explore how to automatically instrument a .NET application using OpenTelemetry.]]></description>
            <content:encoded><![CDATA[<p>In the fast-paced universe of software development, especially in the cloud-native realm, DevOps and SRE teams are increasingly emerging as essential partners in application stability and growth.</p>
<p>DevOps engineers continuously optimize software delivery, while SRE teams act as the stewards of application reliability, scalability, and top-tier performance. The challenge? These teams require a cutting-edge observability solution, one that encompasses full-stack insights, empowering them to rapidly manage, monitor, and rectify potential disruptions before they culminate into operational challenges.</p>
<p>Observability in our modern distributed software ecosystem goes beyond mere monitoring — it demands limitless data collection, precision in processing, and the correlation of this data into actionable insights. However, the road to achieving this holistic view is paved with obstacles, from navigating version incompatibilities to wrestling with restrictive proprietary code.</p>
<p>Enter <a href="https://opentelemetry.io/">OpenTelemetry (OTel)</a>, with the following benefits for those who adopt it:</p>
<ul>
<li>Escape vendor constraints with OTel, freeing yourself from vendor lock-in and ensuring top-notch observability.</li>
<li>See the harmony of unified logs, metrics, and traces come together to provide a complete system view.</li>
<li>Improve your application oversight through richer and enhanced instrumentations.</li>
<li>Embrace the benefits of backward compatibility to protect your prior instrumentation investments.</li>
<li>Embark on the OpenTelemetry journey with an easy learning curve, simplifying onboarding and scalability.</li>
<li>Rely on a proven, future-ready standard to boost your confidence in every investment.</li>
<li>Explore manual instrumentation, enabling customized data collection to fit your unique needs.</li>
<li>Ensure monitoring consistency across layers with a standardized observability data framework.</li>
<li>Decouple development from operations, driving peak efficiency for both.</li>
</ul>
<p>Given this context, OpenTelemetry emerges as an unmatched observability solution for cloud-native software, seamlessly enabling tracing, monitoring, and debugging. One of its strengths is the ability to auto-instrument applications, allowing developers the luxury of collecting invaluable telemetry without delving into code modifications.</p>
<p>In this post, we will dive into the methodology to instrument a .NET application using Docker, blending the best of both worlds: powerful observability without the code hassles.</p>
<h2>What's covered?</h2>
<ul>
<li>How APM works with .NET using CLR Profiler functionality</li>
<li>Creating a Docker image for a .NET application with the OpenTelemetry instrumentation baked in</li>
<li>Installing and running the OpenTelemetry .NET Profiler for automatic instrumentation</li>
</ul>
<h2>How APM works with .NET using CLR Profiler functionality</h2>
<p>Before we delve into the details, let's clear up some confusion around .NET Profilers and CPU Profilers like Elastic&lt;sup&gt;®&lt;/sup&gt;’s Universal Profiling tool — we don’t want to get these two things mixed up, as they have very different purposes.</p>
<p>When discussing profiling tools, especially in the context of .NET, it's not uncommon to encounter confusion between a &quot;.NET profiler&quot; and a &quot;CPU profiler.&quot; Though both are used to diagnose and optimize applications, they serve different primary purposes and operate at different levels. Let's clarify the distinction:</p>
<h3>.NET Profiler</h3>
<ol>
<li>
<p><strong>Scope:</strong> Specifically targets .NET applications. It is designed to work with the .NET runtime (i.e., the Common Language Runtime (CLR)).</p>
</li>
<li>
<p><strong>Functionality:</strong></p>
</li>
<li>
<p><strong>Use cases:</strong></p>
</li>
</ol>
<h3>CPU Profiler</h3>
<ol>
<li>
<p><strong>Scope:</strong> More general than a .NET profiler. It can profile any application, irrespective of the language or runtime, as long as it runs on the CPU being profiled.</p>
</li>
<li>
<p><strong>Functionality:</strong></p>
</li>
<li>
<p><strong>Use cases:</strong></p>
</li>
</ol>
<p>While both .NET profilers and CPU profilers aid in optimizing and diagnosing application performance, their approach and depth differ. A .NET profiler offers deep insights specifically into the .NET ecosystem, allowing for fine-grained analysis and instrumentation. In contrast, a CPU profiler provides a broader view, focusing on CPU usage patterns across any application, regardless of its development platform.</p>
<p>It's worth noting that for comprehensive profiling of a .NET application, you might use both: the .NET profiler to understand code-level behaviors specific to .NET and the CPU profiler to get an overview of CPU resource utilization.</p>
<p>Now that we've cleared that up, let's focus on the .NET Profiler, which we are discussing in this blog for automatic instrumentation of .NET applications. First, let's familiarize ourselves with some foundational concepts and terminologies relevant to a .NET Profiler:</p>
<ul>
<li><strong>CLR (Common Language Runtime):</strong> CLR is a core component of the .NET framework, acting as the execution engine for .NET apps. It provides key services like memory management, exception handling, and type safety.</li>
<li><strong>Profiler API:</strong>.NET provides a set of APIs for profiling applications. These APIs let tools and developers monitor or manipulate .NET applications during runtime.</li>
<li><strong>IL (Intermediate Language):</strong> After compiling, .NET source code turns into IL, a low-level, platform-agnostic representation. This IL code is then compiled just-in-time (JIT) into machine code by the CLR during application execution.</li>
<li><strong>JIT compilation:</strong> JIT stands for just-in-time. In .NET, the CLR compiles IL to native code just before its execution.</li>
</ul>
<p>Now, let's explore how automatic instrumentation works using CLR Profiler.</p>
<p>Automatic instrumentation in .NET, much like Java's bytecode instrumentation, revolves around modifying the behavior of your application's methods during runtime, without changing the actual source code.</p>
<p>Here’s a step-by-step breakdown:</p>
<ol>
<li>
<p><strong>Attach the profiler:</strong> When launching your .NET application, you'll have to specify to load the profiler. The CLR checks for the presence of a profiler by reading environment variables. If it finds one, the CLR initializes the profiler before any user code is executed.</p>
</li>
<li>
<p><strong>Use Profiler API to monitor events:</strong> The Profiler API allows a profiler to monitor various events. For instance, method JIT compilation events can be tracked. When a method is about to be JIT compiled, the profiler gets notified.</p>
</li>
<li>
<p><strong>Manipulate IL code:</strong> Upon getting notified of a JIT compilation, the profiler can manipulate the IL code of the method. Using the Profiler API, the profiler can insert, delete, or replace IL instructions. This is analogous to how Java agents modify bytecode. For example, if you want to measure a method's execution time, you'd modify the IL to insert calls to start and stop a timer at the beginning and end of the method, respectively.</p>
</li>
<li>
<p><strong>Execution of transformed code:</strong> Once the IL has been modified, the JIT compiler will translate it into machine code. The application will then execute this machine code, which includes the additions made by the profiler.</p>
</li>
<li>
<p><strong>Gather and report data:</strong> The added instrumentation can collect various data, such as method execution times or call counts. This data can then be relayed to an application performance management (APM) tool, which can provide insights, visualizations, and alerts based on the data.</p>
</li>
</ol>
<p>In essence, automatic instrumentation with CLR Profiler is about modifying the behavior of your .NET methods at runtime. This is invaluable for monitoring, diagnosing, and fine-tuning the performance of .NET applications without intruding on the application's actual source code.</p>
<h2>Prerequisites</h2>
<ul>
<li>A basic understanding of Docker and .NET</li>
<li>Elastic Cloud</li>
<li>Docker installed on your machine (we recommend docker desktop)</li>
</ul>
<h2>View the example source code</h2>
<p>The full source code, including the Dockerfile used in this blog, can be found on <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/dotnet-login-otel-manual">GitHub</a>. The repository also contains the <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/dotnet-login">same application without instrumentation</a>. This allows you to compare each file and see the differences.</p>
<p>The following steps will show you how to instrument this application and run it on the command line or in Docker. If you are interested in a more complete OTel example, take a look at the docker-compose file <a href="https://github.com/elastic/observability-examples/tree/main#start-the-app">here</a>, which will bring up the full project.</p>
<h2>Step-by-step guide</h2>
<p>This blog assumes you have an Elastic Cloud account — if not, follow the <a href="https://cloud.elastic.co/registration?elektra=en-cloud-page">instructions to get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-net-applications-opentelemetry/elastic-blog-2-free-trial.png" alt="" /></p>
<h2>Step 1. Base image setup</h2>
<p>Start with the .NET runtime image for the base layer of our Dockerfile:</p>
<pre><code class="language-dockerfile">FROM ${ARCH}mcr.microsoft.com/dotnet/aspnet:7.0. AS base
WORKDIR /app
EXPOSE 8000
</code></pre>
<p>Here, we're setting up the application's runtime environment.</p>
<h2>Step 2. Building the .NET application</h2>
<p>This feature of Docker is just the best. Here, we compile our .NET application using the SDK image. In the bad old days, we used to build on a different platform and then put the compiled code into the Docker container. This way, we are much more confident our build will replicate from a developer’s desktop and into production by using Docker all the way through.</p>
<pre><code class="language-dockerfile">FROM --platform=$BUILDPLATFORM mcr.microsoft.com/dotnet/sdk:8.0-preview AS build
ARG TARGETPLATFORM

WORKDIR /src
COPY [&quot;login.csproj&quot;, &quot;./&quot;]
RUN dotnet restore &quot;./login.csproj&quot;
COPY . .
WORKDIR &quot;/src/.&quot;
RUN dotnet build &quot;login.csproj&quot; -c Release -o /app/build
</code></pre>
<p>This section ensures that our .NET code is properly restored and compiled.</p>
<h2>Step 3. Publishing the application</h2>
<p>Once built, we'll publish the app:</p>
<pre><code class="language-bash">FROM build AS publish
RUN dotnet publish &quot;login.csproj&quot; -c Release -o /app/publish
</code></pre>
<h2>Step 4. Preparing the final image</h2>
<p>Now, let's set up the final runtime image:</p>
<pre><code class="language-dockerfile">FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish
</code></pre>
<h2>Step 5. Installing OpenTelemetry</h2>
<p>We'll install dependencies and download the OpenTelemetry auto-instrumentation script:</p>
<pre><code class="language-bash">RUN apt-get update &amp;&amp; apt-get install -y zip curl
RUN mkdir /otel
RUN curl -L -o /otel/otel-dotnet-install.sh https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/releases/download/v0.7.0/otel-dotnet-auto-install.sh
RUN chmod +x /otel/otel-dotnet-install.sh
</code></pre>
<h2>Step 6. Configure OpenTelemetry</h2>
<p>Designate where OpenTelemetry should reside and execute the installation script. Note that the ENV OTEL_DOTNET_AUTO_HOME is required as the script looks for it:</p>
<pre><code class="language-bash">ENV OTEL_DOTNET_AUTO_HOME=/otel
RUN /bin/bash /otel/otel-dotnet-install.sh
</code></pre>
<h2>Step 7. Additional configuration</h2>
<p>Make sure the auto-instrumentation and platform detection scripts are executable and run the platform detection script.</p>
<pre><code class="language-bash">COPY platform-detection.sh /otel/
RUN chmod +x /otel/instrument.sh
RUN chmod +x /otel/platform-detection.sh &amp;&amp; /otel/platform-detection.sh
</code></pre>
<p>This platform detection script will check if the Docker build is for ARM64 and implement a workaround to get the OpenTelemetry instrumentation to work on MacOS. If you happen to be running locally on MacOS M1 or M2 processors, you will be grateful for this script.</p>
<h2>Step 8. Entry point setup</h2>
<p>Lastly, set the Docker image's entry point to both source the OpenTelemetry instrumentation, which sets up the environment variables required to bootstrap the .NET Profiler, and then we start our .NET application:</p>
<pre><code class="language-bash">ENTRYPOINT [&quot;/bin/bash&quot;, &quot;-c&quot;, &quot;source /otel/instrument.sh &amp;&amp; dotnet login.dll&quot;]
</code></pre>
<h2>Step 9. Running the Docker image with environment variables</h2>
<p>To build and run the Docker image, you'd typically follow these steps:</p>
<h3>Build the Docker image</h3>
<p>First, you'd want to build the Docker image from your Dockerfile. Let's assume the Dockerfile is in the current directory, and you'd like to name/tag your image dotnet-login-otel-image.</p>
<pre><code class="language-bash">docker build -t dotnet-login-otel-image .
</code></pre>
<h3>Run the Docker image</h3>
<p>After building the image, you'd run it with the specified environment variables. For this, the docker <strong>run</strong> command is used with the -e flag for each environment variable.</p>
<pre><code class="language-bash">docker run \
       -e OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=Bearer ${ELASTIC_APM_SECRET_TOKEN}&quot; \
       -e OTEL_EXPORTER_OTLP_ENDPOINT=&quot;${ELASTIC_APM_SERVER_URL}&quot; \
       -e OTEL_METRICS_EXPORTER=&quot;otlp&quot; \
       -e OTEL_RESOURCE_ATTRIBUTES=&quot;service.version=1.0,deployment.environment=production&quot; \
       -e OTEL_SERVICE_NAME=&quot;dotnet-login-otel-auto&quot; \
       -e OTEL_TRACES_EXPORTER=&quot;otlp&quot; \
       dotnet-login-otel-image
</code></pre>
<p>Make sure that <code>${ELASTIC_APM_SECRET_TOKEN}</code> and <code>${ELASTIC_APM_SERVER_URL}</code> are set in your shell environment, and replace them with their actual values from the cloud as shown below.<br />
Getting Elastic Cloud variables</p>
<p>You can copy the endpoints and token from Kibana&lt;sup&gt;®&lt;/sup&gt; under the path <code>/app/home#/tutorial/apm</code>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-net-applications-opentelemetry/elastic-blog-3-apm-agents.png" alt="apm agents" /></p>
<p>You can also use an environment file with docker run --env-file to make the command less verbose if you have multiple environment variables.</p>
<p>Once you have this up and running, you can ping the endpoint for your instrumented service (in our case, this is /login), and you should see the app appear in Elastic APM, as shown below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-net-applications-opentelemetry/services-3.png" alt="services" /></p>
<p>It will begin by tracking throughput and latency critical metrics for SREs to pay attention to.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-net-applications-opentelemetry/dotnet-login-otel-auto-1.png" alt="dotnet-login-otel-auto-1" /></p>
<p>Digging in, we can see an overview of all our Transactions.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-net-applications-opentelemetry/dotnet-login-otel-auto-2.png" alt="dotnet-login-otel-auto-2" /></p>
<p>And look at specific transactions:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-net-applications-opentelemetry/specific_transactions.png" alt="specific transactions" /></p>
<p>There is clearly an outlier here, where one transaction took over 200ms. This is likely to be due to the .NET CLR warming up. Click on <strong>Logs</strong> , and we see that logs are also brought over. The OTel Agent will automatically bring in logs and correlate them with traces for you:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-net-applications-opentelemetry/otel_agent.png" alt="otel agent" /></p>
<h2>Wrapping up</h2>
<p>With this Dockerfile, you've transformed your simple .NET application into one that's automatically instrumented with OpenTelemetry. This will aid greatly in understanding application performance, tracing errors, and gaining insights into how users interact with your software.</p>
<p>Remember, observability is a crucial aspect of modern application development, especially in distributed systems. With tools like OpenTelemetry, understanding complex systems becomes a tad bit easier.</p>
<p>In this blog, we discussed the following:</p>
<ul>
<li>How to auto-instrument .NET with OpenTelemetry.</li>
<li>Using standard commands in a Docker file, auto-instrumentation was done efficiently and without adding code in multiple places enabling manageability.</li>
<li>Using OpenTelemetry and its support for multiple languages, DevOps and SRE teams can auto-instrument their applications with ease gaining immediate insights into the health of the entire application stack and reduce mean time to resolution (MTTR).</li>
</ul>
<p>Since Elastic can support a mix of methods for ingesting data, whether it be using auto-instrumentation of open-source OpenTelemetry or manual instrumentation with its native APM agents, you can plan your migration to OTel by focusing on a few applications first and then using OpenTelemety across your applications later on in a manner that best fits your business needs.</p>
<blockquote>
<p>Developer resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/getting-started-opentelemetry-instrumentation-sample-app">Elastiflix application</a>, a guide to instrument different languages with OpenTelemetry</li>
<li>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual-instrumentation</a></li>
<li>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Go: <a href="https://elastic.co/blog/manual-instrumentation-of-go-applications-opentelemetry">Manual-instrumentation</a></li>
<li><a href="https://www.elastic.co/blog/best-practices-instrumenting-opentelemetry">Best practices for instrumenting OpenTelemetry</a></li>
</ul>
<p>General configuration and use case resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></li>
<li><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a></li>
<li><a href="https://www.elastic.co/blog/custom-metrics-app-code-java-agent-plugin">Capturing custom metrics through OpenTelemetry API in code with Elastic</a></li>
<li><a href="https://www.elastic.co/virtual-events/future-proof-your-observability-platform-with-opentelemetry-and-elastic">Future-proof your observability platform with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-k8s-observability-elasticsearch-cncf">Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more</a></li>
</ul>
</blockquote>
<p>Don’t have an Elastic Cloud account yet? <a href="https://cloud.elastic.co/registration">Sign up for Elastic Cloud</a> and try out the auto-instrumentation capabilities that I discussed above. I would be interested in getting your feedback about your experience in gaining visibility into your application stack with Elastic.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-net-applications-opentelemetry/observability-launch-series-4-net-auto.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Automatic instrumentation with OpenTelemetry for Python applications]]></title>
            <link>https://www.elastic.co/observability-labs/blog/auto-instrumentation-python-applications-opentelemetry</link>
            <guid isPermaLink="false">auto-instrumentation-python-applications-opentelemetry</guid>
            <pubDate>Thu, 31 Aug 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to auto-instrument Python applications using OpenTelemetry. With standard commands in a Docker file, applications can be instrumented quickly without writing code in multiple places, enabling rapid change, scale, and easier management.]]></description>
            <content:encoded><![CDATA[<p>DevOps and SRE teams are transforming the process of software development. While DevOps engineers focus on efficient software applications and service delivery, SRE teams are key to ensuring reliability, scalability, and performance. These teams must rely on a full-stack observability solution that allows them to manage and monitor systems and ensure issues are resolved before they impact the business.</p>
<p>Observability across the entire stack of modern distributed applications requires data collection, processing, and correlation often in the form of dashboards. Ingesting all system data requires installing agents across stacks, frameworks, and providers — a process that can be challenging and time-consuming for teams who have to deal with version changes, compatibility issues, and proprietary code that doesn't scale as systems change.</p>
<p>Thanks to <a href="http://opentelemetry.io">OpenTelemetry</a> (OTel), DevOps and SRE teams now have a standard way to collect and send data that doesn't rely on proprietary code and has a large support community reducing vendor lock-in.</p>
<p>In a <a href="https://www.elastic.co/blog/opentelemetry-observability">previous blog</a>, we also reviewed how to use the <a href="https://github.com/elastic/opentelemetry-demo">OpenTelemetry demo</a> and connect it to Elastic&lt;sup&gt;®&lt;/sup&gt;, as well as some of Elastic’s capabilities with <a href="https://www.elastic.co/observability/opentelemetry">OpenTelemetry visualizations</a> and Kubernetes.</p>
<p>In this blog, we will show how to use <a href="https://opentelemetry.io/docs/instrumentation/python/">automatic instrumentation for OpenTelemetry</a> with the Python service of our <a href="https://github.com/elastic/observability-examples">application called Elastiflix</a>, which helps highlight auto-instrumentation in a simple way.</p>
<p>The beauty of this is that there is <strong>no need for the otel-collector</strong>! This setup enables you to slowly and easily migrate an application to OTel with Elastic according to a timeline that best fits your business.</p>
<h2>Application, prerequisites, and config</h2>
<p>The application that we use for this blog is called <a href="https://github.com/elastic/observability-examples">Elastiflix</a>, a movie-streaming application. It consists of several micro-services written in .NET, NodeJS, Go, and Python.</p>
<p>Before we instrument our sample application, we will first need to understand how Elastic can receive the telemetry data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-python-applications-opentelemetry/elastic-blog-1-otel-config-options.png" alt="Elastic configuration options for OpenTelemetry" /></p>
<p>All of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:</p>
<ul>
<li>Service maps</li>
<li>Service details (latency, throughput, failed transactions)</li>
<li>Dependencies between services, distributed tracing</li>
<li>Transactions (traces)</li>
<li>Machine learning (ML) correlations</li>
<li>Log correlation</li>
</ul>
<p>In addition to Elastic’s APM and a unified view of the telemetry data, you will also be able to use Elastic’s powerful machine learning capabilities to reduce the analysis, and alerting to help reduce MTTR.</p>
<h3>Prerequisites</h3>
<ul>
<li>An Elastic Cloud account — <a href="https://cloud.elastic.co/">sign up now</a></li>
<li>A clone of the <a href="https://github.com/elastic/observability-examples">Elastiflix demo application</a>, or your own Python application</li>
<li>Basic understanding of Docker — potentially install <a href="https://www.docker.com/products/docker-desktop/">Docker Desktop</a></li>
<li>Basic understanding of Python</li>
</ul>
<h3>View the example source code</h3>
<p>The full source code, including the Dockerfile used in this blog, can be found on <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/python-favorite-otel-auto">GitHub</a>. The repository also contains the <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/python-favorite">same application without instrumentation</a>. This allows you to compare each file and see the differences.</p>
<p>The following steps will show you how to instrument this application and run it on the command line or in Docker. If you are interested in a more complete OTel example, take a look at the docker-compose file <a href="https://github.com/elastic/observability-examples/tree/main#start-the-app">here</a>, which will bring up the full project.</p>
<h2>Step-by-step guide</h2>
<h3>Step 0. Log in to your Elastic Cloud account</h3>
<p>This blog assumes you have an Elastic Cloud account — if not, follow the <a href="https://cloud.elastic.co/registration?elektra=en-cloud-page">instructions to get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-python-applications-opentelemetry/elastic-blog-2-free-trial.png" alt="free trial" /></p>
<h3>Step 1. Configure auto-instrumentation for the Python Service</h3>
<p>We are going to use automatic instrumentation with Python service from the <a href="https://github.com/elastic/observability-examples">Elastiflix demo application</a>.</p>
<p>We will be using the following service from Elastiflix:</p>
<pre><code class="language-bash">Elastiflix/python-favorite-otel-auto
</code></pre>
<p>Per the <a href="https://opentelemetry.io/docs/instrumentation/js/automatic/">OpenTelemetry Automatic Instrumentation for Python documentation</a>, you will simply install the appropriate Python packages using pip install.</p>
<pre><code class="language-bash">&gt;pip install opentelemetry-distro \
	opentelemetry-exporter-otlp

&gt;opentelemetry-bootstrap -a install
</code></pre>
<p>If you are running the Python service on the command line, then you can use the following command:</p>
<pre><code class="language-bash">opentelemetry-instrument python main.py
</code></pre>
<p>For our application, we do this as part of the Dockerfile.</p>
<p><strong>Dockerfile</strong></p>
<pre><code class="language-dockerfile">FROM python:3.9-slim as base

# get packages
COPY requirements.txt .
RUN pip install -r requirements.txt
WORKDIR /favoriteservice

#install opentelemetry packages
RUN pip install opentelemetry-distro \
	opentelemetry-exporter-otlp

RUN opentelemetry-bootstrap -a install

# Add the application
COPY . .

EXPOSE 5000
ENTRYPOINT [ &quot;opentelemetry-instrument&quot;, &quot;python&quot;, &quot;main.py&quot;]
</code></pre>
<h3>Step 2. Running the Docker image with environment variables</h3>
<p>As specified in the <a href="https://opentelemetry.io/docs/instrumentation/python/automatic/#configuring-the-agent">OTEL Python documentation</a>, we will use environment variables and pass in the configuration values to enable it to connect with <a href="https://www.elastic.co/guide/en/apm/guide/current/open-telemetry.html">Elastic Observability’s APM server</a>.</p>
<p>Because Elastic accepts OTLP natively, we just need to provide the Endpoint and authentication where the OTEL Exporter needs to send the data, as well as some other environment variables.</p>
<p><strong>Getting Elastic Cloud variables</strong><br />
You can copy the endpoints and token from Kibana&lt;sup&gt;®&lt;/sup&gt; under the path /app/home#/tutorial/apm.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-python-applications-opentelemetry/elastic-blog-3-apm-agents.png" alt="apm agents" /></p>
<p>You will need to copy the following environment variables:</p>
<pre><code class="language-bash">OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_EXPORTER_OTLP_HEADERS
</code></pre>
<p><strong>Build the image</strong></p>
<pre><code class="language-bash">docker build -t  python-otel-auto-image .
</code></pre>
<p><strong>Run the image</strong></p>
<pre><code class="language-bash">docker run \
       -e OTEL_EXPORTER_OTLP_ENDPOINT=&quot;&lt;REPLACE WITH OTEL_EXPORTER_OTLP_ENDPOINT&gt;&quot; \
       -e OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=Bearer%20&lt;REPLACE WITH TOKEN&gt;&quot; \
       -e OTEL_RESOURCE_ATTRIBUTES=&quot;service.version=1.0,deployment.environment=production&quot; \
       -e OTEL_SERVICE_NAME=&quot;python-favorite-otel-auto&quot; \
       -p 5001:5001 \
       python-otel-auto-image
</code></pre>
<p><strong>Important:</strong> Note that the “OTEL_EXPORTER_OTLP_HEADERS” variable has the whitespace after Bearer escaped as “%20” — this is a requirement for Python.</p>
<p>You can now issue a few requests in order to generate trace data. Note that these requests are expected to return an error, as this service relies on a connection to Redis that you don’t currently have running. As mentioned before, you can find a more complete example using docker-compose <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix">here</a>.</p>
<pre><code class="language-bash">curl localhost:5000/favorites

# or alternatively issue a request every second

while true; do curl &quot;localhost:5000/favorites&quot;; sleep 1; done;
</code></pre>
<h3>Step 3: Explore traces, metrics, and logs in Elastic APM</h3>
<p>Exploring the Services section in Elastic APM, you’ll see the Python service displayed.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-python-applications-opentelemetry/elastic-blog-4-services.png" alt="services" /></p>
<p>Clicking on the python-favorite-otel-auto service , you can see that it is ingesting telemetry data using OpenTelemetry.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-python-applications-opentelemetry/elastic-blog-5-graph-view.png" alt="graph view" /></p>
<p>In this blog, we discussed the following:</p>
<ul>
<li>How to auto-instrument Python with OpenTelemetry</li>
<li>Using standard commands in a Dockerfile, auto-instrumentation was done efficiently and without adding code in multiple places</li>
</ul>
<p>Since Elastic can support a mix of methods for ingesting data, whether it be using auto-instrumentation of open-source OpenTelemetry or manual instrumentation with its native APM agents, you can plan your migration to OTel by focusing on a few applications first and then using OpenTelemety across your applications later on in a manner that best fits your business needs.</p>
<blockquote>
<p>Developer resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/getting-started-opentelemetry-instrumentation-sample-app">Elastiflix application</a>, a guide to instrument different languages with OpenTelemetry</li>
<li>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual-instrumentation</a></li>
<li>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Go: <a href="https://elastic.co/blog/manual-instrumentation-of-go-applications-opentelemetry">Manual-instrumentation</a></li>
<li><a href="https://www.elastic.co/blog/best-practices-instrumenting-opentelemetry">Best practices for instrumenting OpenTelemetry</a></li>
</ul>
<p>General configuration and use case resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></li>
<li><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a></li>
<li><a href="https://www.elastic.co/blog/custom-metrics-app-code-java-agent-plugin">Capturing custom metrics through OpenTelemetry API in code with Elastic</a></li>
<li><a href="https://www.elastic.co/virtual-events/future-proof-your-observability-platform-with-opentelemetry-and-elastic">Future-proof your observability platform with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-k8s-observability-elasticsearch-cncf">Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more</a></li>
</ul>
</blockquote>
<p>Don’t have an Elastic Cloud account yet? <a href="https://cloud.elastic.co/registration">Sign up for Elastic Cloud</a> and try out the auto-instrumentation capabilities that I discussed above. I would be interested in getting your feedback about your experience in gaining visibility into your application stack with Elastic.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/auto-instrumentation-python-applications-opentelemetry/observability-launch-series-2-python-auto_(1).jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Automated Error Triage: From Reactive to Autonomous]]></title>
            <link>https://www.elastic.co/observability-labs/blog/automated-error-triaging</link>
            <guid isPermaLink="false">automated-error-triaging</guid>
            <pubDate>Wed, 18 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to automate error triage by using  Elasticsearch log clustering and AI agents, turning production logs into actionable root cause reports.]]></description>
            <content:encoded><![CDATA[<p>The engineering feedback loop is often pictured as a clean cycle: shipping a feature, monitoring its health, triaging issues, identifying bugs, and deploying fixes. However, in large-scale cloud environments, the path from monitoring to identification frequently becomes a bottleneck. When thousands of Kibana instances running on Elastic Cloud emit millions of logs across a vast codebase, the lag between an error occurring and an engineer understanding its root cause—the Maintenance Gap—can stretch from hours to months.</p>
<p>To close this gap, we built an automated pipeline that moves beyond simple monitoring. By automating the discovery and investigation phases, we have shifted the focus of the engineer from &quot;what happened?&quot; to &quot;is this fix correct?&quot;</p>
<h2><strong>The Bottleneck in the Feedback Loop</strong></h2>
<p>In a high-velocity engineering environment, the path from deployment to resolution involves several distinct stages: <strong>Ship</strong>, <strong>Monitor</strong>, <strong>Triage</strong>, <strong>Identify</strong>, <strong>Fix</strong>, and <strong>Review/Deploy</strong>.</p>
<p>Velocity typically stalls during triage and identification. While catastrophic failures are reported immediately, smaller errors—intermittent UI glitches or failed background tasks—often go unreported. This dependency on manual reporting creates an inflated time to resolution; by the time a report is filed and routed, the issue may have already impacted the fleet for days.</p>
<p>By automating discovery and investigation, even these &quot;paper cut&quot; bugs are quantified before they accumulate into significant technical debt. The goal is to ensure that by the time a developer enters the cycle to write a fix, the detective work is already complete.</p>
<h2><strong>Discovery: Automated Log Clustering</strong></h2>
<p>The first challenge in this process is signal-to-noise. In a massive production environment, creating a ticket for every error event is unmanageable.</p>
<p>Instead of analyzing individual log lines, we automate the triage process using ES|QL's <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/grouping-functions/categorize">CATEGORIZE grouping function</a>. <code>CATEGORIZE</code> clusters text messages into groups of similarly formatted values, turning unstructured telemetry into a prioritized backlog of distinct error patterns.</p>
<p>For example, a query like the following runs on a rolling window across all Kibana error logs:</p>
<pre><code class="language-esql">FROM kibana-server-logs
| WHERE log.level == &quot;ERROR&quot;
    AND @timestamp &gt;= NOW() - 7 days
| STATS count = COUNT() BY category = CATEGORIZE(message)
| SORT count DESC
</code></pre>
<p>The result is a table of regex-like categories and their occurrence counts:</p>
<table>
<thead>
<tr>
<th>count</th>
<th>category</th>
</tr>
</thead>
<tbody>
<tr>
<td>1,247</td>
<td><code>.?TypeError.+?Cannot.+?read.+?properties.+?of.+?undefined.+?reading.+?document.?</code></td>
</tr>
<tr>
<td>812</td>
<td><code>.?Connection.+?error.?</code></td>
</tr>
<tr>
<td>3</td>
<td><code>.?Disconnected.?</code></td>
</tr>
</tbody>
</table>
<p>A category like <code>TypeError Cannot read properties of undefined reading document</code> with 1,200+ hits over the past week tells us there is a real, recurring defect worth investigating. A category like <code>Connection error</code> spread uniformly across the fleet is more likely infrastructure noise.</p>
<p>The output is used to automatically file prioritized issues in a backlog, each enriched with the category, its regex, the occurrence count, and deep links into the raw telemetry. This automation ensures the feedback loop no longer waits for a user report to trigger an investigation; the discovery is proactive and immediate. These prioritized clusters then serve as the direct input for our autonomous investigation agent.</p>
<h2><strong>Investigation: The Automated Detective</strong></h2>
<p>Once an error pattern is identified, the pipeline moves to the identification phase. We deployed an AI agent to run a complete investigation of the issue. Navigating a codebase of Kibana's complexity is a significant time sink; the agent accelerates this by correlating information across the stack using <strong>ES|QL (Elasticsearch Query Language)</strong>.</p>
<h3><strong>Protocol-Driven Investigation</strong></h3>
<p>It is important to distinguish this agent from a traditional automation script. The agent does not follow a hardcoded state machine; instead, it is provided with a protocol that outlines investigation goals and available tools.</p>
<p>The protocol prescribes a phased approach: understand the error, analyze its distribution, correlate with other data sources, find the source, and report. Each phase is described in terms of goals, not commands. The following excerpt shows how the protocol defines the first investigation step:</p>
<pre><code class="language-markdown">### Phase 1: Understand the Error
- Review the pre-extracted error details from the backlog issue
- Check for similar/overlapping error backlog issues (include closed!)
  - the categorization is often imperfect; closed issues may have
    valuable context about fixes
- Query for error overview statistics
- Get sample error messages to understand the actual content
</code></pre>
<p>The agent is also provided with an ES|QL reference guide and a library of query templates. Here is one of the templates for analyzing version distribution (a common first step to determine whether an error is a regression):</p>
<pre><code class="language-esql">FROM logging-*:cluster-kibana-*
| WHERE @timestamp &gt;= NOW() - 4 hours
    AND log.level == &quot;ERROR&quot;
    AND message : &quot;TypeError Cannot read properties&quot;
| STATS
    error_count = COUNT(*),
    deployments = COUNT_DISTINCT(ece.deployment)
  BY `docker.container.labels.org.label-schema.version`
| SORT error_count DESC
</code></pre>
<p>Because the agent has the autonomy to choose which tools to call—and in what order—based on the results of previous queries, it can adapt its strategy to the specific error. It might decide to skip proxy analysis if the telemetry suggests a background task failure, or it might dive deep into git history if ES|QL reveals the bug only exists on a specific version. This flexibility allows it to navigate the nuance of a massive codebase without requiring a pre-defined path for every possible failure mode.</p>
<h3><strong>Lessons Learned: Query Discipline</strong></h3>
<p>Direct LLM access to production clusters requires tactical constraints to manage costs and performance. We codified several requirements into the investigation workflow to ensure efficiency:</p>
<ul>
<li>
<p><strong>Query Budgets</strong>: The agent is restricted to <code>~15-20</code> queries per investigation, forcing it to form a hypothesis before data-retrieval.</p>
</li>
<li>
<p><strong>The 4-Hour Rule</strong>: The agent starts with a small time window (the most recent <code>1-4</code> hours) to leverage caches and reduce compute costs.</p>
</li>
<li>
<p><strong>Optimal Operators</strong>: The agent prefers equality filters and the MATCH (:) operator over LIKE or regex, which can make queries <code>50-1000×</code> faster.</p>
</li>
<li>
<p><strong>Fail-Fast Timeouts</strong>: Every query has a strict timeout, requiring the agent to refine its filters rather than retrying expensive operations.</p>
</li>
</ul>
<h2><strong>Source Code Contextualization</strong></h2>
<p>To complete the identification phase, the agent correlates telemetry with the git history and source files. It uses the stack trace and log patterns to narrow its search, parsing through potential code matches faster than a manual search. By identifying the specific line of code producing the error and checking recent PRs, the agent links a production symptom directly to its technical root cause.</p>
<h2><strong>Real-World Case Study: The Streams UI Crash</strong></h2>
<p>The value of this autonomous investigation is best illustrated by the rare edge cases it uncovers. In one instance, the clustering system surfaced a sporadic pattern:</p>
<p><code>.*?TypeError.+?Cannot.+?read.+?properties.+?of.+?undefined.+?reading.+?document.*?</code></p>
<p>A human might have dismissed this as generic telemetry noise, but the agent's investigation revealed a reproducible race condition in the Streams UI:</p>
<ol>
<li>
<p><strong>Quantification</strong>: Using ES|QL, the agent analyzed the error distribution and identified the specific application context (Streams) and the relevant loggers.</p>
</li>
<li>
<p><strong>Code Analysis</strong>: It identified a logic error in <code>processor_outcome_preview.tsx</code>. The code was indexing into an array (<code>originalSamples[currentDoc.index].document</code>) without verifying the element existed.</p>
</li>
<li>
<p><strong>Root Cause</strong>: The agent realized that when a user changed filters while a row was expanded, the currentDoc.index became stale before the next render cleared it.</p>
</li>
<li>
<p><strong>Outcome</strong>: The agent provided a suggested fix (guarding the access) and recommended a regression test around filter changes during row expansion.</p>
</li>
</ol>
<p>This case highlights the <strong>economic scale</strong> of autonomous triage. Sifting through thousands of &quot;noisy&quot; logs to find the few that represent real, fixable UI crashes is a non-starter for senior engineers. Agents process this volume at a fraction of the cost, acting as a high-fidelity filter that ensures human time is only spent on verified, actionable issues.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/automated-error-triaging/error-report.png" alt="Automated error investigation report showing the agent's analysis of a Streams UI crash, including root cause identification and suggested fix" /></p>
<h2><strong>The Future of Engineering Velocity</strong></h2>
<p>Automating triaging and identification is the first step. We are currently layering in the ability to pass these findings to a coding agent for draft Pull Requests. Beyond production errors, we are also investigating <strong>agentic exploratory testing</strong> to stress-test features during the pre-release phase and catch bugs before they ever reach a user.</p>
<p>This autonomous layer is <strong>complementary to, not a replacement for, classic quality gates</strong>. Unit tests, API-level checks, and UI integration tests remain the primary defense. Our approach provides a safety net for the failures that inevitably bypass these gates in a complex environment, ensuring they are addressed with the same rigor as pre-release bugs.</p>
<p>As we move toward a more agent-driven development process, the ability to rapidly validate that changes are safe and to control overall quality is the primary bottleneck for engineering velocity. While code generation itself is becoming a commodity, the &quot;reasoning&quot; required to verify that a change is both correct and safe remains the most critical hurdle. By focusing our automation on the discovery and root-cause analysis of failures, we ensure that our engineering teams can scale their impact without being buried by the operational weight of maintaining quality. The goal is to build a system that can understand, diagnose, and eventually fix itself.</p>
<p>For more information on Elastic and its observability capabilities, check out <a href="https://www.elastic.co/observability">Elastic Observability</a>. You can also <a href="https://cloud.elastic.co">sign up for a free trial</a> to try it out yourself.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/automated-error-triaging/automated-error-triaging.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Automated log parsing in Streams with ML]]></title>
            <link>https://www.elastic.co/observability-labs/blog/automated-log-parsing-ml-streams</link>
            <guid isPermaLink="false">automated-log-parsing-ml-streams</guid>
            <pubDate>Tue, 10 Feb 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how a hybrid ML approach achieved 94% log parsing and 91% log partitioning accuracy through automation experiments with log format fingerprinting in Streams.]]></description>
            <content:encoded><![CDATA[<p>In modern observability stacks, ingesting unstructured logs from diverse data providers into platforms like Elasticsearch remains a challenge. Reliance on manually crafted parsing rules creates brittle pipelines, where even minor upstream code updates lead to parsing failures and unindexed data. This fragility is compounded by the scalability challenge: in dynamic microservices environments, the continuous addition of new services turns manual rule maintenance into an operational nightmare.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/automated-log-parsing-ml-streams/automated-log-parsing.png" alt="" /></p>
<p>Our goal was to transition to an automated, adaptive approach capable of handling both log parsing (field extraction) and log partitioning (source identification). We hypothesized that Large Language Models <a href="https://www.elastic.co/what-is/large-language-models">(LLMs)</a>, with their inherent understanding of code syntax and semantic patterns, could automate these tasks with minimal human intervention.</p>
<p>We are happy to announce that this feature is already available in <a href="https://www.elastic.co/elasticsearch/streams">Streams</a>!</p>
<h2>Dataset Description</h2>
<p>We chose a <a href="https://github.com/logpai/loghub">Loghub</a> collection of logs for PoC purposes. For our investigation, we selected representative samples from the following key areas:</p>
<ul>
<li>Distributed systems: We used the HDFS (Hadoop Distributed File System) and Spark datasets. These contain a mix of info, debug, and error messages typical of big data platforms.</li>
<li>Server &amp; web applications: Logs from Apache web servers and OpenSSH provided a valuable source of access, error, and security-relevant events. These are critical for monitoring web traffic and detecting potential threats.</li>
<li>Operating systems: We included logs from Linux and Windows. These datasets represent the common, semi-structured system-level events that operations teams encounter daily.</li>
<li>Mobile systems: To ensure our model could handle logs from mobile environments, we included the Android dataset. These logs are often verbose and capture a wide range of application and system-level activities on mobile devices.</li>
<li>Supercomputers: To test performance on high-performance computing (HPC) environments, we incorporated the BGL (Blue Gene/L) dataset, which features highly structured logs with specific domain terminology.</li>
</ul>
<p>A key advantage of the Loghub collection is that the logs are largely unsanitized and unlabeled, mirroring a noisy live production environment with microservice architecture.</p>
<p>Log examples:</p>
<pre><code class="language-text">[Sun Dec 04 20:34:21 2005] [notice] jk2_init() Found child 2008 in scoreboard slot 6
[Sun Dec 04 20:34:25 2005] [notice] workerEnv.init() ok /etc/httpd/conf/workers2.properties
[Mon Dec 05 11:06:51 2005] [notice] workerEnv.init() ok /etc/httpd/conf/workers2.properties
17/06/09 20:10:58 INFO output.FileOutputCommitter: Saved output of task 'attempt_201706092018_0024_m_000083_1138' to hdfs://10.10.34.11:9000/pjhe/test/1/_temporary/0/task_201706092018_0024_m_000083
17/06/09 20:10:58 INFO mapred.SparkHadoopMapRedUtil: attempt_201706092018_0024_m_000083_1138: Committed
</code></pre>
<p>In addition, we created a Kubernetes cluster with a typical web application + database set up to mine extra logs in the most common domain.</p>
<p>Example of common log fields: timestamp, log level (INFO, WARN, ERROR), source, message.</p>
<h2>Few-Shot Log Parsing with an LLM</h2>
<p>Our first set of experiments focused on a fundamental question: <strong>Can an LLM reliably identify key fields and generate consistent parsing rules to extract them?</strong></p>
<p>We asked a model to analyse raw log samples and generate log parsing rules in regular expression (regex) and <a href="https://www.elastic.co/docs/explore-analyze/scripting/grok">Grok</a> formats. Our results showed that this approach has a lot of potential, but also significant implementation challenges.</p>
<h3>High Confidence &amp; Context Awareness</h3>
<p>Initial results were promising. The LLM demonstrated a strong ability to generate parsing rules that matched the provided few-shot examples with high confidence. Besides simple pattern matching, the model showed a capacity for log understanding —it could correctly identify and name the log source (e.g., health tracking app, Nginx web app, Mongo database).</p>
<h3>The &quot;Goldilocks&quot; Dilemma of Input Samples</h3>
<p>Our experiments quickly surfaced a significant lack of robustness because of extreme <strong>sensitivity to the input sample</strong>. The model's performance fluctuates wildly based on the specific log examples included in the prompt. We observed a log similarity problem where the log sample needs to include just diverse enough logs:</p>
<ul>
<li>Too homogeneous (overfitting): If the input logs are too similar, the LLM tends to overspecify. It treats variable data—such as specific Java class names in a stack trace—as static parts of the template. This results in brittle rules that cover a tiny ratio of logs and extract unusable fields.</li>
<li>Too heterogeneous (confusion): Conversely, if the sample contains significant formatting variance—or worse, &quot;trash logs&quot; like progress bars, memory tables, or ASCII art—the model struggles to find a common denominator. It often resorts to generating complex, broken regexes or lazily over-generalizing the entire line into a single message blob field.</li>
</ul>
<h3>The Context Window Constraint</h3>
<p>We also encountered a context window bottleneck. When input logs were long, heterogeneous, or rich in extractable fields, the model's output often deteriorated, becoming &quot;messy&quot; or too long to fit into the output context window. Naturally, chunking helps in this case. By splitting logs using character-based and entity-based delimiters, we could help the model focus on extracting the main fields without being overwhelmed by noise.</p>
<h3>The consistency &amp; standardization gap</h3>
<p>Even when the model successfully generated rules, we noted slight inconsistencies:</p>
<ul>
<li>Service naming variations: The model proposes different names for the same entity (e.g., labeling the source as &quot;Spark,&quot; &quot;Apache Spark,&quot; and &quot;Spark Log Analytics&quot; in different runs).</li>
<li>Field naming variations: Field names lacked standardization (e.g., <code>id</code> vs. <code>service.id</code> vs. <code>device.id</code>). We normalized names using a standardized <a href="https://www.elastic.co/docs/reference/ecs/ecs-field-reference">Elastic field naming</a>.</li>
<li>Resolution variance: The resolution of the field extraction varied depending on how similar the input logs were to one another.</li>
</ul>
<h2>Log Format Fingerprint</h2>
<p>To address the challenge of log similarity, we introduce a high-performance heuristic: <strong>log format fingerprint (LFF)</strong>.</p>
<p>Instead of feeding raw, noisy logs directly into an LLM, we first apply a deterministic transformation to reveal the underlying structure of each message. This pre-processing step abstracts away variable data, generating a simplified &quot;fingerprint&quot; that allows us to group related logs.</p>
<p>The mapping logic is simple to ensure speed and consistency:</p>
<ol>
<li>Digit abstraction: Any sequence of digits (0-9) is replaced by a single ‘0’.</li>
<li>Text abstraction: Any sequence of alphabetical characters with whitespace is replaced by a single ‘a’.</li>
<li>Whitespace normalization: All sequences of whitespace (spaces, tabs, newlines) are collapsed into a single space.</li>
<li>Symbol preservation: Punctuation and special characters (e.g., :, [, ], /) are preserved, as they are often the strongest indicators of log structure.</li>
</ol>
<p>We introduce the log mapping approach. The basic mapping patterns include the following:</p>
<p>Digits 0-9 of any length -&gt; to ‘0.’</p>
<ul>
<li>Text (alphabetical characters with spaces) of any length -&gt; to ‘a’.</li>
<li>White spaces, tabs, and new lines -&gt; to a single space.</li>
<li>Let's look at an example of how this mapping allows us to transform the logs.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/automated-log-parsing-ml-streams/transform-logs.png" alt="" /></p>
<p>As a result, we obtain the following log masks:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/automated-log-parsing-ml-streams/log-masks.png" alt="" /></p>
<p>Notice the fingerprints of the first two logs. Despite different timestamps, source classes, and message content, their prefixes (<code>0/0/0 0:0:0 a a.a:</code>) are identical. This structural alignment allows us to automatically bucket these logs into the same cluster.</p>
<p>The third log, however, produces a completely divergent fingerprint (<code>0-0-0...</code>). This allows us to algorithmically separate it from the first group before we ever invoke an LLM.</p>
<h2>Bonus Part: Instant Implementation with ES|QL</h2>
<p>It’s as easy as passing this query in Discover.</p>
<pre><code class="language-esql">FROM loghub |
EVAL pattern = REPLACE(REPLACE(REPLACE(REPLACE(raw_message, &quot;[ \t\n]+&quot;, &quot; &quot;), &quot;[A-Za-z]+&quot;, &quot;a&quot;), &quot;[0-9]+&quot;, &quot;0&quot;), &quot;a( a)+&quot;, &quot;a&quot;) |
STATS total_count = COUNT(), ratio = COUNT() / 2000.0, datasources=VALUES(filename), example=TOP(raw_message, 3, &quot;desc&quot;) BY SUBSTRING(pattern, 0, 15) |
SORT total_count DESC |
LIMIT 100
</code></pre>
<p><strong>Query breakdown:</strong></p>
<p><strong>FROM</strong> loghub: Targets our index containing the raw log data.</p>
<p><strong>EVAL</strong> pattern = …: The core mapping logic. We chain REPLACE functions to perform the abstraction (e.g., digits to '0', text to 'a', etc.) and save the result in a “pattern” field.</p>
<p><strong>STATS</strong> [column1 =] expression1, … <strong>BY</strong> SUBSTRING(pattern, 0, 15):</p>
<p>This is a clustering step. We group logs that share the first 15 characters of their pattern and create aggregated fields such as total log count per group, list of log datasources, pattern prefix, 3 log examples</p>
<p><strong>SORT</strong> total_count DESC | <strong>LIMIT</strong> 100 : Surfaces the top 100 most frequent log patterns</p>
<p>The query results on LogHub are displayed below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/automated-log-parsing-ml-streams/query-results.png" alt="" />
<img src="https://www.elastic.co/observability-labs/assets/images/automated-log-parsing-ml-streams/results.png" alt="" /></p>
<p>As demonstrated in the visualization, this “LLM-free” approach partitions logs with high accuracy. It successfully clustered 10 out of 16 data sources (based on LogHub labels) completely (&gt;90%) and achieved majority clustering in 13 out of 16 sources (&gt;60%) —all without requiring additional cleaning, preprocessing, or fine-tuning.</p>
<p>Log format fingerprint offers a pragmatic, high-impact alternative and addition to sophisticated ML solutions like <a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-bucket-categorize-text-aggregation">log pattern analysis</a>. It provides immediate insights into log relationships and effectively manages large log clusters.</p>
<ul>
<li>Versatility as a primitive</li>
</ul>
<p>Thanks to <a href="https://www.elastic.co/blog/getting-started-elasticsearch-query-language">ES|QL</a> implementation, LFF serves both as a standalone tool for fast data diagnostics/visualisations, and as a building block in log analysis pipelines for high-volume use cases.</p>
<ul>
<li>Flexibility</li>
</ul>
<p>LFF is easy to customize and extend to capture specific patterns, i.e. hexadecimal numbers and IP addresses.</p>
<ul>
<li>Deterministic stability</li>
</ul>
<p>Unlike ML-based clustering algorithms, LFF logic is straightforward and deterministic. New incoming logs do not retroactively affect existing log clusters.</p>
<ul>
<li>Performance and Memory</li>
</ul>
<p>It requires minimal memory, no training or GPU making it ideal for real-time high-throughput environments.</p>
<h2>Combining Log Format Fingerprint with an LLM</h2>
<p>To validate the proposed hybrid architecture, each experiment contained a random 20% subset of the logs from each data source. This constraint simulates a real-world production environment where logs are processed in batches rather than as a monolithic historical dump.</p>
<p>The objective was to demonstrate that LFF acts as an effective compression layer. We aimed to prove that high-coverage parsing rules could be generated from small, curated samples and successfully generalized to the entire dataset.</p>
<h2>Execution Pipeline</h2>
<p>We implemented a multi-stage pipeline that filters, clusters, and applies stratified sampling to the data before it reaches the LLM.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/automated-log-parsing-ml-streams/automating-log-parsing-ai-excecusion-pipeline.png" alt="" /></p>
<ol>
<li>Two-stage hierarchical clustering</li>
</ol>
<ul>
<li>Subclasses (exact match): Logs are aggregated by identical fingerprints. Every log in one subclass shares the exact same format structure.</li>
<li>Outlier cleaning: We discard any subclasses that represent less than 5% of the total log volume. This ensures the LLM focuses on the dominant signal and won’t be sidetracked by noise or malformed logs.</li>
<li>Metaclasses (prefix match): Remaining subclasses are grouped into Metaclasses by the first N characters of the format fingerprint match. This grouping strategy effectively splits lexically similar formats under a single umbrella.We chose N=5 for Log parsing and N=15 for Log partitioning when data sources are unknown.</li>
</ul>
<ol start="2">
<li>Stratified sampling. Once the hierarchical tree is built, we construct the log sample for the LLM. The strategic goal is to maximize variance coverage while minimizing token usage.</li>
</ol>
<ul>
<li>We select representative logs from each valid subclass within the broader metaclass.</li>
<li>To manage an edge case of too numerous subclasses, we apply random down-sampling to fit the target window size.</li>
</ul>
<ol start="3">
<li>Rule generation Finally, we prompt the LLM to generate a regex parsing rule that fits all logs in the provided sample for each Metaclass. For our PoC, we used the GPT-4o mini model.</li>
</ol>
<h2>Experimental Results &amp; Observations</h2>
<p>We achieved 94% parsing accuracy and 91% partitioning accuracy on the Loghub dataset.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/automated-log-parsing-ml-streams/automating-log-parsing-results.png" alt="" /></p>
<p>The confusion matrix above illustrates log partitioning results. The vertical axis represents the actual data sources, and the horizontal axis represents the predicted data sources. The heatmap intensity corresponds to log volume, with lighter tiles indicating a higher count. The diagonal alignment demonstrates the model's high fidelity in source attribution, with minimal scattering.</p>
<h2>Our Performance Benchmarks Insights</h2>
<ul>
<li>Optimal baseline: a context window of 30–40 log samples per category proved to be the &quot;sweet spot,&quot; consistently producing robust parsing with both Regex and Grok patterns.</li>
<li>Input minimisation: we pushed the input size to 10 logs per category for Regex patterns and observed only 2% drop in parsing performance, confirming that diversity-based sampling is more critical than raw volume.</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/automated-log-parsing-ml-streams/cover.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[One-Step Ingest for CloudWatch Logs and Metrics into Elastic Observability with Amazon Data Firehose]]></title>
            <link>https://www.elastic.co/observability-labs/blog/aws-data-firehose-onboarding</link>
            <guid isPermaLink="false">aws-data-firehose-onboarding</guid>
            <pubDate>Tue, 26 Nov 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[AWS users can now leverage the new guided onboarding workflow to ingest CloudWatch logs and metrics in Elastic Cloud and explore the usage and performance of over twenty AWS services within minutes, using the provided CloudFormation template.]]></description>
            <content:encoded><![CDATA[<h2>Overview of the new Quickstart guided workflow</h2>
<p>Elastic Observability has been supporting AWS logs ingest with Amazon Data Firehose over the last few releases. To makes configuration easier, we introduced, in 8.16, a one step guided workflow to onboard all CloudWatch logs and metrics from a single region. The configuration uses a pre-populated CloudFormation template, to automatically create a Amazon Data Firehose and connect to Elastic Observability. Additionally, all the relevant Elastic AWS Integrations are auto-installed. The configuration ensures ingestion for metrics from all namespaces and a policy to ingest logs from all existing log groups. Any new metric namespaces and log groups post setup will also be ingested automatically. Additionally, the CloudFormation template can also be customized and deployed in a production environment using infra-as-code.</p>
<p>This allows SREs to to start monitoring the usage and health of their popular AWS services using pre-built dashboards within minutes. This blog reviews how to setup this quickstart workflow, and the out-of-the box dashboards that will be populated from it.</p>
<h2>Onboarding data using Amazon Data Firehose</h2>
<p>In order to utilize this guided workflow, a user needs the superuser built-in Kibana role. A deployment of the hosted Elasticsearch service of version 8.16 on <a href="https://cloud.elastic.co/login?redirectTo=%2Fhome">Elastic Cloud</a> is required. Further, an active AWS account and the necessary permissions to create delivery streams, run CloudFormation, create CloudWatch log group/metric streams are needed.</p>
<p>Let’s walk through the steps required to onboard data using this workflow. There should be some CloudWatch logs and metrics already available in the customer account. The screenshot below shows an example where a number of CloudWatch metrics namespaces already exist.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-data-firehose-onboarding/AWS-CloudWatch-Metrics.png" alt="CloudWatch metrics already present" /></p>
<p>Similarly, a number of CloudWatch log groups are already present in this customer account as shown below.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-data-firehose-onboarding/AWS-CloudWatch-Log-Groups.png" alt="CloudWatch logs already present" /></p>
<p>This guided workflow is accessible from the ‘Add data’ left navigation option in the Elastic Observability app. The user needs to select the ‘Cloud’ option and click on the ‘AWS’ tile. The Amazon Firehose quickstart onboarding workflow is available at the top left and is labeled as a Quickstart option, as shown below.  </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-data-firehose-onboarding/Kibana-Onboarding-Firehose-Card.png" alt="Firehose onboarding tile" /></p>
<p>The Data Firehose delivery stream can be created either using the AWS CLI or the AWS console, as shown in step 2 of the guided workflow below. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-data-firehose-onboarding/Kibana-Firehose-Flow-Start.png" alt="Firehose onboarding step 1" /></p>
<p>By clicking on the ‘Create Firehose Stream in AWS’ button under the ‘Via AWS Console’ tab, the user will be taken to the AWS console and the menu for creating the CloudFormation stack, as shown below. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-data-firehose-onboarding/AWS-CloudFormation-Template-Form-1.png" alt="Firehose onboarding aws console" /></p>
<p>The CloudFormation (CF) template provided by Elastic has prepopulated default settings including the Elasticsearch endpoint and the API key, as shown in the screenshot above. The user can review these defaults in the AWS console and proceed by clicking on the ‘Create stack’ button, as shown below. Note that this stack creates IAM resources and so the checkbox acknowledging that must be checked to move forward. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-data-firehose-onboarding/AWS-CloudFormation-Template-Form-2.png" alt="CF template 2" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-data-firehose-onboarding/CloudFormation-Template-Complete.png" alt="CF template complete" /></p>
<p>Once the CloudFormation stack has been created in AWS, the user can switch back to Kibana. By default, the CF stack will consist of separate delivery streams for CloudWatch logs and metrics, as shown below. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-data-firehose-onboarding/Firehose-Streams.png" alt="Firehose streams" /></p>
<p>In Kibana, under step 3 ‘Visualize your data’ of the workflow, the incoming data starts to appear, categorized by AWS service type as shown below. The page refreshes automatically every 5 s and the new services appear at the bottom of the list.  </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-data-firehose-onboarding/Kibana-AWS-Services-Detected-1.png" alt="Services detected 01" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-data-firehose-onboarding/Kibana-AWS-Services-Detected-2.png" alt="Services detected 02" /></p>
<p>For each detected AWS service, the user is recommended 1-2 pre-built dashboards to explore the health and usage of their services. For example, the pre-built dashboard shown below provides a quick overview on the usage of the NAT Gateway.  </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-data-firehose-onboarding/NAT-Gateway-Dashboard.png" alt="Nat Gateway dashboard" /></p>
<p>In addition to pre-built dashboards, Discover can also be used to explore the ingested CloudWatch logs, as shown below. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-data-firehose-onboarding/ECS-Logs.png" alt="Discover for logs" /></p>
<p>AWS Usage overview can be explored using the pre-built dashboard shown below.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-data-firehose-onboarding/AWS-Usage-Dashboard.png" alt="AWS usage" /></p>
<h2>Customisation options</h2>
<p>The region needs to be selected/modified in the AWS console as shown below, before starting with the CF stack creation. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-data-firehose-onboarding/AWS-Console-Region-Selector.png" alt="AWS region selector" /></p>
<p>The setting of <code>EnableCloudWatchLogs</code> parameter and the setting of <code>EnableCloudWatchMetrics</code> parameter in the AWS console or the CF template can be changed to disable the collection of logs or metrics.</p>
<p>The <code>MetricNameFilters</code> parameter in the CF template or console can be used to exclude specific namespace-metric names pairs from collection.</p>
<p>The CF template provided by Elastic can be used together with the Terraform resource <a href="https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudformation_stack">aws_cloudformation_stack</a> as shown below to deploy in the production environment, to facilitate as-code deployment.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-data-firehose-onboarding/Terraform-Template.png" alt="Terraform template" /></p>
<h2>Start your own exploration </h2>
<p>The new guided onboarding workflow for AWS utilizes the Amazon Firehose delivery stream to collect all available CloudWatch logs &amp; metrics, from a single customer account and a single region. The workflow also installs AWS Integration packages in the Elastic stack, enabling users to start monitoring the usage and performance of their common AWS services using pre-built dashboards, within minutes. Some of the AWS services that can be monitored using this workflow are listed below. A complete list of over twenty services that are supported by this workflow along with additional details are available <a href="https://www.elastic.co/guide/en/observability/current/collect-data-with-aws-firehose.html">here</a>.</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>VPC Flow Logs</td>
<td>Logs</td>
</tr>
<tr>
<td>API Gateway</td>
<td>Logs, Metrics</td>
</tr>
<tr>
<td>CloudTrail</td>
<td>Logs</td>
</tr>
<tr>
<td>Network Firewall</td>
<td>Logs, Metrics</td>
</tr>
<tr>
<td>WAF</td>
<td>Logs</td>
</tr>
<tr>
<td>EC2</td>
<td>Metrics</td>
</tr>
<tr>
<td>RDS</td>
<td>Metrics</td>
</tr>
</tbody>
</table>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/aws-data-firehose-onboarding/154567_Image 21.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Unleash the power of Elastic and Amazon Kinesis Data Firehose to enhance observability and data analytics]]></title>
            <link>https://www.elastic.co/observability-labs/blog/aws-kinesis-data-firehose-observability-analytics</link>
            <guid isPermaLink="false">aws-kinesis-data-firehose-observability-analytics</guid>
            <pubDate>Thu, 18 May 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[AWS users can now leverage the new Amazon Kinesis Firehose Delivery Stream to directly ingest logs into Elastic Cloud in real time for centralized alerting, troubleshooting, and analytics across your cloud and on-premises infrastructure.]]></description>
            <content:encoded><![CDATA[<p>As more organizations leverage the Amazon Web Services (AWS) cloud platform and services to drive operational efficiency and bring products to market, managing logs becomes a critical component of maintaining visibility and safeguarding multi-account AWS environments. Traditionally, logs are stored in Amazon Simple Storage Service (Amazon S3) and then shipped to an external monitoring and analysis solution for further processing.</p>
<p>To simplify this process and reduce management overhead, AWS users can now leverage the new Amazon Kinesis Firehose Delivery Stream to ingest logs into Elastic Cloud in AWS in real time and view them in the Elastic Stack alongside other logs for centralized analytics. This eliminates the necessity for time-consuming and expensive procedures such as VM provisioning or data shipper operations.</p>
<p>Elastic Observability unifies logs, metrics, and application performance monitoring (APM) traces for a full contextual view across your hybrid <a href="https://www.elastic.co/blog/aws-service-metrics-monitor-observability-easy">AWS environments alongside their on-premises data sets</a>. Elastic Observability enables you to track and monitor performance <a href="https://www.elastic.co/observability/aws-monitoring">across a broad range of AWS services</a>, including AWS Lambda, Amazon Elastic Compute Cloud (EC2), Amazon Elastic Container Service (ECS), Amazon Elastic Kubernetes Service (EKS), Amazon Simple Storage Service (S3), Amazon Cloudtrail, Amazon Network Firewall, and more.</p>
<p>In this blog, we will walk you through how to use the Amazon Kinesis Data Firehose integration — <a href="https://aws.amazon.com/blogs/big-data/accelerate-data-insights-with-elastic-and-amazon-kinesis-data-firehose/">Elastic is listed in the Amazon Kinesis Firehose</a> drop-down list — to simplify your architecture and send logs to Elastic, so you can monitor and safeguard your multi-account AWS environments.</p>
<h2>Announcing the Kinesis Firehose method</h2>
<p>Elastic currently provides both agent-based and serverless mechanisms, and we are pleased to announce the addition of the Kinesis Firehose method. This new method enables customers to directly ingest logs from AWS into Elastic, supplementing our existing options.</p>
<ul>
<li><a href="https://www.youtube.com/watch?v=pnGXjljuEnY"><strong>Elastic Agent</strong></a> pulls metrics and logs from CloudWatch and S3 where logs are generally pushed from a service (for example, EC2, ELB, WAF, Route53) and ingests them into Elastic Cloud.</li>
<li><a href="https://www.elastic.co/blog/elastic-and-aws-serverless-application-repository-speed-time-to-actionable-insights-with-frictionless-log-ingestion-from-amazon-s3"><strong>Elastic’s Serverless Forwarder</strong></a> (runs Lambda and available in AWS SAR) sends logs from Kinesis Data Stream, Amazon S3, and AWS Cloudwatch log groups into Elastic. To learn more about this topic, please see this <a href="https://www.elastic.co/blog/elastic-and-aws-serverless-application-repository-speed-time-to-actionable-insights-with-frictionless-log-ingestion-from-amazon-s3">blog post</a>.</li>
<li><a href="https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html"><strong>Amazon Kinesis Firehose</strong></a> directly ingests logs from AWS into Elastic (specifically, if you are running the Elastic Cloud on AWS).</li>
</ul>
<p>In this blog, we will cover the last option since we have recently released the Amazon Kinesis Data Firehose integration. Specifically, we'll review:</p>
<ul>
<li>A general overview of the Amazon Kinesis Data Firehose integration and how it works with AWS</li>
<li>Step-by-step instructions to set up the Amazon Kinesis Data Firehose integration on AWS and on <a href="http://cloud.elastic.co">Elastic Cloud</a></li>
</ul>
<p>By the end of this blog, you'll be equipped with the knowledge and tools to simplify your AWS log management with Elastic Observability and Amazon Kinesis Data Firehose.</p>
<h2>Prerequisites and configurations</h2>
<p>If you intend to follow the steps outlined in this blog post, there are a few prerequisites and configurations that you should have in place beforehand.</p>
<ol>
<li>You will need an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack on AWS. Instructions for deploying a stack on AWS can be found <a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">here</a>. This is necessary for AWS Firehose Log ingestion.</li>
<li>You will also need an AWS account with the necessary permissions to pull data from AWS. Details on the required permissions can be found in our <a href="https://docs.elastic.co/en/integrations/aws#aws-permissions">documentation</a>.</li>
<li>Finally, be sure to turn on VPC Flow Logs for the VPC where your application is deployed and send them to AWS Firehose.</li>
</ol>
<h2>Elastic’s Amazon Kinesis Data Firehose integration</h2>
<p>Elastic has collaborated with AWS to offer a seamless integration of Amazon Kinesis Data Firehose with Elastic, enabling direct ingestion of data from Amazon Kinesis Data Firehose into Elastic without the need for Agents or Beats. All you need to do is configure the Amazon Kinesis Data Firehose delivery stream to send its data to Elastic's endpoint. In this configuration, we will demonstrate how to ingest VPC Flow logs and Firewall logs into Elastic. You can follow a similar process to ingest other logs from your AWS environment into Elastic.</p>
<p>There are three distinct configurations available for ingesting VPC Flow and Network firewall logs into Elastic. One configuration involves sending logs through CloudWatch, and another uses S3 and Kinesis Firehose; each has its own unique setup. With Cloudwatch and S3 you can store and forward but with Kinesis Firehose you will have to ingest immediately. However, in this blog post, we will focus on this new configuration that involves sending VPC Flow logs and Network Firewall logs directly to Elastic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/image2.png" alt="AWS elastic configuration" /></p>
<p>We will guide you through the configuration of the easiest setup, which involves directly sending VPC Flow logs and Firewalls logs to Amazon Kinesis Data Firehose and then into Elastic Cloud.</p>
<p><strong>Note:</strong> It's important to note that this setup is only compatible with Elastic Cloud on AWS and cannot be used with self-managed or on-premise or other cloud provider Elastic deployments.</p>
<h2>Setting it all up</h2>
<p>To begin setting up the integration between Amazon Kinesis Data Firehose and Elastic, let's go through the necessary steps.</p>
<h3>Step 0: Get an account on Elastic Cloud</h3>
<p>Create an account on Elastic Cloud by following the instructions provided to <a href="https://cloud.elastic.co/registration?fromURI=/home">get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/Screenshot_2023-05-18_at_6.00.28_PM.png" alt="elastic free trial" /></p>
<h3>Step 1: Deploy Elastic on AWS</h3>
<p>You can deploy Elastic on AWS via two different approaches: through the UI or through Terraform. We’ll start first with the UI option.</p>
<p>After logging into Elastic Cloud, create a deployment on Elastic. It's crucial to make sure that the deployment is on Elastic Cloud on AWS since the Amazon Kinesis Data Firehose connects to a specific endpoint that must be on AWS.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-create-a-deployment.png" alt="create a deployment" /></p>
<p>After your deployment is created, it's essential to copy the Elasticsearch endpoint to ensure a seamless configuration process.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-O11y-log.png" alt="O11y log" /></p>
<p>The Elasticsearch HTTP endpoint should be copied and used for Amazon Firehose destination configuration purposes, as it will be required. Here's an example of what the endpoint should look like:</p>
<pre><code class="language-bash">https://elastic-O11y-log.es.us-east-1.aws.found.io
</code></pre>
<h3><em>Alternative approach using Terraform</em></h3>
<p>An alternative approach to deploying Elastic Cloud on AWS is by using Terraform. It's also an effective way to automate and streamline the deployment process.</p>
<p>To begin, simply create a Terraform configuration file that outlines the necessary infrastructure. This file should include resources for your Elastic Cloud deployment and any required IAM roles and policies. By using this approach, you can simplify the deployment process and ensure consistency across environments.</p>
<p>One easy way to create your Elastic Cloud deployment with Terraform is to use this Github <a href="https://github.com/aws-ia/terraform-elastic-cloud">repo</a>. This resource lets you specify the region, version, and deployment template for your Elastic Cloud deployment, as well as any additional settings you require.</p>
<h3>Step 2: To turn on Elastic's AWS integrations, navigate to the Elastic Integration section in your deployment</h3>
<p>To install AWS assets in your deployment's Elastic Integration section, follow these steps:</p>
<ol>
<li>Log in to your Elastic Cloud deployment and open <strong>Kibana</strong>.</li>
<li>To get started, go to the <strong>management</strong> section of Kibana and click on &quot; <strong>Integrations.</strong>&quot;</li>
<li>Navigate to the <strong>AWS</strong> integration and click on the &quot;Install AWS Assets&quot; button in the <strong>settings</strong>.This step is important as it installs the necessary assets such as <strong>dashboards</strong> and <strong>ingest pipelines</strong> to enable data ingestion from AWS services into Elastic.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-aws-settings.png" alt="aws settings" /></p>
<h3>Step 3: Set up the Amazon Kinesis Data Firehose delivery stream on the AWS Console</h3>
<p>You can set up the Kinesis Data Firehose delivery stream via two different approaches: through the AWS Management Console or through Terraform. We’ll start first with the console option.</p>
<p>To set up the Kinesis Data Firehose delivery stream on AWS, follow these <a href="https://docs.aws.amazon.com/firehose/latest/dev/create-destination.html#create-destination-elastic">steps</a>:</p>
<ol>
<li>
<p>Go to the AWS Management Console and select Amazon Kinesis Data Firehose.</p>
</li>
<li>
<p>Click on Create delivery stream.</p>
</li>
<li>
<p>Choose a delivery stream name and select Direct PUT or other sources as the source.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-create-delivery-stream.png" alt="create delivery stream" /></p>
<ol start="4">
<li>
<p>Choose Elastic as the destination.</p>
</li>
<li>
<p>In the Elastic destination section, enter the Elastic endpoint URL that you copied from your Elastic Cloud deployment.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-destination-settings.png" alt="destination settings" /></p>
<ol start="6">
<li>
<p>Choose the content encoding and retry duration as shown above.</p>
</li>
<li>
<p>Enter the appropriate parameter values for your AWS log type. For example, for VPC Flow logs, you would need to specify the _ <strong>es_datastream_name</strong> _ and _ <strong>logs-aws.vpc flow-default</strong> _.</p>
</li>
<li>
<p>Configure the Amazon S3 bucket as the source backup for the Amazon Kinesis Data Firehose delivery stream failed data or all data, and configure any required tags for the delivery stream.</p>
</li>
<li>
<p>Review the settings and click on Create delivery stream.</p>
</li>
</ol>
<p>In the example above, we are using the <strong>es_datastream_name</strong> parameter to pull in VPC Flow logs through the <strong>logs-aws.vpcflow-default</strong> datastream. Depending on your use case, this parameter can be configured with one of the following types of logs:</p>
<ul>
<li>logs-aws.cloudfront_logs-default (AWS CloudFront logs)</li>
<li>logs-aws.ec2_logs-default (EC2 logs in AWS CloudWatch)</li>
<li>logs-aws.elb_logs-default (Amazon Elastic Load Balancing logs)</li>
<li>logs-aws.firewall_logs-default (AWS Network Firewall logs)</li>
<li>logs-aws.route53_public_logs-default (Amazon Route 53 public DNS queries logs)</li>
<li>logs-aws.route53_resolver_logs-default (Amazon Route 53 DNS queries &amp; responses logs)</li>
<li>logs-aws.s3access-default (Amazon S3 server access log)</li>
<li>logs-aws.vpcflow-default (AWS VPC flow logs)</li>
<li>logs-aws.waf-default (AWS WAF Logs)</li>
</ul>
<h3><em>Alternative approach using Terraform</em></h3>
<p>Using the &quot; <strong>aws_kinesis_firehose_delivery_stream</strong>&quot; resource in <strong>Terraform</strong> is another way to create a Kinesis Firehose delivery stream, allowing you to specify the delivery stream name, data source, and destination - in this case, an Elasticsearch HTTP endpoint. To authenticate, you'll need to provide the endpoint URL and an API key. Leveraging this Terraform resource is a fantastic way to automate and streamline your deployment process, resulting in greater consistency and efficiency.</p>
<p>Here's an example code that shows you how to create a Kinesis Firehose delivery stream with Terraform that sends data to an Elasticsearch HTTP endpoint:</p>
<pre><code class="language-hcl">resource &quot;aws_kinesis_firehose_delivery_stream&quot; “Elasticcloud_stream&quot; {
  name        = &quot;terraform-kinesis-firehose-ElasticCloud-stream&quot;
  destination = &quot;http_endpoint”
  s3_configuration {
    role_arn           = aws_iam_role.firehose.arn
    bucket_arn         = aws_s3_bucket.bucket.arn
    buffer_size        = 5
    buffer_interval    = 300
    compression_format = &quot;GZIP&quot;
  }
  http_endpoint_configuration {
    url        = &quot;https://cloud.elastic.co/&quot;
    name       = “ElasticCloudEndpoint&quot;
    access_key = “ElasticApi-key&quot;
    buffering_hints {
      size_in_mb = 5
      interval_in_seconds = 300
    }

   role_arn       = &quot;arn:Elastic_role&quot;
   s3_backup_mode = &quot;FailedDataOnly&quot;
  }
}
</code></pre>
<h3>Step 4: Configure VPC Flow Logs to send to Amazon Kinesis Data Firehose</h3>
<p>To complete the setup, you'll need to configure VPC Flow logs in the VPC where your application is deployed and send them to the Amazon Kinesis Data Firehose delivery stream you set up in Step 3.</p>
<p>Enabling VPC flow logs in AWS is a straightforward process that involves several steps. Here's a step-by-step details to enable VPC flow logs in your AWS account:</p>
<ol>
<li>
<p>Select the VPC for which you want to enable flow logs.</p>
</li>
<li>
<p>In the VPC dashboard, click on &quot;Flow Logs&quot; under the &quot;Logs&quot; section.</p>
</li>
<li>
<p>Click on the &quot;Create Flow Log&quot; button to create a new flow log.</p>
</li>
<li>
<p>In the &quot;Create Flow Log&quot; wizard, provide the following information:</p>
</li>
</ol>
<p>Choose the target for your flow logs: In this case, Amazon Kinesis Data Firehose in the same AWS account.</p>
<ul>
<li>Provide a name for your flow log.</li>
<li>Choose the VPC and the network interface(s) for which you want to enable flow logs.</li>
<li>Choose the flow log format: either AWS default or Custom format.</li>
</ul>
<ol start="5">
<li>
<p>Configure the IAM role for the flow logs. If you have an existing IAM role, select it. Otherwise, create a new IAM role that grants the necessary permissions for the flow logs.</p>
</li>
<li>
<p>Review the flow log configuration and click &quot;Create.&quot;</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-flow-log-settings.png" alt="flow log settings" /></p>
<p>Create the VPC Flow log.</p>
<h3>Step 5: After a few minutes, check if flows are coming into Elastic</h3>
<p>To confirm that the VPC Flow logs are ingesting into Elastic, you can check the logs in Kibana. You can do this by searching for the index in the Kibana Discover tab and filtering the results by the appropriate index and time range. If VPC Flow logs are flowing in, you should see a list of documents representing the VPC Flow logs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-expanded-document.png" alt="expanded document" /></p>
<h3>Step 6: Navigate to Kibana to see your logs parsed and visualized in the [Logs AWS] VPC Flow Log Overview dashboard</h3>
<p>Finally, there is an Elastic out-of-the-box (OOTB) VPC Flow logs dashboard that displays the top IP addresses that are hitting your VPC, their geographic location, time series of the flows, and a summary of VPC flow log rejects within the selected time frame. This dashboard can provide valuable insights into your network traffic and potential security threats.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-VPC-flow-log-map.png" alt="vpc flow log map" /></p>
<p><em>Note: For additional VPC flow log analysis capabilities, please refer to</em> <a href="https://www.elastic.co/blog/vpc-flow-logs-monitoring-analytics-observability"><em>this blog</em></a><em>.</em></p>
<h3>Step 7: Configure AWS Network Firewall Logs to send to Kinesis Firehose</h3>
<p>To create a Kinesis Data Firehose delivery stream for AWS Network firewall logs, first log in to the AWS Management Console, navigate to the Kinesis service, select &quot;Data Firehose&quot;, and follow the step-by-step instructions as shown in Step 3. Specify the Elasticsearch endpoint, API key, add a parameter (_ <strong>es_datastream_name=logs-aws.firewall_logs-default</strong> _), and create the delivery stream.</p>
<p>Second, to set up a Network Firewall rule group to send logs to the Kinesis Firehose, go to the Network Firewall section of the console, create a rule group, add a rule to allow traffic to the Kinesis endpoint, and attach the rule group to your Network Firewall configuration. Finally, test the configuration by sending traffic through the Network Firewall to the Kinesis Firehose endpoint and verify that logs are being delivered to your S3 bucket.</p>
<p>Kindly follow the instructions below to set up a firewall rule and logging.</p>
<ol>
<li>Set up a Network Firewall rule group to send logs to Amazon Kinesis Data Firehose:</li>
</ol>
<ul>
<li>Go to the AWS Management Console and select Network Firewall.</li>
<li>Click on &quot;Rule groups&quot; in the left menu and then click &quot;Create rule group.&quot;</li>
<li>Choose &quot;Stateless&quot; or &quot;Stateful&quot; depending on your requirements, and give your rule group a name. Click &quot;Create rule group.&quot;</li>
<li>Add a rule to the rule group to allow traffic to the Kinesis Firehose endpoint. For example, if you are using the us-east-1 region, you would add a rule like this:json</li>
</ul>
<pre><code class="language-json">{
  &quot;RuleDefinition&quot;: {
    &quot;Actions&quot;: [
      {
        &quot;Type&quot;: &quot;AWS::KinesisFirehose::DeliveryStream&quot;,
        &quot;Options&quot;: {
          &quot;DeliveryStreamArn&quot;: &quot;arn:aws:firehose:us-east-1:12387389012:deliverystream/my-delivery-stream&quot;
        }
      }
    ],
    &quot;MatchAttributes&quot;: {
      &quot;Destination&quot;: {
        &quot;Addresses&quot;: [&quot;api.firehose.us-east-1.amazonaws.com&quot;]
      },
      &quot;Protocol&quot;: {
        &quot;Numeric&quot;: 6,
        &quot;Type&quot;: &quot;TCP&quot;
      },
      &quot;PortRanges&quot;: [
        {
          &quot;From&quot;: 443,
          &quot;To&quot;: 443
        }
      ]
    }
  },
  &quot;RuleOptions&quot;: {
    &quot;CustomTCPStarter&quot;: {
      &quot;Enabled&quot;: true,
      &quot;PortNumber&quot;: 443
    }
  }
}
</code></pre>
<ul>
<li>Save the rule group.</li>
</ul>
<ol start="2">
<li>Attach the rule group to your Network Firewall configuration:</li>
</ol>
<ul>
<li>Go to the AWS Management Console and select Network Firewall.</li>
<li>Click on &quot;Firewall configurations&quot; in the left menu and select the configuration you want to attach the rule group to.</li>
<li>Scroll down to &quot;Associations&quot; and click &quot;Edit.&quot;</li>
<li>Select the rule group you created in Step 2 and click &quot;Save.&quot;</li>
</ul>
<ol start="3">
<li>Test the configuration:</li>
</ol>
<ul>
<li>Send traffic through the Network Firewall to the Kinesis Firehose endpoint and verify that logs are being delivered to your S3 bucket.</li>
</ul>
<h3>Step 8: Navigate to Kibana to see your logs parsed and visualized in the [Logs AWS] Firewall Log dashboard</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-firewall-log-dashboard.png" alt="firewall log dashboard" /></p>
<h2>Wrapping up</h2>
<p>We’re excited to bring you this latest integration for AWS Cloud and Kinesis Data Firehose into production. The ability to consolidate logs and metrics to gain visibility across your cloud and on-premises environment is crucial for today’s distributed environments and applications.</p>
<p>From EC2, Cloudwatch, Lambda, ECS and SAR, <a href="https://www.elastic.co/integrations/data-integrations?solution=all-solutions&amp;category=aws">Elastic Integrations</a> allow you to quickly and easily get started with ingesting your telemetry data for monitoring, analytics, and observability. Elastic is constantly delivering frictionless customer experiences, allowing anytime, anywhere access to all of your telemetry data — this streamlined, native integration with AWS is the latest example of our commitment.</p>
<h2>Start a free trial today</h2>
<p>You can begin with a <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k">7-day free trial</a> of Elastic Cloud within the AWS Marketplace to start monitoring and improving your users' experience today!</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/image2.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Wait… Elastic Observability monitors metrics for AWS services in just minutes?]]></title>
            <link>https://www.elastic.co/observability-labs/blog/aws-service-metrics-monitor-observability-easy</link>
            <guid isPermaLink="false">aws-service-metrics-monitor-observability-easy</guid>
            <pubDate>Mon, 21 Nov 2022 00:00:00 GMT</pubDate>
            <description><![CDATA[Get metrics and logs from your AWS deployment and Elastic Observability in just minutes! We’ll show you how to use Elastic integrations to quickly monitor and manage the performance of your applications and AWS services to streamline troubleshooting.]]></description>
            <content:encoded><![CDATA[<p>The transition to distributed applications is in full swing, driven mainly by our need to be “always-on” as consumers and fast-paced businesses. That need is driving deployments to have more complex requirements along with the ability to be globally diverse and rapidly innovate.</p>
<p>Cloud is becoming the de facto deployment option for today’s applications. Many cloud deployments choose to host their applications on AWS for the globally diverse set of regions it covers and the myriad of services (for faster development and innovation) available, as well as to drive operational and capital costs down. On AWS, development teams are finding additional value in migrating to Kubernetes on Amazon EKS, testing out the latest serverless options, and improving traditional, tiered applications with better services.</p>
<p>Elastic Observability offers 30 out-of-the-box integrations for AWS services with more to come.</p>
<p>A quick review highlighting some of the integrations and capabilities can be found in a previous post:</p>
<ul>
<li><a href="https://www.elastic.co/blog/elastic-and-aws-seamlessly-ingest-logs-and-metrics-into-a-unified-platform-with-ready-to-use-integrations">Elastic and AWS: Seamlessly ingest logs and metrics into a unified platform with ready-to-use integrations</a>.</li>
</ul>
<p>Some additional posts on key AWS service integrations on Elastic are:</p>
<ul>
<li><a href="https://www.elastic.co/blog/observability-apm-aws-lambda-serverless-functions">APM (metrics, traces and logs) for serverless functions on AWS Lambda with Elastic</a></li>
<li><a href="https://www.elastic.co/blog/elastic-and-aws-serverless-application-repository-speed-time-to-actionable-insights-with-frictionless-log-ingestion-from-amazon-s3">Log ingestion from AWS Services into Elastic via serverless forwarder on Lambda</a></li>
<li><a href="https://www.elastic.co/blog/new-elastic-and-amazon-s3-storage-lens-integration-simplify-management-control-costs-and-reduce-risk">Elastic’s Amazon S3 Storage Lens Integration: Simplify management, control costs, and reduce risk</a></li>
<li><a href="https://www.elastic.co/blog/elastic-cloud-with-aws-firelens-accelerate-time-to-insight-with-agentless-data-ingestion">Ingest your container logs into Elastic Cloud with AWS FireLens</a></li>
</ul>
<p>A full list of AWS integrations can be found in Elastic’s online documentation:</p>
<ul>
<li><a href="https://docs.elastic.co/en/integrations/aws">Full list of AWS integrations</a></li>
</ul>
<p>In addition to our native AWS integrations, Elastic Observability aggregates not only logs but also metrics for AWS services and the applications running on AWS compute services (EC2, Lambda, EKS/ECS/Fargate). All this data can be analyzed visually and more intuitively using Elastic’s advanced machine learning capabilities, which help detect performance issues and surface root causes before end users are affected.</p>
<p>For more details on how Elastic Observability provides application performance monitoring (APM) capabilities such as service maps, tracing, dependencies, and ML based metrics correlations:</p>
<ul>
<li><a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">APM correlations in Elastic Observability: Automatically identifying probable causes of slow or failed transactions</a></li>
<li><a href="https://www.elastic.co/blog/elastic-and-aws-get-the-most-value-from-your-data-sets">Elastic and AWS: Get the most value from your data sets</a></li>
</ul>
<p>That’s right, Elastic offers metrics ingest, aggregation, and analysis for AWS services and applications on AWS compute services (EC2, Lambda, EKS/ECS/Fargate). Elastic is more than logs — it offers a unified observability solution for AWS environments.</p>
<p>In this blog, I’ll review how Elastic Observability can monitor metrics for a simple AWS application running on AWS services which include:</p>
<ul>
<li>AWS EC2</li>
<li>AWS ELB</li>
<li>AWS RDS (AuroraDB)</li>
<li>AWS NAT Gateways</li>
</ul>
<p>As you will see, once the integration is installed, metrics will arrive instantly and you can immediately start reviewing metrics.</p>
<h2>Prerequisites and config</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up this demonstration:</p>
<ul>
<li>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>).</li>
<li>Ensure you have an AWS account with permissions to pull the necessary data from AWS. <a href="https://docs.elastic.co/en/integrations/aws#aws-permissions">See details in our documentation</a>.</li>
<li>We used <a href="https://github.com/aws-samples/aws-three-tier-web-architecture-workshop">AWS’s three tier app</a> and installed it as instructed in git.</li>
<li>We’ll walk through installing the general <a href="https://docs.elastic.co/en/integrations/aws">Elastic AWS Integration</a>, which covers the four services we want to collect metrics for.<br />
(<a href="https://docs.elastic.co/en/integrations/aws#reference">Full list of services supported by the Elastic AWS Integration</a>)</li>
<li>We will <em>not</em> cover application monitoring given other blogs cover application <a href="https://www.elastic.co/observability/aws-monitoring">AWS monitoring</a> (metrics, logs, and tracing). Instead we will focus on how AWS services can be easily monitored.</li>
<li>In order to see metrics, you will need to load the application. We’ve also created a playwright script to drive traffic to the application.</li>
</ul>
<h2>Three tier application overview</h2>
<p>Before we dive into the Elastic configuration, let's review what we are monitoring. If you follow the instructions for <a href="https://github.com/aws-samples/aws-three-tier-web-architecture-workshop">aws-three-tier-web-architecture-workshop</a>, you will have the following deployed.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-service-metrics-monitor-observability-easy/blog-elastic-three-tier.png" alt="" /></p>
<p>What’s deployed:</p>
<ul>
<li>1 VPC with 6 subnets</li>
<li>2 AZs</li>
<li>2 web servers per AZ</li>
<li>2 application servers per AZ</li>
<li>1 External facing application load balancer</li>
<li>1 Internal facing application load balancer</li>
<li>2 NAT gateways to manage traffic to the application layer</li>
<li>1 Internet gateway</li>
<li>1 RDS Aurora DB with a read replica</li>
</ul>
<p>At the end of the blog, we will also provide a Playwright script to implement to load this app. This will help drive metrics to “light up” the dashboards.</p>
<h2>Setting it all up</h2>
<p>Let’s walk through the details of how to get the application, AWS integration on Elastic, and what gets ingested.</p>
<h3>Step 0: Load up the AWS Three Tier application and get your credentials</h3>
<p>Follow the instructions listed out in <a href="https://github.com/aws-samples/aws-three-tier-web-architecture-workshop">AWS’s Three Tier app</a> and instructions in the workshop link on git. The workshop is listed <a href="https://catalog.us-east-1.prod.workshops.aws/workshops/85cd2bb2-7f79-4e96-bdee-8078e469752a/en-US">here</a>.</p>
<p>Once you’ve installed the app, get credentials from AWS. This will be needed for Elastic’s AWS integration.</p>
<p>There are several options for credentials:</p>
<ul>
<li>Use access keys directly</li>
<li>Use temporary security credentials</li>
<li>Use a shared credentials file</li>
<li>Use an IAM role Amazon Resource Name (ARN)</li>
</ul>
<p>For more details on specifics around necessary <a href="https://docs.elastic.co/en/integrations/aws#aws-credentials">credentials</a> and <a href="https://docs.elastic.co/en/integrations/aws#aws-permissions">permissions</a>.</p>
<h3>Step 1: Get an account on Elastic Cloud</h3>
<p>Follow the instructions to <a href="https://cloud.elastic.co/registration?fromURI=/home">get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-service-metrics-monitor-observability-easy/blog-elastic-get-an-account.png" alt="" /></p>
<h3>Step 2: Install the Elastic AWS integration</h3>
<p>Navigate to the AWS integration on Elastic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-service-metrics-monitor-observability-easy/blog-elastic-install-aws-integration.png" alt="" /></p>
<p>Select Add AWS integration.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-service-metrics-monitor-observability-easy/blog-elastic-add-aws-integration.png" alt="" /></p>
<p>This is where you will add your credentials and it will be stored as a policy in Elastic. This policy will be used as part of the install for the agent in the next step.</p>
<p>As you can see, the general Elastic AWS Integration will collect a significant amount of data from 30 AWS services. If you don’t want to install this general Elastic AWS Integration, you can select individual integrations to install.</p>
<h3>Step 3: Install the Elastic Agent with AWS integration</h3>
<p>Now that you have created an integration policy, navigate to the Fleet section under Management in Elastic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-service-metrics-monitor-observability-easy/blog-elastic-install-elastic-agent.png" alt="" /></p>
<p>Select the name of the policy you created in the last step.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-service-metrics-monitor-observability-easy/blog-elastic-name-policy.png" alt="" /></p>
<p>Follow step 3 in the instructions in the <strong>Add</strong> agent window. This will require you to:</p>
<p>1: Bring up an EC2 instance</p>
<ul>
<li>t2.medium is minimum</li>
<li>Linux - your choice of which</li>
<li>Ensure you allow for Open reservation on the EC2 instance when you Launch it</li>
</ul>
<p>2: Log in to the instance and run the commands under Linux Tar tab (below is an example)</p>
<pre><code class="language-bash">curl -L -O https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.5.0-linux-x86_64.tar.gz
tar xzvf elastic-agent-8.5.0-linux-x86_64.tar.gz
cd elastic-agent-8.5.0-linux-x86_64
sudo ./elastic-agent install --url=https://37845638732625692c8ee914d88951dd96.fleet.us-central1.gcp.cloud.es.io:443 --enrollment-token=jkhfglkuwyvrquevuytqoeiyri
</code></pre>
<h3>Step 4: Run traffic against the application</h3>
<p>While getting the application running is fairly easy, there is nothing to monitor or observe with Elastic unless you add a load on the application.</p>
<p>Here is a simple script you can also run using <a href="https://playwright.dev/">Playwright</a> to add traffic to the website for the AWS three tier application:</p>
<pre><code class="language-javascript">import { test, expect } from &quot;@playwright/test&quot;;

test(&quot;homepage for AWS Threetierapp&quot;, async ({ page }) =&gt; {
  await page.goto(
    &quot;http://web-tier-external-lb-1897463036.us-west-1.elb.amazonaws.com/#/db&quot;
  );

  await page.fill(
    &quot;#transactions &gt; tbody &gt; tr &gt; td:nth-child(2) &gt; input&quot;,
    (Math.random() * 100).toString()
  );
  await page.fill(
    &quot;#transactions &gt; tbody &gt; tr &gt; td:nth-child(3) &gt; input&quot;,
    (Math.random() * 100).toString()
  );
  await page.waitForTimeout(1000);
  await page.click(
    &quot;#transactions &gt; tbody &gt; tr:nth-child(2) &gt; td:nth-child(1) &gt; input[type=button]&quot;
  );
  await page.waitForTimeout(4000);
});
</code></pre>
<p>This script will launch three browsers, but you can limit this load to one browser in playwright.config.ts file.</p>
<p>For this exercise, we ran this traffic for approximately five hours with an interval of five minutes while testing the website.</p>
<h3>Step 5: Go to AWS dashboards</h3>
<p>Now that your Elastic Agent is running, you can go to the related AWS dashboards to view what’s being ingested.</p>
<p>To search for the AWS Integration dashboards, simply search for them in the Elastic search bar. The relevant ones for this blog are:</p>
<ul>
<li>[Metrics AWS] EC2 Overview</li>
<li>[Metrics AWS] ELB Overview</li>
<li>[Metrics AWS] RDS Overview</li>
<li>[Metrics AWS] NAT Gateway</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-service-metrics-monitor-observability-easy/blog-elastic-search-aws-integration-dashboards.png" alt="" /></p>
<p>Let's see what comes up!</p>
<p>All of these dashboards are out-of-the-box and for all the following images, we’ve narrowed the views to only the relevant items from our app.</p>
<p>Across all dashboards, we’ve limited the timeframe to when we ran the traffic generator.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-service-metrics-monitor-observability-easy/blog-elastic-dashboard-traffic-generator.png" alt="Elastic Observability EC2 Overview Dashboard" /></p>
<p>Once we filtered for our 4 EC2 instances (2 web servers and 2 application servers), we can see the following:</p>
<p>1: All 4 instances are up and running with no failures in status checks.</p>
<p>2: We see the average CPU utilization across the timeframe and nothing looks abnormal.</p>
<p>3: We see the network bytes flow in and out, aggregating over time as the database is loaded with rows.</p>
<p>While this exercise shows a small portion of the metrics that can be viewed, more are available from AWS EC2. The metrics listed on <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/viewing_metrics_with_cloudwatch.html">AWS documentation</a> are all available, including the dimensions to help narrow the search for specific instances, etc.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-service-metrics-monitor-observability-easy/blog-elastic-overview-dashboard.png" alt="Elastic Observability ELB Overview Dashboard" /></p>
<p>For the ELB dashboard, we filter for our 2 load balancers (external web load balancer and internal application load balancer).</p>
<p>With the out-of-the-box dashboard, you can see application ELB-specific metrics. A good portion of the application ELB specific metrics listed in <a href="https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-cloudwatch-metrics.html">AWS Docs</a> are available to add graphs for.</p>
<p>For our two load balancers, we can see:</p>
<p>1: Both the hosts (EC2 instances connected to the ELBs) are healthy.</p>
<p>2: Load Balancer Capacity Units (how much you are using) and request counts both went up as expected during the traffic generation time frame.</p>
<p>3: We picked to show 4XX and 2XX counts. 4XX will help identify issues with the application or connectivity with the application servers.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-service-metrics-monitor-observability-easy/blog-elastic-transaction-blocked.png" alt="Elastic Observability RDS Overview Dashboard" /></p>
<p>For AuroraDB, which is deployed in RDS, we’ve filtered for just the primary and secondary instances of Aurora on the dashboard.</p>
<p>Just as with EC2, ELB, most RDS metrics from Cloudwatch are also available to create new charts and graphs. In this dashboard, we’ve narrowed it down to showing:</p>
<p>1: Insert throughput &amp; Select throughput</p>
<p>2: Write latency</p>
<p>3: CPU usage</p>
<p>4: General number of connections during the timeframe</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-service-metrics-monitor-observability-easy/blog-elastic-aws-nat-dashboard.png" alt=" Elastic Observability AWS NAT Dashboard" /></p>
<p>We filtered to look only at our 2 NAT instances which are fronting the application servers. As with the other dashboards, other metrics are available to build graphs and /charts as needed.</p>
<p>For the NAT dashboard we can see the following:</p>
<p>1: The NAT Gateways are doing well due to no packet drops</p>
<p>2: An expected number of active connections from the web server</p>
<p>3: Fairly normal set of metrics for bytes in and out</p>
<p><strong>Congratulations, you have now started monitoring metrics from key AWS services for your application!</strong></p>
<h2>What to monitor on AWS next?</h2>
<h3>Add logs from AWS Services</h3>
<p>Now that metrics are being monitored, you can also now add logging. There are several options for ingesting logs.</p>
<ol>
<li>The AWS Integration in the Elastic Agent has logs setting. Just ensure you turn on what you wish to receive. Let’s ingest the Aurora Logs from RDS. In the Elastic agent policy, we simply turn on Collect logs from CloudWatch (see below). Next, update the agent through the Fleet management UI.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-service-metrics-monitor-observability-easy/blog-elastic-collect-logs.png" alt="" /></p>
<ol start="2">
<li>You can install the <a href="https://github.com/elastic/elastic-serverless-forwarder/blob/main/docs/README-AWS.md#deploying-elastic-serverless-forwarder">Lambda logs forwarder</a>. This option will pull logs from multiple locations. See the architecture diagram below.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-service-metrics-monitor-observability-easy/blog-elastic-lambda-logs-forwarder.png" alt="" /></p>
<p>A review of this option is also found in the following <a href="https://www.elastic.co/blog/elastic-and-aws-serverless-application-repository-speed-time-to-actionable-insights-with-frictionless-log-ingestion-from-amazon-s3">blog</a>.</p>
<h3>Analyze your data with Elastic Machine Learning</h3>
<p>Once metrics and logs (or either one) are in Elastic, start analyzing your data through Elastic’s ML capabilities. A great review of these features can be found here:</p>
<ul>
<li><a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">Correlating APM Telemetry to determine root causes in transactions</a></li>
<li><a href="https://www.elastic.co/elasticon/archive/2020/global/machine-learning-and-the-elastic-stack-everywhere-you-need-it">Introduction to Elastic Machine Learning</a></li>
</ul>
<p>And there are many more videos and blogs on <a href="https://www.elastic.co/blog/">Elastic’s Blog</a>.</p>
<h2>Conclusion: Monitoring AWS service metrics with Elastic Observability is easy!</h2>
<p>I hope you’ve gotten an appreciation for how Elastic Observability can help you monitor AWS service metrics, here’s a quick recap of lessons and what you learned:</p>
<ul>
<li>Elastic Observability supports ingest and analysis of AWS service metrics</li>
<li>It’s easy to set up ingest from AWS Services via the Elastic Agent</li>
<li>Elastic Observability has multiple out-of-the-box (OOTB) AWS service dashboards you can use to preliminarily review information, then modify for your needs</li>
<li>30+ AWS services are supported as part of AWS Integration on Elastic Observability, with more services being added regularly</li>
<li>As noted in related blogs, you can analyze your AWS service metrics with Elastic’s machine learning capabilities</li>
</ul>
<p>Start your own <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k?trk=5fbc596b-6d2a-433a-8333-0bd1f28e84da%E2%89%BBchannel=el">7-day free trial</a> by signing up via <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k?trk=d54b31eb-671c-49ba-88bb-7a1106421dfa%E2%89%BBchannel=el">AWS Marketplace</a> and quickly spin up a deployment in minutes on any of the <a href="https://www.elastic.co/guide/en/cloud/current/ec-reference-regions.html#ec_amazon_web_services_aws_regions">Elastic Cloud regions on AWS</a> around the world. Your AWS Marketplace purchase of Elastic will be included in your monthly consolidated billing statement and will draw against your committed spend with AWS.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/aws-service-metrics-monitor-observability-easy/blog-charts-packages.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[AWS VPC Flow log analysis with GenAI in Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/aws-vpc-flow-log-analysis-with-genai-elastic</link>
            <guid isPermaLink="false">aws-vpc-flow-log-analysis-with-genai-elastic</guid>
            <pubDate>Fri, 07 Jun 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic has a set of embedded capabilities such as a GenAI RAG-based AI Assistant and a machine learning platform as part of the product baseline. These make analyzing the vast number of logs you get from AWS VPC Flows easier.]]></description>
            <content:encoded><![CDATA[<p>Elastic Observability provides a full observability solution, by supporting metrics, traces and logs for applications and infrastructure. In managing AWS deployments, VPC flow logs are critical in managing performance, network visibility, security, compliance, and overall management of your AWS environment. Several examples of :</p>
<ol>
<li>
<p>Where traffic is coming in from and going out to from the deployment, and within the deployment. This helps identify unusual or unauthorized communications</p>
</li>
<li>
<p>Traffic volumes detecting spikes or drops which could indicate service issues in production or an increase in customer traffic</p>
</li>
<li>
<p>Latency and Performance bottlenecks - with VPC Flow logs, you can look at latency for a flow (in and outflows), and understand patterns</p>
</li>
<li>
<p>Accepted and rejected traffic helps determine where potential security threats and misconfigurations lie. </p>
</li>
</ol>
<p>AWS VPC Logs is a great example of how logs are great. Logging is an important part of Observability, for which we generally think of metrics and tracing. However, the amount of logs an application and the underlying infrastructure output can be significantly daunting with VPC Logs. However, it also provides a significant amount of insight.</p>
<p>Before we proceed, it is important to understand what Elastic provides in managing AWS and VPC Flow logs:</p>
<ol>
<li>
<p>A full set of integrations to manage VPC Flows and the <a href="https://www.elastic.co/observability-labs/blog/aws-service-metrics-monitor-observability-easy">entire end-to-end deployment on AWS</a>. </p>
</li>
<li>
<p>Elastic has a simple-to-use <a href="https://www.elastic.co/observability-labs/blog/aws-kinesis-data-firehose-observability-analytics">AWS Firehose integration</a>. </p>
</li>
<li>
<p>Elastic’s tools such as <a href="https://www.elastic.co/observability-labs/blog/vpc-flow-logs-monitoring-analytics-observability">Discover, spike analysis,  and anomaly detection help provide you with better insights and analysis</a>.</p>
</li>
<li>
<p>And a set of simple <a href="https://www.elastic.co/guide/en/observability/current/monitor-amazon-vpc-flow-logs.html#aws-firehose-dashboard">Out-of-the-box dashboards</a></p>
</li>
</ol>
<p>In today’s blog, we’ll cover how Elastics’ other features can support analyzing and RCA for potential VPC flow logs even more easily. Specifically, we will focus on managing the number of rejects, as this helps ensure there weren’t any unauthorized or unusual activities:</p>
<ol>
<li>
<p>Set up an easy-to-use SLO (newly released) to detect when things are potentially degrading</p>
</li>
<li>
<p>Create an ML job to analyze different fields of the VPC Flow log</p>
</li>
<li>
<p>Using our newly released RAG-based AI Assistant to help analyze the logs without needing to know Elastic’s query language nor how to even graph on Elastic</p>
</li>
<li>
<p>ES|QL will help understand and analyze add latency for patterns.</p>
</li>
</ol>
<p>In subsequent blogs, we will use AI Assistant and ESQL to show how to get other insights beyond just REJECT/ACCEPT from VPC Flow log.</p>
<h2>Prerequisites and config</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up this demonstration:</p>
<ul>
<li>
<p>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>).</p>
</li>
<li>
<p>Follow the steps in the following blog to get <a href="https://github.com/aws-samples/aws-three-tier-web-architecture-workshop">AWS’s three-tier app</a> installed instructed in git, and bring in the <a href="https://www.elastic.co/observability-labs/blog/aws-kinesis-data-firehose-observability-analytics">AWS VPC Flow logs</a>.</p>
</li>
<li>
<p>Ensure you have an <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-settings.html">ML node configured</a> in your Elastic stack</p>
</li>
<li>
<p>To use the AI Assistant you will need a trial or upgrade to Platinum.</p>
</li>
</ul>
<h2>SLO with VPC Flow Logs</h2>
<p>Elastic’s SLO capability is based directly on the Google SRE Handbook. All the definitions and semantics are utilized as described in Google’s SRE handbook. Hence users can perform the following on SLOs in Elastic:</p>
<ul>
<li>Define an SLO on Logs not just metrics - Users can use KQL (log-based query), service availability, service latency, custom metric, histogram metric, or a timeslice metric.</li>
<li>Define SLO, SLI, Error budget and burn rates. Users can also use occurrence versus time slice-based budgeting. </li>
<li>Manage, with dashboards, all the SLOs in a singular location.</li>
<li>Trigger alerts from the defined SLO, whether the SLI is off, the burn rate is used up, or the error rate is X.</li>
</ul>
<p>Setting up an SLO for VPC is easy. You simply create a query you want to trigger off. In our case, we look for all the good events where <em>aws.vpcflow.action=ACCEPT</em> and we define the target at 85%. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-vpc-flow-log-analysis-with-genai-elastic/VPCFlowSLOsetup.png" alt="Setting up SLO for VPC FLow log" /></p>
<p>As the following example shows, over the last 7 days, we have exceeded our budget by 43%. Additionally, we have not complied for the last 7 days.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-vpc-flow-log-analysis-with-genai-elastic/VPCFlowSLOMiss.png" alt="VPC Flow Reject SLO" /></p>
<h2>Analyzing the SLO with AI Assistant</h2>
<p>Now that we see that there is an issue with the VPC Flows, we immediately work with the AI Assistant to start analyzing the SLO. Because it's a chat interface we simply open the AI Assistant and work through some simple analysis: (See Animated GIF for a demo below)</p>
<h3>AI Assistant analysis:</h3>
<ul>
<li>
<p><strong>what were the top 3 source.address that had <em>aws.vpcflow.action=REJECT</em> over the last 7 days, which is causing this SLO issue?</strong> - We wanted to simply see what could be causing the loss in error budget. Were there any particular source.addresses causing a heavy reject rate.</p>
<ul>
<li>
<p>The answer: A table with the highest count = 42670 and <em>source.address = 79.110.62.185</em></p>
</li>
<li>
<p>There is one singular <em>source.address</em> that is causing the loss in SLO. </p>
</li>
</ul>
</li>
<li>
<p><strong>What is the largest number of  <em>aws.vpcflow.action=REJECT</em> in a 30 min time frame for the last 3 days where the <em>source.address=79.110.62.185</em>?</strong> - After understanding that a specific source.address is causing the loss in SLO, we want to understand the averages. </p>
<ul>
<li>
<p>**The answer: ** &quot;The largest number of <em>aws.vpcflow.action=REJECT</em> in a 30-minute time frame for the last 3 days where the <em>source.address</em> is 79.110.62.185 is 229. This occurred on 2024-06-01T04:00:00.000Z.”</p>
</li>
<li>
<p>It means there must be a low REJECT rate but fairly consistent vs spiky over the last 7 days.</p>
</li>
</ul>
</li>
<li>
<p><strong>for the logs with <em>source.address</em>=&quot;79.110.62.185&quot; was there any country code of <em>source.geo.country_iso_code</em> field present. If yes what is the value</strong> - Given the last question showed a low REJECT rate, it only means that this was fairly consistent vs spiky over the last 7 days.</p>
<ul>
<li><strong>The answer:</strong> Yes, there is a country code present in the <em>source.geo.country_iso_code</em> field for logs with <em>source.address</em>=&quot;79.110.62.185&quot;. The value is BG (Bulgaria).</li>
</ul>
</li>
<li>
<p><strong>Is there a specific destination.address where <em>source.address=79.110.62.185</em> is getting a <em>aws.vpcflow.action=REJECT</em>. Give me both the destination.address and the number of REJECTs for that destination.address?</strong></p>
<ul>
<li><strong>The answer:</strong> destination.address of 10.0.0.27 is giving a reject number of 53433 in this time frame.</li>
</ul>
</li>
<li>
<p><strong>Graph the number of REJECT vs ACCEPT for <em>source.address</em>=&quot;79.110.62.185&quot; over the last 7 days. The graph is on a daily basis in a singular graph</strong> - We asked this question to see what the comparison is between ACCEPT and REJECT. </p>
<ul>
<li><strong>The answer:</strong> See the animated GIF to see that the generated graph is fairly stable</li>
</ul>
</li>
<li>
<p><strong>Were there any source.address that had a spike, high reject rate in. a 30min period over the 30 days?</strong> - We wanted to see if there was any other spike </p>
<ul>
<li><strong>The answer</strong> - Yes, there was a source.address that had a spike in high reject rates in a 30-minute period over the last 30 days. <em>source.address</em>: 185.244.212.67, Reject Count: 8975, Time Period: 2024-05-22T03:00:00.000Z</li>
</ul>
</li>
</ul>
<hr />
<h3>Watch the flow</h3>
&lt;Video vidyardUuid=&quot;1jvEpzfkci9j6AoL42XWA3&quot; /&gt;
<h3>Potential issue:</h3>
<p>he server handling requests from source <strong><em>79.110.62.185</em></strong> is potentially having an issue.</p>
<p>Again using logs, we essentially asked the AI Assistant to give the <em>eni</em> ids where the internal ip address was 10.0.0.27</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-vpc-flow-log-analysis-with-genai-elastic/VPCFlow-findingwebserver.png" alt="Finding the issue - webserver" /></p>
<p>From our AWS console, we know that this is the webserver. Further analysis in Elastic, and with the developers we realized there is a new version that was installed recently causing a problem with connections.</p>
<h2>Locating anomalies with ML</h2>
<p>While using the AI Assistant is great for analyzing information, another important aspect of VPC flow management is to ensure you can manage log spikes and anomalies. Elastic has a machine learning platform that allows you to develop jobs to analyze specific metrics or multiple metrics to look for anomalies.</p>
<p>VPC Flow logs come with a large amount of information. The full set of fields is listed in <a href="https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-logs-basics">AWS docs</a>. We will use a specific subset to help detect anomalies.</p>
<p>We were setting up anomalies for aws.vpcflow.action=REJECT, which requires us to use multimetric anomaly detection in Elastic.</p>
<p>The config we used utilizes:</p>
<p>Detectors:</p>
<ul>
<li>
<p>destination.address</p>
</li>
<li>
<p>destination.port</p>
</li>
</ul>
<p>Influencers:</p>
<ul>
<li>
<p>source.address</p>
</li>
<li>
<p>aws.vpcflow.action</p>
</li>
<li>
<p>destination.geo.region_iso_code</p>
</li>
</ul>
<p>The way we set this up will help us understand if there is a large spike in REJECT/ACCEPT against <em>destination.address</em> values from a specific <em>source.address</em> and/or <em>destination.geo.region_iso_code</em> location.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-vpc-flow-log-analysis-with-genai-elastic/VPCFlowanomalysetup.png" alt="Anomaly detection job config" /></p>
<p>The job once run reveals something interesting:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-vpc-flow-log-analysis-with-genai-elastic/VPCFlowAnomalyDetection.png" alt="Anomaly detected" /></p>
<p>Notice that <em>source.address</em> 185.244.212.67 has had a high REJECT rate in the last 30 days. </p>
<p>Notice where we found this before? In the AI Assistant!!!!!</p>
<p>While we can run the AI Assistant and find this sort of anomaly, the ML job can be setup to run continuously and alert us on such spikes. This will help us understand if there are any issues with the webserver like we found above or even potential security attacks.</p>
<h2>Conclusion:</h2>
<p>You’ve now seen how easily Elastic’s RAG-based AI Assistant can help analyze VPC Flows without even the need to know query syntax, understand where the data is, and understand even the fields. Additionally, you’ve also seen how we can alert you when a potential issue or degradation in service (SLO). Check out our other blogs on AWS VPC Flow analysis in Elastic:</p>
<ol>
<li>
<p>A full set of integrations to manage VPC Flows and the <a href="https://www.elastic.co/observability-labs/blog/aws-service-metrics-monitor-observability-easy">entire end-to-end deployment on AWS</a>. </p>
</li>
<li>
<p>Elastic has a simple-to-use <a href="https://www.elastic.co/observability-labs/blog/aws-kinesis-data-firehose-observability-analytics">AWS Firehose integration</a>. </p>
</li>
<li>
<p>Elastic’s tools such as <a href="https://www.elastic.co/observability-labs/blog/vpc-flow-logs-monitoring-analytics-observability">Discover, spike analysis,  and anomaly detection help provide you with better insights and analysis</a>.</p>
</li>
<li>
<p>And a set of simple <a href="https://www.elastic.co/guide/en/observability/current/monitor-amazon-vpc-flow-logs.html#aws-firehose-dashboard">Out-of-the-box dashboards</a></p>
</li>
</ol>
<h2>Try it out</h2>
<p>Existing Elastic Cloud customers can access many of these features directly from the <a href="https://cloud.elastic.co/">Elastic Cloud console</a>. Not taking advantage of Elastic on the cloud? <a href="https://www.elastic.co/cloud/cloud-trial-overview">Start a free trial</a>.</p>
<p>All of this is also possible in your environment. <a href="https://www.elastic.co/observability/universal-profiling">Learn how to get started today</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/aws-vpc-flow-log-analysis-with-genai-elastic/21-cubes.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Using Azure SRE Agent and Elasticsearch to boost SRE productivity]]></title>
            <link>https://www.elastic.co/observability-labs/blog/azure-sre-agent-elasticsearch</link>
            <guid isPermaLink="false">azure-sre-agent-elasticsearch</guid>
            <pubDate>Fri, 13 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to integrate the Azure SRE Agent with Elasticsearch to benefit from AI-driven autonomous operations, smarter detection, and proactive prevention.]]></description>
            <content:encoded><![CDATA[<p>If you’re a Site Reliability Engineer (SRE), you know the feeling: the cloud landscape is growing, and the architectural complexity is crushing. You’re constantly jumping between fragmented toolsets, spending too much time on manual, repetitive tasks just to manage compute, storage, and networking services. That constant toil leads to high Mean Time to Recovery (MTTR) and, let's be honest, serious operational burnout.1</p>
<p>This is why adopting an AI-driven approach isn't just critical—it’s necessary to solve modern system challenges. Autonomous agents can automate complete operational workflows with minimal human intervention, empowering SRE teams to move beyond constant reactive issue resolution toward proactive system engineering. But here’s the key: the effectiveness of any autonomous agent depends entirely on the quality of its underlying data. By seamlessly integrating the Azure SRE Agent with Elastic Observability, we’re not just offering simple automation; we’re giving organizations a strategy to enter a new phase of governed, AI-driven autonomous operations. </p>
<p>In this blog, we’ll go over how Elastic Observability and the Azure SRE Agent work together, how this integration empowers SREs with AI-driven operations, and how to get started.</p>
<h2>The Power of Choice: Why Elastic Observability is the Foundation for AI-Driven Ops</h2>
<p>For the modern SRE, Elastic Observability serves as the indispensable high-fidelity data foundation. Elastic transforms environmental complexity into a strategic asset by providing a unified, search-powered view of Logs, Metrics, and Traces.</p>
<p>The Azure SRE Agent requires more than just raw data; it requires governed, real-time production insights. Elastic delivers this through <strong>ES|QL</strong>, our piped-query language that allows for high-speed telemetry correlation and transformation. Specifically optimized for <strong>Elastic 9.2.0+</strong> and <strong>Elasticsearch Serverless</strong> projects, this integration utilizes the Model Context Protocol (MCP) to provide the agent with deep system context.</p>
<p><strong>Pro-Tip:</strong> To leverage this integration, ensure that the <strong>Agent Builder</strong> feature is enabled within your Elastic deployment, as this serves as the gateway for the agent to access your production environment securely.</p>
<h2>Better Together: The Value of the Elastic and Azure SRE Agent Integration</h2>
<p>Combining Elastic’s search-powered observability with Azure’s agentic automation creates a &quot;Better Together&quot; ecosystem that provides several strategic advantages:</p>
<ul>
<li>
<p><strong>Smarter Detection &amp; Remediation:</strong> Infuse Elastic’s real-time governed data and causal analysis into Azure SRE Agent workflows. This allows the agent to not only identify a symptom but also understand the underlying root cause.</p>
</li>
<li>
<p><strong>Context-Rich Investigation:</strong> SREs can accelerate triage by providing the agent with full production context—including the blast radius of an incident—directly where the SRE works. This eliminates the &quot;swivel-chair&quot; effect of switching between monitoring dashboards.</p>
</li>
<li>
<p><strong>Proactive Prevention:</strong> By utilizing historical trends and real-time signals from Elastic, the Azure SRE Agent can stop regressions and performance degradations before they impact the end-user experience.</p>
</li>
<li>
<p><strong>Natural Language Interaction:</strong> Through the Elasticsearch MCP server, SREs can query complex clusters using natural language, making deep data exploration accessible without needing to master complex Query syntax.</p>
</li>
</ul>
<h2>Practical Scenarios: Elastic-Powered SRE in Action</h2>
<p>This integration empowers SREs to solve real-world problems through conversational automation:</p>
<ol>
<li>
<p><strong>Incident Triage:</strong> An SRE prompts the agent: <em>&quot;Search for errors in the last hour across all logs indices.&quot;</em> The agent invokes the MCP tools in Agent Builder to return a prioritized list of error logs, identifying a service spike in seconds.</p>
</li>
<li>
<p><strong>Performance Analysis:</strong> To identify a recurring pattern, an SRE commands: <em>&quot;Run an ES|QL query to find the top 10 error types.&quot;</em> The agent uses ES|QL to aggregate telemetry, allowing the team to prioritize development fixes based on frequency.</p>
</li>
<li>
<p><strong>Infrastructure Health:</strong> During a suspected Azure resource failure, an SRE can check the data layer by asking: <em>&quot;Show me metric information for my cluster.&quot;</em> By invoking MCP tools, the agent determines if a node failure is impacting data availability.</p>
</li>
</ol>
<h2>Practical How-to Guide: Integrating Elastic with the Azure SRE Agent </h2>
<ol>
<li>In Elastic via your Kibana interface - create an API Key and remember the key:</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/azure-sre-agent-elasticsearch/image-1.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/azure-sre-agent-elasticsearch/image-1-2.png" alt="" /></p>
<ol start="2">
<li>Find and copy your MCP Endpoint in Agent Builder:</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/azure-sre-agent-elasticsearch/image-2.png" alt="" /></p>
<ol start="3">
<li>In the Azure portal, find the SRE Agent service:</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/azure-sre-agent-elasticsearch/image-3.png" alt="" /></p>
<ol start="4">
<li>Create an Agent:</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/azure-sre-agent-elasticsearch/image-4.png" alt="" /></p>
<ol start="5">
<li>Add the Elastic Connector:</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/azure-sre-agent-elasticsearch/image-5.png" alt="" /></p>
<ol start="6">
<li>Talk to your agent. Use “/agent” to select your agent in the chat interface:</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/azure-sre-agent-elasticsearch/image-6.png" alt="" /></p>
<h2>Conclusions</h2>
<p>The integration of Elastic Observability and the Azure SRE Agent represents a strategic leap forward for cloud operations. By combining Elastic's superior data depth and ES|QL engine with Azure’s autonomous automation, organizations can drastically reduce MTTR, eliminate toil, and maximize the ROI of their Azure investments.</p>
<h2> Next Steps</h2>
<p>Explore the <a href="https://marketplace.microsoft.com/en-us/product/elastic.ec-azure-observability?tab=Overview">Elasticsearch Observability</a> solution implementation on Microsoft Marketplace and visit the <a href="https://sre.azure.com">Azure SRE Agent resource</a> to begin your trial of Elastic-centric autonomous operations today.</p>
<p>Learn more by checking out the following links:</p>
<ul>
<li><a href="https://techcommunity.microsoft.com/blog/appsonazureblog/get-started-with-elasticsearch-mcp-server-in-azure-sre-agent/4492896">Microsoft: Get started with Elasticsearch MCP server in Azure SRE Agent</a></li>
<li><a href="https://www.elastic.co/docs/explore-analyze/ai-features/agent-builder/mcp-server">Elastic Agent Builder MCP Server</a></li>
<li><a href="https://learn.microsoft.com/en-us/azure/sre-agent/custom-mcp-server">Azure SRE Agent MCP Overview</a></li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/azure-sre-agent-elasticsearch/cover.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Best practices for instrumenting OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/best-practices-instrumenting-opentelemetry</link>
            <guid isPermaLink="false">best-practices-instrumenting-opentelemetry</guid>
            <pubDate>Wed, 13 Sep 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Instrumenting OpenTelemetry is complex. Even using auto-instrumentation requires understanding details about your application and OpenTelemetry configuration options. We’ll cover the best practices for instrumenting applications for OpenTelemetry.]]></description>
            <content:encoded><![CDATA[<p>OpenTelemetry (OTel) is steadily gaining broad industry adoption. As one of the major Cloud Native Computing Foundation (CNCF) projects, with as many commits as Kubernetes, it is gaining support from major ISVs and cloud providers delivering support for the framework. Many global companies from finance, insurance, tech, and other industries are starting to standardize on OpenTelemetry. With OpenTelemetry, DevOps teams have a consistent approach to collecting and ingesting telemetry data providing a de-facto standard for observability. With that, teams can rely on vendor-agnostic, future-proof instrumentation of their applications that allows them to switch observability backends without additional overhead in adapting instrumentation.</p>
<p>Teams that have chosen OpenTelemetry for instrumentation face a choice between different instrumentation techniques and data collection approaches. Determining how to instrument and what mechanism to use can be challenging. In this blog, we will go over Elastic’s recommendations around some best practices for OpenTelemetry instrumentation:</p>
<ul>
<li><strong>Automatic or manual?</strong> We’ll cover the need for one versus the other and provide recommendations based on your situation.</li>
<li><strong>Collector or direct from the application?</strong> While the traditional option is to use a collector, observability tools like Elastic Observability can take telemetry from OpenTelemetry applications directly.</li>
<li><strong>What to instrument from OTel SDKs.</strong> Traces and metrics are well contributed to (<a href="https://opentelemetry.io/docs/instrumentation/">per the table in OTel docs</a>), but logs are still in progress. Elastic&lt;sup&gt;®&lt;/sup&gt; is improving the progress with its <a href="https://www.elastic.co/blog/ecs-elastic-common-schema-otel-opentelemetry-announcement">contribution of ECS to OTel</a>. Regardless of the status from OTel, you need to test and ensure these instrumentations work for you.</li>
<li><strong>Advantages and disadvantages of OpenTelemetry</strong></li>
</ul>
<h2>OTel automatic or manual instrumentation: Which one should I use?</h2>
<p>While there are two ways to instrument your applications with OpenTelemetry — automatic and manual — there isn’t a perfect answer, as it depends on your needs. There are pros and cons of using one versus another, such as:</p>
<ul>
<li>Auto-magic experience vs. control over instrumentation</li>
<li>Customization vs. out-of-the-box data</li>
<li>Instrumentation overhead</li>
<li>Simplicity vs. flexibility</li>
</ul>
<p>Additionally, you might even land on a combination depending on availability and need.</p>
<p>Let’s review both automatic and manual instrumentation and explore specific recommendations.</p>
<h3>Auto-instrumentation</h3>
<p>For most of the programming languages and runtimes, OpenTelemetry provides an auto-instrumentation approach for gathering telemetry data. Auto-instrumentation provides a set of pre-defined, out-of-the-box instrumentation modules for well-known frameworks and libraries. With that, users can gather telemetry data (such as traces, metrics, and logs) from well-known frameworks and libraries used by their application with only minimal or even no need for code changes.</p>
<p>Here are some of the apparent benefits of using auto-instrumentation:</p>
<ul>
<li>Quicker development and path to production. Auto-instrumentation saves time by accelerating the process of integrating telemetry into an application, allowing more focus on other critical tasks.</li>
<li>Simpler maintenance by only having to update one line, which is usually the container start command where auto-instrumentation is configured, versus having to update multiple lines of code across multiple classes, methods, and services.</li>
<li>Easier to keep up with the latest features and improvements in the OpenTelemetry project without manually updating the instrumentation of used libraries and/or code.</li>
</ul>
<p>There are also some disadvantages and limitations of the auto-instrumentation approach:</p>
<ul>
<li>Auto-instrumentation collects telemetry data only for the frameworks and libraries in use for which an explicit auto-instrumentation module exists. In particular, it’s unlikely that auto-instrumentation would collect telemetry data for “exotic” or custom libraries.</li>
<li>Auto-instrumentation does not capture telemetry for pure custom code (that does not use well-known libraries underneath).</li>
<li>Auto-instrumentation modules come with a pre-defined, opinionated instrumentation logic that provides sufficient and meaningful information in the vast majority of cases. However, in some custom edge cases, the information value, structure, or level of detail of the data provided by auto-instrumentation modules might be not sufficient.</li>
<li>Depending on the runtime, technology, and size of the target application, auto-instrumentation may come with a (slightly) higher start-up or runtime overhead compared to manual instrumentation. In the majority of cases, this overhead is negligible but may become a problem in some edge cases.</li>
</ul>
<p><a href="https://github.com/elastic/workshops-instruqt/blob/main/Elastiflix/python-favorite-otel-auto/Dockerfile">Here</a> is an example of a Python application that was auto-instrumented with OpenTelemetry. If you had a Python application locally, you would add the code below to auto-instrument:</p>
<pre><code class="language-bash">opentelemetry-instrument \
    --traces_exporter OTEL_TRACES_EXPORTER \
    --metrics_exporter OTLP_METRICS_EXPORTER \
    --service_name OTLP_SERVICE_NAME \
    --exporter_otlp_endpoint OTEL_EXPORTER_TRACES_ENDPOINT \
    python main.py
</code></pre>
<p><a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Learn more about auto-instrumentation with OpenTelemetry for Python applications</a>.</p>
<p>Finally, developers familiar with OpenTelemetry's APIs can leverage their existing knowledge by using auto-instrumentation, avoiding the complexities that may arise from manual instrumentation. However, manual instrumentation might still be preferred for specific use cases or when custom requirements cannot be fully addressed by auto-instrumentation.</p>
<h3>Combination: Automatic and Manual</h3>
<p>Before we proceed with manual instrumentation, you can also use a combination of automatic and manual instrumentation. As we noted above, if you start to understand the application’s behavior, then you can determine if you need some additional instrumentation for code that is not being traced by auto-instrumentation.</p>
<p>Additionally, because not all the auto-instrumentation is equal across the OTel language set, you will probably need to manually instrument in some cases — for example, if auto-instrumentation of a Flask-based Python application doesn’t automatically show middleware calls like calls to the requests library. In this situation, you will have to go with manual instrumentation for the Python application if you want to also see middleware tracing. However, as these libraries mature, more support options will become available.</p>
<p>A combination is where most developers will ultimately land when the application gets to near production quality.</p>
<h3>Manual instrumentation</h3>
<p>If the auto-instrumentation does not cover your needs, you want to have more control over the instrumentation, or you’d like to treat instrumentation as code, using manual instrumentation is likely the right choice for you. As described above, you can use it as an enhancement to auto-instrumentation or entirely switch to manual instrumentation. If you eventually go down a path of manual instrumentation, it definitely provides more flexibility but also means you will have to not only code in the traces and metrics but also maintain it regularly.</p>
<p>As new features are added and changes to the libraries are made, the maintenance for the code may or may not be cumbersome. It’s a decision that requires some forethought.</p>
<p>Here are some reasons why you would potentially use manual instrumentation:</p>
<ul>
<li>You may already have some OTel instrumented applications using auto-instrumentation and need to add more telemetry for specific functions or libraries (like DBs or middleware), thus you will have to add manual instrumentation.</li>
<li>You need more flexibility and control in terms of the application language and what you’d like to instrument.</li>
<li>In case there's no auto-instrumentation available for your programming language and the technologies in use, manual instrumentation would be the way to go for your applications built using these languages.</li>
<li>You might have to instrument for logging with an alternative approach, as logging is not yet stable for all the programming languages.</li>
<li>You need to customize and enrich your telemetry data for your specific use cases — for example, you have a multi-tenant application and you need to get each tenant’s information and then use manual instrumentation via the OpenTelemetry SDK.</li>
</ul>
<p><strong>Recommendations for manual instrumentation</strong><br />
Manual instrumentation will require specific configuration to ensure you have the best experience with OTel. Below are Elastic’s recommendations (as outlined by the <a href="https://www.cncf.io/blog/2020/06/26/opentelemetry-best-practices-overview-part-2-2/">CNCF</a>), for gaining the most benefits when instrumenting using the manual method:</p>
<ol>
<li>
<p>Ensure that your provider configuration and tracer initialization is done properly.</p>
</li>
<li>
<p>Ensure you set up spans in all the functions you want traced.</p>
</li>
<li>
<p>Set up resource attributes correctly.</p>
</li>
<li>
<p>Use batch rather than simple processing.</p>
</li>
</ol>
<p>Let’s review these individually:</p>
<p><strong>1. Ensure that your provider configuration and tracer initialization is done properly.</strong><br />
The general rule of thumb is to ensure you configure all your variables and tracer initialization in the front of the application. Using the <a href="https://github.com/elastic/workshops-instruqt/tree/main/Elastiflix">Elastiflix application’s Python favorite service</a> as an example, we can see:</p>
<p><em>Tracer being set up globally</em></p>
<pre><code class="language-python">from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource

...


resource = Resource.create(resource_attributes)

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(exporter)
provider.add_span_processor(processor)

# Sets the global default tracer provider
trace.set_tracer_provider(provider)

# Creates a tracer from the global tracer provider
tracer = trace.get_tracer(otel_service_name)
</code></pre>
<p>In the above, we’ve added the OpenTelemetry trace module and imported the TraceProvider , which is the entry point of the API. It provides access to the Tracer, which is the class responsible for creating spans.</p>
<p>Additionally, we specify the use of BatchSpanProcessor. The span processor is an interface that provides hooks for span start and end method invocations.</p>
<p>In OpenTelemetry, different span processors are offered. The BatchSpanProcessor batches span and sends them in bulk. Multiple span processors can be configured to be active at the same time using the MultiSpanProcessor. <a href="https://opentelemetry.io/docs/instrumentation/java/manual/#span-processor">See OpenTelemetry Documentation</a>.</p>
<p>The variable otel_service_name is set in with environment variables (i.e., OTLP ENDPOINT and others) also set up globally. See below:</p>
<pre><code class="language-python">otel_service_name = os.environ.get('OTEL_SERVICE_NAME') or 'favorite_otel_manual'
environment = os.environ.get('ENVIRONMENT') or 'dev'
otel_service_version = os.environ.get('OTEL_SERVICE_VERSION') or '1.0.0'

otel_exporter_otlp_headers = os.environ.get('OTEL_EXPORTER_OTLP_HEADERS')
otel_exporter_otlp_endpoint = os.environ.get('OTEL_EXPORTER_OTLP_ENDPOINT')
</code></pre>
<p>In the above code, we initialize several variables. Because we also imported Resource, we initialize several variables:</p>
<p><strong>Resource variables (we will cover this later in this article):</strong></p>
<ul>
<li>otel_service_name – This helps set the name of the service (service.name) in otel Resource attributes.</li>
<li>otel_service_version – This helps set the version of the service (service.version) in OTel Resource attributes.</li>
<li>environment – This helps set the deployment.environment variable in OTel Resource attributes.</li>
</ul>
<p><strong>Exporter variables:</strong></p>
<ul>
<li>otel_exporter_otlp_endpoint – This helps set the OTLP endpoint where traces, logs, and metrics are sent. Elastic would be an OTLP endpoint. You can also use OTEL_TRACES_EXPORTER or OTEL_METRICS_EXPORTER if you want to only send traces and/or metrics to specific endpoints.</li>
<li>Otel_exporter_otlp_headers – This is the authorization needed for the endpoint.</li>
</ul>
<p>The separation of your provider and tracer configuration allows you to use any OpenTelemetry provider and tracing framework that you choose.</p>
<p><strong>2. Set up your spans inside the application functions themselves.</strong><br />
Make sure your spans end and are in the right context so you can track the relationships between spans. In our Python favorite application, the function that retrieves a user’s favorite movies shows:</p>
<pre><code class="language-python">@app.route('/favorites', methods=['GET'])
def get_favorite_movies():
    # add artificial delay if enabled
    if delay_time &gt; 0:
        time.sleep(max(0, random.gauss(delay_time/1000, delay_time/1000/10)))

    with tracer.start_as_current_span(&quot;get_favorite_movies&quot;) as span:
        user_id = str(request.args.get('user_id'))

        logger.info('Getting favorites for user ' + user_id, extra={
            &quot;event.dataset&quot;: &quot;favorite.log&quot;,
            &quot;user.id&quot;: request.args.get('user_id')
        })

        favorites = r.smembers(user_id)

        # convert to list
        favorites = list(favorites)
        logger.info('User ' + user_id + ' has favorites: ' + str(favorites), extra={
            &quot;event.dataset&quot;: &quot;favorite.log&quot;,
            &quot;user.id&quot;: user_id
        })
        return { &quot;favorites&quot;: favorites}
</code></pre>
<p>While you can instrument every function, it’s strongly recommended that you instrument what you need to avoid a flood of data. The need will be dependent not only on the development process needs but also on what SRE and potentially the business needs to observe with the application. Instrument for your target use cases.</p>
<p>Also, avoid instrumenting trivial/utility methods/functions or such that are intended to be called extensively (e.g., getter/setter functions). Otherwise, this would produce a huge amount of telemetry data with very low additional value.</p>
<p><strong>3. Set resource attributes and use semantic conventions</strong></p>
<p>_ <strong>Resource attributes</strong> _<br />
Attributes such as service.name, tracer, development.environment, and cloud are important in managing version, environment, cloud provider, etc. for the specific service. Resource attributes describe resources such as hosts, systems, processes, and services and do not change during the lifetime of the resource. Resource attributes are a great help for correlating data, providing additional context to telemetry data and, thus, helping narrow down root causes of problems during troubleshooting. While it is simple to set up in auto-instrument, you need to ensure you also send these through in your application.</p>
<p>Check out OpenTelemetry’s list of attributes that can be set in the <a href="https://opentelemetry.io/docs/specs/otel/resource/semantic_conventions/#semantic-attributes-with-sdk-provided-default-value">OTel documentation</a>.</p>
<p>In our auto-instrumented Python application from above, here is how we set up resource attributes:</p>
<pre><code class="language-bash">opentelemetry-instrument \
    --traces_exporter console,otlp \
    --metrics_exporter console \
    --service_name your-service-name \
    --exporter_otlp_endpoint 0.0.0.0:4317 \
    python myapp.py
</code></pre>
<p>However, when instrumenting manually, you need to add your resource attributes and ensure you have consistent values across your application’s code. Resource attributes have been defined by OpenTelemetry’s Resource Semantic Convention and can be found <a href="https://opentelemetry.io/docs/specs/semconv/resource/">here</a>. In fact, your organization should have a resource attribute convention that is applied across all applications.</p>
<p>These attributes are added to your metrics, traces, and logs, helping you filter out data, correlate, and make more sense out of them.</p>
<p>Here is an example of setting resource attributes in our Python service:</p>
<pre><code class="language-python">resource_attributes = {
    &quot;service.name&quot;: otel_service_name,
    &quot;telemetry.version&quot;: otel_service_version,
    &quot;Deployment.environment&quot;: environment

}

resource = Resource.create(resource_attributes)

provider = TracerProvider(resource=resource)
</code></pre>
<p>We’ve set up service.name, service.version, and deployment.environment. You can set up as many resource attributes as you need, but you need to ensure you pass the resource attributes into the tracer with provider = TracerProvider(resource=resource).</p>
<p>_ <strong>Semantic conventions</strong> _<br />
In addition to adding the appropriate resource attributes to the code, the OpenTelemetry semantic conventions are important. Another one is about semantic conventions for specific technologies used in building your application with specific infrastructure. For example, if you need to instrument databases, there is no automatic instrumentation. You will have to manually instrument for tracing against the database. In doing so, you should utilize the <a href="https://opentelemetry.io/docs/specs/semconv/database/database-spans/">semantic conventions for database calls in OpenTelemetry</a>.</p>
<p>Similarly, if you are trying to trace Kafka or RabbitMQ, you can follow the <a href="https://opentelemetry.io/docs/specs/semconv/messaging/">OpenTelemetry semantic conventions for messaging systems</a>.</p>
<p>There are multiple semantic conventions across several areas and signal types that can be followed using OpenTelemetry — <a href="https://opentelemetry.io/docs/specs/semconv/">check out the details</a>.</p>
<p><strong>4. Use Batch or simple processing?</strong><br />
Using simple or batch processing depends on your specific observability requirements. The advantages of batch processing include improved efficiency and reduced network overhead. Batch processing allows you to process telemetry data in batches, enabling more efficient data handling and resource utilization. On the other hand, batch processing increases the lag time for telemetry data to appear in the backend, as the span processor needs to wait for a sufficient amount of data to send over to the backend.</p>
<p>With simple processing, you send your telemetry data as soon as the data is generated, resulting in real-time observability. However, you will need to prepare for higher network overhead and more resources required to process all the separate data transmissions.</p>
<p>Here is what we used to set this up in Python:</p>
<pre><code class="language-python">from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(exporter)
provider.add_span_processor(processor)

# Sets the global default tracer provider
trace.set_tracer_provider(provider)
</code></pre>
<p>Your observability goals and budgetary constraints are the deciding factors when choosing batch or simple processing. A hybrid approach can also be implemented. If real-time insights are critical for an ecommerce application, for example, then simple processing would be the better approach. For other applications where real-time insights are not crucial, consider batch processing. Often, experimenting with both approaches and seeing how your observability backend handles the data is a fruitful exercise to hone in on what approach works best for the business.</p>
<h2>Use the OpenTelemetry Collector or go direct?</h2>
<p>When starting out with OpenTelemetry, ingesting and transmitting telemetry data directly to a backend such as Elastic is a good way to get started. Often, you would be using the OTel direct method in the development phase and in a local environment.</p>
<p>However, as you deploy your applications to production, the applications become fully responsible for ingesting and sending telemetry data. The amount of data sent in a local environment or during development would be miniscule compared to a production environment. With millions or even billions of users interacting with your applications, the work of ingesting and sending telemetry data in addition to the core application functions can become resource-intensive. Thus, offloading the collection, processing, and exporting of telemetry data over to a backend such as Elastic using the vendor-agnostic OTel Collector would enable your applications to perform more efficiently, leading to a better customer experience.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/best-practices-instrumenting-opentelemetry/elastic-blog-1-microservices-flowchart.png" alt="1 microservices flowchart" /></p>
<h3>Advantages of using the OpenTelemetry Collector</h3>
<p>For cloud-native and microservices-based applications, the OpenTelemetry Collector provides the flexibility to handle multiple data formats and, more importantly, offloads the resources required from the application to manage telemetry data. The result: reduced application overhead and ease of management as the telemetry configuration can now be managed in one place.</p>
<p>The OTel Collector is the most common configuration because the OTel Collector is used:</p>
<ul>
<li>To enrich the telemetry data with additional context information — for example, on Kubernetes, the OTel Collector would take the responsibility to enrich all the telemetry with the corresponding K8s pod and node information (labels, pod-name, etc.)</li>
<li>To provide uniform and consistent processing or transform telemetry data in a central place (i.e., OTel Collector) rather than take on the burden of syncing configuration across hundreds of services to ensure consistent processing</li>
<li>To aggregate metrics across multiple instances of a service, which is only doable on the OTel Collector (not within individual SDKs/agents)</li>
</ul>
<p>Key features of the OpenTelemetry Collector include:</p>
<ul>
<li><strong>Simple setup:</strong> The <a href="https://opentelemetry.io/docs/collector/getting-started/">setup documentation</a> is clear and comprehensive. We also have an example setup using Elastic and the OTel Collector documented from <a href="https://www.elastic.co/blog/opentelemetry-observability">this blog</a>.</li>
<li><strong>Flexibility:</strong> The OTel Collector offers many configuration options and allows you to easily integrate into your existing <a href="https://www.elastic.co/observability">observability solution</a>. However, <a href="https://opentelemetry.io/docs/collector/distributions/">OpenTelemetry’s pre-built distributions</a> allow you to start quickly and build the features that you need. <a href="https://github.com/bshetti/opentelemetry-microservices-demo/blob/main/deploy-with-collector-k8s/otelcollector.yaml">Here</a> as well as below is an example of the code that we used to build our collector for an application running on Kubernetes.</li>
</ul>
<pre><code class="language-yaml">---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otelcollector
spec:
  selector:
    matchLabels:
      app: otelcollector
  template:
    metadata:
      labels:
        app: otelcollector
    spec:
      serviceAccountName: default
      terminationGracePeriodSeconds: 5
      containers:
        - command:
            - &quot;/otelcol&quot;
            - &quot;--config=/conf/otel-collector-config.yaml&quot;
          image: otel/opentelemetry-collector:0.61.0
          name: otelcollector
          resources:
            limits:
              cpu: 1
              memory: 2Gi
            requests:
              cpu: 200m
              memory: 400Mi
</code></pre>
<ul>
<li><strong>Collect host metrics:</strong> Using the OTel Collector allows you to capture infrastructure metrics, including CPU, RAM, storage capacity, and more. This means you won’t need to install a separate infrastructure agent to collect host metrics. An example OTel configuration for ingesting host metrics is below.</li>
</ul>
<pre><code class="language-yaml">receivers:
  hostmetrics:
    scrapers:
      cpu:
      disk:
</code></pre>
<ul>
<li><strong>Security:</strong> The OTel Collector operates in a secure manner by default. It can filter out sensitive information based on your configuration. OpenTelemetry provides <a href="https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md">these security guidelines</a> to ensure your security needs are met.</li>
<li><strong>Tail-based sampling for distributed tracing:</strong> With OpenTelemetry, you can specify the sampling strategy you would like to use for capturing traces. Tail-based sampling is available by default with the OTel Collector. With tail-based sampling, you control and thereby reduce the amount of trace data collected. More importantly, you capture the most relevant traces, enabling you to spot issues within your microservices applications much faster.</li>
</ul>
<h2>What about logs?</h2>
<p>OpenTelemetry’s approach to ingesting metrics and traces is a “clean-sheet design.” OTel developed a new API for metrics and traces and implementations for multiple languages. For logs, on the other hand, due to the broad adoption and existence of legacy log solutions and libraries, support from OTel is the least mature.</p>
<p>Today, OpenTelemetry’s solution for logs is to provide integration hooks to existing solutions. Longer term though, OpenTelemetry aims to incorporate context aggregation with logs thus easing logging correlation with metrics and traces. <a href="https://opentelemetry.io/docs/specs/otel/logs/#opentelemetry-solution">Learn more about OpenTelemetry’s vision</a>.</p>
<p>Elastic has written up its recommendations in the following article: <a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a>. Here is a brief summary of what Elastic recommends:</p>
<ol>
<li>
<p>Output logs from your service (alongside traces and metrics) using an embedded <a href="https://opentelemetry.io/docs/instrumentation/#status-and-releases">OpenTelemetry Instrumentation library</a> to Elastic via the OTLP protocol.</p>
</li>
<li>
<p>Write logs from your service to a file scrapped by the <a href="https://opentelemetry.io/docs/collector/">OpenTelemetry Collector</a>, which then forwards to Elastic via the OTLP protocol.</p>
</li>
<li>
<p>Write logs from your service to a file scrapped by <a href="https://www.elastic.co/elastic-agent">Elastic Agent</a> (or <a href="https://www.elastic.co/beats/filebeat">Filebeat</a>), which then forwards to Elastic via an Elastic-defined protocol.</p>
</li>
</ol>
<p>The third approach, where developers have their logs scraped using an Elastic Agent, is the recommended approach, as Elastic provides a widely adopted and proven method for capturing logs from applications and services using OTel. The first two approaches, although both use OTel instrumentation, are not yet mature and aren't ready for production-level applications.</p>
<p>Get more details about the three approaches in this <a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">Elastic blog</a> which includes a deep-dive discussion with hands-on implementation, architecture, advantages, and disadvantages.</p>
<h2>It’s not all sunshine and roses</h2>
<p>OpenTelemetry is definitely beneficial to obtaining observability for modern cloud-native distributed applications. Having a standardized framework for ingesting telemetry reduces operational expenses and allows the organization to focus more on application innovation. Even with all the advantages of using OTel, there are some limitations that you should be aware of as well.</p>
<p>But first, here are the advantages of using OpenTelemetry:</p>
<ul>
<li><strong>Standardized instrumentation:</strong> Having a consistent method for instrumenting systems up and down the stack gives organizations more operational efficiency and cost-effective observability.</li>
<li><strong>Auto-instrumentation:</strong> OTel provides organizations with the ability to auto-instrument popular libraries and frameworks enabling them to quickly get up and running and requiring minimal changes to the codebase.</li>
<li><strong>Vendor neutrality:</strong> Organizations don’t have to be tied to one vendor for their observability needs. In fact, they can use several of them, using OTel to try one out or have a more best-of-breed approach if desired.</li>
<li><strong>Future-proof instrumentation:</strong> Since OpenTelemetry is open-source and has a vast ecosystem of support, your organization will be using technology that will be constantly innovated and can scale and grow with the business.</li>
</ul>
<p>There are some limitations as well:</p>
<ul>
<li>Instrumenting with OTel is a fork-lift upgrade. Organizations must be aware that time and effort needs to be invested to migrate proprietary instrumentation to OpenTelemetry.</li>
<li>The <a href="https://opentelemetry.io/docs/instrumentation/">language SDKs</a> are at a different maturity level, so applications with alpha, beta, or experimental functional support may not provide the organization with the full benefits in the short term.</li>
</ul>
<p>Over time, the disadvantages will be reduced, especially as the maturity level of the functional components improves. Check the <a href="https://opentelemetry.io/status/">OpenTelemetry status page</a> for updates on the status of the language SDKs, the collector, and overall specifications.</p>
<h2>Using Elastic and migrating to OpenTelemetry at your speed</h2>
<p>Transitioning to OpenTelemetry is a challenge for most organizations, as it requires retooling existing proprietary APM agents on almost all applications. This can be daunting, but OpenTelemetry agents provide a mechanism to avoid having to modify the source code, otherwise known as auto-instrumentation. With auto-instrumentation, the only code changes will be to rip out the proprietary APM agent code. Additionally, you should also ensure you have an <a href="https://www.elastic.co/blog/opentelemetry-observability">observability tool that natively supports OTel</a> without the need for additional agents, such as <a href="https://www.elastic.co/blog/opentelemetry-observability">Elastic Observability</a>.</p>
<p><a href="https://www.elastic.co/blog/ecs-elastic-common-schema-otel-opentelemetry-announcement">Elastic recently donated Elastic Common Schema (ECS) in its entirety to OTel</a>. The goal was to ensure OTel can get to a standardized logging format. ECS, developed by the Elastic community over the past few years, provides a vehicle to allow OTel to provide a more mature logging solution.</p>
<p>Elastic provides native OTel support. You can directly send OTel telemetry into Elastic Observability without the need for a collector or any sort of processing normally used in the collector.</p>
<p>Here are the configuration options in Elastic for OpenTelemetry:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/best-practices-instrumenting-opentelemetry/elastic-blog-2-otel-config-options.png" alt="" /></p>
<p>Most of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:</p>
<ul>
<li>Service maps</li>
<li>Service details (latency, throughput, failed transactions)</li>
<li>Dependencies between services, distributed tracing</li>
<li>Transactions (traces)</li>
<li>Machine learning (ML) correlations</li>
<li>Log correlation</li>
</ul>
<p>In addition to Elastic’s APM and a unified view of the telemetry data, you will also be able to use Elastic’s powerful machine learning capabilities to reduce the analysis, and alerting to help reduce MTTR.</p>
<p>Although OpenTelemetry supports many programming languages, the <a href="https://opentelemetry.io/docs/instrumentation/">status of its major functional components</a> — metrics, traces, and logs — are still at various stages. Thus migrating applications written in Java, Python, and JavaScript are good choices to start with as their metrics, traces, and logs (for Java) are stable.</p>
<p>For the other languages that are not yet supported, you can easily instrument those using Elastic Agents, therefore running your observability platform in mixed mode (Elastic Agents with OpenTelemetry agents).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/best-practices-instrumenting-opentelemetry/elastic-blog-3-services.png" alt="services" /></p>
<p>We ran a variation of our standard Elastic Agent application with one service flipped to OTel — the newsletter-otel service. But we can easily and as needed convert each of these services to OTel as development resources allow.</p>
<p>As a result, you can take advantage of the benefits of OpenTelemetry, which include:</p>
<ul>
<li><strong>Standardization:</strong> OpenTelemetry provides a standard approach to telemetry collection, enabling consistency of processes and easier integration of different components.</li>
<li><strong>Vendor-agnostic:</strong> Since OpenTelemetry is open source, it is designed to be vendor-agnostic, allowing DevOps and SRE teams to work with other monitoring and observability backends reducing vendor lock-in.</li>
<li><strong>Flexibility and extensibility:</strong> With its flexible architecture and inherent design for extensibility, OpenTelemetry enables teams to create custom instrumentation and enrich their own telemetry data.</li>
<li><strong>Community and support:</strong> OpenTelemetry has a growing community of contributors and adopters. In fact, Elastic contributed to developing a common schema for metrics, logs, traces, and security events. Learn more <a href="https://www.elastic.co/blog/ecs-elastic-common-schema-otel-opentelemetry-announcement">here</a>.</li>
</ul>
<p>Once the other languages reach a stable state, you can then continue your migration to OpenTelemetry agents.</p>
<h2>Summary</h2>
<p>OpenTelemetry has become the de facto standard for ingesting metrics, traces, and logs from cloud-native applications. It provides a vendor-agnostic framework for collecting telemetry data, enabling you to use the observability backend of your choice.</p>
<p>Auto-instrumentation using OpenTelemetry is the fastest way for you to ingest your telemetry data and is an optimal way to get started with OTel. However, using manual instrumentation provides more flexibility, so it is often the next step in gaining deeper insights from your telemetry data.</p>
<p><a href="https://www.elastic.co/observability/opentelemetry">OpenTelemetry visualization</a> also allows you to ingest your data directly or by using the OTel Collector. For local development, going direct is a great way to get your data to your observability backend; however, with production workloads, using the OTel Collector is recommended. The collector takes care of all the data ingestion and processing, enabling your applications to focus on functionality and not have to deal with any telemetry data tasks.</p>
<p>Logging functionality is still at a nascent stage with OpenTelemetry, while ingesting metrics and traces is well established. For logs, if you’ve started down the OTel path, you can send your logs to Elastic using the OTLP protocol. Since Elastic has a very mature logging solution, a better approach would be to use an Elastic Agent to ingest logs.</p>
<p>Although the long-term benefits are clear, organizations need to be aware that adopting OpenTelemetry means they would own their own instrumentation. Thus, appropriate resources and effort need to be incorporated in the development lifecycle. Over time, however, OpenTelemetry brings standardization to telemetry data ingestion, offering organizations vendor-choice, scalability, flexibility, and future-proofing of investments.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/best-practices-instrumenting-opentelemetry/ecs-otel-announcement-3.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Best Practices for Log Management: Leveraging Logs for Faster Problem Resolution]]></title>
            <link>https://www.elastic.co/observability-labs/blog/best-practices-logging</link>
            <guid isPermaLink="false">best-practices-logging</guid>
            <pubDate>Wed, 11 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Explore effective log management strategies to improve system reliability and performance. Learn about data collection, processing, analysis, and cost-effective management of logs in complex software environments.]]></description>
            <content:encoded><![CDATA[<p>In today's rapid software development landscape, efficient log management is crucial for maintaining system reliability and performance. With expanding and complex infrastructure and application components, the responsibilities of operations and development teams are ever-growing and multifaceted. This blog post outlines best practices for effective log management, addressing the challenges of growing data volumes, complex infrastructures, and the need for quick problem resolution.</p>
<h2>Understanding Logs and Their Importance</h2>
<p>Logs are records of events occurring within your infrastructure, typically including a timestamp, a message detailing the event, and metadata identifying the source. They are invaluable for diagnosing issues, providing early warnings, and speeding up problem resolution. Logs are often the primary signal that developers enable, offering significant detail for debugging, performance analysis, security, and compliance management.</p>
<h2>The Logging Journey</h2>
<p>The logging journey involves three basic steps: collection and ingestion, processing and enrichment, and analysis and rationalization. Let's explore each step in detail, covering some of the best practices for each section.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/best-practices-logging/blog-elastic-collection-and-ingest.png" alt="Logging Journey" /></p>
<h3>1. Log Collection and Ingestion</h3>
<h4>Collect Everything Relevant and Actionable</h4>
<p>The first step is to collect all logs into a central location. This involves identifying all your applications and systems and collecting their logs. Comprehensive data collection ensures no critical information is missed, providing a complete picture of your system's behavior. In the event of an incident, having all logs in one place can significantly reduce the time to resolution. It's generally better to collect more data than you need, as you can always filter out irrelevant information later, as well as delete logs that are no longer needed more quickly.</p>
<h4>Leverage Integrations</h4>
<p>Elastic provides over 300 integrations that simplify data onboarding. These integrations not only collect data but also come with dashboards, saved searches, and pipelines to parse the data. Utilizing these integrations can significantly reduce manual effort and ensure data consistency.</p>
<h4>Consider Ingestion Capacity and Costs</h4>
<p>An important aspect of log collection is ensuring you have sufficient ingestion capacity at a manageable cost. When assessing solutions, be cautious about those that charge significantly more for high cardinality data, as this can lead to unexpectedly high costs in observability solutions. We'll talk more about cost effective log management later in this post.</p>
<h4>Use Kafka for Large Projects</h4>
<p>For larger organizations, implementing Kafka can improve log data management. Kafka acts as a buffer, making the system more reliable and easier to manage. It allows different teams to send data to a centralized location, which can then be ingested into Elastic.</p>
<h3>2. Processing and Enrichment</h3>
<h4>Adopt Elastic Common Schema (ECS)</h4>
<p>One key aspect of log collection is to have the most amount of normalization across all of your applications and infrastructure. Having a common semantic schema is crucial. Elastic contributed Elastic Common Schema (ECS) to OpenTelemetry (OTel), helping accelerate the adoption of OTel-based observability and security. This move towards a more normalized way to define and ingest logs (including metrics and traces) is beneficial for the industry.</p>
<p>Using ECS helps standardize field names and data structures, making data analysis and correlation easier. This common schema ensures your data is organized predictably, facilitating more efficient querying and reporting. Learn more about ECS <a href="https://www.elastic.co/guide/en/ecs/current/ecs-reference.html">here</a>.</p>
<h4>Optimize Mappings for High Volume Data</h4>
<p>For high cardinality fields or those rarely used, consider optimizing or removing them from the index. This can improve performance by reducing the amount of data that needs to be indexed and searched. Our documentation has sections to tune your setup for <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html">disk usage</a>, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html">search speed</a> and <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html">indexing speed</a>.</p>
<h4>Managing Structured vs. Unstructured Logs</h4>
<p>Structured logs are generally preferable as they offer more value and are easier to work with. They have a predefined format and fields, simplifying information extraction and analysis. For custom logs without pre-built integrations, you may need to define your own parsing rules.</p>
<p>For unstructured logs, full-text search capabilities can help mitigate limitations. By indexing logs, full-text search allows users to search for specific keywords or phrases efficiently, even within large volumes of unstructured data. This is one of the main differentiators of Elastic's observability solution. You can simply search for any keyword or phrase and get results in real-time, without needing to write complex regular expressions or parsing rules at query time.</p>
<h4>Schema-on-Read vs. Schema-on-Write</h4>
<p>There are two main approaches to processing log data:</p>
<ol>
<li>
<p>Schema-on-read: Some observability dashboarding capabilities can perform runtime transformations to extract fields from non-parsed sources on the fly. This is helpful when dealing with legacy systems or custom applications that may not log data in a standardized format. However, runtime parsing can be time-consuming and resource-intensive, especially for large volumes of data.</p>
</li>
<li>
<p>Schema-on-write: This approach offers better performance and more control over the data. The schema is defined upfront, and the data is structured and validated at the time of writing. This allows for faster processing and analysis of the data, which is beneficial for enrichment.</p>
</li>
</ol>
<h3>3. Analysis and Rationalization</h3>
<h4>Full-Text Search</h4>
<p>Elastic's full-text search capabilities, powered by Elasticsearch, allow you to quickly find relevant logs. The Kibana Query Language (KQL) enhances search efficiency, enabling you to filter and drill down into the data to identify issues rapidly.</p>
<p>Here are a few examples of KQL queries:</p>
<pre><code>// Filter documents where a field exists
http.request.method: *

// Filter documents that match a specific value
http.request.method: GET

// Search all fields for a specific value
Hello

// Filter documents where a text field contains specific terms
http.request.body.content: &quot;null pointer&quot;

// Filter documents within a range
http.response.bytes &lt; 10000

// Combine range queries
http.response.bytes &gt; 10000 and http.response.bytes &lt;= 20000

// Use wildcards to match patterns
http.response.status_code: 4*

// Negate a query
not http.request.method: GET

// Combine multiple queries with AND/OR
http.request.method: GET and http.response.status_code: 400
</code></pre>
<h4>Machine Learning Integration</h4>
<p>Machine learning can automate the detection of anomalies and patterns within your log data. Elastic offers features like log rate analysis that automatically identify deviations from normal behavior. By leveraging machine learning, you can proactively address potential issues before they escalate.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/best-practices-logging/screenshot-machine-learning-smv-anomaly.png" alt="Machine Learning" /></p>
<p>It is recommended that organizations utilize a diverse arsenal of machine learning algorithms and techniques to effectively uncover unknown-unknowns in log files. Unsupervised machine learning algorithms, should be employed for anomaly detection on real-time data, with rate-controlled alerting based on severity.</p>
<p>By automatically identifying influencers, users can gain valuable context for automated root cause analysis (RCA). Log pattern analysis brings categorization to unstructured logs, while log rate analysis and change point detection help identify the root causes of spikes in log data.</p>
<p>Take a look at the <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-overview.html">documentation</a> to get started with machine learning in Elastic.</p>
<h4>Dashboarding and Alerting</h4>
<p>Building dashboards and setting up alerting helps you monitor your logs in real-time. Dashboards provide a visual representation of your logs, making it easier to identify patterns and anomalies. Alerting can notify you when specific events occur, allowing you to take action quickly.</p>
<h2>Cost-Effective Log Management</h2>
<h3>Use Data Tiers</h3>
<p>Implementing index lifecycle management to move data across hot, warm, cold, and frozen tiers can significantly reduce storage costs. This approach ensures that only the most frequently accessed data is stored on expensive, high-performance storage, while older data is moved to more cost-effective storage solutions.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/best-practices-logging/ilm.png" alt="ILM" /></p>
<p>Our documentation explains how to set up <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html">Index Lifecycle Management</a>.</p>
<h3>Compression and Index Sorting</h3>
<p>Applying best compression settings and using index sorting can further reduce the data footprint. Optimizing the way data is stored on disk can lead to substantial savings in storage costs and improve retrieval performance. As of 8.15, Elasticsearch provides an indexing mode called &quot;logsdb&quot;. This is a highly optimized way of storing log data. This new way of indexing data uses 2.5 times less disk space than the default mode. You can read more about it <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/logs-data-stream.html">here</a>. This mode automatically applies the best combination of settings for compression, index sorting, and other optimizations that weren't accessible to users before.</p>
<h3>Snapshot Lifecycle Management (SLM)</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/best-practices-logging/slm.png" alt="SLM" /></p>
<p>SLM allows you to back up your data and delete it from the main cluster, freeing up resources. If needed, data can be restored quickly for analysis, ensuring that you maintain the ability to investigate historical events without incurring high storage costs.</p>
<p>Learn more about SLM in the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-lifecycle-management.html">documentation</a>.</p>
<h3>Dealing with Large Amounts of Log Data</h3>
<p>Managing large volumes of log data can be challenging. Here are some strategies to optimize log management:</p>
<ol>
<li>Develop a logs deletion policy. Evaluate what data to collect and when to delete it.</li>
<li>Consider discarding DEBUG logs or even INFO logs earlier, and delete dev and staging environment logs sooner.</li>
<li>Aggregate short windows of identical log lines, which is especially useful for TCP security event logging.</li>
<li>For applications and code you control, consider moving some logs into traces to reduce log volume while maintaining detailed information.</li>
</ol>
<h3>Centralized vs. Decentralized Log Storage</h3>
<p>Data locality is an important consideration when managing log data. The costs of ingressing and egressing large amounts of log data can be prohibitively high, especially when dealing with cloud providers.</p>
<p>In the absence of regional redundancy requirements, your organization may not need to send all log data to a central location. Consider keeping log data local to the datacenter where it was generated to reduce ingress and egress costs.</p>
<p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cross-cluster-search.html">Cross-cluster search</a> functionality enables users to search across multiple logging clusters simultaneously, reducing the amount of data that needs to be transferred over the network.</p>
<p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/xpack-ccr.html">Cross-cluster replication</a> is useful for maintaining business continuity in the event of a disaster, ensuring data availability even during an outage in one datacenter.</p>
<h2>Monitoring and Performance</h2>
<h3>Monitor Your Log Management System</h3>
<p>Using a dedicated monitoring cluster can help you track the performance of your Elastic deployment. <a href="https://www.elastic.co/guide/en/kibana/current/xpack-monitoring.html">Stack monitoring</a> provides metrics on search and indexing activity, helping you identify and resolve performance bottlenecks.</p>
<h3>Adjust Bulk Size and Refresh Interval</h3>
<p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html">Optimizing these settings</a> can balance performance and resource usage. Increasing bulk size and refresh interval can improve indexing efficiency, especially for high-throughput environments.</p>
<h2>Logging Best Practices</h2>
<h3>Adjust Log Levels</h3>
<p>Ensure that log levels are appropriately set for all applications. Customize log formats to facilitate easier ingestion and analysis. Properly configured log levels can reduce noise and make it easier to identify critical issues.</p>
<h3>Use Modern Logging Frameworks</h3>
<p>Implement logging frameworks that support structured logging. Adding metadata to logs enhances their usefulness for analysis. Structured logging formats, such as JSON, allow logs to be easily parsed and queried, improving the efficiency of log analysis.
If you fully control the application and are already using structured logging, consider using <a href="https://github.com/elastic/ecs-logging">Elastic's version of these libraries</a>, which can automatically parse logs into ECS fields.</p>
<h3>Leverage APM and Metrics</h3>
<p>For custom-built applications, Application Performance Monitoring (APM) provides deeper insights into application performance, complementing traditional logging. APM tracks transactions across services, helping you understand dependencies and identify performance bottlenecks.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/best-practices-logging/apm.png" alt="APM" /></p>
<p>Consider collecting metrics alongside logs. Metrics can provide insights into your system's performance, such as CPU usage, memory usage, and network traffic. If you're already collecting logs from your systems, adding metrics collection is usually a quick process.</p>
<p>Traces can provide even deeper insights into specific transactions or request paths, especially in cloud-native environments. They offer more contextual information and excel at tracking dependencies across services. However, implementing tracing is only possible for applications you own, and not all developers have fully embraced it yet.</p>
<p>A combined logging and tracing strategy is recommended, where traces provide coverage for newer instrumented apps, and logging supports legacy applications and systems you don't own the source code for.</p>
<h2>Conclusion</h2>
<p>Effective log management is essential for maintaining system reliability and performance in today's complex software environments. By following these best practices, you can optimize your log management process, reduce costs, and improve problem resolution times.</p>
<p>Key takeaways include:</p>
<ul>
<li>Ensure comprehensive log collection with a focus on normalization and common schemas.</li>
<li>Use appropriate processing and enrichment techniques, balancing between structured and unstructured logs.</li>
<li>Leverage full-text search and machine learning for efficient log analysis.</li>
<li>Implement cost-effective storage strategies and smart data retention policies.</li>
<li>Enhance your logging strategy with APM, metrics, and traces for a complete observability solution.</li>
</ul>
<p>Continuously evaluate and adjust your strategies to keep pace with the growing volume and complexity of log data, and you'll be well-equipped to ensure the reliability, performance, and security of your applications and infrastructure.</p>
<p>Check out our other blogs:</p>
<ul>
<li><a href="https://www.elastic.co/observability-labs/blog/service-level-objectives-slos-logs-metrics">Build better Service Level Objectives (SLOs) from logs and metrics</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/aws-vpc-flow-log-analysis-with-genai-elastic">AWS VPC Flow log analysis with GenAI in Elastic</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/migrating-billion-log-lines-opensearch-elasticsearch">Migrating 1 billion log lines from OpenSearch to Elasticsearch</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/pruning-incoming-log-volumes">Pruning incoming log volumes with Elastic</a></li>
</ul>
<p>Ready to get started? Use Elastic Observability on Elastic Cloud — the hosted Elasticsearch service that includes all of the latest features.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/best-practices-logging/best-practices-log-management.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Revolutionizing big data management: Unveiling the power of Amazon EMR and Elastic integration]]></title>
            <link>https://www.elastic.co/observability-labs/blog/big-data-management-amazon-emr-elastic-integration</link>
            <guid isPermaLink="false">big-data-management-amazon-emr-elastic-integration</guid>
            <pubDate>Tue, 26 Sep 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Amazon EMR allows you to easily run and scale big data workloads. With Elastic’s native integration, you'll find the confidence to monitor, analyze, and optimize your EMR clusters, opening up exciting opportunities for your data-driven initiatives.]]></description>
            <content:encoded><![CDATA[<p>In the dynamic realm of data processing, Amazon EMR takes center stage as an AWS-provided big data service, offering a cost-effective conduit for running Apache Spark and a plethora of other open-source applications. While the capabilities of EMR are impressive, the art of vigilant monitoring holds the key to unlocking its full potential. This blog post explains the pivotal role of monitoring Amazon EMR clusters, accentuating the transformative integration with Elastic&lt;sup&gt;®&lt;/sup&gt;.</p>
<p>Elastic can make it easier for organizations to transform data into actionable insights and stop threats quickly with unified visibility across your environment — so mission-critical applications can keep running smoothly no matter what. From a free trial and fast deployment to sending logs to Elastic securely and frictionlessly, all you need to do is point and click to capture, store, and search data from your AWS services.</p>
<h2>Monitoring EMR via Elastic Observability</h2>
<p>In this article, we will delve into the following key aspects:</p>
<ul>
<li><strong>Enabling EMR cluster metrics for Elastic integration:</strong> Learn the intricacies of configuring an EMR cluster to emit metrics that Elastic can effectively extract, paving the way for insightful analysis.</li>
<li><strong>Harnessing Kibana</strong> &lt;sup&gt;®&lt;/sup&gt; <strong>dashboards for EMR workload analysis:</strong> Discover the potential of utilizing Kibana dashboards to dissect metrics related to an EMR workload. By gaining a deeper understanding, we open the doors to optimization opportunities.</li>
</ul>
<h3>Key benefits of AWS EMR integration</h3>
<ul>
<li><strong>Comprehensive monitoring:</strong> Monitor the health and performance of your EMR clusters in real time. Track metrics related to cluster status and utilization, node status, IO, and many others, allowing you to identify bottlenecks and optimize your data processing.</li>
<li><strong>Log analysis:</strong> Dive deep into EMR logs with ease. Our integration enables you to collect and analyze logs from your clusters, helping you troubleshoot issues and gain valuable insights.</li>
<li><strong>Cost optimization:</strong> Understand the cost implications of your EMR clusters. By monitoring resource utilization, you can identify opportunities to optimize your cluster configurations and reduce costs.</li>
<li><strong>Alerting and notifications:</strong> Set up custom alerts based on EMR metrics and logs. Receive notifications when performance thresholds are breached, ensuring that you can take action promptly.</li>
<li><strong>Seamless integration:</strong> Our integration is designed for ease of use. Getting started is simple, and you can start monitoring your EMR clusters quickly.</li>
</ul>
<p>Accompanying these discussions is an illustrative solution architecture diagram, providing a visual representation of the intricacies and interactions within the proposed solution.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/big-data-management-amazon-emr-elastic-integration/elastic-blog-1-flowchart-aws-emr.png" alt="1" /></p>
<h2>How to get started</h2>
<p>Getting started with AWS EMR integration in Observability is easy. Here's a quick overview of the steps:</p>
<h3>Prerequisites and configurations</h3>
<p>If you intend to follow the steps outlined in this blog post, there are a few prerequisites and configurations that you should have in place beforehand.</p>
<ol>
<li>
<p>You will need an account on <a href="http://cloud.elastic.co/">Elastic Cloud</a> and a deployed stack and agent. Instructions for deploying a stack on AWS can be found <a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">here</a>. This is necessary for AWS EMR logging and analysis.</p>
</li>
<li>
<p>You will also need an AWS account with the necessary permissions to pull data from AWS. Details on the required permissions can be found in our <a href="https://docs.elastic.co/en/integrations/aws#aws-permissions">documentation</a>.</p>
</li>
<li>
<p>Finally, be sure to turn on EMR monitoring for the EMR cluster when you deploy the cluster.</p>
</li>
</ol>
<h3>Step 1: Create an account with Elastic</h3>
<p><a href="https://cloud.elastic.co/registration?fromURI=/home">Create an account on Elastic Cloud</a> by following the steps provided.</p>
<h3>Step 2: Add integration</h3>
<ol>
<li>Log in to your <a href="https://cloud.elastic.co/registration">Elastic Cloud on AWS</a> deployment.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/big-data-management-amazon-emr-elastic-integration/elastic-blog-2-free-trial.png" alt="2 free trial" /></p>
<ol start="2">
<li>Click on <strong>Add Integration</strong>. You will be navigated to a catalog of supported integrations.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/big-data-management-amazon-emr-elastic-integration/elastic-blog-3-welcome-home.png" alt="3 welcome home" /></p>
<ol start="3">
<li>Search and select <strong>Amazon EMR</strong>.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/big-data-management-amazon-emr-elastic-integration/elastic-blog-4-integrations.png" alt="4 integrations" /></p>
<h3>Step 3: Configure integration</h3>
<ol>
<li>
<p>Click on the <strong>Add Amazon EMR</strong> button and provide the required details.</p>
</li>
<li>
<p>Provide the required access credentials to connect to your EMR instance.</p>
</li>
<li>
<p>You can choose to collect EMR metrics, EMR logs via S3, or EMR logs via Cloudwatch.</p>
</li>
<li>
<p>Click on the <strong>Save and continue</strong> button at the bottom of the page.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/big-data-management-amazon-emr-elastic-integration/elastic-blog-5-amazon-emr.png" alt="5 amazon emr" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/big-data-management-amazon-emr-elastic-integration/elastic-blog-6-add-amazon-emr.png" alt="6 add amazon emr integration" /></p>
<h3>Step 4: Analyze and monitor</h3>
<p>Explore the data using the out-of-the-box dashboards available for the integration. Select <strong>Discover</strong> from the Elastic Cloud top-level menu.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/big-data-management-amazon-emr-elastic-integration/elastic-blog-7-manage-deployment.png" alt="7 manage deployment" /></p>
<p>Or, create custom dashboards, set up alerts, and gain actionable insights into your EMR clusters' performance.</p>
<p>This integration streamlines the collection of vital metrics and logs, including Cluster Status, Node Status, IO, and Cluster Capacity. Some metrics gathered include:</p>
<ul>
<li><strong>IsIdle:</strong> Indicates that a cluster is no longer performing work, but is still alive and accruing charges</li>
<li><strong>ContainerAllocated:</strong> The number of resource containers allocated by the ResourceManager</li>
<li><strong>ContainerReserved:</strong> The number of containers reserved</li>
<li><strong>CoreNodesRunning:</strong> The number of core nodes working</li>
<li><strong>CoreNodesPending:</strong> The number of core nodes waiting to be assigned</li>
<li><strong>MRActiveNodes:</strong> The number of nodes presently running MapReduce tasks or jobs</li>
<li><strong>MRLostNodes:</strong> The number of nodes allocated to MapReduce that have been marked in a LOST state</li>
<li><strong>HDFSUtilization:</strong> The percentage of HDFS storage currently used</li>
<li><strong>HDFSBytesRead/Written:</strong> The number of bytes read/written from HDFS (This metric aggregates MapReduce jobs only, and does not apply for other workloads on Amazon EMR.)</li>
<li><strong>TotalUnitsRequested/TotalNodesRequested/TotalVCPURequested:</strong> The target total number of units/nodes/vCPUs in a cluster as determined by managed scaling</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/big-data-management-amazon-emr-elastic-integration/elastic-blog-8-pie-graphs.png" alt="8 pie graph" /></p>
<h2>Conclusion</h2>
<p>Elastic is committed to fulfilling all your observability requirements, offering an effortless experience. Our integrations are designed to simplify the process of ingesting telemetry data, granting you convenient access to critical information for monitoring, analytics, and observability. The native AWS EMR integration underscores our dedication to delivering seamless solutions for your data needs. With this integration, you'll find the confidence to monitor, analyze, and optimize your EMR clusters, opening up exciting opportunities for your data-driven initiatives.</p>
<h2>Start a free trial today</h2>
<p>Start your own <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k?trk=5fbc596b-6d2a-433a-8333-0bd1f28e84da%E2%89%BBchannel=el">7-day free trial</a> by signing up via <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k?trk=5fbc596b-6d2a-433a-8333-0bd1f28e84da&amp;sc_channel=el&amp;ultron=gobig&amp;hulk=regpage&amp;blade=elasticweb&amp;gambit=mp-b">AWS Marketplace</a> and quickly spin up a deployment in minutes on any of the <a href="https://www.elastic.co/guide/en/cloud/current/ec-reference-regions.html#ec_amazon_web_services_aws_regions">Elastic Cloud regions on AWS</a> around the world. Your AWS Marketplace purchase of Elastic will be included in your monthly consolidated billing statement and will draw against your committed spend with AWS.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/big-data-management-amazon-emr-elastic-integration/21-cubes.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Bringing Your Cloud-Managed Kubernetes Audit Logs into Elasticsearch]]></title>
            <link>https://www.elastic.co/observability-labs/blog/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch</link>
            <guid isPermaLink="false">bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch</guid>
            <pubDate>Mon, 19 Aug 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[How to bring your Cloud-Managed Kubernetes Audit Logs into Elasticsearch]]></description>
            <content:encoded><![CDATA[<h2>Introduction:</h2>
<p>Kubernetes audit logs are essential for ensuring the security, compliance, and transparency of Kubernetes clusters. However, with managed Kubernetes infrastructure, traditional audit file-based log shipping is often not supported, and audit logs are only available via the control plane API or the Cloud Provider logging facility. In this blog, we will show you how to ingest the audit logs from these other sources and still take advantage of the <a href="https://www.elastic.co/docs/current/integrations/kubernetes/audit-logs">Elastic Kubernetes Audit Log Integration</a>.</p>
<p>In this blog we will be focusing on AWS as our cloud provider and when ingesting logs from AWS you have several options:</p>
<ul>
<li><a href="https://www.elastic.co/docs/current/integrations/aws_logs">AWS Custom Logs integration</a> (which we will utilize in this blog)</li>
<li><a href="https://www.elastic.co/observability-labs/blog/aws-kinesis-data-firehose-observability-analytics">AWS Firehose</a> to send logs from Cloudwatch to Elastic</li>
<li><a href="https://www.elastic.co/docs/current/integrations/aws">AWS General integration</a> which supports many AWS sources</li>
</ul>
<p>In part 1 of this two-part series, we will focus on properly ingesting Kubernetes Audit, and part 2 will focus on investigation, analytics, and alerting.</p>
<p>Kubernetes auditing <a href="https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/">documentation</a> describes the need for auditing in order to get answers to the questions below:</p>
<ul>
<li>What happened?</li>
<li>When did it happen?</li>
<li>Who initiated it?</li>
<li>What resource did it occur on?</li>
<li>Where was it observed?</li>
<li>From where was it initiated (Source IP)?</li>
<li>Where was it going (Destination IP)?</li>
</ul>
<p>Answers to the above questions become important when an incident occurs and an investigation follows. Alternatively, it could just be a log retention use case for a regulated company trying to fulfill compliance requirements. </p>
<p>We are giving special importance to audit logs in Kubernetes because audit logs are not enabled by default. Audit logs can take up a large amount of memory and storage. So, usually, it’s a balance between retaining/investigating audit logs against giving up resources budgeted otherwise for workloads to be hosted on the Kubernetes cluster. Another reason we’re talking about audit logs in Kubernetes is that, unlike usual container logs, after being turned on, these logs are orchestrated to write to the cloud provider’s logging service. This is true for most cloud providers because the Kubernetes control plane is managed by the cloud providers. It makes sense for cloud providers to use their built-in orchestration workflows involving the control plane for a managed service backed by their implementation of a logging framework.</p>
<p>Kubernetes audit logs can be quite verbose by default. Hence, it becomes important to selectively choose how much logging needs to be done so that all the audit requirements are met for the organization. This is done in the <a href="https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/#audit-policy">audit policy</a> file. The audit policy file is submitted against the<code> kube-apiserver.</code> It is not necessary that all flavors of cloud-provider-hosted Kubernetes clusters allow you to play with the <code>kube-apiserver</code> directly. For example, AWS EKS allows for this <a href="https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html">logging</a> to be done only by the control plane.</p>
<p><strong>In this blog we will be using Elastic Kubernetes Service (Amazon EKS) on AWS with the Kubernetes Audit Logs that are automatically shipped to AWS CloudWatch.</strong></p>
<p>A sample audit log for a secret by the name “empty-secret” created by an admin user on EKS  is logged on AWS CloudWatch in the following format: </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/aws-clougwatch-logs.png" alt="Alt text" /></p>
<p>Once the audit logs show up on CloudWatch, it is time to consider how to transfer them to Elasticsearch. Elasticsearch is a great platform for creating dashboards that visualize different audit events recorded in a Kubernetes cluster. It is also a powerful tool for analyzing various audit events. For example, how many secret object creation attempts were made in an hour? </p>
<p>Now that we established the Kubernetes audit logs are being logged in CloudWatch, let’s discuss how to get the logs ingested into Elasticsearch. Elasticsearch has an integration to consume logs written on CloudWatch. Just using this integration by default is going to get the JSON from CloudWatch as is i.e. the real audit log JSON is nested inside the wrapper CloudWatch JSON. When bringing logs to Elasticsearch, it is important that we use the <a href="https://www.elastic.co/guide/en/ecs/current/index.html">Elastic Common Schema</a>(ECS) to get the best search and analytics performance. This means that there needs to  be an ingest pipeline that parses a standard Kubernetes audit JSON message and creates an ECS Compliant document in Elasticsearch. Let’s dive into how to achieve this.</p>
<p>Elasticsearch has a Kubernetes integration using Elastic Agent to consume Kubernetes container logs from the console and audit logs written to a file path. For a cloud-provider use case, as described above, it may not be feasible to write audit logs to a path on the Kubernetes cluster. So, how do we leverage the<a href="https://github.com/elastic/integrations/blob/main/packages/kubernetes/data_stream/audit_logs/fields/fields.yml"> ECS designed for parsing the Kubernetes audit logs</a> already implemented in the Kubernetes integration to work on the CloudWatch audit logs? That is the most exciting plumbing piece! Let’s see how to do it.</p>
<h3>What we’re going to do is:</h3>
<ul>
<li>
<p>Read the Kubernetes audit logs from the cloud provider’s logging module, in our case, AWS CloudWatch since this is where logs reside. We will use Elastic Agent and <a href="https://www.elastic.co/docs/current/integrations/aws_logs">Elasticsearch AWS Custom Logs integration </a> to read from logs from CloudWatch. <strong>Note:</strong> please be aware, there are several Elastic AWS integration, we are specifically using the AWS Custom Logs integration.</p>
</li>
<li>
<p>Create two simple ingest pipelines (we do this for best practices of isolation and composability) </p>
</li>
<li>
<p>The first pipeline looks for Kubernetes audit JSON messages and then redirects them to the second pipeline</p>
</li>
<li>
<p>The second custom pipeline will associate the JSON <code>message</code> field with the correct field expected by the Elasticsearch Kubernetes Audit managed pipeline (aka the Integration) and then <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/reroute-processor.html"><code>reroute</code></a> the message to the correct data stream, <code>kubernetes.audit_logs-default,</code> which in turn applies all the proper mapping and ingest pipelines for the incoming message</p>
</li>
<li>
<p>The overall flow will be</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/overall-ingestion-flow.png" alt="Alt text" /></p>
<h3>1. Create an AWS CloudWatch integration:</h3>
<p>a.  Populate the AWS access key and secret pair values</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/aws-custom-logs-integration-1.png" alt="Alt text" /></p>
<p>b. In the logs section, populate the log ARN, Tags and Preserve the original event if you want to, and then Save this integration and exit from the page</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/aws-custom-logs-integration-2.png" alt="Alt text" /></p>
<h3>2. Next, we will configure the custom ingest pipeline</h3>
<p>We are doing this because we want to override what the generic managed pipeline does. We will retrieve the custom component name by searching for managed pipeline created as an asset when we install the AWS CloudWatch integration. In this case we will be adding the custom ingest pipeline <code>logs-aws_logs.generic@custom</code></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/aws-logs-index-management.png" alt="Alt text" /></p>
<p>From the Dev tools console, run below. Here, we are extracting the message field from the CloudWatch JSON and putting the value in a field called kubernetes.audit. Then, we are rerouting this message to the default Kubernetes audit dataset or ECS that comes with Kubernetes integration</p>
<pre><code>PUT _ingest/pipeline/logs-aws_logs.generic@custom
{
    &quot;processors&quot;: [
      {
        &quot;pipeline&quot;: {
          &quot;if&quot;: &quot;ctx.message.contains('audit.k8s.io')&quot;,
          &quot;name&quot;: &quot;logs-aws-process-k8s-audit&quot;
        }
      }
    ]
}

PUT _ingest/pipeline/logs-aws-process-k8s-audit
{
  &quot;processors&quot;: [
    {
      &quot;json&quot;: {
        &quot;field&quot;: &quot;message&quot;,
        &quot;target_field&quot;: &quot;kubernetes.audit&quot;
      }
    },
    {
      &quot;remove&quot;: {
        &quot;field&quot;: &quot;message&quot;
      }
    },
    {
      &quot;reroute&quot;: {
        &quot;dataset&quot;: &quot;kubernetes.audit_logs&quot;,
        &quot;namespace&quot;: &quot;default&quot;
      }
    }
  ]
}
</code></pre>
<p>Let’s understand this further:</p>
<ul>
<li>
<p>When we create a Kubernetes integration, we get a managed index template called <code>logs-kubernetes.audit_logs</code> that writes to the pipeline called <code>logs-kubernetes.audit_logs-1.62.2</code> by default</p>
</li>
<li>
<p>If we look into the pipeline<code> logs-kubernetes.audit_logs-1.62.2</code>, we see that all the processor logic is working against the field <code>kubernetes.audit</code>. This is the reason why our json processor in the above code snippet is creating a field called <code>kubernetes.audit </code>before dropping the original <em>message</em> field and rerouting. Rerouting is directed to the <code>kubernetes.audit_logs</code> dataset that backs the <code>logs-kubernetes.audit_logs-1.62.2</code> pipeline (dataset name is derived from the pipeline name convention that’s in the format <code>logs-&lt;datasetname&gt;-version</code>)</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/ingest-pipelines.png" alt="Alt text" /></p>
<h3>3.  Now let’s verify that the logs are actually flowing through and the audit message is being parsed</h3>
<p>a. We will use Elastic Agent and enroll using Fleet and the integration policy we created in the Step 1. There are a number of ways to <a href="https://www.elastic.co/guide/en/fleet/current/install-fleet-managed-elastic-agent.html">deploy Elastic Agent</a> and for this exercise we will deploy using docker which is quick and easy.</p>
<pre><code>% docker run --env FLEET_ENROLL=1 --env FLEET_URL=&lt;&lt;fleet_URL&gt;&gt; --env FLEET_ENROLLMENT_TOKEN=&lt;&lt;fleet_enrollment_token&gt;&gt;  --rm docker.elastic.co/beats/elastic-agent:8.19.12
</code></pre>
<p>b. Check the messages in Discover. In 8.15 there is also a new feature called Logs Explorer which provides an ability to see Kubernetes Audit logs (and container logs) with a few clicks (see image below). Voila! We can see the Kubernetes audit messages parsed!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/discover.jpg" alt="Alt text" /></p>
<h3>4. Let's do a quick recap of what we did</h3>
<p>We configured CloudWatch integration in Elasticsearch to read Kubernetes audit logs from CloudWatch. Then, we created custom ingest pipelines to reroute the audit messages to the correct data stream and all the OOTB mappings and parsing that come with the Kubernetes Audit Logs integration. </p>
<p>In the next part, we’ll look at how to analyze the ingested Kubernetes Audit log data.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Centrally Managing OTel Collectors with Elastic Agent and Fleet]]></title>
            <link>https://www.elastic.co/observability-labs/blog/centrally-managed-otel-collectors-with-elastic-fleet</link>
            <guid isPermaLink="false">centrally-managed-otel-collectors-with-elastic-fleet</guid>
            <pubDate>Tue, 24 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[How Elastic Agent 9.3 unifies Beats and OpenTelemetry (OTel) data collection and delivers central management with Elastic Fleet.]]></description>
            <content:encoded><![CDATA[<p>&quot;The dream of OpenTelemetry is vendor-neutral, standardised observability.
The challenge nobody mentions is how you operate hundreds, or thousands, of those collectors in production.&quot;</p>
<p>OpenTelemetry has won the hearts of the industry.
Adoption is accelerating: the CNCF's 2024 Observability survey found OTel to be the fastest-growing project in the foundation's history, with the OTel Collector registering hundreds of millions of downloads.
The proposition is compelling: write instrumentation once, ship it anywhere, avoid lock-in.</p>
<p>But here is what every platform team discovers once they cross into production: the collector sprawl problem.
Hundreds of collector instances deployed across regions, Kubernetes namespaces, and bare-metal hosts. Configuration drift creeping in.
An upgrade that has to be co-ordinated across a fleet of independent processes. A security patch that someone has to manually roll out to each one.
And zero visibility into which collectors are running, healthy, or stuck.</p>
<p>This is the gap between &quot;deploying OpenTelemetry&quot; and &quot;operating OpenTelemetry at scale.&quot;
With Elastic 9.3, Elastic Agent closes that gap entirely.
The Elastic Agent is now built on Elastic's Distribution of the OpenTelemetry Collector (EDOT) and, when managed by Fleet, gives platform teams a single control plane for configuring, updating, and monitoring every OTel collector in their estate — all while remaining compatible with the Beats-based integrations they already rely on.</p>
<h2>The Collector Sprawl Problem and Why It Matters</h2>
<p>OpenTelemetry's success has created a quiet operational debt for many organisations.
Individual teams adopt the collector for their services: logs here, metrics there, a custom pipeline for the new microservice.
Without a centralised management layer, each of these collectors becomes an independent snowflake: its own config file, its own upgrade cycle, its own failure domain.</p>
<p>The consequences are predictable.
Configuration drift means collectors running different versions of the same pipeline, producing subtly incompatible data.
Compliance teams ask &quot;show me all the places data is collected and where it goes&quot;, and the honest answer is a spreadsheet that's already out of date.</p>
<p>This isn't a niche problem.
A Gartner analysis of enterprise observability programmes consistently identifies operational overhead as the top barrier to expanding OTel adoption beyond initial pilots.
The technology works. The tooling to manage it at scale is what's been missing.</p>
<h2>How Elastic Agent Became an OTel Collector</h2>
<p>To understand the significance of this, it helps to understand what Elastic Agent used to be, and what it is now.</p>
<p>Elastic Agent acts as a supervisor process: Before version 9.3, it managed a collection of separate Beats sub-processes (Filebeat, Metricbeat, Winlogbeat and so on), each running its own input/output lifecycle, each consuming its own memory footprint.
The agent coordinated them, but the fundamental model was a collection of discrete daemons running under a parent.</p>
<p>With 9.3, that model has been replaced.
Elastic Agent is now itself an instance of the EDOT Collector: Elastic's hardened, production-supported distribution of the upstream OTel Collector.
The architectural shift has three important consequences.</p>
<p><strong>First</strong>, the process model simplifies dramatically.
Instead of a supervisor managing multiple sub-process lifecycles, there is a single EDOT Collector process.
This means a smaller memory footprint, fewer things that can fail independently, and fewer processes to observe for health and performance.</p>
<p><strong>Second</strong>, Beats functionality is preserved, not discarded.
Rather than forcing a breaking migration, Elastic has introduced <em>Beats Receivers</em>: beat inputs and processors re-packaged as native OTel receiver components.
A Filestream input is enabled by a <code>filebeatreceiver</code>.
The same Filebeat configuration YAML you write today is automatically translated into the corresponding EDOT receiver configuration at runtime.
Existing integrations, dashboards, and ingest pipelines continue to work without modification.</p>
<p><strong>Third</strong>, the agent is now a first-class participant in the OTel ecosystem.
It speaks OTLP natively, it runs standard OTel receivers, and it can be configured to sit alongside any other OTel-compatible tool in a modern observability pipeline.</p>
<h2>Central Management with Fleet: Configuration, Lifecycle, and Visibility</h2>
<p>The architectural shift above would be valuable on its own. But it becomes transformative when combined with Elastic Fleet, the centralised management plane for Elastic Agents.</p>
<p>Fleet gives platform and SRE teams a single console from which to manage every Elastic Agent (and by extension, every EDOT Collector instance) in their estate.
The capabilities break into three categories: configuration management, lifecycle management, and fleet-wide observability.</p>
<h3>Configuration management at scale</h3>
<p>With Fleet, you define an <em>Agent Policy</em> — a declarative description of what a collector should do.
What data should it collect?
Via which receivers?
Where should it export?
The policy is authored once in Fleet's UI (or via its API), and pushed automatically to every agent enrolled in that policy.
Change the policy, and every affected collector receives the update.
No SSH.
No Ansible playbook to maintain.
No configuration drift.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/centrally-managed-otel-collectors-with-elastic-fleet/policy-health.jpg" alt="Fleet Policy Health" /></p>
<p>Fleet pushes policies to enrolled agents across any environment. Agents send heartbeat and health data back, giving a live inventory of every collector in the estate.</p>
<h3>Lifecycle management: upgrades, enrolment, and remediation</h3>
<p>Perhaps the most operationally significant benefit of Fleet management is lifecycle control.
With Fleet, upgrading a collector is a policy action: select the target version, select the scope (all agents, a specific policy group, a canary subset), and click.
Fleet orchestrates the rolling upgrade, tracking status per agent and surfacing failures immediately.</p>
<p>This changes the security calculus fundamentally.
When a vulnerability is disclosed in the OTel Collector binary, patching is a Fleet operation measured in minutes, not a change-management ceremony measured in days across SSH sessions to individual hosts.</p>
<p>Fleet also handles enrolment and de-enrolment.
New hosts added to your infrastructure can be auto-enrolled into the appropriate policy based on tags or deployment tooling.
Agents on decommissioned hosts can be removed from Fleet's inventory, ensuring your observability map reflects your actual infrastructure.</p>
<h3>Fleet-wide observability of your collectors</h3>
<p>Every Fleet-managed Elastic Agent ships monitoring telemetry about itself: CPU and memory consumption, event throughput, error rates, pipeline latency.
This data flows into Elastic and is surfaced in the Fleet UI, giving you a live dashboard of every collector in your estate, not just the ones you happen to be watching.</p>
<p>For the first time, &quot;how healthy is my observability pipeline?&quot; becomes a question with a real-time, fleet-wide answer.
You can identify agents that have stopped sending data, agents consuming unexpectedly high resources, and agents that have fallen behind on queue processing — before those problems surface as gaps in your monitoring data.</p>
<p>In the near future this capability will be offered to non-Fleet managed agents (aka standalone) and/or 3rd party OTel collectors provided by other vendors.
These collectors can be configured via some other means but be monitored in Fleet - from both resource consumption and/or component pipeline health.</p>
<h2>The Hybrid Agent: Beats Data and OTel Data, Simultaneously</h2>
<p>One of the most practically significant capabilities introduced in 9.3 is what Elastic calls the <em>Hybrid Agent</em>: an Elastic Agent that can run both Beats-based receivers and native OTel receivers in the same pipeline, at the same time.
This does not change anything for existing installations.</p>
<p>This matters enormously for real-world adoption. Most organisations arriving at OTel in 2025 and 2026 are not starting from a blank slate.
They have years of investment in Beats-based integrations: Filebeat-powered log collection, Metricbeat-powered host metrics, bespoke ingest pipelines in Elasticsearch that normalise and enrich that data into ECS (Elastic Common Schema) format.
The business value locked in those integrations (the dashboards, the alerts, the correlation logic) is not something they can afford to throw away in order to &quot;go OTel.&quot;</p>
<p>The Hybrid Agent solves this by making the two worlds coexist.
For example, in a single agent policy you can simultaneously configure:</p>
<ul>
<li>A <code>filebeatreceiver</code> collecting application logs in ECS format, routed through your existing ingest pipeline to its existing data stream</li>
<li>A native OTel <code>filelog</code> receiver collecting OTel-native telemetry from your new services instrumented with the OTel SDK, stored in OTel-native data streams without touching ingest pipelines</li>
<li>An OTel <code>hostmetrics</code> receiver collecting system metrics in semantic convention format alongside your existing Metricbeat-derived system metrics</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/centrally-managed-otel-collectors-with-elastic-fleet/hybrid-agent.jpg" alt="Hybrid Agent" /></p>
<p>The two lanes are independent.
Beats-receiver data travels through ingest pipelines and lands in ECS-formatted data streams, exactly as it always has.
Native OTel data follows OTel semantic conventions and is stored directly in OTel-native data streams, bypassing ingest pipelines.
Your existing dashboards and alerts continue to work. Your new OTel-native workloads get the full OTel experience.
The same agent, the same Fleet policy, the same management console.</p>
<p>This co-existence is the practical answer to the question every platform team eventually faces: &quot;We want to adopt OTel properly but we can't break what we already have.&quot;
The Hybrid Agent lets you migrate incrementally, service by service, on your timeline.</p>
<h2>The Integration Catalogue: Turning Configuration into a One-Click Operation</h2>
<p>Configuration management at scale is only as good as the configurations themselves.
Elastic's integration catalogue — over 500 packages covering everything from NGINX and PostgreSQL to AWS CloudTrail and Kubernetes — extends naturally to the Hybrid Agent model.</p>
<p>From 9.3 onwards, the catalogue includes <em>OTel integration packages</em> alongside the existing Beats-based ones. Each OTel package contains two components:</p>
<ul>
<li>An <em>Input package</em>: the configuration for the corresponding OTel receiver (receivers, processors, pipeline wiring), ready to be applied to a Hybrid Agent policy</li>
<li>A <em>Content package</em>: the assets associated with the application: pre-built dashboards, alerts, index templates, and saved queries, all calibrated for OTel semantic convention data</li>
</ul>
<p>When an operator adds an OTel integration to an Agent Policy in Fleet, the receiver configuration is pushed to all enrolled agents.
When those agents start ingesting data and it arrives in Elasticsearch, the content package assets are automatically installed based on metadata in the data received.
The dashboard is ready before you've had time to wonder where it is.</p>
<p>The same policy can hold both OTel integrations and legacy Beats integrations.
A real-world agent policy might simultaneously collect system metrics via the OTel <code>hostmetrics</code> receiver, application logs via <code>filebeat</code> receiver, and APM data via OTLP — all from one policy, all managed from Fleet, all visible in a unified Kibana experience.</p>
<p>A technical walk through of how this is done for NGINX data collection can be found <a href="https://www.elastic.co/observability-labs/blog/hybrid-elastic-agent-opentelemetry-integration">here</a> for reference.
Currently management of Elastic Agents is done via existing Fleet protocols, however in the near future this will move over to OPAMP so that Fleet will be able to provide management to 3rd party OTel collectors as well.</p>
<p>For organisations on platforms not yet in Elastic's OS support matrix, 3rd-party OTel Collectors (such as Red Hat's OpenShift-native collector) can send data to Elastic using the OTLP exporter and be observed  alongside all other collectors in their fleet.</p>
<h2>What This Means in Practice: A Migration Story</h2>
<p>Consider a mid-sized platform team operating 200 Linux hosts across three regions, currently running Elastic Agent 8.x with a mix of Filebeat and Metricbeat integrations.
Their new services are being instrumented with the OTel SDK and they want to standardise on OTel going forward without disrupting the monitoring coverage they already have.</p>
<p>With a Fleet-managed upgrade to 9.3, their existing agents become Hybrid Agents automatically.
Their Filebeat and Metricbeat configurations are internally translated to Beats receiver configurations and continue to run unmodified.
Their existing dashboards still populate. Their ingest pipelines still fire. Nothing breaks.</p>
<p>They then add OTel integration packages to their Fleet policies for each new service. The OTel-instrumented microservices start sending OTLP data, received by native OTel receivers in the same agents.
OTel-native dashboards appear automatically in Kibana. They now have both data universes in one place, managed from one console, visible in one interface.</p>
<p>Over the following quarters, as Beats-based integrations for their remaining services are superseded by OTel equivalents in the catalogue, they migrate them one by one, updating the Agent Policy in Fleet and watching the transition happen across all 200 hosts simultaneously, without touching a single one directly.</p>
<h2>Looking Forward</h2>
<p>Elastic has made a clear architectural bet: OpenTelemetry is the future of observability data collection, and the right response to that future is not to build a parallel OTel tool alongside the existing stack — it is to evolve the existing stack into OTel.
The Hybrid Agent and EDOT Collector are the result of that bet.</p>
<p>Fleet central management is the operational layer that makes that bet practical at scale.
OpenTelemetry gives you standardised, vendor-neutral instrumentation.
Fleet gives you the operational control plane to manage those collectors like the production infrastructure they are, not like artisanal YAML files scattered across your estate.</p>
<p>The collector sprawl problem is solvable.
The answer is a managed, policy-driven, centrally observable fleet of EDOT Collectors, and in Elastic 9.3, that answer is production-ready today.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/centrally-managed-otel-collectors-with-elastic-fleet/header.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Collecting JMX metrics with OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/collecting-jmx-metrics-opentelemetry</link>
            <guid isPermaLink="false">collecting-jmx-metrics-opentelemetry</guid>
            <pubDate>Thu, 05 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to collect Tomcat JMX metrics with OpenTelemetry using the Java agent or jmx-scraper, then extend coverage with custom YAML rules and validate output.]]></description>
            <content:encoded><![CDATA[<p>Java Management Extensions (JMX) is the JVM's built-in management interface, exposing runtime and component metrics such as memory, threads, and request pools. It is useful for collecting operational telemetry from Java services without changing application code.</p>
<p>Collecting JMX metrics with OpenTelemetry can be done in two main ways depending on your environment, requirements and constraints:</p>
<ul>
<li>from inside the JVM with the <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation">OpenTelemetry Instrumentation Java</a> agent (or <a href="https://www.elastic.co/docs/reference/opentelemetry/edot-sdks/java">EDOT Java</a>)</li>
<li>from outside the JVM with the <a href="https://github.com/open-telemetry/opentelemetry-java-contrib/tree/main/jmx-scraper">jmx-scraper</a>.</li>
</ul>
<p>Thorough this article, we will use the term &quot;Java agent&quot; to refer to the <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation">OpenTelemetry Java instrumentation</a> agent, this
also applies to the Elastic own distribution (<a href="https://www.elastic.co/docs/reference/opentelemetry/edot-sdks/java">EDOT Java</a>) which is based on it and provides the same features.</p>
<p>This walkthrough uses a <a href="https://tomcat.apache.org/">Tomcat</a> server as the target and shows how to validate which metrics are emitted with the logging exporter.</p>
<p>The configuration examples in this article use Java system properties that must be passed using <code>-D</code> flags in the JVM startup command, equivalent environment variables can also be used for configuration.</p>
<h2>Prerequisites</h2>
<ul>
<li>A local <a href="https://tomcat.apache.org/">Tomcat</a> install (or any JVM app you can start with custom JVM flags)</li>
<li>Java 8+ on the host, the Tomcat version used might require a more recent version though.</li>
<li>An OpenTelemetry Collector endpoint if you want to ship metrics beyond local logging</li>
</ul>
<h2>Choosing between the Java agent and jmx-scraper</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/collecting-jmx-metrics-opentelemetry/collection_options.png" alt="Java agent vs jmx-scraper" /></p>
<p>Use the Java agent (or EDOT Java) when you can modify JVM startup flags and want in-process collection with full context from the running application: this allows to capture traces, logs and metrics with a single tool deployment.</p>
<p>Use jmx-scraper when you cannot install an agent on the JVM or prefer out-of-process collection from a separate host. This requires the JVM and the network to be configured for remote JMX access and also dealing with authentication and credentials.</p>
<p>Both approaches rely on the same JMX metric mappings and can use the logging exporter for validation and then use OTLP to send metrics to the collector / an OTLP endpoint.</p>
<h2>Option 1: Collect JMX metrics inside the JVM with the Java agent</h2>
<p>OpenTelemetry Java instrumentation ships with a curated set of JMX metric mappings. For Tomcat, you just need to enable the Java agent and set <code>otel.jmx.target.system=tomcat</code>.</p>
<h3>Step 1 - Download the OpenTelemetry Java agent</h3>
<p>The agent is downloaded in <code>/opt/otel</code> but you can choose any location on the host.
Make sure the path is consistent with the <code>-javaagent</code> flag in the next step.</p>
<pre><code class="language-bash">mkdir -p /opt/otel
curl -L -o /opt/otel/opentelemetry-javaagent.jar \
  https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar
</code></pre>
<h3>Step 2 - Configure Tomcat with <code>bin/setenv.sh</code></h3>
<p>Create or update <code>bin/setenv.sh</code> so Tomcat launches with the agent and JMX target system enabled.</p>
<pre><code class="language-bash">#!/bin/bash
export CATALINA_OPTS=&quot;$CATALINA_OPTS \
  -javaagent:/opt/otel/opentelemetry-javaagent.jar \
  -Dotel.service.name=tomcat-demo \
  -Dotel.metrics.exporter=otlp,logging \
  -Dotel.jmx.target.system=tomcat&quot;
</code></pre>
<p>This will configure the agent to log metrics (using the <code>logging</code> exporter) in addition to sending them to the Collector.</p>
<h3>Step 3 - Validate the emitted metrics</h3>
<p>Start Tomcat and watch stdout.</p>
<pre><code class="language-bash">./bin/catalina.sh run
</code></pre>
<p>By defaults metrics are sampled and emitted every minute, so you might have to wait a bit for the metrics to be logged.
If needed, you can use <code>otel.metric.export.interval</code> configuration to increase or reduce the frequency.</p>
<p>You should see logging exporter output with JVM and Tomcat metrics. Look for lines containing the <code>LoggingMetricExporter</code> class name.</p>
<pre><code class="language-text">INFO io.opentelemetry.exporter.logging.LoggingMetricExporter - MetricData{name=tomcat.threadpool.currentThreadsBusy, ...}
INFO io.opentelemetry.exporter.logging.LoggingMetricExporter - MetricData{name=jvm.memory.used, ...}
</code></pre>
<h3>Step 4 - Send metrics to a Collector</h3>
<p>Once metric capture is validated, you should be ready to send metrics to a collector.</p>
<p>You will have to:</p>
<ul>
<li>remove the <code>logging</code> exporter as it's no longer necessary for production</li>
<li>configure the OTLP endpoint (<code>otel.exporter.otlp.endpoint</code>) and headers (<code>otel.exporter.otlp.headers</code>) if needed</li>
</ul>
<p>The <code>bin/setenv.sh</code> file should be modified to look like this:</p>
<pre><code class="language-bash">#!/bin/bash
export CATALINA_OPTS=&quot;$CATALINA_OPTS \
  -javaagent:/opt/otel/opentelemetry-javaagent.jar \
  -Dotel.service.name=tomcat-demo \
  -Dotel.jmx.target.system=tomcat \
  -Dotel.exporter.otlp.endpoint=https://your-collector:4317 \
  -Dotel.exporter.otlp.headers=Authorization=Bearer &lt;your-token&gt;&quot;
</code></pre>
<p>When using the Java agent, the JVM metrics are automatically captured by the <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-telemetry"><code>runtime-telemetry</code></a> module, it is thus not necessary to include <code>jvm</code> in the <code>otel.jmx.target.system</code> configuration option.</p>
<h2>Option 2: Collect JMX metrics from outside the JVM with jmx-scraper</h2>
<p>When you cannot install an agent in the JVM or if only metrics are required, jmx-scraper lets you query JMX remotely and export metrics to an OTLP endpoint.</p>
<h3>Step 1 - Enable remote JMX on Tomcat</h3>
<p>Add JMX remote options to <code>bin/setenv.sh</code> and create access/password files.</p>
<blockquote>
<p><strong>Warning:</strong> This uses trivial credentials and disables SSL. Do not use this configuration in production.</p>
</blockquote>
<pre><code class="language-bash">mkdir -p /opt/jmx
cat &lt;&lt;EOF &gt; ${CATALINA_HOME}/jmxremote.access
monitorRole readonly
EOF

cat &lt;&lt;EOF &gt; ${CATALINA_HOME}/jmxremote.password
monitorRole monitorPass
EOF

chmod 600 ${CATALINA_HOME}/jmxremote.password

export CATALINA_OPTS=&quot;$CATALINA_OPTS \
  -Dcom.sun.management.jmxremote \
  -Dcom.sun.management.jmxremote.port=9010 \
  -Dcom.sun.management.jmxremote.rmi.port=9010 \
  -Dcom.sun.management.jmxremote.authenticate=true \
  -Dcom.sun.management.jmxremote.ssl=false \
  -Dcom.sun.management.jmxremote.access.file=${CATALINA_HOME}/jmxremote.access \
  -Dcom.sun.management.jmxremote.password.file=${CATALINA_HOME}/jmxremote.password \
  -Djava.rmi.server.hostname=127.0.0.1&quot;
</code></pre>
<h3>Step 2 - Download jmx-scraper</h3>
<p>The jmx-scraper is downloaded in <code>/opt/otel</code> but you can choose any location on the host.</p>
<pre><code class="language-bash">mkdir -p /opt/otel
curl -L -o /opt/otel/opentelemetry-jmx-scraper.jar \
  https://github.com/open-telemetry/opentelemetry-java-contrib/releases/latest/download/opentelemetry-jmx-scraper.jar
</code></pre>
<h3>Step 3 - Check the JMX connection</h3>
<p>Run jmx-scraper with credentials from previous step to confirm it can reach Tomcat. If the credentials are wrong, you will see authentication errors.</p>
<pre><code class="language-bash">java -jar /opt/otel/opentelemetry-jmx-scraper.jar \
  -Dotel.jmx.service.url=service:jmx:rmi:///jndi/rmi://localhost:9010/jmxrmi
  -Dotel.jmx.username=monitorRole \
  -Dotel.jmx.password=monitorPass \
  -Dotel.jmx.target.system=tomcat \
  -test
</code></pre>
<p>You should get in the standard output:</p>
<ul>
<li><code>JMX connection test OK</code> if the connection and authentication is successful</li>
<li><code>JMX connection test ERROR</code> otherwise</li>
</ul>
<h3>Step 4 - Validate the emitted metrics</h3>
<p>Using the logging exporter allows to inspect metrics and attributes before sending them to a collector.</p>
<p>In order to capture both Tomcat and JVM metrics, it is required to set <code>otel.jmx.target.system</code> to <code>tomcat,jvm</code>.</p>
<pre><code class="language-bash">java -jar /opt/otel/opentelemetry-jmx-scraper.jar \
  -Dotel.jmx.service.url=service:jmx:rmi:///jndi/rmi://localhost:9010/jmxrmi
  -Dotel.jmx.username=monitorRole \
  -Dotel.jmx.password=monitorPass \
  -Dotel.jmx.target.system=tomcat,jvm \
  -Dotel.metrics.exporter=logging
</code></pre>
<h3>Step 5 - Send metrics to a Collector</h3>
<p>After validation, to send metrics to an OTLP endpoint, you will have to:</p>
<ul>
<li>remove the <code>-Dotel.metrics.exporter</code> to restore the <code>otlp</code> default value.</li>
<li>configure the OTLP endpoint (<code>otel.exporter.otlp.endpoint</code>) and headers (<code>otel.exporter.otlp.headers</code>) if needed</li>
</ul>
<pre><code class="language-bash">java -jar /opt/otel/opentelemetry-jmx-scraper.jar \
  -Dotel.jmx.service.url=service:jmx:rmi:///jndi/rmi://localhost:9010/jmxrmi
  -Dotel.jmx.username=monitorRole \
  -Dotel.jmx.password=monitorPass \
  -Dotel.jmx.target.system=tomcat,jvm \
  -Dotel.exporter.otlp.endpoint=https://your-collector:4317
  -Dotel.exporter.otlp.headers=&quot;Authorization=Bearer &lt;your-token&gt;&quot;
</code></pre>
<h2>Customizing the JMX Metrics Collection</h2>
<p>Once the built-in Tomcat and JVM mappings are flowing, you can add custom rules with <code>otel.jmx.config</code>. Create a YAML file and pass its path alongside <code>otel.jmx.target.system</code>.</p>
<p>For example, the following <code>custom.yaml</code> file allows to capture the <code>custom.jvm.thread.count</code> metric from the <code>java.lang:type=Threading</code> MBean:</p>
<pre><code class="language-yaml">---
rules:
  - bean: &quot;java.lang:type=Threading&quot;
    mapping:
      ThreadCount:
        metric: custom.jvm.thread.count
        type: gauge
        unit: &quot;{thread}&quot;
        desc: Current number of live threads.
</code></pre>
<p>For complete reference on the configuration format and syntax, refer to <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/jmx-metrics">jmx-metrics</a> module in Opentelemetry Java instrumentation.</p>
<p>This custom configuration can be used both with jmx-scraper and Java agent, both support the <code>otel.jmx.config</code> configuration option, for example with jmx-scraper:</p>
<pre><code class="language-bash">java -jar /opt/otel/opentelemetry-jmx-scraper.jar \
  -Dotel.jmx.service.url=service:jmx:rmi:///jndi/rmi://localhost:9010/jmxrmi
  -Dotel.jmx.username=monitorRole \
  -Dotel.jmx.password=monitorPass \
  -Dotel.jmx.target.system=tomcat,jvm \
  otel.jmx.config=/opt/otel/jmx/custom.yaml
</code></pre>
<p>You can pass multiple custom files as a comma-separated list to <code>otel.jmx.config</code> when you need to organize metrics by team or component.</p>
<h2>Using the JMX Metrics in Kibana</h2>
<p>Once you have collected the JMX metrics using one of the approaches described in this article, you can start using them in Kibana.
You can build custom dashboards and visualizations to explore and analyze the metrics, create custom alerts on top of them or build MCP tools and AI Agents to use them in your agentic workflows.</p>
<p>Here is an example of how you can use the JMX metrics in Kibana through ES|QL:</p>
<pre><code class="language-esql">TS metrics*
| WHERE telemetry.sdk.language == &quot;java&quot;
| WHERE service.name == ?instance
| STATS
    request_rate = SUM(RATE(tomcat.request.count))
  BY Time = BUCKET(@timestamp, 100, ?_tstart, ?_tend)
</code></pre>
<p>You can use the native metric and dimension names of the JMX metrics to build your queries.
With the <code>TS</code> command you get first-class support for time series aggregation functions and dimensions on your metrics.
This kind of queries constitute the building blocks for your dashboards, alerts, workflows and AI agent tools.</p>
<p>Here is an example of a dashboard that visualizes the typical JMX metrics for Apache Tomcat:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/collecting-jmx-metrics-opentelemetry/tomcat_jmx_dashboard.png" alt="Tomcat Dashboard" /></p>
<h2>Conclusion</h2>
<p>In this article, we have seen how to collect JMX metrics with OpenTelemetry using the Java agent or jmx-scraper.
We have also seen how to use the JMX metrics in Kibana through ES|QL to build custom dashboards, alerts, workflows and AI agent tools.</p>
<p>This is just the beginning of what you can do with the JMX metrics and Elastic Observability.
Try it out yourself and explore the full potential of your JMX metrics when combined with powerful features provided by the Elastic Observability platform.</p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/collecting-jmx-metrics-opentelemetry/jmx_header_image.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Beyond the trace: Pinpointing performance culprits with continuous profiling and distributed tracing correlation]]></title>
            <link>https://www.elastic.co/observability-labs/blog/continuous-profiling-distributed-tracing-correlation</link>
            <guid isPermaLink="false">continuous-profiling-distributed-tracing-correlation</guid>
            <pubDate>Thu, 28 Mar 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Frustrated by slow traces but unsure where the code bottleneck lies? Elastic Universal Profiling correlates profiling stacktraces with OpenTelemetry (OTel) traces, helping you identify and pinpoint the exact lines of code causing performance issues.]]></description>
            <content:encoded><![CDATA[<p>Observability goes beyond monitoring; it's about truly understanding your system. To achieve this comprehensive view, practitioners need a unified observability solution that natively combines insights from metrics, logs, traces, and crucially, <strong>continuous profiling</strong>. While metrics, logs, and traces offer valuable insights, they can't answer the all-important &quot;why.&quot; Continuous profiling signals act as a magnifying glass, providing granular code visibility into the system's hidden complexities. They fill the gap left by other data sources, enabling you to answer critical questions –– why is this trace slow? Where exactly in the code is the bottleneck residing?</p>
<p>Traces provide the &quot;what&quot; and &quot;where&quot; — what happened and where in your system. Continuous profiling refines this understanding by pinpointing the &quot;why&quot; and validating your hypotheses about the &quot;what.&quot; Just like a full-body MRI scan, Elastic's whole-system continuous profiling (powered by eBPF) uncovers unknown-unknowns in your system. This includes not just your code, but also third-party libraries and kernel activity triggered by your application transactions. This comprehensive visibility improves your mean-time-to-detection (MTTD) and mean-time-to-recovery (MTTR) KPIs.</p>
<p><em>[Related article:</em> <a href="https://www.elastic.co/blog/observability-profiling-metrics-logs-traces"><em>Why metrics, logs, and traces aren’t enough</em></a><em>]</em></p>
<h2>Bridging the disconnect between continuous profiling and OTel traces</h2>
<p>Historically, continuous profiling signals have been largely disconnected from OpenTelemetry (OTel) traces. Here's the exciting news: we're bridging this gap! We're introducing native correlation between continuous profiling signals and OTel traces, starting with Java.</p>
<p>Imagine this: You're troubleshooting a performance issue and identify a slow trace. Whole-system continuous profiling steps in, acting like an MRI scan for your entire codebase and system. It narrows down the culprit to the specific lines of code hogging CPU time within the context of your distributed trace. This empowers you to answer the &quot;why&quot; question with minimal effort and confidence, all within the same troubleshooting context.</p>
<p>Furthermore, by correlating continuous profiling with distributed tracing, Elastic Observability customers can measure the cloud cost and CO&lt;sub&gt;2&lt;/sub&gt; impact of every code change at the service and transaction level.</p>
<p>This milestone is significant, especially considering the recent developments in the OTel community. With <a href="https://www.cncf.io/blog/2024/03/19/opentelemetry-announces-support-for-profiling/">OTel adopting profiling</a> and Elastic <a href="https://www.elastic.co/blog/elastic-donation-proposal-to-contribute-profiling-agent-to-opentelemetry">donating the industry’s most advanced eBPF-based continuous profiling agent to OTel</a>, we're set for a game-changer in observability — empowering OTel end users with a correlated system visibility that goes from a trace span in the userspace down to the kernel.</p>
<p>Furthermore, achieving this goal, especially with Java, presented significant challenges and demanded serious engineering R&amp;D. This blog post will delve into these challenges, explore the approaches we considered in our proof-of-concepts, and explain how we arrived at a solution that can be easily extended to other OTel language agents. Most importantly, this solution correlates traces with profiling signals at the agent, not in the backend — to ensure optimal query performance and minimal reliance on vendor backend storage architectures.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/continuous-profiling-distributed-tracing-correlation/trace.png" alt="Profiling flamegraph for a specific trace.id" /></p>
<h2>Figuring out the active OTel trace and span</h2>
<p>The primary technical challenge in this endeavor is essentially the following: whenever the profiler interrupts an OTel instrumented process to capture a stacktrace, we need to be able to efficiently determine the active span and trace ID (per-thread) and the service name (per-process).</p>
<p>For the purpose of this blog, we'll focus on the recently released <a href="https://github.com/elastic/elastic-otel-java">Elastic distribution of the OTel Java instrumentation</a>, but the approach that we ended up with generalizes to any language that can load and call into a native library. So, how do we get our hands on those IDs?</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/continuous-profiling-distributed-tracing-correlation/service-popout.png" alt="Profiling correlated with service.name, showing  CO2 and cloud cost impact by line of code." /></p>
<p>The OTel Java agent itself keeps track of the active span by storing a stack of spans in the <a href="https://opentelemetry.io/docs/concepts/context-propagation/#context">OpenTelemetryContext</a>, which itself is stored in a <a href="https://docs.oracle.com/javase/8/docs/api/java/lang/ThreadLocal.html">ThreadLocal</a> variable. We originally considered reading these Java structures directly from BPF, but we eventually decided against that approach. There is no documented specification on how ThreadLocals are implemented, and reliably reading and following the JVM's internal data-structures would incur a high maintenance burden. Any minor update to the JVM could change details of the structure layouts. To add to this, we would also have to reverse engineer how each JVM version lays out Java class fields in memory, as well as how all the high-level Java types used in the context objects are actually implemented under the hood. This approach further wouldn't generalize to any non-JVM language and needs to be repeated for any language that we wish to support.</p>
<p>After we had convinced ourselves that reading Java ThreadLocal directly is not the answer, we decided to look for more portable alternatives instead. The option that we ultimately settled with is to load and call into a C++ library that is responsible for making the required information available via a known and defined interface whenever the span changes.</p>
<p>Other than with Java's ThreadLocals, the details on how a native shared library should expose per-process and per-thread data are well-defined in the System V ABI specification and the architecture specific ELF ABI documents.</p>
<h2>Exposing per-process information</h2>
<p>Exposing per-process data is easy: we simply declare a global variable . . .</p>
<pre><code class="language-java">void* elastic_tracecorr_process_storage_v1 = nullptr;
</code></pre>
<p>. . . and expose it via ELF symbols. When the user initializes the OTel library to set the service name, we allocate a buffer and populate it with data in a <a href="https://github.com/elastic/apm/blob/149cd3e39a77a58002344270ed2ad35357bdd02d/specs/agents/universal-profiling-integration.md#process-storage-layout">protocol that we defined for this purpose</a>. Once the buffer is fully populated, we update the global pointer to point to the buffer.</p>
<p>On the profiling agent side, we already have code in place that detects libraries and executables loaded into any process's address space. We normally use this mechanism to detect and analyze high-level language interpreters (e.g., libpython, libjvm) when they are loaded, but it also turned out to be a perfect fit to detect the OTel trace correlation library. When the library is detected in a process, we scan the exports, resolve the symbol, and read the per-process information directly from the instrumented process’ memory.</p>
<h2>Exposing per-thread information</h2>
<p>With the easy part out of the way, let's get to the nitty-gritty portion: exposing per-thread information via thread-local storage (TLS). So, what exactly is TLS, and how does it work? At the most basic level, the idea is to have <strong>one instance of a variable for every thread</strong>. Semantically you can think of it like having a global Map&lt;ThreadID, T&gt;, although that is not how it is implemented.</p>
<p>On Linux, there are two major options for thread locals: TSD and TLS.</p>
<h2>Thread-specific data (TSD)</h2>
<p>TSD is the older and probably more commonly known variant. It works by explicitly allocating a key via pthread_key_create — usually during process startup — and passing it to all threads that require access to the thread-local variable. The threads can then pass that key to the pthread_getspecific and pthread_setspecific functions to read and update the variable for the currently running thread.</p>
<p>TSD is simple, but for our purposes it has a range of drawbacks:</p>
<ul>
<li>
<p>The pthread_key_t structure is opaque and doesn't have a defined layout. Similar to the Java ThreadLocals, the underlying data-structures aren't defined by the ABI documents and different libc implementations (glibc, musl) will handle them differently.</p>
</li>
<li>
<p>We cannot call a function like pthread_getspecific from BPF, so we'd have to reverse engineer and reimplement the logic. Logic may change between libc versions, and we’d have to detect the version and support all variants that may come up in the wild.</p>
</li>
<li>
<p>TSD performance is not predictable and varies depending on how many thread local variables have been allocated in the process previously. This may not be a huge concern for Java specifically since spans are typically not swapped super rapidly, but it’d likely be quite noticeable for user-mode scheduling languages where the context might need to be swapped at every await point/coroutine yield.</p>
</li>
</ul>
<p>None of this is strictly prohibitive, but a lot of this is annoying at the very least. Let’s see if we can do better!</p>
<h2>Thread-local storage (TLS)</h2>
<p>Starting with C11 and C++11, both languages support thread local variables directly via the _Thread_local and thread_local storage specifiers, respectively. Declaring a variable as per-thread is now a matter of simply adding the keyword:</p>
<pre><code class="language-java">thread_local void* elastic_tracecorr_tls_v1 = nullptr;
</code></pre>
<p>You might assume that the compiler simply inserts calls to the corresponding pthread function calls when variables declared with this are accessed, but this is not actually the case. The reality is surprisingly complicated, and it turns out that there are four different models of TLS that the compiler can choose to generate. For some of those models, there are further multiple dialects that can be used to implement them. The different models and dialects come with various portability versus performance trade-offs. If you are interested in the details, I suggest reading this <a href="https://maskray.me/blog/2021-02-14-all-about-thread-local-storage">blog article</a> that does a great job at explaining them.</p>
<p>The TLS model and dialect are usually chosen by the compiler based on a somewhat opaque and complicated set of architecture-specific rules. Fortunately for us, both gcc and clang allow users to pick a particular one using the -ftls-model and -mtls-dialect arguments. The variant that we ended up picking for our purposes is -ftls-model=global-dynamic and -mtls-dialect=gnu2 (and desc on aarch64).</p>
<p>Let's take a look at the assembly that is being generated when accessing a thread_local variable under these settings. Our function:</p>
<pre><code class="language-java">void setThreadProfilingCorrelationBuffer(JNIEnv* jniEnv, jobject bytebuffer) {
  if (bytebuffer == nullptr) {
    elastic_tracecorr_tls_v1 = nullptr;
  } else {
    elastic_tracecorr_tls_v1 = jniEnv-&gt;GetDirectBufferAddress(bytebuffer);
  }
}
</code></pre>
<p>Is compiled to the following assembly code:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/continuous-profiling-distributed-tracing-correlation/assembly.png" alt="assembly" /></p>
<p>Both possible branches assign a value to our thread-local variable. Let’s focus at the right branch corresponding to the nullptr case to get rid of the noise from the GetDirectBufferAddress function call:</p>
<pre><code class="language-java">lea   rax, elastic_tracecorr_tls_v1_tlsdesc  ;; Load some pointer into rax.
call  qword ptr [rax]                        ;; Read &amp; call function pointer at rax.
mov   qword ptr fs:[rax], 0                  ;; Assign 0 to the pointer returned by
                                             ;; the function that we just called.
</code></pre>
<p>The fs: portion of the mov instruction is the actual magic bit that makes the memory read per-thread. We’ll get to that later; let’s first look at the mysterious elastic_tracecorr_tls_v1_tlsdesc variable that the compiler emitted here. It’s an instance of the tlsdesc structure that is located somewhere in the .got.plt ELF section. The structure looks like this:</p>
<pre><code class="language-java">struct tlsdesc {
  // Function pointer used to retrieve the offset
  uint64_t (*resolver)(tlsdesc*);

  // TLS offset -- more on that later.
  uint64_t tp_offset;
}
</code></pre>
<p>The resolver field is initialized with nullptr and tp_offset with a per-executable offset. The first thread-local variable in an executable will usually have offset 0, the next one sizeof(first_var), and so on. At first glance this may appear to be similar to how TSD works, with the call to pthread_getspecific to resolve the actual offset, but there is a crucial difference. When the library is loaded, the resolver field is filled in with the address of __tls_get_addr by the loader (ld.so). __tls_get_addr is a relatively heavy function that allocates a TLS offset that is globally unique between all shared libraries in the process. It then proceeds by updating the tlsdesc structure itself, inserting the global offset and replacing the resolver function with a trivial one:</p>
<pre><code class="language-java">void* second_stage_resolver(tlsdesc* desc) {
  return tlsdesc-&gt;tp_offset;
}
</code></pre>
<p>In essence, this means that the first access to a tlsdesc based thread-local variable is rather expensive, but all subsequent ones are cheap. We further know that by the time that our C++ library starts publishing per-thread data, it must have gone through the initial resolving process already. Consequently, all that we need to do is to read the final offset from the process's memory and memorize it. We also refresh the offset every now and then to ensure that we really have the final offset, combating the unlikely but possible race condition that we read the offset before it was initialized. We can detect this case by comparing the resolver address against the address of the __tls_get_addr function exported by ld.so.</p>
<h2>Determining the TLS offset from an external process</h2>
<p>With that out of the way, the next question that arises is how to actually find the tlsdesc in memory so that we can read the offset. Intuitively one might expect that the dynamic symbol exported on the ELF file points to that descriptor, but that is not actually the case.</p>
<pre><code class="language-bash">$ readelf --wide --dyn-syms elastic-jvmti-linux-x64.so | grep elastic_tracecorr_tls_v1
328: 0000000000000000 	8 TLS 	GLOBAL DEFAULT   19 elastic_tracecorr_tls_v1
</code></pre>
<p>The dynamic symbol instead contains an offset relative to the start of the .tls ELF section and points to the initial value that libc initializes the TLS value with when it is allocated. So how does ld.so find the tlsdesc to fill in the initial resolver? In addition to the dynamic symbol, the compiler also emits a relocation record for our symbol, and that one actually points to the descriptor structure that we are looking for.</p>
<pre><code class="language-bash">$ readelf --relocs --wide elastic-jvmti-linux-x64.so | grep R_X86_64_TLSDESC
00000000000426e8  0000014800000024 R_X86_64_TLSDESC   	0000000000000000
elastic_tracecorr_tls_v1 + 0
</code></pre>
<p>To read the final TLS offset, we thus simply have to:</p>
<ul>
<li>
<p>Wait for the event notifying us about a new shared library being loaded into a process</p>
</li>
<li>
<p>Do some cheap heuristics to detect our C++ library, avoiding the more expensive analysis below from being executed for every unrelated library on the system</p>
</li>
<li>
<p>Analyze the library on disk and scan ELF relocations for our per-thread variable to extract the tlsdesc address</p>
</li>
<li>
<p>Rebase that address to match where our library was loaded in that particular process</p>
</li>
<li>
<p>Read the offset from tlsdesc+8</p>
</li>
</ul>
<h2>Determining the TLS base</h2>
<p>Now that we have the offset, how do we use that to actually read the data that the library puts there for us? This brings us back to the magic fs: portion of the mov instruction that we discussed earlier. In X86, most memory operands can optionally be supplied with a segment register that influences the address translation.</p>
<p>Segments are an archaic construct from the early days of 16-bit X86 where they were used to extend the address space. Essentially the architecture provides a range of segment registers that can be configured with different base addresses, thus allowing more than 16-bits worth of memory to be accessed. In times of 64-bit processors, this is hardly a concern anymore. In fact, X86-64 aka AMD64 got rid of all but two of those segment registers: fs and gs.</p>
<p>So why keep two of them? It turns out that they are quite useful for the use-case of thread-local data. Since every thread can be configured to have its own base address in these segment registers, we can use it to point to a block of data for this specific thread. That is precisely what libc implementations on Linux are doing with the fs segment. The offset that we snatched from the processes memory earlier is used as an address with the fs segment register, and the CPU automatically adds it to the per-thread base address.</p>
<p>To retrieve the base address pointed to by the fs segment register in the kernel, we need to read its destination from the kernel’s task_struct for the thread that we happened to interrupt with our profiling timer event. Getting the task struct is easy because we are blessed with the bpf_get_current_task BPF helper functions. BPF helpers are pretty much syscalls for BPF programs: we can just ask the Linux kernel to hand us the pointer.</p>
<p>Armed with the task pointer, we now have to read the thread.fsbase (X86-64) or thread.uw.tp_value (aarch64) field to get our desired base address that the user-mode process accesses via fs. This is where things get complicated one last time, at least if we wish to support older kernels without <a href="https://www.kernel.org/doc/html/latest/bpf/btf.html">BTF support</a> (we do!). The <a href="https://github.com/torvalds/linux/blob/259f7d5e2baf87fcbb4fabc46526c9c47fed1914/include/linux/sched.h#L748">task_struct is huge</a> and there are hundreds of fields that can be present or not depending on how the kernel is configured. Being a core primitive of the scheduler, it is also constantly subject to changes between different kernel versions. On modern Linux distributions, the kernel is typically nice enough to tell us the offset via BTF. On older ones, the situation is more complicated. Since hardcoding the offset is clearly not an option if we hope the code to be portable, we instead have to figure out the offset by ourselves.</p>
<p>We do this by consulting /proc/kallsyms, a file with mappings between kernel functions and their addresses, and then using BPF to dump the compiled code of a kernel function that rarely changes and uses the desired offset. We dynamically disassemble and analyze the function and extract the offset directly from the assembly. For X86-64 specifically, we dump the <a href="https://elixir.bootlin.com/linux/v5.9.16/source/arch/x86/kernel/hw_breakpoint.c#L452">aout_dump_debugregs</a> function that accesses thread-&gt;ptrace_bps, which has consistently been 16 bytes away from the fsbase field that we are interested in for all kernels that we have ever looked at.</p>
<h2>Reading TLS data from kernel</h2>
<p>With all the required offsets at our hands, we can now finally do what we set out to do in the first place: use them to enrich our stack traces with the OTel trace and span IDs that our C++ library prepared for us!</p>
<pre><code class="language-java">void maybe_add_otel_info(Trace* trace) {
  // Did user-mode insert a TLS offset for this process? Read it.
  TraceCorrProcInfo* proc = bpf_map_lookup_elem(&amp;tracecorr_procs, &amp;trace-&gt;pid);

  // No entry -&gt; process doesn't have the C++ library loaded.
  if (!proc) return;

  // Load the fsbase offset from our global configuration map.
  u32 key = 0;
  SystemConfig* syscfg = bpf_map_lookup_elem(&amp;system_config, &amp;key);

  // Read the fsbase offset from the kernel's task struct.
  u8* fsbase;
  u8* task = (u8*)bpf_get_current_task();
  bpf_probe_read_kernel(&amp;fsbase, sizeof(fsbase), task + syscfg-&gt;fsbase_offset);

  // Use the TLS offset to read the **pointer** to our TLS buffer.
  void* corr_buf_ptr;
  bpf_probe_read_user(
    &amp;corr_buf_ptr,
    sizeof(corr_buf_ptr),
    fsbase + proc-&gt;tls_offset
  );

  // Read the information that our library prepared for us.
  TraceCorrelationBuf corr_buf;
  bpf_probe_read_user(&amp;corr_buf, sizeof(corr_buf), corr_buf_ptr);

  // If the library reports that we are currently in a trace, store it into
  // the stack trace that will be reported to our user-land process.
  if (corr_buf.trace_present &amp;&amp; corr_buf.valid) {
    trace-&gt;otel_trace_id.as_int.hi = corr_buf.trace_id.as_int.hi;
    trace-&gt;otel_trace_id.as_int.lo = corr_buf.trace_id.as_int.lo;
    trace-&gt;otel_span_id.as_int = corr_buf.span_id.as_int;
  }
}
</code></pre>
<h2>Sending out the mappings</h2>
<p>From this point on, everything further is pretty simple. The C++ library sets up a unix datagram socket during startup and communicates the socket path to the profiler via the per-process data block. The stacktraces annotated with the OTel trace and span IDs are sent from BPF to our user-mode profiler process via perf event buffers, which in turn sends the mappings between OTel span and trace and stack trace hashes to the C++ library. Our extensions to the OTel instrumentation framework then read those mappings and insert the stack trace hashes into the OTel trace.</p>
<p>This approach has a few major upsides compared to the perhaps more obvious alternative of sending out the OTel span and trace ID with the profiler’s stacktrace records. We want the stacktrace associations to be stored in the trace indices to allow filtering and aggregating stacktraces by the plethora of fields available on OTel traces. If we were to send out the trace IDs via the profiler's gRPC connection instead, we’d have to search for and update the corresponding OTel trace records in the profiling collector to insert the stack trace hashes.</p>
<p>This is not trivial: stacktraces are sent out rather frequently (every 5 seconds, as of writing) and the corresponding OTel trace might not have been sent and stored by the time the corresponding stack traces arrive in our cluster. We’d have to build a kind of delay queue and periodically retry updating the OTel trace documents, introducing avoidable database work and complexity in the collectors. With the approach of sending stacktrace mappings to the OTel instrumented process instead, the need for server-side merging vanishes entirely.</p>
<h2>Trace correlation in action</h2>
<p>With all the hard work out of the way, let’s take a look at what trace correlation looks like in action!</p>
&lt;Video vidyardUuid=&quot;JYTzQYeiJ6CK6K3hZ33sz5&quot; /&gt;
<h2>Future work: Supporting other languages</h2>
<p>We have demonstrated that trace correlation can work nicely for Java, but we have no intention of stopping there. The general approach that we discussed previously should work for any language that can efficiently load and call into our C++ library and doesn’t do user-mode scheduling with coroutines. The problem with user-mode scheduling is that the logical thread can change at any await/yield point, requiring us to update the trace IDs in TLS. Many such coroutine environments like Rust’s Tokio provide the ability to register a callback for whenever the active task is swapped, so they can be supported easily. Other languages, however, do not provide that option.</p>
<p>One prominent example in that category is Go: goroutines are built on user-mode scheduling, but to our knowledge there’s no way to instrument the scheduler. Such languages will need solutions that don’t go via the generic TLS path. For Go specifically, we have already built a prototype that uses pprof labels that are associated with a specific Goroutine, having Go’s scheduler update them for us automatically.</p>
<h2>Getting started</h2>
<p>We hope this blog post has given you an overview of correlating profiling signals to distributed tracing, and its benefits for end-users.</p>
<p>To get started, download the <a href="https://github.com/elastic/elastic-otel-java">Elastic distribution of the OTel agent</a>, which contains the new trace correlation library. Additionally, you will need the latest version of Universal Profiling agent, bundled with <a href="https://www.elastic.co/blog/whats-new-elastic-8-13-0">Elastic Stack version 8.13</a>.</p>
<h2>Acknowledgment</h2>
<p>We appreciate <a href="https://github.com/trask">Trask Stalnaker</a>, maintainer of the OTel Java agent, for his feedback on our approach and for reviewing the early draft of this blog post.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/continuous-profiling-distributed-tracing-correlation/Under_highway_bridge.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Continuous profiling: The key to more efficient and cost-effective applications]]></title>
            <link>https://www.elastic.co/observability-labs/blog/continuous-profiling-efficient-cost-effective-applications</link>
            <guid isPermaLink="false">continuous-profiling-efficient-cost-effective-applications</guid>
            <pubDate>Fri, 27 Oct 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[In this post, we discuss why computational efficiency is important and how Elastic Universal Profiling enables your business to use continuous profiling in production environments to make the software that runs your business as efficient as possible.]]></description>
            <content:encoded><![CDATA[<p>Recently, Elastic Universal Profiling&lt;sup&gt;TM&lt;/sup&gt; became <a href="https://www.elastic.co/blog/continuous-profiling-is-generally-available">generally available</a>. It is the part of our Observability solution that allows users to do <em>whole system, continuous profiling</em> in production environments. If you're not familiar with continuous profiling, you are probably wondering what Universal Profiling is and why you should care. That's what we will address in this post.</p>
<h2>Efficiency is important (again)</h2>
<p>Before we jump into continuous profiling, let's start with the &quot;Why should I care?&quot; question. To do that, I'd like to talk a bit about efficiency and some large-scale trends happening in our industry that are making efficiency, specifically computational efficiency, important again. I say again because in the past, when memory and storage on a computer was very limited and you had to worry about every byte of code, efficiency was an important aspect of developing software.</p>
<h3>The end of Moore’s Law</h3>
<p>First, the <a href="https://en.wikipedia.org/wiki/Moore's_law">Moore's Law</a> era is drawing to a close. This was inevitable simply due to physical limits of how small you can make a transistor and the connections between them. For a long time, software developers had the luxury of not worrying about complexity and efficiency because the next generation of hardware would mitigate any negative cost or performance impact.</p>
<p><em>If you can't rely on an endless progression of ever faster hardware, you should be interested in computational efficiency.</em></p>
<h3>The move to Software-as-a-Service</h3>
<p>Another trend to consider is the shift from software vendors that sold customers software to run themselves to Software-as-a-Service businesses. A traditional software vendor didn't have to worry too much about the efficiency of their code. That issue largely fell to the customer to address; a new software version might dictate a hardware refresh to the latest and most performant. For a SaaS business, inefficient software usually degrades the customer’s experience and it certainly impacts the bottom line.</p>
<p><em>If you are a SaaS business in a competitive environment, you should be interested in computational efficiency.</em></p>
<h3>Cloud migration</h3>
<p>Next is the ongoing <a href="https://www.elastic.co/observability/cloud-migration">cloud migration</a> to cloud computing. One of the benefits of cloud computing is the ease of scaling, both hardware and software. In the cloud, we are not constrained by the limits of our data centers or the next hardware purchase. Instead we simply spin up more cloud instances to mitigate performance problems. In addition to infrastructure scalability, microservices architectures, containerization, and the rise of Kubernetes and similar orchestration tools means that scaling services is simpler than ever. It's not uncommon to have thousands of instances of a service running in a cloud environment. This ease of scaling accounts for another trend, namely that many businesses are dealing with skyrocketing cloud computing costs.</p>
<p><em>If you are a business with ever increasing cloud costs, you should be interested in computational efficiency.</em></p>
<h3>Our changing climate</h3>
<p>Lastly, if none of those reasons pique your interest, let's consider a global problem that all of us should have in mind — namely, climate change. There are many things that need to be addressed to tackle climate change, but with our dependence on software in every part of our society, computational efficiency is certainly something we should be thinking about.</p>
<p>Thomas Dullien, distinguished engineer at Elastic and one of the founders of Optymize points out that if you can save 20% on 800 servers, and assume 300W power consumption for each server, that code change is worth 160 metric tons of CO&lt;sub&gt;2&lt;/sub&gt; saved per year. That may seem like a drop in the bucket but if all businesses focus more on computational efficiency, it will make an impact. Also, let's not forget the financial benefits: those 160 metric tons of CO&lt;sub&gt;2&lt;/sub&gt; savings also represent a significant annual cost savings.</p>
<p><em>If you live on planet Earth, you should be interested in computational efficiency.</em></p>
<h2>Performance engineering</h2>
<p>Who's job is it to worry about computational efficiency? Application developers usually pay at least some attention to efficiency as they develop their code. Profiling is a common approach for a developer to understand the performance of their code, and there is an entire portfolio of profiling tools available. Frequently, however, schedule pressures trump time spent on performance analysis and computational efficiency. In addition, performance problems may not become apparent until an application is running at scale in production and interacting (and competing) with everything else in that environment. Many profiling tools are not well suited to use in a production environment because they require code instrumentation and recompilation and add significant overhead.</p>
<p>When inefficient code makes it into production and begins to cause performance problems, the next line of defense is the Operations or SRE team. Their mission is to keep everything humming, and performance problems will certainly draw attention. Observability tools such as APM can shed light on these types of issues and lead the team to a specific application or service, but these tools have limits into the observability of the full system. Third-party libraries and operating system kernels functions remain hidden without a profiling solution in the production environment.</p>
<p>So, what can these teams do when there is a need to investigate a performance problem in production? That's where continuous profiling comes into the picture.</p>
<h2>Continuous profiling</h2>
<p>Continuous profiling is not a new idea. Google published a <a href="https://research.google/pubs/pub36575/">paper about it</a> in 2010 and began implementing continuous profiling in its environments around that time. Facebook and Netflix followed suit not long afterward.</p>
<p>Typically, continuous profiling tools have been the domain of dedicated performance engineering or operating system engineering teams, which are usually only found at extremely large scale enterprises like the ones mentioned above. The key idea is to run profiling on every server, all of the time. That way, when your observability tools point you to a specific part of an application, but you need a more detailed view into exactly where that application is consuming CPU resources, the profiling data will be there, ready to use.</p>
<p>Another benefit of continuous profiling is that it provides a view of CPU intensive software across your entire environment — whether that is a very CPU intensive function or the aggregate of a relatively small function that is run thousands of times a second in your environment.</p>
<p>While profiling tools are not new, most of them have significant gaps. Let's look at a couple of the most significant ones.</p>
<ul>
<li><strong>Limited visibility.</strong> Modern distributed applications are composed of a complex mix of building blocks, including custom software functions, third-party software libraries, networking software, operating system services, and more and more often, orchestration software such as <a href="https://kubernetes.io/">Kubernetes</a>. To fully understand what is happening in an application, you need visibility into each piece. However, even if a developer has the ability to profile their own code, everything else remains invisible. To make matters worse, most profiling tools require instrumenting the code, which adds overhead and therefore even your developers’ code is not profiled in production.</li>
<li><strong>Missing symbols in production.</strong> All of these pieces of code building blocks typically have descriptive names (some more intuitive than others) so that developers can understand and make sense of them. In a running program, these descriptive names are usually referred to as <strong>symbols</strong>. For a human being to make sense of the execution of a running application, these names are very important. Unfortunately, almost always, any software running in production has these human readable symbols stripped away for space efficiency since they are not needed by the CPU executing the software. Without all of the symbols, it makes it much more difficult to understand the full picture of what's happening in the application. To illustrate this, think of the last time you were in an SMS chat on your mobile device and you only had some of the people in the chat group in your address book while the rest simply appeared as phone numbers — this makes it very hard to tell who is saying what.</li>
</ul>
<h2>Elastic Universal Profiling: Continuous profiling for all</h2>
<p>Our goal is to allow any business, large or small, to make computational efficiency a core consideration for all of the software that they run. Universal Profiling imposes very low overhead on your servers so it can be used in production and it provides visibility to everything running on every machine. It opens up the possibility of seeing the financial unit cost and CO&lt;sub&gt;2&lt;/sub&gt; impact of every line of code running on every system in your business. How do we do that?</p>
<h3>Whole-system visibility — SIMPLE</h3>
<p>Universal Profiling is based on <a href="https://www.elastic.co/blog/ebpf-observability-security-workload-profiling">eBPF</a>, which means that it imposes very low overhead (our goal is less than 1% CPU and less than 250MB of RAM) on your servers because it doesn't require code instrumentation. That low overhead means it can be run continuously, on every server, even in production.</p>
<p>eBPF also lets us deploy a single profiler agent on a host and peek inside the operating system to see every line of code executing on the CPU. That means we have visibility into all of those application building blocks described above — the operating system itself as well as <a href="https://en.wikipedia.org/wiki/Containerization_(computing)">containerization and orchestration frameworks</a> without complex configuration.</p>
<h3>All the symbols</h3>
<p>A key part of Universal Profiling is our hosted symbolization service. This means that symbols are not required on your servers, which not only eliminates a need for recompiling software with symbols, but it also helps to reduce overhead by allowing the Universal Profiling agent to send very sparse data back to the Elasticsearch platform where it is enriched with all of the missing symbols. Since we maintain a repository of most popular third-party software libraries and Linux operating system symbols, the Universal Profiling UI can show you all the symbols.</p>
<h3>Your favorite language, and then some</h3>
<p>Universal Profiling is multilanguage. We support all of today’s popular programming languages, including Python, Go, Java (and any other JVM-based languages), Ruby, NodeJS, PHP, Perl, and of course, C and C++, which is critical since these languages still underly so many third-party libraries used by the other languages. In addition, we support profiling <a href="https://en.wikipedia.org/wiki/Machine_code">native code</a> a.k.a. machine language.</p>
<p>Speaking of native code, all profiling tools are tied to a specific type of CPU. Most tools today only support the Intel x86 CPU architecture. Universal Profiling supports both x86 and ARM-based processors. With the expanding use of ARM-based servers, especially in cloud environments, Universal Profiling future-proofs your continuous profiling.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/continuous-profiling-efficient-cost-effective-applications/elastic-blog-1-universal-profiling.png" alt="A flamegraph showing traces across Python, Native, Kernel, and Java code" /></p>
<p>Many businesses today employ polyglot programming — that is, they use multiple languages to build an application — and Universal Profiling is the only profiler available that can build a holistic view across all of these languages. This will help you look for hotspots in the environment, leading you to &quot;unknown unknowns&quot; that warrant deeper performance analysis. That might be a simple interest rate calculation that should be efficient and lightweight but, surprisingly, isn't. Or perhaps it is a service that is reused much more frequently than originally expected, resulting in thousands of instances running across your environment every second, making it a prime target for efficiency improvement.</p>
<h3>Visualize your impact</h3>
<p>Elastic Universal Profiling has an intuitive UI that immediately shows you the impact of any given function, including the time it spends executing on the CPU and how much that costs both in dollars and in carbon emissions.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/continuous-profiling-efficient-cost-effective-applications/elastic-blog-2-universal-profiling-flamegraph.png" alt="Annualized dollar cost and CO2 emissions for any function" /></p>
<p>Finally, with the level of software complexity in most production environments, there's a good chance that making a code change will have unanticipated effects across the environment. That code change may be due to a new feature being rolled out or a change to improve efficiency. In either case, a differential view, before and after the change, will help you understand the impact.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/continuous-profiling-efficient-cost-effective-applications/elastic-blog-3.png" alt="Performance, CO2, and cost improvements of a more efficient hashing function" /></p>
<h2>Let's recap</h2>
<p>Computational efficiency is an important topic, both from the perspective of the ultra-competitive business climate we all work in and from living through the challenges of our planet's changing climate. Improving efficiency can be a challenging endeavor, but we can't even begin to attempt to make improvements without knowing where to focus our efforts. Elastic Universal Profiling is here to provide every business with visibility into computational efficiency.</p>
<p>How will you use Elastic Universal Profiling in your business?</p>
<ul>
<li>If you are an application developer or part of the site reliability team, Universal Profiling will provide you with unprecedented visibility into your applications that will not only help you troubleshoot performance problems in production, but also understand the impact of new features and deliver an optimal user experience.</li>
<li>If you are involved in cloud and infrastructure financial management and capacity planning, Universal Profiling will provide you with unprecedented visibility into the unit cost of every line of code that your business runs.</li>
<li>If you are involved in your business’s <a href="https://www.elastic.co/blog/sustainability-elastic-6-months-reflection">ESG</a> initiative, Universal Profiling will provide you with unprecedented visibility into your CO&lt;sub&gt;2&lt;/sub&gt; emissions and open up new avenues for reducing your carbon footprint.</li>
</ul>
<p>These are just a few examples. For more ideas, read how <a href="https://www.elastic.co/customers/appomni">AppOmni benefits from Elastic Universal Profiling</a>.</p>
<p>You can <a href="https://www.elastic.co/guide/en/observability/current/profiling-get-started.html">get started</a> with Elastic Universal Profiling right now!</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/continuous-profiling-efficient-cost-effective-applications/the-end-of-databases-A_(1).jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[How to capture custom metrics without app code changes using the Java Agent Plugin]]></title>
            <link>https://www.elastic.co/observability-labs/blog/custom-metrics-app-code-java-agent-plugin</link>
            <guid isPermaLink="false">custom-metrics-app-code-java-agent-plugin</guid>
            <pubDate>Mon, 10 Jul 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[When the application you're monitoring doesn't emit the custom metrics you'd like, and you can't directly change the app code, you can use a Java Agent Plugin to automatically instrument the application and emit the custom metrics you desire.]]></description>
            <content:encoded><![CDATA[<p>The Elastic APM Java Agent automatically tracks <a href="https://www.elastic.co/guide/en/apm/agent/java/current/metrics.html">many metrics</a>, including those that are generated through <a href="https://micrometer.io/">Micrometer</a> or the <a href="https://opentelemetry.io/docs/specs/otel/metrics/api/">OpenTelemetry Metrics API</a>. So if your application (or the libraries it includes) already exposes metrics from one of those APIs, installing the Elastic APM Java Agent is the only step required to capture them. You'll be able to visualize and configure thresholds, alerts, and anomaly detection — and anything else you want to use them for!</p>
<p>The next simplest option is to generate custom metrics directly from your code (e.g., by adding code using the <a href="https://opentelemetry.io/docs/specs/otel/metrics/api/">OpenTelemetry Metrics API</a> directly into the application). The major downside of that approach is that it requires modifying the application, so if you can't or don't want to do that, you can easily produce the desired custom metrics by adding instrumentation to the Elastic APM Java Agent via a plugin.</p>
<p>This article deals with the situation where the application you are monitoring doesn't emit the custom metrics you'd like it to, and you can't directly change the code or config to make it do so. Instead, you can use a plugin to automatically instrument the application via the Elastic APM Java Agent, which will then make the application emit the custom metrics you desire.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/custom-metrics-app-code-java-agent-plugin/elastic-blog-1-kibana-lens.png" alt="Using Elastic Kibana Lens to analyze APM telemetry on various measures" /></p>
<h2>Plugin basics</h2>
<p>The basics of the Elastic APM Java Agent, and how to easily plugin instrumentation, are detailed in the article &quot;<a href="https://www.elastic.co/blog/create-your-own-instrumentation-with-the-java-agent-plugin">Create your own instrumentation with the Java Agent Plugin</a>.&quot; Generating metrics from a plugin is just another type of instrumentation, and the referenced article provides detailed step-by-step instructions with a worked example of how to create a plugin with custom instrumentation.</p>
<p>For this article, I assume you understand how to create a plugin with custom instrumentation based on that previous article, as well as the example application (a simple webserver <a href="https://github.com/elastic/apm-agent-java-plugin-example/blob/main/application/src/main/java/co/elastic/apm/example/webserver/ExampleBasicHttpServer.java">ExampleBasicHttpServer</a>) from our <a href="https://github.com/elastic/apm-agent-java-plugin-example">plugin example repo</a>.</p>
<h2>The custom metric</h2>
<p>For our example application, which is an HTTP server (<a href="https://github.com/elastic/apm-agent-java-plugin-example/blob/main/application/src/main/java/co/elastic/apm/example/webserver/ExampleBasicHttpServer.java">ExampleBasicHttpServer</a>) we'd like to add a custom metric 'page_views' which increments each time the <a href="https://github.com/elastic/apm-agent-java-plugin-example/blob/main/application/src/main/java/co/elastic/apm/example/webserver/ExampleBasicHttpServer.java">ExampleBasicHttpServer</a> application handles any request. That means the instrumentation we'll add will be triggered by the same ExampleBasicHttpServer.handleRequest() method used in &quot;<a href="https://www.elastic.co/blog/create-your-own-instrumentation-with-the-java-agent-plugin">Create your own instrumentation with the Java Agent Plugin</a>.&quot;</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/custom-metrics-app-code-java-agent-plugin/elastic-blog-2-15m-vis.png" alt="A 15-minute line visualization of the page_views metric using Elastic APM" /></p>
<h2>Using the Plugin/OpenTelemetry API</h2>
<p>Essentially the only difference to that article is that for metrics, we'll use the <a href="https://opentelemetry.io/docs/specs/otel/metrics/api/">OpenTelemetry <em>metrics</em> API</a> instead of the <a href="https://opentelemetry.io/docs/instrumentation/java/manual/">OpenTelemetry <em>tracing</em> API</a>.</p>
<p>In particular for the metrics, the advice method for the handleRequest() method is the following code:</p>
<pre><code class="language-java">if (pageViewCounter == null) {
    pageViewCounter = GlobalOpenTelemetry
        .getMeter(&quot;ExampleHttpServer&quot;)
        .counterBuilder(&quot;page_views&quot;)
        .setDescription(&quot;Page view count&quot;)
        .build();
}
pageViewCounter.add(1);
</code></pre>
<p>That is, lazily create the meter when it's first needed, and then on each invocation of the ExampleBasicHttpServer.handleRequest() method, increment the page view counter.</p>
<p>Everything else — setting up instrumentation, finding the method to instrument, building the plugin — is the same as in the article &quot;</p>
<p><a href="https://www.elastic.co/blog/create-your-own-instrumentation-with-the-java-agent-plugin">Create your own instrumentation with the Java Agent Plugin</a>.&quot; The full metrics example is implemented in the <a href="https://github.com/elastic/apm-agent-java-plugin-example">plugin example repo</a>, and the actual full metrics instrumentation implementation is <a href="https://github.com/elastic/apm-agent-java-plugin-example/blob/main/plugin/src/main/java/co/elastic/apm/example/webserver/plugin/ExampleMetricsInstrumentation.java">ExampleMetricsInstrumentation</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/custom-metrics-app-code-java-agent-plugin/elastic-blog-3-bar-chart.png" alt="A 15-minute bar chart visualization of the page_views metric using Elastic APM" /></p>
<h2>Try it out!</h2>
<p>That's it! To run the agent with the plugin, just build and include the jar as described in &quot;<a href="https://www.elastic.co/blog/create-your-own-instrumentation-with-the-java-agent-plugin">Create your own instrumentation with the Java Agent Plugin</a>,&quot; in the directory specified by the plugins_dir configuration option. The <a href="https://github.com/elastic/apm-agent-java-plugin-example">plugin example repo</a> provides a full tested implementation — just clone it and mvn install to see it working.</p>
<p>The best place to get started with Elastic APM is in the cloud. Begin your <a href="https://cloud.elastic.co/registration?elektra=en-observability-application-performance-monitoring-page">free trial of Elastic Cloud</a> today!</p>
<blockquote>
<ul>
<li>The <a href="https://www.elastic.co/guide/en/apm/agent/java/current/index.html">Elastic APM Java Agent docs</a></li>
<li>The <a href="https://github.com/elastic/apm-agent-java/">Elastic APM Java Agent repo</a></li>
<li>The <a href="https://github.com/elastic/apm-agent-java-plugin-example">plugin example</a> repo</li>
<li>The previous <a href="https://www.elastic.co/blog/create-your-own-instrumentation-with-the-java-agent-plugin">Create your own instrumentation with the Java Agent Plugin</a> article</li>
<li>The associated <a href="https://www.elastic.co/blog/regression-testing-your-java-agent-plugin">Regression testing your Java Agent Plugin</a> article</li>
<li>The <a href="https://opentelemetry.io/docs/specs/otel/metrics/api/">OpenTelemetry metrics API</a></li>
<li>The <a href="https://opentelemetry.io/docs/instrumentation/java/manual/">OpenTelemetry tracing API</a></li>
<li><a href="https://micrometer.io/">Micrometer</a></li>
</ul>
</blockquote>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/custom-metrics-app-code-java-agent-plugin/capture-custom-metrics-blog-720x420.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Customize your data ingestion with Elastic input packages]]></title>
            <link>https://www.elastic.co/observability-labs/blog/customize-data-ingestion-input-packages</link>
            <guid isPermaLink="false">customize-data-ingestion-input-packages</guid>
            <pubDate>Tue, 26 Sep 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[In this post, learn about input packages and how they can provide a flexible solution to advanced users for customizing their ingestion experience in Elastic.]]></description>
            <content:encoded><![CDATA[<p>Elastic&lt;sup&gt;®&lt;/sup&gt; has enabled the collection, transformation, and analysis of data flowing between the external data sources and Elastic Observability Solution through <a href="https://www.elastic.co/integrations/">integrations</a>. Integration packages achieve this by encapsulating several components, including <a href="https://www.elastic.co/guide/en/fleet/current/create-standalone-agent-policy.html">agent configuration</a>, inputs for data collection, and assets like <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html">ingest pipelines</a>, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html">data streams</a>, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index-templates.html">index templates</a>, and <a href="https://www.elastic.co/guide/en/kibana/current/dashboard.html">visualizations</a>. The breadth of these assets supported in the Elastic Stack increases day by day.</p>
<p>This blog dives into how input packages provide an extremely generic and flexible solution to the advanced users for customizing their ingestion experience in Elastic.</p>
<h2>What are input packages?</h2>
<p>An <a href="https://github.com/elastic/elastic-package">Elastic Package</a> is an artifact that contains a collection of assets that extend the Elastic Stack, providing new capabilities to accomplish a specific task like integration with an external data source. The first use of Elastic packages is <a href="https://github.com/elastic/integrations">integration packages</a>, which provide an end-to-end experience — from configuring Elastic Agent, to collecting signals from the data source, to ingesting them correctly and using the data once ingested.</p>
<p>However, advanced users may need to customize data collection, either because an integration does not exist for a specific data source, or even if it does, they want to collect additional signals or in a different way. Input packages are another type of <a href="https://github.com/elastic/elastic-package">Elastic package</a> that provides the capability to configure Elastic Agent to use the provided inputs in a custom way.</p>
<h2>Let’s look at an example</h2>
<p>Say hello to Julia, who works as an engineer at Ascio Innovation firm. She is currently working with Oracle Weblogic server and wants to get a set of metrics for monitoring it. She goes ahead and installs Elastic <a href="https://docs.elastic.co/integrations/oracle_weblogic">Oracle Weblogic Integration</a>, which uses Jolokia in the backend to fetch the metrics.</p>
<p>Now, her team wants to advance in the monitoring and has the following requirements:</p>
<ol>
<li>
<p>We should be able to extract metrics other than the default ones, which are not supported by the default Oracle Weblogic Integration.</p>
</li>
<li>
<p>We want to have our own bespoke pipelines, visualizations, and experience.</p>
</li>
<li>
<p>We should be able to identify the metrics coming in from two different instances of Weblogic Servers by having data mapped to separate <a href="https://www.elastic.co/blog/what-is-an-elasticsearch-index">indices</a>.</p>
</li>
</ol>
<p>All the above requirements can be met by using the <a href="https://docs.elastic.co/integrations/jolokia">Jolokia input package</a> to get a customized experience. Let's see how.</p>
<p>Julia can add the configuration of Jolokia input package as below, fulfilling the <em>first requirement.</em></p>
<p>hostname, JMX Mappings for the fields you want to fetch for the JVM application, and the <a href="https://www.elastic.co/guide/en/ecs/master/ecs-data_stream.html#field-data-stream-dataset">data set</a> name to which the response fields would get mapped.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/customize-data-ingestion-input-packages/elastic-blog-1-config-parameters.png" alt="Configuration Parameters for Jolokia Input package" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/customize-data-ingestion-input-packages/elastic-blog-2-expanded-doc.png" alt="Metrics getting mapped to the index created by the ‘jolokia_first_dataset’" /></p>
<p>Julia can customize her data by writing her own <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html">ingest pipelines</a> and providing her customized <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html">mappings</a>. Also, she can then build her own bespoke dashboards, hence meeting her <em>second requirement.</em></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/customize-data-ingestion-input-packages/elastic-blog-3-ingest-pipelines.png" alt="Customization of Ingest Pipelines and Mappings" /></p>
<p>Let’s say now Julia wants to use another instance of Oracle Weblogic and get a different set of metrics.</p>
<p>This can be achieved by adding another instance of Jolokia input package and specifying a new <a href="https://www.elastic.co/guide/en/ecs/master/ecs-data_stream.html#field-data-stream-dataset">data set</a> name as shown in the screenshot below. The resultant metrics will be mapped to a different <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html">index</a>/data set hence fulfilling her <em>third requirement.</em> This will help Julia to differentiate metrics coming in from two different instances of Oracle Weblogic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/customize-data-ingestion-input-packages/elastic-blog-4-jolokia.png" alt="jolokia metrics" /></p>
<p>The resultant metrics of the query will be indexed to the new data set, jolokia_second_dataset in the below example.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/customize-data-ingestion-input-packages/elastic-blog-5-dataset.png" alt="dataset" /></p>
<p>As we can see above, the Jolokia input package provides the flexibility to get new metrics by specifying different JMX Mappings, which are not supported in the default Oracle Weblogic integration (the user gets metrics from a predetermined set of JMX Mappings).</p>
<p>The Jolokia Input package also can be used for monitoring any Java-based application, which pushes its metrics through JMX. So a single input package can be used to collect metrics from multiple Java applications/services.</p>
<h2>Elastic input packages</h2>
<p>Elastic has started supporting input packages from the 8.8.0 release. Some of the input packages are now available in beta and will mature gradually:</p>
<ol>
<li>
<p><a href="https://docs.elastic.co/integrations/sql">SQL input package</a>: The SQL input package allows you to execute queries against any SQL database and store the results in Elasticsearch&lt;sup&gt;®&lt;/sup&gt;.</p>
</li>
<li>
<p><a href="https://docs.elastic.co/integrations/prometheus_input">Prometheus input package</a>: This input package can collect metrics from <a href="https://prometheus.io/docs/instrumenting/exporters/">Prometheus Exporters (Collectors)</a>.It can be used by any service exporting its metrics to a Prometheus endpoint.</p>
</li>
<li>
<p><a href="https://docs.elastic.co/integrations/jolokia">Jolokia input package</a>: This input package collects metrics from <a href="https://jolokia.org/agent.html">Jolokia agents</a> running on a target JMX server or dedicated proxy server. It can be used for monitoring any Java-based application, which pushes its metrics through JMX.</p>
</li>
<li>
<p><a href="https://docs.elastic.co/integrations/statsd_input">Statsd input package</a>: The statsd input package spawns a UDP server and listens for metrics in StatsD compatible format. This input can be used to collect metrics from services that send data over the StatsD protocol.</p>
</li>
<li>
<p><a href="https://docs.elastic.co/integrations/gcp_metrics">GCP Metrics input package</a>: The GCP Metrics input package can collect custom metrics for any GCP service.</p>
</li>
</ol>
<h2>Try it out!</h2>
<p>Now that you know more about input packages, try building your own customized integration for your service through input packages, and get started with an <a href="https://cloud.elastic.co/registration?fromURI=/home">Elastic Cloud</a> free trial.</p>
<p>We would love to hear from you about your experience with input packages on the Elastic <a href="https://discuss.elastic.co/">Discuss</a> forum or in <a href="https://github.com/elastic/integrations">the Elastic Integrations repository</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/customize-data-ingestion-input-packages/customize-observability-input-720x420.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Elastic Observability: Streams Data Quality and Failure Store Insights]]></title>
            <link>https://www.elastic.co/observability-labs/blog/data-quality-and-failure-store-in-streams</link>
            <guid isPermaLink="false">data-quality-and-failure-store-in-streams</guid>
            <pubDate>Tue, 18 Nov 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Discover how the Streams a new AI driven Elastic Observability feature help manage data quality with a failure store to help you monitor, troubleshoot, and retain high-quality data.]]></description>
            <content:encoded><![CDATA[<p>When working with observability and logging data, not all documents make it into Elasticsearch in pristine condition. Some may be dropped due to processing failures in ingest pipelines or mapping errors, while others may be partially ingested with ignored fields if a fields value is incompatible with the defined mappings. These issues can impact downstream analysis and dashboards. Streams data quality makes it easier than ever to monitor the health of your ingested data, identify potential issues, and take corrective action right from the UI. With data quality, you can now see exactly how well your Stream is performing and quickly understand whether your data has a <strong>Good</strong>, <strong>Degraded</strong>, or <strong>Poor</strong> quality.</p>
<h2>What's in data quality</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/data-quality-and-failure-store-in-streams/data-quality-tab.png" alt="Data quality tab" /></p>
<h3>At-a-glance summary</h3>
<p>The summary card shows:</p>
<ul>
<li><strong>Degraded documents</strong> - Documents that contain the <code>_ignored</code> field - see <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/mapping-ignored-field">this</a> for more info.</li>
<li><strong>Failed documents</strong> - Documents that were rejected at ingestion due to mapping conflicts or pipeline failures.</li>
</ul>
<p>The overall <strong>quality score</strong> (Good, Degraded, Poor) is automatically calculated based on the percentage of degraded and failed documents.</p>
<h3>Trends over time</h3>
<p>The tab includes a time-series chart so you can track how degraded and failed documents are accumulating over time. Use the <strong>date picker</strong> to zoom into a specific range and understand when problems are spiking.</p>
<h3>Quality issues table</h3>
<p>A detailed table lists the types of issues affecting your stream. For each issue, you can:</p>
<ul>
<li>See which fields are causing problems.</li>
<li>Review counts of affected documents.</li>
<li>Filter by issues that have not been solved yet (Current issues only).</li>
<li>Open a <strong>flyout</strong> to dive deeper into the cause of the issue and learn how to fix it.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/data-quality-and-failure-store-in-streams/quality-issue-flyout.png" alt="Data quality issue flyout" /></p>
<h2>Monitoring degraded documents</h2>
<p>A degraded document is one that contains the <code>_ignored</code> field, which means one or more of its fields were ignored during indexing. One of the reasons could be that their values didn’t match the expected mappings. While the rest of the document is still indexed, a high number of degraded documents can affect query results, dashboards, and overall observability accuracy.</p>
<p>To help keep these issues under control, the Data quality tab provides visibility into the percentage of degraded documents in your stream.</p>
<h3>Set up a rule to stay ahead of issues</h3>
<p>You can use the <strong>Create rule</strong> button above the Degraded docs chart to define an alert that notifies you when the percentage of degraded documents crosses a certain threshold. This makes it easy to proactively monitor for mapping mismatches and ensure your data continues to meet quality expectations.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/data-quality-and-failure-store-in-streams/create-rule-button.png" alt="Create rule button" /></p>
<p>For more information on how to configure this rule, see <a href="https://www.elastic.co/docs/solutions/observability/incident-management/create-a-degraded-docs-rule#degraded-docs-rule-conditions">Degraded docs rule conditions</a>.</p>
<h2>Handling failed documents with the failure store</h2>
<p><a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/failure-store"><strong>Failure store</strong></a> is a special index that captures documents rejected during ingestion. Instead of losing this data, the failure store retains it in a dedicated <code>::failures</code> index, allowing you to inspect the problematic documents, understand what went wrong, and fix the underlying issues.</p>
<p>In Data Quality tab, the failed documents are only visible if your stream has a failure store enabled, for checking failure store documents you are required to have at least <code>read_failure_store</code> privileges. If the failure store is <strong>not enabled</strong>, you’ll see an <strong>“Enable failure store”</strong> link that opens a modal to configure it and set the retention period. For enabling failure store you are required to have <code>manage_failure_store</code> privileges over the specific data stream. For further information about failure store security you can refer to <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/failure-store#use-failure-store-searching">Searching failures</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/data-quality-and-failure-store-in-streams/enable-fs-link.png" alt="Enable failure store link" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/data-quality-and-failure-store-in-streams/failure-store-modal.png" alt="Failure store configuration modal" /></p>
<p>Once enabled, you can <strong>edit the failure store configuration</strong> or disable it at any time using the <strong>Edit</strong> button above the failed docs chart.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/data-quality-and-failure-store-in-streams/edit-fs-button.png" alt="Edit failure store button" /></p>
<p>The failure store can also be configured in the Streams Retention tab - see <a href="https://www.elastic.co/observability-labs/blog/simplifying-retention-management-with-streams.mdx">this article</a> for more information.</p>
<h2>Technical implementation</h2>
<p>Under the hood, the <strong>Data quality</strong> tab builds on the existing <strong>Dataset quality</strong> plugin - the same one that powers the <a href="https://www.elastic.co/docs/solutions/observability/data-set-quality-monitoring"><strong>Dataset quality page</strong></a> in <strong>Stack Management</strong>. However, instead of working in the context of datasets following the Data stream naming scheme, it’s now tailored specifically for <strong>streams</strong>.</p>
<p>To determine the quality of a stream, the UI sends three <strong>ES|QL</strong> query server requests:</p>
<ol>
<li><strong>All documents (including failures):</strong></li>
</ol>
<pre><code class="language-sql"> FROM myStream, myStream::failures | STATS doc_count = COUNT(*)
</code></pre>
<ol start="2">
<li><strong>Failed documents only:</strong></li>
</ol>
<pre><code class="language-sql"> FROM myStream::failures | STATS failed_doc_count = COUNT(*)
</code></pre>
<ol start="3">
<li><strong>Degraded documents:</strong></li>
</ol>
<pre><code class="language-sql">FROM myStream METADATA _ignored | WHERE _ignored IS NOT NULL | STATS degraded_doc_count = COUNT(*)
</code></pre>
<p>The results of these queries are then used to calculate the <strong>percentages</strong> of failed and degraded documents. The overall data quality is determined using simple thresholds:</p>
<ul>
<li><strong>Good:</strong> Both percentages are 0%</li>
<li><strong>Degraded:</strong> Any percentage is greater than 0% but less than 3%</li>
<li><strong>Poor:</strong> Any percentage is above 3%</li>
</ul>
<p>For managing the <strong>failure store</strong>, Streams uses the <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-put-data-stream-options">Update data stream options API</a> with the <code>failure_store</code> parameter to configure and update the failure store settings, including enabling the store and setting the retention period.</p>
<h2>Why you’ll love this</h2>
<p>The new <strong>Data quality</strong> tab gives you:</p>
<ul>
<li>Visibility into ingestion problems without digging into logs</li>
<li>A clear breakdown of degraded vs. failed documents</li>
<li>Insights into which fields are ignored and why</li>
<li>Tools to capture and troubleshoot failed docs with the failure store</li>
</ul>
<p>By surfacing data quality issues directly in the Streams UI, we’re making it easier to keep your data flowing reliably and to ensure your analytics are built on a strong foundation.</p>
<h2><strong>Try it out today</strong></h2>
<p>The <strong>data quality</strong> feature is available in <strong>Elastic Observability on Serverless</strong>, and coming soon for self-managed and Elastic Cloud users.</p>
<p>Sign up for an Elastic trial at <a href="http://cloud.elastic.co">cloud.elastic.co</a>, and trial Elastic's Serverless offering which will allow you to play with all of the Streams functionality.</p>
<p>For more information on Streams:</p>
<p><em>Read about</em> <a href="https://www.elastic.co/observability-labs/blog/reimagine-observability-elastic-streams"><em>Reimagining streams</em></a></p>
<p><em>Look at the</em> <a href="http://elastic.co/elasticsearch/streams"><em>Streams website</em></a></p>
<p><em>Read the</em> <a href="https://www.elastic.co/docs/solutions/observability/streams/streams"><em>Streams documentation</em></a></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/data-quality-and-failure-store-in-streams/article.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[A day in the life of an OpenTelemetry maintainer]]></title>
            <link>https://www.elastic.co/observability-labs/blog/day-opentelemetry-maintainer</link>
            <guid isPermaLink="false">day-opentelemetry-maintainer</guid>
            <pubDate>Fri, 26 Sep 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[In this article, we will discover what the role of a maintainer is about, and how they make Open Source projects alive.]]></description>
            <content:encoded><![CDATA[<p>I’m <a href="https://github.com/dmathieu">Damien</a>, I’m an engineer at Elastic, a maintainer of the OpenTelemetry Go SDK, an approver of the OpenTelemetry Collector and a member of several SIGs.
In this post, we’ll take a closer look at what it means to be a maintainer: the responsibilities they carry, the challenges they navigate, and the impact they have on both the project and the broader community.</p>
<p><a href="https://youtu.be/eZ3OrhxUAmU?t=72"><img src="https://www.elastic.co/observability-labs/assets/images/day-opentelemetry-maintainer/humans-of-otel.png" alt="Humans of OTel at KubeCon EU 2025" /></a></p>
<p>When people think about open source, they often picture lines of code, clever algorithms, or maybe a GitHub repo full of issues and pull requests.
What can be harder to see is the human side. The people who quietly keep things moving, who make sure contributions land smoothly and help the community grow in a healthy way.
That's the work of a maintainer.</p>
<p>Maintainers are more than just code reviewers. They are the stewards of the SIG's (<a href="https://github.com/open-telemetry/community#special-interest-groups">Special Interest Group</a>) health, direction, and community.
They balance technical oversight with mentorship, governance with collaboration, and long-term vision with the day-to-day realities of issues and pull requests.</p>
<h2>Open Source mentorship</h2>
<p>One of the most rewarding parts of being a maintainer is mentorship.
Every open source project depends on new contributors stepping in, learning the ropes, and eventually taking on more responsibility themselves.
As maintainers, we’re often the first point of contact for someone who’s never contributed to the project before.</p>
<p>Mentorship can look like many different things.
Sometimes it’s as simple as leaving a thoughtful code review that doesn’t just point out what’s wrong, but explains why a change matters.
Other times, it’s guiding a contributor through their first issue, helping them understand the project’s structure, or showing them how to run tests locally.
And every so often, it means stepping back to give someone room to try, even if they don’t get it right the first time.</p>
<p>The goal isn’t just to fix the immediate bug or land the pull request. It's to help contributors feel confident enough to come back again.
A healthy project grows by sharing knowledge, not hoarding it.
Mentorship is how maintainers make sure today's first-time contributor can become tomorrow's reviewer, and eventually, the next maintainer.</p>
<h2>Setting direction and priorities</h2>
<p>Another part of being a maintainer is shaping the project's roadmap.
Open source moves fast: there are always new ideas, bug reports, and feature requests.
Left unchecked, a project can easily become a grab bag of loosely connected changes.
Part of our job as maintainers is to make sure the work stays aligned with the bigger picture.</p>
<p>That means asking questions like:</p>
<ul>
<li>Does this feature fit with our long-term goals?</li>
<li>Is now the right time to tackle it?</li>
<li>Do we have the capacity to maintain it once it’s merged?</li>
</ul>
<p>Sometimes the answer is &quot;not yet&quot; or even &quot;no&quot;, and it’s on us to communicate that clearly while still encouraging contributions.</p>
<p>Roadmapping isn't about dictating every detail.
It's about setting priorities together with the community—listening to feedback, balancing what users need today with where the project should be tomorrow, and making tradeoffs that keep the project sustainable.</p>
<p>The roadmap gives everyone a shared sense of direction.
Contributors know where their work fits in, users can see what's coming next, and the project as a whole stays focused instead of scattered.</p>
<h2>Special Interest Group meetings</h2>
<p>One of the maintainer's roles is also to facilitate the frequent meetings that help their SIG communicate and plan its work.</p>
<p>Facilitating a SIG meeting isn’t about running through an agenda like a checklist.
It’s about creating space where everyone feels comfortable speaking up, from long-time contributors to someone joining their very first call.
That means keeping discussions focused, making sure quieter voices get heard, and helping the group reach consensus without letting debates drag on forever.</p>
<p>There's also a practical side: preparing the agenda ahead of time, documenting decisions so they’re visible to the wider community, and following up on action items afterward.</p>
<p>In many ways, SIG meetings are where the &quot;community&quot; part of open source really comes to life.
As maintainers, our role is to guide the conversation, not control it, making sure the project keeps moving forward while staying open and inclusive.</p>
<h2>Challenges</h2>
<p>Of course, maintaining isn’t all smooth sailing.
One of the hardest parts is balancing the constant flow of contributions with the need to keep the codebase healthy.
Every pull request represents someone's time and effort, and it’s important to honor that.
Yet, at the same time, not every change fits the project's standards or long-term goals.
Saying &quot;no&quot; gracefully is just as important as merging a great contribution.</p>
<p>Maintainers also find themselves balancing priorities that go beyond code.
Different contributors, and often the companies backing them, come with their own needs and expectations.
One team might want a new feature quickly, another might be focused on stability, while the community as a whole still needs clear direction.
Managing those competing priorities, and making decisions that serve the project rather than any single interest, is a constant challenge.</p>
<p>Conflicts are another reality. With so many people involved, it's inevitable that disagreements will happen.
Sometimes it's about technical design, sometimes about process, and occasionally about interpersonal dynamics.
Part of the maintainer role is helping to navigate those moments: keeping discussions respectful, finding common ground, and making sure decisions are made transparently.</p>
<p>And yet, despite the difficulties, the impact of this work is enormous: when maintainers succeed, the entire community thrives.</p>
<h2>The importance and impact of Open Source maintainers</h2>
<p>When maintainers do their job well, the effects ripple far beyond the codebase.
A well-tended project feels reliable and welcoming—contributors know their work will be reviewed thoughtfully, users trust the software to be stable, and the community grows because people want to come back.</p>
<p>Good project maintenance builds momentum.
A contributor who feels supported on their first pull request is more likely to return for a second.
Clear roadmaps and consistent standards give people confidence that their effort matters and will fit into the bigger picture. And when conflicts are handled with respect and transparency, it reinforces the culture of trust that makes open source sustainable.</p>
<p>The impact goes deeper than just keeping a project alive.
Effective maintainers create the conditions for others to succeed.
That's the real legacy of this role: not just code, but a thriving ecosystem and community built around it.</p>
<h2>Conclusion</h2>
<p>Being a maintainer is challenging work, but it’s also some of the most meaningful.
It’s about more than merging code. It's about stewardship, mentorship, and creating a community where people feel empowered to contribute.
Every healthy open source project owes its success to the care and commitment of its maintainers.</p>
<p>And while the challenges are real, the rewards are just as tangible: the chance to constantly learn, to collaborate on complex problems, and to connect with people from every corner of the world and every kind of background.</p>
<p>OpenTelemetry's maintainers embody this balance every day, helping the project grow while keeping its community strong.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/day-opentelemetry-maintainer/blog-image.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Debugging Azure Networking for Elastic Cloud Serverless]]></title>
            <link>https://www.elastic.co/observability-labs/blog/debugging-aks-packet-loss</link>
            <guid isPermaLink="false">debugging-aks-packet-loss</guid>
            <pubDate>Thu, 05 Jun 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how Elastic SREs uncovered and resolved unexpected packet loss in Azure Kubernetes Service (AKS), impacting Elastic Cloud Serverless performance.]]></description>
            <content:encoded><![CDATA[&lt;h2&gt; Summary of Findings &lt;/h2&gt; 
<p>Elastic's Site Reliability Engineering team (SRE) observed unstable throughput and packet loss in Elastic Cloud Serverless running on Azure Kubernetes Service (AKS). After investigation, we identified the primary contributing factors to be RX ring buffer overflows and kernel input queue saturation on SR-IOV interfaces. To address this, we increased RX buffer sizes and adjusted the netdev backlog, which significantly improved network stability.</p>
&lt;h2&gt; Setting the Scene &lt;/h2&gt;
<p><a href="https://www.elastic.co/cloud/serverless">Elastic Cloud Serverless</a> is a fully managed solution that allows you to deploy and use Elastic for your use cases without managing the underlying infrastructure. Built on Kubernetes, it represents a shift in how you interact with Elasticsearch. Instead of managing clusters, nodes, data tiers, and scaling, you create serverless projects that are fully managed and automatically scaled by Elastic. This abstraction of infrastructure decisions allows you to focus solely on gaining value and insight from your data.</p>
<p>Elastic Cloud Serverless is generally available (GA) on AWS, GCP and currently in <a href="https://www.elastic.co/guide/en/serverless/current/regions.html">Technical Preview on Azure</a>. As part of preparing Elastic Cloud Serverless GA on Azure, we have been conducting extensive performance and scalability tests to ensure that our users get a consistent and reliable user experience.</p>
<p>In this post, we’ll take you behind the scenes of a deep technical investigation into a surprising performance issue that affected Serverless Elasticsearch in our Azure Kubernetes clusters. At first, the network seemed like the least likely place to look, especially with a high-speed 100 Gb/s interface on the host backing it. But as we dug deeper, with help from the Microsoft Azure team, that’s exactly where the problem led us.</p>
&lt;h2&gt; Unexpected Results! &lt;/h2&gt;
<p>While the high-level architectures and system design patterns of the major cloud provider’s systems are often similar, the implementations are different, and these differences can have dramatic impacts on a system’s performance characteristics.</p>
<p>One of the most significant differences between the different cloud providers is that the underlying hypervisor software and server hardware of the Virtual Machines can vary significantly, even between instance families of the same provider.</p>
<p>There is no way to fully abstract the hardware away from an application like Elasticsearch. Fundamentally, its performance is dictated by the CPU, memory, disks, and network interfaces on the physical server. In preparation for the Elastic Cloud Serverless GA on Azure, our Elasticsearch Performance team kicked off large-scale load testing against Serverless Elasticsearch projects running on <a href="https://docs.azure.cn/en-us/aks/what-is-aks">Azure Kubernetes Service (AKS)</a>, using <a href="https://azure.microsoft.com/en-us/blog/azure-cobalt-100-based-virtual-machines-are-now-generally-available/">ARM-based VMs</a> (we’re big fans!). Throughout this process, we relied heavily on Elastic tools to analyse system behaviour, identify bottlenecks, and validate performance under load.</p>
<p>To perform these scale and load tests, the Elasticsearch Performance team use <a href="https://github.com/elastic/rally">Rally</a>, an open-source benchmarking tool designed to measure the performance of Elasticsearch clusters. The workload (or in Rally nomenclature, ‘Track’) used for these tests was the <a href="https://github.com/elastic/rally-tracks/tree/master/github_archive">GitHub Archive Track</a>. Rally collects and sends test telemetry using the <a href="https://www.elastic.co/docs/reference/elasticsearch/clients/python">official Python client</a> to a separate Elasticsearch cluster running <a href="https://www.elastic.co/observability">Elastic Observability</a>, which allows for monitoring and analysis during these scale and load tests in real time via <a href="https://www.elastic.co/docs/explore-analyze">Kibana</a>.</p>
<p>When we looked at the results, we observed that the indexing rate (the number of docs/s) for the Serverless projects was not only much lower than we had expected for the given hardware, but the throughput was also quite unstable. There were peaks and valleys, interspersed with frequent errors, whereas we were instead expecting a stable indexing rate for the duration of the test.</p>
<p>These tests are designed to push the system to its limits, and in doing so, they surfaced unexpected behavior in the form of unstable indexing throughput and intermittent errors. This was precisely the kind of problem we'd hoped to uncover prior to going GA — giving us the opportunity to work closely with Azure.</p>
&lt;div align=&quot;center&quot;&gt;
![Indexing Rate with Packet Loss](/assets/images/debugging-aks-packet-loss/indexing-rate-before.png)
_A Kibana visualisation of Rally telemetry, showing fluctuating Elasticsearch indexing rates alongside spikes in 5xx and 4xx HTTP error responses._
&lt;/div&gt;
&lt;h2&gt; Debugging! &lt;/h2&gt;
<p>Debugging performance issues can feel a little bit like trying to find a <a href="https://www.youtube.com/watch?v=7AO4wz6gI3Q">‘Butterfly in a Hurricane’</a>, so it’s crucial that you take a methodological approach to analysing application and system performance.</p>
<p>Using methodologies helps you to be more consistent and thorough in your debugging, and avoids missing things. We started with the <a href="https://www.brendangregg.com/usemethod.html">Utilisation Saturation and Errors (USE) Method</a>, looking at both the client and server side to identify any obvious bottlenecks in the system.</p>
<p>Elastic's Site Reliability Engineers (SREs) maintain a suite of custom <a href="https://www.elastic.co/docs/solutions/observability/get-started/what-is-elastic-observability">Elastic Observability</a> dashboards designed to visualise data collected from various <a href="https://www.elastic.co/docs/extend/integrations/what-is-an-integration">Elastic Integrations</a>. These dashboards provide deep visibility into the health and performance of Elastic Cloud infrastructure and systems.</p>
<p>For this investigation, we leveraged a custom dashboard built using metrics and log data from the <a href="https://www.elastic.co/docs/reference/integrations/system">System</a> and <a href="https://www.elastic.co/docs/reference/integrations/linux">Linux</a> Integrations:</p>
&lt;div align=&quot;center&quot;&gt;
  ![Node Overview Dashboard](/assets/images/debugging-aks-packet-loss/overview-dashboard.png)
  _One of many Elastic Observability dashboards built and maintained by the SRE team._
&lt;/div&gt;
<p>Following the USE Method, these dashboards highlight resource utilisation, saturation, and errors across our systems. With their help, we quickly identified that the AKS nodes hosting the Elasticsearch pods under test were dropping thousands of packets per second.</p>
&lt;div align=&quot;center&quot;&gt;
![Node Packet Loss Before Tuning](/assets/images/debugging-aks-packet-loss/packet-loss-before.png)
_A Kibana visualisation of [Elastic Agent's System Integration](https://www.elastic.co/docs/reference/integrations/system), showing the rate of packet drops per second for AKS nodes._
&lt;/div&gt;
<p>Dropping packets forces reliable protocols, such as TCP, to retransmit any missing packets. These retransmissions can introduce significant delays, which kills the throughput of any system where client requests are only triggered upon the previous request completion (known as a <a href="https://www.usenix.org/legacy/event/nsdi06/tech/full_papers/schroeder/schroeder.pdf">Closed System</a>).</p>
<p>To investigate further, we jumped onto one of the AKS nodes exhibiting the packet loss to check the basics. First off, we wanted to identify what type of packet drops or errors we’re seeing; is it for specific pods, or the host as a whole?</p>
<pre><code>root@aks-k8s-node-1:~# ip -s link show
2: eth0: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 7c:1e:52:be:ce:5e brd ff:ff:ff:ff:ff:ff
    RX:    bytes   packets errors dropped  missed   mcast
    373507935420 134292481      0       0       0      15
    TX:    bytes   packets errors dropped carrier collsns
    644247778936 303191014      0       0       0       0
3: enP42266s1: &lt;BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP&gt; mtu 1500 qdisc mq master eth0 state UP mode DEFAULT group default qlen 1000
    link/ether 7c:1e:52:be:ce:5e brd ff:ff:ff:ff:ff:ff
    RX:    bytes   packets errors dropped  missed   mcast
    386782548951 307000571      0       0 5321081       0
    TX:    bytes   packets errors dropped carrier collsns
    655758630548 477594747      0       0       0       0
    altname enP42266p0s2
15: lxc0ca0ec41ecd2@if14: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether f6:f5:5e:c9:4e:fb brd ff:ff:ff:ff:ff:ff link-netns cni-3f90ab53-df66-cac5-bd19-9cea4a68c29b
    RX:    bytes   packets errors dropped  missed   mcast
    627954576078  54297550      0    1600       0       0
    TX:    bytes   packets errors dropped carrier collsns
    372155326349 133538064      0    3927       0       0
</code></pre>
<p>In this output you can see the <code>enP42266s1</code> interface is showing a significant number of packets in the <code>missed</code> column. That’s interesting, sure, but what does missed actually represent? And what is <code>enP42266s1</code>?</p>
<p>To understand, let’s look at roughly what happens when a packet arrives at the NIC:</p>
<ol>
<li>A packet arrives at the NIC from the network.</li>
<li>The NIC uses DMA (Direct Memory Access) to place the packet into a receive ring buffer allocated in memory by the kernel, mapped for use by the NIC. Since our NICs supports multiple hardware queues, each queue has its own dedicated ring buffer, IRQ, and NAPI context.</li>
<li>The NIC raises a hardware interrupt (IRQ) to notify the CPU that a packet is ready.</li>
<li>The CPU runs the NIC driver’s IRQ handler. The driver schedules a NAPI (New API) poll to defer packet processing to a softirq context. A mechanism in the Linux kernel that defers work to be processed outside of the hard IRQ context, for better batching and CPU efficiency, enabling improved scalability.</li>
<li>The NAPI poll function is executed in a softirq context (<code>NET_RX_SOFTIRQ</code>) and retrieves packets from the ring buffer. This polling continues either until the driver’s packet budget is exhausted (<code>net.core.netdev_budget</code>) or the time limit is hit (<code>net.core.netdev_budget_usecs</code>).</li>
<li>Each packet is wrapped in an <code>sk_buff</code> (socket buffer) structure, which includes metadata such as protocol headers, timestamps, and interface identifiers.</li>
<li>If the networking stack is slower than the rate at which NAPI fetches packets, excess packets are queued in a per-CPU backlog queue (via <code>enqueue_to_backlog</code>). The maximum size of this backlog is controlled by the <code>net.core.netdev_max_backlog</code> sysctl.</li>
<li>Packets are then handed off to the kernel’s networking stack for routing, filtering, and protocol-specific processing (e.g. TCP, UDP).</li>
<li>Finally, packets reach the appropriate socket receive buffer, where they are available for consumption by the user-space application.</li>
</ol>
<p>Visualised, it looks something like this:</p>
&lt;div align=&quot;center&quot;&gt;
![Linux Packet Flow Diagram](/assets/images/debugging-aks-packet-loss/packet-flow.png)
_Image © 2018 Leandro Moreira. Used under the [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause). Source: [GitHub repository](https://github.com/leandromoreira/linux-network-performance-parameters)._
&lt;/div&gt;
<p>The <code>missed</code> counter is incremented whenever the NIC tries to DMA a packet into a fully occupied <a href="https://en.wikipedia.org/wiki/Circular_buffer">ring buffer</a>. The NIC essentially &quot;misses&quot; the chance to deliver the packet to the VM’s memory. However, what’s most interesting is that this counter seldom increments for VMs. This is because Virtual NICs are usually implemented as software via the hypervisor, which typically has much more flexible memory management compared to the physical NICs and can reduce the chance of ring buffer overflow.</p>
<p>We mentioned earlier that we’re building Azure Elasticsearch Serverless on top of Azure’s AKS service, which is important to note because all of our AKS nodes use an Azure feature called <a href="https://learn.microsoft.com/en-us/azure/virtual-network/accelerated-networking-overview">Accelerated Networking</a>. In this setup, network traffic is delivered directly to the VM’s network interface, bypassing the hypervisor. This is enabled by <a href="https://learn.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-single-root-i-o-virtualization--sr-iov-">single root I/O virtualization (SR-IOV)</a>, which offers much lower latency and higher throughput than traditional VM networking. Each node is physically connected to a 100 Gb/s network interface, although the SR-IOV Virtual Function (VF) exposed to the VM typically provides only a fraction of that total bandwidth.</p>
<p>Despite the VM only having a fraction of the 100 Gb/s bandwidth, microbursts are still very possible. These physical interfaces are so fast that they can transmit and receive multiple packets in just nanoseconds, far faster than most buffers or processing queues can absorb. At these timescales, even a short-lived burst of traffic can overwhelm the receiver, leading to dropped packets and unpredictable latency.</p>
<p>Direct access to the SR-IOV interface means that our VMs are responsible for handling the hardware interrupts triggered by the NIC in a timely manner, if there's any delay in handling the hardware interrupt (e.g. waiting to be scheduled onto CPU by the hypervisor) then network packets can be missed!</p>
&lt;h2&gt; Firstly - NIC-level Tuning &lt;/h2&gt;
<p>Since we'd confirmed that our VMs were using SR-IOV, we established that the <code>enP42266s1</code> and <code>eth0</code> interfaces <a href="https://learn.microsoft.com/en-us/azure/virtual-network/accelerated-networking-how-it-works">were a bonded pair and acted as a single interface</a>. Knowing this, then we reasoned that we should be able to adjust the ring buffer values directly using <code>ethtool</code>.</p>
<pre><code>root@aks-k8s-node-1:~# ethtool -g enP42266s1
Ring parameters for enP42266s1:
Pre-set maximums:
RX:		8192
RX Mini:	n/a
RX Jumbo:	n/a
TX:		8192
Current hardware settings:
RX:		1024
RX Mini:	n/a
RX Jumbo:	n/a
TX:		1024
</code></pre>
<p>In the output above, we were using only 1/8th of the available ring buffer descriptors. These values were set by the OS defaults, which generally aim to balance performance and resource usage. Set too low, they risk packet drops under load; set too high, they can lead to unnecessary memory consumption. We knew that the VMs were backed by a virtual function carved out of the directly attached 100 Gb/s network interface, which is fast enough to deliver microbursts that could easily overwhelm small buffers. To better absorb those short, high-intensity bursts of traffic, we increased the NIC’s RX ring buffer size from 1024 to 8192. Using a privileged DaemonSet, we rolled out the change across all of our AKS nodes by installing <a href="https://en.wikipedia.org/wiki/Udev">a <code>udev</code> rule</a> to automatically increase the buffer size:</p>
<pre><code># Match Mellanox ConnectX network cards and run ethtool to update the ring buffer settings
ENV{INTERFACE}==&quot;en*&quot;, ENV{ID_NET_DRIVER}==&quot;mlx5_core&quot;, RUN+=&quot;/sbin/ethtool -G %k rx ${CONFIG_AZURE_MLX_RING_BUFFER_SIZE} tx ${CONFIG_AZURE_MLX_RING_BUFFER_SIZE}&quot;
</code></pre>
&lt;div align=&quot;center&quot;&gt;
![AKS Node Packet Loss after RX ring buffer change](/assets/images/debugging-aks-packet-loss/packet-loss-after.png)
_A Kibana visualisation of [Elastic Agent's System Integration](https://www.elastic.co/docs/reference/integrations/system), showing packet loss reduced by ~99% after increasing the NIC's RX ring buffer values._
&lt;/div&gt;
<p>As soon as the change had been applied to all AKS nodes we stopped ‘missing’ RX packets! Fantastic! As a result of this simple change we observed a significant improvement in our indexing throughput and stability.</p>
&lt;div align=&quot;center&quot;&gt;
![Indexing rate after RX ring buffer change](/assets/images/debugging-aks-packet-loss/indexing-rate-after.png)
_A Kibana visualisation of Rally telemetry, showing stable and improved Elasticsearch indexing rates after increasing the RX ring buffer size._
&lt;/div&gt;
<p>Job done, right? Not quite..</p>
&lt;h2&gt; Further improvements - Kernel-level Tuning &lt;/h2&gt;
<p>Eagle eyed readers may have noticed two things:</p>
<ol>
<li>In the previous screenshot, despite adjusting the physical RX ring buffer values, we still observed a small number of <code>dropped</code> packets on the TX side.</li>
<li>In the original <code>ip link -s show</code> output, one of the ‘logical’ interfaces used by the Elasticsearch pod was showing <code>dropped</code> packets on both the TX and RX sides.</li>
</ol>
<pre><code>15: lxc0ca0ec41ecd2@if14: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether f6:f5:5e:c9:4e:fb brd ff:ff:ff:ff:ff:ff link-netns cni-3f90ab53-df66-cac5-bd19-9cea4a68c29b
    RX:    bytes   packets errors dropped  missed   mcast
    627954576078  54297550      0    1600       0       0
    TX:    bytes   packets errors dropped carrier collsns
    372155326349 133538064      0    3927       0       0
</code></pre>
<p>So, we continued to dig. We’d eliminated ~99% of the packet loss, and the remaining loss rate wasn’t as significant as what we’d started with, but we still wanted to understand why it was occurring even after adjusting the RX ring buffer size of the NIC.</p>
<p>So what does <code>dropped</code> represent, and what is this <code>lxc0ca0ec41ecd2</code> interface? <code>dropped</code> is similar to <code>missed</code>, but only occurs when packets are deliberately dropped by the kernel or network interface. Crucially though, it doesn’t tell you why a packet was dropped. As for the <code>lxc0ca0ec41ecd2</code> interface, we use the <a href="https://learn.microsoft.com/en-us/azure/aks/azure-cni-powered-by-cilium">Azure CNI Powered by Cilium</a> to provide the network functionality to our AKS clusters. Any pod spun up on an AKS node gets a ‘logical’ interface, which is a virtual ethernet (<code>veth</code>) pair that connects the pod’s network namespace with the host’s network namespace. It was here that we were dropping packets.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/debugging-aks-packet-loss/aks-node-network-topology.png" alt="AKS Node Networking Diragram" /></p>
<p>In our experience, packet drops at this layer are unusual, so we started digging deeper into the cause of the drops. There are numerous ways you can debug why a packet is being dropped, but one of the easiest is <a href="https://perfwiki.github.io/main/">to use <code>perf</code></a> attach to the <code>skb:kfree_skb</code> tracepoint. The &quot;socket buffer&quot; (<code>skb</code>) is the primary data structure used to represent network packets in the Linux kernel. When a packet is dropped, its corresponding socket buffer is usually freed, triggering the <code>kfree_skb</code> tracepoint. Using <code>perf</code> to attach to this event allowed us to capture stack traces to analyze the cause of the drops.</p>
&lt;div align=&quot;center&quot;&gt;
```
# perf record -g -a -e skb:kfree_skb
```
&lt;/div&gt;
<p>We left this to run for ~10 minutes or so to capture as many drops as possible, and then ‘heavily inspired’ by <a href="https://gist.github.com/bobrik/0e57671c732d9b13ac49fed85a2b2290">this GitHub Gist by Ivan Babrou</a>, we converted the stack traces into an ‘easier’ to read <a href="https://github.com/brendangregg/FlameGraph">Flamegraphs</a>:</p>
<pre><code># perf script | sed -e 's/skb:kfree_skb:.*reason:\(.*\)/\n\tfffff \1 (unknown)/' -e 's/^\(\w\+\)\s\+/kernel /' &gt; stacks.txt
cat stacks.txt | stackcollapse-perf.pl --all | perl -pe 's/.*?;//' | sed -e 's/.*irq_exit_rcu_\[k\];/irq_exit_rcu_[k];/' | flamegraph.pl --colors=java --hash --title=aks-k8s-node-1 --width=1440 --minwidth=0.005 &gt; aks-k8s-node-1.svg
</code></pre>
&lt;div align=&quot;center&quot;&gt;
![AKS Node Packet Loss Flamegraph](/assets/images/debugging-aks-packet-loss/aks-packet-loss-flamegraph.png)
_A Flamegraph showing the various stack trace ancestry of packet loss._
&lt;/div&gt;
<p>The flamegraph here shows how often different functions appeared in stack traces for packets drops. Each box represents a function call and wider boxes mean the function appears more frequently in the traces. The stack's ancestry builds upward from the bottom with earlier calls, to the top with later calls.</p>
<p>Firstly, we quickly discovered that unfortunately the <code>skb_drop_reason</code> enum <a href="https://github.com/torvalds/linux/commit/c504e5c2f9648a1e5c2be01e8c3f59d394192bd3">was only added in Kernel 5.17</a> (Azure’s Node Image at the time was using 5.15). This meant that there was no single human readable message that told us why the packets were being dropped, instead all we got was <code>NOT_SPECIFIED</code>. To work out why packets were being dropped we needed to do a little sleuthing through the stack traces to work out what code paths were being taken when a packet was dropped.</p>
<p>In the flamegraph above you can see that many of the stack traces include <code>veth</code> driver function calls (e.g. <code>veth_xmit</code>), and many end abruptly with a call to the <code>enqueue_to_backlog</code> function. When many stacks end at the same function (like <code>enqueue_to_backlog</code>) it suggests that function is a common point where packets are being dropped. If you go back to the earlier explanation of what happens when a packet arrives at the NIC, you’ll notice that in step 7 we explained:</p>
<blockquote>
<p><em>7. If the networking stack is slower than the rate at which NAPI fetches packets, excess packets are queued in a per-CPU backlog queue (via <code>enqueue_to_backlog</code>). The maximum size of this backlog is controlled by the <code>net.core.netdev_max_backlog</code> sysctl.</em></p>
</blockquote>
<p>Using the same privileged DaemonSet method for the RX ring buffer adjustment, we set the value of the <code>net.core.netdev_max_backlog</code> adjustable kernel parameter from 1000 to 32768:</p>
<pre><code>/usr/sbin/sysctl -w net.core.netdev_max_backlog=32768
</code></pre>
<p>This value was based on the fact we knew the hosts were using a 100 Gb/s SR-IOV NIC, even if the VM was allowed only a fraction of the total bandwidth. We acknowledge that it’s worth revisiting this value in the future to see if it can be better optimised to not waste extraneous memory, but at the time “perfect was the enemy of good”.</p>
<p>We re-ran the load tests and compared the three sets of results we’d collected thus far.</p>
&lt;div align=&quot;center&quot;&gt;
![Final Indexing Rate Results](/assets/images/debugging-aks-packet-loss/indexing-rate-final.png)
_A Kibana visualisation of Rally results, comparing impact to median throughput after each configuration change._
&lt;/div&gt;
<table>
<thead>
<tr>
<th>Tuning Step</th>
<th>Packet Loss</th>
<th>Median indexing throughput</th>
</tr>
</thead>
<tbody>
<tr>
<td>Baseline</td>
<td>High</td>
<td>~18,000 docs/s</td>
</tr>
<tr>
<td>+RX Buffer</td>
<td>~99% drop ↓</td>
<td>~26,000 (+ ~40% from baseline)</td>
</tr>
<tr>
<td>+Backlog &amp; +RX Buffer</td>
<td>Near zero</td>
<td>~29,000 (+ ~60% from baseline)</td>
</tr>
</tbody>
</table>
<p>Here you can see the P50 of throughput in docs/s over the course of the hours-long load tests. Compared to the baseline, we saw a roughly <strong>~40%</strong> increase in throughput by only adjusting the RX ring buffer values, and a <strong>~50-60%</strong> increase with both the RX ring buffer and backlog changes! Hooray!</p>
<p>A great result and one more step on our journey towards better Serverless Elasticsearch performance.</p>
&lt;h2&gt; Working with Azure &lt;/h2&gt;
<p>It’s great that we were able to quickly identify and mitigate the majority of our packet loss issues, but since we were using AKS with AKS node images, it made sense to engage with Azure to understand why the defaults weren’t working for our workload.</p>
<p>We walked Azure through our investigation, mitigations and results, and asked for some additional validation of our mitigations. Azure Engineering confirmed that the host NICs were not discarding packets, which confirmed that everything arriving at the host level was passed through to the hypervisor on the host. Further investigation confirmed that no loss or discards were occurring to Azure network fabric, or internal to the hypervisor – which shifted focus from the host to the guest OS and why the guest OS kernel was slow when reading packets off of the <code>enP*</code> SR-IOV interfaces.</p>
<p>Given the complexity of our load testing scenario — which involved configuring multiple systems and tools, including <a href="https://www.elastic.co/observability">Elastic Observability</a>, we also developed a simplified reproduction of the packet loss issue using <a href="https://github.com/esnet/iperf"><code>iperf3</code></a>. This simplified test was created specifically to share with Azure for targeted analysis, and added to the broader monitoring and analysis enabled by Elastic Observability and Rally.</p>
<p>With this reproduction Azure was able to confirm the increasing <code>missed</code> and <code>dropped</code> packet counters we had observed, and confirmed the increased RX ring buffer and <code>netdev_max_backlog</code> increase as the recommended mitigations.</p>
&lt;h2&gt; Conclusion &lt;/h2&gt;
<p>While cloud providers offer various abstractions to manage your resources, the underlying hardware ultimately determines your application's performance and stability. High-performance hardware often requires tuning at the operating system level, well beyond the default settings most environments ship with. In managed platforms like AKS, where Azure controls both the node images and infrastructure, it is easy to overlook the impact of low-level configurations such as network device ring buffer sizes or sysctls like <code>net.core.netdev_max_backlog</code>.</p>
<p>Our experience shows that even with the convenience of a managed Kubernetes service, performance issues can still emerge if these hardware parameters are not tuned appropriately. It was tempting to assume that high-speed 100 Gb/s network interfaces, directly attached to the VM using SR-IOV would eliminate any chance of network-related bottlenecks. In reality, that assumption didn’t hold up.</p>
<p>Engaging early with Azure was essential, as they provided deeper visibility into the underlying infrastructure and worked with us to tune low-level, performance-critical settings. Combined with thorough load and scale testing and robust observability using tools like Elastic Observability, this collaboration helped us detect and rectify the issue early in order to deliver a consistent, reliable, and high-performing experience for our users.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/debugging-aks-packet-loss/debugging-aks-packet-loss.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[How to deploy a Hello World web app with Elastic Observability on AWS App Runner]]></title>
            <link>https://www.elastic.co/observability-labs/blog/deploy-app-observability-aws-app-runner</link>
            <guid isPermaLink="false">deploy-app-observability-aws-app-runner</guid>
            <pubDate>Mon, 02 Oct 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Follow the step-by-step process of instrumenting Elastic Observability for a Hello World web app running on AWS App Runner.]]></description>
            <content:encoded><![CDATA[<p>Elastic Observability is the premiere tool to provide visibility into web apps running in your environment. AWS App Runner is the serverless platform of choice to run your web apps that need to scale up and down massively to meet demand or minimize costs. Elastic Observability combined with AWS App Runner is the perfect solution for developers to deploy <a href="https://www.elastic.co/blog/observability-powerful-flexible-efficient">web apps that are auto-scaled with fully observable operations</a>, in a way that’s straightforward to implement and manage.</p>
<p>This blog post will show you how to deploy a simple Hello World web app to App Runner and then walk you through the steps to instrument the Hello World web app to enable observation of the application’s operations with Elastic Cloud.</p>
<h2>Elastic Observability setup</h2>
<p>We’ll start with setting up an Elastic Cloud deployment, which is where observability will take place for the web app we’ll be deploying.</p>
<p>From the <a href="https://cloud.elastic.co">Elastic Cloud console</a>, select <strong>Create deployment</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/elastic-blog-1-create-deployment.png" alt="1 create deployment" /></p>
<p>Enter a deployment name and click <strong>Create deployment</strong>. It takes a few minutes for your deployment to be created. While waiting, you are prompted to save the admin credentials for your deployment, which provides you with superuser access to your Elastic® deployment. Keep these credentials safe as they are shown only once.</p>
<p>Elastic Observability requires an APM Server URL and an APM Secret token for an app to send observability data to Elastic Cloud. Once the deployment is created, we’ll copy the Elastic Observability server URL and secret token and store them somewhere safely for adding to our web app code in a later step.</p>
<p>To copy the APM Server URL and the APM Secret Token, go to <a href="https://cloud.elastic.co/home">Elastic Cloud</a>. Then go to the<a href="https://cloud.elastic.co/deployments">Deployments</a> page, which lists all of the deployments you have created. Select the deployment you want to use, which will open the deployment details page. In the Kibana® row of links, click on <strong>Open</strong> to open <strong>Kibana</strong> for your deployment.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/elastic-blog-2-my-deployment.png" alt="2 my deployment" /></p>
<p>Select <strong>Integrations</strong> from the top-level menu. Then click the <strong>APM</strong> tile.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/elastic-blog-3-apm.png" alt="3 apm" /></p>
<p>On the APM Agents page, copy the secretToken and the serverUrl values and save them for use in a later step.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/elastic-blog-4-apm-agents.png" alt="4 apm agents" /></p>
<p>Now that we’ve completed the Elastic Cloud setup, the next step is to set up our AWS project for deploying apps to App Runner.</p>
<h2>AWS App Runner setup</h2>
<p>To start using AWS App Runner, you need an AWS account. If you’re a brand new user, go to <a href="https://aws.amazon.com">aws.amazon.com</a> to sign up for a new account.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/elastic-blog-5-start-building.png" alt="5 start building on aws today" /></p>
<h2>Set up AWS CloudShell</h2>
<p>We’ll perform the process of creating a Python Hello World App image and pushing it to the AWS ECR using AWS CloudShell.</p>
<p>We’re going to use Docker to build the sample app image. Perform the following five steps to set up Docker within CloudShell.</p>
<ol>
<li>Open <a href="https://console.aws.amazon.com/cloudshell/">AWS CloudShell</a>.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/elastic-blog-6-welcome-to-aws-cloudshell.png" alt="6 welcome to aws cloudshell" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/elastic-blog-7-aws-cloudshell.png" alt="7 aws cloudshell" /></p>
<ol start="2">
<li>Run the following two commands to install Docker in CloudShell:</li>
</ol>
<pre><code class="language-bash">sudo yum update -y
sudo amazon-linux-extras install docker
</code></pre>
<ol start="3">
<li>Start Docker by running the command:</li>
</ol>
<pre><code class="language-bash">sudo dockerd
</code></pre>
<ol start="4">
<li>With Docker running, open a new tab in CloudShell by clicking the <strong>Actions</strong> dropdown menu and selecting <strong>New tab</strong>.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/elastic-blog-8-aws-cloudshell-with-code.png" alt="8 aws cloudshell with code" /></p>
<ol start="5">
<li>Run the following command to authenticate Docker within CloudShell. Replace &lt;account_id&gt; with your AWS Account ID in the Docker command below, and then run it in CloudShell.</li>
</ol>
<pre><code class="language-bash">aws ecr get-login-password --region us-east-2 | sudo docker login --username AWS --password-stdin &lt;account_id&gt;.dkr.ecr.us-east-2.amazonaws.com
</code></pre>
<h2>Build the Hello World web app image and push it to AWS ECR</h2>
<p>We’ll be using <a href="https://aws.amazon.com/ecr/">AWS ECR</a>, Amazon’s fully managed container registry for storing and deploying application images. To build and push the Hello World app image to AWS ECR, we’ll perform the following six steps in <a href="https://console.aws.amazon.com/cloudshell/">AWS CloudShell</a>:</p>
<ol>
<li>Run the command below in CloudShell to create a repository in AWS ECR.</li>
</ol>
<pre><code class="language-bash">aws ecr create-repository \
    --repository-name elastic-helloworld/web \
    --image-scanning-configuration scanOnPush=true \
    --region us-east-2
</code></pre>
<p><strong>“elastic-helloworld”</strong> will be the application's name and “ <strong>web”</strong> will be the service name.</p>
<ol start="2">
<li>In the newly created tab within CloudShell, clone a <a href="https://github.com/elastic/observability-examples/tree/main/aws/app-runner/helloworld">Python Hello World sample app</a> repo from GitHub by entering the following command.</li>
</ol>
<pre><code class="language-bash">git clone https://github.com/elastic/observability-examples
</code></pre>
<ol start="3">
<li>Change directory to the location of the Hello World web app code by running the following command:</li>
</ol>
<pre><code class="language-bash">cd observability-examples/aws/app-runner/helloworld
</code></pre>
<ol start="4">
<li>Build the Hello World sample app from the application’s directory. Run the following Docker command in CloudShell.</li>
</ol>
<pre><code class="language-bash">sudo docker build -t elastic-helloworld/web .
</code></pre>
<ol start="5">
<li>Tag the application image. Replace &lt;account_id&gt; with your AWS Account ID in the Docker command below, and then run it in CloudShell.</li>
</ol>
<pre><code class="language-bash">sudo docker tag elastic-helloworld/web:latest &lt;account_id&gt;.dkr.ecr.us-east-2.amazonaws.com/elastic-helloworld/web:latest
</code></pre>
<ol start="6">
<li>Push the application image to ECR. Replace &lt;account_id&gt; with your AWS Account ID in the command below, and then run it in CloudShell.</li>
</ol>
<pre><code class="language-bash">sudo docker push &lt;account_id&gt;.dkr.ecr.us-east-2.amazonaws.com/elastic-helloworld/web:latest
</code></pre>
<h2>Deploy a Hello World web app to AWS App Runner</h2>
<p>We’ll perform the process of deploying a Python Hello World App to App Runner using the AWS App Runner console.</p>
<ol>
<li>Open the <a href="https://console.aws.amazon.com/apprunner/">App Runner console</a> and click the <strong>Create an App Runner service</strong> button.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/elastic-blog-9-aws-app-runner.png" alt="9 aws app runner" /></p>
<ol start="2">
<li>On the Source and deployment page, set the following deployment details:</li>
</ol>
<ul>
<li>In the Source section, for Repository type, choose <strong>Container registry</strong>.</li>
<li>For Provider, choose <strong>Amazon ECR</strong>.</li>
<li>For Container image URI, choose <strong>Browse</strong> to select the Hello World application image that we previously pushed to AWS ECR.
<ul>
<li>In the Select Amazon ECR container image dialog box, for Image repository, select the “ <strong>elastic-helloworld/web”</strong> repository.</li>
<li>For Image tag, select “ <strong>latest”</strong> and then choose <strong>Continue</strong>.</li>
</ul>
</li>
<li>In the Deployment settings section, choose <strong>Automatic</strong>.</li>
<li>For ECR access role, choose <strong>Create new service role.</strong></li>
<li>Click <strong>Next</strong>.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/elastic-blog-10-source-and-deployment.png" alt="10 source and deployment" /></p>
<ol start="3">
<li>On the Configure service page, in the Service settings section, enter the service name “ <strong>helloworld-app</strong>.” Leave all the other settings as they are and click <strong>Next</strong>.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/elastic-blog-11-configure-service.png" alt="11 configure service" /></p>
<ol start="4">
<li>On the Review and create page, click <strong>Create &amp; deploy</strong>.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/elastic-blog-12-review-and-create.png" alt="12 review and create" /></p>
<p>After a few minutes, the Hello World app will be deployed to App Runner.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/elastic-blog-13-helloworld-app.png" alt="13 hello world app green text" /></p>
<ol start="5">
<li>Click the <strong>Default domain</strong> URL to view the Hello World app running in App Runner.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/elastic-blog-14-hello-world.png" alt="14 hello world" /></p>
<h2>Instrument the Hello World web app with Elastic Observability</h2>
<p>With a web app successfully running in App Runner, we’re now ready to add the minimal code necessary to start monitoring the app. To enable observability for the Hello World app in Elastic Cloud, we’ll perform the following five steps in <a href="https://console.aws.amazon.com/cloudshell">AWS CloudShell</a>:</p>
<ol>
<li>Edit the Dockerfile file to add the following Elastic Open Telemetry environment variables along with the commands to install and run the Elastic APM agent. Use the “nano” text editor by typing “nano Dockerfile”. Be sure to replace the &lt;ELASTIC_APM_SERVER_URL&gt; text and the &lt;ELASTIC_APM_SECRET_TOKEN&gt; text with the APM Server URL and the APM Secret Token values that you copied and saved in an earlier step. The updated Dockerfile should look something like this:</li>
</ol>
<pre><code class="language-python">FROM python:3.9-slim as base

# get packages
COPY requirements.txt .
RUN pip install -r requirements.txt

WORKDIR /app

# install opentelemetry packages
RUN pip install opentelemetry-distro opentelemetry-exporter-otlp
RUN opentelemetry-bootstrap -a install

ENV OTEL_EXPORTER_OTLP_ENDPOINT='&lt;ELASTIC_APM_SERVER_URL&gt;'
ENV OTEL_EXPORTER_OTLP_HEADERS='Authorization=Bearer%20&lt;ELASTIC_APM_SECRET_TOKEN&gt;'
ENV OTEL_LOG_LEVEL=info
ENV OTEL_METRICS_EXPORTER=otlp
ENV OTEL_RESOURCE_ATTRIBUTES=service.version=1.0,deployment.environment=production
ENV OTEL_SERVICE_NAME=helloworld
ENV OTEL_TRACES_EXPORTER=otlp

COPY . .
ENV FLASK_APP=helloworld
ENV FLASK_RUN_HOST=0.0.0.0
ENV FLASK_RUN_PORT=8080
EXPOSE 8080
ENTRYPOINT [ &quot;opentelemetry-instrument&quot;, &quot;flask&quot;, &quot;run&quot; ]
</code></pre>
<p>Note: You can close the nano text editor and save the file by typing “Ctrl + x”. Press the “y” key and then the “Enter” key to save the changes.</p>
<ol start="2">
<li>Edit the helloworld.py file to add observability traces. In CloudShell, type “nano helloworld.py” to edit the file.</li>
</ol>
<ul>
<li>After the import statements at the top of the file, add the code required to initialize the Elastic Open Telemetry APM agent:</li>
</ul>
<pre><code class="language-python">from opentelemetry import trace
tracer = trace.get_tracer(&quot;hello-world&quot;)
</code></pre>
<ul>
<li>Replace the “Hello World!” output code . . .</li>
</ul>
<pre><code class="language-javascript">return &quot;&lt;h1&gt;Hello World!&lt;/h1&gt;&quot;;
</code></pre>
<ul>
<li>… with the Hello Elastic Observability code block.</li>
</ul>
<pre><code class="language-javascript">return '''
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;h1 style=&quot;color: #005A9E; font-family:'Verdana'&quot;&gt;
Hello Elastic Observability - AWS App Runner - Python
&lt;/h1&gt;
&lt;img src=&quot;https://elastic-helloworld.s3.us-east-2.amazonaws.com/elastic-logo.png&quot;&gt;
&lt;/div&gt;
'''
</code></pre>
<ul>
<li>Then add a “hi” trace before the Hello Elastic Observability code block along with an additional “@app.after_request” method placed afterward to implement a “bye” trace.</li>
</ul>
<pre><code class="language-python">@app.route(&quot;/&quot;)
def helloworld():
	with tracer.start_as_current_span(&quot;hi&quot;) as span:
  	  logging.info(&quot;hello&quot;)
  	  return '''
   	 &lt;div style=&quot;text-align: center;&quot;&gt;
   	 &lt;h1 style=&quot;color: #005A9E; font-family:'Verdana'&quot;&gt;
   	 Hello Elastic Observability - AWS App Runner - Python
   	 &lt;/h1&gt;
   	 &lt;img src=&quot;https://elastic-helloworld.s3.us-east-2.amazonaws.com/elastic-logo.png&quot;&gt;
   	 &lt;/div&gt;
   	 '''

@app.after_request
def after_request(response):
	with tracer.start_as_current_span(&quot;bye&quot;):
  	  logging.info(&quot;goodbye&quot;)
  	  return response
</code></pre>
<p>The completed helloworld.py file should look something like this:</p>
<pre><code class="language-python">import logging
from flask import Flask

from opentelemetry import trace
tracer = trace.get_tracer(&quot;hello-world&quot;)

app = Flask(__name__)

@app.route(&quot;/&quot;)
def helloworld():
    with tracer.start_as_current_span(&quot;hi&quot;) as span:
   	 logging.info(&quot;hello&quot;)
   	 return '''
    	&lt;div style=&quot;text-align: center;&quot;&gt;
    	&lt;h1 style=&quot;color: #005A9E; font-family:'Verdana'&quot;&gt;
    	Hello Elastic Observability - AWS App Runner - Python
    	&lt;/h1&gt;
    	&lt;img src=&quot;https://elastic-helloworld.s3.us-east-2.amazonaws.com/elastic-logo.png&quot;&gt;
    	&lt;/div&gt;
    	'''

@app.after_request
def after_request(response):
    with tracer.start_as_current_span(&quot;bye&quot;):
   	 logging.info(&quot;goodbye&quot;)
   	 return response
</code></pre>
<p>Note: You can close the nano text editor and save the file by typing “Ctrl + x”. Press the “y” key and then the “Enter” key to save the changes.</p>
<ol>
<li>Rebuild the updated Hello World sample app using Docker from within the application’s directory. Run the following command in CloudShell.</li>
</ol>
<pre><code class="language-bash">sudo docker build -t elastic-helloworld/web .
</code></pre>
<ol start="4">
<li>Tag the application image using Docker. Replace &lt;account_id&gt; with your AWS Account ID in the Docker command below and then run it in CloudShell.</li>
</ol>
<pre><code class="language-bash">sudo docker tag elastic-helloworld/web:latest &lt;account_id&gt;.dkr.ecr.us-east-2.amazonaws.com/elastic-helloworld/web:latest
</code></pre>
<ol start="5">
<li>Push the updated application image to ECR. Replace &lt;account_id&gt; with your AWS Account ID in the Docker command below and then run it in CloudShell.</li>
</ol>
<pre><code class="language-bash">sudo docker push &lt;account_id&gt;.dkr.ecr.us-east-2.amazonaws.com/elastic-helloworld/web:latest
</code></pre>
<p>Pushing the image to ECR will automatically deploy the new version of the Hello World app.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/elastic-blog-15-green-banner-successfully.png" alt="15 green banner successful deployment" /></p>
<p>Open the <a href="http://console.aws.amazon.com/apprunner">App Runner</a> console. After a few minutes, the Hello World app will be deployed to App Runner. Click the <strong>Default domain</strong> URL to view the updated Hello World app running in App Runner.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/elastic-blog-16-elastic-logo-text-top.png" alt="16 elastic" /></p>
<h2>Observe the Hello World web app</h2>
<p>Now that we’ve instrumented the web app to send observability data to Elastic Observability, we can now use Elastic Cloud to monitor the web app’s operations.</p>
<ol>
<li>
<p>In Elastic Cloud, select the Observability <strong>Services</strong> menu item.</p>
</li>
<li>
<p>Click the <strong>helloworld</strong> service.</p>
</li>
<li>
<p>Click the <strong>Transactions</strong> tab.</p>
</li>
<li>
<p>Scroll down and click the <strong>“/”</strong> transaction.</p>
</li>
<li>
<p>Scroll down to the Trace Sample section to see the <strong>“/,” “hi,”</strong> and <strong>“bye”</strong> trace samples.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/elastic-blog-17-trace-sample.png" alt="17 trace sample" /></p>
<h2>Observability made to scale</h2>
<p>You’ve seen the complete process of deploying a web app to AWS App Runner that is instrumented with Elastic Observability. The end result is a web app that will scale up and down with usage, combined with the observability tools to monitor the web app as it serves one user or millions of users.</p>
<p>Now that you’ve seen how to deploy a serverless web app instrumented with observability, visit <a href="https://www.elastic.co/observability">Elastic Observability</a> to learn more about how to implement a complete observability solution for your apps. Or visit <a href="https://www.elastic.co/getting-started/aws">Getting started with Elastic on AWS</a> for more examples of how you can drive the data insights you need by combining AWS’s cloud computing services with Elastic’s search-powered platform.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-aws-app-runner/library-branding-elastic-observability-white-1680x980.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Deploying Elastic Agent with Confluent Cloud's Elasticsearch Connector]]></title>
            <link>https://www.elastic.co/observability-labs/blog/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector</link>
            <guid isPermaLink="false">deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector</guid>
            <pubDate>Wed, 22 Jan 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Confluent Cloud users can now use the updated Elasticsearch Sink Connector with Elastic Agent and Elastic Integrations for a fully-managed and highly scalable data ingest architecture.]]></description>
            <content:encoded><![CDATA[<p>Elastic and Confluent are key technology partners and we're pleased to announce new investments in that partnership. Built by the original creators of Apache Kafka®, Confluent's data streaming platform is a key component of many Enterprise ingest architectures, and it ensures that customers can guarantee delivery of critical Observability and Security data into their Elasticsearch clusters. Together, we've been working on key improvements to how our products fit together. With <a href="https://www.elastic.co/blog/elastic-agent-output-kafka-data-collection-streaming">Elastic Agent's new Kafka output</a> and Confluent's newly improved <a href="https://www.confluent.io/hub/confluentinc/kafka-connect-elasticsearch/">Elasticsearch Sink Connectors</a> it's never been easier to seamlessly collect data from the edge, stream it through Kafka, and into an Elasticsearch cluster.</p>
<p>In this blog, we examine a simple way to integrate Elastic Agent with Confluent Cloud's Kafka offering to reduce the operational burden of ingesting business-critical data.</p>
<h2>Benefits of Elastic Agent and Confluent Cloud</h2>
<p>When combined, Elastic Agent and Confluent Cloud's updated Elasticsearch Sink connector provide a myriad of advantages for organizations of all sizes. This combined solution offers flexibility in handling any type of data ingest workload in an efficient and resilient manner.</p>
<h3>Fully Managed</h3>
<p>When combined, Elastic Cloud Serverless and Confluent Cloud provide users with a fully managed service. This makes it effortless to deploy and ingest nearly unlimited data volumes without having to worry about nodes, clusters, or scaling.</p>
<h3>Full Elastic Integrations Support</h3>
<p>Sending data through Kafka is fully supported with any of the 300+ Elastic Integrations. In this blog post, we outline how to set up the connection between the two platforms. This ensures you can benefit from our investments in built-in alerts, SLOs, AI Assistants, and more.</p>
<h3>Decoupled Architecture</h3>
<p>Kafka acts as a resilient buffer between data sources (such as Elastic Agent and Logstash) and Elasticsearch, decoupling data producers from consumers. This can significantly reduce total cost of ownership by enabling you to size your Elasticsearch cluster based on typical data ingest volume, not maximum ingest volume. It also ensures system resilience during spikes in data volume.</p>
<h3>Ultimate control over your data</h3>
<p>With our new Output per Integration capability, customers can now send different data to different destinations using the same agent. Customers can easily send security logs directly to Confluent Cloud/Kafka, which can provide delivery guarantees, while sending less critical application logs and system metrics directly to Elasticsearch.</p>
<h2>Deploying the reference architecture</h2>
<p>In the following sections, we will walk you through one of the ways Confluent Kafka can be integrated with Elastic Agent and Elasticsearch using Confluent Cloud's Elasticsearch Sink Connector. As with any streaming and data collection technology, there are many ways a pipeline can be configured depending on the particular use case. This blog post will focus on a simple architecture that can be used as a starting point for more complex deployments.</p>
<p>Some of the highlights of this architecture are:</p>
<ul>
<li>Dynamic Kafka topic selection at Elastic Agents</li>
<li>Elasticsearch Sink Connectors for fully managed transfer from Confluent Kafka to Elasticsearch</li>
<li>Processing data leveraging Elastic's 300+ Integrations</li>
</ul>
<h3>Prerequisites</h3>
<p>Before getting started ensure you have a Kafka cluster deployed in Confluent Cloud, an Elasticsearch cluster or project deployed in Elastic Cloud, and an installed and enrolled Elastic Agent.</p>
<h3>Configure Confluent Cloud Kafka Cluster for Elastic Agent</h3>
<p>Navigate to the Kafka cluster in Confluent Cloud, and select <code>Cluster Settings</code>. Locate and note the <code>Bootstrap Server</code> address, we will need this value later when we create the Kafka Output in Fleet.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector/confluent-cluster-settings.png" alt="Confluent Cluster Settings" /></p>
<p>Navigate to <code>Topics</code> in the left-hand navigation menu and create two topics:</p>
<ol>
<li>A topic named <code>logs</code></li>
<li>A topic named <code>metrics</code></li>
</ol>
<p>Next, navigate to <code>API Keys</code> in the left-hand navigation menu:</p>
<ol>
<li>Click <code>+ Add API Key</code></li>
<li>Select the <code>Service Account</code> API key type</li>
<li>Provide a meaningful name for this API Key</li>
<li>Grant the key write permission to the <code>metrics</code> and <code>logs</code> topics</li>
<li>Create the key</li>
</ol>
<p>Note the provided Key and the Secret, we will need it later when we configure the Kafka Output in Fleet.</p>
<h3>Configure Elasticsearch and Elastic Agent</h3>
<p>In this section, we will configure the Elastic Agent to send data to Confluent Cloud's Kafka cluster and we will configure Elasticsearch so it can receive data from the Confluent Cloud Elasticsearch Sink Connector.</p>
<h4>Configure Elastic Agent to send data to Confluent Cloud</h4>
<p>Elastic Fleet simplifies sending data to Kafka and Confluent Cloud. With Elastic Agent, a Kafka &quot;output&quot; can be easily attached to all data coming from an agent or it can be applied only to data coming from a specific data source.</p>
<p>Find <code>Fleet</code> in the left-hand navigation, click the <code>Settings</code> tab. On the <code>Settings</code> tab, find the <code>Outputs</code> section and click <code>Add Output</code>.</p>
<p>Perform the following steps to configure the new Kafka output:</p>
<ol>
<li>Provide a <code>Name</code> for the output</li>
<li>Set the <code>Type</code> to <code>Kafka</code></li>
<li>Populate the <code>Hosts</code> field with the <code>Bootstrap Server</code> address we noted earlier .</li>
<li>Under <code>Authentication</code>, populate the <code>Username</code> with the <code>API Key</code> and the <code>Password</code> with the <code>Secret</code> we noted earlier <img src="https://www.elastic.co/observability-labs/assets/images/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector/fleet-output-configuration.png" alt="Elastic Fleet Output" /></li>
<li>Under <code>Topics</code>, select <code>Dynamic Topic</code> and set <code>Topic from field</code> to <code>data_stream.type</code> <img src="https://www.elastic.co/observability-labs/assets/images/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector/fleet-output-configuration-dynamic-topic.png" alt="Kafka Output Dynamic Topic Configuration" /></li>
<li>Click <code>Save and apply settings</code></li>
</ol>
<p>Next, we will navigate to the <code>Agent Policies</code> tab in Fleet and click to edit the Agent Policy that we want to attach the Kafka output to. With the Agent Policy open, click the <code>Settings</code> tab and change <code>Output for integrations</code> and <code>Output for agent monitoring</code> to the Kafka output we just created.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector/fleet-agent-policy-kafka.png" alt="Agent Policy Output Configuration" /></p>
<p><strong>Selecting an Output per Elastic Integration</strong>: To set the Kafka output to be used for specific data sources, see the <a href="https://www.elastic.co/guide/en/fleet/master/integration-level-outputs.html">integration-level outputs documentation</a>.</p>
<p><strong>A note about Topic Selection</strong>: The <code>data_stream.type</code> field is a reserved field which Elastic Agent automatically sets to <code>logs</code> if the data we're sending is a log and <code>metrics</code> if the data we're sending is a metric. Enabling Dynamic Topic selection using <code>data_stream.type</code>, will cause Elastic Agent to automatically route metrics to a <code>metrics</code> topic and logs to a <code>logs</code> topic. For information on topic selection, see the Kafka Output's <a href="https://www.elastic.co/guide/en/fleet/master/kafka-output-settings.html#_topics_settings">Topics settings</a> documentation.</p>
<h4>Configuring a publishing endpoint in Elasticsearch</h4>
<p>Next, we will set up two publishing endpoints (data streams) for the Confluent Cloud Sink Connector to use when publishing documents to Elasticsearch:</p>
<ol>
<li>We will create a data stream <code>logs-kafka.reroute-default</code> for handling <strong>logs</strong></li>
<li>We will create a data stream <code>metrics-kafka.reroute-default</code> for handling <strong>metrics</strong></li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector/sink-connector-overview.png" alt="Sink Connector Overview" /></p>
<p>If we were to leave the data in those data streams as-is, the data would be available but we would find the data is unparsed and lacking vital enrichment. So we will also create two index templates and two ingest pipelines to make sure the data is processed by our Elastic Integrations.</p>
<h4>Creating the Elasticsearch Index Templates and Ingest Pipelines</h4>
<p>The following steps use <a href="https://www.elastic.co/guide/en/kibana/current/devtools-kibana.html">Dev Tools in Kibana</a>, but all of these steps can be completed via the REST API or using the relevant user interfaces in Stack Management.</p>
<p>First, we will create the Index Template and Ingest Pipeline for handling <strong>logs</strong>:</p>
<pre><code class="language-json">PUT _index_template/logs-kafka.reroute
{
  &quot;template&quot;: {
    &quot;settings&quot;: {
      &quot;index.default_pipeline&quot;: &quot;logs-kafka.reroute&quot;
    }
  },
  &quot;index_patterns&quot;: [
    &quot;logs-kafka.reroute-default&quot;
  ],
  &quot;data_stream&quot;: {}
}
</code></pre>
<pre><code class="language-json">PUT _ingest/pipeline/logs-kafka.reroute
{
  &quot;processors&quot;: [
    {
      &quot;reroute&quot;: {
        &quot;dataset&quot;: [
          &quot;{{data_stream.dataset}}&quot;
        ],
        &quot;namespace&quot;: [
          &quot;{{data_stream.namespace}}&quot;
        ]
      }
    }
  ]
}
</code></pre>
<p>Next, we will create the Index Template and Ingest Pipeline for handling <strong>metrics</strong>:</p>
<pre><code class="language-json">PUT _index_template/metrics-kafka.reroute
{
  &quot;template&quot;: {
    &quot;settings&quot;: {
      &quot;index.default_pipeline&quot;: &quot;metrics-kafka.reroute&quot;
    }
  },
  &quot;index_patterns&quot;: [
    &quot;metrics-kafka.reroute-default&quot;
  ],
  &quot;data_stream&quot;: {}
}
</code></pre>
<pre><code class="language-json">PUT _ingest/pipeline/metrics-kafka.reroute
{
  &quot;processors&quot;: [
    {
      &quot;reroute&quot;: {
        &quot;dataset&quot;: [
          &quot;{{data_stream.dataset}}&quot;
        ],
        &quot;namespace&quot;: [
          &quot;{{data_stream.namespace}}&quot;
        ]
      }
    }
  ]
}
</code></pre>
<p><strong>A note about rerouting</strong>: For a practical example of how this works, a document related to a Linux Network Metric would be first land in <code>metrics-kafka.reroute-default</code> and this Ingest Pipeline would inspect the document and find <code>data_stream.dataset</code> set to <code>system.network</code> and <code>data_stream.namespace</code> set to <code>default</code>. It would use these values to reroute the document from <code>metrics-kafka.reroute-default</code> to <code>metrics-system.network-default</code> where it would be processed by the <code>system</code> integration.</p>
<h3>Configure the Confluent Cloud Elasticsearch Sink Connector</h3>
<p>Now it's time to configure the Confluent Cloud Elasticsearch Sink Connector. We will perform the following steps twice and create two separate connectors, one connector for <strong>logs</strong> and one connector for <strong>metrics</strong>. Where the required settings differ, we will highlight the correct values.</p>
<p>Navigate to your Kafka cluster in Confluent Cloud and select Connectors from the left-hand navigation menu. On the Connectors page, select <code>Elasticsearch Service Sink</code> from a catalog of connectors available.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector/sink-connector-install.png" alt="Sink Connector Setup" /></p>
<p>Confluent Cloud presents a simplified workflow for the user to configure a connector. Here we will walk through each step of the process:</p>
<h4>Step 1: Topic Selection</h4>
<p>First, we will select the topic that the connector will consume data from based on which connector we are deploying:</p>
<ul>
<li>When deploying the Elasticsearch Sink Connector for <strong>logs</strong>, select the <code>logs</code> topic.</li>
<li>When deploying the Elasticsearch Sink Connector for <strong>metrics</strong>, select the <code>metrics</code> topic.</li>
</ul>
<h4>Step 2: Kafka Credentials</h4>
<p>Choose <code>KAFKA_API_KEY</code> as the cluster authentication mode. Provide the <code>API Key</code> and <code>Secret</code> noted earlier  when we gather required Confluent Cloud Cluster information. <img src="https://www.elastic.co/observability-labs/assets/images/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector/sink-connector-credentials.png" alt="Sink Connector Credentials" /></p>
<h4>Step 3: Authentication</h4>
<p>Provide the Elasticsearch Endpoint address of our Elasticsearch cluster as the <code>Connection URI</code>. The <code>Connection user</code> and <code>Connection password</code> are the authentication information for the account in Elasticsearch that will be used by the Elasticsearch Sink Connector to write data to Elasticsearch.</p>
<h4>Step 4: Configuration</h4>
<p>In this step we will keep the <code>Input Kafka record value format</code> set to <code>JSON</code>. Next, expand <code>Advanced Configuration</code>.</p>
<ol>
<li>We will set <code>Data Stream Dataset</code> to <code>kafka.reroute</code></li>
<li>We will set <code>Data Stream Type</code>based on the connector we are deploying:
<ul>
<li>When deploying the Elasticsearch Sink Connector for logs, we will set <code>Data Stream Type</code> to <code>logs</code></li>
<li>When deploying the Elasticsearch Sink Connector for metrics, we will set <code>Data Stream Type</code> to <code>metrics</code></li>
</ul>
</li>
<li>The correct values for other settings will depend on the specific environment.</li>
</ol>
<h4>Step 5: Sizing</h4>
<p>In this step, notice that Confluent Cloud provides a recommended minimum number of tasks for our deployment. Following the recommendation here is a good starting place for most deployments.</p>
<h4>Step 6: Review and Launch</h4>
<p>Review the <code>Connector configuration</code> and <code>Connector pricing</code> sections and if everything looks good, it's time to click <code>continue</code> and launch the connector! The connector may report as provisioning but will soon start consuming data from the Kafka topic and writing it to the Elasticsearch cluster.</p>
<p>You can now navigate to Discover in Kibana and find your logs flowing into Elasticsearch! Also check out the real time metrics that Confluent Cloud provides for your new Elasticsearch Sink Connector deployments.</p>
<p>If you have only deployed the first <code>logs</code> sink connector, you can now repeat the steps above to deploy the second <code>metrics</code> sink connector.</p>
<h2>Enjoy your fully managed data ingest architecture</h2>
<p>If you followed the steps above, congratulations. You have successfully:</p>
<ol>
<li>Configured Elastic Agent to send logs and metrics to dedicated topics in Kafka</li>
<li>Created publishing endpoints (data streams) in Elasticsearch dedicated to handling data from the Elasticsearch Sink Connector</li>
<li>Configured managed Elasticsearch Sink connectors to consume data from multiple topics and publish that data to Elasticsearch</li>
</ol>
<p>Next you should enable additional integrations, deploy more Elastic Agents, explore your data in Kibana, and enjoy the benefits of a fully managed data ingest architecture with Elastic Serverless and Confluent Cloud!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector/title.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Log Processing UX Design in Elastic Streams]]></title>
            <link>https://www.elastic.co/observability-labs/blog/designing-log-processing-ux-for-streams</link>
            <guid isPermaLink="false">designing-log-processing-ux-for-streams</guid>
            <pubDate>Tue, 03 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Explore log processing in Elastic Streams and the design decisions behind the Processing UX that make log data more accessible, consistent, and actionable.]]></description>
            <content:encoded><![CDATA[<p>This post is written from the perspective of the Elastic Observability design team. It’s aimed at developers and SREs who work with logs and ingest pipelines, and it explains how design decisions shaped the Processing experience in Streams.</p>
<h2>The Design Problem in Log Processing</h2>
<p>We rarely talk about how projects actually begin. </p>
<p>How do you design something that doesn't fully exist yet?</p>
<p>How do you align AI capabilities, system constraints, real user pains into one coherent experience?</p>
<p><a href="https://www.elastic.co/elasticsearch/streams">Streams</a> gave us that challenge.</p>
<p>Logs are one of the richest signals in observability - but also one of the messiest. Streams is an agentic AI-powered solution that rethinks how teams work with logs to enable fast incident investigation and resolution. </p>
<p><em>Streams uses AI to partition and parse raw logs, extract relevant fields, reduce schema management overhead, and surface significant events like critical errors and anomalies.</em></p>
<p>This led us to make logs investigation-ready from the start, and not force the Site Reliability Engineer to fight their data. But in order to enable such experience, we had to carefully rethink a core concept and step in the process - Processing.</p>
<h2>Designing Processing UX in Elastic Streams</h2>
<p>Logs are powerful, but only if they are structured correctly. Today, a user would onboard logs via Elastic Agent, using a custom integration, extract something as simple as an IP field by:</p>
<ul>
<li>Write GROK patterns</li>
<li>Create pipelines</li>
<li>Manage mappings</li>
<li>Test transformation</li>
<li>Iterate repeatedly</li>
</ul>
<p>What sounds simple requires 20+ steps — and deep expertise most teams shouldn’t need. Our goal became simple: make this dramatically simpler.</p>
<p>Our early design question was:</p>
<p><em>“ Can we reduce this experience to 2 meaningful steps instead of 20 technical ones?”</em></p>
<p>That question shaped how we approached the Stream UX.</p>
<h3>The Foundation</h3>
<p>Before we jumped into designing the UI in <a href="https://www.elastic.co/kibana">Kibana</a>, we defined a core mental model. </p>
<p>A <a href="https://www.elastic.co/elasticsearch/streams">Stream</a> is a collection of documents stored together that share:</p>
<ul>
<li>Retention</li>
<li>Configuration</li>
<li>Mappings</li>
<li>Processing rules</li>
<li>Lifecycle behaviour</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/1.png" alt="stream-architecture" /></p>
<p>The key design principle:</p>
<p><em>“A Stream should contain data that behaves consistently.”</em></p>
<h3>Why Does Data Consistency Matter?</h3>
<p>We started with an example to test our thinking. Take Nginx access and error logs.</p>
<p>Access logs describe request/response events:</p>
<p><code>192.168.1.10 - - [16/Feb/2026:12:32:10 +0000] &quot;GET /api/orders/123 HTTP/1.1&quot; 200 532 &quot;-&quot; &quot;Mozilla/5.0&quot;</code></p>
<p>Error logs describe diagnostic events:</p>
<p><code>2026/02/16 12:32:10 [error] 2719#2719: *342 connect() failed (111: Connection refused) while connecting to upstream…</code></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/2.png" alt="log-example" /></p>
<p>If both live in the same Streams that might cause:</p>
<ul>
<li>Processing logic conflicts</li>
<li>Field divergence</li>
<li>Mapping conflicts</li>
<li>Investigations would be fundamentally harder</li>
</ul>
<p>That insight clarified something critical: </p>
<p><strong>“<em>Processing isn’t just about extracting fields. It’s about protecting consistency.”</em></strong></p>
<h3>Making Complexity Manageable</h3>
<p>The ingest ecosystem isn’t small, simple, or hypothetical. Real pipelines use dozens of processors — from common ones like <code>rename</code>, <code>set</code>, <code>convert</code>, and <code>append</code>, to niche types like <code>urldecode</code> and <code>network_direction</code>.</p>
<p>The UI had to support both high-frequency actions and long-tail edge cases without losing structure. Currently Elasticsearch supports over <a href="https://www.elastic.co/docs/reference/enrich-processor">40 different ingest processors</a>. We had to make sure our interface could handle the different types.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/3.png" alt="card-sample" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/4.png" alt="processor-panel" /></p>
<p>We introduced a clear, nested structure for pipeline steps. Users could create, reorder, edit, or remove individual steps or grouped ones with confidence. The <a href="https://eui.elastic.co/docs/patterns/nested-drag-and-drop/">nested drag and drop</a> capability was also added as a pattern in our EUI library.</p>
<p>This gave us the context and foundation to work on integrating those concepts into a model that would be definitive for everything in Streams.</p>
<h3>Page Archetypes</h3>
<p>Processing is powerful - and risky. Changing a parsing condition or step might affect:</p>
<ul>
<li>Field availability</li>
<li>Search behaviour</li>
<li>Alerts</li>
<li>AI Insights</li>
<li>Investigations</li>
</ul>
<p>So we asked ourselves how do we make something so powerful and important, safe for the user? The answer led to a core page archetype:</p>
<p><strong>Create &gt; Preview &gt; Confirm</strong></p>
<p>This wasn’t a UI pattern added later. It emerged directly from our concept work and understanding what users would have to deal with.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/5.png" alt="create-preview-confirm" /></p>
<p>To support this archetype and core idea, we also introduced a split-screen structure.</p>
<p><strong>Left: Build</strong></p>
<p>This is where users would:</p>
<ul>
<li>Add processing steps</li>
<li>Define conditions</li>
<li>Apply rules</li>
<li>Leverage AI suggestions both as a whole pipeline creation or individual steps like a GROK processor</li>
</ul>
<p>It remained focused, intentional and structured.</p>
<p><strong>Right Preview</strong></p>
<p>This is where users would:</p>
<ul>
<li>See real life log samples</li>
<li>See extracted fields in context</li>
<li>Immediate feedback on changes, with insights about the matched and unmatched percentage of documents</li>
<li>Optional drilldown side panel on the right</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/6.png" alt="split-screen-application" /></p>
<p>The preview panel became the anchor of confidence. This was not about visual symmetry, but to reinforce experimentation, control over errors and decrease the level of mistakes. Knowing that users might want to switch their focus from interaction to detailed preview, we introduced the resizeable function to both panels, and unlocked more flexiblity and control over the use cases.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/7.png" alt="stream-architecture" /></p>
<h3>AI Automation</h3>
<p>Streams is agentic and AI powered. That added another layer of complexity for the design, but also another opportunity to unlock even more power and insights from users' log data. </p>
<p>AI introduced a new tension: how do you accelerate processing without turning it into a black box?</p>
<p>We established a few guardrails:</p>
<ul>
<li>Clear, concise suggestions</li>
<li>Visible impact through matched document metrics</li>
<li>Inspectability</li>
<li>Alignment with the Create → Preview → Confirm model</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/8.png" alt="ai-in-split-screen-model" /></p>
<p>Processing UX became the bridge between automation and human in the loop. Log data is one of the most powerful investigation signals. Every design decision reinforced that belief.</p>
<h2>What We Learned</h2>
<p>Designing for the future does not start with screens. It starts with:</p>
<ul>
<li>Edge case testing</li>
<li>Clear mental models</li>
<li>Strong and guiding principles</li>
<li>Behavioral consistency</li>
<li>Scalable and stress-tested archetypes</li>
</ul>
<p>We know that in order for a user to be able unlock insightful discoveries from their logs, they would need to process and manage their data effectively. We knew we were shaping their entire observability foundation. </p>
<p>Processing is about trust, control, and scalable data management.</p>
<p>Trust enables investigation speed.</p>
<p>Investigation speed enables resilience.</p>
<h2>Learn more</h2>
<p>Sign up for an Elastic trial at <a href="http://cloud.elastic.co">cloud.elastic.co</a>, and trial Elastic's Serverless offering which will allow you to play with all of the Streams functionality.
You want to know more about Streams? Check some of the links below:</p>
<p><em>Read about</em> <a href="https://www.elastic.co/observability-labs/blog/reimagine-observability-elastic-streams"><em>Reimagining streams</em></a></p>
<p><em>Read about</em> <a href="https://www.elastic.co/observability-labs/blog/simplifying-retention-management-with-streams"><em>Retention management</em></a></p>
<p><em>Look at the</em> <a href="http://elastic.co/elasticsearch/streams"><em>Streams website</em></a></p>
<p><em>Check the</em> <a href="https://www.elastic.co/docs/solutions/observability/streams/streams"><em>Streams documentation</em></a></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/11.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Developer's Guide to Easy Ops: Demystifying OpenTelemetry's Magic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/developers-guide-to-easy-ops</link>
            <guid isPermaLink="false">developers-guide-to-easy-ops</guid>
            <pubDate>Tue, 17 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[A Go-based Developer's 101 Guide to Easy Ops with OpenTelemetry and Elastic Observability.]]></description>
            <content:encoded><![CDATA[<h1>The Introduction: From Code to Dash, Demystified</h1>
<p>Observability for developers has lately been distilled into implementing auto-instrumentation, allowing you to instantly connect your code with the larger observability world. This way of utilizing an upstream SDK is certainly the simplest and most production-ready, and works efficiently with the <a href="https://www.elastic.co/docs/solutions/observability/get-started/quickstart-elastic-cloud-otel-endpoint">Elastic Cloud Managed OTLP Endpoint</a>.</p>
<p>But what if you could not only add powerful tracing to your Go service but also <em>truly</em> understand how the magic works, rather than just copy-pasting configuration files or a line of code? In the same way that you build your knowledge of software development systems, observability, modernized by OpenTelemetry (OTel) standardization, is a rich, broad system that is valuable to understand. Here is an in-depth technical breakdown of every piece of simple OTel instrumentation using the Elastic Distributions of OpenTelemetry (EDOT) and Golang, from the ground up.</p>
<p>Telemetry is the automated collection, transmission and analysis of data from your application, which can apply to any observable distributed system. This data can range from regular health check calls with your application to real-time information about user interactions, requests, and transactions. Using the example application repository <a href="https://github.com/sophia-solo/otel-go-demo">here</a>, we’ll build a strong observability foundation to start observing our applications with confidence.</p>
<h2>Understanding the OpenTelemetry Flow</h2>
<p>Below, you will see the basic flow of your data when implementing observability with OTel in your system. Before we dive in, let’s explain some of the key terms within OTel. Let’s go over these base key players that we need to implement observability solutions with OTel:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/docs/solutions/observability/apm/spans"><strong>Span</strong></a>: This is a single, timed unit of a distributed trace that can represent a specific operation, such as a database query or an HTTP handler.</p>
</li>
<li>
<p><a href="https://www.elastic.co/docs/solutions/observability/apm/traces"><strong>Trace</strong></a>: This is a detailed record of a single request’s journey through your system, AKA a hierarchy of your spans.</p>
</li>
<li>
<p><a href="https://opentelemetry.io/docs/specs/otel/trace/api/#tracer"><strong>Tracer</strong></a>: This is the handle for generating spans. You will typically have one per instrumentation library, for example myapp/http.</p>
</li>
<li>
<p><a href="https://opentelemetry.io/docs/specs/otel/trace/api/#tracerprovider"><strong>Tracer Provider</strong></a>: This is the cornerstone of the SDK. This creates Tracer instances, and you can configure it on application start up.</p>
</li>
<li>
<p><a href="https://www.elastic.co/docs/reference/apm/agents/go/custom-instrumentation-propagation"><strong>Context Propagation</strong></a>: The mechanism for passing trace context between operations and services, maintaining the relationship between parent and child spans.</p>
</li>
<li>
<p><a href="https://www.elastic.co/docs/deploy-manage/monitor/stack-monitoring/es-monitoring-exporters"><strong>Exporter</strong></a>: This is the part that is responsible for sending your telemetry data to a vendor backend, and you can decide if you are sending it to the OTel Collector, EDOT Collector or an OTLP Endpoint.</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/developers-guide-to-easy-ops/otel-flow.png" alt="Go OpenTelemetry App Flow" /></p>
<h2>Installing the Magic, Instrumentation Style</h2>
<p>OpenTelemetry provides instrumentation libraries that handle much of the tracing complexity for you. These libraries wrap common frameworks and libraries (like <code>net/http/otelhttp</code>) and automatically capture telemetry without requiring you to manually create spans for every operation.</p>
<p>However, before you're able to send any telemetry, OTel needs to know <em>who</em> (which service) is sending that data.</p>
<p>A <a href="https://opentelemetry.io/docs/concepts/resources/">resource</a> represents the specific entity, in this case <code>&quot;simple-go-service&quot;</code>, that is producing your telemetry data. Its identity is recorded as resource attributes, and resource attributes can include pod names, service names or instances, deployment environments; Basically <em>anything</em> important to identifying your resource. This resource is your service's identity card that gets attached to every span and metric that it emits with its attributes. Once your trace arrives, these attributes can answer <em>&quot;what version was running?&quot;</em> or <em>&quot;which service is this from?&quot;</em></p>
<pre><code>func initOTel(ctx context.Context, endpoint string) (func(context.Context) error, error) {
res, err := resource.New(ctx,
		resource.WithAttributes(
			semconv.ServiceName(&quot;simple-go-service&quot;),
			semconv.ServiceVersion(&quot;1.0.0&quot;),
		),
	)
	if err != nil {
		return nil, err
	}
</code></pre>
<p>In the code above, <code>resource.New()</code> constructs the &quot;identity card&quot; of our Go service. The attributes that will be attached to it will use semantic conventions(<code>semconv</code>), standardized names for common metadata fields. These <a href="https://opentelemetry.io/docs/concepts/semantic-conventions/">semantic conventions</a> make sure that every single OTel-compatible observability backend knows their meaning.</p>
<p>Now that we've bootstrapped our application with the <code>initOtel</code> function, we can continue to configure everything else!</p>
<p>Let’s begin instrumenting this application by building all the app components that we will need to implement modern observability tools. Below is our instrumentation using <code>otelhttp</code>, which will handle span creation after calling the specified API routes. </p>
<pre><code>http.Handle(&quot;/hello&quot;, otelhttp.NewHandler(http.HandlerFunc(handleHello), &quot;hello&quot;))
http.Handle(&quot;/api/data&quot;, otelhttp.NewHandler(http.HandlerFunc(handleData), &quot;data&quot;))
http.HandleFunc(&quot;/health&quot;, handleHealth)

// Example of a tracer within our handleHello() function
tracer = tp.Tracer(&quot;simple-go-service&quot;)

ctx, span := tracer.Start(ctx, &quot;process-hello&quot;)
defer span.End()
</code></pre>
<p>The key insight here is that <code>otelhttp.NewHandler</code> handles all the span lifecycle management for HTTP requests. You don't need to manually call <code>tracer.Start()</code>or <code>span.End()</code> for basic HTTP tracing since the library does this for you.</p>
<p>On application start up, the SDK will use the tracer provider set up below in order to create Tracer instances. These instances help create and manage the spans contained within traces.</p>
<pre><code>traceExporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint(endpoint),
        otlptracegrpc.WithInsecure(),
    )

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(traceExporter),
        sdktrace.WithResource(res),
    )
    otel.SetTracerProvider(tp)
    tracer = tp.Tracer(&quot;simple-go-service&quot;)
</code></pre>
<p>Within our <code>initOTel</code> function, we will set up one of our most important signals: logs. First, we initialize the logExporter that will send logs to our OTel Collector using gRPC protocol. Then the <code>LoggerProvider</code> will create the base of the <code>logExporter</code> that batches log entries together before sending those batches to your exporter, attaching metadata about the services along the way. Lastly, the <code>LoggerProvider</code> also creates a standard Go structured logger (slog) that automatically includes trace context (such as span IDs) and batches your log with other logs. These are sent to your observability backend through the exporter along with your metrics and traces. </p>
<pre><code>logExporter, err := otlploggrpc.New(ctx,
		otlploggrpc.WithEndpoint(endpoint),
		otlploggrpc.WithInsecure(),
	)
	if err != nil {
		return nil, err
	}

	lp := sdklog.NewLoggerProvider(
		sdklog.WithProcessor(sdklog.NewBatchProcessor(logExporter)),
		sdklog.WithResource(res),
	)
	logger = slog.New(otelslog.NewHandler(&quot;simple-go-service&quot;, otelslog.WithLoggerProvider(lp)))
</code></pre>
<p>Below you can see how you can view your logs through Kibana in the APM UI. These logs are also color - coordinated; Coded warnings are in yellow, errors are in red, and regular logs are in green.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/developers-guide-to-easy-ops/log-viewer.png" alt="Viewing your logs in the APM UI" /></p>
<p><a href="https://www.elastic.co/docs/solutions/observability/apm/metrics">Metrics</a> are set up in the next part of our code. Metrics are telemetry signals that track the quantitative data from your application, such as response times and request counts. The metric exporter is initialized to send metric data to our EDOT Collector then to our observability backend, Elastic Observability in this case, using gRPC. The meter provider in the next portion periodically collects and exports our metrics data and measurements, the same as the tracer provider creates tracers. The only difference between the two providers is that the meter provider works on a timer while the trace provider exports spans as they complete.</p>
<pre><code>metricExporter, err := otlpmetricgrpc.New(ctx,
		otlpmetricgrpc.WithEndpoint(endpoint),
		otlpmetricgrpc.WithInsecure(),
	)
	if err != nil {
		return nil, err
	}

	mp := metric.NewMeterProvider(
		metric.WithReader(metric.NewPeriodicReader(metricExporter)),
		metric.WithResource(res),
	)
	otel.SetMeterProvider(mp)

	meter := mp.Meter(&quot;simple-go-service&quot;)
	requestCounter, _ = meter.Int64Counter(&quot;http.requests&quot;)
	requestDuration, _ = meter.Float64Histogram(&quot;http.duration&quot;)
</code></pre>
<p>In order to finish initializing OpenTelemetry, we set up our propagators for context propagation. The set text map propagator automatically injects the trace ID and the span ID of your service making an outbound HTTP request to another service, following the <a href="https://www.w3.org/TR/trace-context/">W3C Trace Context</a> standard. In short, this maintains the parent-child relationship between spans.</p>
<pre><code>otel.SetTextMapPropagator(propagation.TraceContext{})

	return func(ctx context.Context) error {
		tp.Shutdown(ctx)
		mp.Shutdown(ctx)
		lp.Shutdown(ctx)
		return nil
	}, nil
</code></pre>
<p>Now that you know how these pieces work together, try to run the repository linked <a href="https://github.com/sophia-solo/otel-go-demo">here</a>, using the readme as your guide.</p>
<h3>Sidenote: Adding Custom Spans</h3>
<p>For getting an application emitting traces, this instrumentation works great! If you visit localhost:8080/hello after starting the docker containers, the <code>otelhttp</code> middleware automatically creates spans for each HTTP request. However, basic instrumentation only shows essential application telemetry, such as response duration, URL paths, and status codes. You won’t know what happens between the request coming in and request completion. The moment OpenTelemetry truly gains power is when you add custom spans. Unlike auto-instrumentation where spans are created as well as closed automatically, custom spans require you to explicitly start and stop them.</p>
<p>Custom spans can track your application’s logic, such as specific business events or marking expensive operations, using a detailed hierarchy within each trace. In the <a href="https://github.com/sophia-solo/otel-go-demo">application</a> for this article, there are several custom spans that were created to track important operations:</p>
<ul>
<li>
<p><code>background-work</code>: This traces asynchronous processing that happens with the main request.</p>
</li>
<li>
<p><code>computation:</code> This measures computations and then captures those results, and the computation type.</p>
</li>
</ul>
<p>Custom spans add granular visibility into your application's behavior. For example, in <code>performComputation</code>:</p>
<pre><code>ctx, span := tracer.Start(ctx, &quot;computation&quot;)
defer span.End()

result := rand.Float64()
span.SetAttributes(
	attribute.String(&quot;comp.type&quot;, compType),
	attribute.Float64(&quot;comp.result&quot;, result),
	)

	logger.InfoContext(ctx, &quot;Computation completed&quot;, &quot;type&quot;, compType, &quot;result&quot;, result)

if result &lt; 0.3 {
span.AddEvent(&quot;Low confidence result&quot;)
	logger.WarnContext(ctx, &quot;Low confidence computation&quot;, &quot;result&quot;, result)
}
}
</code></pre>
<p>The attributes set above become searchable and filterable in our Elastic Observability backend, allowing for attribute filtering by <code>attribute.result</code> and <code>attribute.compType</code>. If you query your data with “show me all computations where results are less than 0.3,” then you will notice the span event <code>span.AddEvent(“Low confidence result”)</code> tacked on with a timestamped marker. This appears on your trace timeline as well, adding even more visibility to any unusual events. Below is a small example of the filtering that Kibana can accomplish from custom spans.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/developers-guide-to-easy-ops/computations.png" alt="Filtering attribute.Result to review borderline Low Confidence results" /></p>
<h1>The Data Pipeline: From Code to IRL</h1>
<p>Now that you can export your custom spans and data to OTLP which sends it to the EDOT Collector and then to an observability backend, the best hub for your telemetry data will be the [OpenTelemetry Collector](<a href="https://opentelemetry.io/docs/collector/">https://opentelemetry.io/docs/collector/</a>. It is a simple, standalone process that is able to receive, process and export all of your telemetry data. Within this project, we use the Elastic Distributions of OpenTelemetry (<a href="https://www.elastic.co/docs/solutions/observability/get-started/opentelemetry/quickstart/self-managed/docker">EDOT</a>) Collector, an optimized Collector for usage within your Elastic Stack. Since this is a self-managed Elastic instance, this article and connected repository utilize the EDOT Collector through <code>elasticapm</code>, but for Elastic Cloud or Serverless projects, you can use the Elastic Managed OpenTelemetry Protocol (OTLP) Endpoint. As noted in the quickstart documentation <a href="https://www.elastic.co/docs/solutions/observability/get-started/quickstart-elastic-cloud-otel-endpoint">here</a>, the Elastic Cloud Managed OTLP Endpoint endpoint helps get your data quickly and efficiently into your Elastic Stack through OTLP, without schema translation! This means that your telemetry hits Elastic instantly and your telemetry data remains vendor-neutral.</p>
<p>For most developers and SREs, this Collector is an amazing tool. It allows you to decouple your code from the observability backend. Your application does not need to know its final destination, it can just send the data to the Collector. Your observability backend can change constantly without it even touching your code. The OpenTelemetry Collector also acts as a gateway for multiple streams of data, and is able to accept various formats in order to unify them for exportation. Lastly, the OpenTelemetry Collector is able to offload processing power from your application - tasks such as retries, batching and filtering can happen in the Collector, not your application.</p>
<p>After trying out this article’s repository, try auto-instrumenting your application with <a href="https://www.elastic.co/docs/reference/opentelemetry">Elastic Distributions of OpenTelemetry</a> (EDOT) so that you can utilize the APM UI to its full potential! With the latest version of Elasticsearch and Kibana <a href="https://github.com/elastic/start-local"><code>start-local</code></a>, you can use <a href="https://www.docker.com/">Docker</a> to install and run the services and instantly start monitoring your application. </p>
<h2>Understanding the Collector Configuration</h2>
<p>The Collector's behavior is defined in a configuration file (<code>otel-collector-config.yaml</code>). Let's break down each component.</p>
<p><strong>Receivers</strong> define how the Collector accepts telemetry data. Here, we're listening for both gRPC and HTTP traffic.</p>
<pre><code>receivers:
  # Receives data from other Collectors in Agent mode
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
</code></pre>
<p><strong>Connectors</strong> are specialized components that sit in between pipelines, and in this case, we are using the <code>elasticapm</code> Connector. This APM Connector exports our metrics, logs, and traces, while simultaneously acting as a receiver for the metrics/aggregated-otel-metrics pipeline (see below). Without it, your raw OTLP data lands in Elasticsearch, but the APM UI has nothing to build its views from.</p>
<pre><code>connectors:
  elasticapm: {} # Elastic APM Connector
</code></pre>
<p><strong>Processors</strong> transform, filter, or enrich data as it passes through the EDOT Collector. The batch processor aggregates spans before export, reducing network overhead and improving efficiency, as well as limiting batch sizes. The batch/metrics processor does this as well, but for APM metrics. Lastly, there is the Elastic APM processor. This processor ensures that your spans fields are aligned, your traces views are complete, and  it overall bridges the gap between Elastic's expectations and OpenTelemetry's formatting of your traces.</p>
<pre><code>processors:
  batch:
    send_batch_size: 1000
    timeout: 1s
    send_batch_max_size: 1500
  batch/metrics:
    send_batch_max_size: 0 # Explicitly set to 0 to avoid splitting metrics requests
    timeout: 1s
  elasticapm: {} # Elastic APM Processor
</code></pre>
<p>As mentioned previously in the article, <strong>exporters</strong> send data to your observability backend. The debug exporter logs telemetry to the console (useful for development), while the Elasticsearch exporter sends traces to your Elastic stack.</p>
<pre><code>exporters:
  debug: {}
  elasticsearch/otel:
    endpoints:
      - ${ELASTIC_ENDPOINT} # Will be populated from environment variable
    user: elastic
    password: ${ELASTIC_PASSWORD}
    tls:
      ca_file: /config/certs/ca/ca.crt
    mapping:
      mode: otel
</code></pre>
<p><strong>Pipelines</strong> connect receivers, processors, and exporters into a data flow. These EDOT Collector pipelines receive OTLP traces, batches them, and exports to the <code>debug</code>, <code>elasticapm</code> and <code>elasticsearch/otel</code> exporters. It also exports metrics to the <code>debug</code> and <code>elasticsearch/otel</code> exporters.</p>
<pre><code>service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch/metrics]
      exporters: [debug, elasticsearch/otel]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug, elasticapm, elasticsearch/otel]
    traces:
      receivers: [otlp]
      processors: [batch, elasticapm]
      exporters: [debug, elasticapm, elasticsearch/otel]
    metrics/aggregated-otel-metrics:
      receivers:
        - elasticapm
      processors: [] # No processors defined in the original for this pipeline
      exporters:
        - debug
        - elasticsearch/otel
</code></pre>
<h2>Debugging Your Code with Confidence in Kibana</h2>
<p>Elastic Observability, utilizing Kibana and Streams, has native support for the OTLP Endpoint through the EDOT Collector, which was used in this project. Below, you can see that your data is automatically connected to Streams from the beginning, requiring no extra leg work! You can add conditions or any Grok processors as your data is streaming in, and you'll be able to instantly see your data's schema and data quality.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/developers-guide-to-easy-ops/streams-connection.png" alt="Streams built-in connection" /></p>
<p>Elastic also provides the Elastic Cloud Managed Endpoint for even easier storage, data-processing, and scaling. If you use this Managed Endpoint, it means that you can configure OpenTelemetry to send data directly to Elasticsearch, without ANY specialized Collectors. Any way you choose, once your traces are flowing, Kibana’s APM UI provides powerful visualization and analysis capabilities will be everything you need to debug your code. You are able to drill down into individual requests, identify bottlenecks, find anomalies and troubleshoot any issues that arise with confidence.</p>
<p>Here is one span of interest from this repository. Within Kibana, you can immediately filter by the Trace ID, finding other spans with the same Trace ID to visually see the entire trace.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/developers-guide-to-easy-ops/pre-filter-traces.png" alt="A span of interest among many" />
<img src="https://www.elastic.co/observability-labs/assets/images/developers-guide-to-easy-ops/post-filter-traces.png" alt="The entire trace of the span" /></p>
<p>Kibana Discover also allows you to switch indices instantly without losing your filters, ensuring that you can also see the logs that correspond with the same Trace ID.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/developers-guide-to-easy-ops/log-trace.png" alt="Logs matching the Trace ID" /></p>
<p>In addition to the manually checking your traces, you can automatically check them within the APM UI (shown below). This is easy trace visualization using the Kibana APM UI is readily available while using the <code>elasticapm</code> connector. Below is a visualization of a trace comprised of spans within our project. Knowing both methods of correlating spans is beneficial to build the foundation of utilizing Kibana and the APM UI for observability.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/developers-guide-to-easy-ops/automatic-apm-trace.png" alt="Automatic trace span hierarchy in Kibana APM" /></p>
<p>Here is a fully built out dashboard built from the repository featured in this article. The possibilities with Elastic Observability are endless!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/developers-guide-to-easy-ops/kibana-dashboard.png" alt="Full Kibana Dashboard" /></p>
<h2>Congrats, You’re Not “Just” a Developer Anymore!</h2>
<p>We’ve broken down the why and how behind OpenTelemetry’s basic components, including the TraceProvider, the span, the exporter and the Collector. Here, you’ve done more than just implement your tracing tool. You now understand the complete data flow from your code to the graphs on your dashboard.</p>
<p>You can now speak the language of observability with confidence, not because you memorized a configuration file, but because you now understand the data flow from your code to the graph on your dashboard. You understand how telemetry moves through your system. You aren’t “just” a developer anymore; you’re now a developer who can truly see.</p>
<p>Try out the code repo above! Included in the <a href="">repository</a> is a generate-traffic.sh script file. You can run this repeatedly in order to generate logs, traces, and metrics for you to play with within the APM UI. Also, check out our latest <a href="https://www.elastic.co/docs/release-notes/elasticsearch">releases</a> in our release docs page for exciting Elastic updates.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/developers-guide-to-easy-ops/blog-header.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[The DNA of DATA Increasing Efficiency with the Elastic Common Schema]]></title>
            <link>https://www.elastic.co/observability-labs/blog/dna-of-data</link>
            <guid isPermaLink="false">dna-of-data</guid>
            <pubDate>Wed, 25 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic ECS helps improve semantic conversion of log fields. Learn how quantifying the benefits of normalized data, not just for infrastructure efficiency, but also data fidelity.]]></description>
            <content:encoded><![CDATA[<p>The Elastic Common Schema is a fantastic way to simplify and unify a search experience. By aligning disparate data sources into a common language, users have a lower bar to overcome with interpreting events of interest, resolving incidents or hunting for unknown threats. However, there are underlying infrastructure reasons to justify adopting the Elastic Common Schema.</p>
<p>In this blog you will learn about the quantifiable operational benefits of ECS, how to leverage ECS with any data ingest tool, and the pitfalls to avoid. The data source leveraged in this blog is a 3.3GB Nginx log file obtained from Kaggle. The representation of this dataset is divided into three categories: raw, self, and ECS; with raw having zero normalization, self being a demonstration of commonly implemented mistakes observed from my 5+ years of experience working with various users, and finally ECS with the optimal approach of data hygiene.</p>
<p>This hygiene is achieved through the parsing, enrichment, and mapping of data ingested; akin to the sequencing of DNA in order to express genetic traits. Through the understanding of the data's structure, and assigning the correct mapping, a more thorough expression may be represented, stored and searched upon.</p>
<p>If you would like to learn more about ECS, the dataset used in this blog, or available Elastic integrations, please be sure to check out these related links:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/blog/introducing-the-elastic-common-schema">Introducing the Elastic Common Schema</a></p>
</li>
<li>
<p><a href="https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs">Kaggle Web Server Logs</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/integrations/data-integrations">Elastic Integrations</a></p>
</li>
</ul>
<h2>Dataset Validation</h2>
<p>Before we begin, let us review how many documents exist and what we're required to ingest. We have 10,365,152 documents/events from our Nginx log file:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/access-logs.png" alt="nginx access logs" /></p>
<p>With 10,365,152 documents in our targeted end-state:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/end-state.png" alt="end state" /></p>
<h2>Dataset Ingestion: Raw &amp; Self</h2>
<p>To achieve the raw and self ingestion techniques, this example is leveraging Logstash for simplicity. For the raw data ingest, a simple file input with no additional modifications or index templates.</p>
<pre><code>
    input {
      file {
      id =&gt; &quot;NGINX_FILE_INPUT&quot;
      path =&gt; &quot;/etc/logstash/raw/access.log&quot;
      ecs_compatibility =&gt; disabled
      start_position =&gt; &quot;beginning&quot;
      mode =&gt; read
      }
    }
    filter {
    }
    output {
      elasticsearch {
        hosts =&gt; [&quot;https://mycluster.es.us-east4.gcp.elastic-cloud.com:9243&quot;]
          index =&gt; &quot;nginx-raw&quot;
          ilm_enabled =&gt; true
          manage_template =&gt; false
          user =&gt; &quot;username&quot;
          password =&gt; &quot;password&quot;
          ssl_verification_mode =&gt; none
          ecs_compatibility =&gt; disabled
          id =&gt; &quot;NGINX-FILE_ES_Output&quot;
      }
    }

</code></pre>
<p>For the self ingest, a custom Logstash pipeline with a simple Grok filter was created with no index template applied:</p>
<pre><code>    input {
      file {
        id =&gt; &quot;NGINX_FILE_INPUT&quot;
        path =&gt; &quot;/etc/logstash/self/access.log&quot;
        ecs_compatibility =&gt; disabled
        start_position =&gt; &quot;beginning&quot;
        mode =&gt; read
      }
    }
    filter {
      grok {
        match =&gt; { &quot;message&quot; =&gt; &quot;%{IP:clientip} - (?:%{NOTSPACE:requestClient}|-) \[%{HTTPDATE:timestamp}\] \&quot;(?:%{WORD:requestMethod} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})\&quot; (?:-|%{NUMBER:response}) (?:-|%{NUMBER:bytes_in}) (-|%{QS:bytes_out}) %{QS:user_agent}&quot; }
      }
    }
    output {
      elasticsearch {
        hosts =&gt; [&quot;https://myscluster.es.us-east4.gcp.elastic-cloud.com:9243&quot;]
        index =&gt; &quot;nginx-self&quot;
        ilm_enabled =&gt; true
        manage_template =&gt; false
        user =&gt; &quot;username&quot;
        password =&gt; &quot;password&quot;
        ssl_verification_mode =&gt; none
        ecs_compatibility =&gt; disabled
        id =&gt; &quot;NGINX-FILE_ES_Output&quot;
      }
    }
</code></pre>
<h2>Dataset Ingestion: ECS</h2>
<p>Elastic comes included with many available integrations which contain everything you need to achieve to ensure that your data is ingested as efficiently as possible.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/integrations.png" alt="integrations" /></p>
<p>For our use case of Nginx, we'll be using the associated integration's assets only.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-integration.png" alt="nginx integration" /></p>
<p>The assets which are installed are more than just dashboards, there are ingest pipelines which not only normalize but enrich the data while simultaneously mapping the fields to their correct type via component templates. All we have to do is make sure that as the data is coming in, that it will traverse through the ingest pipeline and use these supplied mappings.</p>
<p>Create your index template, and select the supplied component templates provided from your integration.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-ecs.png" alt="nginx-ecs" /></p>
<p>Think of the component templates like building blocks to an index template. These allow for the reuse of core settings, ensuring standardization is adopted across your data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-ecs-template.png" alt="nginx-ecs-template" /></p>
<p>For our ingestion method, we merely point to the index name that we specified during the index template creation, in this case, <code>nginx-ecs</code> and Elastic will handle all the rest!</p>
<pre><code>    input {
      file {
      id =&gt; &quot;NGINX_FILE_INPUT&quot;
      path =&gt; &quot;/etc/logstash/ecs/access.log&quot;
      #ecs_compatibility =&gt; disabled
      start_position =&gt; &quot;beginning&quot;
      mode =&gt; read
      }
    }
    filter {
    }
    output {
      elasticsearch {
        hosts =&gt; [&quot;https://mycluster.es.us-east4.gcp.elastic-cloud.com:9243&quot;]
        index =&gt; &quot;nginx-ecs&quot;
        ilm_enabled =&gt; true
        manage_template =&gt; false
        user =&gt; &quot;username&quot;
        password =&gt; &quot;password&quot;
        ssl_verification_mode =&gt; none
        ecs_compatibility =&gt; disabled
        id =&gt; &quot;NGINX-FILE_ES_Output&quot;
      }
    }

</code></pre>
<h2>Data Fidelity Comparison</h2>
<p>Let's compare how many fields are available to search upon the three indices as well as the quality of the data. Our raw index has but 15 fields to search upon, with most being duplicates for aggregation purposes.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-raw.png" alt="nginx-raw" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/mapping-1.png" alt="mapping-1" /></p>
<p>However from a Discover perspective, we are limited to <code>6</code> fields!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-raw-discover.png" alt="nginx-raw-discover" /></p>
<p>Our self-parsed index has 37 available fields, however these too are duplicated and not ideal for efficient searching.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-self.png" alt="nginx-self" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/mapping-2.png" alt="mapping-2" /></p>
<p>From a Discover perspective here we have almost 3x as many fields to choose from, yet without the correct mapping the ease of which this data may be searched is less than ideal. A great example of this, is attempting to calculate the average bytes_in on a text field.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-self-discover.png" alt="nginx-self-discover" /></p>
<p>Finally with our ECS index, we have 71 fields available to us! Notice that courtesy of the ingest pipeline, we have enriched fields of geographic information as well as event categorial fields.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-ecs-pipeline.png" alt="nginx-ecs-pipeline" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/mapping-3.png" alt="mapping-3" /></p>
<p>Now what about Discover? There were 51 fields directly available to us for searching purposes:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-ecs-discover.png" alt="nginx-ecs-discover" /></p>
<p>Using Discover as our basis, our self-parsed index has 283% more fields to search upon whereas our ECS index has 850%! </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/table-1.png" alt="table-1" /></p>
<h2>Storage Utilization Comparison</h2>
<p>Surely with all these fields in our ECS index the size would be exponentially larger than the self normalized index, let alone the raw index? The results may surprise you.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/total-storage.png" alt="total-storage" /></p>
<p>Accounting for the replica of data of our 3.3GB size data set, we can see that the impact of normalized and mapped data has a significant impact on the amount of storage required.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/table-2.png" alt="table-2" /></p>
<h2>Conclusion</h2>
<p>While there is an increase in the amount required storage for any dataset that is enriched, Elastic provides easy solutions to maximize the fidelity of the data to be searched while simultaneously ensuring operational storage efficiency; that is the power of the Elastic Common Schema.</p>
<p>Let's review how we were able to maximize search, while minimizing storage</p>
<ul>
<li>Installing integration assets for our dataset that we are going to ingest.</li>
</ul>
<ul>
<li>Customizing the index template to leverage the included components to ensure mapping and parsing are aligned to the Elastic Common Schema.</li>
</ul>
<p>Ready to get started? Sign up <a href="https://cloud.elastic.co/registration">for Elastic Cloud</a> and try out the features and capabilities I've outlined above to get the most value and visibility out of your data.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/dna-of-data/dna-of-data.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[TLS Certificate Monitoring with the OpenTelemetry Collector]]></title>
            <link>https://www.elastic.co/observability-labs/blog/edot-certificate-monitoring</link>
            <guid isPermaLink="false">edot-certificate-monitoring</guid>
            <pubDate>Fri, 09 Jan 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to monitor TLS certificate expiration in Kubernetes clusters using the OpenTelemetry Collector, ensuring comprehensive visibility into both external and internal certificates, using Elastic Observability
]]></description>
            <content:encoded><![CDATA[<p>In modern distributed systems, TLS certificates are the glue that holds
everything together while keeping it safe. Certificates aren't only used for
encrypting user traffic; they are fundamental building blocks of trust for your
entire system.</p>
<p>Indeed, an expired certificate is <em>not</em> just a minor technical glitch.
It is a direct hit on your most critical systems:</p>
<ul>
<li>
<p>Your CI/CD pipeline grinds to a halt because it can not trust the internal
image registry.</p>
</li>
<li>
<p>Your Single Sign-On (SSO) system fails, locking all your internal users out.</p>
</li>
<li>
<p>Your external clients see scary browser warnings, shattering user trust and
forcing support tickets.</p>
</li>
<li>
<p>Your SLOs burn due to services not being able to communicate with one another.</p>
</li>
</ul>
<p>In Kubernetes, certificates are usually dynamically generated and auto-renewed
by tools like <code>cert-manager</code>. In more unlucky scenarios, certificates might be
tucked away inside <code>Secrets</code> and <code>ConfigMaps</code>, leading to challenges while
inventorying them. It is neither hard nor unheard of to have a dozen critical
certificates and no centralized way to know when they are about to expire.</p>
<p>Additionally, only monitoring the certificates for external Load Balancers might
lead to huge <em>internal</em> risks, since many certificates never get exposed to
external users.</p>
<p>In this blog post, we will guide you through establishing comprehensive,
cluster-wide certificate monitoring using the OpenTelemetry Collector,
the <a href="https://github.com/enix/x509-certificate-exporter">x509-certificate-exporter</a>,
and Elastic Observability.</p>
<h2>Classical approach: HTTP monitoring</h2>
<p>The classical approach to monitor TLS certificate expiration in the Elastic
Observability is by treating it like any other service availability check. Historically,
this was accomplished using Heartbeat or, more recently, Elastic Observability's Synthetics.
These tools perform an external check against a public HTTPS endpoint and
automatically extract the certificate's validity dates, allowing you to
configure a
<a href="https://www.elastic.co/docs/solutions/observability/incident-management/create-tls-certificate-rule">Synthetics TLS certificate rule</a>
in Kibana to trigger an alert when expiration is within a specified threshold
(e.g., 30 days).</p>
<p>While effective for external-facing services, this &quot;classical&quot; approach has two
major shortcomings when dealing with Kubernetes:</p>
<ul>
<li>
<p>It only works for certificates exposed via HTTP(S), meaning you cannot use
this for internal services, databases, or message queues using other protocols.
In other words, this won't work to monitor common, critical TLS certificates
such as Kafka's.</p>
</li>
<li>
<p>The monitoring agent must have network access to the endpoint. In a segmented
or private Kubernetes environment, deploying agents with the necessary access
often introduces unnecessary complexity or security risks.</p>
</li>
</ul>
<p>To gain true cluster-wide visibility, we need to inspect the certificates at
their source: <em>inside</em> Kubernetes Secrets or ConfigMaps.</p>
<h2>A Kubernetes-native approach: monitor Secrets and ConfigMaps</h2>
<p>Monitoring TLS certificate expiration directly within Kubernetes Secrets and
ConfigMaps is the only reliable way to gain visibility into internal,
non-HTTP-exposed certificates, such as those used for service meshes, internal
registries, or databases. In this section, we will use the OpenTelemetry Collector to
monitor certificate expiration.</p>
<p>The OpenTelemetry Collector provides a mechanism to read
up-to-date information from the Kubernetes API, including Secrets, via the
<a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/k8sobjectsreceiver">k8sobjects receiver</a>.
However, this receiver only fetches <em>raw</em> TLS certificate resource data,
which the OpenTelemetry Transformation Language (OTTL) can not properly parse.
Therefore, we need to use a dedicated exporter to collect the certificate data
and expose the results in a digestible format.</p>
<h3>The industry-standard solution</h3>
<p>As mentioned above, simply reading certificate information from the Kubernetes API
is not a feasible solution. We will therefore use a specialized,
lightweight exporter (specifically, the popular
<a href="https://github.com/enix/x509-certificate-exporter">x509-certificate-exporter</a>)
to collect TLS certificate data and expose the results,
allowing the OpenTelemetry Collector's Prometheus receiver to seamlessly
scrape the data and send it to Elastic Observability.
This approach immediately and easily enables us to monitor both certificates
generated by <code>cert-manager</code> and self-managed ones, such as the ones created for
ECK.</p>
<p>A fully working configuration example and a script to set up a complete local
development environment is available <a href="https://github.com/elastic/edot-certificate-monitoring-blog-post">here</a>.
Feel free to use it to follow along as you read through this guide and try out the examples.
Please note that, while this repository uses the Elastic Distribution of OpenTelemetry (EDOT),
it can be easily adapted to use the OpenTelemetry Collector.</p>
<h4>Helm Chart Configuration</h4>
<p>We configured the <code>x509-certificate-exporter</code> with the official Helm Chart and
used the following minimal configuration:</p>
<pre><code class="language-yaml">secretsExporter:
  secretTypes:
  - type: kubernetes.io/tls
    key: tls.crt
  # For ECK that uses different secret types
  - type: Opaque
    key: tls.crt
  - type: Opaque
    key: ca.crt
  configMapKeys:
  - tls.crt
  - ca.crt

# Create a service to have a stable endpoint for scraping metrics
service:
  create: true
  # -- TCP port to expose the Service on
  port: 9793

# Disable prometheus service monitor and prometheus rules
prometheusServiceMonitor:
  create: false
prometheusRules:
  create: false
</code></pre>
<p>We refer to the reference values.yaml to get insights in the plethora of
configuration options.</p>
<h4>OpenTelemetry Collector Configuration</h4>
<p>Afterward, we configured the OpenTelemetry Collector to scrape the metrics from the
service:</p>
<pre><code class="language-yaml">prometheus/cert-expiration:
  config:
    scrape_configs:
      - job_name: &quot;cert-expiration&quot;
        scrape_interval: 60m
        static_configs:
          - targets:
              - &quot;x509-certificate-exporter.monitoring.svc.cluster.local:9793&quot;
</code></pre>
<p>We deliberately used a long scrape interval of 60 minutes, because certificate
expiration is a low-frequency concern.</p>
<h4>Visualizing the data in Kibana</h4>
<p>Once the data is ingested, we can explore it using Discover. We can select the
<code>metrics-*</code> Data View and search for our
data with the filter <code>data_stream.dataset : &quot;prometheusreceiver.otel&quot;</code>.</p>
<p>An example document looks like the following:</p>
<pre><code class="language-json">{
  &quot;@timestamp&quot;: &quot;2025-12-19T09:43:45.317Z&quot;,
  &quot;_metric_names_hash&quot;: &quot;7d113f55b70019d9&quot;,
  &quot;attributes&quot;: {
    &quot;issuer_CN&quot;: &quot;tls-cert.example.com&quot;,
    &quot;issuer_O&quot;: &quot;TLS Cert&quot;,
    &quot;secret_key&quot;: &quot;tls.crt&quot;,
    &quot;secret_name&quot;: &quot;tls-cert-secret&quot;,
    &quot;secret_namespace&quot;: &quot;test-certs&quot;,
    &quot;serial_number&quot;: &quot;250887723804527203192865532237673843132727735771&quot;,
    &quot;subject_CN&quot;: &quot;tls-cert.example.com&quot;,
    &quot;subject_O&quot;: &quot;TLS Cert&quot;
  },
  &quot;data_stream&quot;: {
    &quot;dataset&quot;: &quot;prometheusreceiver.otel&quot;,
    &quot;namespace&quot;: &quot;default&quot;,
    &quot;type&quot;: &quot;metrics&quot;
  },
  &quot;metrics&quot;: {
    &quot;x509_cert_expired&quot;: 0,
    &quot;x509_cert_not_after&quot;: 1768488242,
    &quot;x509_cert_not_before&quot;: 1765896242
  },
  &quot;resource&quot;: {
    &quot;attributes&quot;: {
      &quot;server.address&quot;: &quot;x509-certificate-exporter.monitoring.svc.cluster.local&quot;,
      &quot;server.port&quot;: &quot;9793&quot;,
      &quot;service.instance.id&quot;: &quot;x509-certificate-exporter.monitoring.svc.cluster.local:9793&quot;,
      &quot;service.name&quot;: &quot;cert-expiration&quot;,
      &quot;url.scheme&quot;: &quot;http&quot;
    }
  },
  &quot;scope&quot;: {
    &quot;name&quot;: &quot;github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver&quot;,
    &quot;version&quot;: &quot;9.2.2&quot;
  }
}
</code></pre>
<p>The core metric reported by the <code>x509-certificate-exporter</code> is
<code>x509_cert_not_after</code> that represent the Unix Epoch timestamp (in seconds) of the certificate's
expiration date. This metric has some attributes associated with it.
In the case of <code>Secrets</code>, the following attributes are relevant:</p>
<ul>
<li><code>secret_namespace</code>: The namespace of the Secret containing the certificate.</li>
<li><code>secret_name</code>: The name of the Secret containing the certificate.</li>
<li><code>secret_key</code>: The specific key within the Secret where the certificate is stored.</li>
</ul>
<p>In the case of <code>ConfigMaps</code>, we can infer the attributes of interest
from the <code>filepath</code> attribute.</p>
<p>Finally, we can leverage ES|QL to compute the remaining days until expiration.
In the following examples, we will use the <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/ts"><code>TS</code> command</a>,
which is optimized and recommended for interacting with time-series data.</p>
<p>For <code>Secrets</code>:</p>
<pre><code class="language-sql">TS metrics-*
| WHERE metrics.x509_cert_not_after is not NULL
| STATS expiration_date = MAX(LAST_OVER_TIME(metrics.x509_cert_not_after)) by attributes.secret_namespace, attributes.secret_name, attributes.secret_key
| EVAL remaining_days = DATE_DIFF(&quot;days&quot;, NOW(), TO_DATETIME (1000 * expiration_date))
| EVAL expiration_date = TO_DATETIME(1000 * expiration_date)
| SORT expiration_date ASC
</code></pre>
<p>And for <code>ConfigMaps</code>:</p>
<pre><code class="language-sql">TS metrics-*
| WHERE metrics.x509_cert_not_after IS NOT NULL
| WHERE attributes.filepath IS NOT NULL
| DISSECT attributes.filepath &quot;k8s/%{namespace}/%{configmap}&quot;
| WHERE configmap != &quot;kube-root-ca.crt&quot; // Filter out the Kubernetes API server certificate's signing CA
| STATS expiration_date = MAX(LAST_OVER_TIME(metrics.x509_cert_not_after)) by namespace, configmap, filename
| EVAL remaining_days = DATE_DIFF(&quot;days&quot;, NOW(), TO_DATETIME (1000 * expiration_date))
| EVAL expiration_date = TO_DATETIME(1000 * expiration_date)
| SORT expiration_date ASC
</code></pre>
<p>Based on these core queries, we can easily build a dashboard that shows the
remaining days until expiration for all the certificates in the cluster:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/edot-certificate-monitoring/dashboard.png" alt="Kibana Certificate Expiration Dashboard" /></p>
<p>and create alerts about certificates that are about to expire by adding a
condition after the query:</p>
<pre><code class="language-sql">WHERE remaining_days &lt; 30
</code></pre>
<h3>Conclusion</h3>
<p>In this blog post, we explored how to monitor TLS certificate expiration
within a Kubernetes cluster using the OpenTelemetry Collector.
We discussed the limitations of traditional HTTP-based monitoring
approaches and introduced a Kubernetes-native solution leveraging the
<code>x509-certificate-exporter</code> to extract certificate expiration data directly from
Kubernetes Secrets and ConfigMaps. This method provides comprehensive visibility
into all certificates used within the cluster, including those not exposed via
HTTP(S).</p>
<p>For the sake of simplicity, we just focused on monitoring certificate expiration
with the OpenTelemetry Collector on Kubernetes. However, this approach can be easily applied
with classical Elastic Agent by leveraging the
<a href="https://www.elastic.co/docs/reference/integrations/prometheus_input">Prometheus input package</a>
(read more on how to use input packages
<a href="https://www.elastic.co/observability-labs/blog/customize-data-ingestion-input-packages">here</a>)
and can be also extended to monitor certificates on virtual machines or
bare-metal servers by deploying the <code>x509-certificate-exporter</code> there.</p>
<p>Finally, is worth knowing that Elastic Observability, offers an officially supported
distribution of the OpenTelemetry Collector,
called <a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry">Elastic Distributions of OpenTelemetry (EDOT)</a>.</p>
<p>If you are an Elastic user, you could consider using EDOT Collector to monitor certificates with
OpenTelemetry: since it is supported by Elastic Observability, it will be easier to manage and keep up to date. Alternatively you can use upstream OTel compnents also.</p>
<h3>What's next?</h3>
<p>Now that Elastic supports
<a href="https://www.elastic.co/docs/reference/fleet/alerting-rule-templates">Rule Templates</a>
and <a href="https://www.elastic.co/docs/solutions/observability/apm/opentelemetry">OpenTelemetry content packs</a>,
our near-term objective is to contribute to the integration repository to make
the setup of certificate monitoring even easier for our users.
Stay tuned for more updates on this!</p>
<p>Check out other resources on Elastic's OpenTelemetry</p>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-managed-otlp-endpoint-for-opentelemetry">Elastic's OTLP EndPoint</a></p>
<p><a href="https://www.elastic.co/observability-labs/blog/opentelemetry-accepts-elastics-donation-of-edot">Elastic's EDOT PHP Contribution</a></p>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-sdk-central-configuration-opamp">Opentelemetry SDK Central Management with EDOT</a></p>
<p>Also sign up for <a href="https://cloud.elastic.co">Elastic Cloud</a> and try out your application with OpenTelemetry in Elastic</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/edot-certificate-monitoring/edot-certificate-monitoring.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Scale testing OpenTelemetry log ingestion on GCP with EDOT Cloud Forwarder]]></title>
            <link>https://www.elastic.co/observability-labs/blog/edot-cloud-forwarder-gcp-load-testing</link>
            <guid isPermaLink="false">edot-cloud-forwarder-gcp-load-testing</guid>
            <pubDate>Wed, 04 Feb 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how we load tested the EDOT Cloud Forwarder for GCP on Google Cloud Run and identified practical capacity limits per instance. We show how runtime tuning improves stability and translate the results into concrete configuration and scaling guidance.]]></description>
            <content:encoded><![CDATA[<p>EDOT Cloud Forwarder (ECF) for GCP is an event-triggered, serverless OpenTelemetry Collector deployment for Google Cloud. It runs the OpenTelemetry Collector on Cloud Run, ingests events from Pub/Sub and Google Cloud Storage, parses Google Cloud service logs into OpenTelemetry semantic conventions, and forwards the resulting OTLP data to Elastic, relying on Cloud Run for scaling, execution, and infrastructure lifecycle management.</p>
<p>To run ECF for GCP confidently at scale, you need to understand its capacity characteristics and sizing behavior. For ECF for GCP which is part of the broader <a href="https://www.elastic.co/docs/reference/opentelemetry/edot-cloud-forwarder/gcp">ECF architecture</a>, we answered those questions through repeatable load testing and by grounding decisions in measured data.</p>
<p>We'll introduce the test setup, explain each runtime setting, and share the capacity numbers we observed for a single instance.</p>
<h2>How we load tested EDOT Cloud Forwarder for GCP</h2>
<h3>Architecture</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/edot-cloud-forwarder-gcp-load-testing/load-testing.png" alt="Load testing overview" /></p>
<p>The load testing architecture simulates a realistic, high-volume pipeline:</p>
<ol>
<li>We developed a load tester service that uploads generated log files to a GCS bucket as fast as possible.</li>
<li>Each file creation in this Google Cloud Storage (GCS) bucket then triggers an event notification to Pub/Sub.</li>
<li>Pub/Sub delivers push messages to a Cloud Run service where EDOT Cloud Forwarder fetches and processes these log files.</li>
</ol>
<p>Our setup exposes two primary tunable settings that directly influence Cloud Run scaling behavior and memory pressure:</p>
<ul>
<li>Request pressure using a concurrency setting (how many concurrent requests each ECF instance can handle).</li>
<li>Work per request using a log count setting (number of logs per file in each uploaded object).</li>
</ul>
<p>In our tests, we used a testing system that:</p>
<ul>
<li>Deploys the whole testing infrastructure. This includes the complete ECF infrastructure, a mock backend, etc.</li>
<li>Generates log files according to the configured log counts, using a Cloud Audit log of ~1.4 KB.</li>
<li>Runs a matrix of tests across all combinations of concurrency and log volume.</li>
<li>Produces a report for each tested concurrency level in which several stats are reported, such as CPU usage and memory consumption.</li>
</ul>
<p>For reproducibility and isolation, the <code>otlphttp</code> exporter in EDOT Cloud Forwarder uses a <strong>mock backend</strong> that always returns HTTP 200. This ensures all observed behavior is attributable to ECF itself, not downstream systems or network variability.</p>
<h2>Step 1: Establish a stable runtime before measuring capacity</h2>
<p>Before asking how much load a single instance can handle, we first established a stable runtime baseline.</p>
<p>We quickly learned that a single flag, <code>cpu_idle</code>, can turn Cloud Run into a garbage-collector (GC) starvation trap. This is amplified by a known <a href="https://www.elastic.co/docs/reference/opentelemetry/edot-cloud-forwarder/gcp#limitations">limitation</a> of our ECF current architecture: the existing OpenTelemetry implementation reads whole log files into memory before processing them. Our goal was to eliminate configuration side effects so capacity tests reflected ECF actual limits.</p>
<p>We focused on three runtime parameters:</p>
<table>
<thead>
<tr>
<th>Setting</th>
<th>What it controls</th>
<th>Why it matters for ECF</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>cpu_idle</code></td>
<td>Whether CPU is always allocated or only during requests</td>
<td>Dictates how much background time the garbage collector gets to reclaim memory</td>
</tr>
<tr>
<td><code>GOMEMLIMIT</code></td>
<td>Upper bound on Go heap size inside the container</td>
<td>Keeps the process from quietly growing until Cloud Run kills it on OOM</td>
</tr>
<tr>
<td><code>GOGC</code></td>
<td>Heap growth and collection aggressiveness in Go</td>
<td>Trades lower memory usage for higher CPU consumption</td>
</tr>
</tbody>
</table>
<p>All parameter-isolation tests use a single Cloud Run instance (min 0, max 1), fix concurrency for the scenario under study, and keep input files and test matrix identical across runs. This design lets us attribute differences directly to the parameter in question.</p>
<h3>CPU allocation: Stop starving the garbage collector</h3>
<p>Cloud Run offers two CPU allocation modes:</p>
<ul>
<li>Request-based (throttled). Enabled with <code>cpu_idle: true</code>. CPU is available only while a request is actively being processed.</li>
<li>Instance-based (always on). Enabled with <code>cpu_idle: false</code>. CPU remains available when idle, allowing background work such as garbage collection to run.</li>
</ul>
<p>The tests compared these modes under identical conditions:</p>
<table>
<thead>
<tr>
<th>Parameter</th>
<th align="center">Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>vCPU</td>
<td align="center">1</td>
</tr>
<tr>
<td>Memory</td>
<td align="center">4 GiB (high enough to remove OOM as a factor)</td>
</tr>
<tr>
<td><code>GOMEMLIMIT</code></td>
<td align="center">90% of memory</td>
</tr>
<tr>
<td><code>GOGC</code></td>
<td align="center">Default (unset)</td>
</tr>
<tr>
<td>Concurrency</td>
<td align="center">10</td>
</tr>
</tbody>
</table>
<h4>What we observed</h4>
<p><img src="https://www.elastic.co/observability-labs/assets/images/edot-cloud-forwarder-gcp-load-testing/cpu_allocation.png" alt="CPU allocation" /></p>
<p>With CPU allocated only on requests (<code>cpu_idle: true</code>):</p>
<ul>
<li>Memory variance was extreme (±71% RSS, ±213% heap).</li>
<li>Peak heap reached ~304 MB in the worst run.</li>
<li>We saw request refusals in the sample (90% success rate).</li>
</ul>
<p>With CPU always allocated (<code>cpu_idle: false</code>):</p>
<ul>
<li>Memory variance became tightly bounded (±8% RSS, ±32% heap).</li>
<li>Peak heap dropped to ~89 MB in the worst run.</li>
<li>We saw no refusals in the sample (100% success).</li>
</ul>
<p>From these runs we saw:</p>
<ul>
<li>When CPU is throttled, the Go garbage collector is effectively starved, leading to heap accumulation and large run-to-run variance.</li>
<li>When CPU is always available, garbage collection keeps pace with allocation, resulting in lower and more predictable memory usage.</li>
</ul>
<p><em>Takeaway:</em> for this set of tests, <code>cpu_idle: false</code> was the most stable baseline configuration. Request-based CPU throttling introduced artificial instability that makes capacity planning much harder.</p>
<h3>Go memory limit: <code>GOMEMLIMIT</code> in constrained containers</h3>
<p>Cloud Run enforces a hard memory limit at the container level. If the process exceeds it, the instance is OOM-killed.</p>
<p>We tested Cloud Run with:</p>
<table>
<thead>
<tr>
<th>Parameter</th>
<th align="center">Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Container memory</td>
<td align="center">512 MiB</td>
</tr>
<tr>
<td>vCPU</td>
<td align="center">1</td>
</tr>
<tr>
<td>Concurrency</td>
<td align="center">20</td>
</tr>
<tr>
<td><code>GOGC</code></td>
<td align="center">Default (unset)</td>
</tr>
<tr>
<td><code>cpu_idle</code></td>
<td align="center"><code>false</code></td>
</tr>
</tbody>
</table>
<p>The tests compared:</p>
<ul>
<li>No <code>GOMEMLIMIT</code> (Go relies on OS pressure).</li>
<li><code>GOMEMLIMIT=460MiB</code> (or 90% of container memory).</li>
</ul>
<p>The results were clear:</p>
<table>
<thead>
<tr>
<th><code>GOMEMLIMIT</code></th>
<th>Outcome</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Unset</td>
<td>Unstable; repeated OOM kills</td>
<td>Service never produced stable results</td>
</tr>
<tr>
<td><code>460MiB</code></td>
<td>Stable; runs completed</td>
<td>Worst-case peak RSS reached ~505 MB, but the process within container limits</td>
</tr>
</tbody>
</table>
<p><em>Takeaway:</em> in a memory-constrained environment like Cloud Run, setting <code>GOMEMLIMIT</code> close to (but below) the container limit is essential for predictable behavior under load.</p>
<h3>GOGC: memory savings vs. reliability</h3>
<p>The <code>GOGC</code> parameter controls how much the heap can grow (in %) between GC cycles:</p>
<ul>
<li>Lower values (e.g., <code>GOGC=50</code>): more frequent collections, lower memory, higher CPU.</li>
<li>Higher values (e.g., <code>GOGC=100</code>): fewer collections, higher memory, lower CPU.</li>
</ul>
<p>The tests covered: (1) <code>GOGC=50</code> (aggressive); (2) <code>GOGC=75</code> (moderate); (3) <code>GOGC=100</code> (default/unset).</p>
<p>Setup:</p>
<table>
<thead>
<tr>
<th>Parameter</th>
<th align="center">Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Container memory</td>
<td align="center">4  GiB (high enough to remove OOM as a factor)</td>
</tr>
<tr>
<td>vCPU</td>
<td align="center">1</td>
</tr>
<tr>
<td>Concurrency</td>
<td align="center">10 (safe level)</td>
</tr>
<tr>
<td><code>GOMEMLIMIT</code></td>
<td align="center">90% of memory</td>
</tr>
<tr>
<td><code>cpu_idle</code></td>
<td align="center"><code>false</code></td>
</tr>
</tbody>
</table>
<h4>What we observed</h4>
<p><img src="https://www.elastic.co/observability-labs/assets/images/edot-cloud-forwarder-gcp-load-testing/gogc.png" alt="GOGC" /></p>
<p>From the runs:</p>
<table>
<thead>
<tr>
<th align="center"><code>GOGC</code></th>
<th align="center">Peak RSS (sample)</th>
<th>CPU behavior</th>
<th align="center">Failure rate</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">50</td>
<td align="center">~267 MB</td>
<td>Very high; often saturating</td>
<td align="center">30%</td>
<td>GC consumed cycles needed for ingestion</td>
</tr>
<tr>
<td align="center">75</td>
<td align="center">~454 MB</td>
<td>~83.5% avg</td>
<td align="center">10%</td>
<td>GC consumed cycles needed for ingestion</td>
</tr>
<tr>
<td align="center">100 (default)</td>
<td align="center">~472 MB</td>
<td>~83.5% avg; leaves headroom for bursts</td>
<td align="center">0%</td>
<td></td>
</tr>
</tbody>
</table>
<p>The conclusion from these runs is clear: pushing <code>GOGC</code> down trades memory for reliability, and the trade is not favorable for ECF.</p>
<p><em>Takeaway:</em> for this workload, the default <code>GOGC=100</code> provided the best balance. Attempts to optimize memory by lowering <code>GOGC</code> directly reduced reliability.</p>
<h2>Step 2: Find capacity and breaking points</h2>
<p>With the runtime stabilized, we evaluated how much traffic a single instance can sustain by increasing concurrency until failures emerged.</p>
<p><em>How to read the tables:</em> each concurrency level was tested across 20 runs covering both light (240 logs per file, around 362KB file size) and heavy inputs (over 6k logs per file, around 8MB file size). Tables report baseline RSS from light workloads and peak values from the worst-case run.</p>
<h3>Concurrency 5: Stable baseline</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/edot-cloud-forwarder-gcp-load-testing/concurrency_5.png" alt="Concurrency 5" /></p>
<p>At concurrency 5, the service was solid.</p>
<table>
<thead>
<tr>
<th align="left">Case</th>
<th align="right">Memory (RSS)</th>
<th align="right">CPU utilization</th>
<th align="left">Requests refused</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Baseline (lightest workload avg)</td>
<td align="right">99.89 MB</td>
<td align="right"></td>
<td align="left"></td>
</tr>
<tr>
<td align="left">Worst run</td>
<td align="right">211.02 MB</td>
<td align="right">86.43%</td>
<td align="left">No</td>
</tr>
</tbody>
</table>
<p>This proved that a single instance handles a moderate load comfortably, with memory usage staying well within safe limits.</p>
<h3>Concurrency 10: Safe but volatile</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/edot-cloud-forwarder-gcp-load-testing/concurrency_10.png" alt="Concurrency 10" /></p>
<p>At concurrency 10, the system remained functional but with significant volatility.</p>
<table>
<thead>
<tr>
<th align="left">Case</th>
<th align="right">Memory (RSS)</th>
<th align="right">CPU utilization</th>
<th align="left">Requests refused</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Baseline (lightest workload avg)</td>
<td align="right">100.33 MB</td>
<td align="right"></td>
<td align="left"></td>
</tr>
<tr>
<td align="left">Worst run</td>
<td align="right">424.80 MB</td>
<td align="right">94.10%</td>
<td align="left">No (in sample)</td>
</tr>
</tbody>
</table>
<p>We also noticed that memory usage shows extreme variance:</p>
<ul>
<li>Best run RSS: 178 MB.</li>
<li>Worst run RSS: 425 MB.</li>
</ul>
<p>This behavior comes mainly from two effects:</p>
<ul>
<li>Bursty Pub/Sub delivery: 10 heavy requests may land at nearly the same instant.</li>
<li>The use of <code>io.ReadAll</code> inside the collector: each request reads the entire log file into memory.</li>
</ul>
<p>When all 10 requests arrived concurrently, we were effectively stacking ~10× file size in RAM before the GC can clean up. When they are slightly staggered, GC has time to reclaim memory between requests, leading to much lower peaks.</p>
<p>This leads to a crucial sizing insight:</p>
<ul>
<li>Do not size the service using average memory (for example, ~260 MB).</li>
<li>Size it for the worst observed burst (~425 MB) to avoid OOM or GC stalls.</li>
</ul>
<p>In practice, you should set the memory limit to at least 512 MiB per instance at concurrency 10.</p>
<h3>Concurrency 20: Unstable, systemic load shedding</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/edot-cloud-forwarder-gcp-load-testing/concurrency_20.png" alt="Concurrency 20" /></p>
<p>At concurrency 20, the system consistently began shedding load.</p>
<table>
<thead>
<tr>
<th align="left">Case</th>
<th align="right">Memory (RSS)</th>
<th align="right">CPU utilization</th>
<th align="left">Requests refused</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Baseline (lightest workload avg)</td>
<td align="right">97.44 MB</td>
<td align="right"></td>
<td align="left"></td>
</tr>
<tr>
<td align="left">Worst run</td>
<td align="right">482.42 MB</td>
<td align="right">88.90%</td>
<td align="left">Yes (every run)</td>
</tr>
</tbody>
</table>
<p>Even though memory and CPU metrics don't look drastically worse than at concurrency 10, behavior changes qualitatively: the service begins to refuse requests consistently.</p>
<h3>Concurrency 40: Failure mode</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/edot-cloud-forwarder-gcp-load-testing/concurrency_40.png" alt="Concurrency 40" /></p>
<p>At concurrency 40, the instance collapsed completely. Memory and CPU are overwhelmed, and ingest reliability collapses.</p>
<table>
<thead>
<tr>
<th align="left">Case</th>
<th align="right">Memory (RSS)</th>
<th align="right">CPU utilization</th>
<th align="left">Requests refused</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Baseline (lightest workload avg)</td>
<td align="right">100.20 MB</td>
<td align="right"></td>
<td align="left"></td>
</tr>
<tr>
<td align="left">Worst run</td>
<td align="right">1234.28 MB</td>
<td align="right">96.57%</td>
<td align="left">Yes (all runs)</td>
</tr>
</tbody>
</table>
<h3>The breaking point: a 1 vCPU instance's realistic limits</h3>
<table>
<thead>
<tr>
<th align="center">Concurrency</th>
<th align="center">Peak RSS (MB)</th>
<th align="center">Stability</th>
<th align="center">Refusals?</th>
<th>Status</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">5</td>
<td align="center">211.02</td>
<td align="center">Low variance</td>
<td align="center">No</td>
<td>Stable baseline</td>
</tr>
<tr>
<td align="center">10</td>
<td align="center">424.80</td>
<td align="center">High variance</td>
<td align="center">No</td>
<td>Safe but volatile</td>
</tr>
<tr>
<td align="center">20</td>
<td align="center">482.42</td>
<td align="center">High variance</td>
<td align="center">Yes (Frequent)</td>
<td>Unstable (sheds load)</td>
</tr>
<tr>
<td align="center">40</td>
<td align="center">1234.28</td>
<td align="center">Extreme variance</td>
<td align="center">Yes (Always)</td>
<td>Failure (memory explosion)</td>
</tr>
</tbody>
</table>
<p>Combined with the CPU data (94% peak at concurrency 10), this supports a practical rule: <strong>for this workload</strong> and architecture, 10 concurrent heavy requests per 1 vCPU instance is the realistic upper bound.</p>
<h2>Turning findings into concrete recommendations</h2>
<p>These experiments lead to clear, actionable recommendations for running the ECF OpenTelemetry collector on Cloud Run as part of the broader Elastic Cloud Forwarder deployment.</p>
<p>Scope: these recommendations apply to the workload and harness we tested (light vs. heavy log files up to 8MB, and Pub/Sub burst delivery), using the tuned runtime settings listed below. If your log sizes, request burstiness, or pipeline shape differ significantly, validate these limits against your own traffic.</p>
<h3>Runtime and container configuration</h3>
<table>
<thead>
<tr>
<th>Area</th>
<th>Recommendation</th>
<th>Rationale</th>
</tr>
</thead>
<tbody>
<tr>
<td>CPU allocation</td>
<td>Set <code>cpu_idle: false</code> (always-on CPU)</td>
<td>Avoids GC starvation, stabilizes memory variance, and eliminates request failures caused by long GC pauses</td>
</tr>
<tr>
<td>Go memory limit</td>
<td>Set <code>GOMEMLIMIT</code> to ~90% of container memory</td>
<td>Enforces a heap boundary aligned with the Cloud Run limit so that Go reacts before the OS, preventing OOM kills</td>
</tr>
<tr>
<td>Garbage collection</td>
<td>Keep <code>GOGC</code> at 100 (default)</td>
<td>Lower <code>GOGC</code> reduces memory at the cost of higher CPU usage and measurable failure rates</td>
</tr>
</tbody>
</table>
<h3>Capacity and per-instance limits</h3>
<p>For a 1 vCPU Cloud Run instance running the ECF OpenTelemetry collector with the tuned runtime:</p>
<table>
<thead>
<tr>
<th>Limit</th>
<th>Recommendation</th>
<th>Rationale</th>
</tr>
</thead>
<tbody>
<tr>
<td>Hard concurrency</td>
<td>Cap concurrency at 10 requests per instance</td>
<td>At concurrency 10, CPU already reaches ~94% in the worst run; higher concurrency drives instability (refusals, GC stalls)</td>
</tr>
<tr>
<td>Memory</td>
<td>Use at least 512 MiB per instance (for concurrency 10)</td>
<td>Worst-case observed RSS is ~425 MB; 512 MiB provides a narrow but workable safety margin against burst alignment</td>
</tr>
</tbody>
</table>
<h3>Scaling strategy: horizontal, not vertical</h3>
<ul>
<li>Vertical scaling (increasing concurrency per instance) quickly runs into CPU and memory limits for this workload.</li>
<li>Horizontal scaling is a better fit: treat each instance as a worker with a hard limit of 10 concurrent heavy jobs.</li>
</ul>
<p>Practically:</p>
<ul>
<li>Configure the service so that no instance exceeds 10 concurrent requests.</li>
<li>Let autoscaling handle an increased load by adding instances, not by increasing per-instance concurrency.</li>
</ul>
<h2>Takeaways</h2>
<ul>
<li>Tuned runtime settings matter as much as raw resources: a single flag like <code>cpu_idle</code> can be the difference between predictable behavior and GC-driven chaos.</li>
<li>Go needs explicit limits in containers: <code>GOMEMLIMIT</code> must be set in memory-constrained environments; otherwise, OOM kills are inevitable under heavy ingesting.</li>
<li>&quot;Lower memory&quot; is not always better: aggressive GC tuning (<code>GOGC</code> &lt; 100) did reduce memory usage but directly increased failure rates.</li>
<li>Concurrency 10 is the realistic ceiling for a 1 vCPU ECF instance; beyond that, refusals and instability become the norm.</li>
<li>Horizontal scaling is the right model: each instance should be treated as a 10-request worker, with higher total throughput coming from more workers rather than more concurrency per worker.</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/edot-cloud-forwarder-gcp-load-testing/cover-gcp.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Using the Elastic Agent to monitor Amazon ECS and AWS Fargate with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-agent-monitor-ecs-aws-fargate-observability</link>
            <guid isPermaLink="false">elastic-agent-monitor-ecs-aws-fargate-observability</guid>
            <pubDate>Thu, 15 Jun 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[In this article, we’ll guide you through how to install the Elastic Agent with the AWS Fargate integration as a sidecar container to send host metrics and logs to Elastic Observability.]]></description>
            <content:encoded><![CDATA[<h2>Serverless and AWS ECS Fargate</h2>
<p>AWS Fargate is a serverless pay-as-you-go engine used for Amazon Elastic Container Service (ECS) to run Docker containers without having to manage servers or clusters. The goal of Fargate is to containerize your application and specify the OS, CPU and memory, networking, and IAM policies needed for launch. Additionally, AWS Fargate can be used with Elastic Kubernetes Service (EKS) in a <a href="https://docs.aws.amazon.com/eks/latest/userguide/fargate.html">similar manner</a>.</p>
<p>Although the provisioning of servers would be handled by a third party, the need to understand the health and performance of containers within your serverless environment becomes even more vital in identifying root causes and system interruptions. Serverless still requires observability. Elastic Observability can provide observability for not only AWS ECS with Fargate, as we will discuss in this blog, but also for a number of AWS services (EC2, RDS, ELB, etc). See our <a href="https://www.elastic.co/blog/aws-service-metrics-monitor-observability-easy">previous blog</a> on managing an EC2-based application with Elastic Observability.</p>
<h2>Gaining full visibility with Elastic Observability</h2>
<p>Elastic Observability is governed by the three pillars involved in creating full visibility within a system: logs, metrics, and traces. Logs list all the events that have taken place in the system. Metrics keep track of data that will tell you if the system is down, like response time, CPU usage, memory usage, and latency. Traces give a good indication of the performance of your system based on the execution of requests.</p>
<p>These pillars by themselves offer some insight, but combining them allows for you to see the full scope of your system and how it handles increases in load or traffic over time. Connecting Elastic Observability to your serverless environment will help you deal with outages quicker and perform root cause analysis to prevent any future problems.</p>
<p>In this article, we’ll guide you through how to install the Elastic Agent with the <a href="https://docs.elastic.co/integrations/awsfargate">AWS Fargate</a> integration as a sidecar container to send host metrics and logs to Elastic Observability.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/Screenshot_2023-06-16_at_12.58.05_PM.png" alt="" /></p>
<h2>Prerequisites:</h2>
<ul>
<li>AWS account with AWS CLI configured</li>
<li>GitHub account</li>
<li>Elastic Cloud account</li>
<li>An app running on a container in AWS</li>
</ul>
<p>This tutorial is divided into two parts:</p>
<ol>
<li>Set up the Fleet server to be used by the sidecar container in AWS.</li>
<li>Create the sidecar container in AWS Fargate to send data back to Elastic Observability.</li>
</ol>
<h2>Part I: Set up the Fleet server</h2>
<p>First, let’s log in to Elastic Cloud.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image4.png" alt="" /></p>
<p>You can either create a new deployment or use an existing one.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image35.png" alt="" /></p>
<p>From the <strong>Home</strong> page, use the side panel to scroll to Management &gt; Fleet &gt; Agent policies. Click <strong>Add policy</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image30.png" alt="" /></p>
<p>Click <strong>Create agent policy</strong>. Here we’ll create a policy to attach to the Fleet agent.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image38.png" alt="" /></p>
<p>Give the policy a name and save changes.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image44.png" alt="" /></p>
<p>Click <strong>Create agent policy</strong>. You should see the agent policy AWS Fargate in the list of policies.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image42.png" alt="" /></p>
<p>Now that we have an agent policy, let’s add the integration to collect logs and metrics from the host. Click on <strong>AWS Fargate -&gt; Add integration</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image19.png" alt="" /></p>
<p>We’ll be adding to the policy AWS to collect overall AWS metrics and AWS Fargate to collect metrics from this integration. You can find each one by typing them in the search bar.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image1.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image34.png" alt="" /></p>
<p>Once you click on the integration, it will take you to its landing page, where you can add it to the policy.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image48.png" alt="" /></p>
<p>For the AWS integration, the only collection settings that we will configure are Collect billing metrics, Collect logs from CloudWatch, Collect metrics from CloudWatch, Collect ECS metrics, and Collect Usage metrics. Everything else can be left disabled.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/Screenshot_2023-06-15_at_11.35.28_AM.png" alt="" /></p>
<p>Another thing to keep in mind when using this integration is the set of permissions required to collect data from AWS. This can be found on the AWS integration page under AWS permissions. Take note of these permissions, as we will use them to create an IAM policy.</p>
<p>Next, we will add the AWS Fargate integration, which doesn’t require further configuration settings.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image37.png" alt="" /></p>
<p>Now that we have created the agent policy and attached the proper integrations, let’s create the agent that will implement the policy. Navigate back to the main Fleet page and click <strong>Add agent</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image41.png" alt="" /></p>
<p>Since we’ll be connecting to AWS Fargate through ECS, the host type should be set to this value. All the other default values can stay the same.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image15.png" alt="" /></p>
<p>Lastly, let’s create the enrollment token and attach the agent policy. This will enable AWS ECS Fargate to access Elastic and send data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image6.png" alt="" /></p>
<p>Once created, you should be able to see policy name, secret, and agent policy listed.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image43.png" alt="" /></p>
<p>We’ll be using our Fleet credentials in the next step to send data to Elastic from AWS Fargate.</p>
<h2>Part II: Send data to Elastic Observability</h2>
<p>It’s time to create our ECS Cluster, Service, and task definition in order to start running the container.</p>
<p>Log in to your AWS account and navigate to ECS.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image46.png" alt="" /></p>
<p>We’ll start by creating the cluster.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image9.png" alt="" /></p>
<p>Add a name to the Cluster. And for subnets, only select the first two for us-east-1a and us-eastlb.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image10.png" alt="" /></p>
<p>For the sake of the demo, we’ll keep the rest of the options set to default. Click <strong>Create</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image11.png" alt="" /></p>
<p>We should see the cluster we created listed below.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/Screenshot_2023-06-15_at_11.15.51_AM.png" alt="" /></p>
<p>Now that we’ve created our cluster to host our container, we want to create a task definition that will be used to set up our container. But before we do this, we will need to create a task role with an associated policy. This task role will allow for AWS metrics to be sent from AWS to the Elastic Agent.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image47.png" alt="" /></p>
<p>Navigate to IAM in AWS.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image32.png" alt="" /></p>
<p>Go to <strong>Policies -&gt; Create policy</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image31.png" alt="" /></p>
<p>Now we will reference the AWS permissions from the Fleet AWS integration page and use them to configure the policy. In addition to these permissions, we will also add the GetAtuhenticationToken action for ECR.</p>
<p>You can configure each one using the visual editor.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image22.png" alt="" /></p>
<p>Or, use the JSON option. Don’t forget to replace the &lt;account_id&gt; with your own.</p>
<pre><code class="language-json">{
  &quot;Version&quot;: &quot;2012-10-17&quot;,
  &quot;Statement&quot;: [
    {
      &quot;Sid&quot;: &quot;VisualEditor0&quot;,
      &quot;Effect&quot;: &quot;Allow&quot;,
      &quot;Action&quot;: [
        &quot;sqs:DeleteMessage&quot;,
        &quot;sqs:ChangeMessageVisibility&quot;,
        &quot;sqs:ReceiveMessage&quot;,
        &quot;ecr:GetDownloadUrlForLayer&quot;,
        &quot;ecr:UploadLayerPart&quot;,
        &quot;ecr:PutImage&quot;,
        &quot;sts:AssumeRole&quot;,
        &quot;rds:ListTagsForResource&quot;,
        &quot;ecr:BatchGetImage&quot;,
        &quot;ecr:CompleteLayerUpload&quot;,
        &quot;rds:DescribeDBInstances&quot;,
        &quot;logs:FilterLogEvents&quot;,
        &quot;ecr:InitiateLayerUpload&quot;,
        &quot;ecr:BatchCheckLayerAvailability&quot;
      ],
      &quot;Resource&quot;: [
        &quot;arn:aws:iam::&lt;account_id&gt;:role/*&quot;,
        &quot;arn:aws:logs:*:&lt;account_id&gt;:log-group:*&quot;,
        &quot;arn:aws:sqs:*:&lt;account_id&gt;:*&quot;,
        &quot;arn:aws:ecr:*:&lt;account_id&gt;:repository/*&quot;,
        &quot;arn:aws:rds:*:&lt;account_id&gt;:target-group:*&quot;,
        &quot;arn:aws:rds:*:&lt;account_id&gt;:subgrp:*&quot;,
        &quot;arn:aws:rds:*:&lt;account_id&gt;:pg:*&quot;,
        &quot;arn:aws:rds:*:&lt;account_id&gt;:ri:*&quot;,
        &quot;arn:aws:rds:*:&lt;account_id&gt;:cluster-snapshot:*&quot;,
        &quot;arn:aws:rds:*:&lt;account_id&gt;:cev:*/*/*&quot;,
        &quot;arn:aws:rds:*:&lt;account_id&gt;:og:*&quot;,
        &quot;arn:aws:rds:*:&lt;account_id&gt;:db:*&quot;,
        &quot;arn:aws:rds:*:&lt;account_id&gt;:es:*&quot;,
        &quot;arn:aws:rds:*:&lt;account_id&gt;:db-proxy-endpoint:*&quot;,
        &quot;arn:aws:rds:*:&lt;account_id&gt;:secgrp:*&quot;,
        &quot;arn:aws:rds:*:&lt;account_id&gt;:cluster:*&quot;,
        &quot;arn:aws:rds:*:&lt;account_id&gt;:cluster-pg:*&quot;,
        &quot;arn:aws:rds:*:&lt;account_id&gt;:cluster-endpoint:*&quot;,
        &quot;arn:aws:rds:*:&lt;account_id&gt;:db-proxy:*&quot;,
        &quot;arn:aws:rds:*:&lt;account_id&gt;:snapshot:*&quot;
      ]
    },
    {
      &quot;Sid&quot;: &quot;VisualEditor1&quot;,
      &quot;Effect&quot;: &quot;Allow&quot;,
      &quot;Action&quot;: [
        &quot;sqs:ListQueues&quot;,
        &quot;organizations:ListAccounts&quot;,
        &quot;ec2:DescribeInstances&quot;,
        &quot;tag:GetResources&quot;,
        &quot;cloudwatch:GetMetricData&quot;,
        &quot;ec2:DescribeRegions&quot;,
        &quot;iam:ListAccountAliases&quot;,
        &quot;sns:ListTopics&quot;,
        &quot;sts:GetCallerIdentity&quot;,
        &quot;cloudwatch:ListMetrics&quot;
      ],
      &quot;Resource&quot;: &quot;*&quot;
    },
    {
      &quot;Sid&quot;: &quot;VisualEditor2&quot;,
      &quot;Effect&quot;: &quot;Allow&quot;,
      &quot;Action&quot;: &quot;ecr:GetAuthorizationToken&quot;,
      &quot;Resource&quot;: &quot;arn:aws:ecr:*:&lt;account_id&gt;:repository/*&quot;
    }
  ]
}
</code></pre>
<p>Review your changes.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image3.png" alt="" /></p>
<p>Now let’s attach this policy to a role. Navigate to <strong>IAM -&gt; Roles</strong>. Click <strong>Create role</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image45.png" alt="" /></p>
<p>Select AWS service as Trusted entity type and select EC2 as Use case. Click <strong>Next</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image24.png" alt="" /></p>
<p>Under permissions policies, select the policy we just created, as well as CloudWatchLogsFullAccess and AmazonEC2ContainerRegistryFullAccess. Click <strong>Next</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image27.png" alt="" /></p>
<p>Give the task role a name and description.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image39.png" alt="" /></p>
<p>Click <strong>Create role</strong>.</p>
<p>Now it’s time to create the task definition. Navigate to <strong>ECS -&gt; Task definitions</strong>. Click <strong>Create new task definition</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image21.png" alt="" /></p>
<p>Let’s give this task definition a name.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image14.png" alt="" /></p>
<p>After giving the task definition a name, you’ll add the Fleet credentials to the container section, which you can obtain from the Enrollment Tokens section of the Fleet section in Elastic Cloud. This allows us to host the Elastic Agent on the ECS container as a sidecar and send data to Elastic using Fleet credentials.</p>
<ul>
<li>
<p>Container name: <strong>elastic-agent-container</strong></p>
</li>
<li>
<p>Image: <strong>docker.elastic.co/beats/elastic-agent:8.19.12</strong></p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image40.png" alt="" /></p>
<p>Now let’s add the environment variables:</p>
<ul>
<li>
<p>FLEET_ENROLL: <strong>yes</strong></p>
</li>
<li>
<p>FLEET_ENROLLMENT_TOKEN: <strong>&lt;enrollment-token&gt;</strong></p>
</li>
<li>
<p>FLEET_URL: <strong>&lt;fleet-server-url&gt;</strong></p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image26.png" alt="" /></p>
<p>For the sake of the demo, leave Environment, Monitoring, Storage, and Tags as default values. Now we will need to create a second container to run the image for the golang app stored in ECR. Click <strong>Add more containers</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image5.png" alt="" /></p>
<p>For Environment, we will reserve 1 vCPU and 3 GB of memory. Under Task role, search for the role we created that uses the IAM policy.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image7.png" alt="" /></p>
<p>Review the changes, then click <strong>Create</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image25.png" alt="" /></p>
<p>You should see your new task definition included in the list.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image20.png" alt="" /></p>
<p>The final step is to create the service that will connect directly to the fleet server.<br />
Navigate to the cluster you created and click <strong>Create</strong> under the Service tab.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image18.png" alt="" /></p>
<p>Let’s get our service environment configured.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image28.png" alt="" /></p>
<p>Set up the deployment configuration. Here you should provide the name of the task definition you created in the previous step. Also, provide the service with a unique name. Set the number of <strong>desired tasks</strong> to 2 instead of 1.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image16.png" alt="" /></p>
<p>Click <strong>Create</strong>. Now your service is running two tasks in your cluster using the task definition you provided.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image33.png" alt="" /></p>
<p>To recap, we set up a Fleet server in Elastic Cloud to receive AWS Fargate data. We then created our AWS Fargate cluster task definition with the Fleet credentials implemented within the container. Lastly, we created the service to send data about our host to Elastic.</p>
<p>Now let’s verify our Elastic Agent is healthy and properly receiving data from AWS Fargate.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image36.png" alt="" /></p>
<p>We can also view a better breakdown of our agent on the Observability Overview page.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image2.png" alt="" /></p>
<p>If we drill down to hosts, by clicking on host name we should be able to see more granular data. For instance, we can see the CPU Usage of the Elastic Agent itself that is deployed in our AWS Fargate environment.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image8.png" alt="" /></p>
<p>Lastly, we can view the AWS Fargate dashboard generated using the data collected by our Elastic Agent. This is an out-of-the-box dashboard that can also be customized based on the data you would like to visualize.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/image23.png" alt="" /></p>
<p>As you can see in the dashboard we’re able to filter based on running tasks, as well as see a list of containers running in our environment. Something else that could be useful to show is the CPU usage per cluster as shown under CPU Utilization per Cluster.</p>
<p>The dashboard can pull data from different sources and in this case shows data for both AWS Fargate and the greater ECS cluster. The two containers at the bottom display the CPU and memory usage directly from ECS.</p>
<h2>Conclusion</h2>
<p>In this article, we showed how to send data from AWS Fargate to Elastic Observability using the Elastic Agent and Fleet. Serverless architectures are quickly becoming industry standard in offloading the management of servers to third parties. However, this does not alleviate the responsibility of operations engineers to manage the data generated within these environments. Elastic Observability provides a way to not only ingest the data from serverless architectures, but also establish a roadmap to address future problems.</p>
<p>Start your own <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k?trk=5fbc596b-6d2a-433a-8333-0bd1f28e84da%E2%89%BBchannel=el">7-day free trial</a> by signing up via <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k?trk=d54b31eb-671c-49ba-88bb-7a1106421dfa%E2%89%BBchannel=el">AWS Marketplace</a> and quickly spin up a deployment in minutes on any of the <a href="https://www.elastic.co/guide/en/cloud/current/ec-reference-regions.html#ec_amazon_web_services_aws_regions">Elastic Cloud regions on AWS</a> around the world. Your AWS Marketplace purchase of Elastic will be included in your monthly consolidated billing statement and will draw against your committed spend with AWS.</p>
<p><strong>More resources on serverless and observability and AWS:</strong></p>
<ul>
<li><a href="https://www.elastic.co/blog/aws-service-metrics-monitor-observability-easy">Analyze your AWS application’s service metrics on Elastic Observability (EC2, ELB, RDS, and NAT)</a></li>
<li><a href="https://www.elastic.co/blog/observability-apm-aws-lambda-serverless-functions">Get visibility into AWS Lambda serverless functions with Elastic Observability</a></li>
<li><a href="https://www.elastic.co/blog/trace-based-testing-elastic-apm-tracetest">Trace-based testing with Elastic APM and Tracetest</a></li>
<li><a href="https://www.elastic.co/blog/aws-kinesis-data-firehose-elastic-observability-analytics">Sending AWS logs into Elastic via AWS Firehose</a></li>
</ul>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-agent-monitor-ecs-aws-fargate-observability/blog-thumb-observability-pattern-color.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Pivoting Elastic's Data Ingestion to OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-agent-pivot-opentelemetry</link>
            <guid isPermaLink="false">elastic-agent-pivot-opentelemetry</guid>
            <pubDate>Tue, 03 Jun 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic has fully embraced OpenTelemetry as the backbone of its data ingestion strategy, aligning with the open-source community and contributing to make it the best data collection platform for a broad user base. This move benefits users by providing enhanced flexibility, efficiency, and control over telemetry data.]]></description>
            <content:encoded><![CDATA[<h1>Introduction</h1>
<p>Elastic has fully embraced OpenTelemetry as the backbone of its data ingestion strategy, aligning with the open-source community and contributing to make it the best data collection platform for a broad user base. This move benefits users by providing enhanced flexibility, efficiency, and control over telemetry data.</p>
<h1>Why OpenTelemetry?</h1>
<p>OpenTelemetry provides a powerful set of capabilities that make it a compelling choice for open-source-focused users. Elastic is re-architecting its data ingest tools around OpenTelemetry to offer users vendor-agnostic flexibility, performance optimization through OTel's efficient data model for correlating telemetry, and enhanced flexibility and control over data pipelines. This move brings the benefits of open-source telemetry to Elastic users.</p>
<p>Elastic engineers are active contributors to the Otel project in several areas of the project. Demonstrating its commitment to open source, Elastic continues to make significant <a href="https://opentelemetry.devstats.cncf.io/d/5/companies-table?orgId=1%5C&amp;var-period_name=Last%20year&amp;var-metric=contributions">contributions to OpenTelemetry</a>.</p>
<h1>OpenTelemetry as the Core of Elastic's Data Ingestion</h1>
<p>Elastic is transforming its data ingestion strategy by basing all ingestion mechanisms on the OpenTelemetry components. Elastic currently supports the following OTel based ingest architecture, which support OTel SDKs and Collectors from OTel or Elastic's Distribution of OpenTelemetry (EDOT).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-pivot-opentelemetry/edot-components.png" alt="EDOT components" /></p>
<p>This marks a fundamental shift, ensuring a more standardized and scalable telemetry pipeline. All the existing Elastic ingest components will become OTel based.</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Beats</strong></td>
<td>Beats architecture will be based on OTel.</td>
</tr>
<tr>
<td><strong>Elastic Agent</strong></td>
<td>Agent architecture will be based on OTel to support both beats based inputs and OTel receivers.</td>
</tr>
<tr>
<td><strong>Integrations</strong></td>
<td>Integrations catalogue will additionally include OTel based modules for ease of configuration.</td>
</tr>
<tr>
<td><strong>Fleet central management</strong></td>
<td>Fleet will support monitoring of Elastic OTel collectors.</td>
</tr>
</tbody>
</table>
<p>Let's discuss how each component of Elastic's data ingestion platform will be based on an OpenTelemetry collector whilst still providing the same functionality to the user.</p>
<h2>Beats</h2>
<p>Elastic's traditional data shippers will be re-architected as OpenTelemetry Collectors, aligning with OTel's extensibility model. Current Beat architecture is essentially made up of a few stages in its pipeline, as shown in the diagram below. It consists of an Input, Processors for enrichments, Queuing of events and Output for batching and writing the data to a specific output.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-pivot-opentelemetry/filebeat.png" alt="filebeat" /></p>
<h2>Beatreceiver Concept</h2>
<p>To ensure a smooth transition without major disruptions, a &quot;beatsreceiver&quot; concept is being implemented. These <code>beatreceivers</code> (like <code>filebeatreceiver</code> or <code>metricbeatreceiver</code>) act as dedicated Beat inputs integrated into the OpenTelemetry Collector as native receivers. They support all existing inputs and processors, guaranteeing that the final architecture accepts the user's current configuration and delivers the same functionality as today's Beats, all without introducing any breaking changes.</p>
<p>An OTel based Beats architecture will see the Input phase embedded as an OTel receiver (eg.  <code>filebeatreceiver</code> to represent the functionality of <code>filebeat</code>). This receiver would only be available as part of Elastic's distribution of OTel in support of our current user base and not a functionality that would be available upstream.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-pivot-opentelemetry/filebeatreceiver.png" alt="filebeat" /></p>
<p>All the remaining components of the pipeline will be based on OTel. The new Beat will accept the same filebeat configuration (as an example) and will transform it to an OTel based configuration in order to avoid any deployment disruption. It should be noted that in this architecture the Beats will continue to only support ECS formatted data. In order to keep the Beat functionality inline with what exists today, the Elasticsearch exporter (as an example) will output ECS formatted data only.</p>
<p>The following diagram illustrates the <code>beatreceiver</code> concept by showing how a basic <code>filebeat</code> configuration is automatically translated into an OpenTelemetry-based configuration. This new configuration retains the original inputs and processors but leverages the native OpenTelemetry pipeline and exporter to achieve the same overall <code>filebeat</code> functionality. Existing <code>filebeat</code> configurations will be automatically converted, eliminating the need for manual adjustments or introducing breaking changes.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-pivot-opentelemetry/elastic-agent-otel-config.png" alt="Filebeat OTel config" /></p>
<h2>Elastic Agent</h2>
<p>Elastic Agent is a unified agent for data collection, security, and observability. It can also be deployed in an OpenTelemetry only mode, enabling native OTel workflows. Elastic Agent is a supervisor that manages many other Beats as sub-processes in order to provide a more comprehensive data collection tool. It is capable of translating Agent Policy received from Fleet into configuration acceptable by the various sub-processes.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-pivot-opentelemetry/elastic-agent-architecture.png" alt="Elastic Agent Architecture" /></p>
<p>Expanding on the Beat receiver concept described above, the Elastic Agent, which currently can be deployed as an OTel collector (see <a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry-ga">blog</a>), will be also modified to a much simpler OTel based architecture based on these receivers. As shown below, this architecture will streamline the components within the Elastic Agent and remove duplicated functionality such as queuing and output. Whilst supporting the current functionality, these changes will reduce the agent footprint and also present a reduction in number of connections opened to pipeline elements egress of the agent (such as Elasticsearch clusters, Logstash or Kafka brokers).</p>
<p>By moving to an OTel based architecture Elastic Agent is now able to operate as a truly hybrid Elastic Agent which provides not only the Beat functionality but also allows our users to create OTel native pipelines and take advantage of plethora of functionality available as part of the open source project.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-pivot-opentelemetry/elastic-agent-otel-architecture.png" alt="Elastic Agent OTel Architecture" /></p>
<p>Elastic's commitment to OpenTelemetry will deepen through increased contributions, resulting in OpenTelemetry receivers gradually superseding Beats receiver features. This evolution will eventually reduce the need for a distinct Beats receiver within the Elastic Agent architecture. The envisioned architecture will empower the Elastic Agent to transmit data in OTLP format as well, granting users the flexibility to select any OTLP-compatible backend, thereby upholding the principle of vendor neutrality.</p>
<h2>Fleet &amp; Integrations: Managing OpenTelemetry at Scale</h2>
<p>Elastic's centralized management system will support OpenTelemetry-based configurations, making large-scale deployments easier to manage. Managing thousands of telemetry agents at scale presents a significant challenge. Elastic's <strong>Fleet &amp; Integrations</strong> simplify this process by providing robust lifecycle management for these new OpenTelemetry-based Elastic agents.</p>
<p><strong>Key Capabilities Offered:</strong></p>
<ul>
<li>
<p><strong>Scalability:</strong> Manage up to 100K+ agents across distributed environments.</p>
</li>
<li>
<p><strong>Automated Upgrades:</strong> Staged rollouts and automatic upgrades ensure minimal downtime.</p>
</li>
<li>
<p><strong>Monitoring &amp; Diagnostics:</strong> Real-time status updates, failure detection, and diagnostic downloads improve system reliability.</p>
</li>
<li>
<p><strong>Policy-Based Configuration Management:</strong> Enables centralized control over agent configurations, improving consistency across deployments.</p>
</li>
<li>
<p><strong>Pre-Built Integrations:</strong> Elastic offers a catalog of <strong>470+ pre-built integrations</strong>, allowing users to ingest data seamlessly from various sources. These will also include OTel based packages making configuration much more efficient across a large deployment.</p>
</li>
</ul>
<p>The goal is for Fleet to also provide monitoring capabilities for native OTel collectors as well in a vendor agnostic fashion.</p>
<h1>Conclusion</h1>
<p>Elastic's adoption of OpenTelemetry marks a significant milestone in the evolution of open-source observability. By standardizing on OpenTelemetry, Elastic is ensuring that its data ingestion strategy remains <strong>open, scalable, and future-proof</strong>.</p>
<p>For open-source users, this shift means:</p>
<ul>
<li>
<p>Greater interoperability across observability tools.</p>
</li>
<li>
<p>Enhanced flexibility in choosing telemetry backends.</p>
</li>
<li>
<p>A stronger commitment to <strong>community-driven</strong> observability standards.</p>
</li>
<li>
<p>Existing Beats and Elastic Agent users can <strong>seamlessly adopt OpenTelemetry</strong> without rearchitecting their pipelines.</p>
</li>
<li>
<p>OpenTelemetry users can <strong>integrate with Elastic's observability stack</strong> without additional complexity.</p>
</li>
</ul>
<p>Stay tuned for more updates as Elastic continues to expand its OpenTelemetry-based data collection capabilities! In the mean time here are some other references:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry-ga">Elastic Distributions of OpenTelemetry (EDOT) Now GA</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/k8s-discovery-with-EDOT-collector">Dynamic workload discovery on Kubernetes now supported with EDOT Collector</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/introducing-the-ottl-playground-for-opentelemetry">Introducing the OTTL Playground for OpenTelemetry</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-agent-pivot-opentelemetry/self-service-blog-image-templates.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Agent Skills for Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-agent-skills-observability-workflows</link>
            <guid isPermaLink="false">elastic-agent-skills-observability-workflows</guid>
            <pubDate>Mon, 16 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how Agent Skills for Elastic Observability help SREs and developers run observability workflows through natural language to instrument apps with OpenTelemetry, search logs, manage SLOs, understand service health, and help with LLM observability.]]></description>
            <content:encoded><![CDATA[<p>Elastic Observability provides a wide set of capabilities, from configuring OpenTelemetry instrumentation, writing ES|QL queries to search logs and metrics, defining SLOs with the correct indicator types and equation syntax, triaging noisy alert storms, and stitching together service health from multiple signals. SREs are now looking to autmoate further with AI Agents.</p>
<p>Elastic's Agent skills are open source packages that give your AI coding agent native Elastic expertise. If you're already using Elastic Agent Builder, you get AI agents that work natively with your Observability data. The <a href="https://github.com/elastic/agent-skills">Elastic Agent Skills</a> deliver native platform expertise directly to your AI coding agent, so you can stop debugging AI-generated errors and start shipping production-ready code with the full depth of Elastic.</p>
<p>Skills can be used for specialized tasks across the Elastic stack — Elasticsearch, Kibana, Elastic Security, Elastic Observability, and more. Each skill lives in its own folder with a SKILL.md file containing metadata and instructions the agent follows.</p>
<p>Observability is releasing five skills that together cover the core workflows SREs and developers perform daily.Running Elastic Observability today involves a wide surface area: configuring OpenTelemetry instrumentation, writing ES|QL queries to search logs and metrics, defining SLOs with the correct indicator types and equation syntax, tand stitching together service health from multiple signals. Each of these tasks requires domain expertise and familiarity with specific APIs, index patterns, and Kibana workflows. For teams managing dozens of services across multiple environments, this is repetitive, error-prone, and time-consuming.</p>
<p>This article walks through the current Observability skill set, shows an end-to-end workflow, and highlights where these skills are useful in day-to-day operations.</p>
<h2>Why this matters for observability teams</h2>
<p>Modern observability work is usually ad hoc and cross-cutting. In one hour, you may instrument a new service, inspect logs for an incident, check error-budget status, and validate service health across several signals.</p>
<p>Each step often needs different APIs, index patterns, and Kibana workflows. Agent Skills package this task knowledge into reusable units so an agent can execute these steps consistently.</p>
<h2>The observability skills</h2>
<p>The observability set currently focuses on five connected workflows:</p>
<ol>
<li><strong>Instrument applications</strong> Adds the Elastic Distributions of OpenTelemetry to Python, Java, or .NET services (tracing, metrics, logs) or helps migrate from the classic Elastic APM agents to EDOT, with correct OTLP endpoints and configuration</li>
<li><strong>Search logs</strong> Provides visibility into Elastic Streams — the data routing and processing layer for observability data.</li>
<li><strong>Manage SLOs</strong> Creates and manages Service-Level Objectives in Elastic Observability via the Kibana API — from data exploration through SLO definition, creation, and lifecycle management.</li>
<li><strong>Assess service health</strong> Provides a unified view of service health by combining signals from APM, infrastructure metrics, logs, SLOs, and alerts into a single assessment.</li>
<li><strong>Observe LLM applications</strong> Monitors and troubleshoots LLM-powered applications — tracking token usage, latency, error rates, and model performance across inference calls.</li>
</ol>
<h2>What Agent Skills are</h2>
<p>Agent Skills are self-contained folders with instructions, scripts, and resources that an AI agent loads dynamically for a specific task. Elastic publishes official skills in <a href="https://github.com/elastic/agent-skills">elastic/agent-skills</a>, based on the <a href="https://agentskills.io/">Agent Skills standard</a>.</p>
<p>At a practical level, this means:</p>
<ul>
<li>You describe the goal.</li>
<li>The agent selects the relevant skill or you specify it.</li>
<li>The skill applies known consistent steps and API patterns, Elastic recommendeds, for that job.</li>
</ul>
<h2>Practical example: from incident question to root-cause</h2>
<p>As an SRE, you're notified that a specific customer is experiencing errors. Support has been trying to trouble shoot, but they need help. Support provides a transaction ID to investigate.</p>
<p>You've loaded Elastic's Agent Skills to Claude. You ask Claude:</p>
<p><code>Find out why transaction with id 01ba6cf8e60253bdeb26026caa3278a1 is having issues over the last 24 hours.</code></p>
<p>Claude, with Elastic O11y Skills added, analyzes the issue for that specific transaction with Elastic.</p>
<ol>
<li>it uses the log-search skill to narrow down likely causes</li>
<li>the root cause is identified</li>
<li>and a potential remediation is recommended</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-skills-observability-workflows/Analyze-logs-for-transaction.png" alt="Claude Code interaction for log-search skill" /></p>
<h2>How to get started</h2>
<p>Install Elastic skills with the <code>skills</code> CLI:</p>
<pre><code class="language-bash">npx skills add elastic/agent-skills
</code></pre>
<p>Install a specific skill directly:</p>
<pre><code class="language-bash">npx skills add elastic/agent-skills --skill logs-search 
</code></pre>
<p>Then run your agent and give it an outcome-focused request, for example:</p>
<pre><code class="language-text">My cart service is experiencing some slowness, are there any errors over the last 3 hours? Please give me a summary of these logs.
</code></pre>
<p>The key shift is that the request is outcome-first. The skill captures implementation details such as API order, field expectations, and verification steps.</p>
<h2>What is next</h2>
<p>The planned scope includes broader workflow coverage. As skills mature, teams can combine them into repeatable operating patterns that still support ad hoc investigation.</p>
<p>If you want to try this model now, get <a href="https://github.com/elastic/agent-skills">Elastic's Agent Skills</a>, start with one service and one workflow:</p>
<ol>
<li>Assess service health.</li>
<li>Run guided log investigation for one real incident.</li>
<li>Add SLO management after baseline telemetry quality is in place.</li>
<li>Understand how well your LLM is performing for your developers.</li>
</ol>
<p>This gives you a concrete way to evaluate agent-assisted observability work without changing your full operating model in one step.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-agent-skills-observability-workflows/header2.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Getting started with the Elastic AI Assistant for Observability and Amazon Bedrock]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-ai-assistant-observability-amazon-bedrock</link>
            <guid isPermaLink="false">elastic-ai-assistant-observability-amazon-bedrock</guid>
            <pubDate>Fri, 03 May 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Follow this step-by-step process to get started with the Elastic AI Assistant for Observability and Amazon Bedrock.]]></description>
            <content:encoded><![CDATA[<p>Elastic recently released version <a href="https://www.elastic.co/blog/whats-new-elastic-8-13-0">8.13, which includes the general availability of Amazon Bedrock integration for the Elastic AI Assistant for Observability</a>. This blog post will walk through the step-by-step process of setting up the Elastic AI Assistant with Amazon Bedrock. Then, we’ll show you how to add content to the AI Assistant’s knowledge base to demonstrate how the power of Elasticsearch combined with Amazon Bedrock can supercharge the answers Elastic AI Assistant provides so that they are uniquely specific to your needs.</p>
<p>Managing applications and the infrastructure they run on requires advanced observability into the diverse types of data involved like logs, traces, profiles, and metrics. General purpose generative AI large language models (LLMs) offer a new capability to provide human readable guidance to your observability questions. However, they have limitations. Specifically, when it comes to providing answers about your application’s distinct observability data like real-time metrics, the LLMs require additional context to provide answers that will help to actually resolve issues. This is a limitation that the Elastic AI Assistant for Observability can uniquely solve.</p>
<p>Elastic Observability, serving as a central datastore of all the observability data flowing from your application, combined with the Elastic AI Assistant gives you the ability to generate a context window that can inform an LLM’s responses and vastly improve the answers it provides. For example, when you ask the Elastic AI Assistant a question about a specific issue happening in your application, it gathers up all the relevant details — current errors captured from logs or a related runbook that your team has stored in the Elastic AI Assistant’s knowledge base. Then, it sends that information to the Amazon Bedrock LLM as a context window from which it can better answer your observability questions.</p>
<p>Read on to follow the steps for setting up the Elastic AI Assistant for yourself.</p>
<h2>Set up the Elastic AI Assistant for Observability: Create an Amazon Bedrock connector in Elastic Cloud</h2>
<p>Start by creating an Elastic Cloud 8.13 deployment via the <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k">AWS marketplace</a>. If you’re a new user of Elastic Cloud, you can create a new deployment with a 7-day free trial.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/1.png" alt="1" /></p>
<p>Sign in to the Elastic Cloud deployment you’ve created. From the top level menu, select <strong>Stack Management</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/2.png" alt="2" /></p>
<p>Select <strong>Connectors</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/3.png" alt="3" /></p>
<p>Click the <strong>Create connector</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/4.png" alt="4" /></p>
<h2>Enable Amazon Bedrock model access</h2>
<p>For populating the required connector settings, enable Amazon Bedrock model access in the AWS console using the following steps.</p>
<p>In a new browser tab, open <a href="https://console.aws.amazon.com/bedrock/">Amazon Bedrock</a> and click the <strong>Get started</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/5.png" alt="5" /></p>
<p>Currently, access to the Amazon Bedrock foundation models is granted by requesting access using the Bedrock <strong>Model access</strong> section in the AWS console.</p>
<p>Select <strong>Model access</strong> from the navigation menu.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/6.png" alt="6" /></p>
<p>To request access, select the foundation models that you want to access and click the <strong>Save Changes</strong> button. For this blog post, we will choose the Anthropic Claude models.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/7.png" alt="7" /></p>
<p>Once access is granted, the <strong>Manage model</strong> <strong>access</strong> settings will indicate that access has been granted.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/8.png" alt="8" /></p>
<h3>Create AWS IAM User</h3>
<p>Create an <a href="https://aws.amazon.com/iam/">IAM</a> user and assign it a role with <a href="https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonBedrockFullAccess.html">Amazon Bedrock full access</a> and also <a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html">generate an IAM access key and secret key</a> in the console. If you already have an IAM user with a generated access key and secret key, you can use the existing credentials to access Amazon Bedrock.</p>
<h3>Configure Elastic connector to use Amazon Bedrock</h3>
<p>Back in the Elastic Cloud deployment create connector flyout, select the connector for Amazon Bedrock.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/9.png" alt="9" /></p>
<p>Enter a <strong>Name</strong> of your choice for the connector. Also, enter the <strong>Access Key</strong> and <strong>Key Secret</strong> that you copied in a previous step. Click the <strong>Save &amp; test</strong> button to create the connector.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/10.png" alt="10" /></p>
<p>Within the <strong>Edit Connector</strong> flyout window, click the <strong>Run</strong> button to confirm that the connector configuration is valid and can successfully connect to your Amazon Bedrock instance.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/11.png" alt="11" /></p>
<p>You should see confirmation that the connector test was successful.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/12.png" alt="12" /></p>
<h3>Add an example logs record</h3>
<p>Now that the connector is configured, let's add a logs record to demonstrate how the Elastic AI Assistant can help you to better understand the diverse types of information contained within logs.</p>
<p>Use the Elastic Dev Tools to add a single logs record. Click the top-level menu and select <strong>Dev Tools</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/13.png" alt="13" /></p>
<p>Within the console area of Dev Tools, enter the following POST statement:</p>
<pre><code class="language-json">POST /logs-elastic_agent-default/_doc
{
    &quot;message&quot;: &quot;Status(StatusCode=\&quot;BadGateway\&quot;, Detail=\&quot;Error: The server encountered a temporary error and could not complete your request\&quot;).&quot;,
    &quot;@timestamp&quot;: &quot;2024-04-21T10:33:00.884Z&quot;,
    &quot;log&quot;: {
   	 &quot;level&quot;: &quot;error&quot;
    },
    &quot;service&quot;: {
   	 &quot;name&quot;: &quot;proxyService&quot;
    },
    &quot;host&quot;: {
   	 &quot;name&quot;: &quot;appserver-2&quot;
    }
}
</code></pre>
<p>Then run the POST command by clicking the green <strong>Run</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/14.png" alt="14" /></p>
<p>You should see a 201 response confirming that the example logs record was successfully created.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/15.png" alt="15" /></p>
<h3>Use the Elastic AI Assistant</h3>
<p>Now that you have a log entry, let’s use the AI Assistant to see how it interacts with logs data. Click the top-level menu and select <strong>Observability</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/16.png" alt="16" /></p>
<p>Select <strong>Logs</strong> <strong>Explorer</strong> under Observability.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/17.png" alt="17" /></p>
<p>In the Logs Explorer search box, enter the text “badgateway” and press the <strong>Enter</strong> key to perform the search.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/18.png" alt="18" /></p>
<p>Click the <strong>View all matches</strong> button to include all search results.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/19.png" alt="19" /></p>
<p>You should see the one log record that you previously inserted via Dev Tools. Click the expand icon in the <strong>actions</strong> column to see the log record’s details.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/20.png" alt="20" /></p>
<p>You should see the expanded view of the logs record. Let’s use the AI Assistant to summarize it. Click on the <strong>What's this message?</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/21.png" alt="21" /></p>
<p>We get a fairly generic answer back. Depending on the exception or error we're trying to analyze, this can still be really useful, but we can improve this response by adding additional documentation to the AI Assistant knowledge base.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/22.png" alt="22" /></p>
<p>Let’s add an entry in AI Assistant’s knowledge base to improve its understanding of this specific logs message.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/23.png" alt="23" /></p>
<p>Click the <strong>AI Assistant</strong> button at the top right of the window.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/24.png" alt="24" /></p>
<p>Click the <strong>Install Knowledge base</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/25.png" alt="25" /></p>
<p>Click the top-level menu and select <strong>Stack Management</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/26.png" alt="26" /></p>
<p>Then select <strong>AI Assistants</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/27.png" alt="27" /></p>
<p>Click <strong>Elastic AI Assistant for Observability</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/28.png" alt="28" /></p>
<p>Select the <strong>Knowledge base</strong> tab.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/29.png" alt="29" /></p>
<p>Click the <strong>New entry</strong> button and select <strong>Single entry</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/30.png" alt="30" /></p>
<p>Give it the <strong>Name</strong> “proxyservice” and enter the following text as the <strong>Contents</strong> :</p>
<pre><code class="language-markdown">
​​I have the following runbook located on Github. Store this information in your knowledge base and always include the link to the runbook in your response if the topic is related to a bad gateway error.

Runbook Link: https://github.com/elastic/observability-aiops/blob/main/ai_assistant/runbooks/slos/502-errors.md

Runbook Title: Handling 502 Bad Gateway Errors

Summary: This is likely an issue with Nginx proxy configuration

Body: This runbook provides instructions for diagnosing and resolving 502 Bad Gateway errors in your system.
</code></pre>
<p>Click <strong>Save</strong> to save the new knowledge base entry.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/31.png" alt="31" /></p>
<p>Now let’s go back to the Observability Logs Explorer. Click the top-level menu and select <strong>Observability</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/32.png" alt="32" /></p>
<p>Then select <strong>Explorer</strong> under <strong>Logs</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/33.png" alt="33" /></p>
<p>Expand the same logs entry as you did previously and click the <strong>What’s this message?</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/34.png" alt="34" /></p>
<p>The response you get now should be much more relevant.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/35.png" alt="35" /></p>
<h2>Try out the Elastic AI Assistant with a knowledge base filled with your own data</h2>
<p>Now you’ve seen the complete process of connecting the Elastic AI Assistant to Amazon Bedrock. You’ve also seen how to use the AI Assistant’s knowledge base to store custom remediation documentation like runbooks that the AI Assistant can leverage to generate more helpful responses. Steps like this can help you remediate issues more quickly when they happen. Try out the Elastic AI Assistant with your own logs and custom knowledge base.</p>
<p>Start a <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k?trk=5fbc596b-6d2a-433a-8333-0bd1f28e84da%E2%89%BBchannel=el">7-day free trial</a> by signing up via <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k">AWS Marketplace</a> and quickly spin up a deployment in minutes on any of the <a href="https://www.elastic.co/guide/en/cloud/current/ec-reference-regions.html#ec_amazon_web_services_aws_regions">Elastic Cloud regions on AWS</a> around the world.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-amazon-bedrock/AI_hand.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[The Elastic AI Assistant for Observability escapes Kibana!]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-ai-assistant-observability-escapes-kibana</link>
            <guid isPermaLink="false">elastic-ai-assistant-observability-escapes-kibana</guid>
            <pubDate>Mon, 08 Apr 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Harness the Elastic AI Assistant API to seamlessly blend Elastic's Observability capabilities into your daily workflow, from Slack to the command line, boosting efficiency and decision-making. Work smarter, not harder.]]></description>
            <content:encoded><![CDATA[<p><em>Note: The API described below is currently under development and undocumented, and thus it is not supported. Consider this a forward-looking blog. Features are not guaranteed to be released.</em></p>
<p>Elastic, time-saving assistants, generative models, APIs, Python, and the potential to show a new way of working with our technology? Of course, I would move this to the top of my project list!</p>
<p>If 2023 was the year of figuring out generative AI and retrieval augmented generation (RAG), then 2024 will be the year of productionalizing generative AI RAG applications. Companies are beginning to publish references and architectures, and businesses are integrating generative applications into their lines of business.</p>
<p>Elastic is following suit by integrating not one but two AI Assistants into Kibana: one in <a href="https://www.elastic.co/guide/en/observability/current/obs-ai-assistant.html">Observability</a> and one in <a href="https://www.elastic.co/guide/en/security/current/security-assistant.html">Security</a>. Today, we will be working with the former.</p>
<h2>The Elastic AI Assistant for Observability</h2>
<p>What is the Observability AI Assistant? Allow me to <a href="https://www.elastic.co/guide/en/security/current/security-assistant.html">quote the documentation</a>:</p>
<p><em>The AI Assistant uses generative AI to provide:</em></p>
<ul>
<li>
<p>_ <strong>Contextual insights:</strong> _ <em>Open prompts throughout Observability that explain errors and messages and suggest remediation. This includes your own GitHub issues, runbooks, architectural images, etc. Essentially, anything internally that is useful for the SRE and stored in Elastic can be used to suggest resolution.</em> <a href="https://www.elastic.co/blog/sre-troubleshooting-ai-assistant-observability-runbooks"><em>Elastic AI Assistant for Observability uses RAG to get the most relevant internal information</em></a><em>.</em></p>
</li>
<li>
<p>_ <strong>Chat:</strong> _ <em>Have conversations with the AI Assistant. Chat uses function calling to request, analyze, and visualize your data.</em></p>
</li>
</ul>
<p>In other words, it's a chatbot built into the Observability section of Kibana, allowing SREs and operations people to perform their work faster and more efficiently. In the theme of integrating generative AI into lines of business, these AI Assistants are integrated seamlessly into Kibana.</p>
<h2>Why “escape” Kibana?</h2>
<p>Kibana is a powerful tool, offering many functions and uses. The Observability section has rich UIs for logs, metrics, APM, and more. As much as I believe people in operations, SREs, and the like can get the majority of their work done in Kibana (given Elastic is collecting the relevant data), having worked in the real world, I know just about everyone has multiple tools they work with.</p>
<p>We want to integrate with people’s workflows as much as we want them to integrate with Elastic. As such, providing API access to the AI Assistants allows Elastic to meet you where you spend most of your time. Be it Slack, Teams, or any other app that can integrate with an API.</p>
<h2>API overview</h2>
<p>Enter the AI Assistant API. The API provides most of the functionality and efficiencies the AI Assistant brings in Kibana. Since the API handles most of the functionality, it’s like having a team of developers working to improve and develop new features for you.</p>
<p>The API provides access to ask questions in natural language via <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html">ELSER</a> and a group of functions the large language model (LLM) can use to gather additional information from Elasticsearch, all out of the box.</p>
<h2>Command line</h2>
<p>Enough talk; let’s look at some examples!</p>
<p>The first example of using the AI Assistant outside of Kibana is on the command-line. This command-line script allows you to ask questions and get responses. Essentially, the script uses the Elastic API to enable you to have AI Assistant interactions on your CLI (outside of Kibana) Credit for this script goes to Almudena Sanz Olivé, senior software engineer on the Observability team. Of course, I want to also credit the rest of the development team for creating the assistant! NOTE: The AI Assistant API is not yet public but Elastic is working on potentially releasing this. Stay tuned.</p>
<p>The script prints API information on a new line each time the LLM calls a function or Kibana runs a function to provide additional information about what is happening behind the scenes. The generated answer will also be written on a new line.</p>
<p>There are many ways to start a conversation with the AI Assistant. Let’s imagine I work for an ecommerce company and just checked in some code to GitHub. I realize I need to check if there are any active alerts that need to be worked on. Since I’m already on the commandline, I can run the AI Assistant CLI and ask it to check for me.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-escapes-kibana/1.png" alt="Asking the AI Assistant to list all active alerts." /></p>
<p>There are nine active alerts. It's not the worst count I’ve seen by a long shot, but they should still be addressed. There are many ways to start here, but the one that caught my attention first was related to the SLO burn rate on the service-otel cart. This service handles our customers' checkout procedures.</p>
<p>I could ask the AI Assistant to investigate this more for me, but first, let me check if there are any runbooks our SRE team has loaded into the AI Assistant’s knowledge base.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-escapes-kibana/2.png" alt="Ask the AI Assistant to check if there are runbooks to handle issues with a service." /></p>
<p>Fantastic! I can call my fantastic co-worker Luca Wintergerst and have him fix it. While I prefer tea these days, I’ll follow step two and grab a cup of coffee.</p>
<p>With that handled, let’s go have some fun with SlackBots.</p>
<h2>Slackbots</h2>
<p>Before coming to Elastic, I worked at E*Trade, where I was on a team responsible for managing several large Elasticsearch clusters. I spent a decent amount of time working in Kibana; however, as we worked on other technologies, I spent much more time outside of Kibana. One app I usually had open was Slack. Long story short, <a href="https://www.elastic.co/elasticon/tour/2018/chicago/elastic-at-etrade">I wrote a Slackbot</a> (skip to the 05:22 mark to see a brief demo of it) that could perform many operations with Elasticsearch.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-escapes-kibana/3.png" alt="Slackbot circa 2018 reporting on Elastic ML Anomalies for trade transactions by sock symbol" /></p>
<p>This worked really well. The only problem was writing all the code, including implementing basic natural language processing (NLP). All the searches were hard-coded, and the list of tasks was static.</p>
<h3>Creating an AI Slackbot today</h3>
<p>Implementing a Slackbot with the AI Assistant's API is far more straightforward today. The interaction with the bot is the same as we saw with the command-line interface, except that we are in Slack.</p>
<p>To start things off, I created a new slackBot and named it <em>obsBurger</em>. I’m a Bob’s Burgers fan, and observability can be considered a stack of data. The Observability Burger, obsBurger for short, was born. This would be the bot that will directly connect to the AI Assistant API and perform all the same functions that can be performed within Kibana.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-escapes-kibana/4.png" alt="Just like in Kibana, I can as ObsBurger (the AI Assistant) for a list of active alerts" /></p>
<h3>More bots!</h3>
<p>Connecting by Slackbot to the AI Assistant's API was so easy to implement that I started brainstorming ideas to entertain myself.</p>
<p>Various personas will benefit from using the AI Assistant, especially Level One (L1) operations analysts. These people are generally new to observability and would typically need a lot of mentoring by a more senior employee to ramp up quickly. We could pretend to be an L1, test the Slackbot, or have fun with LLMs and prompt engineering!</p>
<p>I created a new Slackbot called <em>opsHuman</em>. This bot connects directly to Azure OpenAI using the same model the AI Assistant is configured to use. This virtual L1 uses the system prompt instructing it to behave as such.</p>
<p>You are OpsHuman, styled as a Level 1 operations expert with limited expertise in observability.<br />
Your primary role is to simulate a beginner's interaction with Elasticsearch Observability.</p>
<p>The full prompt is much longer and instructs how the LLM should behave when interacting with our AI Assistant.</p>
<h3>Let’s see it in action!</h3>
<p>To kick off the bot’s conversation, we “@” mention opsHuman, with the trigger command shiftstart, followed by the question we want our L1 to ask the AI Assistant.</p>
<p>@OpsHuman shiftstart are there any active alerts?</p>
<p>From there, OpsHuman will take our question and start a conversation with obsBurger, the AI Assistant.</p>
<p>@ObsBurger are there any active alerts?</p>
<p>From there, we sit back and let one of history's most advanced generative AI language models converse with itself!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-escapes-kibana/5.png" alt="Triggering the start of a two-bot conversation." /></p>
<p>It’s fascinating to watch this conversation unfold. This is the same generative model, GPT-4-turbo, responding to two sets of API calls, with only different prompt instructions guiding the style and sophistication of the responses. When I first set this up, I watched the interaction several times, using a variety of initial questions to start the conversation. Most of the time, the L1 will spend several rounds asking questions about what the alerts mean, what a type of APM service does, and how to investigate and ultimately remediate any issue.</p>
<p>Because I initially didn’t have a way to actually stop the conversation, the two sides would agree they were happy with the conversation and investigation and get into a loop thanking the other.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-escapes-kibana/6.png" alt="Neither Slackbot wants to be the one to hang up first" /></p>
<h3>Iterating</h3>
<p>To give a little more structure to this currently open-ended demo, I set up a scenario where L1 is asked to perform an investigation, is given three rounds of interactions with obsBurger to collect information, and finally generates a summary report of the situation, which could be passed to Level 2 (note there is no L2 bot at this point in time, but you could program one!).</p>
<p>Once again, we start by having opsHuman investigate if there are any active alerts.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-escapes-kibana/7.png" alt="Starting the investigation" /></p>
<p>Several rounds of investigation are performed until our limit has been reached. At that time, it will generate a summary of the situation.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-escapes-kibana/8.png" alt="Level One, OpsHuman, summarizing the investigation" /></p>
<h2>How about something with a real-world application</h2>
<p>As fun as watching two Slackbots talk to each other is, having an L1 speak to an AI Assistant isn’t very useful beyond a demo. So, I decided to see if I could modify opsHuman to be more beneficial for real-world applications.</p>
<p>The two main changes for this experiment were:</p>
<ol>
<li>
<p>Flip the profile of the bot from an entry-level personality to an expert.</p>
</li>
<li>
<p>Allow the number of interactions to expand, but encourage the bot to use as few as possible.</p>
</li>
</ol>
<p>With those points in mind, I cloned opsHuman into opsExpert and modified the prompt to be an expert in all things Elastic and observability.</p>
<p>You are OpsMaster, recognized as a senior operations and observability expert with extensive expertise in Elasticsearch, APM (Application Performance Monitoring), logs, metrics, synthetics, alerting, monitoring, OpenTelemetry, and infrastructure management.</p>
<p>I started with the same command: Are there any active alerts? After getting the list of alerts, OpsExpert dove into data collection for its investigation.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-escapes-kibana/9.png" alt="9 - opsexpert" /></p>
<p>After the opsBurger (the AI Assistant) provided the requested information, OpsExpert investigated two services that appeared to be the root of the alerts.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-escapes-kibana/10.png" alt="10- opsexpert standby" /></p>
<p>After several more back-and-forth requests for and deliveries of relevant information, OpsExpert reached a conclusion for the active alerts related to the checkout service and wrote up a summary report.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-escapes-kibana/11.png" alt="11 - paymentservice" /></p>
<h2>Looking forward</h2>
<p>This is just one example of what you can accomplish by bringing the AI Assistant to where you operate. You could take this one step further and have it actually open an issue on GitHub:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-escapes-kibana/12.png" alt="12. -github issue created" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-escapes-kibana/13.png" alt="13 - jeffvestal commented" /></p>
<p>Or integrate it into any other tracking platform you use!</p>
<p>The team is focused on building functionality into the Kibana integration, so this is just the beginning of the API. As time progresses, new functionality will be added. Even at a preview stage, I hope this starts you thinking about how having a fully developed Observability AI Assistant accessible by a standard API can make your work life even easier. It could get us closer to my dream of sitting on a beach handling incidents from my phone!</p>
<h2>Try it yourself!</h2>
<p>You can explore the API yourself if running Elasticsearch version 8.13 or later. The demo code I used for the above examples is <a href="https://github.com/jeffvestal/obsburger">available on GitHub</a>.</p>
<p>As a reminder, as of Elastic version 8.13, when this blog was written, the API is not supported as it is pre-beta. Care should be taken using it, and it should not yet be used in production.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-escapes-kibana/Running_away.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Getting started with the Elastic AI Assistant for Observability and Microsoft Azure OpenAI]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-ai-assistant-observability-microsoft-azure-openai</link>
            <guid isPermaLink="false">elastic-ai-assistant-observability-microsoft-azure-openai</guid>
            <pubDate>Wed, 03 Apr 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Follow this step-by-step process to get started with the Elastic AI Assistant for Observability and Microsoft Azure OpenAI.]]></description>
            <content:encoded><![CDATA[<p>Recently, Elastic <a href="https://www.elastic.co/blog/whats-new-elastic-observability-8-12-0">announced</a> the AI Assistant for Observability is now generally available for all Elastic users. The AI Assistant enables a new tool for Elastic Observability providing large language model (LLM) connected chat and contextual insights to explain errors and suggest remediation. Similar to how Microsoft Copilot is an AI companion that introduces new capabilities and increases productivity for developers, the Elastic AI Assistant is an AI companion that can help you quickly gain additional value from your observability data.</p>
<p>This blog post presents a step-by-step guide on how to set up the AI Assistant for Observability with Azure OpenAI as the backing LLM. Then once you’ve got the AI Assistant set up, this post will show you how to add documents to the AI Assistant’s knowledge base along with demonstrating how the AI Assistant uses its knowledge base to improve its responses to address specific questions.</p>
<h2>Set up the Elastic AI Assistant for Observability: Create an Azure OpenAI key</h2>
<p>Start by creating a Microsoft Azure OpenAI API key to authenticate requests from the Elastic AI Assistant. Head over to <a href="https://azure.microsoft.com/">Microsoft Azure and use an existing subscription or create a new one at the Azure portal</a>.</p>
<p>Currently, access to the Azure OpenAI service is granted by applying for access. See the <a href="https://learn.microsoft.com/en-us/azure/ai-services/openai/quickstart?tabs=command-line%2Cpython-new&amp;pivots=programming-language-studio#prerequisites">official Microsoft documentation for the current prerequisites</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/1.png" alt="Watch what your data can do" /></p>
<p>In the Azure portal, select <strong>Azure OpenAI</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/2.png" alt="Azure OpenAI" /></p>
<p>In the Azure OpenAI service, click the <strong>Create</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/3.png" alt="+Create" /></p>
<p>Enter an instance <strong>Name</strong> and click <strong>Next</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/4.png" alt="Basics Next" /></p>
<p>Select your network access preference for the Azure OpenAI instance and click <strong>Next</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/5.png" alt="Network Next" /></p>
<p>Add optional <strong>Tags</strong> and click <strong>Next</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/6.png" alt="Tags Next" /></p>
<p>Confirm your settings and click <strong>Create</strong> to create the Azure OpenAI instance.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/7.png" alt="Review + submit Create" /></p>
<p>Once the instance creation is complete, click the <strong>Go to resource</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/8.png" alt="go to resource" /></p>
<p>Click the <strong>Manage keys</strong> link to access the instance’s API key.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/9.png" alt="manage keys" /></p>
<p>Copy your Azure OpenAI <strong>API Key</strong> and the <strong>Endpoint</strong> and save them both in a safe place for use in a later step.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/10.png" alt="copy to clipboard" /></p>
<p>Next, click <strong>Model deployments</strong> to create a deployment within the Azure OpenAI instance you just created.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/11.png" alt="model deployments" /></p>
<p>Click the <strong>Manage deployments</strong> button to open Azure OpenAI Studio.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/12.png" alt="manage deployments" /></p>
<p>Click the <strong>Create new deployment</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/13.png" alt="+ Create new deployment" /></p>
<p>Select the model type you want to use and enter a Deployment name. Note the Deployment name for use in a later step. Click the <strong>Create</strong> button to deploy the model.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/14.png" alt="deploy model" /></p>
<h2>Set up the Elastic AI Assistant for Observability: Create an OpenAI connector in Elastic Cloud</h2>
<p>The remainder of the instructions in this post will take place within <a href="https://cloud.elastic.co/registration">Elastic Cloud</a>. You can use an existing deployment or you can create a new Elastic Cloud deployment as a free trial if you’re trying Elastic Cloud for the first time. Another option to get started is to create an <a href="https://azuremarketplace.microsoft.com/en-us/marketplace/apps/elastic.ec-azure-observability?tab=Overview">Elastic deployment from the Microsoft Azure Marketplace</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/15.png" alt="sign up trial" /></p>
<p>The next step is to create an Azure OpenAI connector in Elastic Cloud. In the <a href="https://cloud.elastic.co/home">Elastic Cloud console</a> for your deployment, select the top-level menu and then select <strong>Stack Management</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/16.png" alt="stack management" /></p>
<p>Select <strong>Connectors</strong> on the Stack Management page.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/17.png" alt="connectors" /></p>
<p>Select <strong>Create connector</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/18.png" alt="create connector" /></p>
<p>Select the connector for Azure OpenAI.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/19.png" alt="openai" /></p>
<p>Enter a <strong>Name</strong> of your choice for the connector. Select <strong>Azure OpenAI</strong> as the OpenAI provider.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/20.png" alt="openai connector" /></p>
<p>Enter the Endpoint URL using the following format:</p>
<ul>
<li>
<p>Replace <code>{your-resource-name}</code> with the <strong>name of the Azure Open AI instance</strong> that you created within the Azure portal in a previous step.</p>
</li>
<li>
<p>Replace <code>deployment-id</code> with the <strong>Deployment name</strong> that you specified when you created a model deployment within the Azure portal in a previous step.</p>
</li>
<li>
<p>Replace <code>{api-version}</code> with one of the valid <strong>Supported versions</strong> listed in the <a href="https://learn.microsoft.com/en-us/azure/ai-services/openai/reference">Completions section of the Azure OpenAI reference page</a>.</p>
</li>
</ul>
<pre><code class="language-bash">https://{your-resource-name}.openai.azure.com/openai/deployments/{deployment-id}/chat/completions?api-version={api-version}
</code></pre>
<p>Your completed Endpoint URL should look something like this:</p>
<pre><code class="language-bash">https://example-openai-instance.openai.azure.com/openai/deployments/gpt-4-turbo/chat/completions?api-version=2024-02-01
</code></pre>
<p>Enter the API Key that you copied in a previous step. Then click the <strong>Save &amp; test</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/21.png" alt="save &amp; test" /></p>
<p>Within the <strong>Edit Connector</strong> flyout window, click the <strong>Run</strong> button to confirm that the connector configuration is valid and can successfully connect to your Azure OpenAI instance.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/22.png" alt="" /></p>
<p>A successful connector test should look something like this:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/23.png" alt="results" /></p>
<h2>Add an example logs record</h2>
<p>Now that you have your Elastic Cloud deployment set up with an AI Assistant connector, let’s add an example logs record to demonstrate how the AI Assistant can help you to better understand logs data.</p>
<p>We’ll use the Elastic Dev Tools to add a single logs record. Click the top-level menu and select <strong>Dev Tools</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/24.png" alt="dev tools" /></p>
<p>Within the Console area of Dev Tools, enter the following POST statement:</p>
<pre><code class="language-bash">POST /logs-elastic_agent-default/_doc
{
	&quot;message&quot;: &quot;Status(StatusCode=\&quot;FailedPrecondition\&quot;, Detail=\&quot;Can't access cart storage. \nSystem.ApplicationException: Wasn't able to connect to redis \n  at cartservice.cartstore.RedisCartStore.EnsureRedisConnected() in /usr/src/app/src/cartstore/RedisCartStore.cs:line 104 \n  at cartservice.cartstore.RedisCartStore.EmptyCartAsync(String userId) in /usr/src/app/src/cartstore/RedisCartStore.cs:line 168\&quot;).&quot;,
	&quot;@timestamp&quot;: &quot;2024-02-22T11:34:00.884Z&quot;,
	&quot;log&quot;: {
    	&quot;level&quot;: &quot;error&quot;
	},
	&quot;service&quot;: {
    	&quot;name&quot;: &quot;cartService&quot;
	},
	&quot;host&quot;: {
    	&quot;name&quot;: &quot;appserver-1&quot;
	}
}
</code></pre>
<p>Then run the POST command by clicking the green <strong>Run</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/25.png" alt="click to send request" /></p>
<p>You should see a 201 response confirming that the example logs record was successfully created.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/26.png" alt="201 response" /></p>
<h2>Use the Elastic AI Assistant</h2>
<p>Now that you have a log record to work with, let’s jump over to the Observability Logs Explorer to see how the AI Assistant interacts with logs data. Click the top-level menu and select <strong>Observability</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/27.png" alt="observability" /></p>
<p>Select <strong>Logs Explorer</strong> to explore the logs data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/28.png" alt="explorer" /></p>
<p>In the Logs Explorer search box, enter the text “redis” and press the <strong>Enter</strong> key to perform the search.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/29.png" alt="redis" /></p>
<p>Click the <strong>View all matches</strong> button to include all search results.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/30.png" alt="view all matches" /></p>
<p>You should see the one log record that you previously inserted via Dev Tools. Click the expand icon to see the log record’s details.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/31.png" alt="expand icon" /></p>
<p>You should see the expanded view of the logs record. Instead of trying to understand its contents ourselves, we'll use the AI Assistant to summarize it. Click on the <strong>What's this message?</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/32.png" alt="What's this message?" /></p>
<p>We get a fairly generic answer back. Depending on the exception or error we're trying to analyze, this can still be really useful, but we can make this better by adding additional documentation to the AI Assistant knowledge base.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/33.png" alt="log details" /></p>
<p>Let’s see how we can use the AI Assistant’s knowledge base to improve its understanding of this specific logs message.</p>
<h2>Create an Elastic AI Assistant knowledge base</h2>
<p>Select <strong>Overview</strong> from the <strong>Observability</strong> menu.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/34.png" alt="Select Overview from the Observability menu." /></p>
<p>Click the <strong>AI Assistant</strong> button at the top right of the window.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/35.png" alt="AI Assistant" /></p>
<p>Click the <strong>Install Knowledge base</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/36.png" alt="Install Knowledge base" /></p>
<p>Click the top-level menu and select <strong>Stack Management</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/37.png" alt="Stack Management" /></p>
<p>Then select <strong>AI Assistants</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/38.png" alt="AI Assistants" /></p>
<p>Click <strong>Elastic AI Assistant for Observability</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/39.png" alt="Elastic AI Assistant for Observability" /></p>
<p>Select the <strong>Knowledge base</strong> tab.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/40.png" alt="Knowledge base" /></p>
<p>Click the <strong>New entry</strong> button and select <strong>Single entry</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/41.png" alt="new entry" /></p>
<p>Give it the <strong>Name</strong> “cartservice” and enter the following text as the <strong>Contents</strong> :</p>
<pre><code class="language-markdown">Link: [Cartservice Intermittent connection issue](https://github.com/elastic/observability-examples/issues/25)
I have the following GitHub issue. Store this information in your knowledge base and always return the link to it if relevant.
GitHub Issue, return if relevant

Link: https://github.com/elastic/observability-examples/issues/25

Title: Cartservice Intermittent connection issue

Body:
The cartservice occasionally encounters storage errors due to an unreliable network connection.

The errors typically indicate a failure to connect to Redis, as seen in the error message:

Status(StatusCode=&quot;FailedPrecondition&quot;, Detail=&quot;Can't access cart storage.
System.ApplicationException: Wasn't able to connect to redis
at cartservice.cartstore.RedisCartStore.EnsureRedisConnected() in /usr/src/app/src/cartstore/RedisCartStore.cs:line 104
at cartservice.cartstore.RedisCartStore.EmptyCartAsync(String userId) in /usr/src/app/src/cartstore/RedisCartStore.cs:line 168')'.
I just talked to the SRE team in Slack, they have plans to implement retries as a quick fix and address the network issue later.
</code></pre>
<p>Click <strong>Save</strong> to save the new knowledge base entry.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/42.png" alt="save" /></p>
<p>Now let’s go back to the Observability Logs Explorer. Click the top-level menu and select <strong>Observability</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/43.png" alt="settings" /></p>
<p>Then select <strong>Explorer</strong> under <strong>Logs</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/44.png" alt="explorer" /></p>
<p>Expand the same logs entry as you did previously and click the <strong>What’s this message?</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/45.png" alt="What’s this message? button" /></p>
<p>The response you get now should be much more relevant.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/46.png" alt="log details" /></p>
<h2>Try out the Elastic AI Assistant with a knowledge base filled with your own data</h2>
<p>Now that you’ve seen how easy it is to set up the Elastic AI Assistant for Observability, go ahead and give it a try for yourself. Sign up for a <a href="https://cloud.elastic.co/registration">free 14-day trial</a>. You can quickly spin up an Elastic Cloud deployment in minutes and have your own search powered AI knowledge base to help you with getting your most important work done.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-ai-assistant-observability-microsoft-azure-openai/AI_hand.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Elastic APM for iOS and Android Native apps]]></title>
            <link>https://www.elastic.co/observability-labs/blog/apm-ios-android-native-apps</link>
            <guid isPermaLink="false">apm-ios-android-native-apps</guid>
            <pubDate>Thu, 08 Feb 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[This blog provides an overview of the key capabilities included in the Elastic APM solution for iOS and Android native apps, as well as a walkthrough of the configuration details and troubleshooting workflow for a few error scenarios.]]></description>
            <content:encoded><![CDATA[<blockquote>
<p><strong>WARNING</strong>: This article shows information about the Android agent that is no longer accurate for versions <code>1.x</code>. Please refer to <a href="https://www.elastic.co/docs/reference/apm/agents/android">its documentation</a> to learn about its new APIs.</p>
</blockquote>
<p>Elastic® APM for iOS and Android native apps is generally available in the stack release v8.12. The Elastic <a href="https://github.com/elastic/apm-agent-ios">iOS</a> and <a href="https://github.com/elastic/apm-agent-android">Android</a> APM agents are open-source and have been developed on-top, i.e., as a distribution of the OpenTelemetry Swift and Android SDK/API, respectively.</p>
<h2>Overview of the Mobile APM solution</h2>
<p>The OpenTelemetry SDK/API for iOS and Android supports capabilities such as auto-instrumentation of HTTP requests, API for manual instrumentation, data model based on the OpenTelemetry semantic conventions, and buffering support. Additionally, the Elastic APM agent distributions also support an easier initialization process and novel features such as remote config and user session based sampling. The Elastic <a href="https://github.com/elastic/apm-agent-ios">iOS</a> and <a href="https://github.com/elastic/apm-agent-android">Android</a> APM agents being <em>distributions</em> are maintained per Elastic’s standard support T&amp;Cs.</p>
<p>There are curated or pre-built dashboards provided in Kibana® for monitoring, data analysis, and for troubleshooting purposes. The <strong>Service Overview</strong> view shown below provides relevant frontend KPIs such as crash rate, http requests, average app load time, and more, including the comparison view.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/1.png" alt="1 - comparison view" /></p>
<p>Further, the geographic distribution of user traffic is available on a map at a country and regional level. The service overview dashboard also shows trends of metrics such as throughput, latency, failed transaction rate, and distribution of traffic by device make-model, network connection type, and app version.</p>
<p>The <strong>Transactions</strong> view shown below highlights the performance of the different transaction groups, including the distributed trace end-to-end of individual transactions with links to associated spans, errors and crashes. Further, users can see at a glance the distribution of traffic by device make and model, app version, and OS version.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/2.png" alt="2- opbeans android" /></p>
<p>Tabular views such as the one highlighted below located at the bottom of <strong>Transactions</strong> tab makes it relatively easy to see how the device make and model, App version, etc., impacts latency and crash rate.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/3.png" alt="3 - latency and crash rate" /></p>
<p>The <strong>Errors &amp; Crashes</strong> view shown below can be used to analyze the different error and crash groups. The unsymbolicated (iOS) or obfuscated (Android) stacktrace of the individual error or crash instance is also available in this view.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/4.png" alt="4 - opbeans swift" /></p>
<p>The <strong>Service Map</strong> view shown below provides a visualization of the end-to-end service interdependencies, including any third-party APIs, proxy servers, and databases.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/5.png" alt="5 - flowchart" /></p>
<p>The comprehensive pre-built dashboards for observing the mobile frontend in Kibana provide visibility into the sources of errors, crashes, and bottlenecks to ease troubleshooting of issues in the production environment. The underlying Elasticsearch® Platform also supports the ability to query raw data, build custom metrics and custom dashboards, alerting, SLOs, and anomaly detection. Altogether the platform provides a comprehensive set of tools to expedite root cause analysis and remediation, thereby facilitating a high velocity of innovation.</p>
<h2>Walkthrough of the debugging workflow for some error scenarios</h2>
<p>Next, we will provide a walkthrough of the configuration details and the troubleshooting workflow for a couple of error scenarios in iOS and Android native apps.</p>
<h3>Scenario 1</h3>
<p>In this example, we will debug a crash in an asynchronous method using Apple’s crash report <strong>symbolication</strong> as well as <strong>breadcrumbs</strong> to deduce the cause of the crash.</p>
<p><strong>Symbolication</strong><br />
In this scenario, users notice a spike in the crash occurrences of a particular crash group in the Errors &amp; Crashes tab and decide to investigate further. A new crash comes in on the Crashes tab, and the developer follows these steps to symbolicate the crash report locally.</p>
<ol>
<li>Copy the crash via the UI and paste it into a file with the following name format &lt;AppBinaryName&gt;_&lt;DateTime&gt;. For example, “opbeans-swift_2024-01-18-114211.ips`.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/6.png" alt="6 - Symbolication" /></p>
<ol start="2">
<li>Apple provides <a href="https://developer.apple.com/documentation/xcode/adding-identifiable-symbol-names-to-a-crash-report">detailed instructions</a> on how to symbolicate this file locally either automatically through Xcode or manually using the command line.</li>
</ol>
<p><strong>Breadcrumbs</strong><br />
The second frame of the first thread shows that the crash is occuring in a Worker instance.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/7.png" alt="7 - Breadcrumbs" /></p>
<p>This instance is actually used in many places, and due to the asynchronous nature of this function, it’s not possible to determine immediately where this call is coming from. Nevertheless, we can utilize features of the Open Telemetry SDK to add more context to these crashes and then put the pieces together to find the site of the crash.</p>
<p>By adding “breadcrumbs” around this Worker instance, it is possible to track down which calls to the Worker are actually associated with this crash.</p>
<p><strong>Example:</strong><br />
Create a logger provider in the Worker class as a public variable for ease of access, as shown below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/8.png" alt="8 - example code" /></p>
<p>Create breadcrumbs everywhere the Worker.doWork() function is called:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/9.png" alt="9 - Create breadcrumbs everywhere the Worker.doWork() function" /></p>
<p>Each of these breadcrumbs will use the same event <strong>name</strong> “worker_breadcrumb” so they can be consistently queried, and the differentiation will be done using the “ <strong>source</strong> ” attribute.</p>
<p>In this example, the Worker.doWork() function is being called from a CustomerRow struct (a table row which does work ‘onTapGesture’). If you were to call this method from multiple places in a CustomerRow struct, you may also add additional differentiations to the “ <strong>source</strong> ” attribute value, such as the associated function (e.g., “CustomerRow#onTapGesture”).</p>
<p>Now that the app is reporting these breadcrumbs, we can use Discover to <strong>query</strong> for them, as shown below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/10.png" alt="10 - Discover to query" /></p>
<p>_ <strong>Note:</strong> _ <em>Event</em> _ <strong>names</strong> _ <em>sent by the agent are translated to event</em> _ <strong>action</strong> _ <em>in Elastic Common Schema (ECS), so ensure the query uses this field.</em></p>
<ol>
<li>
<p>You can add a filter: <code>event.action: “worker_breadcrumb”</code> and it shows all events generated from this new breadcrumb.</p>
</li>
<li>
<p>You can also see the various sources: ProductRow, CustomerRow, CartRow, etc.</p>
</li>
<li>
<p>If you add <strong>error.type : crash</strong> to the query, you can see crashes alongside the breadcrumbs:</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/11.png" alt="11 - crashes along side the breadcrumbs" /></p>
<p>A crash and a breadcrumb next to each other in the timeline may come from completely different devices, so we need another differentiator. For each crash, we have metadata that contains the <strong>session.id</strong> associated with the crash, viewable from the Metadata tab. We can query using this <strong>session.id</strong> to ensure that the only data we are looking at in Discover is from a single user session (i.e., a single device) that resulted in the crash.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/12.png" alt="12. - session.id" /></p>
<p>In Discover, we can now see the session event flow, on a single device, concerning the crash via the breadcrumbs, as shown below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/13.png" alt="13 - session event flow" /></p>
<p>It looks like the last breadcrumb before the crash was from the “CustomerRow” breadcrumb. Now this gives the app developer a good place to start their root cause analysis or investigation.</p>
<h3>Scenario 2</h3>
<p>_ <strong>Note:</strong> _ <em>This scenario requires the Elastic Android agent version “0.14.0” or higher.</em></p>
<p>An Android sample app has a form composed of two screens that are created using two fragments (<code>FirstPage</code> and <code>SecondPage</code>). In the first screen, the app makes a backend API call to get a key that identifies the form submission. This key is stored in memory in the app and must be available on the last screen where the form is sent; the key must be sent along with the form's data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/14.jpg" alt="14 - form submission" /></p>
<p><strong>The problem</strong><br />
We start to see a spike in crash occurrences in Kibana (null pointer exception) in the Errors &amp; Crashes tab that always seem to happen on the last screen of the form, when the users click on the &quot;FINISH&quot; button. Nevertheless, <strong>this is not always reproducible</strong> , so the root cause isn't clear just by looking at the crash’s stacktrace alone. Here’s what it looks like:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/15.png" alt="15 - stack trace" /></p>
<p>When we take a look at the code referenced in the stacktrace, this is what we can see:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/16.png" alt="16 - When we take a look at the code referenced in the stacktrace, this is what we can see:" /></p>
<p>This is the line where the crash happens, so it seems like the variable “formId” (which is a static String located in “FirstPage”) was null by the time this code was executed, causing a null pointer exception to be raised. This variable is set within the “FirstPage” fragment after the backend request is done to retrieve the id. The only way to get to the “SecondPage” is by passing through the “FirstPage.” So, the stacktrace alone doesn’t help much as the pages have to be opened in order, and the first one will always set the “formId” variable. Therefore, it doesn’t seem likely that the formId could be null in “SecondPage.”</p>
<p><strong>Finding the root cause</strong><br />
Apart from taking a look at the crash’s stacktrace, it could also be useful to take a look at complementary data that would help put the pieces together and get a broader picture of what other things happened while our app was running when the crash happened. For this case, we know that the form ID must come from our backend service, so we could start by ruling out that there was an error with the backend call. We do this by checking the traces from the creation of our FirstPage fragment where the form ID request is executed, in the Transaction details view:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/17.png" alt="17 - trace sample" /></p>
<p>The “Created” spans represent the time it took to create the first fragment. The topmost one shows the Activity creation, followed by the NavHostFragment, followed by “FirstScreen.” Not long after its creation, we see that a GET HTTP request to our backend is made to retrieve our form ID and, according to the traces, the GET request was successful. We can therefore rule out that there is an issue with the backend communication for this problem.</p>
<p>Another option could be looking at the logs sent throughout the <a href="https://opentelemetry.io/docs/specs/semconv/general/session/">session</a> in our app where the crash occurred (we could also take a look at all the logs coming from our app but they would be too many to analyze this one issue). To do so, we first copy one of the spans’ “session.id” values (any span would work since the same session ID will be available in all the data that was sent from our app during the time that the crash occurred) available in the span details flyout.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/18.png" alt="18 - red box highlighted" /></p>
<p>_ <strong>Note:</strong> _ <em>The same session ID can also be found in the crash metadata.</em></p>
<p>Now that we have identified our session, we can open up the Logs Explorer view and take a look at all of our app’s logs within that same session, as shown below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/19.png" alt="19 - app's logs" /></p>
<p>By looking at the logs, and adding a few fields to show the app’s lifecycle status and the error types, we see the log events that are <a href="https://github.com/elastic/apm/blob/main/specs/agents/mobile/events.md">automatically collected</a> from our app. We can see the crash event at the top of the list as the latest one. We can also see our app’s lifecycle events, and if we keep scrolling through, we’ll get to some lifecycle events that are going to help find our root cause:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/20.png" alt="20 - root cause" /></p>
<p>We can see there are a couple of lifecycle events that tell us that the app was restarted during the session. This is an important hint because it means that the Android OS killed our app at some point, which is common when an app stays in the background for a while. With this information, we could try to reproduce the issue by forcing the OS to kill our app in the background and then see how it behaves when reopened from the recently opened apps menu.</p>
<p>After giving it a try, we could reproduce the issue and we found that the static “formId” variable was lost when the app was restarted, causing it to be null when the SecondPage fragment requested it. We can now research best practices of passing arguments to Fragments so we can change our code to prevent relying on static fields and instead store and share values between screens, thus preventing this crash from happening again.</p>
<p><strong>Bonus:</strong> For this scenario, it was enough for us to rely on the events that are sent automatically by the APM Agent; however, if those aren’t enough for other cases, we can always send custom events in the places where we want to track the state changes of our app via the OpenTelemetry event API, as shown in the the code snippet below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/21.png" alt="21 - black code box" /></p>
<h2>Make the most of your Elastic APM Experience</h2>
<p>In this post, we reviewed Elastic’s new Mobile APM solution available in 8.12. The new solution uses Elastic’s new <a href="https://github.com/elastic/apm-agent-ios">iOS</a> and <a href="https://github.com/elastic/apm-agent-android">Android</a> APM agents that are open-source and have been developed on-top, i.e., as a distribution of the OpenTelemetry Swift and Android SDK/API, respectively.</p>
<p>We also reviewed configuration details and the troubleshooting workflow for two error scenarios in iOS and Android native apps.</p>
<ul>
<li>
<p><strong>iOS scenario:</strong> Debug a crash in an asynchronous method using Apple’s crash report <strong>symbolication</strong> as well as <strong>breadcrumbs</strong> to deduce the cause of the crash.</p>
</li>
<li>
<p><strong>Android scenario:</strong> Analyze why users get a null pointer exception on the last screen when they click on the “FINISH” button of a form. Analyzing this is not always clear by looking at the crash’s stack trace and isn’t easily reproducible.</p>
</li>
</ul>
<p>In both instances, we found the root cause of the crash using distributed traces from the mobile device as well as correlated logs. Hopefully this blog provided a review of how Elastic can help manage and monitor Mobile native apps.</p>
<p>Elastic invites SREs and developers to experience our Mobile APM solution firsthand and unlock new horizons in their data tasks. Try it today at <a href="https://ela.st/free-trial">https://ela.st/free-trial</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/apm-ios-android-native-apps/141949-elastic-blogheaderimage.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Accelerate log analytics in Elastic Observability with Automatic Import powered by Search AI]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-automatic-import-logs-genai</link>
            <guid isPermaLink="false">elastic-automatic-import-logs-genai</guid>
            <pubDate>Wed, 04 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Migrate your logs to AI-driven log analytics in record time by automating custom data integrations]]></description>
            <content:encoded><![CDATA[<p>Elastic is accelerating the adoption of <a href="https://www.elastic.co/observability/aiops">AI-driven log analytics</a> by automating the ingestion of custom logs, which is increasingly important as the deployment of GenAI-based applications grows. These custom data sources must be ingested, parsed, and indexed effortlessly, enabling broader visibility and more straightforward root cause analysis (RCA) without requiring effort from Site Reliability Engineers (SREs). Achieving visibility across an enterprise IT environment is inherently challenging for SREs due to constant growth and change, such as new applications, added systems, and infrastructure migrations to the cloud. Until now, the onboarding of custom data has been costly and complex for SREs. With automatic import, SREs can concentrate on deploying, optimizing, and improving applications.</p>
<p>Automatic Import uses generative AI to automate the development of custom data integrations, reducing the time required from several days to less than 10 minutes and significantly lowering the learning curve for onboarding data. Powered by the  <a href="https://www.elastic.co/platform">Elastic Search AI Platform</a>, it provides model-agnostic access to leverage large language models (LLMs) and grounds answers in proprietary data through <a href="https://www.elastic.co/search-labs/blog/retrieval-augmented-generation-rag">retrieval augmented generation (RAG)</a>. This capability is further enhanced by Elastic's expertise in enabling observability teams to utilize any type of data and the flexibility of its <a href="https://www.elastic.co/generative-ai/search-ai-lake">Search AI Lake</a>. Arriving at a crucial time when organizations face an explosion of applications and telemetry data, such as logs, Automatic Import streamlines the initial stages of data migration by simplifying data collection and normalization. It also addresses the challenges of building custom connectors, which can otherwise delay deployments, issue analysis, and impact customer experiences.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-new-int.png" alt="Create new integration" /></p>
<h2>Enhancing AI Powered Observability with Automatic Import</h2>
<p><a href="https://www.elastic.co/observability">Automatic Import</a> builds on Elastic Observability’s AI-driven log analytics innovations—such as  <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-getting-started.html">anomaly detection</a>, <a href="https://www.elastic.co/guide/en/kibana/current/xpack-ml-aiops.html">log rate and pattern analysis</a>, and <a href="https://www.elastic.co/blog/introducing-elastic-ai-assistant">Elastic AI Assistant</a>, and further automates and simplifies SRE’s workflows. Automatic Import applies generative AI to automate the creation of custom data integrations, allowing SREs to focus on logs and other telemetry data. While Elastic provides over <a href="https://www.elastic.co/integrations/data-integrations">400+ prebuilt data integrations</a>, automatic import allows SREs to extend integrations to fit their workflows and expand visibility into production environments.  </p>
<p>In conjunction with automatic import, Elastic is introducing <a href="https://www.elastic.co/blog/ai-log-analytics-express-migration">Elastic Express Migration</a>, a commercial incentive program designed to overcome migration inertia from existing deployments and contracts, providing a faster adoption path for new customers. </p>
<p>Automatic Import leverages <a href="https://www.elastic.co/blog/ecs-elastic-common-schema-otel-opentelemetry-faq">Elastic Common Schema (ECS)</a> with public LLMs to process and analyze data in ECS format which is also part of OpenTelemetry. Once the data is in, SRE’s can leverage Elastic’s RAG-based AI Assistant to solve root cause analysis (RCA) challenges in dynamic, complex environments.</p>
<h2>Configuring and using Automatic Import</h2>
<p>Automatic Import is available to everyone with an Enterprise license. Here is how it works:</p>
<ul>
<li>
<p>The user configures connectivity to an LLM and uploads sample data</p>
</li>
<li>
<p>Automatic Import then extrapolates what to expect from the data source. These log samples are paired with LLM prompts that have been honed by Elastic engineers to reliably produce conformant Elasticsearch ingest pipelines. </p>
</li>
<li>
<p>Automatic Import then iteratively builds, tests, and tweaks a custom ingest pipeline until it meets Elastic integration requirements.</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-arch.png" alt="Create new integration Architecture" />
<em>Automatic Import powered by the Elastic Search AI Platform</em></p>
<p>Within minutes, a validated custom integration is created that accurately maps raw data into ECS and custom fields, populates contextual information (such as <code>related.*</code> fields), and categorizes events.</p>
<p>Automatic Import currently supports Anthropic models via <a href="https://www.elastic.co/guide/en/kibana/8.15/bedrock-action-type.html">Elastic’s connector for Amazon Bedrock</a>, and additional LLMs will be introduced soon. It supports JSON and NDJSON-based log formats currently.</p>
<h3>Automatic Import workflow</h3>
<p>SREs are constantly having to manage new tools and components that developers add into applications. Neo4j, is a database that doesn’t have an integration in Elastic. The following steps walk you through how to create an integration for Neo4j with automatic import:</p>
<ol>
<li>Start by navigating to <code>Integrations</code> -&gt; <code>Create new integration</code>.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-new-int.png" alt="Create new integration" /></p>
<ol start="2">
<li>Provide a name and description for the new data source.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-neo4j-setup.png" alt="Set up integration" /></p>
<ol start="3">
<li>Next, fill in other details and provide some sample data, anonymized as you see fit.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-pipline.png" alt="Set up pipeline" /></p>
<ol start="4">
<li>Click “Analyze logs” to submit integration details, sample logs, and expert-written instructions from Elastic to the specified LLM, which builds the integration package using generative AI. Automatic Import then fine-tunes the integration in an automated feedback loop until it is validated to meet Elastic requirements.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-analysis.png" alt="Analyze sample logs" /></p>
<ol start="5">
<li>Review what automatic Import presents as recommended mappings to ECS fields and custom fields. You can easily adjust these settings if necessary.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-finished.png" alt="Review Analysis" /></p>
<ol start="6">
<li>After finalizing the integration, add it to Elastic Agent or view it in Kibana. It is now available alongside your other integrations and follows the same workflows as prebuilt integrations.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-success.png" alt="Creation complete" /></p>
<ol start="7">
<li>Upon deployment, you can begin analyzing newly ingested data immediately. Start by looking at the new Logs Explorer in Elastic Observability</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-explorer.png" alt="Look at logs" /></p>
<h2>Accelerate log-analytics with automatic import</h2>
<p>Automatic Import lowers the time required to build and test custom data integrations from days to minutes, accelerating the switch to <a href="https://www.elastic.co/observability/aiops">AI-driven log analytics</a>. Elastic Observability pairs the unique power of Automatic Import with Elastic’s deep library of prebuilt data integrations, enabling wider visibility and fast data onboarding, along with AI-based features, such as the Elastic AI Assistant to accelerate RCA and reduce operational overhead.</p>
<p>Interested in our <a href="https://www.elastic.co/splunk-replacement">Express Migration</a> program to level up to Elastic? <a href="https://www.elastic.co/splunk-interest?elektra=organic&amp;storm=CLP&amp;rogue=splunkobs-gic">Contact Elastic</a> to learn more. </p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em> </p>
<p><em>Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/elastic-auto-importv2.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Traces in Discover for Deeper Application Insights in Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-discover-traces-apm</link>
            <guid isPermaLink="false">elastic-discover-traces-apm</guid>
            <pubDate>Fri, 05 Sep 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic brings traces into Discover. See how you can apply the capabilities of ad-hoc data exploration and ES|QL to your tracing data.]]></description>
            <content:encoded><![CDATA[<p>In the world of observability, context is king. For years, Elastic APM has provided dedicated views and capabilities for understanding the health of your applications and services. When you need to know how your checkout service is performing, you can go straight to its dedicated page, view key metrics like latency and throughput, and directly access related transactions and errors. This entity-centric view is invaluable for targeted monitoring and diagnostics.</p>
<p>But what happens when the problem isn't neatly confined to a single service? What if you need to ask more complex, exploratory questions that span across your entire dataset? Questions like:</p>
<ul>
<li>
<p>Show me all traces where a specific user experienced a latency of over two seconds, and correlate it with any frontend errors that occurred at the same time.</p>
</li>
<li>
<p>Are there any slow database queries happening only for customers on our premium plan?</p>
</li>
<li>
<p>Which specific RPC call is the common source of failure across three different microservices?</p>
</li>
</ul>
<p>Historically, answering these questions has been effective, but it required navigating different UIs and manually piecing together clues, leading to a less-than-seamless experience, a common challenge across various observability platforms.</p>
<p>Today, we're excited to announce a key improvement for trace search and analytics. We are bringing native support for <strong>Traces into Discover</strong>, complete with an integrated trace waterfall view. You can now apply the full capabilities of ad-hoc data exploration and ES|QL to your tracing data.</p>
<h2>From Curated Views to Broader Data Exploration</h2>
<p>Discover is the primary interface for data exploration in the Elastic Stack. It's the workbench where you can freely explore, filter, and correlate all of your indexed data. By integrating traces into this environment, you can now move beyond APM's curated views and conduct more flexible investigations, searching by any trace attribute.</p>
<p>You can now easily search for individual spans or errors, filter by OpenTelemetry resource attributes and span attributes, and analyze complex scenarios, all without leaving the Discover interface you know and love.</p>
<h2>A Practical Scenario: Unraveling a Slow API</h2>
<p>Imagine a critical frontend API to place orders is experiencing intermittent slowdowns. Your team has the APM service view, which confirms the high latency, but the root cause isn't immediately obvious. The slowdown seems to be happening deep within a complex chain of microservice calls.</p>
<p>This is where the new Discover functionality is particularly useful.</p>
<p>Your investigation can now start directly in Discover with a broad ES|QL query to find the slowest transactions for that specific endpoint. Example below is using OpenTelemetry demo, which you can try yourself on <a href="https://otel.demo.elastic.co/">OpenTelemetry demo</a></p>
<p>ES|QL</p>
<pre><code class="language-sql">
FROM traces-*
| WHERE span.name == &quot;oteldemo.CheckoutService/PlaceOrder&quot;
| SORT span.duration.us DESC

</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-discover-traces-apm/traces-discover-screenshot1-min.jpg" alt="Traces Discover Screenshot" /></p>
<p>This simple query reveals the most problematic transactions. From the results table, a single click on any trace opens a detailed, end-to-end trace waterfall view—right there in Discover. No context switching, no new browser tabs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-discover-traces-apm/traces-discover-screenshot2-min.jpg" alt="Traces Discover Screenshot" /></p>
<p>The waterfall reveals that a downstream currency service is taking a long time to respond. But why? You can now refine your ESQL query to ask a more sophisticated question, digging into the span attributes to find the specific downstream service call that is causing the bottleneck:</p>
<p>ES|QL</p>
<pre><code class="language-sql">    FROM traces-* 
    | WHERE service.name == &quot;currency&quot; and span.name == &quot;Currency/Convert&quot;
    | SORT span.duration.us DESC
</code></pre>
<p>With this query, you’ve instantly found the exact spans within the <code>currency</code> service that are impacting your place order API. You can see all span details and attributes, the duration, and the <code>trace.id</code> giving you full transaction context.</p>
<p>Workflow now utilizes a single tool, Discover, for an iterative process of discovery and refinement.</p>
<h2>Benefits of a Unified Experience</h2>
<p>These new capabilities simplify complex workflows that are otherwise difficult to achieve:</p>
<ul>
<li><strong>Correlate Everything:</strong> Combine trace filters with log messages, infrastructure metrics, or any other data you have in Elasticsearch. Find a slow trace and immediately see the corresponding logs from the affected pod, all in a single view.</li>
<li><strong>Enhanced Flexibility:</strong> Go beyond pre-defined filters. Use the full power of ES|QL to group, aggregate, and filter your trace data based on any attribute, providing comprehensive data analysis options.</li>
<li><strong>Integrated Experience:</strong> Move from a high-level ES|QL query to a detailed trace waterfall without ever breaking your investigative flow.</li>
</ul>
<h2>Looking Ahead: Investigation in Discover</h2>
<p>Traces in Discover, powered by ES|QL, delivers a flexible and potent investigative toolset. This is just the beginning, with more to come, including improved correlation between spans, logs, and exceptions within Discover, additional ES|QL commands and powerful UI to make analysis of complex traces easier.</p>
<p>We invite you to dive in and experience it for yourself. Bring your most complex questions and your trickiest bugs. You can now find answers more directly in Discover. This functionality is already available on Serverless. Existing users hosting Elastic themselves will need to upgrade to 8.19+ or 9.1+ to access it.</p>
<p>Try it out today on <a href="https://www.elastic.co/cloud/">Elastic Cloud</a> or the <a href="https://otel.demo.elastic.co/">OpenTelemetry demo</a>. We look forward to hearing your feedback as you explore traces in Discover!</p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-discover-traces-apm/cover.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Introducing Elastic Distribution of OpenTelemetry Collector]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-collector</link>
            <guid isPermaLink="false">elastic-distribution-opentelemetry-collector</guid>
            <pubDate>Fri, 09 Aug 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[We are thrilled to announce the technical preview of the Elastic Distribution of OpenTelemetry Collector. This new offering underscores Elastic dedication to this important framework and highlights our ongoing contributions to make OpenTelemetry the best vendor agnostic data collection framework.]]></description>
            <content:encoded><![CDATA[<p>OpenTelemetry is an open-source framework that ensures vendor-agnostic data collection, providing a standardized approach for the collection, processing, and ingestion of observability data. Elastic is fully committed to this principle, aiming to make observability truly vendor-agnostic and eliminating the need for users to reinstrument their observability when switching platforms.</p>
<p>Over the past year, Elastic has made several notable contributions to the OpenTelemetry ecosystem. We <a href="https://opentelemetry.io/blog/2023/ecs-otel-semconv-convergence/">donated our Elastic Common Schema (ECS)</a> to OpenTelemetry, successfully <a href="https://opentelemetry.io/blog/2024/elastic-contributes-continuous-profiling-agent/">integrated the eBPF-based profiling agent</a>, and have consistently been one of the top contributing companies across the OpenTelemetry project. Additionally, Elastic has significantly improved upstream logging capabilities within OpenTelemetry with enhancements to key areas such as <a href="https://opentelemetry.io/blog/2024/otel-collector-container-log-parser/">container logging</a>, further enhancing the framework’s robustness.</p>
<p>These efforts demonstrate our strategic focus on enhancing and expanding the capabilities of OpenTelemetry for the broader observability community and reinforce the vendor-agnostic benefits of using OpenTelemetry.</p>
<p>Today, we are thrilled to announce the technical preview of the Elastic Distribution of OpenTelemetry Collector. This new offering underscores Elastic’s dedication to this important framework and highlights our ongoing contributions to make OpenTelemetry the best vendor agnostic data collection framework.</p>
<h2>Elastic Agent as an OpenTelemetry Collector&lt;a id=&quot;elastic-agent-as-an-opentelemetry-collector&quot;&gt;&lt;/a&gt;</h2>
<p>Technically, the Elastic Distribution of OpenTelemetry Collector represents an evolution of the Elastic Agent. In its latest version, the Elastic Agent can operate in an OpenTelemetry mode. This mode invokes a module within the Elastic Agent which is essentially a distribution of the OpenTelemetry collector. It is crafted using a selection of upstream components from the contrib distribution.</p>
<p>The Elastic OpenTelemetry Collector also includes configuration for this set of <a href="https://github.com/elastic/elastic-agent/tree/main/internal/pkg/otel#components">upstream OpenTelemetry Collector components</a>, providing out-of-the-box functionality with Elastic Observability. This integration allows users to seamlessly utilize Elastic’s advanced observability features with minimal setup.</p>
<p>The technical preview version of the Elastic OpenTelemetry Collector has been tailored with out-of-the-box configurations for the below use cases, we will keep working to add more as we progress: :</p>
<ul>
<li>
<p><strong><em>Collect and ship logs</em></strong>: Use the Elastic OpenTelemetry Collector to gather log data from various sources and ship it directly to Elastic where it can be analyzed in Kibana Discover, and Elastic Observability’s Explorer (also in Tech Preview in 8.15).</p>
</li>
<li>
<p><strong><em>Assess host health</em></strong>: Leverage the OpenTelemetry host metrics and Kubernetes receivers to monitor to evaluate the performance of hosts and pods. This data can then be visualized and analyzed in Elastic’s Infrastructure Observability UIs, providing deep insights into host performance and health. Details of how this is configured in the OTel collector is outlined in this <a href="https://www.elastic.co/observability-labs/blog/infrastructure-monitoring-with-opentelemetry-in-elastic-observability">blog</a>.</p>
</li>
<li>
<p><strong>Kubernetes container logs</strong>: Additionally, users of the Elastic OpenTelemetry Collector benefit from out-of-the-box Kubernetes container and application logs enriched with Kubernetes metadata by leveraging the powerful <a href="https://opentelemetry.io/blog/2024/otel-collector-container-log-parser/">container log parser</a> Elastic recently contributed to OTel. This OpenTelemetry-based enrichment enhances the context and value of the collected logs, providing deeper insights and more effective troubleshooting capabilities.</p>
</li>
</ul>
<p>While the Elastic OpenTelemetry Collector comes pre-built and preconfigured for the sake of easier onboarding and getting started experience, Elastic is committed to the vision of vendor-neutral collection of data. Thus, we strive to contribute any Elastic specific features back to the upstream OpenTelemetry components, to advance and help grow the OpenTelemetry landscape and capabilities.</p>
<p>Stay tuned for upcoming announcements sharing our plans to combine the best of Elastic Agent and OpenTelemetry Collector.</p>
<h2>Get started the Elastic Distribution of OpenTelemetry Collector&lt;a id=&quot;get-started-the-elastic-distribution-for-opentelemetry-collector&quot;&gt;&lt;/a&gt;</h2>
<p>To get started with a guided onboarding flow for the Elastic Distribution of the OpenTelemetry Collector for Kubernetes, Linux, and Mac environments, visit the<a href="https://github.com/elastic/opentelemetry/blob/main/docs/guided-onboarding.md"> guided onboarding documentation</a>.</p>
<p>For more advanced manual configuration, follow the<a href="https://github.com/elastic/opentelemetry/blob/main/docs/manual-configuration.md"> manual configuration instructions</a>.</p>
<p>Once the Elastic Distribution of the OpenTelemetry Collector is set up and running, you’ll be able to analyze your systems within various features of the Elastic Observability solution.</p>
<p>Analyze the performance and health of your infrastructure, through corresponding metrics and logs collected through OpenTelemetry Collector receivers, such as the host metrics receiver and different Kubernetes receivers.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-distribution-opentelemetry-collector/hosts.png" alt="OTel Monitoring Hosts" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-distribution-opentelemetry-collector/otel-daemonset-green-logs.png" alt="OTel Logs" /></p>
<p>With Elastic OpenTelemetry Collector, container and application logs are enriched with Kubernetes metadata out-of-the-box making filtering, grouping and logs analysis easier and more efficient.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-distribution-opentelemetry-collector/explorer.png" alt="OTel Discover" /></p>
<p>The Elastic Distribution of the OpenTelemetry Collector allows for tracing just like any other collector distribution made of upstream components. Explore and analyze the performance and runtime behavior of your applications and services through RED metric, service maps and distributed traces collected from OpenTelemetry SDKs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-distribution-opentelemetry-collector/apm.png" alt="OTel APM" /></p>
<p>The above capabilities and features packed with the Elastic OpenTelemetry Collector can be achieved in a similar way with a custom build of the upstream OpenTelemetry Collector packing the right set of upstream components. To do just that follow our <a href="https://www.elastic.co/docs/reference/opentelemetry/edot-collector/customization">guidance here</a>.</p>
<h2>Outlook&lt;a id=&quot;outlook&quot;&gt;&lt;/a&gt;</h2>
<p>The launch of the technical preview of the Elastic Distribution of OpenTelemetry Collector is another step on Elastic’s journey towards OpenTelemetry based observability. On that journey we are committed to a vendor-agnostic approach to data collection and therefore prioritize upstream contribution to OpenTelemetry over Elastic-specific data collection features.</p>
<p>Stay tuned to see more of Elastic’s contributions to OpenTelemetry and observe Elastic’s journey towards fully OpenTelemetry-based observability.</p>
<p>Additional resources for OpenTelemetry with Elastic:</p>
<ul>
<li>
<p>Elastic Distributions recently introduced:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-java-agent">Elastic Distribution of OpenTelemetry's Java SDK</a>.</p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-python">Elastic Distribution of OpenTelemetry's Python SDK</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-node-js">Elastic Distribution of OpenTelemetry's NodeJS SDK</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-dotnet-applications">Elastic Distribution of OpenTelemetry's .NET SDK</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/apm-ios-android-native-apps">Elastic Distribution of OpenTelemetry of iOS and Android</a></p>
</li>
</ul>
</li>
<li>
<p>Other Elastic OpenTelemetry resources:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></p>
</li>
</ul>
</li>
<li>
<p>Instrumentation resources:</p>
<ul>
<li>
<p>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual instrumentation</a></p>
</li>
<li>
<p>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual instrumentation </a></p>
</li>
<li>
<p>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual instrumentation</a></p>
</li>
<li>
<p>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual instrumentation</a></p>
</li>
</ul>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-distribution-opentelemetry-collector/otel-collector-announcement.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Announcing GA of Elastic distribution of the OpenTelemetry Java Agent]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-java-agent</link>
            <guid isPermaLink="false">elastic-distribution-opentelemetry-java-agent</guid>
            <pubDate>Thu, 12 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic announces general availability of the Elastic distribution of the OpenTelemetry (OTel) Java Agent, a fully OTel-compatible agent with a rich set of useful additional features.]]></description>
            <content:encoded><![CDATA[<p>As Elastic continues its commitment to OpenTelemetry (OTel), we are excited to announce general availability of the <a href="https://github.com/elastic/elastic-otel-java">Elastic Distribution of OpenTelemetry Java (EDOT Java)</a>. EDOT Java is a fully compatible drop-in replacement for the OTel Java agent that comes with a set of built-in, useful extensions for powerful additional features and improved usability with Elastic Observability. Use EDOT Java to start the OpenTelemetry SDK with your Java application, and automatically capture tracing data, performance metrics, and logs. Traces, metrics, and logs can be sent to any OpenTelemetry Protocol (OTLP) collector you choose.</p>
<p>With EDOT Java you have access to all the features of the OpenTelemetry Java agent plus:</p>
<ul>
<li>Access to SDK improvements and bug fixes contributed by the Elastic team before the changes are available upstream in OpenTelemetry repositories.</li>
<li>Access to optional features that can enhance OpenTelemetry data that is being sent to Elastic (for example, inferred spans and span stacktrace).</li>
</ul>
<p>In this blog post, we will explore the rationale behind our unique distribution, detailing the powerful additional features it brings to the table. We will provide an overview of how these enhancements can be utilized with our distribution, the standard OTel SDK, or the vanilla OTel Java agent. Stay tuned as we conclude with a look ahead at our future plans and what you can expect from Elastic contributions to OTel Java moving forward.</p>
<h2>Elastic Distribution of OpenTelemetry Java (EDOT Java)</h2>
<p>Until now, Elastic users looking to monitor their Java services through automatic instrumentation had two options: the proprietary Elastic APM Java agent or the vanilla OTel Java agent. While both agents offer robust capabilities and have reached a high level of maturity, each has its distinct advantages and limitations. The OTel Java agent provides extensive instrumentation across a broad spectrum of frameworks and libraries, is highly extensible, and natively emits OTel data. Conversely, the Elastic APM Java agent includes several powerful features absent in the OTel Java agent.</p>
<p>Elastic’s distribution of the OTel Java agent aims to bring together the best aspects of the proprietary Elastic Java agent and the OpenTelemetry Java agent. This distribution enhances the vanilla OTel Java agent with a set of additional features realized through extensions, while still being a fully compatible drop-in replacement.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-distribution-opentelemetry-java-agent/1.png" alt="Elastic distribution of the OpenTelemetry Java agent" /></p>
<p>Elastic’s commitment to OpenTelemetry not only focuses on standardizing data collection around OTel but also includes improving OTel components and integrating Elastic's data collection features into OTel. In this vein, our ultimate goal is to contribute as many features from Elastic’s distribution back to the upstream OTel Java agent; our distribution is designed in such a way that the additional features, realized as extensions, work directly with the OTel SDK. This means they can be used independent of Elastic’s distro — either with the Otel Java SDK or with the vanilla OTel Java agent. We’ll discuss these usage patterns further in the sections below.</p>
<h2>Features included</h2>
<p>The Elastic distribution of the OpenTelemetry Java agent includes a suite of extensions that deliver the features outlined below.</p>
<h3>Inferred spans</h3>
<p>In a <a href="https://www.elastic.co/observability-labs/blog/tracing-data-inferred-spans-opentelemetry">recent blog post</a>, we introduced inferred spans, a powerful feature designed to enhance distributed traces with additional profiling-based spans.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-distribution-opentelemetry-java-agent/2.png" alt="Inferred spans" /></p>
<p>Inferred spans (blue spans labeled “internal” in the above image) offer valuable insights into sources of latency within the code that might remain uncaptured by purely instrumentation-based traces. In other words, they fill in the gaps between instrumentation-based traces. The Elastic distribution of the OTel Java agent includes the inferred spans feature. It can be enabled by setting the following environment variable.</p>
<pre><code class="language-bash">ELASTIC_OTEL_INFERRED_SPANS_ENABLED=true
</code></pre>
<h3>Correlation with profiling</h3>
<p>With <a href="https://opentelemetry.io/blog/2024/profiling/">OpenTelemetry embracing profiling</a> and <a href="https://www.elastic.co/blog/elastic-donation-proposal-to-contribute-profiling-agent-to-opentelemetry">Elastic's proposal to donate its eBPF-based, continuous profiling agent</a>, a new frontier opens up in correlating distributed traces with continuous profiling data. This integration offers unprecedented code-level insights into latency issues and CO2 emission footprints, all within a clearly defined service, transaction, and trace context. To get started, follow <a href="https://www.elastic.co/observability-labs/blog/universal-profiling-with-java-apm-services-traces">this guide</a> to setup universal profiling and the OpenTelemetry integration. In order to get more background information on the feature, check out <a href="https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation">this blog article</a>, where we explore how these technologies converge to enhance observability and environmental consciousness in software development.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-distribution-opentelemetry-java-agent/3.png" alt="Correlation with profiling" /></p>
<p>Users of Elastic Universal Profiling can already leverage the Elastic distribution of the OTel Java agent to access this powerful integration. With Elastic's proposed donation of the profiling agent, we anticipate that this capability will soon be available to all OTel users who employ the OTel Java agent in conjunction with the new OTel eBPF profiling.</p>
<h3>Span stack traces</h3>
<p>In many cases, spans within a distributed trace are relatively coarse-grained, particularly when features like inferred spans are not used. Understanding precisely where in the code path a span originates can be incredibly valuable. To address this need, the Elastic distribution of the OTel Java agent includes the span stack traces feature. This functionality provides crucial insights by collecting corresponding stack traces for spans that exceed a configurable minimum duration, pinpointing exactly where a span is initiated in the code.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-distribution-opentelemetry-java-agent/4.png" alt="Span stack traces" /></p>
<p>This simple yet powerful feature significantly enhances problem troubleshooting, offering developers a clearer understanding of their application’s performance dynamics.</p>
<p>In the example above, it allows you to get the call stack of a gRPC call, which can help understanding which code paths triggered it.</p>
<h3>Auto-detection of service and cloud resources</h3>
<p>In today's expansive and diverse cloud environments, which often include multiple regions and cloud providers, having information on where your services are operating is incredibly valuable. Particularly in Java services, where the service name is frequently embedded within the deployment artifacts, the ability to automatically retrieve service and cloud resource information marks a substantial leap in usability.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-distribution-opentelemetry-java-agent/5.png" alt="Auto-detection of service and cloud resources" /></p>
<p>To address this need, the Elastic distribution of the OTel Java agent includes built-in auto detectors for service and cloud resources, specifically for AWS and GCP, sourced from <a href="https://github.com/open-telemetry/opentelemetry-java-contrib">the OpenTelemetry Java Contrib repository</a>. This feature, which is on by default, enhances observability and streamlines the management of services across various cloud platforms, making it a key asset for any cloud-based deployment.</p>
<h2>Ways to use the EDOT Java</h2>
<p>The Elastic distribution of the OTel Java agent is designed to meet our users exactly where they are, accommodating a variety of needs and strategic approaches. Whether you're looking to fully integrate new observability features or simply enhance existing setups, the Elastic distribution offers multiple technical pathways to leverage its capabilities. This flexibility ensures that users can tailor the agent's implementation to align perfectly with their specific operational requirements and goals.</p>
<h3>Using Elastic’s distribution directly</h3>
<p>The most straightforward path to harnessing the capabilities described above is by adopting the Elastic distribution of the OTel Java agent as a drop-in replacement for the standard OTel Java agent. Structurally, the Elastic distro functions as a wrapper around the OTel Java agent, maintaining full compatibility with all upstream configuration options and incorporating all its features. Additionally, it includes the advanced features described above that significantly augment its functionality. Users of the Elastic distribution will also benefit from the comprehensive technical support provided by Elastic, which will commence once the agent achieves general availability. To get started, simply <a href="https://mvnrepository.com/artifact/co.elastic.otel/elastic-otel-javaagent">download the agent Jar file</a> and attach it to your application:</p>
<pre><code class="language-bash">​​java -javaagent:/pathto/elastic-otel-javaagent.jar -jar myapp.jar
</code></pre>
<h3>Using Elastic’s extensions with the vanilla OTel Java agent</h3>
<p>If you prefer to continue using the vanilla OTel Java agent but wish to take advantage of the features described above, you have the flexibility to do so. We offer a separate agent extensions package specifically designed for this purpose. To integrate these enhancements, simply <a href="https://mvnrepository.com/artifact/co.elastic.otel/elastic-otel-agentextension">download and place the extensions jar file</a> into a designated directory and configure the OTel Java agent extensions directory:</p>
<pre><code class="language-bash">​​OTEL_JAVAAGENT_EXTENSIONS=/pathto/elastic-otel-agentextension.jar
java -javaagent:/pathto/otel-javaagent.jar -jar myapp.jar
</code></pre>
<h3>Using Elastic’s extensions manually with the OTel Java SDK</h3>
<p>If you build your instrumentations directly into your applications using the OTel API and rely on the OTel Java SDK instead of the automatic Java agent, you can still use the features we've discussed. Each feature is designed as a standalone component that can be integrated with the OTel Java SDK framework. To implement these features, simply refer to the specific descriptions for each one to learn how to configure the OTel Java SDK accordingly:</p>
<ul>
<li><a href="https://github.com/elastic/elastic-otel-java/tree/main/inferred-spans">Setting up the inferred spans feature with the SDK</a></li>
<li><a href="https://github.com/elastic/elastic-otel-java/tree/main/universal-profiling-integration">Setting up profiling correlation with the SDK</a></li>
<li><a href="https://github.com/open-telemetry/opentelemetry-java-contrib/tree/main/span-stacktrace">Setting up the span stack traces feature with the SDK</a></li>
<li>Setting up resource detectors with the SDK
<ul>
<li><a href="https://github.com/open-telemetry/opentelemetry-java-contrib/tree/main/resource-providers">Service resource detectors</a></li>
<li><a href="https://github.com/open-telemetry/opentelemetry-java-contrib/tree/main/aws-resources">AWS resource detector</a></li>
<li><a href="https://github.com/open-telemetry/opentelemetry-java-contrib/tree/main/gcp-resources">GCP resource detector</a></li>
</ul>
</li>
</ul>
<p>This approach ensures that you can tailor your observability tools to meet your specific needs without compromising on functionality.</p>
<h2>Future plans and contributions</h2>
<p>We are committed to OpenTelemetry, and our contributions to the OpenTelemetry Java project will continue without limit. Not only are we focused on general improvements within the OTel Java project, but we are also committed to ensuring that the features discussed in this blog post become official extensions to the OpenTelemetry Java SDK/Agent and are included in the OpenTelemetry Java Contrib repository. We have already contributed the <a href="https://github.com/open-telemetry/opentelemetry-java-contrib/tree/main/span-stacktrace">span stack trace feature</a> and initiated the contribution of the inferred spans feature, and we are eagerly anticipating the opportunity to add the profiling correlation feature following the successful integration of Elastic’s profiling agent.</p>
<p>Moreover, our efforts extend beyond the current enhancements; we are actively working to port more features from the Elastic APM Java agent to OpenTelemetry. A particularly ambitious yet thrilling endeavor is our project to enable dynamic configurability of the OpenTelemetry Java agent. This future enhancement will allow for the OpenTelemetry Agent Management Protocol (OpAMP) to be used to remotely and dynamically configure OTel Java agents, improving their adaptability and ease of use.</p>
<p>We encourage you to experience the new Elastic distribution of the OTel Java agent and share your feedback with us. Your insights are invaluable as we strive to enhance the capabilities and reach of OpenTelemetry, making it even more powerful and user-friendly.</p>
<p>Check out more information on Elastic Distributions of OpenTelemetry in <a href="https://github.com/elastic/opentelemetry?tab=readme-ov-file">github</a> and our latest <a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry">EDOT Blog</a></p>
<p>Elastic provides the following components of EDOT:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-collector">Elastic Distribution of OpenTelemetry (EDOT) Collector</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-java-agent">Elastic Distribution of OpenTelemetry (EDOT) Java</a>.</p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-python">Elastic Distribution of OpenTelemetry (EDOT) Python</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-node-js">Elastic Distribution of OpenTelemetry (EDOT) NodeJS</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-dotnet-applications">Elastic Distribution of OpenTelemetry (EDOT) .NET</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/apm-ios-android-native-apps">Elastic Distribution of OpenTelemetry (EDOT)  iOS and Android</a></p>
</li>
</ul>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-distribution-opentelemetry-java-agent/observability-launch-series-3-java-auto.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[EDOT SDK central configuration using OpAmp in Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-sdk-central-configuration-opamp</link>
            <guid isPermaLink="false">elastic-distribution-opentelemetry-sdk-central-configuration-opamp</guid>
            <pubDate>Tue, 11 Nov 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to configure the Elastic Distributions of OpenTelemetry (EDOT) SDKs centrally via the EDOT Collector using OpAmp in Elastic Observability at scale.]]></description>
            <content:encoded><![CDATA[<p>Managing configuration changes to the large number of services using Elastic Distribution of OpenTelemetry (EDOT) SDKs can be challenging and time-consuming.
OpenTelemetry has the <a href="https://opentelemetry.io/docs/specs/opamp/">Open Agent Management Protocol</a> (OpAMP) which allows Elastic to manage these SDKs through the EDOT collector.
Elastic's APM Agents provided this capability in Elastic Observability.
Using this functionality together with OpAMP you can now centrally manage your SDKs from Elastic Observability, via the EDOT Collector which uses OpAMP to help manage configuration changes to the multitude of services using the EDOT SDKs.</p>
<p>In this article, we will explore the central configuration capabilities for EDOT SDKs with the EDOT Gateway Collector. You will learn how to configure the EDOT SDKs and the EDOT Gateway Collector to enable central configuration. Finally, we'll cover the configuration settings supported through central configuration.</p>
<h2>Central configuration based on the OpenTelemetry Open Agent Management Protocol</h2>
<p>The OpenTelemetry project provides OpAMP for the remote management of large fleets of data collection Agents among other capabilities.
The central management of EDOT SDKs leverages OpAMP for dispatching the configurations.
It's a client-server network protocol where the OpAMP server is part of the EDOT Collector and the OpAMP client is part of the EDOT SDK.
The EDOT SDK polls at regular intervals the OpAMP server for configuration updates.
The OpAMP server in the EDOT collector is part of the <a href="https://github.com/elastic/opentelemetry-collector-components/blob/main/extension/apmconfigextension/README.md">Elastic APM central configuration extension</a>.
This extension reads the configuration for the EDOT SDKs from Elasticsearch.
The <code>apmconfigextension</code>, which is the technical name of the Elastic APM central configuration extension, is included and needs to be configured in the EDOT Collector configuration to be activated.
The OpAMP specification includes WebSocket and plain HTTP connection for transport.
The EDOT SDKs pull the configuration utilizing the plain HTTP connection for transport from the EDOT Collector.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-distribution-opentelemetry-sdk-central-configuration-opamp/central-config-edot.png" alt="The central configuration architecture with EDOT SDKs and EDOT Collector" /></p>
<h2>Prerequisites</h2>
<p>Central configuration of EDOT SDKs requires a standalone EDOT Collector running in Gateway mode.
Other collectors like the OpenTelemetry contrib collector or a custom distribution of the collector you build yourself require the Elastic APM central configuration extension.</p>
<h3>EDOT versions supporting central configuration</h3>
<p>The following table gives an overview of the versions of the EDOT SDKs and EDOT Collector that provide central configuration support.
Applications and services have to be instrumented with the EDOT SDKs with a version listed below to pull and apply central configuration changes.</p>
<table>
<thead>
<tr>
<th>EDOT</th>
<th>Version</th>
</tr>
</thead>
<tbody>
<tr>
<td>Android</td>
<td>1.2.0+</td>
</tr>
<tr>
<td>iOS</td>
<td>1.4.0+</td>
</tr>
<tr>
<td>Java</td>
<td>1.5.0+</td>
</tr>
<tr>
<td>Node.js</td>
<td>1.2.0+</td>
</tr>
<tr>
<td>PHP</td>
<td>1.1.1+</td>
</tr>
<tr>
<td>Python</td>
<td>1.4.0+</td>
</tr>
<tr>
<td>Collector (Gateway mode)</td>
<td>8.19, 9.1+</td>
</tr>
</tbody>
</table>
<p>Central configuration is not blocking application startup for EDOT Java, Node.js, PHP and Python.
The application starts with default configuration or the configuration provided by environment variables.
When the central configuration settings are successfully pulled, they will take precedence over local configuration settings.
EDOT .NET currently does not support central configuration although it’s planned to add support.</p>
<p>Furthermore, central configuration for EDOT SDKs is not supported on Elastic Cloud Serverless or the Elastic Cloud Managed OTLP Endpoint, yet.
The following table gives an overview of the versions of the Elastic stack that support central configuration of EDOT SDKs.</p>
<table>
<thead>
<tr>
<th>Elastic</th>
<th>Version</th>
</tr>
</thead>
<tbody>
<tr>
<td>Self-managed</td>
<td>9.1.0+</td>
</tr>
<tr>
<td>Elastic Cloud Hosted</td>
<td>9.1.0+</td>
</tr>
</tbody>
</table>
<h3>Retrieve Elasticsearch endpoint and API key</h3>
<p>The Elastic APM central configuration extension needs the Elasticsearch endpoint in the configuration to be able to read the central configuration settings for the EDOT SDKs.
The Elasticsearch endpoint is the same that the Elasticsearch exporter uses to export telemetry data to Elasticsearch.
The <a href="https://www.elastic.co/docs/reference/opentelemetry/central-configuration">central configuration documentation</a> describes the steps to obtain the endpoint and the API key in more detail.</p>
<h2>Enable central configuration for EDOT</h2>
<p>To enable central configuration, the EDOT Collector needs the <code>apmconfigextension</code> to be configured as part of the <code>extensions</code> section.
This requires the Elasticsearch endpoint obtained above and an Elasticsearch API key.
The environment variable <code>ELASTIC_OTEL_OPAMP_ENDPOINT</code> needs to be set to enable central configuration in the EDOT SDK.</p>
<p>In the following, the configuration is explained for the EDOT Gateway Collector and the EDOT SDKs.</p>
<h3>Configure the EDOT Collector to enable central configuration</h3>
<p>Central configuration support in the EDOT Collector is enabled by adding the configuration of the Elastic APM central configuration extension configuration to the configuration file.
For the authentication of the <code>apmconfigextension</code> with the Elasticsearch endpoint, <code>bearertokenauth</code> authenticator is configured.
This configures a client type authenticator for outgoing requests to the Elasticsearch endpoint.
The <code>apmconfigextension</code> acts as client and Elasticsearch endpoint as server.
The <code>apmconfig</code> section configures the OpAMP server endpoint. EDOT SDKs will connect to the endpoint to fetch the configuration.
The <code>service</code> section activates the <code>apmconfig</code> and <code>bearertokenauth</code> extension.
The following code snippet shows the configuration excerpt of the EDOT Collector including the <code>bearertoken</code> authenticator configuration and the <code>apmconfig</code> configuration for central configuration.</p>
<pre><code class="language-yaml">extensions:
  bearertokenauth:
    scheme: &quot;APIKey&quot;
    token: &quot;&lt;ENCODED_ELASTICSEARCH_APIKEY&gt;&quot;
  source:
     elasticsearch:
       endpoint: &quot;&lt;YOUR_ELASTICSEARCH_ENDPOINT&gt;&quot;
       auth:
         authenticator: bearertokenauth
  apmconfig:
    opamp:
      protocols:
        http:
          # Default is localhost:4320
          # To specify a custom endpoint, uncomment the following line
          # and set the endpoint to the custom endpoint
          # endpoint: &quot;&lt;CUSTOM_OPAMP_ENDPOINT&gt;&quot;
  service:
    extensions: [bearertokenauth, apmconfig]
</code></pre>
<p><a href="https://www.elastic.co/docs/reference/edot-collector/download">Download</a> the EDOT collector and include the configuration from the snippet above in the <code>otel.yml</code> configuration file to enable central configuration.
The <code>otel.yml</code> configuration file examples are available in the <a href="https://www.elastic.co/docs/reference/edot-collector/config/default-config-standalone#agent-mode">EDOT Collector documentation</a>.
Consider the example for direct ingestion into Elasticsearch.</p>
<h3>Configure the EDOT SDKs to enable central configuration</h3>
<p>To enable central configuration in the EDOT SDKs Java, Node.js, PHP and Python, set the <code>ELASTIC_OTEL_OPAMP_ENDPOINT</code> environment variable to the OpAMP server endpoint of the EDOT Collector and set the required resource attributes.</p>
<h4>Enable central configuration of EDOT SDKs</h4>
<p>The following code snippet shows how to set the <code>ELASTIC_OTEL_OPAMP_ENDPOINT</code> environment variable with the <code>export</code> command in a shell.</p>
<pre><code class="language-bash">export ELASTIC_OTEL_OPAMP_ENDPOINT=&quot;http://&lt;your-opamp-end-point&gt;:4320/v1/opamp&quot;
</code></pre>
<p><code>&lt;your-opamp-end-point&gt;</code> must be set to the address or host name of the EDOT Gateway Collector that provides the OpAMP server endpoint for central configuration.</p>
<p>When you are using EDOT SDKs for mobile, the documentation shows how to activate central configuration support in <a href="https://www.elastic.co/docs/reference/opentelemetry/edot-sdks/android/configuration#central-configuration">EDOT Android</a> and <a href="https://www.elastic.co/docs/reference/opentelemetry/edot-sdks/ios/configuration#central-configuration-edot">EDOT iOS</a>.</p>
<h4>Configure resource attributes</h4>
<p>Central configuration requires the OpenTelemetry resource attributes <code>service.name</code> and <code>deployment.environment.name</code> to be set.
While <code>service.name</code> is mandatory, <code>deployment.environment.name</code> is optional but recommended.
If <code>deployment.environment.name</code> is unset, no configuration can be created that applies to a whole environment.</p>
<p>Set the <code>OTEL_RESOURCE_ATTRIBUTES</code> environment variable including the <code>service.name</code> and <code>deployment.environment.name</code> like in the following code snippet.
The key-value pairs are concatenated with a comma as separator and provided as value to the <code>OTEL_RESOURCE_ATTRIBUTES</code> environment variable.</p>
<pre><code class="language-bash">export OTEL_RESOURCE_ATTRIBUTES=&quot;deployment.environment.name=production,service.name=my-app&quot;
</code></pre>
<h2>Supported configuration settings</h2>
<p>The following tables give an overview of the supported central configuration settings at the time of writing.</p>
<h3>Non-mobile EDOT SDKs</h3>
<p>The table shows the supported central configuration settings of EDOT Java, Node.js, PHP, and Python SDK.</p>
<table>
<thead>
<tr>
<th>Setting</th>
<th>Description</th>
<th>Java</th>
<th>Node.js</th>
<th>PHP</th>
<th>Python</th>
<th>Kibana</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>logging_level</code></td>
<td>The EDOT SDKs own logging level</td>
<td>1.5.0+</td>
<td>1.2.0+</td>
<td>1.1.0+</td>
<td>1.4.0+</td>
<td>9.1.0+</td>
</tr>
<tr>
<td><code>deactivate_instrumentations</code></td>
<td>Turn off <strong>selected</strong> instrumentations</td>
<td>1.5.0+</td>
<td>1.2.0+</td>
<td>-</td>
<td>-</td>
<td>9.1.0+</td>
</tr>
<tr>
<td><code>deactivate_all_instrumentations</code></td>
<td>Turn off <strong>all</strong> instrumentations</td>
<td>1.5.0+</td>
<td>1.2.0+</td>
<td>-</td>
<td>-</td>
<td>9.1.0+</td>
</tr>
<tr>
<td><code>send_traces</code></td>
<td>Controls if traces should be sent</td>
<td>1.5.0+</td>
<td>1.3.0+</td>
<td>-</td>
<td>-</td>
<td>9.1.0+</td>
</tr>
<tr>
<td><code>send_metrics</code></td>
<td>Controls if metrics should be sent</td>
<td>1.5.0+</td>
<td>1.3.0+</td>
<td>-</td>
<td>-</td>
<td>9.1.0+</td>
</tr>
<tr>
<td><code>send_logs</code></td>
<td>Controls if logs should be sent</td>
<td>1.5.0+</td>
<td>1.3.0+</td>
<td>-</td>
<td>-</td>
<td>9.1.0+</td>
</tr>
<tr>
<td><code>opamp_polling_interval</code></td>
<td>Time between consecutive central configuration pull requests</td>
<td>1.6.0+</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>9.2.0+ (planned)</td>
</tr>
<tr>
<td><code>sampling_rate</code></td>
<td>Trace sampling rate for head-based sampling</td>
<td>1.6.0+</td>
<td>-</td>
<td>-</td>
<td>1.7.0+</td>
<td>9.2.0+ (planned)</td>
</tr>
<tr>
<td><code>infer_spans</code></td>
<td>Activates/Deactivates inferred spans</td>
<td>1.7.0+</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>9.2.0+ (planned)</td>
</tr>
</tbody>
</table>
<h3>Mobile EDOT SDKs</h3>
<p>The table below shows the supported central configuration settings of EDOT Android, and iOS SDK.</p>
<table>
<thead>
<tr>
<th>Setting</th>
<th>Description</th>
<th>Android</th>
<th>iOS</th>
<th>Kibana</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>recording</code></td>
<td>Record and send telemetry</td>
<td>1.2.0+</td>
<td>1.4.0+</td>
<td>9.1.0+</td>
</tr>
<tr>
<td><code>session_sample_rate</code></td>
<td>Sampling rate for session-based sampling</td>
<td>1.2.0+</td>
<td>1.4.0+</td>
<td>9.1.0+</td>
</tr>
</tbody>
</table>
<h2>Use Elastic Observability to change configuration settings of EDOT SDKs</h2>
<p>An application must produce and send telemetry data otherwise the EDOT SDK will not appear in the Agent configuration UI in Elastic Observability.
Because the Agent configuration in Elastic Observability has no knowledge about the existence of the EDOT SDK until it's receiving telemetry data.
The OpenTelemetry resource attribute <code>service.name</code> is used as the key to assign a configuration to an EDOT SDK.
Currently, EDOT SDKs will not show up in the Agent Explorer.</p>
<p>Go to Kibana -&gt; Observability -&gt; Applications -&gt; Service Inventory -&gt; Settings -&gt; Agent Configuration to create a new configuration for your EDOT SDK.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-distribution-opentelemetry-sdk-central-configuration-opamp/edot-sdk-configuration-deactivate-all-instrumentations.png" alt="Deactivate all instrumentations of EDOT Java SDK in Kibana" /></p>
<p>The EDOT Java SDK configuration above deactivates all instrumentations by setting <code>deactivate_all_instrumentations</code> to <code>true</code>. This is useful in situations when switching off the instrumentations is necessary.
The <code>sampling_rate</code> setting becomes handy when the sampling rate of the EDOT SDK should be changed. A sampling rate can be selected according to the needs as shown below.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-distribution-opentelemetry-sdk-central-configuration-opamp/edot-sdk-configuration-sampling-rate.png" alt="Set the sampling rate of EDOT Java SDK in Kibana" /></p>
<h2>Disable central configuration</h2>
<p>Disabling central configuration is easy. Remove the environment variable <code>ELASTIC_OTEL_OPAMP_ENDPOINT</code> and restart the application to disable central configuration support in the EDOT SDKs.
Remove the <code>apmconfig</code> extension from the <code>service</code> section of the EDOT Collector configuration and restart the collector to disable the Elastic APM central configuration extension in the EDOT Collector.</p>
<h2>Elastic's contribution to OpenTelemetry</h2>
<p>Elastic is committed to the OpenTelemetry project.
Elastic contributed the Java OpAMP client implementation to the OpenTelemetry project (<a href="https://github.com/open-telemetry/opentelemetry-java-contrib/pull/2021">Github PR</a>), and is working on the contribution for Python (<a href="https://github.com/open-telemetry/opentelemetry-python-contrib/pull/3635">Github PR</a>) and Node.js.
The OpAMP client for PHP will be part of a larger contribution (<a href="https://github.com/open-telemetry/community/issues/2846">Github issue</a>) that Elastic is doing.</p>
<h2>Conclusion</h2>
<p>In this article, you learned how to configure the EDOT SDKs centrally in Elastic Observability with the Gateway Collector and OpAMP at scale.
You learned how to configure the <code>apmconfigextension</code> in the collector and how to set the <code>ELASTIC_OTEL_OPAMP_ENDPOINT</code> environment variable to enable central configuration in the EDOT SDKs, which versions of the SDKs and collector support central configuration, and what configuration settings are currently supported.
Now you can leverage central configuration in large deployments to manage the configuration of the EDOT SDKs.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-distribution-opentelemetry-sdk-central-configuration-opamp/elastic-distribution-opentelemetry-sdk-central-configuration-opamp.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Elastic Distributions of OpenTelemetry (EDOT) Now GA: Open-Source, Production-Ready OTel]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry-ga</link>
            <guid isPermaLink="false">elastic-distributions-opentelemetry-ga</guid>
            <pubDate>Wed, 02 Apr 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic is proud to introduce General Availability of Elastic Distributions of OpenTelemetry (EDOT), which contains Elastic’s versions of the OpenTelemetry Collector and several language SDKs like Python, Java, .NET, and NodeJS. These help provide enhanced features and enterprise-grade support for EDOT.]]></description>
            <content:encoded><![CDATA[<p>We’re excited to announce <strong>general availability of</strong> <strong>Elastic Distributions of OpenTelemetry (EDOT)!</strong> EDOT is a fully open distribution of the OpenTelemetry collector and language SDKs, providing SREs and developers with a stable, production-tested OTel ecosystem backed by enterprise-grade support.</p>
<p>While OTel components are feature-rich, enhancements through the community can take time. Additionally, support is left up to the community or individual users and organizations.  EDOT delivers the following benefits to end users:</p>
<ul>
<li>
<p>Production-ready, backed by expert OTel support</p>
</li>
<li>
<p>No vendor lock-In - no proprietary add-ons</p>
</li>
<li>
<p>Preserving OpenTelemetry standards - no schema conversion</p>
</li>
</ul>
<h2>EDOT Collector and SDKs are GA</h2>
<p>Elastic Distributions of OpenTelemetry (EDOT) is a curated collection of OpenTelemetry components, EDOT Collector and language SDKs. It is designed to support OTel telemetry from applications and shared infrastructure such as hosts or Kubernetes.</p>
<p>Highlighted below are all the EDOT components that are now GA.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-distributions-opentelemetry-ga/edot-components.png" alt="EDOT Components" /></p>
<ul>
<li>
<p><strong>Elastic Distribution of OpenTelemetry (EDOT) Collector</strong> - The EDOT Collector is the OTel collector with Elastic’s set of processors, receivers and exporters to send OTel to Elastic</p>
</li>
<li>
<p><strong>Elastic Distribution of OpenTelemetry (EDOT) SDKs</strong> <strong>&amp; zero-code instrumentation</strong> - Users have an option to instrument with the SDKs or choose to use zero-code instrumentation. Here are all the SDKs currently available in EDOT:</p>
<ul>
<li>
<p>Elastic Distribution of OpenTelemetry (EDOT) Java.</p>
</li>
<li>
<p>Elastic Distribution of OpenTelemetry (EDOT) Python</p>
</li>
<li>
<p>Elastic Distribution of OpenTelemetry (EDOT) NodeJS</p>
</li>
<li>
<p>Elastic Distribution of OpenTelemetry (EDOT) .NET</p>
</li>
<li>
<p>Elastic Distribution of OpenTelemetry (EDOT) PHP</p>
</li>
<li>
<p>Elastic Distribution of OpenTelemetry (EDOT) iOS</p>
</li>
<li>
<p>Elastic Distribution of OpenTelemetry (EDOT) Android</p>
</li>
</ul>
</li>
</ul>
<p>Details and documentation for EDOT are available in our public <a href="https://www.elastic.co/docs/reference/opentelemetry/">EDOT documentation</a>, and our <a href="https://github.com/elastic/opentelemetry">EDOT github repository</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-distributions-opentelemetry-ga/EDOT-Overview.png" alt="EDOT overview" /></p>
<p>To learn more about the ease of use, particularly with Kubernetes check out our previous blog <a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-otel-operator">Ingest Kubernetes and application telemetry in 3 steps with EDOT</a>.</p>
<h2>What an SRE gains with EDOT</h2>
<p><strong>Production-ready, Backed by Expert OTel Support</strong></p>
<p>Enterprises adopting OpenTelemetry often struggle with unreliable support, slow bug fixes, and untested updates, leading to operational risks and downtime, and increased troubleshooting efforts. Without enterprise-grade guarantees, teams are left to resolve issues on their own, increasing maintenance overhead and slowing adoption.</p>
<p>EDOT delivers enterprise-grade support backed by OpenTelemetry experts, ensuring stability, proactive fixes beyond OpenTelemetry’s release cycles and production-tested reliability. With rapid issue resolution and expert guidance, EDOT enables organizations to confidently adopt and scale OpenTelemetry without operational disruptions or added maintenance burden.</p>
<p><strong>No Vendor Lock-In—No Proprietary Add-Ons</strong></p>
<p>Observability vendors have traditionally built proprietary agents and ingestion pipelines allowing them to control data flows and lock in users.</p>
<p>Elastic Distributions of OpenTelemetry (EDOT) offers a fully open, vendor-neutral approach to observability. As a curated portfolio of OpenTelemetry components, EDOT enhances infrastructure and application monitoring with Elastic Observability—without proprietary modifications.</p>
<p>All enhancements and fixes are contributed back to the OpenTelemetry community, ensuring EDOT remains a stable, standards-compliant distribution that stays aligned with upstream OpenTelemetry. This guarantees interoperability, seamless upgrades, and freedom from vendor lock-in.</p>
<p><strong>Preserving OpenTelemetry Standards for Richer Context</strong></p>
<p>When vendors modify OpenTelemetry data and schemas by introducing proprietary translations that disrupt interoperability, they create vendor lock-in and increase complexity. These modifications force operations teams to manage custom integrations, convert schemas, and sometimes result in each signal requiring its own query language and tooling, adding unnecessary overhead and limiting flexibility.</p>
<p>Elastic has re-architected its platform with an OTel-first approach that preserves the OpenTelemetry data model. OTel data can now be used in its original specification to power Elastic dashboards, analytics, alerts, and other functionality without the need for schema conversions – it just works.</p>
<p>With Elasticsearch as a single backend for all OpenTelemetry signals, users can store and query observability data in a unified, OTel-native format. Combined with ES|QL, a powerful and flexible query language, SREs get effortless correlation of logs, metrics, and traces using OpenTelemetry resource attributes. The result is a faster, more intuitive way to analyze system health and performance—all in one place.</p>
<h2>Get Started Today</h2>
<p>EDOT is available to all Elastic customers. Whether you’re adopting OpenTelemetry for the first time or looking for a reliable distribution with enterprise-grade support, EDOT ensures a smooth, OpenTelemetry-first experience.</p>
<p>Check out our <a href="https://www.elastic.co/docs/reference/opentelemetry/">EDOT documentation</a>, and our <a href="https://github.com/elastic/opentelemetry">EDOT github repository</a> and get started today!</p>
<p>Additionally check out some of the blogs detailing our components</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-collector">Elastic Distribution of OpenTelemetry (EDOT) Collector</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-java-agent">Elastic Distribution of OpenTelemetry (EDOT) Java</a>.</p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-python">Elastic Distribution of OpenTelemetry (EDOT) Python</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-node-js">Elastic Distribution of OpenTelemetry (EDOT) NodeJS</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-distributions-opentelemetry-ga/edot-image.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Introducing Elastic Distributions of OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry</link>
            <guid isPermaLink="false">elastic-distributions-opentelemetry</guid>
            <pubDate>Thu, 15 Aug 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic is proud to introduce Elastic Distributions of OpenTelemetry (EDOT), which contains Elastic’s versions of the OpenTelemetry Collector and several language SDKs like Python, Java, .NET, and NodeJS. These help provide enhanced features and enterprise-grade support for EDOT.]]></description>
            <content:encoded><![CDATA[<p>We are announcing the availability of Elastic Distributions of OpenTelemetry (EDOT). These Elastic distributions, currently in tech preview,  have been developed to enhance the capabilities of standard OpenTelemetry distributions and improve existing OpenTelemetry support from Elastic. </p>
<p>The Elastic Distributions of OpenTelemetry (EDOT) are composed of OpenTelemetry (OTel) project components, OTel Collector, and language SDKs,  which provide users with the necessary capabilities and out-of-the-box configurations, enabling quick and effortless infra and application monitoring.</p>
<p>While OTel components are feature-rich, enhancements through the community can take time. Additionally, support is left up to the community or individual users and organizations. Hence EDOT will bring the following to end users:</p>
<ul>
<li>
<p><strong>Deliver enhanced features earlier than OTel</strong>: By providing features unavailable in the “vanilla” OpenTelemetry components, we can quickly meet customers’ requirements while still providing an OpenTelemetry native and vendor-agnostic instrumentation for their applications. Elastic will continuously upstream these enhanced features.</p>
</li>
<li>
<p><strong>Enhanced OTel support</strong> - By maintaining Elastic distributions, we can better support customers with enhancements and fixes outside of the OTel release cycles. In addition, Elastic support can troubleshoot issues on the EDOT.</p>
</li>
</ul>
<p>EDOT currently includes the following tech preview components, which will  grow over time:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-collector">Elastic Distribution of OpenTelemetry (EDOT) Collector</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-java-agent">Elastic Distribution of OpenTelemetry (EDOT) Java</a>.</p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-python">Elastic Distribution of OpenTelemetry (EDOT) Python</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-node-js">Elastic Distribution of OpenTelemetry (EDOT) NodeJS</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-dotnet-applications">Elastic Distribution of OpenTelemetry (EDOT) .NET</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/apm-ios-android-native-apps">Elastic Distribution of OpenTelemetry (EDOT)  iOS and Android</a></p>
</li>
</ul>
<p>Details and documentation for all EDOT are available in our public <a href="https://github.com/elastic/opentelemetry">OpenTelemetry GitHub repository</a>. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-distributions-opentelemetry/edot-components.png" alt="EDOT Components" /></p>
<h2>Elastic Distribution of OpenTelemetry (EDOT) Collector&lt;a id=&quot;elastic-distribution-of-opentelemetry-edot-collector&quot;&gt;&lt;/a&gt;</h2>
<p>The EDOT Collector, recently released with the 8.15 release of Elastic Observability enhances Elastic’s existing OTel capabilities. The EDOT Collector can, in addition to service monitoring, forward application logs, infrastructure logs, and metrics using standard OpenTelemetry Collector receivers like file logs and host metrics receivers.</p>
<p>Additionally, users of the Elastic Distribution of the OpenTelemetry Collector benefit from container logs automatically enriched with Kubernetes metadata by leveraging the powerful <a href="https://opentelemetry.io/blog/2024/otel-collector-container-log-parser/">container log parser</a> that Elastic recently contributed. This OpenTelemetry-based enrichment enhances the context and value of the collected logs, providing deeper insights and more effective troubleshooting capabilities.</p>
<p>This new collector distribution ensures that exported data is fully compatible with the Elastic Platform, enhancing the overall observability experience. Elastic also ensures that Elastic-curated UIs can seamlessly handle both the Elastic Common Schema (ECS) and OpenTelemetry formats.</p>
<h2>Elastic Distributions for Language SDKs&lt;a id=&quot;elastic-distributions-for-language-sdks&quot;&gt;&lt;/a&gt;</h2>
<p><a href="https://www.elastic.co/guide/en/apm/agent/index.html">Elastic's APM agents</a> have capabilities yet to be available in the OTel SDKs. EDOT brings these capabilities into the OTel language SDKs while maintaining seamless integration with Elastic Observability. Elastic will release OTel versions of all its APM agents, and continue to add additional language SDKs mirroring OTel.</p>
<h2>Continued support for Native OTel components&lt;a id=&quot;continued-support-for-native-otel-components&quot;&gt;&lt;/a&gt;</h2>
<p>EDOT does not preclude users from using native components. Users are still able to use:</p>
<ul>
<li>
<p><strong>OpenTelemetry Vanilla Language SDKs:</strong> use standard OpenTelemetry code instrumentation for many popular programming languages sending OTLP traces to Elastic via APM server.</p>
</li>
<li>
<p><strong>Upstream Distribution of OpenTelemetry Collector (Contrib or Custom):</strong> Send traces using the OpenTelemetry Collector with OTLP receiver and OTLP exporter to Elastic via APM server.</p>
</li>
</ul>
<p>Elastic is committed to contributing EDOT features or components upstream into the OpenTelemetry community, fostering a collaborative environment, and enhancing the overall OpenTelemetry ecosystem.</p>
<h2>Extending our commitment to vendor-agnostic data collection&lt;a id=&quot;extending-our-commitment-to-vendor-agnostic-data-collection&quot;&gt;&lt;/a&gt;</h2>
<p>Elastic remains committed to supporting OpenTelemetry by being OTel first and building a vendor-agnostic framework. As OpenTelemetry constantly grows its support of SDKs and components,  Elastic will continue to refine and mirror EDOT to OpenTelemetry and push enhancements upstream. </p>
<p>Over the past year, Elastic has been active in OTel through its <a href="https://opentelemetry.io/blog/2023/ecs-otel-semconv-convergence/">donation of Elastic Common Schema (ECS)</a>, contributions to the native <a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-collector">OpenTelemetry Collector</a> and language SDKs, and a recent <a href="https://www.elastic.co/observability-labs/blog/elastic-profiling-agent-acceptance-opentelemetry">donation of its Universal Profiling agent</a> to OpenTelemetry. </p>
<p>EDOT  builds on our decision to fully adopt and recommend OpenTelemetry as the preferred solution for observing applications. With EDOT, Elastic customers can future-proof their investments and adopt OpenTelemetry, giving them vendor-neutral instrumentation with Elastic enterprise-grade support.</p>
<p>Our vision is that Elastic will work with the OpenTelemetry community to donate features through the standardization processes and contribute the code to implement those in the native OpenTelemetry components. In time, as OTel capabilities advance, and many of the Elastic-exclusive features transition into OpenTelemetry, we look forward to no longer having Elastic Distributions for OpenTelemetry.. In the meantime, we can deliver those capabilities via our OpenTelemetry distributions.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-distributions-opentelemetry/edot-image.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[OpenTelemetry and Elastic: Working together to establish continuous profiling for the community]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-donation-proposal-to-contribute-profiling-agent-to-opentelemetry</link>
            <guid isPermaLink="false">elastic-donation-proposal-to-contribute-profiling-agent-to-opentelemetry</guid>
            <pubDate>Tue, 12 Mar 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[OpenTelemetry is embracing profiling. Elastic is donating its whole-system continuous profiling agent to OpenTelemetry to further this advancement, empowering OTel users to improve computational efficiency and reduce their carbon footprint.]]></description>
            <content:encoded><![CDATA[<p>Profiling is emerging as a core pillar of observability, aptly dubbed the fourth pillar, with the OpenTelemetry (OTel) project leading this essential development. This blog post dives into the recent advancements in profiling within OTel and how Elastic® is actively contributing toward it.</p>
<p>At Elastic, we’re big believers in and contributors to the OpenTelemetry project. The project’s benefits of flexibility, performance, and vendor agnosticism have been making their rounds; we’ve seen a groundswell of customer interest.</p>
<p>To this end, after donating our <a href="https://www.elastic.co/blog/ecs-elastic-common-schema-otel-opentelemetry-faq"><strong>Elastic Common Schema</strong></a> and our <a href="https://www.elastic.co/blog/elastic-invokedynamic-opentelemetry-java-agent">invokedynamic based java agent approach</a>, we recently <a href="https://github.com/open-telemetry/community/issues/1918">announced our intent to donate our continuous profiling agent</a> — a whole-system, always-on, continuous profiling solution that eliminates the need for run-time/bytecode instrumentation, recompilation, on-host debug symbols, or service restarts.</p>
<p>Profiling helps organizations run efficient services by minimizing computational wastage, thereby reducing operational costs. Leveraging <a href="https://ebpf.io/">eBPF</a>, the Elastic profiling agent provides unprecedented visibility into the runtime behavior of all applications: it builds stacktraces that go from the kernel, through userspace native code, all the way into code running in higher level runtimes, enabling you to identify performance regressions, reduce wasteful computations, and debug complex issues faster.</p>
<h2>Enabling profiling in OpenTelemetry: A step toward unified observability</h2>
<p>Elastic actively participates in the OTel community, particularly within the Profiling Special Interest Group (SIG). This group has been instrumental in defining the OTel <a href="https://github.com/open-telemetry/oteps/blob/main/text/profiles/0239-profiles-data-model.md">Profiling Data Model</a>, a crucial step toward standardizing profiling data.</p>
<p>The recent merger of the <a href="https://github.com/open-telemetry/oteps/pull/239">OpenTelemetry Enhancement Proposal (OTEP) introducing profiling support to the OpenTelemetry Protocol (OTLP)</a> marks a significant milestone. With the standardization of profiles as a core observability pillar alongside metrics, tracing, and logs, OTel offers a comprehensive suite of observability tools, empowering users to gain a holistic view of their applications' health and performance.</p>
<p>In line with this advancement, we are donating our whole-system, eBPF-based continuous profiling agent to OTel. In parallel, we are implementing the experimental OTel Profiling signal in the profiling agent, to ensure and demonstrate OTel protocol compatibility in the agent and prepare it for a fully OTel-based collection of profiling signals and correlate it to logs, metrics, and traces.</p>
<h2>Why is Elastic donating the eBPF-based profiling agent to OpenTelemetry?</h2>
<p>Computational efficiency has always been a critical concern for software professionals. However, in an era where every line of code affects both the bottom line and the environment, there's an additional reason to focus on it. Elastic is committed to helping the OpenTelemetry community enhance computational efficiency because efficient software not only reduces the cost of goods sold (COGS) but also reduces carbon footprint.</p>
<p>We have seen firsthand — both internally and from our customers' testimonials — how profiling insights aid in enhancing software efficiency. This results in an improved customer experience, lower resource consumption, and reduced cloud costs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-donation-proposal-to-contribute-profiling-agent-to-opentelemetry/1-flamegraph.png" alt="A differential flamegraph showing regression in release comparison" /></p>
<p>Moreover, adopting a whole-system profiling strategy, such as <a href="https://www.elastic.co/blog/whole-system-visibility-elastic-universal-profiling">Elastic Universal Profiling</a>, differs significantly from traditional instrumentation profilers that focus solely on runtime. Elastic Universal Profiling provides whole-system visibility, profiling not only your own code but also third-party libraries, kernel operations, and other code you don't own. This comprehensive approach facilitates rapid optimizations by identifying non-optimal common libraries and uncovering &quot;unknown unknowns&quot; that consume CPU cycles. Often, a tipping point is reached when the resource consumption of libraries or certain daemon processes exceeds that of the applications themselves. Without system-wide profiling, along with the capabilities to slice data per service and aggregate total usage, pinpointing these resource-intensive components becomes a formidable challenge.</p>
<p>At Elastic, we have a customer with an extensive cloud footprint who plans to negotiate with their cloud provider to reclaim money for the significant compute resource consumed by the cloud provider's in-VM agents. These examples highlight the importance of whole-system profiling and the benefits that the OpenTelemetry community will gain if the donation proposal is accepted.</p>
<p>Specifically, OTel users will gain access to a lightweight, battle-tested production-grade continuous profiling agent with the following features:</p>
<ul>
<li>
<p>Very low CPU and memory overhead (1% CPU and 250MB memory are our upper limits in testing, and the agent typically manages to stay way below that)</p>
</li>
<li>
<p>Support for native C/C++ executables without the need for DWARF debug information by leveraging .eh_frame data, as described in “<a href="https://www.elastic.co/blog/universal-profiling-frame-pointers-symbols-ebpf">How Universal Profiling unwinds stacks without frame pointers and symbols</a>”</p>
</li>
<li>
<p>Support profiling of system libraries without frame pointers and without debug symbols on the host</p>
</li>
<li>
<p>Support for mixed stacktraces between runtimes — stacktraces go from Kernel space through unmodified system libraries all the way into high-level languages</p>
</li>
<li>
<p>Support for native code (C/C++, Rust, Zig, Go, etc. without debug symbols on host)</p>
</li>
<li>
<p>Support for a broad set of High-level languages (Hotspot JVM, Python, Ruby, PHP, Node.JS, V8, Perl), .NET is in preparation</p>
</li>
<li>
<p><strong>100% non-intrusive:</strong> there's no need to load agents or libraries into the processes that are being profiled</p>
</li>
<li>
<p>No need for any reconfiguration, instrumentation, or restarts of HLL interpreters and VMs: the agent supports unwinding each of the supported languages in the default configuration</p>
</li>
<li>
<p>Support for x86 and Arm64 CPU architectures</p>
</li>
<li>
<p>Support for native inline frames, which provide insights into compiler optimizations and offer a higher precision of function call chains</p>
</li>
<li>
<p>Support for <a href="https://www.elastic.co/guide/en/observability/current/profiling-probabilistic-profiling.html">Probabilistic Profiling</a> to reduce data storage costs</p>
</li>
<li>
<p>. . . and more</p>
</li>
</ul>
<p>Elastic's commitment to enhancing computational efficiency and our belief in the OpenTelemetry vision underscores our dedication to advancing the observability ecosystem –– by donating the profiling agent. Elastic is not only contributing technology but also dedicating a team of specialized profiling domain experts to co-maintain and advance the profiling capabilities within OpenTelemetry.</p>
<h2>How does this donation benefit the OTel community?</h2>
<p>Metrics, logs, and traces offer invaluable insights into system health. But what if you could unlock an even deeper level of visibility? Here's why profiling is a perfect complement to your OTel toolkit:</p>
<h3>1. Deep system visibility: Beyond the surface</h3>
<p>Think of whole-system profiling as an MRI scan for your fleet. It goes deeper into the internals of your system, revealing hidden performance issues lurking beneath the surface. You can identify &quot;unknown unknowns&quot; — inefficiencies you wouldn't have noticed otherwise — and gain a comprehensive understanding of how your system functions at its core.</p>
<h3>2. Cross-signal correlation: Answering &quot;why&quot; with confidence</h3>
<p>The Elastic Universal Profiling agent supports trace correlation with the OTel Java agent/SDK (with Go support coming soon!). This correlation enables OTel users to view profiling data by services or service endpoints, allowing for a more context-aware and targeted root cause analysis. This powerful combination allows you to pinpoint the exact cause of resource consumption at the trace level. No more guessing why specific functions hog CPU or why certain events occur. You can finally answer the critical &quot;why&quot; questions with precision, enabling targeted optimization efforts.</p>
<h3>3. Cost and sustainability optimization: Beyond performance</h3>
<p>Our approach to profiling goes beyond just performance gains. By correlating whole-system profiling data with tracing, we can help you measure the environmental impact and cloud cost associated with specific services and functionalities within your application. This empowers you to make data-driven decisions that optimize both performance and resource utilization, leading to a more sustainable and cost-effective operation.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-donation-proposal-to-contribute-profiling-agent-to-opentelemetry/2-universal-profiling.png" alt="A differential function insight, showing the performance, cost, and CO2 impact of a change" /></p>
<h2>Elastic's commitment to OpenTelemetry</h2>
<p>Elastic currently supports a growing list of Cloud Native Computing Foundation (CNCF) projects <a href="https://www.elastic.co/blog/kubernetes-k8s-observability-elasticsearch-cncf">such as Kubernetes (K8S), Prometheus, Fluentd, Fluent Bit, and Istio</a>. <a href="https://www.elastic.co/observability/application-performance-monitoring">Elastic’s application performance monitoring (APM)</a> also natively supports OTel, ensuring all APM capabilities are available with either Elastic or OTel agents or a combination of the two. In addition to the ECS contribution and ongoing collaboration with OTel SemConv, Elastic <a href="https://www.elastic.co/observability/opentelemetry">has continued to make contributions to other OTel projects</a>, including language SDKs (such as OTel Swift, OTel Go, OTel Ruby, and others), and participates in several <a href="https://github.com/open-telemetry/community#special-interest-groups">special interest groups (SIGs)</a> to establish OTel as a standard for observability and security.</p>
<p>We are excited about our <a href="https://opentelemetry.io/blog/2023/ecs-otel-semconv-convergence/">strengthening relationship with OTel</a> and the opportunity to donate our profiling agent in a way that benefits both the Elastic community and the broader OTel community.Learn more about <a href="https://www.elastic.co/observability/opentelemetry">Elastic’s OpenTelemetry support</a> or contribute to the <a href="https://github.com/open-telemetry/community/issues/1918">donation proposal or just join the conversation</a>.</p>
<p>Stay tuned for further updates as the profiling part of OTel continues to evolve.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-donation-proposal-to-contribute-profiling-agent-to-opentelemetry/ecs-otel-announcement-1.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Connecting the Dots: ES|QL Joins for Richer Observability Insights]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-esql-join-observability</link>
            <guid isPermaLink="false">elastic-esql-join-observability</guid>
            <pubDate>Thu, 29 May 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Now in tech preview, ES|QL LOOKUP JOIN lets you enrich logs, metrics, and traces at query time no need to denormalize at ingest. Add deployment, infra, or business context dynamically, reduce storage, and accelerate root cause analysis in Elastic Obervability.]]></description>
            <content:encoded><![CDATA[<h1>Connecting the Dots: ES|QL Joins for Richer Observability Insights</h1>
<p>You might have seen our recent announcement about the <a href="https://www.elastic.co/blog/esql-lookup-join-elasticsearch">arrival of SQL-style joins in Elasticsearch</a> with ES|QL's LOOKUP JOIN command (now in Tech Preview!). While that post covered the basics, let's take a closer look at this in the context of Observability. How can this new join capability specifically help engineers and SREs make sense of their logs, metrics, and traces and make Elasticsearch more storage efficient by not denormalizing as much data?</p>
<p><strong>Note:</strong> Before we jump into the details, it’s important to mention again that this type of functionality today relies on a special lookup index. It is not (yet) possible to JOIN any arbitrary index.</p>
<p>Observability isn't just about collecting data; it's about understanding it. Often, the raw telemetry data – a log line, a metric point, a trace span – lacks the full context needed for quick diagnosis or impact assessment. We need to correlate data, enrich it with business or infrastructure context, and ask more advanced questions.</p>
<p>Historically, achieving this in Elasticsearch involved techniques like denormalizing data at ingest time (using ingest pipelines with enrich processors, for example) or performing joins client-side. </p>
<p>By adding the necessary context (like host details or user attributes) as data flowed in, each document arrived fully ready for queries and analytics without extra processing later on. This approach worked well in many cases and still does, particularly when the reference data changes slowly or when the enriched fields are critical for nearly every search. </p>
<p>However, as environments become more dynamic and diverse, the need to frequently update reference data (or avoid storing repetitive fields in every document) highlighted some of the trade-offs. </p>
<p>With the introduction of ES|QL LOOKUP JOIN in Elasticsearch 8.18 and 9.0, you now have an additional, more flexible option for situations where real-time lookups and minimal duplication are desired. Both methods—ingest-time enrichment and on-the-fly LOOKUP JOIN—complement each other and remain valid, depending on use case needs around update frequency, query performance, and storage considerations.</p>
<h2>Why Lookup Joins for Observability</h2>
<p>Lookup joins keep things flexible. You can decide on the fly if you’d like to look up additional information to assist you in your investigation.</p>
<p>Here are some examples:</p>
<ul>
<li>
<p><strong>Deployment Information:</strong> Which version of the code is generating these errors?</p>
</li>
<li>
<p><strong>Infrastructure Mapping:</strong> Which Kubernetes cluster or cloud region is experiencing high latency? What hardware does it use?</p>
</li>
<li>
<p><strong>Business Context:</strong> Are critical customers being affected by this slowdown?</p>
</li>
<li>
<p><strong>Team Ownership:</strong> Which team owns the service throwing these exceptions?</p>
</li>
</ul>
<p>Keeping this kind of information perfectly denormalized onto <em>every single</em> log line or metric point can be challenging and inefficient. Lookup datasets – like lists of deployments, server inventories, customer tiers, or service ownership mappings – often change independently of the telemetry data itself.</p>
<p><code>LOOKUP JOIN</code> is ideal here because:</p>
<ol>
<li>
<p><strong>Lookup Indices are Writable:</strong> Update your deployment list, CMDB export, or on-call rotation in the lookup index, and your <em>next</em> ES|QL query immediately uses the fresh data. No need to re-run complex enrich policies or re-index data.</p>
</li>
<li>
<p><strong>Flexibility:</strong> You decide <em>at query time</em> which context to join. Maybe today you care about deployment versions, tomorrow about cloud regions.</p>
</li>
<li>
<p><strong>Simpler Setup:</strong> As the original post highlighted, there are no enrich policies to manage. Just create an index with <code>index.mode: lookup</code> and load your data - up to 2 billion documents per lookup index.</p>
</li>
</ol>
<h2>Observability Use Cases &amp; Examples with ES|QL</h2>
<p>Let’s now look at a few examples to see how Lookup Joins can help.</p>
<h3>Enriching Error Logs with Deployment Context</h3>
<p>Lets say you're seeing a spike in errors for your <code>checkout-service</code>. You have logs flowing into a data stream, but they only contain the service name. The documents don’t have any information about the deployment activity itself. </p>
<pre><code class="language-bash">FROM logs-*
  | WHERE log.level == &quot;error&quot;
  | WHERE service.name == &quot;opbeans-ruby&quot;
</code></pre>
<p>You need to know if a recent deployment is contributing to these errors. To do this, we can maintain a <code>deployments_info_lkp</code> index (set with <code>index.mode: lookup</code>) that maps service names to their deployment times. This index could be updated from our CI/CD pipeline automatically any time a deployment happens.</p>
<pre><code class="language-bash">PUT /deployments_info_lkp
{
  &quot;settings&quot;: {
    &quot;index.mode&quot;: &quot;lookup&quot;
  },
  &quot;mappings&quot;: {
    &quot;properties&quot;: {
      &quot;service&quot;: {
        &quot;properties&quot;: {
          &quot;name&quot;: {
            &quot;type&quot;: &quot;keyword&quot;
          },
          &quot;deployment_time&quot;: {
            &quot;type&quot;: &quot;date&quot;
          },
          &quot;version&quot;: {
            &quot;type&quot;: &quot;keyword&quot;
          }
        }
      }
    }
  }
}
# Bulk index the deployment documents
POST /_bulk
{ &quot;index&quot; : { &quot;_index&quot; : &quot;deployments_info_lkp&quot; } }
{ &quot;service.name&quot;: &quot;opbeans-ruby&quot;, &quot;service.version&quot;: &quot;1.0&quot;, &quot;deployment_time&quot;: &quot;2025-05-22T06:00:00Z&quot; }
{ &quot;index&quot; : { &quot;_index&quot; : &quot;deployments_info_lkp&quot; } }
{ &quot;service.name&quot;: &quot;opbeans-go&quot;, &quot;service.version&quot;: &quot;1.1.0&quot;, &quot;deployment_time&quot;: &quot;2025-05-22T06:00:00Z&quot; }
</code></pre>
<p>Using this information you can now write a query that joins these two sources.</p>
<p><em>ES|QL Query:</em></p>
<pre><code class="language-bash">FROM logs-* 
  | WHERE log.level == &quot;error&quot;
  | WHERE service.name == &quot;opbeans-ruby&quot;
  | LOOKUP JOIN deployments_info_lkp ON service.name 
</code></pre>
<p>This alone is a good step towards troubleshooting the problem. You now have the deployment_time column available for each of your error documents. The last remaining step now is to use this for further filtering. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-esql-join-observability/discover.png" alt="Discover" /></p>
<p>Any of the data we managed to join from the lookup index can be handled as any other data we’d usually have available in the ES|QL query. This means that we can filter on it, and check if we had a recent deployment.</p>
<pre><code class="language-bash">FROM logs-*
  | WHERE log.level == &quot;error&quot;
  | WHERE service.name == &quot;opbeans-ruby&quot;
  | LOOKUP JOIN deployments_info_lkp ON service.name 
  | KEEP message, service.name, service.version, deployment_time 
  | WHERE deployment_time &gt; NOW() - 2h
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-esql-join-observability/discover2.png" alt="Discover2" /></p>
<h3>Saving disk space using JOIN</h3>
<p>Denormalizing data by including contextual information like host OS or cloud provider details directly in every log event is convenient for querying but can increase storage consumption, especially with high-volume data streams. Instead of storing this often-redundant information repeatedly, we can leverage joins to retrieve it on demand, potentially saving valuable disk space. While compression often handles repetitive data well, removing these fields entirely can still yield noticeable storage savings.</p>
<p>In this example we’ll use a dataset of 1,000,000 Kubernetes container logs using the default mapping of the Kubernetes integration, with <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/logs-data-stream">logsdb index mode</a> enabled. The starting size for this index is 35.5mb. </p>
<pre><code class="language-bash">GET _cat/indices/k8s-logs-default?h=index,pri.store.size
### 
k8s-logs-default       35.5mb
</code></pre>
<p>Using the <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-disk-usage">disk usage API</a>, we observed that fields like host.os and cloud.* contribute roughly 5% to the total index size on disk (35.5mb). These fields can be useful in some cases, but information like the os.name is rarely queried. </p>
<pre><code class="language-bash">// Example host.os structure
&quot;os&quot;: {
  &quot;codename&quot;: &quot;Plow&quot;, &quot;family&quot;: &quot;redhat&quot;, &quot;kernel&quot;: &quot;6.6.56+&quot;,
  &quot;name&quot;: &quot;Red Hat Enterprise Linux&quot;, &quot;platform&quot;: &quot;rhel&quot;, &quot;type&quot;: &quot;linux&quot;, &quot;version&quot;: &quot;9.5 (Plow)&quot;
}

// Example cloud structure
&quot;cloud&quot;: {
  &quot;account&quot;: { &quot;id&quot;: &quot;elastic-observability&quot; },
  &quot;availability_zone&quot;: &quot;us-central1-c&quot;,
  &quot;instance&quot;: { &quot;id&quot;: &quot;5799032384800802653&quot;, &quot;name&quot;: &quot;gke-edge-oblt-edge-oblt-pool-46262cd0-w905&quot; },
  &quot;machine&quot;: { &quot;type&quot;: &quot;e2-standard-4&quot; },
  &quot;project&quot;: { &quot;id&quot;: &quot;elastic-observability&quot; },
  &quot;provider&quot;: &quot;gcp&quot;, &quot;region&quot;: &quot;us-central1&quot;, &quot;service&quot;: { &quot;name&quot;: &quot;GCE&quot; }
}
</code></pre>
<p>Instead of storing this information with every document, let's instead drop this information in an ingest pipeline.</p>
<pre><code class="language-bash">PUT _ingest/pipeline/drop-host-os-cloud
{
  &quot;processors&quot;: [
      { &quot;remove&quot;: { &quot;field&quot;: &quot;host.os&quot; } },
      { &quot;set&quot;: { &quot;field&quot;: &quot;tmp1&quot;, &quot;value&quot;: &quot;{{cloud.instance.id}}&quot; } }, // Temporarily store the ID
</code></pre>
<pre><code>      { &quot;remove&quot;: { &quot;field&quot;: &quot;cloud&quot; } },                             // Remove the entire cloud object
      { &quot;set&quot;: { &quot;field&quot;: &quot;cloud.instance.id&quot;, &quot;value&quot;: &quot;{{tmp1}}&quot; } }, // Restore just the cloud instance ID
      { &quot;remove&quot;: { &quot;field&quot;: &quot;tmp1&quot;, &quot;ignore_missing&quot;: true } }         // Clean up temporary field
    ]
}
</code></pre>
<p>Reindexing (and <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-forcemerge">force merging to one segment</a>) now shows the following size, resulting in approximately 5% less space. </p>
<pre><code class="language-bash">GET _cat/indices/k8s-logs-*?h=index,pri.store.size
### 
k8s-logs-default             33.7mb
k8s-logs-drop-cloud-os       35.5mb
</code></pre>
<p>Now, to regain access to the removed host.os and cloud.* information during analysis without storing it in every log document, we can create a lookup index. This index will store the full host and cloud metadata, keyed by the cloud.instance.id that we preserved in our logs. This instance_metadata_lkp index will be significantly smaller than the space saved across millions or billions of log lines, as it only needs one document per unique instance.</p>
<pre><code class="language-bash"># Create the lookup index for instance metadata
PUT /instance_metadata_lkp
{
  &quot;settings&quot;: {
    &quot;index.mode&quot;: &quot;lookup&quot;
  },
  &quot;mappings&quot;: {
    &quot;properties&quot;: {
</code></pre>
<pre><code class="language-bash">      &quot;cloud.instance.id&quot;: {  # The join key we kept in the logs
        &quot;type&quot;: &quot;keyword&quot;
      },
      &quot;host.os&quot;: {           # The full host.os object we removed
        &quot;type&quot;: &quot;object&quot;,
        &quot;enabled&quot;: false      # Often don't need to search sub-fields here
      },
      &quot;cloud&quot;: {             # The full cloud object we removed (mostly)
         &quot;type&quot;: &quot;object&quot;,
         &quot;enabled&quot;: false     # Often don't need to search sub-fields here
      }
    }
  }
}

# Bulk index sample instance metadata (keyed by cloud.instance.id)
# This data might come from your cloud provider API or CMDB
POST /_bulk
{ &quot;index&quot; : { &quot;_index&quot; : &quot;instance_metadata_lkp&quot;, &quot;_id&quot;: &quot;5799032384800802653&quot; } }
{ &quot;cloud.instance.id&quot;: &quot;5799032384800802653&quot;, &quot;host.os&quot;: { &quot;codename&quot;: &quot;Plow&quot;, &quot;family&quot;: &quot;redhat&quot;, &quot;kernel&quot;: &quot;6.6.56+&quot;, &quot;name&quot;: &quot;Red Hat Enterprise Linux&quot;, &quot;platform&quot;: &quot;rhel&quot;, &quot;type&quot;: &quot;linux&quot;, &quot;version&quot;: &quot;9.5 (Plow)&quot; }, &quot;cloud&quot;: { &quot;account&quot;: { &quot;id&quot;: &quot;elastic-observability&quot; }, &quot;availability_zone&quot;: &quot;us-central1-c&quot;, &quot;instance&quot;: { &quot;id&quot;: &quot;5799032384800802653&quot;, &quot;name&quot;: &quot;gke-edge-oblt-edge-oblt-pool-46262cd0-w905&quot; }, &quot;machine&quot;: { &quot;type&quot;: &quot;e2-standard-4&quot; }, &quot;project&quot;: { &quot;id&quot;: &quot;elastic-observability&quot; }, &quot;provider&quot;: &quot;gcp&quot;, &quot;region&quot;: &quot;us-central1&quot;, &quot;service&quot;: { &quot;name&quot;: &quot;GCE&quot; } } }
</code></pre>
<p>With this setup, when you need the full host or cloud context for your logs, you can simply use LOOKUP JOIN in your ES|QL query and continue filtering on the data from the lookup index</p>
<pre><code class="language-bash">FROM logs-* 
  | LOOKUP JOIN instance_metadata_lkp ON cloud.instance.id 
  | WHERE cloud.region == &quot;us-central1&quot;
</code></pre>
<p>This approach allows us to query the full context when needed (e.g., filtering logs by host.os.name or cloud.region) while significantly reducing the storage footprint of the high-volume log indices by avoiding redundant data denormalization.</p>
<p>It should be noted that low cardinality metadata fields generally compress well and a large part of the storage savings in this case come from the “text” mapping of the host.os.name and cloud.instance.name field. Make sure to use the disk usage API to evaluate if this approach would be worth it in your specific use case. </p>
<h2>Getting Started with Lookups for Observability</h2>
<p>Creating the necessary lookup indices is straightforward. As detailed in our <a href="http://link-to-original-blog-post">initial blog post</a>, you can use Kibana's Index Management UI, the Create Index API, or the File Upload utility – the key is setting <code>&quot;index.mode&quot;: &quot;lookup&quot;</code> in the index settings.</p>
<p>For Observability, consider automating the population of these lookup indices:</p>
<ul>
<li>
<p>Export data periodically from your CMDB, CRM, or HR systems.</p>
</li>
<li>
<p>Have your CI/CD pipeline update the <code>deployments_lkp</code> index upon successful deployment.</p>
</li>
<li>
<p>Use tools like Logstash with an <code>elasticsearch</code> output configured to write to your lookup index.</p>
</li>
</ul>
<h2>A Note on Performance and Alternatives</h2>
<p>While incredibly powerful, joins aren't free. Each <code>LOOKUP JOIN</code> adds processing overhead to your query. For contextual data that is <em>very</em> static (e.g., the cloud region a host <em>permanently</em> resides in) and needed in <em>almost every</em> query against that data, the traditional approach of enriching at ingest time might still be slightly more performant for those specific queries, trading upfront processing and storage for query speed.</p>
<p>However, for the dynamic, flexible, and targeted enrichment scenarios common in Observability – like mapping to ever-changing deployments, user segments, or team structures – <code>LOOKUP JOIN</code> offers a compelling, efficient, and easier-to-manage solution.</p>
<h2>Conclusion</h2>
<p>ES|QL's <code>LOOKUP JOIN</code> is making it easy to correlate and enrich your logs, metrics, and traces with up-to-date external information <em>at query time</em>; you can move faster from detecting problems to understanding their scope, impact, and root cause.</p>
<p>This feature is currently in Technical Preview in Elasticsearch 8.18 and Serverless, available now on Elastic Cloud. We encourage you to try it out with your own Observability data and share your feedback using the &quot;Submit feedback&quot; button in the ES|QL editor in Discover. We're excited to see how you use it to connect the dots in your systems!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-esql-join-observability/esql-join.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Migrating from Elastic’s Go APM agent to OpenTelemetry Go SDK]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-go-apm-agent-to-opentelemetry-go-sdk</link>
            <guid isPermaLink="false">elastic-go-apm-agent-to-opentelemetry-go-sdk</guid>
            <pubDate>Mon, 15 Apr 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[As OpenTelemetry is fast becoming an industry standard, Elastic is fast adopting it as well. In this post, we show you a safe and easy way to migrate your Go application from our APM agent to OpenTelemetry.]]></description>
            <content:encoded><![CDATA[<p>As <a href="https://www.elastic.co/blog/elastic-opentelemetry-sdk-distributions">we’ve already shared</a>, Elastic is committed to helping OpenTelemetry (OTel) succeed, which means, in some cases, building distributions of language SDKs.</p>
<p>Elastic is strategically standardizing on OTel for observability and security data collection. Additionally, Elastic is committed to working with the OTel community to become the best data collection infrastructure for the observability ecosystem. Elastic is deepening its relationship with OTel beyond the recent contributions of the <a href="https://www.elastic.co/blog/ecs-elastic-common-schema-otel-opentelemetry-faq">Elastic Common Schema (ECS) to OpenTelemetry</a>, <a href="https://www.elastic.co/blog/elastic-invokedynamic-opentelemetry-java-agent">invokedynamic in the OTel Java agent</a>, and the <a href="https://www.elastic.co/blog/elastic-donation-proposal-to-contribute-profiling-agent-to-opentelemetry">upcoming profiling agent donation</a>.</p>
<p>Since Elastic version 7.14, Elastic has supported OTel natively by being able to directly ingest OpenTelemetry protocol (OTLP)-based traces, metrics, and logs.</p>
<p>The Go SDK is a bit different from the other language SDKs, as the Go language inherently lacks the dynamicity that would allow building a distribution that is not a fork.</p>
<p>Nevertheless, the absence of a distribution doesn’t mean you shouldn’t use OTel for data collection from Go applications with the Elastic Stack.</p>
<p>Elastic currently has an APM Go agent, but we recommend switching to the OTel Go SDK. In this post, we cover two ways you can do that migration:</p>
<ul>
<li>
<p>By replacing all telemetry in your application’s code (a “big bang migration”) and shipping the change</p>
</li>
<li>
<p>By splitting the migration into atomic changes, to reduce the risk of regressions</p>
</li>
</ul>
<h2>A big bang migration</h2>
<p>The simplest way to migrate from our APM Go agent to the OTel SDK may be by removing all telemetry provided by the agent and replacing it all with the new one.</p>
<h3>Automatic instrumentation</h3>
<p>Most of your instrumentation may be provided automatically, as it is part of the frameworks or libraries you are using.</p>
<p>For example, if you use the Elastic Go agent, you may be using our net/http auto instrumentation module like this:</p>
<pre><code class="language-go">import (
	&quot;net/http&quot;
	&quot;go.elastic.co/apm/module/apmhttp/v2&quot;
)


func handler(w http.ResponseWriter, req *http.Request) {
	fmt.Fprintf(w, &quot;Hello World!&quot;)
}

func main() {
	http.ListenAndServe(
                  &quot;:8080&quot;,
                  apmhttp.Wrap(http.HandlerFunc(handler)),
	)
}
</code></pre>
<p>With OpenTelemetry, you would use the otelhttp module instead:</p>
<pre><code class="language-go">import (
	&quot;net/http&quot;
	&quot;go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp&quot;
)


func handler(w http.ResponseWriter, req *http.Request) {
	fmt.Fprintf(w, &quot;Hello World!&quot;)
}

func main() {
	http.ListenAndServe(
                  &quot;:8080&quot;,
                  otelhttp.NewHandler(http.HandlerFunc(handler), &quot;http&quot;),
	)
}
</code></pre>
<p>You should perform this same change for every other module you use from our agent.</p>
<h3>Manual instrumentation</h3>
<p>Your application may also have manual instrumentations, which consist of creating traces and spans directly within your application code by calling the Elastic APM agent API.</p>
<p>You may be creating transactions and spans like this with Elastic’s APM SDK:</p>
<pre><code class="language-go">import (
	&quot;go.elastic.co/apm/v2&quot;
)

func main() {
       // Create a transaction, and assign it to the context.
       tx :=  apm.DefaultTracer().StartTransaction(&quot;GET /&quot;, &quot;request&quot;)
       defer tx.End()
       ctx = apm.ContextWithTransaction(ctx, tx)

       // Create a span
       span, ctx := apm.StartSpan(ctx, &quot;span&quot;)
       defer span.End()
}
</code></pre>
<p>OpenTelemetry uses the same API for both transactions and spans — what Elastic considers “transactions” are just considered spans with no parent in OTel (“root spans”).</p>
<p>So, your instrumentation becomes the following:</p>
<pre><code class="language-go">import (
	&quot;go.opentelemetry.io/otel/trace&quot;
)

func main() {
	tracer := otel.Tracer(&quot;my library&quot;)

	// Create a root span.
	// It is assigned to the returned context automatically.
	ctx, span := tracer.Start(ctx, &quot;GET /&quot;)
	defer span.End()

	// Create a child span (as the context has a parent).
	ctx, span := tracer.Start(ctx, &quot;span&quot;)
	defer span.End()
}
</code></pre>
<p>With a big bang migration, you will need to migrate everything before shipping it to production. You cannot split the migration into smaller chunks.</p>
<p>For small applications or ones that only use automatic instrumentation, that constraint may be fine. It allows you to quickly validate the migration and move on.</p>
<p>However, if you are working on a complex set of services, a large application, or one with a lot of manual instrumentation, you probably want to be able to ship code multiple times during the migration instead of all at once.</p>
<h2>An atomic migration</h2>
<p>An atomic migration would be one where you can ship atomic changes gradually and have your application keep working normally. Then, you are able to pull the final plug only at the end, once you are ready to do so.</p>
<p>To help with atomic migrations, we provide a <a href="https://www.elastic.co/guide/en/apm/agent/go/master/opentelemetry.html">bridge between our APM Go agent and OpenTelemetry</a>.</p>
<p>This bridge allows you to run both our agent and OTel alongside each other and to have instrumentations with both libraries in the same process with the data being transmitted to the same location and in the same format.</p>
<p>You can configure the OTel bridge with our agent like this:</p>
<pre><code class="language-go">import (
	&quot;go.elastic.co/apm/v2&quot;
	&quot;go.elastic.co/apm/module/apmotel/v2&quot;

	&quot;go.opentelemetry.io/otel&quot;
)

func main() {
	provider, err := apmotel.NewTracerProvider()
	if err != nil {
		log.Fatal(err)
	}
	otel.SetTracerProvider(provider)
}
</code></pre>
<p>Once this configuration is set, every span created by OTel will be transmitted to the Elastic APM agent.</p>
<p>With this bridge, you can make your migration much safer with the following process:</p>
<ul>
<li>
<p>Add the bridge to your application.</p>
</li>
<li>
<p>Switch one instrumentation (automatic or manual) from the agent to OpenTelemetry, as you would have done for the big bang migration above but a single one at a time.</p>
</li>
<li>
<p>Remove the bridge and our agent, and configure OpenTelemetry to transmit the data via its SDK.</p>
</li>
</ul>
<p>Each of those steps can be a single change within your application and go to production right away.</p>
<p>If any issue arises during the migration process, you should then be able to see it immediately and fix it before moving on.</p>
<h2>Observability benefits from building with OTel</h2>
<p>As OTel is quickly becoming an industry standard, and Elastic is committed to making it even better, it can be very beneficial to your engineering teams to migrate to it.</p>
<p>In Go, whether you do this through a big bang migration or using Elastic’s OTel bridge, doing so will allow you to benefit from instrumentations maintained by the global community to make your observability even more effective and to better understand what’s happening within your application.</p>
<blockquote>
<ul>
<li><a href="https://www.elastic.co/blog/getting-started-opentelemetry-instrumentation-sample-app">Check out our code series on how to instrument with OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/manual-instrumentation-of-go-applications-opentelemetry">Go manual instrumentation with OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/best-practices-instrumenting-opentelemetry">Best practices for instrumenting with OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/analyzing-opentelemetry-apps-elastic-ai-assistant-apm">Using AI to analyze OpenTelemetry issues</a></li>
</ul>
</blockquote>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-go-apm-agent-to-opentelemetry-go-sdk/elastic-de-136675-V1_V1_(1).jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Elastic's contribution: Invokedynamic in the OpenTelemetry Java agent]]></title>
            <link>https://www.elastic.co/observability-labs/blog/invokedynamic-opentelemetry-java-agent</link>
            <guid isPermaLink="false">invokedynamic-opentelemetry-java-agent</guid>
            <pubDate>Thu, 19 Oct 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[The instrumentation approach in OpenTelemetry's Java Agent comes with some limitations with respect to maintenance and testability. Elastic contributes an invokedynamic-based instrumentation approach that helps overcoming these limitations.]]></description>
            <content:encoded><![CDATA[<p>As the second largest and active Cloud Native Computing Foundation (CNCF) project, <a href="https://opentelemetry.io/">OpenTelemetry</a> is well on its way to becoming the ubiquitous, unified standard and framework for observability. OpenTelemetry owes this success to its comprehensive and feature-rich toolset that allows users to retrieve valuable observability data from their applications with low effort. The OpenTelemetry Java agent is one of the most mature and feature-rich components in OpenTelemetry’s ecosystem. It provides automatic instrumentation for JVM-based applications and comes with a broad coverage of auto-instrumentation modules for popular Java-frameworks and libraries.</p>
<p>The original instrumentation approach used in the OpenTelemetry Java agent left the maintenance and development of auto-instrumentation modules subject to some restrictions. As part of <a href="https://www.elastic.co/blog/transforming-observability-ai-assistant-otel-standardization-continuous-profiling-log-analytics">our reinforced commitment to OpenTelemetry</a>, Elastic® helps evolve and improve OpenTelemetry projects and components. <a href="https://www.elastic.co/blog/ecs-elastic-common-schema-otel-opentelemetry-announcement">Elastic’s contribution of the Elastic Common Schema</a> to OpenTelemetry was an important step for the open-source community. As another step in our commitment to OpenTelemetry, Elastic started contributing to the OpenTelemetry Java agent.</p>
<h2>Elastic’s invokedynamic-based instrumentation approach</h2>
<p>To overcome the above-mentioned limitations in developing and maintaining auto-instrumentation modules in the OpenTelemetry Java agent, Elastic started contributing its <a href="https://www.elastic.co/blog/embracing-invokedynamic-to-tame-class-loaders-in-java-agents"><strong>invokedynamic</strong></a><a href="https://www.elastic.co/blog/embracing-invokedynamic-to-tame-class-loaders-in-java-agents">-based instrumentation approach</a> to the OpenTelemetry Java agent in July 2023.</p>
<p>To explain the improvement, you should know that in Java, a common approach to do auto-instrumentation of applications is through utilizing Java agents that do bytecode instrumentation at runtime. <a href="https://bytebuddy.net/#/">Byte Buddy</a> is a popular and widespread utility that helps with bytecode instrumentation without the need to deal with Java’s bytecode directly. Instrumentation logic that collects observability data from the target application’s code lives in so-called <em>advice methods</em>. Byte Buddy provides different ways of hooking these advice methods into the target application’s methods:</p>
<ul>
<li><em>Advice inlining:</em> The advice method’s code is being copied into the instrumented target method.</li>
<li><em>Static advice dispatching:</em> The instrumented target method invokes static advice methods that need to be visible by the instrumented code.</li>
<li><em>Advice dispatching with</em> _ <strong>invokedynamic</strong> __:_ The instrumented target method uses the JVM’s <strong>invokedynamic</strong> bytecode instruction to call advice methods that are isolated from the instrumented code.</li>
</ul>
<p>These different approaches are described in great detail in our related blog post on <a href="https://www.elastic.co/blog/embracing-invokedynamic-to-tame-class-loaders-in-java-agents">Elastic’s Java APM agent using invokedynamic</a>. In a nutshell, both approaches, <em>advice inlining</em> and <em>dispatching to static advice methods</em> come with some limitations with respect to writing and maintaining the advice code. So far, the OpenTelemetry Java agent has used <em>advice inlining</em> for its bytecode instrumentation. The resulting limitations on developing instrumentations are <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/v1.30.0/docs/contributing/writing-instrumentation-module.md#use-advice-classes-to-write-code-that-will-get-injected-to-the-instrumented-library-classes">documented in corresponding developer guidelines</a>. Among other things, the limitation of not being able to debug advice code is a painful restriction when developing and maintaining instrumentation code.</p>
<p>Elastic’s APM Java agent has been using the <strong>invokedynamic</strong> approach with its benefits for years — field-proven by thousands of customers. To help improve the OpenTelemetry Java agent, Elastic started contributing the <strong>invokedynamic</strong> approach with the goal to simplify and improve the development and maintainability of auto-instrumentation modules. The contribution proposal and the implementation outline is documented in more detail in <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/8999">this GitHub issue</a>.</p>
<p>With the new approach in place, Elastic will help migrate existing instrumentations so the OTel Java community can benefit from the <strong>invokedynamic</strong> -based instrumentation approach.</p>
<blockquote>
<p>Elastic supports OTel natively, and has numerous capabilities to help you analyze your application with OTel. </p>
<ul>
<li><a href="https://www.elastic.co/blog/opentelemetry-observability">Native OpenTelemetry support in Elastic Observability</a></li>
<li><a href="https://www.elastic.co/blog/best-practices-instrumenting-opentelemetry">Best Practices for instrumenting OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></li>
</ul>
<p>Instrumenting with OpenTelemetry:</p>
<ul>
<li><a href="https://www.elastic.co/blog/getting-started-opentelemetry-instrumentation-sample-app">Elastiflix application</a>, a guide to instrument different languages with OpenTelemetry (this is the application the team built to highlight <em>all</em> the languages below)</li>
<li>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual instrumentation</a></li>
<li>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual instrumentation </a></li>
<li>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual instrumentation</a></li>
<li>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual instrumentation</a><br />
Go: <a href="https://elastic.co/blog/manual-instrumentation-of-go-applications-opentelemetry">Manual instrumentation</a></li>
</ul>
</blockquote>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/invokedynamic-opentelemetry-java-agent/24-crystals.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Elastic’s Managed OTLP Endpoint: Simpler, Scalable OpenTelemetry for SREs]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-managed-otlp-endpoint-for-opentelemetry</link>
            <guid isPermaLink="false">elastic-managed-otlp-endpoint-for-opentelemetry</guid>
            <pubDate>Thu, 14 Aug 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Streamline OpenTelemetry data ingestion with Elastic Observability's new managed OTLP endpoint available on Elastic Cloud Serverless. Get native OTel storage and Elastic-grade scaling for logs, metrics, and traces, simplifying observability for SREs.]]></description>
            <content:encoded><![CDATA[<p>We’re excited to announce the <strong>managed OTLP endpoint for Elastic Observability Serverless.</strong> This feature marks a major milestone in Elastic’s shift to OpenTelemetry as the backbone of our data ingestion strategy and makes it dramatically easier to get high-fidelity OpenTelemetry data into Elastic Cloud.</p>
<h2>What is Elastic’s Managed OTLP Endpoint?</h2>
<p>The managed OTLP endpoint delivers on that promise offering a fully hosted OpenTelemetry ingestion path that’s scalable, reliable, and designed from the ground up for OpenTelemetry.</p>
<p>Data from OpenTelemetry SDKs, OpenTelemetry Collectors, or any OTLP service can send data to the OTLP endpoint. The OTLP endpoint is available on Elastic Cloud Serverless, and is fully managed by Elastic. This helps minimize the burden on customers of managing the OpenTelemetry ingestion layer. Whenever your production environment scales, the OTLP end point will also auto scale without any management from an SRE.</p>
<p>OpenTelemetry data is stored without any schema translation, preserving both semantic conventions and resource attributes. Additionally, it supports ingesting OTLP logs, metrics, and traces in a unified manner, ensuring consistent treatment across all telemetry data. This marks a significant improvement over the existing functionality, which primarily focuses on traces and APM use cases.</p>
<p>Hence, SREs gain: </p>
<ul>
<li>
<p><strong>Native OTLP ingestion</strong> with Elastic-managed reliability and scale</p>
</li>
<li>
<p><strong>OTel-native data storage</strong>, enabling richer analytics and future-proof observability</p>
</li>
<li>
<p><strong>Elastic-grade scaling</strong>, ready for production and multi-tenant workloads</p>
</li>
<li>
<p><strong>Frictionless onboarding</strong>, with a drop-in endpoint for logs, metrics and traces..</p>
</li>
</ul>
<h2>Native OTLP ingestion</h2>
<p>Whether you are using native OTel SDKs, OpenTelemetry Collector, EDOT, or other OpenTelemetry instrumentation, the OTLP endpoint will ingest any native OTLP data.</p>
<p>The managed OTLP endpoint will automatically scale with Observability data that is notoriously bursty. A sudden spike in requests, a scaling event in Kubernetes, or a deployment gone sideways can lead to massive surges in telemetry, often when you need visibility the most. That’s exactly what the managed OTLP endpoint in Elastic Observability Serverless is built to handle.</p>
<p>This isn’t just a thin wrapper on a collector. It’s a <strong>multi-tenant, auto-scaling service</strong> architected to absorb high volumes of OpenTelemetry data without you having to manage infrastructure, pre-provision capacity, or worry about dropped data.</p>
<p>Whether you’re routing data directly from OpenTelemetry SDKs or via an intermediate Collector, Elastic handles the scale behind the scenes. The endpoint is designed to scale with your telemetry traffic and recover gracefully from bursts, giving you one less thing to monitor. Just point your instrumentation at the endpoint and let Elastic take care of the rest.</p>
<h2>Natively stored OpenTelemetry </h2>
<p>With this feature, developers can now <strong>send OpenTelemetry signals directly to an Elastic Cloud</strong> <strong>Serverless project</strong> using the OTLP output of a collector or SDK regardless of the distribution contrib, EDOT and any other distribution will work). </p>
<p>The endpoint also supports data forwarded from any OpenTelemetry Collectors, SDKs or OTLP compliant forwarder. This gives teams full control to send directly from an SDK or route, enrich, or batch telemetry when needed. Elasticsearch stores OpenTelemetry data using the OpenTelemetry data model, including resource attributes, to identify emitting entities and enable ES|QL queries that correlate logs, metrics, and traces.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-managed-otlp-endpoint-for-opentelemetry/resource-attributes.jpg" alt="OTel resource attributes" /></p>
<h2>Faster time-to-insight</h2>
<p>Whether you’re building in serverless, Kubernetes, or classic VMs, this endpoint lets you focus on instrumentation and insights—not ingestion plumbing. It dramatically shortens the time from telemetry to value, while embracing the OpenTelemetry data model by preserving the original attributes and built-in correlation</p>
<h2>Easy connectivity to Managed OTLP Endpoint</h2>
<p>Connecting to the Managed OTLP endpoint is as simple as setting your SDK or the OTel collector OTLP export setting to the Elastic Managed OTLP Endpoint URL, and authentication key. Getting your endpoint is extremely straight-forward, go to project management, then edit alias and you will find your project’s OTLP endpoint. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-managed-otlp-endpoint-for-opentelemetry/otlp-endpoint-config.jpg" alt="OTel OTLP endpoint" /></p>
<h2>Get Started Today</h2>
<p>The managed OTLP endpoint can be used today <strong>on Elastic Observability Serverless</strong>. Support for <strong>Elastic Cloud Hosted</strong> deployments is coming soon.</p>
<p>For more detail and examples, follow <a href="https://www.elastic.co/docs/reference/opentelemetry/motlp">this guide</a>.</p>
<p>Whether you’re running microservices in Kubernetes, workloads in serverless, or apps on classic VMs, the OTLP endpoint helps you <strong>streamline your observability pipeline</strong>, <strong>standardize on OpenTelemetry</strong>, and <strong>accelerate your mean time to resolution (MTTR)</strong>.</p>
<p>Also check out our OTel resources about instrumenting and ingesting OTel into Elastic</p>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry-ga">Elastic Distributions of OpenTelemetry</a></p>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-otel-operator">Monitoring Kubernetes with Elastic and OpenTelemetry</a></p>
<p><a href="https://www.elastic.co/observability-labs/blog/k8s-discovery-with-EDOT-collector">Dynamic workload discovery with EDOT Collector</a></p>
<p><a href="https://www.elastic.co/observability-labs/blog/assembling-an-opentelemetry-nginx-ingress-controller-integration">Assembling an OpenTelemetry NGINX Ingress Controller Integration</a></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-managed-otlp-endpoint-for-opentelemetry/otlp-endpoint.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Now GA: Managed OTLP Endpoint on Elastic Cloud Hosted]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-managed-otlp-endpoint-ga-elastic-cloud-hosted</link>
            <guid isPermaLink="false">elastic-managed-otlp-endpoint-ga-elastic-cloud-hosted</guid>
            <pubDate>Thu, 19 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[The Elastic Managed OTLP Endpoint is now generally available on Elastic Cloud Hosted, bringing managed Kafka-backed resilience and native OTLP ingestion to any OpenTelemetry shipper.]]></description>
            <content:encoded><![CDATA[<p>The Elastic Managed OTLP Endpoint (mOTLP) is now generally available on Elastic Cloud Hosted. Any OTLP-compliant source, whether it's an upstream OpenTelemetry SDK, any Collector distribution, EDOT, or a custom forwarder, can send traces, metrics, and logs to Elastic Cloud without deploying or managing ingestion infrastructure.</p>
<p>mOTLP was <a href="https://www.elastic.co/observability-labs/blog/elastic-managed-otlp-endpoint-for-opentelemetry">already GA on Elastic Cloud Serverless</a>. With this release, Elastic Cloud Hosted deployments get the same managed ingestion path: an OpenTelemetry Collector-based architecture with Kafka-backed resilience that absorbs traffic spikes and protects against data loss during the moments that matter most.</p>
<h2>You only need to set two environment variables</h2>
<p>From the outside, sending OpenTelemetry data to Elastic using the Managed OTLP Endpoint is as simple as setting up these environment variables:</p>
<pre><code class="language-bash">export OTEL_EXPORTER_OTLP_ENDPOINT=&quot;https://&lt;your-motlp-endpoint&gt;&quot;
export OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=ApiKey &lt;your-api-key&gt;&quot;
</code></pre>
<p>Two environment variables. That's the entire integration surface. No Collector gateway to deploy, no ingestion pipelines to manage, no credentials to distribute across edge agents.</p>
<p>What makes such a simple design possible is a full OpenTelemetry Collector-based ingestion architecture designed to handle the worst moments in production, from the traffic spike during a deployment rollout to the burst of error traces during an incident. These are the moments when telemetry matters most, and they're also the moments when ingestion is most likely to be overwhelmed.</p>
<p>The managed endpoint receives OTLP data through an OpenTelemetry Collector layer that buffers to a managed Kafka cluster before indexing into Elasticsearch. Kafka absorbs bursts, decouples ingestion from indexing, and provides durability guarantees that in-memory queues can't. If Elasticsearch is temporarily under pressure, data sits in Kafka rather than getting dropped. When pressure subsides, the buffer drains and everything catches up. This is the same <a href="https://www.elastic.co/observability-labs/blog/opentelemetry-collector-reference-architectures">resilience pattern</a> you'd build yourself with a Kafka exporter and Kafka receiver in a self-managed collector pipeline, except Elastic operates it for you.</p>
<p>This makes the entire path OpenTelemetry end-to-end:</p>
<ol>
<li>Your applications produce telemetry with OTel SDKs.</li>
<li>Your collectors (or SDKs directly) export over OTLP.</li>
<li>The managed endpoint receives that OTLP through an OTel Collector-based layer, buffers it through Kafka, and stores it natively in Elasticsearch using the OpenTelemetry data model.</li>
</ol>
<p>No proprietary protocols, no schema translation, no format conversion at any stage of the pipeline. Just pure OTel.</p>
<h2>Bring your OTLP data in, no matter the source</h2>
<p>The endpoint accepts standard OTLP over HTTP and gRPC. Any tool that speaks OTLP can send data to it.</p>
<p>This means you can send data from:</p>
<ul>
<li><strong>OpenTelemetry SDKs</strong> (upstream, EDOT, or any distribution) exporting directly from your application.</li>
<li><strong>OpenTelemetry Collectors</strong> (Contrib, EDOT, or custom builds) running as agents, gateways, or sidecars.</li>
<li><strong>EDOT Cloud Forwarder</strong> forwarding cloud provider logs and metrics.</li>
<li><strong>Any OTLP-compliant forwarder</strong> you've built or adopted.</li>
</ul>
<p>This is worth pausing on, because it's not how most vendors work.</p>
<p>Many observability backends require vendor-specific components in your pipeline. Even when those components live in the OpenTelemetry Collector Contrib repository, they often reshape your data on the way out. A vendor-specific exporter might flatten resource attributes, drop the hierarchy between resources and scopes, or translate semantic conventions into a proprietary schema. By the time your telemetry reaches the backend, it's no longer standard OpenTelemetry data. It just started as OpenTelemetry data.</p>
<p>Elastic doesn't require any of that. The managed endpoint ingests standard OTLP, which means you use the standard <code>otlphttp</code> or <code>otlp</code> exporter that ships with every OpenTelemetry Collector. No Elastic-specific exporter, no vendor plugin, no translation layer. Your data arrives in Elasticsearch with the same resource hierarchy, the same semantic conventions, and the same attribute structure your instrumentation produced.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-managed-otlp-endpoint-ga-elastic-cloud-hosted/motlp-reference-architecture.png" alt="Reference architecture: OTel SDKs and Collectors on the edge exporting over OTLP to the Managed OTLP Endpoint on Elastic Cloud" /></p>
<p>The practical result: teams adopt OpenTelemetry at different speeds and with different tools. Some start with the upstream SDK and a Contrib Collector. Others use EDOT for Elastic-specific optimizations. Many run a mix. The managed endpoint doesn't impose a choice, and it doesn't quietly reshape your data behind a vendor exporter.</p>
<p>For organizations already running OpenTelemetry in production, this means switching to Elastic or adding Elastic as a destination requires changing an exporter URL, not rearchitecting the pipeline or adding vendor-specific components.</p>
<h2>What you no longer have to manage</h2>
<p>Before the managed OTLP endpoint, sending OpenTelemetry data to Elastic Cloud Hosted required deploying your own OTLP-compatible ingestion layer, typically an <a href="https://www.elastic.co/docs/reference/edot-collector/modes#edot-collector-as-gateway">EDOT Collector running as a gateway</a>. That gateway needed to be sized, scaled, monitored, and kept available. If it went down, telemetry stopped flowing.</p>
<p>With mOTLP, that entire layer is Elastic's responsibility. Here's what moves off your plate:</p>
<ul>
<li>The endpoint scales with your traffic. On Elastic Cloud Hosted, rate limits scale dynamically based on Elasticsearch backpressure. No pre-provisioning required.</li>
<li>The Collector-to-Kafka buffer handles burst absorption and backpressure. You don't need to operate your own Kafka cluster.</li>
<li>Your shippers authenticate directly with the endpoint using an API key. No intermediate gateway holding and distributing backend credentials.</li>
<li>The endpoint is managed and multi-tenant. You don't need to run redundant Collector replicas or configure health checks for your ingestion layer.</li>
</ul>
<p>This doesn't mean you should remove all collectors from your architecture. Edge collectors (DaemonSet agents, sidecars, host agents) still serve a purpose: collecting infrastructure telemetry via pull-based receivers like <code>filelog</code> and <code>hostmetrics</code>, applying local transformations, and batching data before export. What changes is that the destination is now a managed endpoint rather than a self-operated gateway.</p>
<h2>Dynamic rate scaling: the system adapts to your cluster</h2>
<p>On Elastic Cloud Hosted, the managed endpoint doesn't have a fixed throughput ceiling. It uses dynamic rate scaling that adjusts based on your Elasticsearch cluster's capacity and current load.</p>
<p>This is a fundamentally different model from static rate limits. Instead of provisioning for peak and paying for idle capacity, the system continuously calibrates ingestion to what your cluster can actually handle. Sudden load spikes may still trigger temporary <code>429</code> responses while the system scales, but these resolve automatically.</p>
<p>If you are seeing consistent 429 errors, the signal is clear, your Elasticsearch cluster needs more capacity. Scaling the cluster reduces backpressure, which in turn increases the ingestion rate limit. The autoscaling capabilities in Elastic Cloud Hosted can help automate this, and AutoOps can assist by monitoring the deployment and providing recommendations to scale or adjust resources when capacity constraints are detected.</p>
<h2>Native OTLP storage: OTel from first mile to last</h2>
<p>The end-to-end OTel story doesn't stop at ingestion. Data that arrives through the managed endpoint is stored using the OpenTelemetry data model. Resource attributes, semantic conventions, and signal structure are preserved as-is. There's no translation to ECS or any other schema at any point in the pipeline.</p>
<p>This means the attribute names your SDK produces are the same you query in ES|QL and Discover: <code>service.name</code>, <code>http.request.method</code>, <code>k8s.pod.name</code> are stored exactly as the OpenTelemetry specification defines them. You're not debugging a mapping layer or wondering which schema translation dropped an attribute. What your instrumentation emits is what Elasticsearch stores and what you search.</p>
<p>If no specific dataset or namespace is configured, telemetry lands in default data streams: <code>traces-generic.otel-default</code>, <code>metrics-generic.otel-default</code>, and <code>logs-generic.otel-default</code>. You can route logs to dedicated datasets by setting the <code>data_stream.dataset</code> attribute, either in your collector configuration or via <code>OTEL_RESOURCE_ATTRIBUTES</code>:</p>
<pre><code class="language-yaml">processors:
  transform:
    log_statements:
      - set(log.attributes[&quot;data_stream.dataset&quot;], &quot;app.orders&quot;) where resource.attributes[&quot;service.name&quot;] == &quot;orders-service&quot;
</code></pre>
<h2>The failure store: a safety net for mapping conflicts</h2>
<p>Even with careful schema design, mapping conflicts happen. A field that's a string in one service might be an integer in another. In traditional setups, these conflicts cause indexing failures and data loss.</p>
<p>The managed endpoint has the <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/failure-store">failure store</a> enabled by default for all OTLP data streams. Documents that fail indexing due to mapping conflicts or ingest pipeline exceptions are stored in a separate index rather than dropped. You can inspect failed documents from the <a href="https://www.elastic.co/docs/solutions/observability/data-set-quality-monitoring">Data Set Quality</a> page and fix the underlying issue without losing the data.</p>
<p>This is particularly valuable in OpenTelemetry environments where multiple teams instrument independently and attribute types can drift across services.</p>
<h2>Getting started</h2>
<p>The managed OTLP endpoint is available today on Elastic Cloud Hosted deployments (version 9.0+) in <a href="https://www.elastic.co/docs/reference/opentelemetry/motlp">supported regions</a>.</p>
<p>To find your endpoint:</p>
<ol>
<li>Log in to the <a href="https://cloud.elastic.co">Elastic Cloud Console</a>.</li>
<li>Select your deployment and go to <strong>Manage</strong>.</li>
<li>In the <strong>Application endpoints</strong> section, select <strong>Managed OTLP</strong> and copy the public endpoint.</li>
</ol>
<p>Then point any OTLP exporter at it:</p>
<pre><code class="language-bash">export OTEL_EXPORTER_OTLP_ENDPOINT=&quot;https://&lt;your-motlp-endpoint&gt;&quot;
export OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=ApiKey &lt;your-api-key&gt;&quot;
</code></pre>
<p>That's it. Traces, metrics, and logs will start flowing into your deployment within seconds.</p>
<p>For a detailed walkthrough, follow the <a href="https://www.elastic.co/docs/solutions/observability/get-started/quickstart-elastic-cloud-otel-endpoint">Send data to the Elastic Cloud Managed OTLP Endpoint</a> quickstart.</p>
<h2>Learn more</h2>
<ul>
<li><a href="https://www.elastic.co/docs/reference/opentelemetry/motlp">Managed OTLP Endpoint reference documentation</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/opentelemetry-collector-reference-architectures">Composing OpenTelemetry Reference Architectures</a></li>
<li><a href="https://www.elastic.co/docs/reference/opentelemetry/edot-collector">EDOT Collector documentation</a></li>
<li><a href="https://www.elastic.co/docs/reference/opentelemetry">OpenTelemetry at Elastic</a></li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-managed-otlp-endpoint-ga-elastic-cloud-hosted/managed-otlp-endpoint-elastic-cloud-hosted.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Elastic's metrics analytics gets 5x faster]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-metrics-analytics</link>
            <guid isPermaLink="false">elastic-metrics-analytics</guid>
            <pubDate>Wed, 28 Jan 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Explore Elastic's metrics analytics enhancements, including faster ES|QL queries, TSDS updates and OpenTelemetry exponential histogram support.]]></description>
            <content:encoded><![CDATA[<p>In our <a href="https://www.elastic.co/observability-labs/blog/metrics-explore-analyze-with-esql-discover">previous blog in this series</a>, we explored the fundamentals of analyzing metrics using the Elasticsearch Query Language (ES|QL) and the interactive power of Discover. Building on that foundation, we are excited to announce a suite of powerful enhancements to Time Series Data Streams (Elastic’s TSDB) and ES|QL designed to provide even more comprehensive and blazingly faster metrics analytics capabilities!</p>
<p>These latest updates, available in v9.3 and in Serverless, introduce significant performance gains, sophisticated time series functions, and native OpenTelemetry exponential histogram support that directly benefit SREs and Observability practitioners.</p>
<h2>Query Performance and Storage Optimizations</h2>
<p>Speed is paramount when diagnosing incidents. Compared to prior releases, we have achieved a 5x+ improvement in query latency when wildcarding or filtering by dimensions. Additionally, storage efficiency for OpenTelemetry metrics data has improved by approximately 2x, significantly reducing the infrastructure footprint required to retain high-volume observability data. If you’re hungry to learn more about what architectural updates are driving these optimizations, stay tuned… Tech blogs are on their way! </p>
<h2>Expanded Time Series Analytics in ES|QL</h2>
<p>The <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/ts">ESQL TS source command</a>, which targets time series indices and enables <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/time-series-aggregation-functions">time series aggregation functions</a>, has been significantly enhanced to support complex analytics capabilities.</p>
<p>We have expanded the <a href="https://www.elastic.co/docs/reference/query-languages/esql/esql-functions-operators">library of time series functions</a> to include essential tools for identifying anomalies and trends.</p>
<ul>
<li><code>PERCENTILE_OVER_TIME</code>, <code>STDDEV_OVER_TIME</code>, <code>VARIANCE_OVER_TIME</code>: Calculate the percentile, standard deviation, or variance of a field over time, which is critical for understanding distribution and variability in service latency or resource usage.</li>
</ul>
<p>Example: Seeing the worst-case latency in 5-minute intervals.</p>
<pre><code class="language-bash">TS metrics*  | STATS MAX(PERCENTILE_OVER_TIME(kafka.consumer.fetch_latency_avg, 99))
  BY TBUCKET(5m)
</code></pre>
<ul>
<li><code>DERIV</code>: This command calculates the derivative of a numeric field over time using linear regression, useful for analyzing the rate of change in system metrics.</li>
</ul>
<p>Example: trending gauge values over time.</p>
<pre><code class="language-bash">TS metrics*  | STATS AVG(DERIV(container.memory.available))
  BY TBUCKET(1 hour)
</code></pre>
<ul>
<li><code>CLAMP</code>: To handle noisy data or outliers, this function limits sample values to a specified lower and upper bound.</li>
</ul>
<p>Example: handling saturation metrics (like CPU or Memory utilization) where spikes or measurement errors can occasionally report values over 100%, making the rest of the data look like a flat line at the bottom of the chart.\</p>
<pre><code class="language-bash">TS metrics*  | STATS AVG(CLAMP(k8s.pod.memory.node.utilization, 0, 100))
  BY k8s.pod.name
</code></pre>
<ul>
<li><code>TRANGE</code>: This new filter function allows you to filter data for a specific time range using the <code>@timestamp</code> attribute, simplifying query syntax for time-bound investigations.</li>
</ul>
<p>Example: Filtering and showing metrics for the last 4 hours.</p>
<pre><code class="language-bash">TS metrics*  | WHERE TRANGE(4h) | STATS AVG(host.cpu.pct)
  BY TBUCKET(5m)
</code></pre>
<p><strong>Window Functions</strong> To smoothen results over specific periods, ES|QL now introduces window functions. Most time series aggregation functions now accept an optional second argument that specifies a sliding time window. For example, you can calculate a rate over a 10-minute sliding window while bucketing results by minute.</p>
<p>Example: Calculating the average rate of requests per host for every minute, using values over a sliding window of 5 minutes.</p>
<pre><code class="language-bash">TS metrics*  | STATS AVG(RATE(app.frontend.requests, 5m))
  BY TBUCKET(1m)
</code></pre>
<p>Accepted window values are currently limited to multiples of the time bucket interval in the BY clause. Windows that are smaller than the time bucket interval or larger but not a multiple of the time bucket interval will be supported in feature releases. </p>
<h2>Native OpenTelemetry Exponential Histograms</h2>
<p>Elastic now provides native support for OpenTelemetry exponential histograms, enabling efficient ingest, querying, and downsampling of high-fidelity distribution data.</p>
<p>We have introduced a new <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/exponential-histogram">exponential_histogram</a> field type designed to capture distributions with fixed, exponentially spaced bucket boundaries. Because these fields are primarily intended for aggregations, the histogram is stored as compact doc values and is not indexed, optimizing storage efficiency. These fields are fully supported in ES|QL aggregation functions such as <code>PERCENTILES</code>, <code>AVG</code>, <code>MIN</code>, <code>MAX</code>, and <code>SUM</code>.</p>
<p>You can index documents with exponential histograms automatically through our <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/tsds-ingest-otlp#configure-histogram-handling">OTLP endpoint</a> or manually. For example, let’s create an index with an exponential histogram field and a keyword field:</p>
<pre><code class="language-bash">PUT my-index-000001
{
  &quot;settings&quot;: {
    &quot;index&quot;: {
      &quot;mode&quot;: &quot;time_series&quot;,
      &quot;routing_path&quot;: [&quot;http.path&quot;],
      &quot;time_series&quot;: {
        &quot;start_time&quot;: &quot;2026-01-21T00:00:00Z&quot;,
        &quot;end_time&quot;: &quot;2026-01-25T00:00:00Z&quot;
     }
    }
  },
  &quot;mappings&quot;: {
    &quot;properties&quot;: {
      &quot;@timestamp&quot;: {
        &quot;type&quot;: &quot;date&quot;
      },
      &quot;http.path&quot;: {
        &quot;type&quot;: &quot;keyword&quot;,
        &quot;time_series_dimension&quot;: true
      },
      &quot;responseTime&quot;: {
        &quot;type&quot;: &quot;exponential_histogram&quot;,
        &quot;time_series_metric&quot;: &quot;histogram&quot;
      }
    }
  }
}
</code></pre>
<p>Index a document with a full exponential histogram payload:</p>
<pre><code class="language-bash">POST my-index-000001/_doc
{
  &quot;@timestamp&quot;: &quot;2026-01-22T21:25:00.000Z&quot;,
  &quot;http.path&quot;: &quot;/foo&quot;,
  &quot;responseTime&quot;: {
    &quot;scale&quot;:3,
    &quot;sum&quot;:73.2,
    &quot;min&quot;:3.12,
    &quot;max&quot;:7.02,
    &quot;positive&quot;: {
      &quot;indices&quot;:[13,14,15,16,17,18,19,20,21,22],
      &quot;counts&quot;:[1,1,2,2,1,2,1,3,1,1]
    }
  }
}

POST my-index-000001/_doc
{
  &quot;@timestamp&quot;: &quot;2026-01-22T21:26:00.000Z&quot;,
  &quot;http.path&quot;: &quot;/bar&quot;,
  &quot;responseTime&quot;: {
    &quot;scale&quot;:3,
    &quot;sum&quot;:45.86,
    &quot;min&quot;:2.15,
    &quot;max&quot;:5.1,
    &quot;positive&quot;: {
      &quot;indices&quot;:[8,9,10,11,12,13,14,15,16,17,18],
      &quot;counts&quot;:[1,1,1,1,1,1,1,2,1,1,2]
    }
  }
}
</code></pre>
<p>And finally, query the time series index using ES|QL and the TS source command:</p>
<pre><code class="language-bash">TS my-index-000001  | STATS MIN(responseTime), MAX(responseTime),
        AVG(responseTime), MEDIAN(responseTime),
        PERCENTILE(responseTime, 90)
  BY http.path
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-metrics-analytics/exponential_histogram_esql_example.png" alt="Alt text" /></p>
<h2>Enhanced Downsampling</h2>
<p>Downsampling is essential for long-term data retention. We have introduced a new <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/downsampling-concepts#downsampling-methods">&quot;last value&quot; downsampling mode</a>. This method exchanges accuracy for storage efficiency and performance by keeping only the last sample value, providing a lightweight alternative to calculating aggregate metrics.</p>
<p>You can <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/run-downsampling">configure a time series data stream</a> for last value downsampling in a similar way as regular downsampling, just by setting the <code>downsampling_method</code> to <code>last_value</code>. For example, by using a data stream lifecycle:</p>
<pre><code class="language-bash">PUT _data_stream/my-data-stream/_lifecycle
{
  &quot;data_retention&quot;: &quot;7d&quot;,
  &quot;downsampling_method&quot;: &quot;last_value&quot;,
  &quot;downsampling&quot;: [
     {
       &quot;after&quot;: &quot;1m&quot;,
       &quot;fixed_interval&quot;: &quot;10m&quot;
      },
      {
        &quot;after&quot;: &quot;1d&quot;,
        &quot;fixed_interval&quot;: &quot;1h&quot;
      }
   ]
}
</code></pre>
<h2>In Conclusion</h2>
<p>These enhancements mark a significant step forward in Elastic's metrics analytics capabilities, delivering 5x+ faster query latency, 2x storage efficiency and specialized commands like <code>DERIV</code>, <code>CLAMP</code>, and <code>PERCENTILE_OVER_TIME</code>. With native support for OpenTelemetry exponential histograms and expanded downsampling options, SREs can now perform richer, more cost-effective analysis on their observability data. This release empowers teams to detect anomalies faster and manage long-term metrics retention with greater efficiency.</p>
<p>We welcome you to <a href="https://cloud.elastic.co/serverless-registration?onboarding_token=observability">try the new features</a> today!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-metrics-analytics/elastic_metrics_leaner_blog_image.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Elastic MongoDB Atlas Integration: Complete Database Monitoring and Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-mongodb-atlas-integration</link>
            <guid isPermaLink="false">elastic-mongodb-atlas-integration</guid>
            <pubDate>Thu, 24 Jul 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Comprehensive MongoDB Atlas monitoring with Elastic's integration - track performance, security, and operations through real-time alerts, audit logs, and actionable insights.]]></description>
            <content:encoded><![CDATA[<p>In today's data-driven landscape, <a href="https://www.mongodb.com/products/platform/atlas-database">MongoDB Atlas</a> has emerged as the leading multi-cloud developer data platform, enabling organizations to work seamlessly with document-based data models while ensuring flexible schema design and easy scalability. However, as your Atlas deployments grow in complexity and criticality, comprehensive observability becomes essential for maintaining optimal performance, security, and reliability.</p>
<p>The Elastic <a href="https://www.elastic.co/docs/reference/integrations/mongodb_atlas">MongoDB Atlas integration</a> transforms how you monitor and troubleshoot your Atlas infrastructure by providing deep insights into every aspect of your deployment—from real-time alerts and audit trails to detailed performance metrics and organizational activities. This integration empowers teams to minimize Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) while gaining actionable insights for capacity planning and performance optimization.</p>
<h2>Why MongoDB Atlas Observability Matters</h2>
<p>MongoDB Atlas abstracts much of the operational complexity of running MongoDB, but this doesn't eliminate the need for monitoring. Modern applications demand:</p>
<ul>
<li><strong>Proactive Issue Detection</strong>: Identify performance bottlenecks, resource constraints, and security threats before they impact users</li>
<li><strong>Comprehensive Audit Trails</strong>: Track database operations, user activities, and configuration changes for compliance and security</li>
<li><strong>Performance Optimization</strong>: Monitor query performance, resource utilization, and capacity trends to optimize costs and user experience</li>
<li><strong>Operational Insights</strong>: Understand organizational activities, project changes, and infrastructure events across your multi-cloud deployments</li>
</ul>
<p>The Elastic <a href="https://www.elastic.co/docs/reference/integrations/mongodb_atlas">MongoDB Atlas integration</a> addresses these needs by collecting comprehensive telemetry data and presenting it through powerful visualizations and alerting capabilities.</p>
<h2>Integration Architecture and Data Streams</h2>
<p>The <a href="https://www.elastic.co/docs/reference/integrations/mongodb_atlas">MongoDB Atlas integration</a> leverages the <a href="https://www.mongodb.com/docs/atlas/reference/api-resources-spec/v2/">Atlas Administration API</a> to collect eight distinct data streams, each providing specific insights into different aspects of your Atlas deployment:</p>
<h3>Log Data Streams</h3>
<p><strong>Alert Logs</strong>: Capture real-time alerts generated by your Atlas instances, covering resource utilization thresholds (CPU, memory, disk space), database operations, security issues, and configuration changes. These alerts provide immediate visibility into critical events that require attention.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-mongodb-atlas-integration/alert_logs.png" alt="Alert Datastream" /></p>
<p><strong>Database Logs</strong>: Collect comprehensive operational logs from MongoDB instances, including incoming connections, executed commands, performance diagnostics, and issues encountered. These logs are invaluable for troubleshooting performance problems and understanding database behavior.</p>
<p><strong>MongoDB Audit Logs</strong>: Enable administrators to track system activity across deployments with multiple users and applications. These logs capture detailed events related to database operations including insertions, updates, deletions, user authentication, and access patterns—essential for security compliance and forensic analysis.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-mongodb-atlas-integration/audit_logs.png" alt="Audit Datastream" /></p>
<p><strong>Organization Logs</strong>: Provide enterprise-level visibility into organizational activities, enabling tracking of significant actions involving database operations, billing changes, security modifications, host management, encryption settings, and user access management across teams.</p>
<p><strong>Project Logs</strong>: Offer project-specific event tracking, capturing detailed records of configuration modifications, user access changes, and general project activities. These logs are crucial for project-level auditing and change management.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-mongodb-atlas-integration/project_logs.png" alt="Project Datastream" /></p>
<h3>Metrics Data Streams</h3>
<p><strong>Hardware Metrics</strong>: Collect comprehensive hardware performance data including CPU usage, memory consumption, JVM memory utilization, and overall system resource metrics for each process in your Atlas groups.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-mongodb-atlas-integration/hardware_metrics.png" alt="Hardware Datastream" /></p>
<p><strong>Disk Metrics</strong>: Monitor storage performance with detailed insights into I/O operations, read/write latency, and space utilization across all disk partitions used by MongoDB Atlas. These metrics help identify storage bottlenecks and plan capacity expansion.</p>
<p><strong>Process Metrics</strong>: Gather host-level metrics per MongoDB process, including detailed CPU usage patterns, I/O operation counts, memory utilization, and database-specific performance indicators like connection counts, operation rates, and cache utilization.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-mongodb-atlas-integration/process_metrics.png" alt="Process Datastream" /></p>
<h2>Implementation Guide</h2>
<h3>Setting Up the Integration</h3>
<p>Getting started with MongoDB Atlas observability requires establishing API access and configuring the integration in Kibana:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-mongodb-atlas-integration/setup.png" alt="Setup" /></p>
<ol>
<li>
<p><strong>Generate Atlas API Keys</strong>: Create <a href="https://www.mongodb.com/docs/atlas/configure-api-access/#grant-programmatic-access-to-an-organization">programmatic API keys</a> with Organization Owner permissions in the Atlas console, then invite these keys to your target projects with appropriate roles (Project Read Only for alerts/metrics, Project Data Access Read Only for audit logs).</p>
</li>
<li>
<p><strong>Enable Prerequisites</strong>: Enable database auditing in Atlas for projects where you want to collect audit and database logs. Gather your <a href="https://www.mongodb.com/docs/atlas/app-services/apps/metadata/#find-a-project-id">Project ID</a> and Organization ID from the Atlas UI.</p>
</li>
<li>
<p><strong>Configure in Kibana</strong>: Navigate to Management &gt; Integrations, search for &quot;MongoDB Atlas,&quot; and add the integration using your API credentials.</p>
</li>
</ol>
<p>The integration supports different permission levels for each data stream, ensuring you can collect operational metrics with minimal privileges while protecting sensitive audit data with elevated permissions.</p>
<h3>Considerations and Limitations</h3>
<ul>
<li><strong>Cluster Support</strong>: Log collection doesn't support M0 free clusters, M2/M5 shared clusters, or serverless instances</li>
<li><strong>Historical Data</strong>: Most log streams collect the previous 30 minutes of historical data</li>
<li><strong>Performance Impact</strong>: Large time spans may cause request timeouts; adjust HTTP Client Timeout accordingly</li>
</ul>
<h2>Real-World Use Cases and Benefits</h2>
<h3>Security and Compliance Monitoring</h3>
<p><strong>Audit Trail Management</strong>: Organizations in regulated industries leverage the audit logs to maintain comprehensive records of database access and modifications. The integration automatically parses and indexes audit events, making it easy to search for specific user activities, failed authentication attempts, or unauthorized access patterns.</p>
<p><strong>Security Incident Response</strong>: When security events occur, teams can quickly correlate alert logs with audit trails to understand the scope and timeline of incidents.</p>
<h3>Performance Optimization and Capacity Planning</h3>
<p><strong>Proactive Resource Management</strong>: By monitoring disk, hardware, and process metrics, teams can identify resource constraints before they impact application performance. For example, tracking disk I/O latency trends helps predict when storage upgrades are needed.</p>
<p><strong>Query Performance Analysis</strong>: Database logs combined with process metrics provide insights into slow queries, connection patterns, and resource utilization that enable database performance tuning.</p>
<h3>Operational Excellence</h3>
<p><strong>Multi-Environment Monitoring</strong>: Organizations running Atlas across development, staging, and production environments can standardize monitoring across all environments while maintaining environment-specific alerting thresholds.</p>
<p><strong>Change Management</strong>: Project and organization logs provide complete audit trails for infrastructure changes, enabling teams to correlate application issues with recent configuration modifications.</p>
<h2>Let's Try It!</h2>
<p>The MongoDB Atlas integration delivers comprehensive database observability that enables proactive management and optimization of your Atlas deployments. With pre-built dashboards and alerting capabilities, teams can gain immediate value while leveraging rich data streams for advanced analytics and custom monitoring solutions.</p>
<p>Deploy a cluster on <a href="https://www.elastic.co/cloud/">Elastic Cloud</a> or <a href="https://www.elastic.co/cloud/serverless">Elastic Serverless</a>, or download the Elasticsearch stack, then spin up the MongoDB Atlas Integration, open the curated dashboards in Kibana and start monitoring your service!</p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-mongodb-atlas-integration/title.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Elastic Observability monitors metrics for Google Cloud in just minutes]]></title>
            <link>https://www.elastic.co/observability-labs/blog/observability-monitors-metrics-google-cloud</link>
            <guid isPermaLink="false">observability-monitors-metrics-google-cloud</guid>
            <pubDate>Mon, 20 Nov 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Follow this step-by-step process to enable Elastic Observability for Google Cloud Platform metrics.]]></description>
            <content:encoded><![CDATA[<p>Developers and SREs choose to host their applications on Google Cloud Platform (GCP) for its reliability, speed, and ease of use. On Google Cloud, development teams are finding additional value in migrating to Kubernetes on GKE, leveraging the latest serverless options like Cloud Run, and improving traditional, tiered applications with managed services.</p>
<p>Elastic Observability offers 16 out-of-the-box integrations for Google Cloud services with more on the way. A full list of Google Cloud integrations can be found in <a href="https://docs.elastic.co/en/integrations/gcp">our online documentation</a>.</p>
<p>In addition to our native Google Cloud integrations, Elastic Observability aggregates not only logs but also metrics for Google Cloud services and the applications running on Google Cloud compute services (Compute Engine, Cloud Run, Cloud Functions, Kubernetes Engine). All this data can be analyzed visually and more intuitively using Elastic®’s advanced machine learning (ML) capabilities, which help detect performance issues and surface root causes before end users are affected.</p>
<p>For more details on how Elastic Observability provides application performance monitoring (APM) capabilities such as service maps, tracing, dependencies, and ML based metrics correlations, read: <a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">APM correlations in Elastic Observability: Automatically identifying probable causes of slow or failed transactions</a>.</p>
<p>That’s right, Elastic offers metrics ingest, aggregation, and analysis for Google Cloud services and applications on Google Cloud compute services. Elastic is more than logs — it offers a unified observability solution for Google Cloud environments.</p>
<p>In this blog, I’ll review how Elastic Observability can monitor metrics for a three-tier web application running on Google Cloud services, which include:</p>
<ul>
<li>Google Cloud Run</li>
<li>Google Cloud SQL for PostgreSQL</li>
<li>Google Cloud Memorystore for Redis</li>
<li>Google Cloud VPC Network</li>
</ul>
<p>As you will see, once the integration is installed, metrics will arrive instantly and you can immediately start reviewing metrics.</p>
<h2>Prerequisites and config</h2>
<p>Here are some of the components and details we used to set up this demonstration:</p>
<ul>
<li>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>).</li>
<li>Ensure you have a Google Cloud project and a Service Account with permissions to pull the necessary data from Google Cloud (<a href="https://docs.elastic.co/en/integrations/gcp#authentication">see details in our documentation</a>).</li>
<li>We used <a href="https://cloud.google.com/architecture/application-development/three-tier-web-app">Google Cloud’s three-tier app</a> and deployed it using the Google Cloud console.</li>
<li>We’ll walk through installing the general <a href="https://docs.elastic.co/en/integrations/gcp">Elastic Google Cloud Platform Integration</a>, which covers the services we want to collect metrics for.</li>
<li>We will <em>not</em> cover application monitoring; instead, we will focus on how Google Cloud services can be easily monitored.</li>
<li>In order to see metrics, you will need to load the application. We’ve also created a playwright script to drive traffic to the application.</li>
</ul>
<h2>Three-tier application overview</h2>
<p>Before we dive into the Elastic configuration, let's review what we are monitoring. If you follow the <a href="https://cloud.google.com/architecture/application-development/three-tier-web-app">Jump Start Solution: Three-tier web app</a> instructions for<a href="https://github.com/aws-samples/aws-three-tier-web-architecture-workshop"></a>deploying the task-tracking app, you will have the following deployed.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/1.png" alt="1" /></p>
<p>What’s deployed:</p>
<ul>
<li>Cloud Run frontend tier that renders an HTML client in the user's browser and enables user requests to be sent to the task-tracking app</li>
<li>Cloud Run middle tier API layer that communicates with the frontend and the database tier</li>
<li>Memorystore for Redis instance in the database tier, caching and serving data that is read frequently</li>
<li>Cloud SQL for PostgreSQL instance in the database tier, handling requests that can't be served from the in-memory Redis cache</li>
</ul>
<p>At the end of the blog, we will also provide a Playwright script that can be run to send requests to this app in order to load it with example data and exercise its functionality. This will help drive metrics to “light up” the dashboards.</p>
<h2>Setting it all up</h2>
<p>Let’s walk through the details of how to get the application, Google Cloud integration on Elastic, and what gets ingested.</p>
<h3>Step 0: Get an account on Elastic Cloud</h3>
<p>Follow the instructions to <a href="https://cloud.elastic.co/registration?fromURI=/home">get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/2.png" alt="2 - start free trial" /></p>
<h3>Step 1: Deploy the Google Cloud three-tier application</h3>
<p>Follow the instructions listed out in <a href="https://cloud.google.com/architecture/application-development/three-tier-web-app">Jump Start Solution: Three-tier web app</a> choosing the <strong>Deploy through the console</strong> option for deployment.</p>
<h3>Step 2: Create a Google Cloud Service Account and download credentials file</h3>
<p>Once you’ve installed the app, the next step is to create a <em>Service Account</em> with a <em>Role</em> and a <em>Service Account Key</em> that will be used by Elastic’s integration to access data in your Google Cloud project.</p>
<p>Go to Google Cloud <a href="https://console.cloud.google.com/iam-admin/roles">IAM Roles</a> to create a Role with the necessary permissions. Click the <strong>CREATE ROLE</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/3.png" alt="3" /></p>
<p>Give the Role a <strong>Title</strong> and an <strong>ID</strong>. Then add the 10 assigned permissions listed here.</p>
<ul>
<li>cloudsql.instances.list</li>
<li>compute.instances.list</li>
<li>monitoring.metricDescriptors.list</li>
<li>monitoring.timeSeries.list</li>
<li>pubsub.subscriptions.consume</li>
<li>pubsub.subscriptions.create</li>
<li>pubsub.subscriptions.get</li>
<li>pubsub.topics.attachSubscription</li>
<li>redis.instances.list</li>
<li>run.services.list</li>
</ul>
<p>These permissions are a minimal set of what’s required for this blog post. You should add permissions for all the services for which you would like to collect metrics. If you need to add or remove permissions in the future, the Role’s permissions can be updated as many times as necessary.</p>
<p>Click the <strong>CREATE</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/4.png" alt="4" /></p>
<p>Go to Google Cloud <a href="https://console.cloud.google.com/iam-admin/serviceaccounts">IAM Service Accounts</a> to create a Service Account that will be used by the Elastic integration for access to Google Cloud. Click the <strong>CREATE SERVICE ACCOUNT</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/5.png" alt="5" /></p>
<p>Enter a <strong>Service account name</strong> and a <strong>Service account ID.</strong> Click the <strong>CREATE AND CONTINUE</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/6.png" alt="6" /></p>
<p>Then select the <strong>Role</strong> that you created previously and click the <strong>CONTINUE</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/7.png" alt="7" /></p>
<p>Click the <strong>DONE</strong> button to complete the Service Account creation process.</p>
<p>Next select the Service Account you just created to see its details page. Under the <strong>KEYS</strong> tab, click the <strong>ADD KEY</strong> dropdown and select <strong>Create new key</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/8.png" alt="8" /></p>
<p>In the Create private key dialog window, with the <strong>Key type</strong> set as JSON, click the <strong>CREATE</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/9.png" alt="9" /></p>
<p>The JSON credentials file key will be automatically downloaded to your local computer’s <strong>Downloads</strong> folder. The credentials file will be named something like:</p>
<pre><code class="language-bash">your-project-id-12a1234b1234.json
</code></pre>
<p>You can rename the file to be something else. For the purpose of this blog, we’ll rename it to:</p>
<pre><code class="language-bash">credentials.json
</code></pre>
<h3>Step 3: Create a Google Cloud VM instance</h3>
<p>To create the Compute Engine VM instance in Google Cloud, go to <a href="https://console.cloud.google.com/compute/instances">Compute Engine</a>. Then select <strong>CREATE INSTANCE.</strong></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/10.png" alt="10" /></p>
<p>Enter the following values for the VM instance details:</p>
<ul>
<li>Enter a <strong>Name</strong> of your choice for the VM instance.</li>
<li>Expand the <strong>Advanced Options</strong> section and the <strong>Networking</strong> sub-section.
<ul>
<li>Enter allow-ssh as the Networking tag.</li>
<li>Select the <strong>Network Interface</strong> to use the <strong>tiered-web-app-private-network</strong> , which is the network on which the Google Cloud three-tier web app is deployed.</li>
</ul>
</li>
</ul>
<p>Click the <strong>CREATE</strong> button to create the VM instance.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/11.png" alt="11" /></p>
<h3>Step 4: SSH in to the Google Cloud VM instance and upload the credentials file</h3>
<p>In order to SSH into the Google Cloud VM instance you just created in the previous step, you’ll need to create a Firewall rule in <strong>tiered-web-app-private-network</strong> , which is the network where the VM instance resides.</p>
<p>Go to the Google Cloud <a href="https://console.cloud.google.com/net-security/firewall-manager/firewall-policies/list"><strong>Firewall policies</strong></a> page. Click the <strong>CREATE FIREWALL RULE</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/12.png" alt="12" /></p>
<p>Enter the following values for the Firewall Rule.</p>
<ul>
<li>Enter a firewall rule <strong>Name</strong>.</li>
<li>Select <strong>tiered-web-app-private-network</strong> for the <strong>Network</strong>.</li>
<li>Enter allow-ssh for <strong>Target Tags</strong>.</li>
<li>Enter 0.0.0.0/0 for the <strong>Source IPv4 ranges</strong>.Click <strong>TCP</strong> and set the <strong>Ports</strong> to <strong>22</strong>.</li>
</ul>
<p>Click <strong>CREATE</strong> to create the firewall rule.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/13.png" alt="13" /></p>
<p>After the new Firewall rule is created, you can now SSH into your VM instance. Go to the <a href="https://console.cloud.google.com/compute/instances">Google Cloud VM instances</a> and select the VM instance you created in the previous step to see its details page. Click the <strong>SSH</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/14.png" alt="14" /></p>
<p>Once you are SSH’d inside the VM instance terminal window, click the <strong>UPLOAD FILE</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/15.png" alt="15" /></p>
<p>Select the credentials.json file located on your local computer and click the <strong>Upload Files</strong> button to upload the file.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/16.png" alt="16" /></p>
<p>In the VM instance’s SSH terminal, run the following command to get the full path to your Google Cloud Service Account credentials file.</p>
<pre><code class="language-bash">realpath credentials.json
</code></pre>
<p>This should return the full path to your Google Cloud Service Account credentials file.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/17.png" alt="17" /></p>
<p>Copy the credentials file’s full path and save it in a handy location to be used in a later step.</p>
<h3>Step 5: Add the Elastic Google Cloud integration</h3>
<p>Navigate to the Google Cloud Platform integration in Elastic by selecting <strong>Integrations</strong> from the top-level menu. Search for google and click the <strong>Google Cloud Platform</strong> tile.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/18.png" alt="18" /></p>
<p>Click <strong>Add Google Cloud Platform</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/19.png" alt="19" /></p>
<p>Click <strong>Add integration only (skip agent installation)</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/20.png" alt="20" /></p>
<p>Update the <strong>Project Id</strong> input text box to be your Google Cloud Project ID. Next, paste in the credentials file’s full path into the <strong>Credentials File</strong> input text box.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/21.png" alt="21" /></p>
<p>As you can see, the general Elastic Google Cloud Platform Integration will collect a significant amount of data from 16 Google Cloud services. If you don’t want to install this general Elastic Google Cloud Platform Integration, you can select individual integrations to install. Click <strong>Save and continue</strong>.</p>
<p>You’ll be presented with a confirmation dialog window. Click <strong>Add Elastic Agent to your hosts</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/22.png" alt="22" /></p>
<p>This will display the instructions required to install the Elastic agent. Copy the command under the <strong>Linux Tar</strong> tab.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/23.png" alt="23" /></p>
<p>Next you will need to use SSH to log in to the Google Cloud VM instance and run the commands copied from <strong>Linux Tar</strong> tab. Go to <a href="https://console.cloud.google.com/compute/instances">Compute Engine</a>. Then click the name of the VM instance that you created in Step 2. Log in to the VM by clicking the <strong>SSH</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/14.png" alt="24 - instance" /></p>
<p>Once you are SSH’d inside the VM instance terminal window, run the commands copied previously from <strong>Linux Tar tab</strong> in the <strong>Install Elastic Agent on your host</strong> instructions.</p>
<p>When the installation completes, you’ll see a confirmation message in the Install Elastic Agent on your host form. Click the <strong>Add the integration</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/25.png" alt="25 - add agent" /></p>
<p>Excellent! The Elastic agent is sending data to Elastic Cloud. Now let’s observe some metrics.</p>
<h3>Step 6: Run traffic against the application</h3>
<p>While getting the application running is fairly easy, there is nothing to monitor or observe with Elastic unless you add a load on the application.</p>
<p>Here is a simple script you can also run using <a href="https://playwright.dev/">Playwright</a> to add traffic and exercise the functionality of the Google Cloud three-tier application:</p>
<pre><code class="language-javascript">import { test, expect } from &quot;@playwright/test&quot;;

test(&quot;homepage for Google Cloud Threetierapp&quot;, async ({ page }) =&gt; {
  await page.goto(&quot;https://tiered-web-app-fe-zg62dali3a-uc.a.run.app&quot;);
  // Insert 2 todo items
  await page.fill(&quot;id=todo-new&quot;, (Math.random() * 100).toString());
  await page.keyboard.press(&quot;Enter&quot;);
  await page.waitForTimeout(1000);
  await page.fill(&quot;id=todo-new&quot;, (Math.random() * 100).toString());
  await page.keyboard.press(&quot;Enter&quot;);
  await page.waitForTimeout(1000);
  // Click one todo item
  await page.getByRole(&quot;checkbox&quot;).nth(0).check();
  await page.waitForTimeout(1000);
  // Delete one todo item
  const deleteButton = page.getByText(&quot;delete&quot;).nth(0);
  await deleteButton.dispatchEvent(&quot;click&quot;);
  await page.waitForTimeout(4000);
});
</code></pre>
<h3>Step 7: Go to Google Cloud dashboards in Elastic</h3>
<p>With Elastic Agent running, you can go to Elastic Dashboards to view what’s being ingested. Simply search for “dashboard” in Elastic and choose <strong>Dashboards.</strong></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/26.png" alt="26 - dashboard" /></p>
<p>This will open the Elastic Dashboards page.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/27.png" alt="27" /></p>
<p>In the Dashboards search box, search for GCP and click the <strong>[Metrics GCP] CloudSQL PostgreSQL Overview</strong> dashboard, one of the many out-of-the-box dashboards available. Let’s see what comes up.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/28.png" alt="28" /></p>
<p>On the Cloud SQL dashboard, we can see the following sampling of some of the many available metrics:</p>
<ul>
<li>Disk write ops</li>
<li>CPU utilization</li>
<li>Network sent and received bytes</li>
<li>Transaction count</li>
<li>Disk bytes used</li>
<li>Disk quota</li>
<li>Memory usage</li>
<li>Disk read ops</li>
</ul>
<p>Next let’s take a look at metrics for Cloud Run.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/29.png" alt="29 - line graphs" /></p>
<p>We’ve created a custom dashboard using the <strong>Create dashboard</strong> button on the Elastic Dashboards page. Here we see a few of the numerous available metrics:</p>
<ul>
<li>Container instance count</li>
<li>CPU utilization for the three-tier app frontend and API</li>
<li>Request count for the three-tier app frontend and API</li>
<li>Bytes in and out of the API</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/30.png" alt="30" /></p>
<p>This is a custom dashboard created for MemoryStore where we can see the following sampling of the available metrics:</p>
<ul>
<li>Network traffic to the Memorystore Redis instance</li>
<li>Count of the keys stored in Memorystore Redis</li>
<li>CPU utilization of the Memorystore Redis instance</li>
<li>Memory usage of the Memorystore Redis instance</li>
</ul>
<p><strong>Congratulations, you have now started monitoring metrics from key Google Cloud services for your application!</strong></p>
<h2>What to monitor on Google Cloud next?</h2>
<h3>Add logs from Google Cloud Services</h3>
<p>Now that metrics are being monitored, you can also now add logging. There are several options for ingesting logs.</p>
<p>The Google Cloud Platform Integration in the Elastic Agent has four separate logs settings: audit logs, firewall logs, VPC Flow logs, and DNS logs. Just ensure you turn on what you wish to receive.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/31.png" alt="31" /></p>
<h3>Analyze your data with Elastic machine learning</h3>
<p>Once metrics and logs (or either one) are in Elastic, start analyzing your data through Elastic’s ML capabilities. A great review of these features can be found here:</p>
<ul>
<li><a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">Correlating APM Telemetry to determine root causes in transactions</a></li>
<li><a href="https://www.elastic.co/elasticon/archive/2020/global/machine-learning-and-the-elastic-stack-everywhere-you-need-it">Introduction to Elastic Machine Learning</a></li>
</ul>
<h2>Conclusion: Monitoring Google Cloud service metrics with Elastic Observability is easy!</h2>
<p>I hope you’ve gotten an appreciation for how Elastic Observability can help you monitor Google Cloud service metrics. Here’s a quick recap of lessons and what you learned:</p>
<ul>
<li>Elastic Observability supports ingest and analysis of Google Cloud service metrics.</li>
<li>It’s easy to set up ingest from Google Cloud services via the Elastic Agent.</li>
<li>Elastic Observability has multiple out-of-the-box Google Cloud service dashboards you can use to preliminarily review information and then modify for your needs.</li>
<li>For metrics not covered by out-of-the-box dashboards, custom dashboards can be easily created to visualize metrics that are important to you.</li>
<li>16 Google Cloud services are supported as part of Google Cloud Platform Integration on Elastic Observability, with more services being added regularly.</li>
<li>As noted in related blogs, you can analyze your Google Cloud service metrics with Elastic’s machine learning capabilities.</li>
</ul>
<p>Try it out for yourself by signing up via <a href="https://console.cloud.google.com/marketplace/product/elastic-prod/elastic-cloud">Google Cloud Marketplace</a> and quickly spin up a deployment in minutes on any of the <a href="https://www.elastic.co/guide/en/cloud/current/ec-reference-regions.html#ec_google_cloud_platform_gcp_regions">Elastic Cloud regions on Google Cloud</a> around the world. Your Google Cloud Marketplace purchase of Elastic will be included in your monthly consolidated billing statement and will draw against your committed spend with Google Cloud.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-google-cloud/serverless-launch-blog-image.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Elastic Observability monitors metrics for Microsoft Azure in just minutes]]></title>
            <link>https://www.elastic.co/observability-labs/blog/observability-monitors-metrics-microsoft-azure</link>
            <guid isPermaLink="false">observability-monitors-metrics-microsoft-azure</guid>
            <pubDate>Mon, 29 Jan 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Follow this step-by-step process to enable Elastic Observability for Microsoft Azure metrics.]]></description>
            <content:encoded><![CDATA[<p>Developers and SREs choose Microsoft Azure to run their applications because it is a trustworthy world-class cloud platform. It has also proven itself over the years as an extremely powerful and reliable infrastructure for hosting business-critical applications.</p>
<p>Elastic Observability offers over 25 out-of-the-box integrations for Microsoft Azure services with more on the way. A full list of Azure integrations can be found in <a href="https://docs.elastic.co/integrations/azure">our online documentation</a>.</p>
<p>Elastic Observability aggregates not only logs but also metrics for Azure services and the applications running on Azure compute services (Virtual Machines, Functions, Kubernetes Service, etc.). All this data can be analyzed visually and more intuitively using Elastic®’s advanced machine learning (ML) capabilities, which help detect performance issues and surface root causes before end users are affected.</p>
<p>For more details on how Elastic Observability provides application performance monitoring (APM) capabilities such as service maps, tracing, dependencies, and ML-based metrics correlations, read <a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">APM correlations in Elastic Observability: Automatically identifying probable causes of slow or failed transactions</a>.</p>
<p>That’s right, Elastic offers capabilities to collect, aggregate, and analyze metrics for Microsoft Azure services and applications running on Azure. Elastic Observability is for more than just capturing logs — it offers a unified observability solution for Microsoft Azure workloads.</p>
<p>In this blog, we’ll review how Elastic Observability can monitor metrics for a three-tier web application running on Microsoft Azure and leveraging:</p>
<ul>
<li>Microsoft Azure Virtual Machines</li>
<li>Microsoft Azure SQL database</li>
<li>Microsoft Azure Virtual Network</li>
</ul>
<p>As you will see, once the integration is installed, metrics will arrive instantly and you can immediately start deriving insights from metrics.</p>
<h2>Prerequisites and config</h2>
<p>Here are some of the components and details we used to set up this demonstration:</p>
<ul>
<li>Ensure you have a Microsoft Azure account and an Azure service principal with permission to read monitoring data from Microsoft Azure (<a href="https://docs.elastic.co/integrations/azure_metrics/monitor#integration-specific-configuration-notes">see details in our documentation</a>).</li>
<li>This post does <em>not</em> cover application monitoring; instead, we will focus on how Microsoft Azure services can be easily monitored. If you want to get started with examples of application monitoring, see our <a href="https://github.com/elastic/observability-examples/tree/main/azure/container-apps">Hello World observability code samples</a>.</li>
<li>In order to see metrics, you will need to load the application. We’ve also created a Playwright script to drive traffic to the application.</li>
</ul>
<h2>Three-tier application overview</h2>
<p>Before we dive into the Elastic deployment setup and configuration, let's review what we are monitoring. If you follow the <a href="https://learn.microsoft.com/en-us/training/modules/n-tier-architecture/">Microsoft Learn N-tier example app</a> instructions for deploying the &quot;What's for Lunch?&quot; app, you will have the following deployed.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-three-tier-application-overview.png" alt="three tier application overview" /></p>
<p>What’s deployed:</p>
<ul>
<li>Microsoft Azure VM presentation tier that renders an HTML client in the user's browser and enables user requests to be sent to the “What’s for Lunch?” app</li>
<li>Microsoft Azure VM application tier that communicates with the presentation and the database tier</li>
<li>Microsoft Azure SQL instance in the database tier, handling requests from the application tier to store and serve data</li>
</ul>
<p>At the end of the blog, we will also provide a Playwright script that can be run to send requests to this app in order to load it with example data and exercise its functionality. This will help drive metrics to “light up” the dashboards.</p>
<h2>Setting it all up</h2>
<p>Let’s walk through the details of how to deploy the example three-tier application, Azure integration on Elastic and visualize what gets ingested in Elastic’s Kibana® dashboards.</p>
<h3>Step 0: Get an account on Elastic Cloud</h3>
<p>Follow the instructions to <a href="https://cloud.elastic.co/registration">get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-free-trial.png" alt="elastic cloud free trial sign up" /></p>
<h3>Step 1: Deploy the Microsoft Azure three-tier application</h3>
<p>From the <a href="https://portal.azure.com/">Azure portal</a>, click the Cloud Shell icon at the top of the portal to open Cloud Shell…</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-open-cloud-shell.png" alt="open cloud shell" /></p>
<p>… and when the Cloud Shell first opens, select <strong>Bash</strong> as the shell type to use.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-cloud-shell-bash.png" alt="cloud shell bash" /></p>
<p>If you’re prompted that “You have no storage mounted,” then click the <strong>Create storage</strong> button to create a file store to be used for saving and editing files from Cloud Shell.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-create-storage.png" alt="cloud shell create storage" /></p>
<p>You should now see the open Cloud Shell terminal.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-cloud-shell-terminal.png" alt="cloud shell terminal" /></p>
<p>Run the following command in Cloud Shell to define the environment variables that we’ll be using in the Cloud Shell commands required to deploy and view the sample application.</p>
<p>Be sure to specify a valid RESOURCE_GROUP from your available <a href="https://portal.azure.com/#view/HubsExtension/BrowseResourceGroups">Resource Groups listed in the Azure portal</a>. Also specify a new password to replace the SpecifyNewPasswordHere placeholder text before running the command. See the Microsoft <a href="https://learn.microsoft.com/en-us/sql/relational-databases/security/password-policy?view=sql-server-ver16#password-complexity">password policy documentation</a> for password requirements.</p>
<pre><code class="language-bash">RESOURCE_GROUP=&quot;test&quot;
APP_PASSWORD=&quot;SpecifyNewPasswordHere&quot;
</code></pre>
<p>Run the following az deployment group create command, which will deploy the example three-tier web app in around five minutes.</p>
<pre><code class="language-bash">az deployment group create --resource-group $RESOURCE_GROUP --template-uri https://raw.githubusercontent.com/MicrosoftDocs/mslearn-n-tier-architecture/master/Deployment/azuredeploy.json --parameters password=$APP_PASSWORD
</code></pre>
<p>After the deployment has completed, run the following command, which returns the URL for the app.</p>
<pre><code class="language-bash">az deployment group show --output table --resource-group $RESOURCE_GROUP --name azuredeploy --query properties.outputs.webSiteUrl
</code></pre>
<p>Copy the web app URL and paste it into a browser to view the example “What’s for Lunch?” web app.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-whats-for-lunch.png" alt="whats for lunch app" /></p>
<h3>Step 2: Create an Azure service principal and grant access permission</h3>
<p>Go to the <a href="https://portal.azure.com/">Microsoft Azure Portal</a>. Search for active directory and select <strong>Microsoft Entra ID</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-active-directory.png" alt="search active directory" /></p>
<p>Copy the <strong>Tenant ID</strong> for use in a later step in this blog post. This ID is required to configure Elastic Agent to connect to your Azure account.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-your-organization-overview.png" alt="your organization overview" /></p>
<p>In the navigation pane, select <strong>App registrations</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-your-organization-overview-app-registrations.png" alt="your organization overview app registrations" /></p>
<p>Then click <strong>New registration</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-your-organization-new-registration.png" alt="your organization new registrations" /></p>
<p>Type the name of your application (this tutorial uses three-tier-app-azure) and click <strong>Register</strong> (accept the default values for other settings).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-register_an_application.png" alt="register an application" /></p>
<p>Copy the <strong>Application (client) ID</strong> and save it for later. This ID is required to configure Elastic Agent to connect to your Azure account.</p>
<p>In the navigation pane, select <strong>Certificates &amp; secrets</strong> , and then click <strong>New client secret</strong> to create a new security key.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-three-tier-app-new-client-secret.png" alt="three tier app new client secret" /></p>
<p>Type a description of the secret and select an expiration. Click <strong>Add</strong> to create the client secret. Under <strong>Value</strong> , copy the secret value and save it (along with your client ID) for later.</p>
<p>After creating the Azure service principal, you need to grant it the correct permissions. In the Azure Portal, search for and select <strong>Subscriptions</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-three-tier-subscriptions.png" alt="three tier subscriptions" /></p>
<p>In the Subscriptions page, click the name of your subscription. On the subscription details page, copy your <strong>Subscription ID</strong> and save it for a later step.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-subscription-essentials-copy.png" alt="subscription essentials copy" /></p>
<p>In the navigation pane, select <strong>Access control (IAM)</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-subscription-access-control.png" alt="subscription access control" /></p>
<p>Click <strong>Add</strong> and select <strong>Add role assignment</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-subscription-access-control-add-role-assignment.png" alt="subscription access control add role assignment" /></p>
<p>On the <strong>Role</strong> tab, select the <strong>Monitoring Reader</strong> role and then click <strong>Next</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-add-role-assignment-monitoring-readers.png" alt="add role assignment monitoring reader" /></p>
<p>On the <strong>Members</strong> tab, select the option to assign access to <strong>User, group, or service principal</strong>. Click <strong>Select members</strong> , and then search for and select the principal you created earlier. For the description, enter the name of your service principal. Click <strong>Next</strong> to review the role assignment.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-add-role-assignment-description.png" alt="add role assignment description" /></p>
<p>Click <strong>Review + assign</strong> to grant the service principal access to your subscription.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-add-role-assignment-review-assign.png" alt="add role assignment review assign" /></p>
<h3>Step 3: Create an Azure VM instance</h3>
<p>In the Azure Portal, search for and select <strong>Virtual machines</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-search-virtual-machines.png" alt="search virtual machines" /></p>
<p>On the <strong>Virtual machines</strong> page, click <strong>+ Create</strong> and select <strong>Azure virtual machine</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-azure-virtual-machine.png" alt="azure virtual machine" /></p>
<p>On the Virtual machine creation page, enter a name like “metrics-vm” for the virtual machine name and select VM Size to be “Standard_D2s_v3 - 2 vcpus, 8 GiB memory.” Click the <strong>Next : Disks</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-create-virtual-macine-next-disks.png" alt="create a virtual machine next disks" /></p>
<p>On the <strong>Disks</strong> page, keep the default settings and click the <strong>Next : Networking</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-create-virtual-machine-next-networking.png" alt="create a virtual machine next networking" /></p>
<p>On the <strong>Networking</strong> page, demo-vnet should be selected for <strong>Virtual network</strong> and demo-biz-subnet should be selected for <strong>Subnet</strong>. These resources are created as part of the three-tier example app’s deployment that was done in Step 1.</p>
<p>Click the <strong>Review + create</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-create-virtual-machine-review-create.png" alt="create virtual machine review create" /></p>
<p>On the <strong>Review</strong> page, click the <strong>Create</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-create-virtual-machine-validation-passed.png" alt="create virtual machine validation passed" /></p>
<h3>Step 4: Install the Azure Resource Metrics integration</h3>
<p>In your <a href="https://cloud.elastic.co/home">Elastic Cloud</a> deployment, navigate to the Elastic Azure integrations by selecting <strong>Integrations</strong> from the top-level menu. Search for azure resource and click the <strong>Azure Resource Metrics</strong> tile.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-integrations-azure-resource-metrics.png" alt="integrations azure resource metrics" /></p>
<p>Click <strong>Add Azure Resource Metrics.</strong></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-azure-resource-metrics.png" alt="azure resource metrics" /></p>
<p>Click <strong>Add integration only (skip agent installation)</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-add-integration-only.png" alt="add integration only" /></p>
<p>Enter the values that you saved previously for Client ID, Client Secret, Tenant ID, and Subscription ID.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-add-azure-resource-metrics-integration.png" alt="add azure resource metrics integration" /></p>
<p>As you can see, the Azure Resource Metrics integration will collect a significant amount of data from eight Azure services. Click <strong>Save and continue</strong>.</p>
<p>You’ll be presented with a confirmation dialog window. Click <strong>Add Elastic Agent to your hosts</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-azure-resource-metrics-integration-added.png" alt="azure resource metrics integration added" /></p>
<p>This will display the instructions required to install the Elastic agent. Copy the command under the <strong>Linux Tar</strong> tab.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-add-agent.png" alt="add agent linux tar" /></p>
<p>Next you will need to use SSH to log in to the Azure VM instance and run the commands copied from <strong>Linux Tar</strong> tab. Go to <a href="https://portal.azure.com/#blade/HubsExtension/BrowseResourceBlade/resourceType/Microsoft.Compute/VirtualMachines">Azure Virtual Machines</a> in the Azure portal. Then click the name of the VM instance that you created in Step 3.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-metrics-vm.png" alt="metrics vm" /></p>
<p>Click the <strong>Select</strong> button in the <strong>SSH Using Azure CLI</strong> section.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-metrics-vm-connect.png" alt="metrics vm connect" /></p>
<p>Select the “I understand …” checkbox and then click the <strong>Configure + connect</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-ssh-using-azure-cli.png" alt="ssh using azure cli" /></p>
<p>Once you are SSH’d inside the VM instance terminal window, run the commands copied previously from <strong>Linux Tar tab</strong> in the <strong>Install Elastic Agent on your host</strong> instructions. When the installation completes, you’ll see a confirmation message in the Install Elastic Agent on your host form.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-add-agent-confirmed.png" alt="add agent confirmed" /></p>
<p>Super! The Elastic agent is sending data to Elastic Cloud. Now let’s observe some metrics.</p>
<h3>Step 5: Run traffic against the application</h3>
<p>While getting the application running is fairly easy, there is nothing to monitor or observe with Elastic unless you add a load on the application.</p>
<p>Here is a simple script you can also run using <a href="https://playwright.dev/">Playwright</a> to add traffic and exercise the functionality of the Azure three-tier application:</p>
<pre><code class="language-javascript">import { test, expect } from &quot;@playwright/test&quot;;

test(&quot;homepage for Microsoft Azure three tier app&quot;, async ({ page }) =&gt; {
  // Load web app
  await page.goto(&quot;http://20.172.198.231/&quot;);
  // Add lunch suggestions
  await page.fill(&quot;id=txtAdd&quot;, &quot;tacos&quot;);
  await page.keyboard.press(&quot;Enter&quot;);
  await page.waitForTimeout(1000);
  await page.fill(&quot;id=txtAdd&quot;, &quot;sushi&quot;);
  await page.keyboard.press(&quot;Enter&quot;);
  await page.waitForTimeout(1000);
  await page.fill(&quot;id=txtAdd&quot;, &quot;pizza&quot;);
  await page.keyboard.press(&quot;Enter&quot;);
  await page.waitForTimeout(1000);
  await page.fill(&quot;id=txtAdd&quot;, &quot;burgers&quot;);
  await page.keyboard.press(&quot;Enter&quot;);
  await page.waitForTimeout(1000);
  await page.fill(&quot;id=txtAdd&quot;, &quot;salad&quot;);
  await page.keyboard.press(&quot;Enter&quot;);
  await page.waitForTimeout(1000);
  await page.fill(&quot;id=txtAdd&quot;, &quot;sandwiches&quot;);
  await page.keyboard.press(&quot;Enter&quot;);
  await page.waitForTimeout(1000);
  // Click vote buttons
  await page.getByRole(&quot;button&quot;).nth(1).click();
  await page.getByRole(&quot;button&quot;).nth(3).click();
  await page.getByRole(&quot;button&quot;).nth(5).click();
  await page.getByRole(&quot;button&quot;).nth(7).click();
  await page.getByRole(&quot;button&quot;).nth(9).click();
  await page.getByRole(&quot;button&quot;).nth(11).click();
  // Click remove buttons
  await page.getByRole(&quot;button&quot;).nth(12).click();
  await page.getByRole(&quot;button&quot;).nth(10).click();
  await page.getByRole(&quot;button&quot;).nth(8).click();
  await page.getByRole(&quot;button&quot;).nth(6).click();
  await page.getByRole(&quot;button&quot;).nth(4).click();
  await page.getByRole(&quot;button&quot;).nth(2).click();
});
</code></pre>
<h3>Step 6: View Azure dashboards in Elastic</h3>
<p>With Elastic Agent running, you can go to Elastic Dashboards to view what’s being ingested. Simply search for “dashboard” in Elastic and choose <strong>Dashboard</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-dashboard.png" alt="dashboard" /></p>
<p>This will open the Elastic Dashboards page. In the Dashboards search box, search for azure vm and click the <strong>[Azure Metrics] Compute VMs Overview</strong> dashboard, one of the many out-of-the-box dashboards available.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-dashboards-create.png" alt="dashboards create" /></p>
<p>You will see a Dashboard populated with your deployed application’s VM metrics.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/blog-elastic-azure-compute-vm.png" alt="azure compute vm" /></p>
<p>On the Azure Compute VM dashboard, we can see the following sampling of some of the many available metrics:</p>
<ul>
<li>CPU utilization</li>
<li>Available memory</li>
<li>Network sent and received bytes</li>
<li>Disk writes and reads metrics</li>
</ul>
<p>For metrics not covered by out-of-the-box dashboards, custom dashboards can be easily created to visualize metrics that are important to you.</p>
<p><strong>Congratulations, you have now started monitoring metrics from Microsoft Azure services for your application!</strong></p>
<h2>Analyze your data with Elastic AI Assistant</h2>
<p>Once metrics and logs (or either one) are in Elastic, start analyzing your data with <a href="https://www.elastic.co/blog/context-aware-insights-elastic-ai-assistant-observability">context-aware insights using the Elastic AI Assistant for Observability</a>.</p>
<h2>Conclusion: Monitoring Microsoft Azure service metrics with Elastic Observability is easy!</h2>
<p>We hope you’ve gotten an appreciation for how Elastic Observability can help you monitor Azure service metrics. Here’s a quick recap of what you learned:</p>
<ul>
<li>Elastic Observability supports ingest and analysis of Azure service metrics.</li>
<li>It’s easy to set up ingest from Azure services via the Elastic Agent.</li>
<li>Elastic Observability has multiple out-of-the-box Azure service dashboards you can use to preliminarily review information and then modify for your needs.</li>
</ul>
<p>Try it out for yourself by signing up via <a href="https://portal.azure.com/#view/Microsoft_Azure_Marketplace/GalleryItemDetailsBladeNopdl/id/elastic.ec-azure-pp">Microsoft Azure Marketplace</a> and quickly spin up a deployment in minutes on any of the <a href="https://www.elastic.co/guide/en/cloud/current/ec-reference-regions.html#ec_azure_regions">Elastic Cloud regions on Microsoft Azure</a> around the world. Your Azure Marketplace purchase of Elastic will be included in your monthly consolidated billing statement and will draw against your committed spend with Microsoft Azure.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/observability-monitors-metrics-microsoft-azure/Azure_Dark_(1).png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Introducing Streams for Observability: Your first stop for investigations]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-observability-streams-ai-logs-investigations</link>
            <guid isPermaLink="false">elastic-observability-streams-ai-logs-investigations</guid>
            <pubDate>Mon, 27 Oct 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Introducing Elastic Streams, an new AI observability feature that transforms logs from a noisy and expensive data source into a primary investigation signal.]]></description>
            <content:encoded><![CDATA[<p>We're excited to introduce Streams, a new AI capability within Elastic Observability. Built on the Elasticsearch platform, it's designed for Site Reliability Engineers (SREs) to use logs as the primary signal for investigations, enabling faster answers and quicker issue resolution. For decades, logs have been considered too noisy, expensive, and complex to manage, and many observability vendors have treated them as a second-class citizen. Streams flips this script by transforming raw logs into your most valuable asset to immediately identify not only the root cause, but also the why behind the root cause to enable instant resolution.</p>
<p>SREs today identify the &quot;what&quot; with metrics and the &quot;where&quot; with traces, which are important for troubleshooting. However, it's often the &quot;why&quot; that's needed for faster and more accurate incident resolution. The crucial “why” is buried in your logs, but the massive volume and unstructured nature of logs in modern microservice environments have made them difficult to use effectively. This has forced teams into a difficult position, either spending countless hours building and maintaining complex data pipelines to tame the chaos or drop valuable log data to control costs and risk critical visibility gaps. As a result, when an incident occurs, SREs waste precious time manually hunting for clues and reverse-engineering data instead of quickly resolving the issue.</p>
<h2>Streams, from ingest to answers with logs</h2>
<p>Streams directly addresses this challenge by using AI to transform the chaos of raw logs into your clearest path to a solution, enabling logs to be the primary signal for investigations. It processes raw logs at scale ingested from any source and in any format (structured and unstructured), then partitions, parses, and helps manage retention and data quality. Streams reduces the need for SREs to constantly normalize data, manage custom schemas, or sift through endless noise. Streams also surfaces Significant Events, like major errors and anomalies, enabling you to be proactive in your investigations. SREs can now focus on resolving issues faster than ever by spending less time on data management and hunting through the noise.</p>
<p>Lets see Streams in action. In the demo below, watch an SRE tackle an issue with a critical trading application in production. In minutes, Streams processes the raw logs, pinpoints a Java out-of-memory error, and the AI Assistant guides the SRE straight to the root cause, turning hours of manual work into a quick fix.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-observability-streams-ai-logs-investigations/streams-in-action.gif" alt="Streams in Action" /></p>
<p>Let's walk through some of the key Streams capabilities highlighted in the video:</p>
<ul>
<li><strong>AI-based partitioning</strong> - simplifies ingest by allowing SREs to send all logs to a single endpoint, without worrying about agents or integrations. Our AI automatically determines that logs are coming from two different systems, Hadoop and Spark. As more data comes through, it continues to learn and identify additional components, making segmentation effortless.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-observability-streams-ai-logs-investigations/image-01.png" alt="AI-based partitioning" /></p>
<ul>
<li><strong>AI-based parsing</strong> - eliminates the manual effort of building and managing log processing pipelines.  In the demo Streams automatically detects logs from Spark and generates a GROK rule that perfectly parses 100% of the fields.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-observability-streams-ai-logs-investigations/image-02.png" alt="AI-based parsing" /></p>
<ul>
<li><strong>Identifying Significant Events</strong> -  Cuts through the noise so you can focus immediately on key issues. Streams analyzes the parsed Spark logs and pinpoints the Java out-of-memory errors and exceptions. This provides SREs with a clear, actionable starting point for their investigations instead of forcing them to hunt through raw data.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-observability-streams-ai-logs-investigations/image-03.png" alt="Significant Events" /></p>
<ul>
<li><strong>AI Assistant</strong> - The AI Assistant provides instant root cause analysis, turning hours of work into immediate answers. After Streams identifies the Java OOM error, an SRE can analyze logs in Discover with the AI Assistant. Within moments, it determines the root cause is that Spark lacks sufficient memory for the datasets being processed, delivering a precise answer to guide remediation.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-observability-streams-ai-logs-investigations/image-04.png" alt="AI Assistant" /></p>
<p>One item that isn't in the video, is how easy Streams makes logs ingest. In this example above, we used the OTel collector, and merely configured it with a processor, exporter and service statements in values.yaml file for the OTel Collector's helm chart:</p>
<pre><code class="language-bash">processors:
  transform/logs-streams:
      log_statements:
        - context: resource
          statements:
            - set(attributes[&quot;elasticsearch.index&quot;], &quot;logs&quot;)
exporters:
  debug:
  otlp/ingest:
    endpoint: ${env:ELASTIC_OTLP_ENDPOINT}
    headers:
      Authorization: ApiKey ${env:ELASTIC_API_KEY}

service:
  pipelines:
      logs:
        receivers: [filelog]
        processors: [batch, transform/logs-streams]
        exporters: [elasticsearch, debug]
</code></pre>
<p>With Streams you can use any log forwarder, OTel Collector (as in the example above), fluentd, fluentbit, etc. This makes ingesting simple and ensures you aren't locked into any specific log forwarder for Elastic.</p>
<p>As you've seen in this example, Streams helps SREs focus on finding the “why”, without the manual, error-prone work of making logs usable. What used to happen in hours can now be accomplished in minutes.</p>
<h2>Streams: Key Features and availability</h2>
<p>While the previous example shows how easy and fast it is to get to RCA with partitioning, parsing, Significant events, and the AI Assistant, Streams has more capabilities which is highlighted in the following diagram:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-observability-streams-ai-logs-investigations/image-05.png" alt="Streams" /></p>
<p>All of these capabilities are available in two primary modes: Streams for data already indexed in Elasticsearch, and Logs Streams for ingesting raw logs directly. Both modes support AI-driven partitioning and parsing, the identification of Significant Events, and essential tools for managing data quality, retention, and cost-efficient storage.</p>
<p><strong>Streams (GA in 9.2)</strong></p>
<p>Provides foundational capabilities that reduce pipeline management for SREs. Streams works with logs from existing agents and integrations as well as raw, unstructured logs coming through Logs Streams. Key capabilities include:</p>
<ul>
<li>
<p>Streams Processing: simulate and refine log parsing using AI-powered Parsing or a point-and-click UI. Compare before-and-after states and modify schemas to simplify log processing.</p>
</li>
<li>
<p>Streams Retention Management: define time-based or advanced ILM policies directly in the UI, gain visibility into ingestion volume, and manage data in the failure store..</p>
</li>
<li>
<p>Streams Data Quality: detect and fix ingestion failures via a failure store that captures and exposes failed documents for inspection.</p>
</li>
</ul>
<p><strong>Logs Streams (Tech Preview)</strong></p>
<p>Enables SREs to ingest any log, in any format, directly into Elasticsearch, without the need for agents or integrations. Key capabilities include:</p>
<ul>
<li>
<p>Direct Ingestion with any log forwarder into Elasticsearch: send raw logs directly into /logs index using any mechanism, such as the logs_index parameter in an OpenTelemetry collector.</p>
</li>
<li>
<p>AI-Driven Partitioning: automatically or manually segment a single log stream into distinct parts (e.g., by service or component) using contextual AI-based suggestions..</p>
</li>
</ul>
<p><strong>Significant Events (tech preview)</strong></p>
<p>Significant Events is available in both Streams and Logs Streams, and surfaces errors and anomalies that truly matter, such as startup and shutdown messages, out-of-memory errors, internal server failures, and other signals of change. These events act as actionable markers, giving SREs early warning and an investigative starting point before a service impact occurs.</p>
<h2>What does this mean for SREs in practice?</h2>
<p>With Elastic Streams, SREs no longer need to spend time data wrangling before they can be investigators. Logs are the primary investigation signal because Streams provides SREs with the ability to:</p>
<ul>
<li><strong>Log everything in any format, and don't worry about pipelines</strong> - Stop wasting time building and maintaining complex ingestion pipelines. Send logs in any format, structured or unstructured, from any source directly to a single Elastic endpoint, without needing specific agents. Use OTel collectors or any other data shipper to send logs to Elastic. Streams AI-driven processing parses and structures your log data, making it immediately “ready for investigation”. This means you can adapt to new log formats on the fly without the need to maintain brittle configurations. Streams ensures you always have the data you need, the moment you need it.</li>
</ul>
<ul>
<li><strong>Don't just collect logs, get answers from them</strong> - Streams analyzes your data to surface “Significant Events,” proactively identifying critical errors, anomalies, and performance bottlenecks like out-of-memory exceptions. Instead of manually sifting through terabytes of data, you get a clear, prioritized starting point for your investigation. This allows you to go from symptom to solution in minutes, fixing issues before they impact users.</li>
</ul>
<ul>
<li><strong>Achieve Complete Visibility at a Lower Cost:</strong> Get comprehensive visibility across all your services without the expected expense. By intelligently structuring data and surfacing only the most critical events, Streams reduces operational complexity and dramatically cuts down root cause analysis time. This efficiency allows you to store all relevant log data cost-effectively, ensuring you never have to sacrifice crucial visibility to meet a budget. Get clearer answers faster and lower your total cost of ownership.</li>
</ul>
<h2>Conclusion</h2>
<p>Elastic Streams revolutionizes observability by transforming logs from a noisy and expensive data source into a primary investigation signal. Through AI-powered capabilities like automatic partitioning, parsing, retention management, and the surfacing of Significant Events, Streams empowers SREs to move beyond data management and directly pinpoints the root cause of issues. By reducing operational complexity, lowering storage costs, and providing complete visibility, Streams ensures that logs, enriched by AI become the fastest path to resolution by answering the critical question “why” for observability.</p>
<p>Sign up for an Elastic trial at <a href="http://cloud.elastic.co">cloud.elastic.co</a>, and trial Elastic's Serverless offering which will allow you to play with all of the Streams functionality.</p>
<p>Additionally, check out:</p>
<p><em>Read about</em> <a href="https://www.elastic.co/observability-labs/blog/reimagine-observability-elastic-streams"><em>Reimagining streams</em></a></p>
<p><em>Look at the</em> <a href="http://elastic.co/elasticsearch/streams"><em>Streams website</em></a></p>
<p><em>Read the</em> <a href="https://www.elastic.co/docs/solutions/observability/streams/streams"><em>Streams documentation</em></a></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-observability-streams-ai-logs-investigations/streams-launch.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Using Elastic to observe GKE Autopilot clusters]]></title>
            <link>https://www.elastic.co/observability-labs/blog/observe-gke-autopilot-clusters</link>
            <guid isPermaLink="false">observe-gke-autopilot-clusters</guid>
            <pubDate>Wed, 15 Mar 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[See how deploying the Elastic Agent onto a GKE Autopilot cluster makes observing the cluster’s behavior easy. Kibana integrations make visualizing the behavior a simple addition to your observability dashboards.]]></description>
            <content:encoded><![CDATA[<p>Elastic has formally supported Google Kubernetes Engine (GKE) since January 2020, when Elastic Cloud on Kubernetes was announced. Since then, Google has expanded GKE, with new service offerings and delivery mechanisms. One of those new offerings is GKE Autopilot. Where GKE is a managed Kubernetes environment, GKE Autopilot is a mode of Kubernetes operation where Google manages your cluster configuration, scaling, security, and more. It is production ready and removes many of the challenges associated with tasks like workload management, deployment automation, and scalability rules. Autopilot lets you focus on building and deploying your application while Google manages everything else.</p>
<p>Elastic is committed to supporting Google Kubernetes Engine (GKE) in all of its delivery modes. In October, during the Google Cloud Next ‘22 event, we announced our intention to integrate and certify Elastic Agent on Anthos, Autopilot, Google Distributed Cloud, and more.</p>
<p>Since that event, we have worked together with Google to get the Elastic Agent certified for use on Anthos, but we didn’t stop there.</p>
<p>Today we are happy to <a href="https://github.com/elastic/elastic-agent/blob/autopilotdocumentaton/docs/elastic-agent-gke-autopilot.md">announce</a> that we have been certified for operation on GKE Autopilot.</p>
<h2>Hands on with Elastic and GKE Autopilot</h2>
<h3><a href="https://www.elastic.co/observability/kubernetes-monitoring">Kubernetes observability</a> has never been easier</h3>
<p>To show how easy it is to get started with Autopilot and Elastic, let's walk through deploying the Elastic Agent on an Autopilot cluster. I’ll show how easy it is to set up and monitor an Autopilot cluster with the Elastic Agent and observe the cluster’s behavior with Kibana integrations.</p>
<p>One of the main differences between GKE and GKE Autopilot is that Autopilot protects the system namespace “kube-system.” To increase the stability and security of a cluster, Autopilot prevents user space workloads from adding or modifying system pods. The default configuration for Elastic Agent is to install itself into the system namespace. The majority of the changes we will make here are to convince the Elastic Agent to run in a different namespace.</p>
<h2>Let’s get started with Elastic Stack!</h2>
<p>While writing this article, I used the latest version of Elastic. The best way for you to get started with Elastic Observability is to:</p>
<ol>
<li>Get an account on <a href="https://cloud.elastic.co/registration?fromURI=/home">Elastic Cloud</a> and look at this <a href="https://www.elastic.co/videos/training-how-to-series-cloud">tutoria</a>l to help launch your first stack, or</li>
<li><a href="https://www.elastic.co/partners/google-cloud">Launch Elastic Cloud on your Google Account</a></li>
</ol>
<h2>Provisioning an Autopilot cluster and an Elastic stack</h2>
<p>To test the agent, I first deployed the recommended, default GKE Autopilot cluster. Elastic’s GKE integration supports kube-state-metrics (KSM), which will increase the number of reported metrics available for reporting and dashboards. Like the Elastic Agent, KSM defaults to running in the system namespace, so I modified its manifest to work with Autopilot. For my testing, I also deployed a basic Elastic stack on Elastic Cloud in the same Google region as my Autopilot cluster. I used a fresh cluster deployed on Elastic’s managed service (ESS), but the process is the same if you are using an Elastic Cloud subscription purchased through the Google marketplace.</p>
<h2>Adding Elastic Observability to GKE Autopilot</h2>
<p>Because this is a brand new deployment, Elastic suggests adding integrations to it. Let’s add the Kubernetes integration into the new deployment:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-welcome-to-elastic.png" alt="elastic agent GKE autopilot welcome" /></p>
<p>Elastic offers hundreds of integrations; filter the list by typing “kub” into the search bar (1) and then click the Kubernetes integration (2).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-kubernetes-integration.png" alt="elastic agent GKE autopilot kubernetes integration" /></p>
<p>The Kubernetes integration page gives you an overview of the integration and lets you manage the Kubernetes clusters you want to observe. We haven’t added a cluster yet, so I clicked “Add Kubernetes” to add the first integration.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-add-kubernetes.png" alt="elastic agent GKE autopilot add kubernetes" /></p>
<p>I changed the integration name to reflect the Kubernetes offering type and then clicked “Save and continue” to accept the integration defaults.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-add-kubernetes-integration.png" alt="elastic agent GKE autopilot add kubernetes integration" /></p>
<p>At this point, an Agent policy has been created. Now it’s time to install the agent. I clicked on the “Kubernetes” integration.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-agent-policy-1.png" alt="elastic agent GKE autopilot agent policy" /></p>
<p>Then I selected the “integration policies” tab (1) and clicked “Add agent” (2).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-add-agent.png" alt="elastic agent GKE autopilot add agent" /></p>
<p>Finally, I downloaded the full manifest for a standard GKE environment.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-download-manifest.png" alt="elastic agent GKE autopilot download manifest" /></p>
<p>We won’t be using this manifest directly, but it contains many of the values that we will need to deploy the agent on Autopilot in the next section.</p>
<p>The Elastic stack is ready and waiting for the Autopilot logs, metrics, and events. It’s time to connect Autopilot to this deployment using the Elastic Agent for GKE.</p>
<h2>Connect Autopilot to Elastic</h2>
<p>From the Google cloud terminal, I downloaded and edited the Elastic Agent manifest for GKE Autopilot.</p>
<pre><code class="language-bash">$ curl -o elastic-agent-managed-gke-autopilot.yaml \
https://github.com/elastic/elastic-agent/blob/autopilotdocumentaton/docs/manifests/elastic-agent-managed-gke-autopilot.yaml
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-cloud-shell-editor.png" alt="elastic agent GKE autopilot cloud shell editor" /></p>
<p>I used the cloud shell editor to configure the manifest for my Autopilot and Elastic clusters. For example, I updated the following:</p>
<pre><code class="language-yaml">containers:
  - name: elastic-agent
    image: docker.elastic.co/beats/elastic-agent:8.19.12
</code></pre>
<p>I also changed the agent to the version of Elastic that I installed (8.6.0).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-google-cloud.png" alt="elastic agent GKE autopilot google cloud" /></p>
<p>From the Integration manifest I downloaded earlier, I copied the values for FLEET_URL and FLEET_ENROLLMENT_TOKEN into this YAML file.</p>
<p>Now it’s time to apply the updated manifest to the Autopilot instance.</p>
<p>Before I commit, I always like to see what’s going to be created (and check for syntax errors) with a dry run.</p>
<pre><code class="language-bash">$ clear
$ kubectl apply --dry-run=&quot;client&quot; -f elastic-agent-managed-gke-autopilot.yaml
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-dry-run.png" alt="elastic agent GKE autopilot dry run" /></p>
<p>Everything looks good, so I’ll do it for real this time.</p>
<pre><code class="language-bash">$ clear
$ kubectl apply -f elastic-agent-managed-gke-autopilot.yaml
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-autopilot-cluster.png" alt="elastic agent GKE autopilot cluster" /></p>
<p>After several minutes, metrics will start flowing from the Autopilot cluster directly into the Elastic deployment.</p>
<h2>Adding a workload to the Autopilot cluster</h2>
<p>Observing an Autopilot cluster without a workload is boring, so I deployed a modified version of Google’s <a href="https://github.com/bshetti/opentelemetry-microservices-demo">Hipster Shop</a> (which includes OpenTelemetry reporting):</p>
<pre><code class="language-yaml">$ git clone https://github.com/bshetti/opentelemetry-microservices-demo
$ cd opentelemetry-microservices-demo
$ nano ./deploy-with-collector-k8s/otelcollector.yaml
</code></pre>
<p>To get the application’s telemetry talking to our Elastic stack, I replaced all instances of the exporter type from HTTP (otlphttp/elastic) to gRPC (otlp/elastic). I then replaced OTEL_EXPORTER_OTLP_ENDPOINT with my APM endpoint and I replaced OTEL_EXPORTER_OTLP_HEADERS with my APM OTEL Bearer and Token.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-terminal-telemetry.png" alt="elastic agent GKE autopilot terminal telemetry" /></p>
<p>Then I deployed the Hipster Shop.</p>
<pre><code class="language-bash">$ kubectl create -f ./deploy-with-collector-k8s/adservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/redis.yaml
$ kubectl create -f ./deploy-with-collector-k8s/cartservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/checkoutservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/currencyservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/emailservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/frontend.yaml
$ kubectl create -f ./deploy-with-collector-k8s/paymentservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/productcatalogservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/recommendationservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/shippingservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/loadgenerator.yaml
</code></pre>
<p>Once all of the shop’s pods were running, I deployed the OpenTelemetry collector.</p>
<pre><code class="language-bash">$ kubectl create -f ./deploy-with-collector-k8s/otelcollector.yaml
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-deployed-opentelemetry-collector.png" alt="elastic agent GKE autopilot deployed opentelemetry collector" /></p>
<h2>Observe and visualize Autopilot’s metrics</h2>
<p>Now that we have added the Elastic Agent to our Autopilot cluster and added a workload, let's take a look at some of the Kubernetes visualizations the integration provides out of the box.</p>
<p>The “[Metrics Kubernetes] Overview” is a great place to start. It provides a high-level view of the resources used by the cluster and allows me to drill into more specific dashboards that I find interesting:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-create-visualization.png" alt="elastic agent GKE autopilot create visualization" /></p>
<p>For example, the “[Metrics Kubernetes] Pods” gives me a high-level view of the pods deployed in the cluster:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-pod.png" alt="elastic agent GKE autopilot pod" /></p>
<p>The “[Metrics Kubernetes] Volumes” gives me an in-depth view to how storage is allocated and used in the Autopilot cluster:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-filesystem-information.png" alt="elastic agent GKE autopilot filesystem information" /></p>
<h2>Creating an alert</h2>
<p>From here, I can easily discover patterns in my cluster’s behavior and even create Alerts. Here is an example of an alert to notify me if the the main storage volume (called “volume”) exceeds 80% of its allocated space:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-create-rule-elasticsearch-query.png" alt="elastic agent GKE autopilot create rule" /></p>
<p>With a little work, I created this view from the standard dashboard:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-kubernetes-dashboard.png" alt="elastic agent GKE autopilot kubernetes dashboard" /></p>
<h2>Conclusion</h2>
<p>Today I have shown how easy it is to monitor, observe, and generate alerts on a GKE Autopilot cluster. To get more information on what is possible, see the official Elastic documentation for <a href="https://github.com/elastic/elastic-agent/blob/autopilotdocumentaton/docs/elastic-agent-gke-autopilot.md">Autopilot observability with Elastic Agent</a>.</p>
<h2>Next steps</h2>
<p>If you don’t have Elastic yet, you can get started for free with an <a href="https://www.elastic.co/cloud/elasticsearch-service/signup">Elastic Trial</a> today. Get more from Elastic and Google together with a <a href="https://console.cloud.google.com/marketplace/browse?q=Elastic&amp;utm_source=Elastic&amp;utm_medium=qwiklabs&amp;utm_campaign=Qwiklabs+to+Marketplace">Marketplace subscription</a>. Elastic does more than just integrate with GKE — check out the almost <a href="https://www.elastic.co/integrations">300 integrations</a> that Elastic provides.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-kubernetes-dashboard.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Introducing Elastic's OpenTelemetry SDK for .NET]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-dotnet-applications</link>
            <guid isPermaLink="false">elastic-opentelemetry-distribution-dotnet-applications</guid>
            <pubDate>Tue, 02 Apr 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Today, we are excited to announce the alpha release of our new Elastic distribution of the OpenTelemetry SDK for .NET. In this post, we cover a few likely questions you may have about this new distribution and explain how to get started.]]></description>
            <content:encoded><![CDATA[<p>We are thrilled to announce the alpha release of our new <a href="https://github.com/elastic/elastic-otel-dotnet/releases">Elastic® distribution of the OpenTelemetry SDK for .NET</a>. In this post, we cover a few reasonable questions you may have about this new distribution.</p>
<p>Download the <a href="https://www.nuget.org/packages/Elastic.OpenTelemetry">NuGet package</a> today if you want to try out this early access release. We welcome all feedback and suggestions to help us enhance the distribution before its stable release.</p>
<p><a href="https://www.elastic.co/blog/elastic-opentelemetry-sdk-distributions">Check out our announcement blog post</a> to learn more about OpenTelemetry and our decision to introduce OpenTelemetry distributions.</p>
<h2>The Elastic .NET OpenTelemetry distribution</h2>
<p>With the alpha release of the Elastic distribution of the .NET OpenTelemetry SDK, we are embracing OpenTelemetry as the preferred and recommended choice for instrumenting .NET applications.</p>
<p>In .NET, the runtime base class libraries (BCL) include types designed for native OpenTelemetry instrumentation, such as <a href="https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.activity">Activity</a> and <a href="https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.metrics.meter">Meter</a>, making adopting OpenTelemetry-native instrumentation even more convenient.</p>
<p>The current alpha release of our distribution is consciously feature-limited. Our goal is to assess the fitness of the API design and ease of use, laying a solid foundation going forward. We acknowledge that it is likely not suited to all application scenarios, so while we welcome developers installing it to try it out, we don’t currently advise using it for production.</p>
<p>In subsequent releases, we plan to add more features as we move toward feature parity with the existing Elastic APM agent for .NET. Based on user feedback, we will refine the API and move toward a stable release. Until then, we may need to make some breaking API changes to support additional use cases.</p>
<p>The current alpha release supports installation in typical modern workloads such as <a href="https://dotnet.microsoft.com/en-us/apps/aspnet">ASP.NET Core</a> and <a href="https://learn.microsoft.com/en-us/dotnet/core/extensions/workers">worker services</a>. It best supports modern .NET runtimes, .NET 6.0 and later. We’d love to hear about other scenarios you think we should focus on next.</p>
<p>The types we introduce in the distribution are to support an easy switch from the “vanilla” OpenTelemetry SDK with no (or minimal) code changes. We expect that for most circumstances, merely adding the NuGet package is all that is required to get started.</p>
<p>The initial alpha releases add very little on top of the “vanilla” SDK from OpenTelemetry, but by adopting it early, you can shape its direction. We will deliver valuable enhancements to developers in subsequent releases.</p>
<p>If you’d like to follow the development of the distribution, the code is fully open source and <a href="https://github.com/elastic/elastic-otel-dotnet">available on GitHub</a>. We encourage you to raise issues for bugs or usability pain points you encounter.</p>
<h2>How do I get started?</h2>
<p>Getting started with the Elastic OpenTelemetry distribution is really easy. Simply add a reference to the Elastic OpenTelemetry NuGet package to your project. This can be achieved by adding a package reference to the project (csproj) file.</p>
<pre><code class="language-xml">&lt;PackageReference Include=&quot;Elastic.OpenTelemetry&quot; Version=&quot;1.0.0-alpha.1&quot; /&gt;
</code></pre>
<p>After adding the package reference, you can use the Elastic OpenTelemetry distribution in your application. The distribution includes a transitive dependency on the OpenTelemetry SDK, so you do not need to add the OpenTelemetry SDK package to your project. Doing so will cause no harm and may be used to opt into newer SDK versions before the Elastic distribution references them.</p>
<p>The Elastic OpenTelemetry distribution is designed to be easy to use and integrate into your applications, including those that have previously used the OpenTelemetry SDK directly. When the OpenTelemetry SDK is already being used, the only required change is to add the Elastic.OpenTelemetry NuGet package to the project. Doing so will automatically switch to the opinionated configuration provided by the Elastic distribution.</p>
<h3>ASP.NET Core example</h3>
<p>A common requirement is to instrument ASP.NET Core applications based on <strong>Microsoft.Extensions.Hosting</strong> libraries, which provide dependency injection via an <strong>IServiceProvider</strong>.</p>
<p>The OpenTelemetry SDK and the Elastic distribution include extension methods to enable observability features in your application by adding a few lines of code.</p>
<p>This example focuses on adding instrumentation to an ASP.NET Core minimal API application using the Elastic OpenTelemetry distribution. Similar steps can also be applied to instrument other ASP.NET Core workloads and host-based applications such as Worker Services.</p>
<p><em>NOTE: This example assumes that we start with a new</em> <a href="https://learn.microsoft.com/en-us/aspnet/core/tutorials/min-web-api"><em>minimal API project</em></a> <em>created using project templates available with the</em> <a href="https://dotnet.microsoft.com/en-us/download/dotnet/8.0"><em>.NET 8 SDK</em></a><em>. It also uses top-level statements inside a single Program.cs file.</em></p>
<p>Add the <strong>Elastic.OpenTelemetry</strong> package reference to the project (csproj) file.</p>
<pre><code class="language-xml">&lt;PackageReference Include=&quot;Elastic.OpenTelemetry&quot; Version=&quot;1.0.0-alpha.1&quot; /&gt;
</code></pre>
<p>To take advantage of the OpenTelemetry SDK instrumentation for ASP.NET Core, also add the <strong>OpenTelemetry.Instrumentation.AspNetCore</strong> NuGet package.</p>
<pre><code class="language-xml">&lt;PackageReference Include=&quot;OpenTelemetry.Instrumentation.AspNetCore&quot; Version=&quot;1.7.1&quot; /&gt;
</code></pre>
<p>This package includes support to collect instrumentation (traces and metrics) for requests handled by ASP.NET Core endpoints.</p>
<p>Inside the <strong>Program.cs</strong> file of the ASP.NET Core application, add the following two using directives:</p>
<pre><code class="language-csharp">using OpenTelemetry;
using OpenTelemetry.Trace;
</code></pre>
<p>The OpenTelemetry SDK includes extension methods on the <strong>IServiceCollection</strong> to enable and configure the trace, metric, and log providers. The Elastic distribution overrides the default SDK registration, adding several opinionated defaults.</p>
<p>In the minimal API template, the <strong>WebApplicationBuilder</strong> exposes a <strong>Services</strong> property that can be used to register services with the dependency injection container. Ensure that the OpenTelemetry SDK is registered to enable tracing and metrics collection.</p>
<pre><code class="language-csharp">var builder = WebApplication.CreateBuilder(args);

builder.Services
  .AddHttpClient() // &lt;1&gt;
  .AddOpenTelemetry() // &lt;2&gt;
    .WithTracing(t =&gt; t.AddAspNetCoreInstrumentation()); // &lt;3&gt;
</code></pre>
<blockquote>
<p>&lt;1&gt; AddHttpClient registers the IHttpClientFactory service with the dependency injection container. This is <em>not</em> required to enable OpenTelemetry, but the example endpoint will use it to send an HTTP request.</p>
<p>&lt;2&gt; AddOpenTelemetry registers the OpenTelemetry SDK with the dependency injection container. When available, the Elastic distribution will override this to add opinionated defaults.</p>
<p>&lt;3&gt; Configures OpenTelemetry tracing to collect tracing and metric data produced by ASP.NET Core.</p>
</blockquote>
<p>With these limited changes to the Program.cs file, the application is now configured to use the OpenTelemetry SDK and the Elastic distribution to collect traces and metrics, which are exported via OTLP.</p>
<p>To demonstrate the tracing capabilities, we will define a single endpoint for the API via the <strong>WebApplication</strong>.</p>
<pre><code class="language-csharp">var app = builder.Build();

app.UseHttpsRedirection();

app.MapGet(&quot;/&quot;, (IHttpClientFactory httpClientFactory) =&gt;
  Api.HandleRoot(httpClientFactory)); // &lt;1&gt;

app.Run();
</code></pre>
<blockquote>
<p>&lt;1&gt; Maps an endpoint that handles requests to the application's root URL path. The handler will be supplied from a static class that we also need to add to the application. It accepts an <strong>IHttpClientFactory</strong> as a parameter, which will be injected from the dependency injection container at runtime and passed as an argument to the <strong>HandleRoot</strong> method.</p>
</blockquote>
<pre><code class="language-csharp">
namespace Example.Api
{
  internal static class Api
  {
    public static async Task&lt;IResult&gt; HandleRoot(IHttpClientFactory httpClientFactory)
    {
      using var client = httpClientFactory.CreateClient();

      await Task.Delay(100); // simulate work
      var response = await client.GetAsync(&quot;http://elastic.co&quot;); // &lt;1&gt;
      await Task.Delay(50); // simulate work

      return response.StatusCode == System.Net.HttpStatusCode.OK ? Results.Ok() : Results.StatusCode(500);
    }
  }
}
</code></pre>
<blockquote>
<p>&lt;1&gt; This URL will require two redirects, allowing us to see multiple spans in the trace.</p>
</blockquote>
<p>This static class includes a <strong>HandleRoot</strong> method that matches the signature for the endpoint handler delegate.</p>
<p>After creating a <strong>HttpClient</strong> from the factory, it sends a GET request to the elastic.co website. Either side of the request is a delay, which is used here to simulate some business logic being executed. The method returns a suitable status code based on the result of the external HTTP request.</p>
<p>If you’re following along, you will also need to include a using directive for the <strong>Example.Api</strong> namespace in your Program.cs file.</p>
<pre><code class="language-csharp">using Example.Api;
</code></pre>
<p>That is all of the code we require for now. The Elastic distribution will automatically enable the exporting of telemetry signals via the OTLP exporter. The OTLP exporter requires that endpoint(s) be configured. A common mechanism for configuring endpoints is via environment variables.</p>
<p>This demo uses an Elastic Cloud deployment as the destination for our observability data. To retrieve the endpoint information from Kibana® running in Elastic Cloud, navigate to the observability setup guides. Select the OpenTelemetry option to view the configuration details that should be supplied to the application.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-distribution-dotnet-applications/1-apm-agents.png" alt="apm agents image" /></p>
<p>Configure environment variables for the application either in launchSettings.json or in the environment where the application is running. The authorization header bearer token should be stored securely, in user secrets or a suitable key vault system.</p>
<p>At a minimum, we must configure two environment variables:</p>
<ul>
<li>
<p>OTEL_EXPORTER_OTLP_ENDPOINT</p>
</li>
<li>
<p>OTLP_EXPORTER_OTLP_HEADERS</p>
</li>
</ul>
<p>It is also highly recommended to configure at least a descriptive service name for the application using the OTEL_RESOURCE_ATTRIBUTES environment variable otherwise a generic default will be applied. For example:</p>
<pre><code class="language-bash">&quot;OTEL_RESOURCE_ATTRIBUTES&quot;: &quot;service.name=minimal-api-example&quot;
</code></pre>
<p>Additional resource tags, such as version, can and should be added as appropriate. You can read more about the options for configuring resource attributes in the <a href="https://opentelemetry.io/docs/languages/net/resources/">OpenTelemetry .NET SDK documentation</a>.</p>
<p>Once configured, run the application and make an HTTP request to its root endpoint. A trace will be generated and exported to the configured OTLP endpoint.</p>
<p>To view the traces, you can use the Elastic APM Kibana UI. From the Kibana home page, visit the Observability area and from a trace under the APM &gt; Traces page. After selecting a suitable time frame and choosing the trace named “GET /,” you will be able to explore one or more trace samples.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-distribution-dotnet-applications/2-trace-sample.png" alt="trace sample" /></p>
<p>The above trace demonstrates the built-in instrumentation collection provided by the OpenTelemetry SDK and the optional <strong>OpenTelemetry.Instrumentation.AspNetCore</strong> package that we added.</p>
<p>It’s important to highlight that we would see a different trace above if we had used the “vanilla” SDK without the Elastic distribution. The HTTP spans that appear in blue in the screenshot would not be shown. By default, the OpenTelemetry SDK does not enable HTTP instrumentation, and it would require additional code to configure the instrumentation of outbound HTTP requests. The Elastic distribution takes the opinion that HTTP spans should be captured and enables this feature by default.</p>
<p>It is also possible to add application-specific instrumentation to this application. Typically, this would require calling vendor-specific APIs, for example, the <a href="https://www.elastic.co/guide/en/apm/agent/dotnet/current/public-api.html#api-tracer-api">tracer API</a> in Elastic APM Agent. A significant benefit of choosing OpenTelemetry is the capability to use vendor-neutral APIs to instrument code with no vendor lock-in. We can see that in action by updating the <strong>API</strong> class in the sample.</p>
<pre><code class="language-csharp">internal static class Api
{
  public static string ActivitySourceName = &quot;CustomActivitySource&quot;;
  private static readonly ActivitySource ActivitySource = new(ActivitySourceName);

  public static async Task&lt;IResult&gt; HandleRoot(IHttpClientFactory httpClientFactory)
  {
    using var activity = ActivitySource.StartActivity(&quot;DoingStuff&quot;, ActivityKind.Internal);
    activity?.SetTag(&quot;custom-tag&quot;, &quot;TagValue&quot;);

    using var client = httpClientFactory.CreateClient();

    await Task.Delay(100);
    var response = await client.GetAsync(&quot;http://elastic.co&quot;); // using this URL will require 2 redirects
    await Task.Delay(50);

    if (response.StatusCode == System.Net.HttpStatusCode.OK)
    {
      activity?.SetStatus(ActivityStatusCode.Ok);
      return Results.Ok();
    }

    activity?.SetStatus(ActivityStatusCode.Error);
    return Results.StatusCode(500);
  }
}
</code></pre>
<p>The preceding code snippet defines a private static <strong>ActivitySource</strong> field inside the <strong>Api</strong> class. Inside the <strong>HandleRoot</strong> method, an <strong>Activity</strong> is started using the ActivitySource, and several tags are set. The <strong>ActivitySource</strong> and <strong>Activity</strong> types are defined in the .NET BCL (base class library) and are defined in the <strong>System.Diagnostics</strong> namespace. A using directive is required to use them.</p>
<pre><code class="language-csharp">using System.Diagnostics;
</code></pre>
<p>By using the Activity APIs to instrument the above code, we are not tied to any specific vendor APM solution. To learn more about using the .NET APIs to instrument code in an OpenTelemetry native way, visit the <a href="https://learn.microsoft.com/en-us/dotnet/core/diagnostics/distributed-tracing-instrumentation-walkthroughs">Microsoft Learn page covering distributed tracing instrumentation</a>.</p>
<p>The last modification we must apply will instruct OpenTelemetry to observe spans from our application-specific <strong>ActivitySource</strong>. This is achieved by updating the registration of the OpenTelemetry components with the dependency injection framework.</p>
<pre><code class="language-csharp">builder.Services
  .AddHttpClient()
  .AddOpenTelemetry()
    .WithTracing(t =&gt; t
      .AddAspNetCoreInstrumentation()
      .AddSource(Api.ActivitySourceName)); // &lt;1&gt;
</code></pre>
<blockquote>
<p>&lt;1&gt; AddSource subscribes the OpenTelemetry SDK to spans (activities) produced by our application code.</p>
</blockquote>
<p>A new trace will be collected and exported after making these changes, rerunning the application, and requesting the root endpoint. The latest trace can be viewed in the Kibana observability UI.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-distribution-dotnet-applications/3-timeline.png" alt="timeline" /></p>
<p>The trace waterfall now includes the internal “DoingStuff” span produced by the instrumentation that we added to our application code. The HTTP spans still appear and are now child spans of the “DoingStuff” span.</p>
<p>We’re working on writing more thorough documentation to be published on elastic.co. Until then, you can find more information in our repository <a href="https://github.com/elastic/elastic-otel-dotnet/blob/main/README.md">readme</a> and the <a href="https://github.com/elastic/elastic-otel-dotnet/tree/main/docs">docs folder</a>.</p>
<p>As the distribution is designed to extend the capabilities of the OpenTelemetry SDK with limited impact on the code used to register the SDK, we recommend visiting the <a href="https://opentelemetry.io/docs/languages/net/">OpenTelemetry documentation for .NET</a> to learn about the instrumenting code and provide a more advanced configuration of the SDK.</p>
<h2>What are the next steps?</h2>
<p>We are very excited to expand our support of the OpenTelemetry community and contribute to its future within the .NET ecosystem. This is the compelling next step toward greater collaboration between all observability vendors to provide a rich ecosystem supporting developers on their journey to improved application observability with zero vendor lock-in.</p>
<p>At this stage, we strongly appreciate any feedback the .NET community and our customers can provide to guide the direction of our OpenTelemetry distribution. Please <a href="https://www.nuget.org/packages/Elastic.OpenTelemetry">try out our distribution</a> and engage with us through our <a href="https://github.com/elastic/elastic-otel-dotnet">GitHub repository</a>.</p>
<p>In the coming weeks and months, we will focus on stabilizing the distribution's API and porting Elastic APM Agent features into the distribution. In parallel, we expect to start donating and contributing features to the broader OpenTelemetry community via the <a href="https://github.com/open-telemetry/">OpenTelemetry GitHub repositories</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-distribution-dotnet-applications/OTel-1.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Introducing Elastic's OpenTelemetry Distribution for Node.js]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-node-js</link>
            <guid isPermaLink="false">elastic-opentelemetry-distribution-node-js</guid>
            <pubDate>Mon, 06 May 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Announcing the first alpha release of the Elastic OpenTelemetry Distribution for Node.js. See how easy it is to instrument your Node.js applications with OpenTelemetry in this blog post.]]></description>
            <content:encoded><![CDATA[<p>We are delighted to announce the alpha release of the <a href="https://github.com/elastic/elastic-otel-node/tree/main/packages/opentelemetry-node#readme">Elastic OpenTelemetry Distribution for Node.js</a>. This distribution is a light wrapper around the OpenTelemetry Node.js SDK that makes it easier to get started using OpenTelemetry to observe your Node.js applications.</p>
<h2>Background</h2>
<p>Elastic is standardizing on OpenTelemetry (OTel) for observability and security data collection. As part of that effort, we are <a href="https://www.elastic.co/blog/elastic-opentelemetry-sdk-distributions">providing distributions of the OpenTelemetry Language SDKs</a>. Our <a href="https://github.com/elastic/apm-agent-android#readme">Android</a> and <a href="https://github.com/elastic/apm-agent-ios#readme">iOS</a> SDKs have been OpenTelemetry-based from the start, and we have recently released alpha distributions for <a href="https://github.com/elastic/elastic-otel-java#readme">Java</a> and <a href="https://github.com/elastic/elastic-otel-dotnet#readme">.NET</a>. The Elastic OpenTelemetry Distribution for Node.js is the latest addition.</p>
<h2>Getting started</h2>
<p>To get started with the Elastic OTel Distribution for Node.js (the &quot;distro&quot;), you need only install and load a single npm dependency (@elastic/opentelemetry-node). The distro sets up the collection of traces, metrics, and logs for a number of popular Node.js packages. It sends data to any OTLP endpoint you configure. This could be a standard OTel Collector or, as shown below, an Elastic Observability cloud deployment.</p>
<pre><code class="language-bash">npm install --save @elastic/opentelemetry-node  # (1) install the SDK

# (2) configure it, for example:
export OTEL_EXPORTER_OTLP_ENDPOINT=https://my-deployment.apm.us-west1.gcp.cloud.es.io
export OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=Bearer ...REDACTED...&quot;
export OTEL_SERVICE_NAME=my-service

# (3) load and start it
node --require @elastic/opentelemetry-node my-service.js
</code></pre>
<h2>A small example with Express and PostgreSQL</h2>
<p>For a concrete example, let's look at a small Node.js &quot;Shortlinks&quot; service implemented using the <a href="https://expressjs.com/">Express</a> web framework and the <a href="https://node-postgres.com/">pg</a><a href="https://node-postgres.com/"></a><a href="https://node-postgres.com/">PostgreSQL client package</a>. This service provides a POST / route for creating short links (a short name for a URL) and a GET /:shortname route for using them.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-distribution-node-js/recent_shortlinks.png" alt="Recent shortlinks" /></p>
<p>The git repository is <a href="https://github.com/elastic/elastic-otel-node-example">here</a>. The <a href="https://github.com/elastic/elastic-otel-node-example#readme">README</a> shows how to create a free trial Elastic cloud deployment and get the appropriate OTEL_... config settings. Try it out (prerequisites are Docker and Node.js v20 or later):</p>
<pre><code class="language-bash">git clone https://github.com/elastic/elastic-otel-node-example.git
cd elastic-otel-node-example
npm install

cp config.env.template config.env
# Edit OTEL_ values in &quot;config.env&quot; to point to your collection endpoint.

npm run db:start
npm start
</code></pre>
<p>The only steps needed to set up observability are <a href="https://github.com/elastic/elastic-otel-node-example/blob/v1.0.0/package.json#L30-L33">these small changes</a> to the &quot;package.json&quot; file and configuring a few standard OTEL_... environment variables.</p>
<pre><code class="language-json">// ...
  &quot;scripts&quot;: {
	&quot;start&quot;: &quot;node --env-file=./config.env -r @elastic/opentelemetry-node lib/app.js&quot;
  },
  &quot;dependencies&quot;: {
	&quot;@elastic/opentelemetry-node&quot;: &quot;*&quot;,
  // ...
</code></pre>
<p>The result is an observable application using the industry-standard <a href="https://opentelemetry.io/">OpenTelemetry</a> — offering high-quality instrumentation of many popular Node.js libraries, a portable API to avoid vendor lock-in, and an active community.</p>
<p>Using Elastic Observability, some out-of-the-box benefits you can expect are: rich trace viewing, Service maps, integrated metrics and log analysis, and more. The distro ships <a href="https://github.com/open-telemetry/opentelemetry-js-contrib#readme">host-metrics</a> and Kibana provides a curated service metrics UI. There is out-of-the-box sending of logs for the popular <a href="https://github.com/winstonjs/winston">Winston</a> and <a href="https://github.com/trentm/node-bunyan">Bunyan</a> logging frameworks, with support planned for <a href="https://getpino.io">Pino</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-distribution-node-js/trace_sample.png" alt="trace sample screenshot" /></p>
<h2>What's next?</h2>
<p>Elastic is committed to helping OpenTelemetry succeed and to helping our customers use OpenTelemetry effectively in their systems. Last year, we <a href="https://opentelemetry.io/blog/2023/ecs-otel-semconv-convergence/">donated ECS</a> and continue to work on integrating it with OpenTelemetry Semantic Conventions. More recently, we are working on <a href="https://www.elastic.co/blog/elastic-donation-proposal-to-contribute-profiling-agent-to-opentelemetry">donating our eBPF-based profiler</a> to OpenTelemetry. We contribute to many of the language SDKs and other OpenTelemetry projects.</p>
<p>As authors of the Node.js distribution, we are excited to work with the OpenTelemetry JavaScript community and to help make the JS API &amp; SDK a more robust, featureful, and obvious choice for JavaScript observability. Having a distro gives us the flexibility to build features on top of the vanilla OTel SDK. Currently, some advantages of the distro include: single package for installation, easy auto-instrumentation with reasonable default configuration, ESM enabled by default, and automatic logs telemetry sending. We will certainly contribute features upstream to the OTel JavaScript project when possible and will include additional features in the distro when it makes more sense for them to be there.</p>
<p>The Elastic OpenTelemetry Distribution for Node.js is currently an alpha. Please <a href="https://github.com/elastic/elastic-otel-node/blob/main/packages/opentelemetry-node/docs/getting-started.mdx">try it out</a> and let us know if it might work for you. Watch for the <a href="https://github.com/elastic/elastic-otel-node/releases">latest releases here</a>. You can engage with us on <a href="https://github.com/elastic/elastic-otel-node/issues">the project issue tracker</a> or <a href="https://discuss.elastic.co/tags/c/apm/nodejs">Elastic's Node.js APM Discuss forum</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-distribution-node-js/Node-js.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Introducing Elastic's distribution of OpenTelemetry PHP]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-php</link>
            <guid isPermaLink="false">elastic-opentelemetry-distribution-php</guid>
            <pubDate>Mon, 16 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Announcing the first alpha release of the Elastic distribution of OpenTelemetry PHP. See how easy it is to instrument your PHP applications with OpenTelemetry in this blog post.]]></description>
            <content:encoded><![CDATA[<p>We’re excited to introduce the first alpha release of <a href="https://github.com/elastic/elastic-otel-php">Elastic Distribution for OpenTelemetry PHP</a>. In this post, you’ll learn how to easily install and set up monitoring for your PHP applications.</p>
<h2>Background</h2>
<p>Elastic is standardizing on OpenTelemetry (OTel) for observability and security data collection. As part of that effort, we are <a href="https://www.elastic.co/blog/elastic-opentelemetry-sdk-distributions">providing distributions of the OpenTelemetry Language SDKs</a>. Our <a href="https://github.com/elastic/apm-agent-android#readme">Android</a> and <a href="https://github.com/elastic/apm-agent-ios#readme">iOS</a> SDKs have been OpenTelemetry-based from the start, and we have recently released alpha distributions for <a href="https://github.com/elastic/elastic-otel-java#readme">Java</a>, <a href="https://github.com/elastic/elastic-otel-dotnet#readme">.NET</a>, <a href="https://github.com/elastic/elastic-otel-node#readme">Node.js</a> and  <a href="https://github.com/elastic/elastic-otel-python#readme">Python</a>. The Elastic distribution of OpenTelemetry PHP is the latest addition.</p>
<h2>Getting started</h2>
<p>To install Elastic Distribution for OpenTelemetry PHP for your application, download the appropriate package for your Linux distribution from <a href="https://github.com/elastic/elastic-otel-php/releases">https://github.com/elastic/elastic-otel-php/releases</a>.</p>
<p>Currently, we support packages for systems using DEB and RPM package managers for x86_64 and ARM64 processors.</p>
<p>For DEB-based systems, run the following command:</p>
<pre><code class="language-bash">dpkg -i &lt;package-file&gt;.deb
</code></pre>
<p>For RPM-based systems, run the following command:</p>
<pre><code class="language-bash">rpm -ivh &lt;package-file&gt;.rpm
</code></pre>
<p>For APK-based systems (Alpine), run the following command:</p>
<pre><code class="language-bash">apk add --allow-untrusted &lt;package-file&gt;.apk
</code></pre>
<p>The package installer will automatically detect the installed PHP versions and update the configuration, so monitoring extension will be available with the next process restart (you need to restart the processes to load the new php.ini configuration).
Some environment variables are needed to provide the necessary configuration for instrumenting your services. These mainly concern the destination of your traces and the identification of your service. You’ll also need to provide the authorization headers for authentication with Elastic Observability Cloud and the Elastic Cloud endpoint where the data is sent.</p>
<pre><code class="language-bash">export OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=&lt;url encoded apikey header value&gt;&quot;
export OTEL_EXPORTER_OTLP_ENDPOINT=&lt;your elastic cloud url&gt;
</code></pre>
<p>where</p>
<ul>
<li><code>OTEL_EXPORTER_OTLP_ENDPOINT</code>: The full URL of the endpoint where data will be sent.</li>
<li><code>OTEL_EXPORTER_OTLP_HEADERS</code>: A comma-separated list of <code>key=value</code> pairs that will be added to the headers of every request. This is typically used for authentication information.</li>
</ul>
<p>After restarting the application, as a result, you should see insights into the monitored applications in Kibana, such as service maps and trace views. In the example below, you can see trace details from the Aimeos application created using the Laravel framework.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-distribution-php/traces-laravel.png" alt="Aimeos trace example" /></p>
<p>Below is an example of a Slim application using HttpAsyncClient:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-distribution-php/traces-slim.png" alt="Slim ande HttpAsyncClient trace example" /></p>
<h2>What's next?</h2>
<p>In this alpha version, we support all modern PHP versions from 8.0 to 8.3 inclusive, providing instrumentation for PHP code, including popular frameworks like Laravel, Slim, and HttpAsyncClient, as well as native extensions such as PDO. In future releases, we plan to introduce additional features supported by OpenTelemetry, along with Elastic APM-exclusive features like Inferred Spans.</p>
<p>Stay tuned!</p>
<p>Elastic is committed to helping OpenTelemetry succeed and to helping our customers use OpenTelemetry effectively in their systems. Last year, we <a href="https://opentelemetry.io/blog/2023/ecs-otel-semconv-convergence/">donated ECS</a> and continue to work on integrating it with OpenTelemetry Semantic Conventions. More recently, we are working on <a href="https://www.elastic.co/blog/elastic-donation-proposal-to-contribute-profiling-agent-to-opentelemetry">donating our eBPF-based profiler</a> to OpenTelemetry. We contribute to many of the language SDKs and other OpenTelemetry projects.</p>
<p>As authors of the PHP distribution, we are excited to work with the OpenTelemetry PHP community and to help make the PHP SDK a more robust, featureful, and obvious choice for PHP observability. Having a distro gives us the flexibility to build features on top of the vanilla OTel SDK. Currently, some advantages of the distro include: fully automatic installation and full auto-instrumentation. We will certainly contribute features upstream to the OTel PHP project when possible and will include additional features in the distro when it makes more sense for them to be there.</p>
<p>The Elastic OpenTelemetry Distribution of PHP is currently an alpha. Please <a href="https://github.com/elastic/elastic-otel-php/blob/main/docs/get-started.md">try it out</a> and let us know if it might work for you. Watch for the <a href="https://github.com/elastic/elastic-otel-php/releases">latest releases here</a>. You can engage with us on <a href="https://github.com/elastic/elastic-otel-php/issues">the project issue tracker</a> or <a href="https://discuss.elastic.co/tags/c/apm/php">Elastic's PHP APM Discuss forum</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-distribution-php/php.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[LLM Observability with Elastic, OpenLIT and OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-openlit-tracing</link>
            <guid isPermaLink="false">elastic-opentelemetry-langchain-openlit-tracing</guid>
            <pubDate>Thu, 29 Aug 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Langchain applications are growing in use. The ability to build out RAG-based applications, simple AI Assistants, and more is becoming the norm. Observing these applications is even harder. Given the various options that are out there, this blog shows how to use OpenTelemetry instrumentation with the OpenLIT instrumentation library to ingest traces into Elastic Observability APM.]]></description>
            <content:encoded><![CDATA[<p>The realm of technology is evolving rapidly, and Large Language Models (LLMs) are at the forefront of this transformation. From chat bots to intelligent application copilots, LLMs are becoming increasingly sophisticated. As these applications grow more complex, ensuring their reliability and performance is paramount. This is where observability steps in, aided by OpenTelemetry and Elastic through the <a href="https://github.com/openlit/openlit">OpenLIT</a> instrumentation library. </p>
<p>OpenLIT is an open-source Observability and Evaluation tool that helps take your LLM apps from playground to debug to production. With OpenLit you get an ability to choose from a <a href="https://docs.openlit.io/latest/integrations/introduction">range of Integrations</a> (across LLMs, VectorDBs, frameworks, and GPUs) to start tracking LLM performance, usage, and costs without hassle. In this blog we will look at tracking OpenAI and LangChain. to send telemetry to an OpenTelemetry compatible endpoint like Elastic.</p>
<p><a href="https://www.elastic.co/guide/en/observability/current/apm-open-telemetry.html">Elastic supports OpenTelemetry natively</a>, it can take telemetry directly from the application (via the OpenTelemetry SDKs) or through a native OTel collector. No special agents are needed. Additionally <a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry">Elastic's EDOT</a> provides a supported set of OTel SDKs and an OTel Collector. In this blog we will connect our application directly to Elastic without a collector for simplicity.</p>
<h2>Why Observability Matters for LLM Applications</h2>
<p>Monitoring LLM applications is crucial for several reasons.</p>
<ol>
<li>
<p>It’s vital to keep track of how often LLMs are being used for usage and cost tracking.</p>
</li>
<li>
<p>Latency is important to track since the response time from the model can vary based on the inputs passed to the LLM.</p>
</li>
<li>
<p>Rate limiting is a common challenge, particularly for external LLMs, as applications depend more on these external API calls. When rate limits are hit, it can hinder these applications from performing their essential functions using these LLMs.</p>
</li>
</ol>
<p>By keeping a close eye on these aspects, you can not only save costs but also avoid hitting request limits, ensuring your LLM applications perform optimally.</p>
<h2>What are the signals that you should be looking at?</h2>
<p>Using Large Language Models (LLMs) in applications differs from traditional machine learning (ML) models. Primarily, LLMs are often accessed through external API calls instead of being run locally or in-house. It is crucial to capture the sequence of events (using traces), especially in a RAG-based application where there can be events before and after LLM usage. Also, analyzing the aggregated data (through metrics) provides a quick overview like request, tokens and cost is important for optimizing performance and managing costs. Here are the key signals to monitor:</p>
<h3>Traces</h3>
<p><strong>Request Metadata</strong>: This is important in the context of LLMs, given the variety of parameters (like temperature and top_p) that can drastically affect both the response quality and the cost. Specific aspects to monitor are:</p>
<ol>
<li>
<p>Temperature: Indicates the level of creativity or randomness desired from the model’s outputs. Varying this parameter can significantly impact the nature of the generated content.</p>
</li>
<li>
<p>top_p: Decides how selective the model is by choosing from a certain percentage of most likely words. A high “top_p” value means the model considers a wider range of words, making the text more varied.</p>
</li>
<li>
<p>Model Name or Version: Essential for tracking over time, as updates to the LLM might affect performance or response characteristics.</p>
</li>
<li>
<p>Prompt Details: The exact inputs sent to the LLM, which, unlike in-house ML models where inputs might be more controlled and homogeneous, can vary wildly and affect output complexity and cost implications.</p>
</li>
</ol>
<p><strong>Response Metadata</strong>: Given the API-based interaction with LLMs, tracking the specifics of the response is key for cost management and quality assessment:</p>
<ol>
<li>
<p>Tokens: Directly impacts cost and is a measure of response length and complexity.</p>
</li>
<li>
<p>Cost: Critical for budgeting, as API-based costs can scale with the number of requests and the complexity of each request.</p>
</li>
<li>
<p>Completion Details: Similar to the prompt details but from the response perspective, providing insights into the model’s output characteristics and potential areas of inefficiency or unexpected cost.</p>
</li>
</ol>
<h3>Metrics</h3>
<p><strong>Request Volume</strong>: The total number of requests made to the LLM service. This helps in understanding the demand patterns and identifying any anomaly in usage, such as sudden spikes or drops.</p>
<p><strong>Request Duration</strong>: The time it takes for a request to be processed and a response to be received from the LLM. This includes network latency and the time the LLM takes to generate a response, providing insights into the performance and reliability of the LLM service.</p>
<p><strong>Costs and Tokens Counters</strong>: Keeping track of the total cost accrued and tokens consumed over time is essential for budgeting and cost optimization strategies. Monitoring these metrics can alert you to unexpected increases that may indicate inefficient use of the LLM or the need for optimization.</p>
<h2>Implementing Automatic Instrumentation with OpenLIT</h2>
<p><a href="https://openlit.io/">OpenLIT</a> automates telemetry data capture, simplifying the process for developers. Here’s a step-by-step guide to setting it up:</p>
<p><strong>1. Install the OpenLIT SDK</strong>:</p>
<p>First, you must install the following package: </p>
<pre><code class="language-bash">pip install openlit
</code></pre>
<p><strong>Note:</strong> OpenLIT currently supports Python, a popular language for Generative AI. The team is also working on expanding support to JavaScript soon.</p>
<p><strong>2. Get your Elastic APM Credentials</strong></p>
<ol>
<li>
<p>Sign in to your <a href="https://cloud.elastic.co">Elastic cloud account</a>.</p>
</li>
<li>
<p>Open the side navigation and click on APM under Observability.</p>
</li>
<li>
<p>Make sure the APM Server is running</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-openlit-tracing/LangChainAppOTelAPMsetup.png" alt="LangChainChat App in Elastic APM" /></p>
<ol>
<li>
<p>In the APM Agents section, Select OpenTelemetry and directly jump to Step 5 (Configure OpenTelemetry in your application):</p>
</li>
<li>
<p>Copy and save the configuration value for <code>OTEL_EXPORTER_OTLP_ENDPOINT</code> and <code>OTEL_EXPORTER_OTLP_HEADERS</code></p>
</li>
</ol>
<p><strong>3. Set Environment Variables</strong>:</p>
<p>OpenTelemetry Environment variables for Elastic can be set as follows in linux (or in the code). <a href="https://www.elastic.co/guide/en/observability/current/apm-open-telemetry-direct.html">Elastic OTel Documentation</a></p>
<pre><code class="language-bash">export OTEL_EXPORTER_OTLP_ENDPOINT=&quot;YOUR_ELASTIC_APM_OTLP_URL&quot;
export OTEL_EXPORTER_OTLP_HEADERS=&quot;YOUR_ELASTIC_APM_AUTH&quot;
</code></pre>
<p><strong>Note:</strong> Make sure to replace the space after Bearer with %20:</p>
<p><code>OTEL_EXPORTER_OTLP_HEADERS=“Authorization=Bearer%20[APIKEY]”</code></p>
<p><strong>4. Initialize the SDK</strong>:</p>
<p>You will need to add the following to the LLM Application code.</p>
<pre><code class="language-python">import openlit
openlit.init()
</code></pre>
<p>Optionally, you can customize the application name and environment by setting the <code>application_name</code> and <code>environment</code> attributes when initializing OpenLIT in your application. These variables configure the OTel attributes <code>service.name</code> and <code>deployment.environment</code>, respectively. For more details on other configuration settings, check out the <a href="https://github.com/openlit/openlit/tree/main/sdk/python#configuration">OpenLIT GitHub Repository</a>.</p>
<pre><code class="language-python">openlit.init(application_name=&quot;YourAppName&quot;,environment=&quot;Production&quot;)
</code></pre>
<p>The most popular libraries in GenAI are OpenAI (for accessing LLMs) and Langchain (for orchestrating steps). An example instrumentation of a Langchain and OpenAI based LLM Application will look like:</p>
<pre><code class="language-python">import getpass
import os
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
import openlit 

# Auto-instruments LLM and VectorDB calls, sending OTel traces and metrics to the configured endpoint
openlit.init()

os.environ[&quot;OPENAI_API_KEY&quot;] = getpass.getpass()
model = ChatOpenAI(model=&quot;gpt-4&quot;)
messages = [
    SystemMessage(content=&quot;Translate the following from English into Italian&quot;),
    HumanMessage(content=&quot;hi!&quot;),
]
model.invoke(messages)
</code></pre>
<h2>Visualizing Data with Kibana</h2>
<p>Once your LLM application is instrumented, visualizing the collected data is the next step. Follow the below steps to import a pre-built Kibana dashboard to get yourself started:</p>
<ol>
<li>
<p>Copy the dashboard NDJSON provided <a href="https://docs.openlit.io/latest/connections/elastic#dashboard">here</a> and save it in a file with an extension <code>.ndjson</code>.</p>
</li>
<li>
<p>Log into your Elastic Instance.</p>
</li>
<li>
<p>Go to Stack Management &gt; Saved Objects.</p>
</li>
<li>
<p>Click Import and upload your file containing the dashboard NDJSON.</p>
</li>
<li>
<p>Click Import and you should have the dashboard available.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-openlit-tracing/elastic-dashboard-1.jpg" alt="Elastic-dashboard-1" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-openlit-tracing/elastic-dashboard-2.jpg" alt="Elastic-dashboard-2" /></p>
<p>The dashboard provides an in-depth overview of system metrics through eight key areas: Total Successful Requests, Request Duration Distribution, Request Rates, Usage Cost and Tokens, Top GenAI Models, GenAI Requests by Platform and Environment, Token Consumption vs. Cost. These metrics collectively help identify peak usage times, latency issues, rate limits, and resource allocation, facilitating performance tuning and cost management. This comprehensive breakdown aids in understanding LLM performance, ensuring consistent operation across environments, budget needs, and troubleshooting issues, ultimately optimizing overall system efficiency.</p>
<p>Also, you can see OpenTelemetry Traces from OpenLIT in Elastic APM, letting you look into each LLM request in detail. This setup ensures better system efficiency by helping with model performance checks, smooth running across environments, budget planning, and troubleshooting.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-openlit-tracing/elastic-dashboard-3.jpg" alt="Elastic-dashboard-3" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-openlit-tracing/elastic-dashboard-4.jpg" alt="Elastic-dashboard-4" /></p>
<h2>Conclusion</h2>
<p>Observability is crucial for the efficient operation of LLM applications. OpenTelemetry's open standards and extensive support, combined with <a href="https://www.elastic.co/observability/application-performance-monitoring">Elastic's APM</a>, <a href="https://www.elastic.co/observability/aiops">AIOps</a>, and <a href="https://www.elastic.co/observability/log-monitoring">analytics</a> and <a href="https://docs.openlit.io/latest/introduction">OpenLIT's</a> powerful and easy auto-instrumentation for 20+ GenAI tools from LLMs to VectorDBs, enable complete visibility into LLM performance. </p>
<p>Hopefully, this provides an easy-to-understand walk-through of instrumenting Langchain with OpenTelemetry and OpenLit and how easy it is to send traces into Elastic.</p>
<p><strong>Additional resources for OpenTelemetry with Elastic:</strong></p>
<ul>
<li>
<p><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/monitor-openai-api-gpt-models-opentelemetry-elastic">Monitor OpenAI API and GPT models with OpenTelemetry and Elastic</a></p>
</li>
<li>
<p>Futureproof<a href="https://www.elastic.co/virtual-events/future-proof-your-observability-platform-with-opentelemetry-and-elastic"> your observability platform with OpenTelemetry and Elastic</a></p>
</li>
<li>
<p>Instrumentation resources:</p>
<ul>
<li>
<p>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual instrumentation</a></p>
</li>
<li>
<p>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual instrumentation </a></p>
</li>
<li>
<p>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual instrumentation</a></p>
</li>
<li>
<p>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual instrumentation</a></p>
</li>
</ul>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-openlit-tracing/elastic-openlit-tracing.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Observing Langchain applications with Elastic, OpenTelemetry, and Langtrace]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-tracing-langtrace</link>
            <guid isPermaLink="false">elastic-opentelemetry-langchain-tracing-langtrace</guid>
            <pubDate>Mon, 02 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Langchain applications are growing in use. The ability to build out RAG-based applications, simple AI Assistants, and more is becoming the norm. Observing these applications is even harder. Given the various options that are out there, this blog shows how to use OpenTelemetry instrumentation with Langtrace and ingest it into Elastic Observability APM]]></description>
            <content:encoded><![CDATA[<p>As AI-driven applications become increasingly complex, the need for robust tools to monitor and optimize their performance is more critical than ever. LangChain has rapidly emerged as a crucial framework in the AI development landscape, particularly for building applications powered by large language models (LLMs). As its adoption has soared among developers, the need for effective debugging and performance optimization tools has become increasingly apparent. One such essential tool is the ability to obtain and analyze traces from Langchain applications. Tracing provides invaluable insights into the execution flow, helping developers understand and improve their AI-driven systems. <a href="https://www.elastic.co/observability/application-performance-monitoring">Elastic Observability's APM</a> provides an ability to trace your Langchain apps with OpenTelemetry, but you need third-party libraries.</p>
<p>There are several options to trace for Langchain. <a href="https://docs.langtrace.ai/introduction">Langtrace</a> is one such option. Langtrace is an <a href="https://github.com/Scale3-Labs/langtrace">open-source</a> observability software that lets you capture, debug and analyze traces and metrics from all your applications. Langtrace automatically captures traces from LLM APIs/inferences, Vector Databases, and LLM-based Frameworks. Langtrace stands out due to its seamless integration with popular LLM frameworks and its ability to provide deep insights into complex AI workflows without requiring extensive manual instrumentation.</p>
<p>Langtrace has an SDK, a lightweight library that can be installed and imported into your project to collect traces. The traces are OpenTelemetry-based and can be exported to Elastic without using a Langtrace API key.</p>
<p>OpenTelemetry (OTel) is now broadly accepted as the industry standard for tracing. As one of the major Cloud Native Computing Foundation (CNCF) projects, with as many commits as Kubernetes, it is gaining support from major ISVs and cloud providers delivering support for the framework. </p>
<p>Hence, many LangChain-based applications will have multiple components beyond just LLM interactions. Using OpenTelemetry with LangChain is essential. </p>
<p>This blog will cover how you can use Langtrace SDK to trace a simple LangChain Chat app connecting to Azure OpenAI, perform a search in DuckDuckGoSearch and export the output to Elastic.</p>
<h1>Pre-requisites:</h1>
<ul>
<li>
<p>An Elastic Cloud account — <a href="https://cloud.elastic.co/">sign up now</a>, and become familiar with <a href="https://www.elastic.co/guide/en/observability/current/apm-open-telemetry.html">Elastic’s OpenTelemetry configuration</a></p>
</li>
<li>
<p>Have a LangChain app to instrument</p>
</li>
<li>
<p>Be familiar with using <a href="https://opentelemetry.io/docs/languages/python/libraries/">OpenTelemetry’s Python SDK</a> </p>
</li>
<li>
<p>An account on your favorite LLM (AzureOpen AI), with API keys</p>
</li>
<li>
<p>The application we used in this blog, called <code>langchainChat</code> can be found in <a href="https://github.com/elastic/observability-examples/tree/main/langchainChat">Github langhcainChat</a>. It is built using Azure OpenAI and DuckDuckGo, but you can easily modify it for your LLM and search of choice.</p>
</li>
</ul>
<h1>App Overview and output in Elastic:</h1>
<p>To showcase the combined power of Langtrace and Elastic, we created a simple LangChain app that performs the following steps:</p>
<ol>
<li>
<p>Takes customer input on the command line. (Queries)</p>
</li>
<li>
<p>Sends these to the Azure OpenAI LLM via a LangChain.</p>
</li>
<li>
<p>Utilizes chain tools to perform a search using DuckDuckGo.</p>
</li>
<li>
<p>The LLM processes the search results and returns the relevant information to the user.</p>
</li>
</ol>
<p>Here is a sample interaction:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-tracing-langtrace/langchainchat-cli.png" alt="Chat Interaction" /></p>
<p>Here is what the service view looks like after we ran a few queries.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-tracing-langtrace/langchainchat-overview.png" alt="Service Overview" /></p>
<p>As you can see, Elastic Observability’s APM recognizes the LangChain app and also shows the average latency, throughput, and transactions. Our average latency is 30s since it takes that log for humans to type the query (twice).</p>
<p>You can also select other tabs to see, dependencies, errors, metrics, and more. One interesting part of Elastic APM is the ability to use universal profiling (eBPF) output also analyzed for this service. Here is what our service’s dependency is (Azure OpenAI) with its average latency, throughput, and failed transactions:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-tracing-langtrace/langchainchat-dependency.png" alt="Dependencies" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-tracing-langtrace/langchainchat-dependency-metrics.png" alt="Dependency-metric" /></p>
<p>We see Azure OpenAI is on average 4s to give us the results.</p>
<p>If we drill into transactions and look at the trace for our queries on Taylor Swift and Pittsburgh Steelers, we can see both queries and their corresponding spans.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-tracing-langtrace/langchainchat-trace.png" alt="Trace for two queries" /></p>
<p>In this trace:</p>
<ol>
<li>
<p>The user makes a query</p>
</li>
<li>
<p>Azure OpenAI is called, but it uses a tool (DuckDuckGo) to obtain some results</p>
</li>
<li>
<p>Azure OpenAI reviews and returns a summary to the end user</p>
</li>
<li>
<p>Repeats for another query</p>
</li>
</ol>
<p>We noticed that the other long span (other than Azure OpenAI) is Duckduckgo (~1000ms). We can individually look at the span and review the data:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-tracing-langtrace/langchainchat-tools-span.png" alt="Span details" /></p>
<h1>Configuration:</h1>
<p>How do we make all this show up in Elastic? Let's go over the steps:</p>
<h2>OpenTelemetry Configuration</h2>
<p>To leverage the full capabilities of OpenTelemetry with Langtrace and Elastic, we need to configure the SDK to generate traces and properly set up Elastic’s endpoint and authorization. Detailed instructions can be found in the <a href="https://opentelemetry.io/docs/zero-code/python/#setup">OpenTelemetry Auto-Instrumentation setup documentation</a>.</p>
<h3>OpenTelemetry Environment variables:</h3>
<p>For Elastic, you can set the following OpenTelemetry environment variables either in your Linux/Mac environment or directly in the code:</p>
<pre><code class="language-bash">OTEL_EXPORTER_OTLP_ENDPOINT=12345.apm.us-west-2.aws.cloud.es.io:443
OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=Bearer%20ZZZZZZZ&quot;
OTEL_RESOURCE_ATTRIBUTES=&quot;service.name=langchainChat,service.version=1.0,deployment.environment=production&quot;
</code></pre>
<p>In this setup:</p>
<ul>
<li>
<p><strong>OTEL_EXPORTER_OTLP_ENDPOINT</strong> is configured to send traces to Elastic.</p>
</li>
<li>
<p><strong>OTEL_EXPORTER_OTLP_HEADERS</strong> provides the necessary authorization for the Elastic APM server.</p>
</li>
<li>
<p><strong>OTEL_RESOURCE_ATTRIBUTES</strong> define key attributes like the service name, version, and deployment environment.</p>
</li>
</ul>
<p>These values can be easily obtained from Elastic’s APM configuration screen under the OpenTelemetry section.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-tracing-langtrace/langchainchat-OTelAPMsetup.png" alt="Span details" /></p>
<p><strong>Note: No agent is required; the OTLP trace messages are sent directly to Elastic’s APM server, simplifying the setup process.</strong></p>
<h2>Langtrace Library:</h2>
<p>OpenTelemetry's auto-instrumentation can be extended to trace additional frameworks using instrumentation packages. For this blog post, you will need to install the Langtrace Python SDK:</p>
<pre><code class="language-python">pip install langtrace-python-sdk 
</code></pre>
<p>After installation, you can add the following code to your project:</p>
<pre><code class="language-python">from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

from langtrace_python_sdk import langtrace, with_langtrace_root_span
</code></pre>
<h2>Instrumentation:</h2>
<p>Once the necessary libraries are installed and the environment variables are configured, you can use auto-instrumentation to trace your application. For example, run the following command to instrument your LangChain application with Elastic:</p>
<pre><code class="language-bash">opentelemetry-instrument python langtrace-elastic-demo.py
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-tracing-langtrace/langchainchat-trace.png" alt="Trace for two queries" /></p>
<p>The Langtrace OpenTelemetry library correctly captures the flow with minimal manual instrumentation, apart from integrating the OpenTelemetry library. Additionally, the LLM spans captured by Langtrace also include useful metadata such as token counts, model hyper-parameter settings etc. Note that the generated spans follow the OTEL GenAI semantics described <a href="https://opentelemetry.io/docs/specs/semconv/attributes-registry/gen-ai/">here</a>.</p>
<p>In summary, the instrumentation process involves:</p>
<ol>
<li>
<p>Capturing customer input from the command line (Queries).</p>
</li>
<li>
<p>Sending these queries to the Azure OpenAI LLM via a LangChain.</p>
</li>
<li>
<p>Utilizing chain tools, such as DuckDuckGo, to perform searches.</p>
</li>
<li>
<p>The LLM processes the results and returns the relevant information to the user.</p>
</li>
</ol>
<h1>Conclusion</h1>
<p>By combining the power of <a href="https://langtrace.ai/">Langtrace</a> with Elastic, developers can achieve unparalleled visibility into their LangChain applications, ensuring optimized performance and quicker debugging. This powerful combination simplifies the complex task of monitoring AI-driven systems, enabling you to focus on what truly matters—delivering value to your users. Throughout this blog,we've covered the following essential steps and concepts:</p>
<ul>
<li>
<p>How to manually instrument Langchain with OpenTelemetry</p>
</li>
<li>
<p>How to properly initialize OpenTelemetry and add a custom span</p>
</li>
<li>
<p>How to easily set the OTLP ENDPOINT and OTLP HEADERS with Elastic without the need for a collector</p>
</li>
<li>
<p>How to view and analyze traces in Elastic Observability APM</p>
</li>
</ul>
<p>These steps provide a clear and actionable guide for developers looking to integrate robust tracing capabilities into their LangChain applications.</p>
<p>We hope this guide makes understanding and implementing OpenTelemetry tracing for LangChain simple, ensuring seamless integration with Elastic.</p>
<p><strong>Additional resources for OpenTelemetry with Elastic:</strong></p>
<ul>
<li>
<p><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/monitor-openai-api-gpt-models-opentelemetry-elastic">Monitor OpenAI API and GPT models with OpenTelemetry and Elastic</a></p>
</li>
<li>
<p>Futureproof<a href="https://www.elastic.co/virtual-events/future-proof-your-observability-platform-with-opentelemetry-and-elastic"> your observability platform with OpenTelemetry and Elastic</a></p>
</li>
<li>
<p>Instrumentation resources:</p>
<ul>
<li>
<p>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual instrumentation</a></p>
</li>
<li>
<p>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual instrumentation </a></p>
</li>
<li>
<p>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual instrumentation</a></p>
</li>
<li>
<p>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual instrumentation</a></p>
</li>
</ul>
</li>
<li>
<p><a href="https://docs.langtrace.ai/supported-integrations/observability-tools/elastic">Elastic APM - Langtrace AI Docs</a> </p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-tracing-langtrace/elastic-langtrace.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Tracing LangChain apps with Elastic, OpenLLMetry, and OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-tracing</link>
            <guid isPermaLink="false">elastic-opentelemetry-langchain-tracing</guid>
            <pubDate>Fri, 02 Aug 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[LangChain applications are growing in use. The ability to build out RAG-based applications, simple AI Assistants, and more is becoming the norm. Observing these applications is even harder. Given the various options that are out there, this blog shows how to use OpenTelemetry instrumentation with OpenLLMetry and ingest it into Elastic Observability APM]]></description>
            <content:encoded><![CDATA[<p>LangChain has rapidly emerged as a crucial framework in the AI development landscape, particularly for building applications powered by large language models (LLMs). As its adoption has soared among developers, the need for effective debugging and performance optimization tools has become increasingly apparent. One such essential tool is the ability to obtain and analyze traces from LangChain applications. Tracing provides invaluable insights into the execution flow, helping developers understand and improve their AI-driven systems. </p>
<p>There are several options to trace for LangChain. One is Langsmith, ideal for detailed tracing and a complete breakdown of requests to large language models (LLMs). However, it is specific to Langchain. OpenTelemetry (OTel) is now broadly accepted as the industry standard for tracing. As one of the major Cloud Native Computing Foundation (CNCF) projects, with as many commits as Kubernetes, it is gaining support from major ISVs and cloud providers delivering support for the framework. </p>
<p>Hence, many LangChain-based applications will have multiple components beyond just LLM interactions. Using OpenTelemetry with LangChain is essential. OpenLLMetry is an available option for tracing Langchain apps in addition to Langsmith.</p>
<p>This blog will show how you can get LangChain tracing into Elastic using the OpenLLMetry library <code>opentelemetry-instrumentation-langchain</code>.</p>
<h1>Pre-requisites:&lt;a id=&quot;pre-requisites&quot;&gt;&lt;/a&gt;</h1>
<ul>
<li>
<p>An Elastic Cloud account — <a href="https://cloud.elastic.co/">sign up now</a>, and become familiar with <a href="https://www.elastic.co/guide/en/observability/current/apm-open-telemetry.html">Elastic’s OpenTelemetry configuration</a></p>
</li>
<li>
<p>Have a LangChain app to instrument</p>
</li>
<li>
<p>Be familiar with using <a href="https://opentelemetry.io/docs/languages/python/libraries/">OpenTelemetry’s Python SDK</a> </p>
</li>
<li>
<p>An account on your favorite LLM, with API keys</p>
</li>
</ul>
<h1>Overview</h1>
<p>In highlighting tracing I created a simple LangChain app that does the following:</p>
<ol>
<li>
<p>Takes customer input on the command line. (Queries)</p>
</li>
<li>
<p>Sends these to the Azure OpenAI LLM via a LangChain.</p>
</li>
<li>
<p>Chain tools are set to use the search with Tavily </p>
</li>
<li>
<p>The LLM uses the output which returns the relevant information to the user.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-tracing/LangChainAppCLI.png" alt="Chat Interaction" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-tracing/LangChainAppInAPM.png" alt="LangChainChat App in Elastic APM" /></p>
<p>As you can see Elastic Observability’s APM recognizes the LangChain App, and also shows the full trace (done with manual instrumentation):</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-tracing/LangChainAutoIntrument.png" alt="LangChainChat App in Elastic APM" /></p>
<p>As the above image shows:</p>
<ol>
<li>The user makes a query</li>
<li>Azure OpenAI is called, but it uses a tool (Tavily) to obtain some results</li>
<li>Azure OpenAI reviews and returns a summary to the end user</li>
</ol>
<p>The code was manually instrumented, but auto-instrument can also be used.</p>
<h1>OpenTelemetry Configuration&lt;a id=&quot;opentelemetry-configuration&quot;&gt;&lt;/a&gt;</h1>
<p>In using OpenTelemetry, we need to configure the SDK to generate traces and configure Elastic’s endpoint and authorization. Instructions can be found in <a href="https://opentelemetry.io/docs/zero-code/python/#setup">OpenTelemetry Auto-Instrumentation setup documentation</a>.</p>
<h2>OpenTelemetry Environment variables:&lt;a id=&quot;opentelemetry-environment-variables&quot;&gt;&lt;/a&gt;</h2>
<p>OpenTelemetry Environment variables for Elastic can be set as follows in linux (or in the code).</p>
<pre><code class="language-bash">OTEL_EXPORTER_OTLP_ENDPOINT=12345.apm.us-west-2.aws.cloud.es.io:443
OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=Bearer%20ZZZZZZZ&quot;
OTEL_RESOURCE_ATTRIBUTES=&quot;service.name=langchainChat,service.version=1.0,deployment.environment=production&quot;
</code></pre>
<p>As you can see <code>OTEL_EXPORTER_OTLP_ENDPOINT</code> is set to Elastic, and the corresponding authorization header is also provided. These can be easily obtained from Elastic’s APM configuration screen under OpenTelemetry</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-tracing/LangChainAppOTelAPMsetup.png" alt="LangChainChat App in Elastic APM" /></p>
<p><strong>Note: No agent is needed, we simply send the OTLP trace messages directly to Elastic’s APM server.</strong> </p>
<h2>OpenLLMetry Library:&lt;a id=&quot;openllmetry-library&quot;&gt;&lt;/a&gt;</h2>
<p>OpenTelemetry's auto-instrumentation can be extended to trace other frameworks via instrumentation packages.</p>
<p>First, you must install the following package: </p>
<p><code>pip install opentelemetry-instrumentation-langchain</code></p>
<p>This library was developed by OpenLLMetry. </p>
<p>Then you will need to add the following to the code.</p>
<pre><code class="language-python">from opentelemetry.instrumentation.langchain import LangchainInstrumentor
LangchainInstrumentor().instrument()
</code></pre>
<h2>Instrumentation&lt;a id=&quot;instrumentation&quot;&gt;&lt;/a&gt;</h2>
<p>Once the libraries are added, and the environment variables are set, you can use auto-instrumentation With auto-instrumentation, the following:</p>
<pre><code class="language-bash">opentelemetry-instrument python tavilyAzureApp.py
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-tracing/LangChainAutoIntrument.png" alt="LangChainChat App in Elastic APM" /></p>
<p>The OpenLLMetry library does pull out the flow correctly with minimal manual instrumentation except for adding the OpenLLMetry library.</p>
<ol>
<li>
<p>Takes customer input on the command line. (Queries)</p>
</li>
<li>
<p>Sends these to the Azure OpenAI LLM via a Lang chain.</p>
</li>
<li>
<p>Chain tools are set to use the search with Tavily </p>
</li>
<li>
<p>The LLM uses the output which returns the relevant information to the user.</p>
</li>
</ol>
<h3>Manual-instrumentation&lt;a id=&quot;manual-instrumentation&quot;&gt;&lt;/a&gt;</h3>
<p>If you want to get more details out of the application, you will need to manually instrument. To get more traces follow my <a href="https://www.elastic.co/observability-labs/blog/manual-instrumentation-python-apps-opentelemetry">Python instrumentation guide</a>. This guide will walk you through setting up the necessary OpenTelemetry bits, Additionally, you can also look at the documentation in <a href="https://opentelemetry.io/docs/languages/python/instrumentation/">OTel for instrumenting in Python</a>.</p>
<p>Note that the env variables <code>OTEL_EXPORTER_OTLP_HEADERS</code> and <code>OTEL_EXPORTER_OTLP_ENDPOINT</code> are set as noted in the section above. You can also set up the <code>OTEL_RESOURCE_ATTRIBUTES</code>.</p>
<p>Once you follow the steps in either guide and initiate the tracer, you will have to essentially just add the span where you want to get more details. In the example below, only one line of code is added for span initialization.</p>
<p>Look at the placement of with <code>tracer.start_as_current_span(&quot;getting user query&quot;) as span:</code> below</p>
<pre><code class="language-python"># Creates a tracer from the global tracer provider
tracer = trace.get_tracer(&quot;newsQuery&quot;)

async def chat_interface():
    print(&quot;Welcome to the AI Chat Interface!&quot;)
    print(&quot;Type 'quit' to exit the chat.&quot;)
    
    with tracer.start_as_current_span(&quot;getting user query&quot;) as span:
        while True:
            user_input = input(&quot;\nYou: &quot;).strip()
            
            if user_input.lower() == 'quit':
                print(&quot;Thank you for chatting. Goodbye!&quot;)
                break
        
            print(&quot;AI: Thinking...&quot;)
            try:
                result = await chain.ainvoke({&quot;query&quot;: user_input})
                print(f&quot;AI: {result.content}&quot;)
            except Exception as e:
                print(f&quot;An error occurred: {str(e)}&quot;)


if __name__ == &quot;__main__&quot;:
    asyncio.run(chat_interface())
</code></pre>
<p>As you can see, with manual instrumentation, we get the following trace:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-tracing/LangChainAppManualTrace.png" alt="LangChainChat App in Elastic APM" /></p>
<p>Which calls out when we enter our query function. <code>async def chat_interface()</code></p>
<h1>Conclusion&lt;a id=&quot;conclusion&quot;&gt;&lt;/a&gt;</h1>
<p>In this blog, we discussed the following:</p>
<ul>
<li>
<p>How to manually instrument LangChain with OpenTelemetry</p>
</li>
<li>
<p>How to properly initialize OpenTelemetry and add a custom span</p>
</li>
<li>
<p>How to easily set the OTLP ENDPOINT and OTLP HEADERS with Elastic without the need for a collector</p>
</li>
<li>
<p>See traces in Elastic Observability APM</p>
</li>
</ul>
<p>Hopefully, this provides an easy-to-understand walk-through of instrumenting LangChain with OpenTelemetry and how easy it is to send traces into Elastic.</p>
<p><strong>Additional resources for OpenTelemetry with Elastic:</strong></p>
<ul>
<li>
<p><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/monitor-openai-api-gpt-models-opentelemetry-elastic">Monitor OpenAI API and GPT models with OpenTelemetry and Elastic</a></p>
</li>
<li>
<p>Futureproof<a href="https://www.elastic.co/virtual-events/future-proof-your-observability-platform-with-opentelemetry-and-elastic"> your observability platform with OpenTelemetry and Elastic</a></p>
</li>
<li>
<p>Instrumentation resources:</p>
<ul>
<li>
<p>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual instrumentation</a></p>
</li>
<li>
<p>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual instrumentation </a></p>
</li>
<li>
<p>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual instrumentation</a></p>
</li>
<li>
<p>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual instrumentation</a></p>
</li>
</ul>
</li>
</ul>
<p>Also log into <a href="https://cloud.elastic.co">cloud.elastic.co</a> to try out Elastic with a free trial.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-langchain-tracing/LangChainBlogMainImage.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Unlock possibilities with native OpenTelemetry: prioritize reliability, not proprietary limitations]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-native-kubernetes-observability</link>
            <guid isPermaLink="false">elastic-opentelemetry-native-kubernetes-observability</guid>
            <pubDate>Tue, 12 Nov 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic now supports Elastic Distributions of OpenTelemetry (EDOT) deployment and management on Kubernetes, using OTel Operator. SREs can now access out-of the-box configurations and dashboards designed to streamline collector deployment, application auto-instrumentation and lifecycle management with Elastic Observability.]]></description>
            <content:encoded><![CDATA[<p>OpenTelemetry (OTel) is emerging as the standard for data ingestion since it delivers a vendor-agnostic way to ingest data across all telemetry signals. Elastic Observability is leading the OTel evolution with the following announcements:</p>
<ul>
<li>
<p><strong>Native OTel Integrity:</strong> Elastic is now 100% OTel-native, retaining OTel data natively without requiring data translation This eliminates the need for SREs to handle tedious schema conversions and develop customized views. All Elastic Observability capabilities—such as entity discovery, entity-centric insights, APM, infrastructure monitoring, and AI-driven issue analysis— now seamlessly work with  native OTel data.</p>
</li>
<li>
<p><strong>Powerful end to end OTel based Kubernetes observability with</strong> <a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry"><strong>Elastic Distributions of OpenTelemetry (EDOT)</strong></a><strong>:</strong> Elastic now supports EDOT deployment and management on Kubernetes via the OTel Operator, enabling streamlined EDOT collector deployment, application auto-instrumentation, and lifecycle management. With out-of-the-box OTel-based Kubernetes integration and dashboards, SREs gain instant, real-time visibility into cluster and application metrics, logs, and traces—with no manual configuration needed.</p>
</li>
</ul>
<p>For organizations, it signals our commitment to open standards, streamlined data collection, and delivering insights from native OpenTelemetry data. Bring the power of Elastic Observability to your Kubernetes and OpenTelemetry deployments for maximum visibility and performance. </p>
<h1>Fully native OTel architecture with in-depth data analysis</h1>
<p>Elastic’s OpenTelemetry-first architecture is 100% OTel-native, fully retaining the OTel data model, including OTel Semantic Conventions and Resource attributes, so your observability data remains in OpenTelemetry standards. OTel data in Elastic is also backward compatible with the Elastic Common Schema (ECS).</p>
<p>SREs now gain a holistic view of resources, as Elastic accurately identifies entities through OTel resource attributes. For example, in a Kubernetes environment, Elastic identifies containers, hosts, and services and connects these entities to logs, metrics, and traces.</p>
<p>Once OTel data is in Elastic’s scalable vector datastore, Elastic’s capabilities such as the AI Assistant, zero-config machine learning-based anomaly detection, pattern analysis, and latency correlation empower SREs to quickly analyze and pinpoint potential issues in production environments.</p>
<h1>Kubernetes insights with Elastic Distributions of OpenTelemetry (EDOT)</h1>
<p>EDOT reduces manual effort through automated onboarding and pre-configured dashboards. With EDOT and OpenTelemetry, Elastic makes Kubernetes monitoring straightforward and accessible for organizations of any size.</p>
<p>EDOT paired with Elasticsearch,  enables storage for all signal types—logs, metrics, traces, and soon profiling—while maintaining essential resource attributes and semantic conventions.</p>
<p>Elastic’s OpenTelemetry-native solution enables customers to quickly extract insights from their data rather than manage complex infrastructure to ingest data. Elastic automates the deployment and configuration of observability components to deliver a user experience focused on ease and scalability, making it well-suited for large-scale environments and diverse industry needs.</p>
<p>Let’s take a look at how Elastic’s EDOT enables visibility into Kubernetes environments.</p>
<h2>1. Simple 3-step OTel ingest with lifecycle management and auto-instrumentation </h2>
<p>Elastic leverages the upstream OpenTelemetry Operator to automate its EDOT lifecycle management—including deployment, scaling, and updates—allowing customers to focus on visibility into their Kubernetes infrastructure and applications instead of their observability infrastructure for data collection.</p>
<p>The Operator integrates with the EDOT Collector and language SDKs to provide a consistent, vendor-agnostic experience. For instance, when customers deploy a new application, they don’t need to manually configure instrumentation for various languages; the OpenTelemetry Operator manages this through auto-instrumentation, as supported by the upstream OpenTelemetry project.</p>
<p>This integration simplifies observability by ensuring consistent application instrumentation across the Kubernetes environment. Elastic’s collaboration with the upstream OpenTelemetry project strengthens this automation, enabling users to benefit from the latest updates and improvements in the OpenTelemetry ecosystem. By relying on open source tools like the OpenTelemetry Operator, Elastic ensures that its solutions stay aligned with the latest advancements in the OpenTelemetry project, reinforcing its commitment to open standards and community-driven development.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-native-kubernetes-observability/unified-otel-based-k8s-experience.png" alt="Unified OTel-based Kubernetes Experience" /></p>
<p>The diagram above shows how the operator can deploy multiple OTel collectors, helping SREs deploy individual EDOT Collectors for specific applications and infrastructure. This configuration improves availability for OTel ingest and the telemetry is sent directly to Elasticsearch servers via OTLP.</p>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-otel-operator">Check out our recent blog on how to set this up</a>.</p>
<h2>2. Out-of-the-box OTel-based Kubernetes integration with dashboards</h2>
<p>Elastic delivers an OTel-based Kubernetes configuration for the OTel collector by packaging all necessary receivers, processors, and configurations for Kubernetes observability. This enables users to automatically collect, process, and analyze Kubernetes metrics, logs, and traces without the need to configure each component individually.</p>
<p>The OpenTelemetry Kubernetes Collector components provide essential building blocks, including receivers like the Kubernetes Receiver for cluster metrics, Kubeletstats Receiver for detailed node and container metrics, along with processors for data transformation and enrichment. By packaging these components, Elastic offers a turnkey solution that simplifies Kubernetes observability and eliminates the need for users to set up and configure individual collectors or processors.</p>
<p>This pre-packaged approach, which includes <a href="https://github.com/elastic/integrations/tree/main/packages/kubernetes_otel">OTel-native Kibana assets</a> such as dashboards, allows users to focus on analyzing their observability data rather than managing configuration details. Elastic’s Unified OpenTelemetry Experience ensures that users can harness OpenTelemetry’s full potential without needing deep expertise. Whether you’re monitoring resource usage, container health, or API server metrics, users gain comprehensive observability through EDOT.</p>
<p>For more details on OpenTelemetry Kubernetes Collector components, visit<a href="https://opentelemetry.io/docs/kubernetes/collector/components/"> OpenTelemetry Collector Components</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-native-kubernetes-observability/otel-based-k8s-dashboard.png" alt="OTel-based Kubernetes Dashboard" /></p>
<h2>3. Streamlined ingest architecture with OTel data and Elasticsearch</h2>
<p>Elastic’s ingest architecture minimizes infrastructure overhead by enabling users to forward trace data directly into Elasticsearch with the EDOT Collector, removing the need for the Elastic APM server. This approach:</p>
<ul>
<li>
<p>Reduces the costs and complexity associated with maintaining additional infrastructure, allowing users to deploy, scale, and manage their observability solutions with fewer resources.</p>
</li>
<li>
<p>Allows all OTel data, metrics, logs, and traces to be ingested and stored in Elastic’s singular vector database store enabling further analysis with Elastic’s AI-driven capabilities.</p>
</li>
</ul>
<p>SREs can now reduce operational burdens while also gaining high performance analytics and observability insights provided by Elastic.</p>
<h1>Elastic’s ongoing commitment to open source and OpenTelemetry</h1>
<p>With <a href="https://www.elastic.co/blog/elasticsearch-is-open-source-again">Elasticsearch fully open source once again</a> under the AGPL license,  this change reinforces our deep commitment to open standards and the open source community. This aligns with Elastic’s OpenTelemetry-first approach to observability, where Elastic Distributions of OpenTelemetry (EDOT) streamline OTel ingestion and schema auto-detection, providing real-time insights for Kubernetes and application telemetry.</p>
<p>As users increasingly adopt OTel as their schema and data collection architecture for observability, Elastic’s Distribution of OpenTelemetry (EDOT), currently in tech preview, enhances standard OpenTelemetry capabilities and improves troubleshooting while also serving as a commercially supported OTel distribution. EDOT, together with Elastic’s recent contributions of the Elastic Profiling Agent and Elastic Common Schema (ECS) to OpenTelemetry, reinforces Elastic’s commitment to establishing OpenTelemetry as the industry standard.</p>
<p>Customers can now embrace open standards and enjoy the advantages of an open, extensible platform that integrates seamlessly with their environment. End result?  Reduced costs, greater visibility, and vendor independence.</p>
<h1>Getting hands-on with Elastic Observability and EDOT</h1>
<p>Ready to try out the OTel Operator with EDOT collector and SDKs to see how Elastic utilizes ingested OTel data in APM, Discover, Analysis, and out-of-the-box dashboards? </p>
<ul>
<li>
<p><a href="https://cloud.elastic.co/">Get an account on Elastic Cloud</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry">Learn about Elastic Distributions of OpenTelemetry Overview</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/opentelemetry-demo-with-the-elastic-distributions-of-opentelemetry">Utilize the OpenTelemetry Demo with EDOT</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/infrastructure-monitoring-with-opentelemetry-in-elastic-observability">Understand how you can monitor Kubernetes with EDOT</a></p>
</li>
<li>
<p><a href="https://github.com/elastic/opentelemetry">Utilize the EDOT Operator </a>and the <a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-collector">EDOT OTel collector</a></p>
</li>
</ul>
<p>If you have your own application and want to configure EDOT the application with auto-instrumentation, read the following blogs on Go, Java, PHP, Python</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/auto-instrumentation-go-applications-opentelemetry">Auto-Instrumenting Go Applications with OpenTelemetry</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-java-agent">Elastic Distribution OpenTelemetry Java Agent</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-php">Elastic OpenTelemetry Distribution for PHP</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-python">Elastic OpenTelemetry Distribution for Python</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-native-kubernetes-observability/Kubecon-main-blog.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Instrumenting your OpenAI-powered Python, Node.js, and Java Applications with EDOT]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-openai</link>
            <guid isPermaLink="false">elastic-opentelemetry-openai</guid>
            <pubDate>Thu, 23 Jan 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic is proud to introduce OpenAI support in our Python, Node.js and Java EDOT SDKs. These add logs, metrics and tracing to applications that use OpenAI compatible services without any code change.]]></description>
            <content:encoded><![CDATA[<h2>Introduction</h2>
<p>Last year, <a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry">we announced Elastic Distribution of OpenTelemetry</a> (a.k.a. EDOT) language SDKs, which collect logs, traces and metrics from applications. When this was announced, we didn’t yet support Large Language Model (LLM) providers such as OpenAI. This limited insight developers had into Generative AI (GenAI) applications.</p>
<p>In a <a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-tracing-langtrace">prior post</a>, we reviewed LLM observability focus, such as token usage, chat latency and knowing which tools (like DuckDuckGo) your application uses. With the right logs, traces and metrics, developers can answer questions like &quot;Which version of a model generated this response?&quot; or &quot;What was the exact chat prompt created by my RAG application?&quot;</p>
<p>In the last six months, Elastic invested a lot of energy alongside others in the OpenTelemetry community towards shared specifications on these areas, including code to collect LLM related logs, metrics and traces. Our goal was to extend the zero code (agent) approach EDOT brings to GenAI use cases.</p>
<p>Today, we announce our first GenAI instrumentation capability in the EDOT language SDKs: OpenAI. Below, you’ll see how to observe GenAI applications using our Python, Node.js and Java EDOT SDKs.</p>
<h2>Example application</h2>
<p>Many of us may be familiar with <a href="https://chatgpt.com/">ChatGPT</a>, which is frontend for OpenAI’s GPT model family. Using this, you can ask a question and the assistant might reply correctly depending on what you ask and text the LLM was trained on.</p>
<p>Here’s an example of an esoteric question answered by ChatGPT:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-openai/chatgpt-screenshot.png" alt="ChatGPT answer" /></p>
<p>Our example application will simply ask this predefined question and print the result. We’ll write it in three languages: Python, JavaScript and Java.</p>
<p>We’ll execute each with a &quot;zero code&quot; (agent) approach, so that logs, metrics and traces are captured and visible in an Elastic Stack configured with Kibana and APM server. If you don’t have a stack running, use <a href="https://github.com/elastic/elasticsearch-labs/tree/main/docker">instructions from Elasticsearch Labs</a> to set one up.</p>
<p>Regardless of programming language, three variables are needed: the OpenAI API key, the location of your Elastic APM server, and the service name of the application. You’ll write these to a file named <code>.env</code>.</p>
<pre><code>OPENAI_API_KEY=sk-YOUR_API_KEY
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:8200
OTEL_SERVICE_NAME=openai-example
</code></pre>
<p>By default instrumentations does not capture the content sent to the OpenAI API in the GenAI events sent to logs, if you want to capture it add the following:</p>
<pre><code>OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true
</code></pre>
<p>Each time the application is run, it sends logs, traces and metrics to the APM server, which you can find by querying Kibana like this for the application &quot;openai-example&quot;</p>
<p><a href="http://localhost:5601/app/apm/services/openai-example/transactions">http://localhost:5601/app/apm/services/openai-example/transactions</a></p>
<p>When you choose a trace, you’ll see the LLM request made by the OpenAI SDK, and HTTP traffic caused by it:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-openai/kibana-transaction-timeline.png" alt="Kibana transaction timeline" /></p>
<p>Select the logs tab to see the exact request and response to OpenAI. This data is critical for Q/A and evaluation use cases.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-openai/kibana-transaction-logs.png" alt="Kibana transaction logs" /></p>
<p>You can also go to the Metrics Explorer and make a graph of &quot;gen_ai.client.token.usage&quot; or &quot;gen_ai.client.operation.duration&quot; over all the times you ran the application:</p>
<p><a href="http://localhost:5601/app/metrics/explorer">http://localhost:5601/app/metrics/explorer</a></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-openai/kibana-metrics-explorer.png" alt="Kibana Metrics Explorer" /></p>
<p>Continue to see exactly how this application looks and is run, in Python, Java and Node.js. Those already using our EDOT language SDKs will be familiar with how this works.</p>
<h2>Python</h2>
<p>Assuming you have python installed, the first thing would be to setup a virtual environment and install the required packages: the OpenAI client, a helper tool to read the <code>.env</code> file and our <a href="https://github.com/elastic/elastic-otel-python">EDOT Python</a> package:</p>
<pre><code class="language-bash">python3 -m venv .venv
source .venv/bin/activate
pip install openai &quot;python-dotenv[cli]&quot; elastic-opentelemetry
</code></pre>
<p>Next, run <code>edot-bootstrap</code> which analyzes the code to install any relevant instrumentation available:</p>
<pre><code class="language-bash">edot-bootstrap —-action=install
</code></pre>
<p>Now, create your <code>.env </code>file, as described earlier in this article, and the below source code in <code>chat.py</code></p>
<pre><code class="language-python">import os

import openai

CHAT_MODEL = os.environ.get(&quot;CHAT_MODEL&quot;, &quot;gpt-4o-mini&quot;)


def main():
  client = openai.Client()

  messages = [
    {
      &quot;role&quot;: &quot;user&quot;,
        &quot;content&quot;: &quot;Answer in up to 3 words: Which ocean contains Bouvet Island?&quot;,
    }
  ]

  chat_completion = client.chat.completions.create(model=CHAT_MODEL, messages=messages)
  print(chat_completion.choices[0].message.content)

if __name__ == &quot;__main__&quot;:
  main()
</code></pre>
<p>Now you can run everything with:</p>
<pre><code class="language-bash">dotenv run -- opentelemetry-instrument python chat.py
</code></pre>
<p>Finally, look for a trace for the service named &quot;openai-example&quot; in Kibana. You should see a transaction named &quot;chat gpt-4o-mini&quot;.</p>
<p>Rather than copy/pasting above, you can find a working copy of this example (along with the instructions) in the Python EDOT repository <a href="https://github.com/elastic/elastic-otel-python/tree/main/examples/openai">here</a>.</p>
<p>Finally, if you would like to try a more comprehensive example, take a look at <a href="https://github.com/elastic/elasticsearch-labs/tree/main/example-apps/chatbot-rag-app">chatbot-rag-app</a> which uses OpenAI with Elasticsearch’s <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html">Elser</a> retrieval model.</p>
<h2>Java</h2>
<p>There are multiple popular ways to initialize a Java project. Since we are using OpenAI, the first step is to configure the dependency <a href="https://central.sonatype.com/artifact/com.openai/openai-java"><code>com.openai:openai-java</code></a> and write the below source as <code>Chat.java.</code></p>
<pre><code class="language-java">package openai.example;

import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.*;


final class Chat {

  public static void main(String[] args) {
    String chatModel = System.getenv().getOrDefault(&quot;CHAT_MODEL&quot;, &quot;gpt-4o-mini&quot;);

    OpenAIClient client = OpenAIOkHttpClient.fromEnv();

    String message = &quot;Answer in up to 3 words: Which ocean contains Bouvet Island?&quot;;
    ChatCompletionCreateParams params = ChatCompletionCreateParams.builder()
        .addMessage(ChatCompletionUserMessageParam.builder()
          .content(message)
          .build())
        .model(chatModel)
        .build();

    ChatCompletion chatCompletion = client.chat().completions().create(params);
    System.out.println(chatCompletion.choices().get(0).message().content().get());
  }
}
</code></pre>
<p>Build the project such that all dependencies are in a single jar. For example, if using Gradle, you would use the <code>com.gradleup.shadow </code>plugin.</p>
<p>Next, create your <code>.env </code>file, as described earlier, and download shdotenv which we’ll use to load it.</p>
<pre><code class="language-bash">curl -O -L https://github.com/ko1nksm/shdotenv/releases/download/v0.14.0/shdotenv
chmod +x ./shdotenv
</code></pre>
<p>At this point, you have a jar and configuration you can use to run the OpenAI example. The next step is to download the EDOT Java javaagent binary. This is the part that records and exports logs, metrics and traces.</p>
<pre><code class="language-bash">curl -o elastic-otel-javaagent.jar -L 'https://oss.sonatype.org/service/local/artifact/maven/redirect?r=snapshots&amp;g=co.elastic.otel&amp;a=elastic-otel-javaagent&amp;v=LATEST'
</code></pre>
<p>Assuming you assembled a file named <code>openai-example-all.jar</code>, run it with EDOT like this:</p>
<pre><code class="language-bash">./shdotenv java -javaagent:elastic-otel-javaagent.jar -jar openai-example-all.jar
</code></pre>
<p>Finally, look for a trace for the service named &quot;openai-example&quot; in Kibana. You should see a transaction named &quot;chat gpt-4o-mini&quot;.</p>
<p>Rather than copy/pasting above, you can find a working copy of this example in the EDOT Java source repository <a href="https://github.com/elastic/elastic-otel-java/tree/main/examples/openai">here</a>.</p>
<h2>Node.js</h2>
<p>Assuming you already have npm installed and configured, run the following commands to initialize a project for the example. This includes the <a href="https://www.npmjs.com/package/openai">openai</a> package and <a href="https://www.npmjs.com/package/@elastic/opentelemetry-node"><code>@elastic/opentelemetry-node</code></a> (EDOT Node.js)</p>
<pre><code class="language-bash">npm init -y
npm install openai @elastic/opentelemetry-node
</code></pre>
<p>Next, create your <code>.env</code> file, as described earlier in this article and the below source code in <code>index.js</code></p>
<pre><code class="language-javascript">const {OpenAI} = require('openai');

let chatModel = process.env.CHAT_MODEL ?? 'gpt-4o-mini';

async function main() {
 const client = new OpenAI();
 const completion = await client.chat.completions.create({
  model: chatModel,
  messages: [
   {
    role: 'user',
    content: 'Answer in up to 3 words: Which ocean contains Bouvet Island?',
   },
  ],
 });
 console.log(completion.choices[0].message.content);
}

main();
</code></pre>
<p>With this in place, run the above source with EDOT like this:</p>
<pre><code class="language-bash">node --env-file .env --require @elastic/opentelemetry-node index.js
</code></pre>
<p>Finally, look for a trace for the service named &quot;openai-example&quot; in Kibana. You should see a transaction named &quot;chat gpt-4o-mini&quot;.</p>
<p>Rather than copy/pasting above, you can find a working copy of this example in the EDOT Node.js source repository <a href="https://github.com/elastic/elastic-otel-node/tree/main/examples/openai">here</a>.</p>
<p>Finally, if you would like to try a more comprehensive example, take a look at <a href="https://github.com/elastic/elasticsearch-labs/tree/main/example-apps/openai-embeddings">openai-embeddings</a> which uses OpenAI with Elasticsearch as a vector database!</p>
<h2>Closing Notes</h2>
<p>Above you’ve seen how to observe the official OpenAI SDK in three different languages, using Elastic Distribution of OpenTelemetry (EDOT).</p>
<p>It is important to note that some of the OpenAI SDKs and also OpenTelemetry specifications around generative AI are experimental. If you find this helps you, or find glitches, please join our slack and let us know about it.</p>
<p>Several LLM platforms accept requests from the OpenAI client SDK, by setting <code>OPENAI_BASE_URL</code> and choosing relevant models. During development, we tested against OpenAI Platform and Azure OpenAI Service. We also ran integration tests against Ollama, contributing improvements its OpenAI support released in v0.5.12. Whatever your choice of OpenAI compatible platform, we hope this new tooling helps you understand your LLM usage.</p>
<p>Finally, while the first Generative AI SDK instrumented with EDOT is OpenAI, you’ll see more soon. We are already working on Bedrock, and collaborating with others in the OpenTelemetry community for other platforms. Keep watching this blog for exciting updates.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-openai/elastic-opentelemetry-openai.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Native OTel-based K8s & App Observability in 3 Steps with Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-otel-operator</link>
            <guid isPermaLink="false">elastic-opentelemetry-otel-operator</guid>
            <pubDate>Wed, 13 Nov 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic's Distributions of OpenTelemetry are now supported with the OTel Operator, providing auto instrumentation of applications with EDOT SDKs, and deployment and lifecycle management of the EDOT OTel Collector for Kubernetes Observability. Learn how to configure this in 3 easy steps]]></description>
            <content:encoded><![CDATA[<p>Elastic recently released its Elastic Distributions of OpenTelemetry (EDOT) which have been developed to enhance the capabilities of standard OpenTelemetry distributions and improve existing OpenTelemetry support from Elastic. EDOT helps Elastic deliver its new Unified OpenTelemetry Experience. SRE’s are no longer burdened with a set of tedious steps instrumenting and ingesting OTel data into Observability. SREs get a simple and frictionless way to instrument the OTel collector, and applications, and ingest all the OTel data into Elastic. The components of this experience include: (detailed in the overview blog)</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry">Elastic Distributions for OpenTelemetry (EDOT)</a></p>
</li>
<li>
<p>Elastic’s configuration for the OpenTelemetry Operator providing:</p>
<ul>
<li>
<p>OTel Lifecycle management for the OTel collector and SDKs</p>
</li>
<li>
<p>Auto instrumentation of apps, which most developers will not instrument</p>
</li>
</ul>
</li>
<li>
<p>Pre-packaged receivers, processors, exporters, and configuration for the OTel Kubernetes Collector</p>
</li>
<li>
<p>Out-of-the-box OTel-based K8S dashboards for metrics and logs</p>
</li>
<li>
<p>Discovered inventory views for services, hosts, and containers</p>
</li>
<li>
<p>Direct OTel ingest into Elasticsearch for EDOT (bypassing ingest into APM server) - all your data (logs, metrics, and traces) is now stored in Elastic’s Search AI Lake</p>
</li>
<li>
<p>All ingested OTel data used and displayed natively in Discovery, APM, Inventory, etc</p>
</li>
</ul>
<p>In this blog we will cover how to ingest OTel for K8S and your application in 3 easy steps:</p>
<ol>
<li>
<p>Copy the install commands from the UI</p>
</li>
<li>
<p>Add the OpenTelemetry helm charts, Install the OpenTelemetry Operator with Elastic’s helm configuration &amp; set your Elastic endpoint and authentication</p>
</li>
<li>
<p>Annotate the app services you want to be auto-instrumented </p>
</li>
</ol>
<p>Then you can easily see K8S metrics, logs and application logs, metrics, and traces in Elastic Observability.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/unified-otel-based-k8s-experience.png" alt="OpenTelemetry Unified Obseervability Experience" /></p>
<p>To follow this blog you will need to have:</p>
<ol>
<li>
<p>An account on cloud.elastic.co, with access to get the Elasticsearch endpoint and authentication (api key)</p>
</li>
<li>
<p>A non-instrumented application with services based on Go, dotnet, Python, or Java. Auto-instrumentation through the OTel operator. In this example, we will be using the <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix">Elastiflix</a> application. </p>
</li>
<li>
<p>A Kubernetes cluster, we used EKS in our setup</p>
</li>
<li>
<p>Helm and Kubectl loaded</p>
</li>
</ol>
<p>To find the authentication, you can find it in the integrations section of Elastic. More information is also available in the <a href="https://www.elastic.co/guide/en/kibana/current/api-keys.html">documentation</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-api-keys.png" alt="OpenTelemetry API Keys" /></p>
<h2>K8S and Application Observability in Elastic:</h2>
<p>Before we walk you through the steps, let's show you what is visible in Elastic.</p>
<p>Once the Operator starts the OTel Collector, you can see the following in Elastic:</p>
<h3>Kubernetes metrics:</h3>
<p>Using an out-of-the-box dashboard, you can see node metrics, overall cluster metrics, and status across pods, deployments, etc.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-k8s-dashboard.png" alt="OTel-based Kubernetes dashboard" /></p>
<h3>Discovered Inventory for Hosts, services, and containers:</h3>
<p>This can be found at Observability-&gt;Inventory on the UI</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-k8s-inventory.png" alt="OTel-based Kubernetes inventory" /></p>
<h3>Detailed metrics, logs, and processor info on hosts:</h3>
<p>This can be found at Observability-&gt;Infrastructure-&gt;Hosts</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-k8s-hosts.png" alt="OTel-based Kubernetes host metrics" /></p>
<h3>K8S and application logs in Elastic’s New Discover (called Explorer)</h3>
<p>This can be found on Observability-&gt;Discover</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-ingest-logs.png" alt="OTel-based Kubernetes logs" /></p>
<h3>Application Service views (logs, metrics, and traces):</h3>
<p>This can be found on Observability-&gt;Application</p>
<p>Then select the service and drill down into different aspects.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-java-traces.png" alt="OTel-based Application Java traces" /></p>
<p>Above we are showing how traces are shown using Native OTel data.</p>
<h2>Steps to install</h2>
<h3>Step 0. Follow the commands listed in the UI</h3>
<p>Under Add data-&gt;Kubernetes-&gt;Kubernetes Monitoring with EDOT</p>
<p>You will find the following instructions, which we will follow here.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-edot-operator-install.png" alt="EDOT Operator Install" /></p>
<h3>Step 1. Install the EDOT config for the OpenTelemetry Operator</h3>
<p>Run the following commands. Please make sure that you have already authenticated in your K8s Cluster and this is where you will run the helm commands provided below.</p>
<pre><code class="language-bash"># Install helm repo needed
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts --force-update
# Install needed secrets. Provide the Elasticsearch Endpoint URL and API key you have noted in previous steps
kubectl create ns opentelemetry-operator-system
kubectl create -n opentelemetry-operator-system secret generic elastic-secret-otel \
    --from-literal=elastic_endpoint='YOUR_ELASTICSEARCH_ENDPOINT' \
    --from-literal=elastic_api_key='YOUR_ELASTICSEARCH_API_KEY'
# Install the EDOT Operator
helm install opentelemetry-kube-stack open-telemetry/opentelemetry-kube-stack --namespace opentelemetry-operator-system --create-namespace --values https://raw.githubusercontent.com/elastic/opentelemetry/refs/heads/main/resources/kubernetes/operator/helm/values.yaml --version 0.3.0
</code></pre>
<p>The values.yaml file configuration can be found <a href="https://github.com/elastic/opentelemetry/blob/main/resources/kubernetes/operator/helm/values.yaml">here</a>.</p>
<h3>Step 1b: Ensure OTel data is arriving in Elastic</h3>
<p>The simplest way to check is to go to Menu &gt; Dashboards &gt; <strong>[OTEL][Metrics Kubernetes] Cluster Overview,</strong> and ensure you see the following dashboard being populated</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-k8s-dashboard.png" alt="OTel-based Kubernetes dashboard" /></p>
<h3>Step 2: Annotate the application with auto-instrumentation</h3>
<p>For this example, we’re only going to annotate one service, the favorite-java service in the <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix">Elastiflix</a> application</p>
<p>Use the following commands to initiate auto-instrumentation:</p>
<pre><code class="language-bash">#Annotate Java namespace
kubectl annotate namespace java instrumentation.opentelemetry.io/inject-java=&quot;opentelemetry-operator-system/elastic-instrumentation&quot;
#Restart the java-app to get the new annotation
kubectl rollout restart deployment java-app -n java
</code></pre>
<p>You can also modify the yaml for your pod with the annotation</p>
<pre><code class="language-bash">metadata:
 name: my-app
 annotations:
   instrumentation.opentelemetry.io/inject-python: &quot;true&quot;
</code></pre>
<p>These instructions are provided in the UI:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-edot-sdk-annotate.png" alt="Annotate Application with EDOT SDK" /></p>
<h2>Check out the service data in Elastic APM</h2>
<p>Once the OTel data is in Elastic, you can see:</p>
<ul>
<li>
<p>Out-of-the-box dashboards for OTel-based Kubernetes metrics</p>
</li>
<li>
<p>Discovered resources such as services, hosts, and containers that are part of the Kubernetes clusters</p>
</li>
<li>
<p>Kubernetes metrics, host metrics, logs, processor info, anomaly detection, and universal profiling.</p>
</li>
<li>
<p>Log analytics in Elastic Discover</p>
</li>
<li>
<p>APM features that show app overview, transactions, dependencies, errors, and more:</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-java-service.png" alt="Java service in Elastic APM" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-java-traces.png" alt="OTel-based Application Java traces" /></p>
<h2>Try it out</h2>
<p>Elastic’s Distribution of OpenTelemetry (EDOT) transforms the observability experience by streamlining Kubernetes and application instrumentation. With EDOT, SREs and developers can bypass complex setups, instantly gain deep visibility into Kubernetes clusters, and capture critical metrics, logs, and traces—all within Elastic Observability. By following just a few simple steps, you’re empowered with a unified, efficient monitoring solution that brings your OpenTelemetry data directly into Elastic. With robust, out-of-the-box dashboards, automatic application instrumentation, and seamless integration, EDOT not only saves time but also enhances the accuracy and accessibility of observability across your infrastructure. Start leveraging EDOT today to unlock a frictionless observability experience and keep your systems running smoothly and insightfully.</p>
<p>Additional resources:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry">Elastic Distributions of OpenTelemetry Overview</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/opentelemetry-demo-with-the-elastic-distributions-of-opentelemetry">OpenTelemetry Demo with Elastic Distributions</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/infrastructure-monitoring-with-opentelemetry-in-elastic-observability">Infrastructure Monitoring with OpenTelemetry in Elastic Observability</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/auto-instrumentation-go-applications-opentelemetry">Auto-Instrumenting Go Applications with OpenTelemetry</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-java-agent">Elastic Distribution OpenTelemetry Java Agent</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-php">Elastic OpenTelemetry Distribution for PHP</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/OTel-operator.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Elastic now providing distributions for OpenTelemetry SDKs]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-sdk-distributions</link>
            <guid isPermaLink="false">elastic-opentelemetry-sdk-distributions</guid>
            <pubDate>Wed, 03 Apr 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Adopting OpenTelemetry native standards for instrumenting and observing applications]]></description>
            <content:encoded><![CDATA[<p>If you develop applications, you may have heard about <a href="https://opentelemetry.io/">OpenTelemetry</a>. At Elastic®, we are enthusiastic about OpenTelemetry as the future of standardized application instrumentation and observability.</p>
<p>In this post, we share our plans to expand our adoption of and commitment to OpenTelemetry with the introduction of Elastic distributions of the OpenTelemetry language SDKs, which will complement our existing Elastic APM agents.</p>
<h2>What is OpenTelemetry?</h2>
<p>OpenTelemetry is a vendor-neutral observability framework and toolkit that supports telemetry signals such as traces, metrics, and logs in applications and distributed microservice-based architectures.</p>
<p>Driven by a set of standards, OpenTelemetry is designed to provide a consistent approach to instrumenting and observing application behavior. OpenTelemetry is an incubating project developed under the Cloud Native Computing Foundation (<a href="https://www.cncf.io/">CNCF</a>) umbrella and is currently the second most active project, topped only by Kubernetes.</p>
<p>You can read more on the <a href="https://opentelemetry.io/docs/what-is-opentelemetry/">OpenTelemetry website</a> about the concepts, terminology, and techniques for adopting OpenTelemetry.</p>
<h2>A richer instrumentation landscape</h2>
<p>By adopting OpenTelemetry, software code can be instrumented in a vendor-agnostic fashion, with telemetry signals exported in a standardized format to one or more vendor backends, such as <a href="https://www.elastic.co/observability/application-performance-monitoring">Elastic APM</a>. Its design provides flexibility for application owners to switch out vendor backends with no code changes and use <a href="https://opentelemetry.io/docs/collector/">OpenTelemetry collectors</a> to send telemetry data to multiple backends.</p>
<p>Because OpenTelemetry is not a vendor-specific solution, it is much easier for language ecosystems to adopt it and provide robust instrumentations. Vendors don’t have to implement specific instrumentations themselves anymore. OpenTelemetry is a standard, and it is in the interest of library developers to introduce and maintain instrumentations from which all consumers can benefit.</p>
<p>As a result, more instrumentation libraries are available and better kept up to date. If your company has open-source libraries, you can also contribute and create your own instrumentations to make it easier for your customers to adopt OpenTelemetry and benefit from richer traces, metrics, and logging in their applications.</p>
<h2>Elastic and OpenTelemetry</h2>
<p>Elastic is deeply involved in OpenTelemetry. In 2023, we donated the <a href="https://www.elastic.co/blog/ecs-elastic-common-schema-otel-opentelemetry-announcement">Elastic Common Schema</a>, which is being merged with the <a href="https://opentelemetry.io/docs/specs/semconv/">Semantic Conventions</a>. In 2024, we are in the process of donating our <a href="https://www.elastic.co/blog/elastic-donation-proposal-to-contribute-profiling-agent-to-opentelemetry">profiling agent based on eBPF</a>. We also have multiple contributors to various areas of OpenTelemetry across the organization.</p>
<p>We are therefore committed to helping OpenTelemetry succeed, which means, in some cases, beginning to shift away from Elastic-specific components and recommend using OpenTelemetry components instead.</p>
<p>Elastic is committed to supporting and contributing to OpenTelemetry. Our APM solution already accepts native OTLP (OpenTelemetry Protocol) data, and many of our APM agents have already bridged data collection and transmission from applications instrumented using the OpenTelemetry APIs.</p>
<p>The next step on our journey is introducing Elastic distributions for the language SDKs and donating features upstream to the OpenTelemetry community by contributing to the OpenTelemetry SDK repositories.</p>
<h2>What is an OpenTelemetry distribution?</h2>
<p>An <a href="https://opentelemetry.io/docs/concepts/distributions/">OpenTelemetry distribution</a> is simply a customized version of one or more OpenTelemetry components. Each distribution extends the core functionality offered by the component while adhering to its API and existing features, utilizing built-in extension points.</p>
<h2>The Elastic OpenTelemetry SDK distributions</h2>
<p>With the release of Elastic distributions of the OpenTelemetry SDKs, we are extending our backing of OpenTelemetry as the preferred and recommended choice for instrumenting applications.</p>
<p>OpenTelemetry maintains and ships many language APIs and SDKs for observing applications using OpenTelemetry. The APIs provide a language-specific interface for instrumenting application code, while the SDK implements that API, enabling signals from observed applications to be collected and exported.</p>
<p>Our current work extends the OpenTelemetry language SDKs to introduce additional features and ensure that the exported data provides the most robust compatibility with our current backend while it evolves to become more OpenTelemetry native.</p>
<p>Additional features include reimplementing concepts currently available in the Elastic APM Agent but not part of the OpenTelemetry SDK. The distributions allow us to ship with opinionated defaults for all signals that are known to provide the best integration with <a href="https://www.elastic.co/observability">Elastic’s Observability</a> offering.</p>
<p>It’s undoubtedly possible to use the OpenTelemetry APIs to instrument code and then reference the OpenTelemetry SDK to enable the collection of the trace, metric, and log data that applications produce. Elastic APM accepts native OTLP data, so you can configure the OpenTelemetry SDK to export telemetry data directly to an Elastic backend. We refer to this setup as using the “vanilla” (a.k.a. “native”) OpenTelemetry SDK.</p>
<p>Work is ongoing to improve support for storing and presenting OpenTelemetry data natively in our backend so that we can drive our observability UIs directly from the data from the various telemetry signals. Our work focuses on ensuring that the Elastic-curated UIs can seamlessly handle the ECS and OpenTelemetry formats. Alongside this effort, we are working on distributions of the language SDKs to support customers looking to adopt OpenTelemetry-native instrumentation in their applications.</p>
<p>The <a href="https://www.elastic.co/guide/en/apm/agent/index.html">current Elastic APM Agents</a> support features such as central configuration and span compression that are not part of the OpenTelemetry specification as of today. We are investing our engineering expertise to bring those features to a broader audience by contributing them to OpenTelemetry. Because standardization takes time, we can more rapidly bring these features to the OpenTelemetry community and our customers by providing distributions.</p>
<p>We believe the responsible choice is to concentrate on enabling and encouraging customers to favor vendor-neutral instrumentation in their code and reap the benefits of OpenTelemetry.</p>
<p>Distributions best serve our decision to fully adopt and recommend OpenTelemetry as the preferred solution for observing applications. By providing features that are currently unavailable in the “vanilla” OpenTelemetry SDK, we can support customers who want to adopt OpenTelemetry native, vendor-agnostic instrumentation in their applications while still providing the same set of features and backend capabilities they enjoy today with the existing APM Agents. By maintaining Elastic distributions, we can also better support our customers with enhancements and fixes outside of the release cycle of the “vanilla” OpenTelemetry SDKs, which we believe to be a crucial differentiating factor in choosing them.</p>
<p>Our vision is that Elastic will work with the OpenTelemetry community to donate features through the standardization processes and contribute the code to implement those in the native OpenTelemetry SDKs. In time, we hope to see many Elastic APM Agent-exclusive features transition into OpenTelemetry to the point where an Elastic distribution may no longer be necessary. In the meantime, we can deliver those capabilities via our OpenTelemetry distributions.</p>
<p>Application developers then have several options for instrumenting and collecting telemetry data from their applications:</p>
<ol>
<li>
<p><strong>Elastic APM Agent:</strong> The most fully featured, however, vendor-specific</p>
</li>
<li>
<p><strong>Elastic APM Agent with OpenTelemetry Bridge:</strong> Vendor-neutral instrumentation API, but with known limitations:</p>
<ol>
<li>Only supports bridging of traces (no metrics support)</li>
<li>Does not support OpenTelemetry span events</li>
</ol>
</li>
<li>
<p><strong>OpenTelemetry “vanilla” SDK:</strong> Fully supported today; however, it lacks some features of Elastic APM Agent, such as span compression</p>
</li>
<li>
<p><strong>Elastic OpenTelemetry Distribution:</strong></p>
<ol>
<li>Supports vendor-neutral instrumentation and no Elastic-specific configuration in code by default</li>
<li>Recommended defaults when using Elastic Observability as a backend</li>
<li>Use OpenTelemetry APIs to further customize our defaults; no new APIs to learn</li>
</ol>
</li>
</ol>
<p>While we continue to support all options to instrument your code for the foreseeable future, we think we are setting our customers up for success by introducing a fourth OpenTelemetry-native offering. We expect this will become the preferred default for Elastic customers in due time.</p>
<p>We currently have distributions in alpha release status for <a href="https://github.com/elastic/elastic-otel-dotnet">.NET</a> and <a href="https://github.com/elastic/elastic-otel-java">Java</a>, with additional language distributions coming very soon. We encourage you to check out those repositories, try out the distributions, and provide feedback to us via issues. Your valued input allows us to refine our designs and steer our direction to ensure that our distributions delight consumers.</p>
<p><a href="https://www.elastic.co/blog/elastic-opentelemetry-distribution-dotnet-applications"><em><strong>Learn about the alpha release of our new Elastic distribution of the OpenTelemetry SDK for .NET.</strong></em></a></p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-sdk-distributions/OTel-2.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[FAQ - Elastic contributes its Universal Profiling agent to OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-profiling-agent-acceptance-opentelemetry-faq</link>
            <guid isPermaLink="false">elastic-profiling-agent-acceptance-opentelemetry-faq</guid>
            <pubDate>Thu, 06 Jun 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic is advancing the adoption of OpenTelemetry with the contribution of its universal profiling agent. Elastic is committed to ensuring a vendor-agnostic ingestion and collection of observability and security telemetry through OpenTelemetry.]]></description>
            <content:encoded><![CDATA[<h2>What is being announced?</h2>
<p>Elastic’s <a href="https://github.com/open-telemetry/community/issues/1918">donation proposal</a> for contributing its Universal Profiling™ agent has now been accepted by the OpenTelemetry community. Elastic’s Universal Profiling agent, the industry’s most comprehensive fleet-wide Universal Profiling solution, empowers users to quickly identify performance bottlenecks, reduce cloud spend, and minimize their carbon footprint. With the contribution of the Elastic Universal Profiling Agent to OpenTelemetry, all customers will benefit from its features and capabilities.</p>
<h2>What do Elastic users need to know?</h2>
<p>Elastic’s contribution of the continuous profiling agent will not change the existing set of Elastic’s continuous profiling features or how we ingest and store profiling data. </p>
<p>Elastic will participate and closely collaborate with the OTel community to manage not only the addition of the continuous profiling agent to OTel but also work with and drive the OTel community’s Profiling Special Interest Group (SIG) in shaping OTel’s continuous profiling evolution. </p>
<p>Elastic has facilitated the definition of the OTel <a href="https://github.com/open-telemetry/oteps/blob/main/text/profiles/0239-profiles-data-model.md">Profiling Data Model</a>, a crucial step toward standardizing profiling data. Moreover, the recent merge of the <a href="https://github.com/open-telemetry/oteps/pull/239">OpenTelemetry Enhancement Proposal (OTEP) introducing profiling support to the OpenTelemetry Protocol (OTLP)</a> marked an additional milestone. </p>
<h2>Why is Elastic contributing its Profiling Agent to OTel?</h2>
<p>This contribution not only accelerates the standardization of continuous profiling but also makes continuous profiling the 4th key signal in observability. This empowers everyone in the observability community to continuously profile with a standardized agent. The addition of Elastic’s continuous profiling agent will:</p>
<ul>
<li>
<p>Align efforts around a single standard poised for broad adoption by users.</p>
</li>
<li>
<p>Drive better visibility and improvement of resource usage and cost management for operations.</p>
</li>
<li>
<p>Enable vendors and the community to focus on richer features versus dealing with data transformation tasks.</p>
</li>
<li>
<p>Enable continuous profiling to become the 4th key signal in Observability.</p>
</li>
<li>
<p>Increase continuous profiling adoption and the continued evolution and convergence of observability and security domains.</p>
</li>
</ul>
<h2>Why is continuous profiling needed by organizations?</h2>
<p>The contribution of Elastic’s continuous profiling agent now helps customers realize the following benefits of continuous profiling:</p>
<ul>
<li>Maximize gross margins: By reducing the computational resources needed to run applications, businesses can optimize their cloud spend and improve profitability. Whole-system continuous profiling is one way of identifying the most expensive applications (down to the lines of code) across diverse environments that may span multiple cloud providers. This principle aligns with the familiar adage, &quot;A penny saved is a penny earned.&quot; In the cloud context, every CPU cycle saved translates to money saved. </li>
</ul>
<ul>
<li>Minimize environmental impact: Energy consumption associated with computing is a growing concern (source: <a href="https://energy.mit.edu/news/energy-efficient-computing/">MIT Energy Initiative</a>). More efficient code translates to lower energy consumption, contributing to a reduction in carbon (CO2) footprint. </li>
</ul>
<ul>
<li>Accelerate engineering workflows: Continuous profiling provides detailed insights to help debug complex issues faster, guide development, and improve overall code quality.</li>
</ul>
<p>With these benefits, customers can now not only manage the overall application’s efficiency on the cloud, but also ensure the application is optimally developed.</p>
<h2>What is continuous profiling?</h2>
<p>Elastic’s continuous profiling agent is a whole-system, always-on, continuous profiling solution that eliminates the need for run-time/bytecode instrumentation, recompilation, on-host debug symbols or service restarts.   </p>
<p>Profiling helps organizations run efficient services by minimizing computational wastage, thereby reducing operational costs. Leveraging <a href="https://ebpf.io/">eBPF</a>, the Elastic profiling agent provides unprecedented visibility into the runtime behavior of all applications: it builds stack traces that go from the kernel, through userspace native code, all the way into code running in higher level runtimes, enabling you to identify performance regressions, reduce wasteful computations, and debug complex issues faster. </p>
<p>To this end, it measures code efficiency in three dimensions: CPU utilization, CO2, and cloud cost. This approach resonates with the sustainability objectives of our customers –– ensuring that Elastic continuous profiling aligns seamlessly with their strategic <a href="https://en.wikipedia.org/wiki/Environmental,_social,_and_corporate_governance">ESG</a> goals</p>
<h2>Does Elastic support OpenTelemetry today?</h2>
<p><a href="https://www.elastic.co/observability/opentelemetry">Elastic supports OTel natively</a>. Elastic users can send OTel data directly from applications or through the OTel collector into Elastic APM, which processes both OTel SemConv and ECS. With this native OTel support, all <a href="https://www.elastic.co/observability/application-performance-monitoring">Elastic APM capabilities</a> are available with OTel. <a href="https://www.elastic.co/guide/en/apm/guide/current/open-telemetry.html">See Elastic documentation to learn more about OTel integration</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-profiling-agent-acceptance-opentelemetry-faq/blog-elastic-otel-2.png" alt="Native OpenTelemetry Support in Elastic" /></p>
<h2>Where can I learn more about Elastic’s Universal Profiling?</h2>
<p>Elastic’s resources help you understand continuous profiling and how to use it in different scenarios:</p>
<hr />
<ul>
<li>
<p><a href="https://www.elastic.co/observability/universal-profiling">Elastic Universal Profiling home page</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/elastic-universal-profiling-agent-open-source">Elastic Universal Profiling agent going open source under Apache 2</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation">Pinpointing performance issues with profiling</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/continuous-profiling-is-generally-available">Elastic releases Universal Profiling</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/whole-system-visibility-elastic-universal-profiling">Whole system profiling with Universal Profiling</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/continuous-profiling-efficient-cost-effective-applications">Cost-effective applications with Universal Profiling</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/guide/en/observability/current/universal-profiling.html">Elastic documentation on Universal Profiling</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-profiling-agent-acceptance-opentelemetry-faq/profiling-acceptance-faq.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Elastic contributes its Universal Profiling agent to OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-profiling-agent-acceptance-opentelemetry</link>
            <guid isPermaLink="false">elastic-profiling-agent-acceptance-opentelemetry</guid>
            <pubDate>Thu, 06 Jun 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic is advancing the adoption of OpenTelemetry with the contribution of its universal profiling agent. Elastic is committed to ensuring a vendor-agnostic ingestion and collection of observability and security telemetry through OpenTelemetry.]]></description>
            <content:encoded><![CDATA[<p>Following great collaboration between Elastic and OpenTelemetry's profiling community, which included a thorough review process, the OpenTelemetry community has accepted Elastic's donation of our continuous profiling agent. This marks a significant milestone in helping establish profiling as the fourth telemetry signal in OpenTelemetry. Elastic’s eBPF-based continuous profiling agent observes code across different programming languages and runtimes, third-party libraries, kernel operations, and system resources with low CPU and memory overhead in production. SREs can now benefit from these capabilities: quickly identifying performance bottlenecks, maximizing resource utilization, reducing carbon footprint, and optimizing cloud spend.
Over the past year, we have been instrumental in <a href="https://opentelemetry.io/blog/2023/ecs-otel-semconv-convergence/">enhancing OpenTelemetry's Semantic Conventions</a> with the donation of Elastic Common Schema (ECS), contributing to the OpenTelemetry Collector and language SDKs, and have been working with OpenTelemetry’s Profiling Special Interest Group (SIG) to lay the foundation necessary to make profiling stable.</p>
<p>With today’s acceptance, we are officially contributing our continuous profiler technology to OpenTelemetry. We will also dedicate a team of profiling domain experts to co-maintain and advance the profiling capabilities within OTel.</p>
<p>We want to thank the OpenTelemetry community for the great and constructive cooperation on the donation proposal. We look forward to jointly establishing continuous profiling as an integral part of OpenTelemetry.</p>
<h2>What is continuous profiling?</h2>
<p>Profiling is a technique used to understand the behavior of a software application by collecting information about its execution. This includes tracking the duration of function calls, memory usage, CPU usage, and other system resources.</p>
<p>However, traditional profiling solutions have significant drawbacks limiting adoption in production environments:</p>
<ul>
<li>Significant cost and performance overhead due to code instrumentation</li>
<li>Disruptive service restarts</li>
<li>Inability to get visibility into third-party libraries</li>
</ul>
<p>Unlike traditional profiling, which is often done only in a specific development phase or under controlled test conditions, continuous profiling runs in the background with minimal overhead. This provides real-time, actionable insights without replicating issues in separate environments. SREs, DevOps, and developers can see how code affects performance and cost, making code and infrastructure improvements easier.</p>
<h2>Contribution of production-grade features</h2>
<p>Elastic Universal Profiling is a whole-system, always-on, continuous profiling solution that eliminates the need for code instrumentation, recompilation, on-host debug symbols or service restarts. Leveraging eBPF, Elastic Universal Profiling profiles every line of code running on a machine, including application code, kernel, and third-party libraries. The solution measures code efficiency in three dimensions, CPU utilization, CO2, and cloud cost, to help organizations manage efficient services by minimizing computational waste.</p>
<p>The Elastic profiling agent facilitates identifying non-optimal code paths, uncovering &quot;unknown unknowns&quot;, and provides comprehensive visibility into the runtime behavior of all applications. Elastic’s continuous profiling agent supports various runtimes and languages, such as C/C++, Rust, Zig, Go, Java, Python, Ruby, PHP, Node.js, V8, Perl, and .NET.</p>
<p>Additionally, organizations can meet sustainability objectives by minimizing computational wastage, ensuring seamless alignment with their strategic <a href="https://en.wikipedia.org/wiki/Environmental,_social,_and_corporate_governance">ESG</a> goals.</p>
<h2>Benefits to OpenTelemetry</h2>
<p>This contribution not only boosts the standardization of continuous profiling for observability but also accelerates the practical adoption of profiling as the fourth key signal in OTel. Customers get a vendor-agnostic way of collecting profiling data and enabling correlation with existing signals, like tracing, metrics, and logs, opening <a href="https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation">new potential for observability insights and a more efficient troubleshooting experience</a>. </p>
<p>OTel-based continuous profiling unlocks the following possibilities for users:</p>
<ul>
<li>Improved customer experience: delivering consistent service quality and performance through continuous profiling ensures customers have an application that performs optimally, remains responsive, and is reliable.</li>
</ul>
<ul>
<li>Maximize gross margins: Businesses can optimize their cloud spend and improve profitability by reducing the computational resources needed to run applications. Whole system continuous profiling identifies the most expensive functions (down to the lines of code) across diverse environments that may span multiple cloud providers. In the cloud context, every CPU cycle saved translates to money saved. </li>
</ul>
<ul>
<li>Minimize environmental impact: energy consumption associated with computing is a growing concern (source: <a href="https://energy.mit.edu/news/energy-efficient-computing/">MIT Energy Initiative</a> ). More efficient code translates to lower energy consumption, reducing carbon (CO2) footprint. </li>
</ul>
<ul>
<li>Accelerate engineering workflows: continuous profiling provides detailed insights to help troubleshoot complex issues faster, guide development, and improve overall code quality.</li>
</ul>
<ul>
<li>Improved vendor neutrality and increased efficiency: an OTel eBPF-based profiling agent removes the need to use proprietary APM agents and offers a more efficient way to collect profiling telemetry.</li>
</ul>
<p>With these benefits, customers can now manage the overall application’s efficiency on the cloud while ensuring their engineering teams optimize it.</p>
<h2>What comes next?</h2>
<p>While the acceptance of Elastic’s donation of the profiling agent marks a significant milestone in the evolution of OTel’s eBPF-based continuous profiling capabilities, it represents the beginning of a broader journey. Moving forward, we will continue collaborating closely with the OTel Profiling and Collector SIGs to ensure seamless integration of the profiling agent within the broader OTel ecosystem. During this phase, users can test early preview versions of the OTel profiling integration by following the directions in the <a href="https://github.com/elastic/otel-profiling-agent/">otel-profiling-agent</a> repository.</p>
<p>Elastic remains deeply committed to OTel’s vision of enabling cross-signal correlation. We plan to further contribute to the community by sharing our innovative research and implementations, specifically those facilitating the correlation between profiling data and distributed traces, across several OTel language SDKs and the profiling agent.</p>
<p>We are excited about our <a href="https://opentelemetry.io/blog/2023/ecs-otel-semconv-convergence/">growing relationship with OTel</a> and the opportunity to donate our profiling agent in a way that benefits both the Elastic community and the broader OTel community. Learn more about <a href="https://www.elastic.co/observability/opentelemetry">Elastic’s OpenTelemetry support</a> and learn how to contribute to the ongoing profiling work in the community.</p>
<h2>Additional Resources</h2>
<p>Additional details on Elastic’s Universal Profiling can be found in the <a href="https://www.elastic.co/observability-labs/blog/elastic-profiling-agent-acceptance-opentelemetry-faq">FAQ</a>.</p>
<p>For insights into observability, visit Observability labs where OTel specific articles are also available.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-profiling-agent-acceptance-opentelemetry/profiling-acceptance.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Elastic's RAG-based AI Assistant: Analyze application issues with LLMs and private GitHub issues]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-rag-ai-assistant-application-issues-llm-github</link>
            <guid isPermaLink="false">elastic-rag-ai-assistant-application-issues-llm-github</guid>
            <pubDate>Wed, 08 May 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[In this blog, we review how GitHub issues and other GitHub documents from internal and external GitHub repositories can be used in root cause analysis with Elastic’s RAG-based AI Assistant.]]></description>
            <content:encoded><![CDATA[<p>As an SRE, analyzing applications is more complex than ever. Not only do you have to ensure the application is running optimally to ensure great customer experiences, but you must also understand the inner workings in some cases to help troubleshoot. Analyzing issues in a production-based service is a team sport. It takes the SRE, DevOps, development, and support to get to the root cause and potentially remediate. If it's impacting, then it's even worse because there is a race against time. Regardless of the situation, there is a ton of information that needs to be consumed and processed. This includes not only what the customer is experiencing, but also internal data to help provide the most appropriate resolution.</p>
<p>Elastic’s AI Assistant helps improve analysis for SREs, DevOps, Devs, and others. In a single window using natural language questions, you can analyze using not only general information but combine it with things like:</p>
<ul>
<li>
<p>Issues from internal GitHub repos, Jira, etc.</p>
</li>
<li>
<p>Documents from internal wiki sites from Confluence, etc.</p>
</li>
<li>
<p>Customer issues from your support service</p>
</li>
<li>
<p>And more</p>
</li>
</ul>
<p>In this blog, we will walk you through how to:</p>
<ol>
<li>
<p>Ingest an external GitHub repository (<a href="https://github.com/open-telemetry/opentelemetry-demo">OpenTelemetry demo repo</a>) with code and issues into Elastic. Apply Elastic Learned Sparse EncodeR (ELSER) and store it in a specific index for the AI Assistant.</p>
</li>
<li>
<p>Ingest internal GitHub repository with runbook information into Elastic. Apply ELSER and store the processed data in a specific index for the AI Assistant.</p>
</li>
<li>
<p>Use these two indices when analyzing issues for the OpenTelemetry demo in Elastic using the AI Assistant.</p>
</li>
</ol>
<h2>3 simple questions using GitHub data with AI Assistant</h2>
<p>Before we walk through the steps for setting up data from GitHub, let’s review what an SRE can do with the AI Assistant and GitHub repos.</p>
<p>We initially connect to GitHub using an Elastic GitHub connector and ingest and process two repos: the OpenTelemetry demo repo (public) and an internal runbook repo (Elastic internal).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-rag-ai-assistant-application-issues-llm-github/1.png" alt="1 - elasticsearch connectors" /></p>
<p>With these two loaded and parsed by ELSER, we ask the AI Assistant some simple questions generally asked during analysis.</p>
<h3>How many issues are open for the OpenTelemetry demo?</h3>
<p>Since we ingested the entire repo (as of April 26, 2024) with a doc count of 1,529, we ask it a simple question regarding the total number of issues that are open. We specifically tell the AI Assistant to search our internal index to ensure the LLM knows to ask Elastic to search its internal index for the total number of issues.</p>
&lt;Video vidyardUuid=&quot;XyKWeYz21mdDkMfop7absQ&quot; loop={true} /&gt;
<h3>Are there any issues for the Rust based shippingservice?</h3>
<p>Elastic’s AI Assistant uses ELSER to traverse the loaded GitHub repo and finds the open issue against the shippingservice (which is the following <a href="https://github.com/open-telemetry/opentelemetry-demo/issues/346">issue</a> at the time of writing this post).</p>
&lt;Video vidyardUuid=&quot;TF1qgy3WH3cuLQdBvdX66A&quot; loop={true} /&gt;
<h3>Is there a runbook for the Cartservice?</h3>
<p>Since we loaded an internal GitHub repo with a few sample runbooks, the Elastic AI Assistant properly finds the runbook.</p>
&lt;Video vidyardUuid=&quot;kSukiZ6zYZDQDycs616ji8&quot; loop={true} /&gt;
<p>As we go through this blog, we will talk about how the AI Assistant finds these issues using ELSER and how you can configure it to use your own GitHub repos.</p>
<h2>Retrieval augmented generation (RAG) with Elastic AI Assistant</h2>
<p>Elastic has the most advanced RAG-based AI Assistant for both Observability and Security. It can help you analyze your data using:</p>
<ul>
<li>
<p>Your favorite LLM (OpenAI, Azure OpenAI, AWS Bedrock, etc.)</p>
</li>
<li>
<p>Any internal information (GitHub, Confluence, customer issues, etc.) you can either connect to or bring into Elastic’s indices</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-rag-ai-assistant-application-issues-llm-github/2.png" alt="Elastic AI Assistant — connecting internal and external information" /></p>
<p>The reason Elastic’s AI Assistant can do this is because it supports RAG, which helps retrieve internal information along with LLM-based knowledge.</p>
<p>Adding relevant internal information for an SRE into Elastic:</p>
<ul>
<li>
<p>As data comes in, such as in your GitHub repository, ELSER is applied to the data, and embeddings (weights and tokens into a sparse vector field) are added to capture semantic meaning and context of the data.</p>
</li>
<li>
<p>This data (GitHub, Confluence, etc.) is processed with embeddings and is stored in an index that can be searched by the AI Assistant.</p>
</li>
</ul>
<p>When you query the AI Assistant for information:</p>
<ul>
<li>
<p>The query goes through the same inference process as the ingested data using ELSER. The input query generates a “sparse vector,” which is used to find the most relevant highly ranked information in the ingested data (GitHub, Confluence, etc.).</p>
</li>
<li>
<p>The retrieved data is then combined with the query and also sent over to the LLM, which will then add its own knowledge base information (if there is anything to add), or it might ask Elastic (via function calls) to analyze, chart, or even search further. If a function call is made to Elastic and a response is provided, it will be added by the LLM to its response.</p>
</li>
<li>
<p>The results will be the most contextual based answer combining both LLM and anything relevant from your internal data.</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-rag-ai-assistant-application-issues-llm-github/3.png" alt="3 - elastic's RAG flowchart" /></p>
<h2>Application, prerequisites, and config</h2>
<p>If you want to try the steps in this blog, here are some prerequisites:</p>
<ul>
<li>
<p>An Elastic Cloud account — <a href="https://cloud.elastic.co/">sign up now</a></p>
</li>
<li>
<p><a href="https://github.com/open-telemetry/opentelemetry-demo">OpenTelemetry demo</a> running and connected to Elastic (<a href="https://www.elastic.co/guide/en/observability/current/apm-open-telemetry-direct.html#apm-instrument-apps-otel">APM documentation</a>)</p>
</li>
<li>
<p>Whatever internal GitHub repo you want to use with some information that is useful for analysis (In our walk through, we will be using a GitHub repo that houses runbooks for different scenarios when Elastic does demos).</p>
</li>
<li>
<p>Account with your favorite or approved LLM (OpenAI, Azure OpenAI, AWS Bedrock)</p>
</li>
</ul>
<h2>Adding the GitHub repos to Elastic</h2>
<p>The first step is to set up the GitHub connector and connect to your GitHub repo. Elastic has several connectors from GitHub, Confluence, Google Drive, Jira, AWS S3, Microsoft Teams, Slack, and more. So while we will go over the GitHub connector in this blog, don’t forget about other connectors.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-rag-ai-assistant-application-issues-llm-github/4.png" alt="4 - select a connector" /></p>
<p>Once you select the GitHub connector and give it a name, you need to add two items:</p>
<ul>
<li>
<p>GitHub token</p>
</li>
<li>
<p>The URL open-telemetry/opentelemetry-demo</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-rag-ai-assistant-application-issues-llm-github/5.png" alt="5 - configuration" /></p>
<p>Next, add it to an index in the wizard.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-rag-ai-assistant-application-issues-llm-github/6.png" alt="6 - attach an index" /></p>
<h2>Create a pipeline and process the data with ELSER</h2>
<p>In order to add the embeddings we discussed in the section above, we need to add the following to the connector:</p>
<ul>
<li>
<p>Create a pipeline in the configuration wizard.</p>
</li>
<li>
<p>Create a custom pipeline.</p>
</li>
<li>
<p>Add the ML inference pipeline.</p>
</li>
<li>
<p>Select ELSERv2 ML Model to add the embeddings.</p>
</li>
<li>
<p>Select the fields that need to be evaluated as part of the inference pipeline.</p>
</li>
<li>
<p>Test and save the inference pipeline and the overall pipeline.</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-rag-ai-assistant-application-issues-llm-github/7.png" alt="7" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-rag-ai-assistant-application-issues-llm-github/8.png" alt="8" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-rag-ai-assistant-application-issues-llm-github/9.png" alt="9" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-rag-ai-assistant-application-issues-llm-github/10.png" alt="10" /></p>
<h2>Sync the data</h2>
<p>Now that the pipeline is created, you need to start to sync the github repo. As the documents from the github repo come in, they will go through the pipeline and embeddings will be added.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-rag-ai-assistant-application-issues-llm-github/11.png" alt="11" /></p>
<h2>Embeddings</h2>
<p>Once the pipeline is set up, sync the data in the connector. As the GitHub repository comes in, the inference pipeline will process the data as follows:</p>
<ul>
<li>
<p>As data comes in from your GitHub repository, ELSER is applied to the data, and embeddings (weights and tokens into a sparse vector field) are added to capture semantic meaning and context of the data.</p>
</li>
<li>
<p>This data is processed with embeddings and is stored in an index that can be searched by the AI Assistant.</p>
</li>
</ul>
<p>When you look at the OpenTelemetry GitHub documents that were ingested, you will see how the weights and token are added to the predicted_value field in the index.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-rag-ai-assistant-application-issues-llm-github/12.png" alt="12" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-rag-ai-assistant-application-issues-llm-github/13.png" alt="13" /></p>
<p>These embeddings will now be used to find the most contextually relevant document when the user asks the AI Assistant a query, which might use this.</p>
<h2>Check if AI Assistant can use the index</h2>
<p>Elastic’s AI Assistant uses ELSER to traverse the loaded Github repo and finds the open issue against the shippingservice. (which is the following <a href="https://github.com/open-telemetry/opentelemetry-demo/issues/346">issue</a> at the time of writing this post).</p>
&lt;Video vidyardUuid=&quot;TF1qgy3WH3cuLQdBvdX66A&quot; loop={true} /&gt;
<p>Based on the response, we can see that the AI Assistant can now use the index to find the issue and use it for further analysis.</p>
<h2>Conclusion</h2>
<p>You’ve now seen how easy Elastic’s RAG-based AI Assistant is to set up. You can bring in documents from multiple locations (GitHub, Confluent, Slack, etc.). We’ve shown the setup for GitHub and OpenTelemetry. This internal information can be useful in managing issues, accelerating resolution, and improving customer experiences. Check out our other blogs on how the AI Assistant can help SREs do better analysis, lower MTTR, and improve operations overall:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/blog/analyzing-opentelemetry-apps-elastic-ai-assistant-apm">Analyzing OpenTelemetry apps with Elastic AI Assistant and APM</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/elastic-ai-assistant-observability-escapes-kibana">The Elastic AI Assistant for Observability escapes Kibana!</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/elastic-ai-assistant-observability-microsoft-azure-openai">Getting started with the Elastic AI Assistant for Observability and Microsoft Azure OpenAI</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/whats-new-elastic-8-13-0">Elastic 8.13: GA of Amazon Bedrock in the Elastic AI Assistant for Observability</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/sre-troubleshooting-ai-assistant-observability-runbooks">Enhancing SRE troubleshooting with the AI Assistant for Observability and your organization's runbooks</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/context-aware-insights-elastic-ai-assistant-observability">Context-aware insights using the Elastic AI Assistant for Observability</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/elastic-ai-assistant-observability-amazon-bedrock">Getting started with the Elastic AI Assistant for Observability and Amazon Bedrock</a></p>
</li>
</ul>
<h2>Try it out</h2>
<p>Existing Elastic Cloud customers can access many of these features directly from the <a href="https://cloud.elastic.co/">Elastic Cloud console</a>. Not taking advantage of Elastic on cloud? <a href="https://www.elastic.co/cloud/cloud-trial-overview">Start a free trial</a>.</p>
<p>All of this is also possible in your environments. <a href="https://www.elastic.co/observability/universal-profiling">Learn how to get started today</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-rag-ai-assistant-application-issues-llm-github/AI_fingertip_touching_human_fingertip.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Elastic SQL inputs: A generic solution for database metrics observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/sql-inputs-database-metrics-observability</link>
            <guid isPermaLink="false">sql-inputs-database-metrics-observability</guid>
            <pubDate>Mon, 11 Sep 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[This blog dives into the functionality of generic SQL and provides various use cases for advanced users to ingest custom metrics to Elastic for database observability. We also introduce the fetch from all database new capability released in 8.10.]]></description>
            <content:encoded><![CDATA[<p>Elastic&lt;sup&gt;®&lt;/sup&gt; SQL inputs (<a href="https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-sql.html">metricbeat</a> module and <a href="https://docs.elastic.co/integrations/sql">input package</a>) allows the user to execute <a href="https://en.wikipedia.org/wiki/SQL">SQL</a> queries against many supported databases in a flexible way and ingest the resulting metrics to Elasticsearch&lt;sup&gt;®&lt;/sup&gt;. This blog dives into the functionality of generic SQL and provides various use cases for <em>advanced users</em> to ingest custom metrics to Elastic&lt;sup&gt;®&lt;/sup&gt;, for database observability. The blog also introduces the fetch from all database new capability, released in 8.10.</p>
<h2>Why “Generic SQL”?</h2>
<p>Elastic already has metricbeat and integration packages targeted for specific databases. One example is <a href="https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-mysql.html">metricbeat</a> for MySQL — and the corresponding integration <a href="https://docs.elastic.co/en/integrations/mysql">package</a>. These beats modules and integrations are customized for a specific database, and the metrics are extracted using pre-defined queries from the specific database. The queries used in these integrations and the corresponding metrics are <em>not</em> available for modification.</p>
<p>Whereas the <em>Generic SQL inputs</em> (<a href="https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-sql.html">metricbeat</a> or <a href="https://docs.elastic.co/integrations/sql">input package</a>) can be used to scrape metrics from any supported database using the user's SQL queries. The queries are provided by the user depending on specific metrics to be extracted. This enables a much more powerful mechanism for metrics ingestion, where users can choose a specific driver and provide the relevant SQL queries and the results get mapped to one or more Elasticsearch documents, using a structured mapping process (table/variable format explained later).</p>
<p>Generic SQL inputs can be used in conjunction with the existing integration packages, which already extract specific database metrics, to extract additional custom metrics dynamically, making this input very powerful. In this blog, <em>Generic SQL input</em> and <em>Generic SQL</em> are used interchangeably.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sql-inputs-database-metrics-observability/elastic-blog-1-genericSQL.png" alt="Generic SQL database metrics collection" /></p>
<h2>Functionalities details</h2>
<p>This section covers some of the features that would help with the metrics extraction. We provide a brief description of the response format configuration. Then we dive into the merge_results functionality, which is used to combine results from multiple SQL queries into a single document.</p>
<p>The next key functionality users may be interested in is to collect metrics from all the custom databases, which is now possible with the fetch_from_all_databases feature.</p>
<p>Now let's dive into the specific functionalities:</p>
<h3>Different drivers supported</h3>
<p>The generic SQL can fetch metrics from the different databases. The current version has the capability to fetch metrics from the following drivers: MySQL, PostgreSQL, Oracle, and Microsoft SQL Server(MSSQL).</p>
<h3>Response format</h3>
<p>The response format in generic SQL is used to manipulate the data in either table or in variable format. Here’s an overview of the formats and syntax for creating and using the table and variables.</p>
<p>Syntax: <code>response_format: table {{or}} variables</code></p>
<p><strong>Response format table</strong><br />
This mode generates a single event for each row. The table format has no restrictions on the number of columns in the response. This format can have any number of columns.</p>
<p>Example:</p>
<pre><code class="language-sql">driver: &quot;mssql&quot;
sql_queries:
 - query: &quot;SELECT counter_name, cntr_value FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'&quot;
   response_format: table
</code></pre>
<p>This query returns a response similar to this:</p>
<pre><code class="language-json">&quot;sql&quot;:{
      &quot;metrics&quot;:{
         &quot;counter_name&quot;:&quot;User Connections &quot;,
         &quot;cntr_value&quot;:7
      },
      &quot;driver&quot;:&quot;mssql&quot;
}
</code></pre>
<p>The response generated above adds the counter_name as a key in the document.</p>
<p><strong>Response format variables</strong><br />
The variable format supports key:value pairs. This format expects only two columns to fetch in a query.</p>
<p>Example:</p>
<pre><code class="language-sql">driver: &quot;mssql&quot;
sql_queries:
 - query: &quot;SELECT counter_name, cntr_value FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'&quot;
   response_format: variables
</code></pre>
<p>The variable format takes the first variable in the query above as the key:</p>
<pre><code class="language-json">&quot;sql&quot;:{
      &quot;metrics&quot;:{
         &quot;user connections &quot;:7
      },
      &quot;driver&quot;:&quot;mssql&quot;
}
</code></pre>
<p>In the above response, you can see the value of counter_name is used to generate the key in variable format.</p>
<h3>Response optimization: merge_results</h3>
<p>We are now supporting merging multiple query responses into a single event. By enabling <strong>merge_results</strong> , users can significantly optimize the storage space of the metrics ingested to Elasticsearch. This mode enables an efficient compaction of the document generated, where instead of generating multiple documents, a single merged document is generated wherever applicable. The metrics of a similar kind, generated from multiple queries, are combined into a single event.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sql-inputs-database-metrics-observability/elastic-blog-2-output-merge-results.png" alt="Output of Merge results" /></p>
<p>Syntax: <code>merge_results: true {{or}} false</code></p>
<p>In the below example, you can see how the data is loaded into Elasticsearch for the below query when the merge_results is disabled.</p>
<p>Example:</p>
<p>In this example, we are using two different queries to fetch metrics from the performance counter.</p>
<pre><code class="language-yaml">merge_results: false
driver: &quot;mssql&quot;
sql_queries:
  - query: &quot;SELECT cntr_value As 'user_connections' FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'&quot;
    response_format: table
  - query: &quot;SELECT cntr_value As 'buffer_cache_hit_ratio' FROM sys.dm_os_performance_counters WHERE counter_name = 'Buffer cache hit ratio' AND object_name like '%Buffer Manager%'&quot;
    response_format: table
</code></pre>
<p>As you can see, the response for the above example generates a single document for each query.</p>
<p>The resulting document from the first query:</p>
<pre><code class="language-json">&quot;sql&quot;:{
      &quot;metrics&quot;:{
         &quot;user_connections&quot;:7
      },
      &quot;driver&quot;:&quot;mssql&quot;
}
</code></pre>
<p>And resulting document from the second query:</p>
<pre><code class="language-json">&quot;sql&quot;:{
      &quot;metrics&quot;:{
         &quot;buffer_cache_hit_ratio&quot;:87
      },
      &quot;driver&quot;:&quot;mssql&quot;
}
</code></pre>
<p>When we enable the merge_results flag in the query, both the above metrics are combined together and the data gets loaded in a single document.</p>
<p>You can see the merged document in the below example:</p>
<pre><code class="language-json">&quot;sql&quot;:{
      &quot;metrics&quot;:{
         &quot;user connections &quot;:7,
         “buffer_cache_hit_ratio”:87
      },
      &quot;driver&quot;:&quot;mssql&quot;
}
</code></pre>
<p><em>However, such a merge is possible only if the table queries are merged, and each produces a single row. There is no restriction on variable queries being merged.</em></p>
<h3>Introducing a new capability: fetch_from_all_databases</h3>
<p>This is a <a href="https://github.com/elastic/beats/pull/35688">new functionality</a> to fetch all the database metrics automatically from the system and user databases of the Microsoft SQL Server, by enabling the fetch_from_all_databases flag.</p>
<p>Keep an eye out for the <a href="https://www.elastic.co/guide/en/beats/metricbeat/8.10/metricbeat-module-sql.html#_example_execute_given_queries_for_all_databases_present_in_a_server">8.10 release version</a> where you can start using the fetch all database feature. Prior to the 8.10 version, users had to provide the database names manually to fetch metrics from custom/user databases.</p>
<p>Syntax: <code>fetch_from_all_databases: true {{or}} false</code></p>
<p>Below is the sample query with fetch all databases flag as disabled:</p>
<pre><code class="language-yaml">fetch_from_all_databases: false
driver: &quot;mssql&quot;
sql_queries:
  - query: &quot;SELECT @@servername AS server_name, @@servicename AS instance_name, name As 'database_name', database_id FROM sys.databases WHERE name='master';&quot;
</code></pre>
<p>The above query fetches metrics only for the provided database name. Here the input database is master, so the metrics are fetched only for the master.</p>
<p>Below is the sample query with the fetch all databases flag as enabled:</p>
<pre><code class="language-yaml">fetch_from_all_databases: true
driver: &quot;mssql&quot;
sql_queries:
  - query: SELECT @@servername AS server_name, @@servicename AS instance_name, DB_NAME() AS 'database_name', DB_ID() AS database_id;
    response_format: table
</code></pre>
<p>The above query fetches metrics from all available databases. This is useful when the user wants to get data from all the databases.</p>
<p>Please note: currently this feature is supported only for Microsoft SQL Server and will be used by MS SQL integration internally, to support extracting metrics for <a href="https://github.com/elastic/integrations/issues/4108">all user DBs</a> by default.</p>
<h2>Using generic SQL: Metricbeat</h2>
<p>The generic <a href="https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-sql.html">SQL metricbeat module</a> provides flexibility to execute queries against different database drivers. The metricbeat input is available as GA for any production usage. <a href="https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-sql.html">Here</a>, you can find more information on configuring <a href="https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-sql.html">the generic SQL</a> for different drivers with various examples.</p>
<h2>Using generic SQL: Input package</h2>
<p>The input package provides a flexible solution to advanced users for customizing their ingestion experience in Elastic. Generic SQL is now also available as an SQL<a href="https://docs.elastic.co/integrations/sql">input package</a>. The input package is currently available for early users as a <strong>beta release</strong>. Let's take a walk through how users can use generic SQL via the input package.</p>
<h3>Configurations of generic SQL input package:</h3>
<p>The configuration options for the generic SQL input package are as below:</p>
<ul>
<li><strong>Driver**</strong> :** This is the SQL database for which you want to use the package. In this case, we will take mysql as an example.</li>
<li><strong>Hosts:</strong> Here the user enters the connection string to connect to the database. It would vary depending on which database/driver is being used. Refer <a href="https://docs.elastic.co/integrations/sql#hosts">here</a> for examples.</li>
<li><strong>SQL Queries:</strong> Here the user writes the SQL queries they want to fire and the response_format is specified.</li>
<li><strong>Data set:</strong> The user specifies a <a href="https://www.elastic.co/guide/en/ecs/master/ecs-data_stream.html#_data_stream_field_details">data set</a> name to which the response fields get mapped.</li>
<li><strong>Merge results**</strong> :** This is an advanced setting, used to merge queries into a single event.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sql-inputs-database-metrics-observability/elastic-blog-3-SQL-metrics-inputpackage.png" alt="Configuration parameters for SQL input package" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sql-inputs-database-metrics-observability/elastic-blog-4-expanded-document.png" alt="Metrics getting mapped to the index created by the ‘sql_first_dataset’" /></p>
<h3>Metrics extensibility with customized SQL queries</h3>
<p>Let's say a user is using <a href="https://docs.elastic.co/integrations/mysql">MYSQL Integration</a>, which provides a fixed set of metrics. Their requirement now extends to retrieving more metrics from the MYSQL database by firing new customized SQL queries.</p>
<p>This can be achieved by adding an instance of SQL input package, writing the customized queries and specifying a new <a href="https://www.elastic.co/guide/en/ecs/master/ecs-data_stream.html#field-data-stream-dataset">data set</a> name as shown in the screenshot below.</p>
<p>This way users can get any metrics by executing corresponding queries. The resultant metrics of the query will be indexed to the new data set, sql_second_dataset.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sql-inputs-database-metrics-observability/elastic-blog-5-driver.png" alt="Customization of Ingest Pipelines and Mappings" /></p>
<p>When there are multiple queries, users can club them into a single event by enabling the Merge Results toggle.</p>
<h3>Customizing user experience</h3>
<p>Users can customize their data by writing their own <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html">ingest pipelines</a> and providing their customized <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html">mappings</a>. Users can also build their own bespoke dashboards.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sql-inputs-database-metrics-observability/elastic-blog-6-ingest-pipeline.png" alt="Customization of Ingest Pipelines and Mappings" /></p>
<p>As we can see above, the SQL input package provides the flexibility to get new metrics by running new queries, which are not supported in the default MYSQL integration (the user gets metrics from a predetermined set of queries).</p>
<p>The SQL input package also supports multiple drivers: mssql, postgresql and oracle. So a single input package can be used to cater to all these databases.</p>
<p>Note: The fetch_from_all_databases feature is not supported in the SQL input package yet.</p>
<h2>Try it out!</h2>
<p>Now that you know about various use cases and features of generic SQL, get started with <a href="https://cloud.elastic.co/registration?fromURI=/home">Elastic Cloud</a> and try using the <a href="https://docs.elastic.co/integrations/sql">SQL input package</a> for your SQL database and get customized experience and metrics. If you are looking for newer metrics for some of our existing SQL based integrations — like <a href="https://docs.elastic.co/en/integrations/microsoft_sqlserver">Microsoft SQL Server</a>, <a href="https://docs.elastic.co/integrations/oracle">Oracle</a>, and more — go ahead and give the SQL input package a swirl.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/sql-inputs-database-metrics-observability/patterns-midnight-background-no-logo-observability.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Smarter Alerting Arrives with Faster Triage, Clearer Groupings, and Actionable Guidance]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-stack-observability-alerting-upgrade</link>
            <guid isPermaLink="false">elastic-stack-observability-alerting-upgrade</guid>
            <pubDate>Thu, 04 Sep 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Exploring the latest enhancements in Elastic Stack alerting, including improved related alert grouping, linking dashboards to alert rules, and embedding investigation guides into alerts.]]></description>
            <content:encoded><![CDATA[<p>In the 9.1 release, we've made significant upgrades to alerting to help SREs and operators cut through the noise, understand what's happening faster, and take meaningful action with less guesswork.</p>
<p>Here's what's new:</p>
<h2>Improved Related Alert Grouping with Relevance Scoring &amp; Reasoning</h2>
<p>We've enhanced our related alert detection to go beyond surface-level correlations. Alerts are now grouped based on a relevance score that reflects the strength of their relationship across dimensions like:</p>
<ul>
<li><strong>Shared entities or resources</strong> (e.g. same host, pod, or service)</li>
<li><strong>Temporal proximity</strong> (alerts firing within a suspiciously short window)</li>
<li><strong>Signal similarity</strong> (e.g. spikes in logs, metrics, and traces that point to the same failure mode)</li>
</ul>
<p>More importantly, we now <strong>show the why</strong>. You'll see why an alert is grouped, whether it's sharing the same Kubernetes pod, has similar log patterns, or was triggered by the same upstream anomaly. This gives users confidence in the grouping logic and accelerates root cause analysis.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-stack-observability-alerting-upgrade/alerting-1.jpg" alt="Related Alerts" /></p>
<h2>Link Dashboards to Alert Rules and Get Smart Suggestions</h2>
<p>You can now <strong>link dashboards directly to your alert rules</strong>, giving responders an instant visual lens into the metrics or logs that matter most for that alert. No more scrambling to remember which dashboard to check — just click and go.</p>
<p>And we've made this smarter too: Elastic will now <strong>suggest relevant dashboards</strong> based on the alert's source, rule logic, or monitored entities, helping users land on the right view without needing to configure anything upfront.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-stack-observability-alerting-upgrade/alerting-2.jpg" alt="Related Alerting Dashboards" /></p>
<h2>Investigation Guides Embedded Into Alerts</h2>
<p>Every alert can now be configured with an <strong>investigation guide</strong>, a set of pre-configured, context-aware instructions or next steps tailored to the alert. Think of it as a playbook that's embedded right where and when you need it.</p>
<p>Use it to:</p>
<ul>
<li>Document your team's runbooks and standard triage steps or link to existing runbooks</li>
<li>Guide junior engineers or on-call responders through unfamiliar territory</li>
<li>Automate the first few steps of root cause analysis</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-stack-observability-alerting-upgrade/alerting-3.jpg" alt="Investigation Guide" /></p>
<h2>Why This Matters</h2>
<p>These changes are all about reducing time to detect (MTTD) and time to resolve (MTTR). By:</p>
<ul>
<li>Grouping alerts more intelligently (and transparently)</li>
<li>Giving you the dashboards you need, when you need them</li>
<li>Embedding action-oriented guides in every alert</li>
</ul>
<p>We're bringing you closer to a truly streamlined incident response workflow; No swivel-chairing, no guesswork, just clarity.</p>
<p>Additionally, look at some of our other articles on Elastic Observability Labs related to analysis:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/ai-assistant">Using the AI Assistant in Elastic Observability to Accelerate Root Cause Analysis</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/log-analytics">All of the log analytics features in Elastic Observability</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/opentelemetry">Our latest on OpenTelemetry support in Elastic Observability</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-stack-observability-alerting-upgrade/cover-alerting.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Streams Processing: Stop Fighting with Grok. Parse Your Logs in Streams.]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-streams-processing</link>
            <guid isPermaLink="false">elastic-streams-processing</guid>
            <pubDate>Thu, 11 Dec 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how Streams Processing works under the hood and how to use it to build, test, and deploy parsing logic on live data quickly.]]></description>
            <content:encoded><![CDATA[<p>With Streams, Elastic's new AI capability in 9.2, we make parsing your logs so simple, it's no longer a concern. In general your logs are messy, lots of fields, some understood, some unknown. You have to constantly keep up with the semantics and pattern match to properly parse them. In some cases, even fields you know have different values or semantics. For instance, <code>timestamp</code> is the ingest time, not the event time. Or you can't even filter by <code>log.level</code> or <code>user.id</code> because they're buried inside the <code>message</code> field. As a result, your dashboards are flat and not useful.</p>
<p>Fixing this used to mean leaving Kibana, learning Grok syntax, manually editing ingest pipeline JSON or a complicated Logstash config, and hoping you didn't break parsing for everything else.</p>
<p>We built Streams to fix this, and much more. It's your one place for data processing, built right into Kibana, that lets you build, test, and deploy parsing logic on live data in seconds. It turns a high-risk backend task into a fast, predictable, interactive UI workflow. You can use AI to generate automated GROK rules from a sample of logs, or build them easily with the UI. Let's walk through an example</p>
<h2>A Quick Walkthrough</h2>
<p>Let's fix a common &quot;unstructured&quot; log right now.</p>
<ol>
<li><strong>Start in Discover</strong>. You find a log that isn't structured. The <code>@timestamp</code> is wrong, and fields like <code>log.level</code> aren't being extracted, so your histograms are just a single-color bar.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/start-in-discover.png" alt="start in discover" /></p>
<ol start="2">
<li><strong>Inspect the log</strong>. Open the document flyout (the &quot;Inspect a single log event&quot; view). You'll see a button: <strong>&quot;Parse content in Streams&quot;</strong> (or &quot;Edit processing in Streams&quot;). Click it.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/inspect-the-log.png" alt="inspect the log" /></p>
<ol start="3">
<li><strong>Go to Processing</strong>. This takes you directly to the Streams processing tab, pre-loaded with sample documents from that data stream. Click <strong>&quot;Create your first step.&quot;</strong></li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/go-to-processing.png" alt="go to streams processing" /></p>
<ol start="4">
<li><strong>Generate a Pattern</strong>. The processor defaults to Grok. You don't have to write any. Just click the <strong>&quot;Generate Pattern&quot;</strong> button. Streams analyzes 100 sample documents from your stream and suggests a Grok pattern for you. By default, this uses the Elastic Managed LLM, but you can configure your own.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/generate-pattern.png" alt="generate the pattern" /></p>
<ol start="5">
<li><strong>Accept and Simulate</strong>. Click &quot;Accept.&quot; Instantly, the UI runs a simulation across all 100 sample documents. You can make changes to the pattern or adjust field names, and the simulation re-runs with every keystroke.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/accept-and-simulate.png" alt="simulate and accept" /></p>
<p>When you're happy, you save it. Your new logs will now be parsed correctly.</p>
<h2>Powerful Features for Messy, Real-World Logs</h2>
<p>That's the simple case. But real-world data is rarely that clean. Here are the features built to handle the complexity.</p>
<h3>The Interactive Grok UI</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/the-interactive-grok-ui.png" alt="interactive grok" /></p>
<p>When you use the Grok processor, the UI gives you a <strong>visual indication</strong> of what your pattern is extracting. You can see which parts of the <code>message</code> field are being mapped to which new field names. This immediate feedback means you're not just guessing. Autocompletion of GROK patterns and instant pattern validation are also part of it.</p>
<h3>The Diff Viewer</h3>
<p>How do you know what exactly changed? Expand any row in the simulation table. You'll get a diff view showing precisely which fields were added, removed, or modified for that specific document. No more guesswork.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/the-diff-viewer.png" alt="the diff viewer" /></p>
<h3>End to End Simulation and Detecting Failures</h3>
<p>This is the most critical part. Streams doesn't just simulate the processor; it simulates the entire indexing process.
If you try to map a non-timestamp string (like the <code>message</code> field) directly to the <code>@timestamp</code> field, the simulation will show a failure. It detects the mapping conflict before you save it and before it can create a data-mapping conflict in your cluster. This safety net is what lets you move fast.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/end-to-end-simulation.png" alt="end to end simulation" /></p>
<h3>Conditional Processing</h3>
<p>What if one data stream contains a large variety of logs? You can't use one Grok pattern for all.</p>
<p>Streams has conditional processing built for this. The UI lets you build &quot;if-then&quot; logic. The UI shows you exactly what percentage of your sample documents are skipped or processed by your conditions. Right now, the UI supports up to 3 levels of nesting, and we plan to add a YAML mode in the future for more complex logic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/conditional-processing.png" alt="conditional processing" /></p>
<h3>Changing Your Test Data (Document Samples)</h3>
<p>A random 100-document sample isn't always helpful, especially in a massive, mixed stream from Kubernetes or a central message broker.</p>
<p>You can change the document sample to test your changes on a more specific set of logs. You can either provide documents manually (copy-paste) or, more powerfully, specify a KQL query to fetch 100 specific documents. For example: <code>service.name : &quot;data_processing&quot;</code>, to fetch 100 additional sample documents to be used in the simulation. Now you can build and test a processor on the exact logs you care about.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/changing-your-test-data.png" alt="changing your test data" /></p>
<h2>How Processing Works Under the Hood</h2>
<p>There’s no magic. In simple terms, it's a UI that makes our existing best practices more accessible. As of version 9.2, Streams runs exclusively on <strong>Elasticsearch ingest pipelines</strong>. (We have plans to offer more than that, stay tuned)</p>
<p>When you save your changes, Streams appends processing steps by:</p>
<ol>
<li>Locating the most specific <code>@custom</code> ingest pipeline for your data stream.</li>
<li>Adding a single <code>pipeline</code> processor to it.</li>
<li>This processor calls a new, dedicated pipeline named <code>&lt;stream-name&gt;@stream.processing</code>, which contains the Grok, conditional, and other logic you built in the UI.</li>
</ol>
<p>You can even see this for yourself by going to the <strong>Advanced tab</strong> in your Stream and clicking the pipeline name.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/how-processing-works.png" alt="how processing works" /></p>
<h2>Processing in OTel, Elastic Agent, Logstash, or Streams? What to Use?</h2>
<p>This is a fair question. You have lots of ways to parse data.</p>
<ul>
<li><strong>Best: Structured logging at the Source</strong>. If you control the app writing the logs, make it log JSON in the right format of your choice. This will always stay the best way to do logging, but it's not always possible.</li>
<li><strong>Good, but not all the time: Elastic Agent + Integrations:</strong> If there is an existing integration for collecting and parsing your data, Streams won't do it any better. Use it!</li>
<li><strong>Good for tech savvy users: OTel at the Edge</strong>. Use OTel (with OTTL) to set yourself up for the future.</li>
<li><strong>The easy Catch-All: In Streams</strong>. Especially when using an Integration that primarily just ships the data into Elastic, Streams can add a lot of value. The Kubernetes Logs integration is a good example of this where an Integration is used, but most logs aren't parsed automatically as they may be from a wide variety of pods.</li>
</ul>
<p>Think of Streams as your universal &quot;catch-all&quot; for everything that arrives unstructured. It's perfect for data from sources you don't control, for legacy systems, or for when you just need to fix a parsing error right now without a full application redeploy.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/processing-in-otel.png" alt="processing in otel" /></p>
<p>A quick note on schemas: Streams can handle both ECS (Elastic Common Schema) and OTel (OpenTelemetry) data. By default, it assumes your target schema is ECS. However, Streams will automatically detect and adapt to the OTel schema if your Stream's name contains the word “otel”, or if you're using the special Logs Stream (currently in tech preview). You get the same visual parsing workflow regardless of the schema.</p>
<p>All processing changes can also be made using a Kibana API. Note that the API is still in tech preview while we mature some of the functionality.</p>
<h2>Summary</h2>
<p>Parsing logs shouldn't be a tedious, high-stakes, backend-only task. Streams moves the entire workflow from a complex, error-prone approach to an interactive UI right where you already are. You can now build, test, and deploy parsing logic with instant, safe feedback. This means you can stop fighting your logs and finally start using them. The next time you see a messy log, don't ignore it. Click &quot;Parse in Streams&quot; and fix it in 60 seconds.</p>
<p>Check out more log analytics articles in <a href="https://www.elastic.co/observability-labs/blog/tag/log-analytics">Elasitc Observability Labs</a>.</p>
<p>Try out Elastic. Sign up for a trial at <a href="https://cloud.elastic.co/registration?fromURI=%2Fhome">Elastic Cloud</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/cover.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Elastic Synthetics Projects: A Git-friendly way to manage your synthetics monitors in Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/synthetics-git-ops-observability</link>
            <guid isPermaLink="false">synthetics-git-ops-observability</guid>
            <pubDate>Thu, 23 Feb 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic Observability can easily integrate into your DevOps git flow when managing applications with synthetics. Our new Synthetics Projects will enable you to develop and manage synthetics monitor configurations written in YAML with git.]]></description>
            <content:encoded><![CDATA[<p>Elastic has an entirely new Heartbeat/Synthetics workflow superior to the current workflow. If you’re a current user of the Elastic Uptime app, read on to learn about the improved workflow you can use today and should eventually migrate toward.</p>
<p>We’ve recently released a beta feature that provides a Git-friendly IaaC oriented workflow. You can now push Heartbeat monitors with the same ease with which you push code changes in Git or config changes in Terraform. The features discussed in this blog are all currently in beta, and we urge users trying these features out to upgrade to the latest stack version first. When these features become GA, this new workflow will be the preferred way of configuring monitors in the Elastic Stack. If you’re starting a new project, you may want to consider setting it up this way instead of via our more classic configuration.</p>
<p>Today, using Heartbeat is simple. You just need to write a little YAML and monitoring data shows up in Elasticsearch, visible in the Uptime UI. While the UI is indeed simple, there’s some hidden complexity there that we’ve improved with a new UI (the Synthetics app) and augmented with an even more automation friendly CLI workflow via our new Projects feature, which will be discussed below.</p>
<p>How do you manage your configs written in YAML? Many of our users will manage YAML in Git and use tooling such as Ansible, Helm, or similar to manage their infrastructure as code (IaaC). As with any other organization, Elastic also heavily utilizes IaaC in all parts of our operations. Hence it’s only natural we developed a capability to provide you with similar support for the current Heartbeat capability and the upcoming synthetics monitoring capabilities.</p>
<h2>Projects: A new way to organize and distribute configs</h2>
<p>Let’s dive right into what we’re calling “Synthetics Projects” and how they differ from traditional Heartbeat config files. To use this feature, you would start by <a href="https://www.elastic.co/guide/en/observability/current/synthetics-get-started-project.html">creating a project</a> in a Git repo containing your configs. At a high level, setting up a project requires performing the following tasks:</p>
<ol>
<li>Run npx @elastic/synthetics init to create a project skeleton in a directory. See more details on the <a href="https://www.npmjs.com/package/@elastic/synthetics">npmjs.com</a> site.</li>
<li>Run git init and git push on the generated directory to version it as a Git repository.</li>
<li>Add your lightweight YAML files and browser javascript/typescript files to the journeys folder.</li>
<li>Test that it works by running npx @elastic/synthetics push command to sync your project to your Elastic Stack.</li>
<li>Configure a CI/CD pipeline to test pull requests to your Git repo and to execute npx @elastic/synthetics push on merges to the main branch.</li>
</ol>
<p>So, once configured, adding, removing, and editing monitors involves:</p>
<ol>
<li>Editing a monitor’s config, either YAML for lightweight monitors, or Javascript/Typescript for browser based ones locally</li>
<li>Testing your local configs with npx @elastic/synthetics journeys</li>
<li>Creating a new PR to your main branch via a Git push</li>
<li>Waiting for your CI server to perform the same validation and waiting for someone else on your team to review your PR</li>
<li>Merging your result to the main branch</li>
<li>Waiting for your CI server to push the changes to your Elastic stack</li>
</ol>
<p>We’ve depicted the flow of data in the diagram below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/synthetics-git-ops-observability/blog-elastic-flow-of-data-diagram.png" alt="" /></p>
<p>This is, in fact, the way many of our users work today, with other software taking the place of npx @elastic/synthetics push as mentioned earlier. Indeed, in the future, we will most likely look into building a Terraform provider, though that isn’t something we’re actively working on now.</p>
<h2>Just have a few monitors? Use the GUI!</h2>
<p>The above approach is great for sophisticated users with larger numbers of configurations, but if you just want to monitor a few URLs, it’s overkill. If that sounds like you, consider the new Monitor Management UI in the Uptime app! It works in the exact same way, saving configs to your Elastic Stack, but with no need for Git, or a project, or all that other infrastructure. Simply, log in, fill out the form pictured below, and hit save. If you want to set up a private location, that is still done in the same way via Fleet.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/synthetics-git-ops-observability/blog-elastic-add-monitor.png" alt="" /></p>
<h2>What about my existing Fleet monitors?</h2>
<p>A small subset of users have monitors configured today using the Synthetics Fleet integration. If that describes you, you’ll want to move onto either the GUI based approach or the Project based approach, as those methods supersede direct usage of the Fleet integration, which will eventually be restricted only to use via the above described methods.</p>
<p>The Fleet approach is inferior in a few ways:</p>
<ol>
<li>It can only configure monitors for a single location.</li>
<li>It creates a different UX for monitors configured on the service versus private locations.</li>
<li>It’s less fluid of an integration with the Uptime UI.</li>
</ol>
<p>It’s rare for us to deprecate beta features, but in this case we had a clearly superior alternative. Maintaining both would have created a more confusing and unwieldy product. We don’t yet have an exact date for removing support for these monitors, but you can track this via <a href="https://github.com/elastic/kibana/issues/137508">this GitHub issue</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/synthetics-git-ops-observability/blog-charts-packages.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Elastic Universal Profiling agent, a continuous profiling solution, is now open source]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-universal-profiling-agent-open-source</link>
            <guid isPermaLink="false">elastic-universal-profiling-agent-open-source</guid>
            <pubDate>Mon, 15 Apr 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[At Elastic, open source isn't just philosophy, it's our DNA. Dive into the future with our open-sourced Universal Profiling agent, revolutionizing software efficiency and sustainability.]]></description>
            <content:encoded><![CDATA[<p>Elastic Universal Profiling™ agent is now open source! The industry’s most advanced fleetwide continuous profiling solution empowers users to identify performance bottlenecks, reduce cloud spend, and minimize their carbon footprint. This post explores the history of the agent, its move to open source, and its future integration with OpenTelemetry.</p>
<h2>Elastic Universal Profiling™ Agent goes open source under Apache 2</h2>
<p>At Elastic, open source is more than just a philosophy — it's our DNA. We believe the benefits of whole-system continuous profiling extend far beyond performance optimization. It's a win for businesses and the planet alike. For instance, since launching Elastic Universal Profiling in general availability (GA), we've observed a wide variety of use cases from customers.</p>
<p>These range from customers relying fully on Universal Profiling's <a href="https://www.elastic.co/guide/en/observability/current/universal-profiling.html#profiling-differential-views-intro">differential flame graphs and topN functions</a> for insights during release management to utilizing AI assistants for quickly optimizing expensive functions. This includes using profiling data to identify the optimal energy-efficient cloud region to run certain workloads. Additionally, customers are using insights that Universal Profiling provides to build evidence to challenge cloud provider bills. As it turns out, cloud providers' in-VM agents can consume a significant portion of the CPU time, which customers are billed for.</p>
<p>In a move that will empower the community to take advantage of continuous profiling's benefits, <strong>we're thrilled to announce that the Elastic Universal Profiling agent</strong> , a pioneering eBPF-based continuous profiling agent, <strong>is now open source under the Apache 2 license!</strong></p>
<p>This move democratizes <strong>hyper-scaler efficiency for everyone</strong> , opening exciting new possibilities for the future of continuous profiling, as well as its role in observability and <strong>OpenTelemetry</strong>.</p>
<h2>Implementation of the OpenTelemetry (OTel) Profiling protocol</h2>
<p>Our commitment to open source goes beyond just the agent itself. We recently <a href="https://www.elastic.co/blog/elastic-donation-proposal-to-contribute-profiling-agent-to-opentelemetry">announced our intent to donate</a> the agent to OpenTelemetry and have further solidified this goal by implementing the experimental <a href="https://github.com/open-telemetry/oteps/blob/main/text/profiles/0239-profiles-data-model.md">OTel Profiling data model</a>. This allows the open-sourced eBPF-based continuous profiling agent to communicate seamlessly with OpenTelemetry backends.</p>
<p>But that's not all! We've also launched an innovative feature that <a href="https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation">correlates profiling data with OpenTelemetry distributed traces</a>. This powerful capability offers a deeper level of insight into application performance, enabling the identification of bottlenecks with greater precision. Upon donating the Profiling agent to OTel, Elastic will also contribute critical components that enable distributed trace correlation within the <a href="https://github.com/elastic/elastic-otel-java">Elastic distribution of the OTel Java agent</a> to the upstream OTel Java SDK. This underscores Elastic Observability's commitment to both open source and the support of open standards like OpenTelemetry while pushing the boundaries of what is possible in observability.</p>
<h2>What does this mean for Elastic Universal Profiling customers?</h2>
<p>We'd like to express our <strong>immense gratitude to all our customers</strong> who have been part of this journey, from the early stages of private beta to GA. Your feedback has been invaluable in shaping Universal Profiling into the powerful product it is today.</p>
<p>By open-sourcing the Universal Profiling agent and contributing it to OpenTelemetry, we're fostering a win-win situation for both you and the broader community. This move opens doors for innovation and collaboration, ultimately leading to a more robust and versatile whole-system continuous profiling solution for everyone.</p>
<p>Furthermore, we're actively working on exciting novel ways to integrate Universal Profiling seamlessly within Elastic Observability. Expect further announcements soon, outlining how you can unlock even greater value from your profiling data within a unified observability experience in a way that has never been done before.</p>
<p>The open-sourced agent is using the recently released (experimental) OTel Profiling <a href="https://github.com/open-telemetry/opentelemetry-proto/pull/534">signal</a>. As a precaution, we recommend not using it in production environments.</p>
<p>Please continue using the official Elastic distribution of the Universal Profiling agent until the agent is formally accepted by OTel and the protocol reaches a stable phase. There's no need to take any action at this time, and we will ensure to have a smooth transition plan in place for you.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-universal-profiling-agent-open-source/image1.png" alt="1 - Elastic Universal Profiling" /></p>
<h2>What does this mean for the OpenTelemetry community?</h2>
<p>OpenTelemetry is adopting continuous profiling as a key signal. By open-sourcing the eBPF-based profiling agent and working towards donating it to OTel, Elastic is making it possible to accelerate the standardization of continuous profiling within OpenTelemetry. This move has a massive impact on the observability community, empowering everyone to continuously profile their systems with a standardized protocol.</p>
<p>This is particularly timely as <a href="https://www.bbc.co.uk/news/technology-32335003">Moore's Law</a> slows down and cloud computing takes hold, making computational efficiency critical for businesses.</p>
<p>Here's how whole-system continuous profiling benefits you:</p>
<ul>
<li>
<p><strong>Maximize gross margins:</strong> By reducing the computational resources needed to run applications, businesses can optimize their cloud spend and improve profitability. Whole-system continuous profiling is one way of identifying the most expensive applications (down to the lines of code) across diverse environments that may span multiple cloud providers. This principle aligns with the familiar adage, <em>&quot;a penny saved is a penny earned.&quot;</em> In the cloud context, every CPU cycle saved translates to money saved.</p>
</li>
<li>
<p><strong>Minimize environmental impact:</strong> Energy consumption associated with computing is a growing concern (source: <a href="https://energy.mit.edu/news/energy-efficient-computing/">MIT Energy Initiative</a>). More efficient code translates to lower energy consumption, contributing to a reduction in carbon footprint.</p>
</li>
<li>
<p><strong>Accelerate engineering workflows:</strong> Continuous profiling provides detailed insights to help debug complex issues faster, guide development, and improve overall code quality.</p>
</li>
</ul>
<p>This is where Elastic Universal Profiling comes in — designed to help organizations run efficient services by minimizing computational wastage. To this end, it measures code efficiency in three dimensions: <strong>CPU utilization</strong> , <strong>CO</strong>** 2 <strong>, and</strong> cloud cost**.</p>
<p>Elastic's journey with continuous profiling began by joining forces with <a href="https://www.elastic.co/about/press/elastic-and-optimyze-join-forces-to-deliver-continuous-profiling-of-infrastructure-applications-and-services">optimyze.cloud</a> –– this became the foundation for <a href="https://www.elastic.co/observability/universal-profiling">Elastic Universal Profiling</a>. We are excited to see this product evolve into its next growth phase in the open-source world.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-universal-profiling-agent-open-source/image2.png" alt="2 - car manufacturers" /></p>
<h2>Ready to give it a spin?</h2>
<p>As Elastic Universal Profiling transitions into this new open source era, the potential for transformative impact on performance optimization, cost efficiency, and environmental sustainability is immense. Elastic's approach — balancing innovation with responsibility — paves the way for a future where technology not only powers our world but does so in a way that is sustainable and accessible to all.</p>
<p>Get started with the open source Elastic Universal Profiling agent today! <a href="https://github.com/elastic/otel-profiling-agent/">Download it directly from GitHub</a> and follow the instructions in the repository.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-universal-profiling-agent-open-source/image3.png" alt="3 - dripping graph and data" /></p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-universal-profiling-agent-open-source/tree_tunnel.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Elastic Universal Profiling: Delivering performance improvements and reduced costs]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-universal-profiling-performance-improvements-reduced-costs</link>
            <guid isPermaLink="false">elastic-universal-profiling-performance-improvements-reduced-costs</guid>
            <pubDate>Mon, 22 Apr 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[In this blog, we’ll cover how a discovery by one of our engineers led to cost savings of thousands of dollars in our QA environment and magnitudes more once we deployed this change to production.]]></description>
            <content:encoded><![CDATA[<p>In today's age of cloud services and SaaS platforms, continuous improvement isn't just a goal — it's a necessity. Here at Elastic, we're always on the lookout for ways to fine-tune our systems, be it our internal tools or the Elastic Cloud service. Our recent investigation in performance optimization within our Elastic Cloud QA environment, guided by <a href="https://www.elastic.co/blog/continuous-profiling-efficient-cost-effective-applications">Elastic Universal Profiling</a>, is a great example of how we turn data into actionable insights.</p>
<p>In this blog, we’ll cover how a discovery by one of our engineers led to savings of thousands of dollars in our QA environment and magnitudes more once we deployed this change to production.</p>
<h2>Elastic Universal Profiling: Our go-to tool for optimization</h2>
<p>In our suite of solutions for addressing performance challenges, Elastic Universal Profiling is a critical component. As an “always-on” profiler utilizing eBPF, it integrates seamlessly into our infrastructure and systematically collects comprehensive profiling data across the entirety of our system. Because there is zero-code instrumentation or reconfiguration, it’s easy to deploy on any host (including Kubernetes hosts) in our cloud — we’ve deployed it across our environment for Elastic Cloud.</p>
<p>All of our hosts run the profiling agent to collect this data, which gives us detailed insight into the performance of any service that we’re running.</p>
<h3>Spotting the opportunity</h3>
<p>It all started with what seemed like a routine check of our QA environment. One of our engineers was looking through the profiling data. With Universal Profiling in play, this initial discovery was relatively quick. We found a function that was not optimized and had heavy compute costs.</p>
<p>Let’s go through it step-by-step.</p>
<p>In order to spot expensive functions, we can simply view a list of the TopN functions. The TopN functions list shows us all functions in all services we run that use the most CPU.</p>
<p>To sort them by their impact, we sort descending on the “total CPU”:</p>
<ul>
<li>
<p><strong>Self CPU</strong> measures the CPU time that a function directly uses, not including the time spent in functions it calls. This metric helps identify functions that use a lot of CPU power on their own. By improving these functions, we can make them run faster and use less CPU.</p>
</li>
<li>
<p><strong>Total CPU</strong> adds up the CPU time used by the function and any functions it calls. This gives a complete picture of how much CPU a function and its related operations use. If a function has a high &quot;total CPU&quot; usage, it might be because it's calling other functions that use a lot of CPU.</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-universal-profiling-performance-improvements-reduced-costs/1.png" alt="1 - universal profiling" /></p>
<p>When our engineer reviewed the TopN functions list, one function called &quot;... <strong>inflateCompressedFrame</strong> …&quot; caught their attention. This is a common scenario where certain types of functions frequently become optimization targets. Here’s a simplified guide on what to look for and possible improvements:</p>
<ul>
<li>
<p><strong>Compression/decompression:</strong> Is there a more efficient algorithm? For example, switching from zlib to zlib-ng might offer better performance.</p>
</li>
<li>
<p><strong>Cryptographic hashing algorithms:</strong> Ensure the fastest algorithm is in use. Sometimes, a quicker non-cryptographic algorithm could be suitable, depending on the security requirements.</p>
</li>
<li>
<p><strong>Non-cryptographic hashing algorithms:</strong> Check if you're using the quickest option. xxh3, for instance, is often faster than other hashing algorithms.</p>
</li>
<li>
<p><strong>Garbage collection:</strong> Minimize heap allocations, especially in frequently used paths. Opt for data structures that don't rely on garbage collection.</p>
</li>
<li>
<p><strong>Heap memory allocations:</strong> These are typically resource-intensive. Consider alternatives like using jemalloc or mimalloc instead of the standard libc malloc() to reduce their impact.</p>
</li>
<li>
<p><strong>Page faults:</strong> Keep an eye out for &quot;exc_page_fault&quot; in your TopN Functions or flamegraph. They indicate areas where memory access patterns could be optimized.</p>
</li>
<li>
<p><strong>Excessive CPU usage by kernel functions:</strong> This may indicate too many system calls. Using larger buffers for read/write operations can reduce the number of syscalls.</p>
</li>
<li>
<p><strong>Serialization/deserialization:</strong> Processes like JSON encoding or decoding can often be accelerated by switching to a faster JSON library.</p>
</li>
</ul>
<p>Identifying these areas can help in pinpointing where performance can be notably improved.</p>
<p>Clicking on the function from the TopN view shows it in the flamegraph. Note that the flamegraph is showing the samples from the full cloud QA infrastructure. In this view, we can tell that this function alone was accounting for &gt;US$6,000 annualized in this part of our QA environment.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-universal-profiling-performance-improvements-reduced-costs/2.png" alt="2 - universal profiling flamegraph" /></p>
<p>After filtering for the thread, it became more clear what the function was doing. The following image shows a flamegraph of this thread across all of the hosts running in the QA environment.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-universal-profiling-performance-improvements-reduced-costs/3.png" alt="3 - flamegraph shows hosts running in QA environment " /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-universal-profiling-performance-improvements-reduced-costs/4.png" alt="4 - hosts running in QA environment" /></p>
<p>Instead of looking at the thread across all hosts, we can also look at a flamegraph for just one specific host.</p>
<p>If we look at this one host at a time, we can see that the impact is even more severe. Keep in mind that the 17% from before was for the full infrastructure. Some hosts may not even be running this service and therefore bring down the average.</p>
<p>Filtering things down to a single host that has the service running, we can tell that this host is actually spending close to 70% of its CPU cycles on running this function.</p>
<p>The dollar cost here just for this one host would put the function at around US$600 per year.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-universal-profiling-performance-improvements-reduced-costs/5.png" alt="5 - filtering" /></p>
<h2>Understanding the performance problem</h2>
<p>After identifying a potentially resource-intensive function, our next step involved collaborating with our Engineering teams to understand the function and work on a potential fix. Here's a straightforward breakdown of our approach:</p>
<ul>
<li><strong>Understanding the function:</strong> We began by analyzing what the function should do. It utilizes gzip for decompression. This insight led us to briefly consider strategies mentioned earlier for reducing CPU usage, such as using a more efficient compression library like zlib or switching to zstd compression.</li>
<li><strong>Evaluating the current implementation:</strong> The function currently relies on JDK's gzip decompression, which is expected to use native libraries under the hood. Our usual preference is Java or Ruby libraries when available because they simplify deployment. Opting for a native library directly would require us to manage different native versions for each OS and CPU we support, complicating our deployment process.</li>
<li><strong>Detailed analysis using flamegraph:</strong> A closer examination of the flamegraph revealed that the system encounters page faults and spends significant CPU cycles handling these.</li>
</ul>
<p><strong>Let’s start with understanding the Flamegraph:</strong></p>
<p>The last few non jdk.* JVM instructions (in green) show the allocation of a direct memory Byte Buffer started by Netty's DirectArena.newUnpooledChunk. Direct memory allocations are costly operations that typically should be avoided on an application's critical path.</p>
<p>The <a href="https://www.elastic.co/blog/context-aware-insights-elastic-ai-assistant-observability">Elastic AI Assistant for Observability</a> is also useful in understanding and optimizing parts of the flamegraph. Especially for users new to Universal Profiling, it can add lots of context to the collected data and give the user a better understanding of them and provide potential solutions.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-universal-profiling-performance-improvements-reduced-costs/6.png" alt="6 - Detailed analysis using flamegraph" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-universal-profiling-performance-improvements-reduced-costs/7.png" alt="7 - understanding flamegraph" /></p>
<p><strong>Netty's memory allocation</strong></p>
<p>Netty, a popular asynchronous event-driven network application framework, uses the maxOrder setting to determine the size of memory chunks allocated for managing objects within its applications. The formula for calculating the chunk size is chunkSize = pageSize &lt;&lt; maxOrder. The default maxOrder value of either 9 or 11 results in the default memory chunk size being 4MB or 16MB, respectively, assuming a page size of 8KB.</p>
<p><strong>Impact on memory allocation</strong></p>
<p>Netty employs a PooledAllocator for efficient memory management, which allocates memory chunks in a pool of direct memory at startup. This allocator optimizes memory usage by reusing memory chunks for objects smaller than the defined chunk size. Any object that exceeds this threshold must be allocated outside of the PooledAllocator.</p>
<p>Allocating and releasing memory outside of this pooled context incurs a higher performance cost for several reasons:</p>
<ul>
<li><strong>Increased allocation overhead:</strong> Objects larger than the chunk size require individual memory allocation requests. These allocations are more time-consuming and resource-intensive compared to the fast, pooled allocation mechanism for smaller objects.</li>
<li><strong>Fragmentation and garbage collection (GC) pressure:</strong> Allocating larger objects outside the pool can lead to increased memory fragmentation. Furthermore, if these objects are allocated on the heap, it can increase GC pressure, leading to potential pauses and reduced application performance.</li>
<li><strong>Netty and the Beats/Agent input:</strong> Logstash's Beats and Elastic Agent inputs use Netty to receive and send data. During processing of a received data batch, decompressing the data frame requires creating a buffer large enough to store the uncompressed events. If this batch is larger than the chunk size, an unpooled chunk is needed, causing a direct memory allocation that slows performance. The universal profiler allowed us to confirm that this was the case from the DirectArena.newUnpooledChunk calls in the flamegraph.</li>
</ul>
<h2>Fixing the performance problem in our environments</h2>
<p>We decided to implement a quick workaround to test our hypothesis. Apart from having to adjust the jvm options once, this approach does not have any major downsides.</p>
<p>The immediate workaround involves manually adjusting the maxOrder setting back to its previous value. This can be achieved by adding a specific flag to the config/jvm.options file in Logstash:</p>
<pre><code>-Dio.netty.allocator.maxOrder=11
</code></pre>
<p>This adjustment will revert the default chunk size to 16MB (chunkSize = pageSize &lt;&lt; maxOrder, or 16MB = 8KB &lt;&lt; 11), which aligns with the previous behavior of Netty, thereby reducing the overhead associated with allocating and releasing larger objects outside of the PooledAllocator.</p>
<p>After rolling out this change to some of our hosts in the QA environment, the impact was immediately visible in the profiling data.</p>
<p><strong>Single host:</strong></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-universal-profiling-performance-improvements-reduced-costs/8.png" alt="8 - single host" /></p>
<p><strong>Multiple hosts:</strong></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-universal-profiling-performance-improvements-reduced-costs/10.png" alt="9 - multiple hosts" /></p>
<p>We can also use the differential flamegraph view to see the impact.</p>
<p>For this specific thread, we’re comparing one day of data from early January to one day of data from early February across a subset of hosts. Both the overall performance improvements as well as the CO&lt;sub&gt;2&lt;/sub&gt; and cost savings are dramatic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-universal-profiling-performance-improvements-reduced-costs/11.png" alt="10. -cost savings" /></p>
<p>This same comparison can also be done for a single host. In this view, we’re comparing one host in early January to that same host in early February. The actual CPU usage on that host decreased by 50%, saving us approximately US$900 per year per host.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-universal-profiling-performance-improvements-reduced-costs/12.png" alt="11 - comparisons" /></p>
<h2>Fixing the issue in Logstash</h2>
<p>In addition to the temporary workaround, we are working on shipping a proper fix for this behavior in Logstash. You can find more details in this <a href="https://github.com/elastic/logstash/issues/15765">issue</a>, but the potential candidates are:</p>
<ul>
<li><strong>Global default adjustment:</strong> One approach is to permanently set the maxOrder back to 11 for all instances by including this change in the jvm.options file. This global change would ensure that all Logstash instances use the larger default chunk size, reducing the need for allocations outside the pooled allocator.</li>
<li><strong>Custom allocator configuration:</strong> For more targeted interventions, we could customize the allocator settings specifically within the TCP, Beats, and HTTP inputs of Logstash. This would involve configuring the maxOrder value at initialization for these inputs, providing a tailored solution that addresses the performance issues in the most affected areas of data ingestion.</li>
<li><strong>Optimizing major allocation sites:</strong> Another solution focuses on altering the behavior of significant allocation sites within Logstash. For instance, modifying the frame decompression process in the Beats input to avoid using direct memory and instead default to heap memory could significantly reduce the performance impact. This approach would circumvent the limitations imposed by the reduced default chunk size, minimizing the reliance on large direct memory allocations.</li>
</ul>
<h2>Cost savings and performance enhancements</h2>
<p>Following the new configuration change for Logstash instances on January 23, the platform's daily function cost dramatically decreased to US$350 from an initial &gt;US$6,000, marking a significant 20x reduction. This change shows the potential for substantial cost savings through technical optimizations. However, it's important to note that these figures represent potential savings rather than direct cost reductions.</p>
<p>Just because a host uses less CPU resources, doesn’t necessarily mean that we are also saving money. To actually benefit from this, the very last step now is to either reduce the number of VMs we have running or to scale down the CPU resources of each one to match the new resource requirements.</p>
<p>This experience with Elastic Universal Profiling highlights how crucial detailed, real-time data analysis is in identifying areas for optimization that lead to significant performance enhancements and cost savings. By implementing targeted changes based on profiling insights, we've dramatically reduced CPU usage and operational costs in our QA environment with promising implications for broader production deployment.</p>
<p>Our findings demonstrate the benefits of an always-on, profiling driven approach in cloud environments, providing a good foundation for future optimizations. As we scale these improvements, the potential for further cost savings and efficiency gains continues to grow.</p>
<p>All of this is also possible in your environments. <a href="https://www.elastic.co/observability/universal-profiling">Learn how to get started today</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-universal-profiling-performance-improvements-reduced-costs/money.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Elastic's collaboration with OpenTelemetry on improving the filelog receiver]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastics-collaboration-opentelemetry-filelog-receiver</link>
            <guid isPermaLink="false">elastics-collaboration-opentelemetry-filelog-receiver</guid>
            <pubDate>Mon, 17 Jun 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic is committed to help OpenTelemetry advance it's logging capabilities. Learn about our collaboration with the OpenTelemetry community on improving the capabilities and quality aspects of the OpenTelemetry Collector's filelog receiver.]]></description>
            <content:encoded><![CDATA[<p>As the newest generally available signal in OpenTelemetry (OTel), logging support currently lags behind tracing and metrics in terms of feature scope and maturity.
At Elastic, we bring years of extensive experience with logging use cases and the challenges they present.
Committed to advancing OpenTelemetry's logging capabilities, we have focused on enhancing its logging functionalities.</p>
<p>Over the past few months, we have dealt with the capabilities of the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.102.0/receiver/filelogreceiver/README.md">filelog receiver</a>
in the <a href="https://opentelemetry.io/docs/collector/">OpenTelemetry Collector</a>, leveraging our expertise as the <a href="https://www.elastic.co/beats/filebeat">Filebeat's</a> maintainers to help refine and expand its potential.
Our goal is to contribute meaningfully to the evolution of OpenTelemetry's logging features, ensuring they meet the high standards required for robust observability.</p>
<p>Specifically, we focused on verifying that the receiver is well covered for cases and aspects that have been a pain for us in the past with Filebeat
— such as fail-over handling, self-telemetry, test coverage, documentation and usability.
Based on our exploration, we started insightful conversations with the OTel project's maintainers, sharing our thoughts and any suggestions that could be useful from our experience.
Moreover, we've started putting up PRs to add documentation, make enhancements, improve tests, fix bugs, and even implement completely new features.</p>
<p>In this blog post we'll provide a sneak preview of the work that we've done so far in collaboration with the OpenTelemetry community and what's coming next as we continue to explore ways to improve the OpenTelemetry Collector for log collection.</p>
<h2>Enhancing the filelog receiver's telemetry</h2>
<p>Observability tools are software components like any other and, thus, need to be monitored as any other software to be able to debug problems and tune relevant settings.
In particular, users of the filelog receiver will want to know how it's performing.
It's important that the filelog receiver emits sufficient telemetry data for common troubleshooting and optimization use cases.
This includes sufficient logging and observable metrics providing insights into the filelog receiver's internal state.</p>
<p>While the filelog receiver already provided a good set of self-telemetry data, we identified some areas of improvement.
In particular, we contributed functionality to emit self-telemetry <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/33237">logs on crucial events</a> like when log files are discovered, moved or truncated.
Another contribution includes <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/31544">observable metrics about filelog’s receiver internal state</a> about how many files are opened and being harvested.
You can find more information on the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/31256">respective tracking issue</a>.</p>
<h2>Improving the Kubernetes container logs parsing</h2>
<p>The filelog receiver has been able to parse Kubernetes container logs for some time now.
However, properly parsing logs from Kubernetes Pods required a fair bit of configuration to deal with different runtime formats and to extract important meta information, such as <code>k8s.pod.name</code>, <code>k8s.container.name</code>, etc.
With this in mind we proposed to abstract these complex set of configuration into a simpler implementation specific container parser and contributed this new feature to the filelog receiver.
With that new feature, setting up logs collection for Kubernetes is by magnitudes easier - with only eight lines of configuration vs. ~ 80 lines of configuration before.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastics-collaboration-opentelemetry-filelog-receiver/container-parser-config-example.png" alt="1 - Usability improvement for parsing Kubernetes container logs" /></p>
<p>You can learn more about the details of the new <a href="https://opentelemetry.io/blog/2024/otel-collector-container-log-parser">container logs parser in the corresponding OpenTelemetry blog post</a>.</p>
<h3>Evaluating test coverage</h3>
<p>Logs collection from files can run into different unexpected scenarios such as restarts, overload and error scenarios.
To ensure reliable and consistent collection of logs, it's important to ensure tests cover these kind of scenarios.
Based on our experience with testing Filebeat, we evaluated the existing filelog receiver tests with respect to those scenarios.
While most of the use cases and scenarios were well-tested already, we identified a few scenarios to improve tests for to ensure reliable logs collection.<br />
At the creation time of this blog posts we were working on contributing additional tests to address the identified test coverage gaps.
You can learn more about it in <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/32001">this GitHub issue</a>.</p>
<h3>Persistence evaluation</h3>
<p>Another important aspect for log collection that we often hear from Elastic's log users are the failover handling capabilities and the delivery guarantees for logs.
Some logging use cases, for example audit logging, have strict delivery guarantee requirements.
Hence, it's important that the filelog receiver provides functionality to reliably handle situations, such as temporary unavailability of the logging backend or unexpected restarts of the OTel Collector.</p>
<p>Overall, the filelog receiver already has corresponding functionality to deal with such situations.
However, user documentation on how to setup reliable logs collection with tangible examples was an area with potential for improvement.</p>
<p>In this regard, beyond verifying the persistence and offset tracking capabilities we worked on improving respective documentation
<a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/31886">1</a> <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/30914">2</a>
and also are collaborating on a <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/31074">community reported issue</a> to ensure delivery guarantees for logs.</p>
<h3>Helping users help themselves</h3>
<p>Elastic has a long and varied history of supporting customers who use our products for log ingestion.
Drawing from this experience, we've proposed a couple of documentation improvements to the OpenTelemetry Collector to help logging users get out of some tricky situations.</p>
<p><strong>Documenting the structure of the tracking file</strong></p>
<p>For every log file the filelog receiver ingests, it needs to track how far into the file it has already read, so it knows where to start reading from when new contents are added to the file.
By default, the filelog receiver doesn't persist this tracking information to disk, but it can be configured to do so.
We felt it would be useful to <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/32180">document the structure of this tracking file</a>. When ingestion stops unexpectedly,
peeking into this tracking file can often provide clues as to where the problem may lie.</p>
<p><strong>Challenges with symlink target changes</strong></p>
<p>The filelog receiver periodically refreshes its memory of the files it's supposed to be ingesting.
The interval at which these refreshes happen is controlled by the <code>poll_interval</code> setting.
In certain setups log files being ingested by the filelog receiver are symlinks pointing to actual files.
Moreover, these symlinks can be updated to point to newer files over time.
If the symlink target changes twice before the filelog receiver has had a chance to refresh its memory, it will miss the first change and therefore not ingest the corresponding target file.
We've <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/32217">documented this edge case</a>, suggesting the users with such setups should make sure they set <code>poll_interval</code> to a sufficiently low value.</p>
<h3>Planning ahead for the receiver's GA </h3>
<p>Last but not least, we have raised the topic of making the filelog receiver a generally available (GA) component.
For users it's important to be able to rely on the stability of used functionality, hence, not being required to deal with the risk of breaking changes through minor version updates.
In this regard, for the filelog receiver we have kicked off a first plan with the maintainers to mark any issue that is a blocker for stability with a <code>required_for_ga</code>
<a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/issues?q=is%3Aopen+is%3Aissue+label%3Arelease%3Arequired-for-ga+label%3Areceiver/filelog">label</a>.
Once the OpenTelemetry collector goes to version <code>v1.0.0</code> we will be able to also work towards the specific receiver’s GA.</p>
<h2>Conclusion</h2>
<p>Overall, OTel's filelog receiver component is in a good shape and provides important functionality for most log collection use cases.
Where there are still minor gaps or need for improvement with the filelog receiver, we are gladly to contribute our expertise and experience from Filebeat use cases.
The above is just the beginning of our effort to help advancing the OpenTelemetry Collector, and specifically for log collection, get closer to a stable version.
Moreover, we are happy to help the filelog receiver maintainers with general maintenance of the component, hence, dealing with community issues and PRs, jointly working on the component's roadmap, etc.</p>
<p>We'd like to thank the OTel Collector group and, in particular, <a href="https://github.com/djaglowski">Daniel Jaglowski</a> for the great and constructive collaboration on the filelog receiver, so far!</p>
<p>Stay tuned to <a href="https://www.elastic.co/observability/opentelemetry">learn more about our future contributions and involvement in OpenTelemetry</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastics-collaboration-opentelemetry-filelog-receiver/otel-filelog-receiver.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[How to use Elasticsearch and Time Series Data Streams for observability metrics]]></title>
            <link>https://www.elastic.co/observability-labs/blog/time-series-data-streams-observability-metrics</link>
            <guid isPermaLink="false">time-series-data-streams-observability-metrics</guid>
            <pubDate>Thu, 04 May 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[With Time Series Data Streams (TSDS), Elasticsearch introduces optimized storage for metrics time series. Check out how we use it for Elastic Observability.]]></description>
            <content:encoded><![CDATA[<p>Elasticsearch is used for a wide variety of data types — one of these is metrics. With the introduction of Metricbeat many years ago and later our APM Agents, the metric use case has become more popular. Over the years, Elasticsearch has made many improvements on how to handle things like metrics aggregations and sparse documents. At the same time, <a href="https://www.elastic.co/guide/en/kibana/current/tsvb.html">TSVB visualizations</a> were introduced to make visualizing metrics easier. One concept that was missing that exists for most other metric solutions is the concept of time series with dimensions.</p>
<p>Mid 2021, the Elasticsearch team <a href="https://github.com/elastic/elasticsearch/issues/74660">embarked</a> on making Elasticsearch a much better fit for metrics. The team created <a href="https://www.elastic.co/guide/en/elasticsearch/reference/master/tsds.html">Time Series Data Streams (TSDS)</a>, which were released in 8.7 as generally available (GA).</p>
<p>This blog post dives into how TSDS works and how we use it in Elastic Observability, as well as how you can use it for your own metrics.</p>
<h2>A quick introduction to TSDS</h2>
<p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/master/tsds.html">Time Series Data Streams (TSDS)</a> are built on top of data streams in Elasticsearch that are optimized for time series. To create a data stream for metrics, an additional setting on the data stream is needed. As we are using data streams, first an Index Template has to be created:</p>
<pre><code class="language-json">PUT _index_template/metrics-laptop
{
  &quot;index_patterns&quot;: [
    &quot;metrics-laptop-*&quot;
  ],
  &quot;data_stream&quot;: {},
  &quot;priority&quot;: 200,
  &quot;template&quot;: {
    &quot;settings&quot;: {
      &quot;index.mode&quot;: &quot;time_series&quot;
    },
    &quot;mappings&quot;: {
      &quot;properties&quot;: {
        &quot;host.name&quot;: {
          &quot;type&quot;: &quot;keyword&quot;,
          &quot;time_series_dimension&quot;: true
        },
        &quot;packages.sent&quot;: {
          &quot;type&quot;: &quot;integer&quot;,
          &quot;time_series_metric&quot;: &quot;counter&quot;
        },
        &quot;memory.usage&quot;: {
          &quot;type&quot;: &quot;double&quot;,
          &quot;time_series_metric&quot;: &quot;gauge&quot;
        }
      }
    }
  }
}
</code></pre>
<p>Let's have a closer look at this template. On the top part, we mark the index pattern with metrics-laptop-*. Any pattern can be selected, but it is recommended to use the <a href="https://www.elastic.co/blog/an-introduction-to-the-elastic-data-stream-naming-scheme">data stream naming scheme</a> for all your metrics. The next section sets the &quot;index.mode&quot;: &quot;time_series&quot; in combination with making sure it is a data_stream: &quot;data_stream&quot;: {}.</p>
<h3>Dimensions</h3>
<p>Each time series data stream needs at least one dimension. In the example above, host.name is set as a dimension field with &quot;time_series_dimension&quot;: true. You can have up to 16 dimensions by default. Not every dimension must show up in each document. The dimensions define the time series. The general rule is to pick fields as dimensions that uniquely identify your time series. Often this is a unique description of the host/container, but for some metrics like disk metrics, the disk id is needed in addition. If you are curious about default recommended dimensions, have a look at this <a href="https://github.com/elastic/ecs/pull/2172">ECS contribution</a> with dimension properties.</p>
<h2>Reduced storage and increased query speed</h2>
<p>At this point, you already have a functioning time series data stream. Setting the index mode to time series automatically turns on <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html#synthetic-source">synthetic source</a>. By default, Elasticsearch typically duplicates data three times:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Column-oriented_DBMS#Row-oriented_systems">row-oriented storage</a> (_source field)</li>
<li><a href="https://en.wikipedia.org/wiki/Column-oriented_DBMS#Column-oriented_systems">column-oriented storage</a> (doc_values: true for aggregations)</li>
<li>indices (index: true for filtering and search)</li>
</ul>
<p>With synthetic source, the _source field is not persisted; instead, it is reconstructed from the doc values. Especially in the metrics use case, there are little benefits to keeping the source.</p>
<p>Not storing it means a significant reduction in storage. Time series data streams sort the data based on the dimensions and the time stamp. This means data that is usually queried together is stored together, which speeds up query times. It also means that the data points for a single time series are stored alongside each other on disk. This enables further compression of the data as the rate at which a counter increases is often relatively constant.</p>
<h2>Metric types</h2>
<p>But to benefit from all the advantages of TSDS, the field properties of the metrics fields must be extended with the <code>time_series_metric: {type}</code>. Several <a href="https://www.elastic.co/guide/en/elasticsearch/reference/master/tsds.html#time-series-metric">types are supported</a> — as an example, gauge and counter were used above. Giving Elasticsearch knowledge about the metric type allows Elasticsearch to offer more optimized queries for the different types and reduce storage usage further.</p>
<p>When you create your own templates for data streams under the <a href="https://www.elastic.co/blog/an-introduction-to-the-elastic-data-stream-naming-scheme">data stream naming scheme</a>, it is important that you set &quot;priority&quot;: 200 or higher, as otherwise the built-in default template will apply.</p>
<h2>Ingest a document</h2>
<p>Ingesting a document into a TSDS isn't in any way different from ingesting documents into Elasticsearch. You can use the following commands in Dev Tools to add a document, and then search for it and also check out the mappings. Note: You have to adjust the @timestamp field to be close to your current date and time.</p>
<pre><code class="language-bash"># Add a document with `host.name` as the dimension
POST metrics-laptop-default/_doc
{
  # This timestamp neesd to be adjusted to be current
  &quot;@timestamp&quot;: &quot;2023-03-30T12:26:23+00:00&quot;,
  &quot;host.name&quot;: &quot;ruflin.com&quot;,
  &quot;packages.sent&quot;: 1000,
  &quot;memory.usage&quot;: 0.8
}

# Search for the added doc, _source will show up but is reconstructed
GET metrics-laptop-default/_search

# Check out the mappings
GET metrics-laptop-default
</code></pre>
<p>If you do search, it still shows _source but this is reconstructed from the doc values. The additional field added above is @timestamp. This is important as it is a required field for any data stream.</p>
<h2>Why is this all important for Observability?</h2>
<p>One of the advantages of the Elastic Observability solution is that in a single storage engine, all signals are brought together in a single place. Users can query logs, metrics, and traces together without having to jump from one system to another. Because of this, having a great storage and query engine not only for logs but also metrics is key for us.</p>
<h2>Usage of TSDS in integrations</h2>
<p>With <a href="https://www.elastic.co/integrations/data-integrations">integrations</a>, we give our users an out of the box experience to integrate with their infrastructure and services. If you are using our integrations, eventually you will automatically get all the benefits of TSDS for your metrics assuming you are on version 8.7 or newer.</p>
<p>Currently we are working through the list of our integration packages, add the dimensions, metric type fields and then turn on TSDS for the metrics data streams. What this means is as soon as the package has all properties enabled, the only thing you have to do is upgrade the integration and everything else will happen automatically in the background.</p>
<p>To visualize your time series in Kibana, use <a href="https://www.elastic.co/guide/en/kibana/current/lens.html">Lens</a>, which has native support built in for TSDS.</p>
<h2>Learn more</h2>
<p>If you switch over to TSDS, you will automatically benefit from all the future improvements Elasticsearch is making for metrics time series, be it more efficient storage, query performance, or new aggregation capabilities. If you want to learn more about how TSDS works under the hood and all available config options, check out the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/master/tsds.html">TSDS documentation</a>. What Elasticsearch supports in 8.7 is only the first iteration of the metrics time series in Elasticsearch.</p>
<p><a href="https://www.elastic.co/blog/whats-new-elasticsearch-8-7-0">TSDS can be used since 8.7</a> and will be in more and more of our integrations automatically when integrations are upgraded. All you will notice is lower storage usage and faster queries. Enjoy!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/time-series-data-streams-observability-metrics/ebpf-monitoring.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[LLM Observability for Google Cloud’s Vertex AI platform - understand performance, cost and reliability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elevate-llm-observability-with-gcp-vertex-ai-integration</link>
            <guid isPermaLink="false">elevate-llm-observability-with-gcp-vertex-ai-integration</guid>
            <pubDate>Wed, 09 Apr 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Enhance LLM observability with Elastic's GCP Vertex AI Integration — gain actionable insights into model performance, resource efficiency, and operational reliability.]]></description>
            <content:encoded><![CDATA[<p>As organizations increasingly adopt large language models (LLMs) for AI-powered applications such as content creation, Retrieval-Augmented Generation (RAG), and data analysis, SREs and developers face new challenges. Tasks like monitoring workflows, analyzing input and output, managing query latency, and controlling costs become critical. LLM observability helps address these issues by providing clear insights into how these models perform, allowing teams to quickly identify bottlenecks, optimize configurations, and improve reliability. With better observability, SREs can confidently scale LLM applications, especially on platforms like <a href="https://cloud.google.com/vertex-ai">Google Cloud’s Vertex AI</a>.</p>
<h3>New Elastic Observability LLM integration with Google Cloud’s Vertex AI platform</h3>
<p>We are thrilled to announce general availability of monitoring LLMs hosted in Google Cloud through the <a href="https://www.elastic.co/docs/current/integrations/gcp_vertexai">Elastic integration with Vertex AI</a>. This integration enables users to experience enhanced LLM Observability by providing deep insights into the usage, cost and operational performance of models on Vertex AI, including latency, errors, token usage, frequency of model invocations as well as resources utilized by models. By leveraging this data, organizations can optimize resource usage, identify and resolve performance bottlenecks, and enhance the model efficiency and accuracy.</p>
<h3>Observability needs for AI-powered applications using the Vertex AI platform</h3>
<p>Leveraging AI models creates unique needs around the observability and monitoring of AI-powered applications. Some of the challenges that come with using LLMs are related to the high cost to call the LLMs, the quality and safety of LLM responses, and the performance, reliability and availability of the LLMs.</p>
<p>Lack of visibility into LLM observability data can make it harder for SREs and DevOps teams to ensure their AI-powered applications meet their service level objectives for reliability, performance, cost and quality of the AI-generated content and have enough telemetry data to troubleshoot related issues. Thus, robust LLM observability and detection of anomalies in the performance of models hosted on Google Cloud’s Vertex AI platform in real time is critical for the success of AI-powered applications.</p>
<p>Depending on the needs of their LLM applications, customers can make use of a growing list of models hosted on the Vertex AI platform such as Gemini 2.0 Pro, Gemini 2.0 Flash, and Imagen for image generation. Each model excels in specific areas and generates content in some modalities including Language, Audio, Vision, Code, etc. No two models are the same; each model has specific performance characteristics. So, it is important that service operators are able to track the individual performance, behaviour and cost of each model.</p>
<h3>Unlocking Insights with Vertex AI Metrics</h3>
<p>The Elastic integration with Google Cloud’s Vertex AI platform collects a wide range of metrics from models hosted on Vertex AI, enabling users to monitor, analyze, and optimize their AI deployments effectively.</p>
<p>Once you use the integration, you can review all the metrics in the Vertex AI dashboard</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elevate-llm-observability-with-gcp-vertex-ai-integration/Overview.png" alt="Overview Dashboard" /></p>
<p>These metrics can be categorized into the following groups:</p>
<h4>1. Prediction Metrics</h4>
<p>Prediction metrics provide critical insights into model usage, performance bottlenecks, and reliability. These metrics help ensure smooth operations, optimize response times, and maintain robust, accurate predictions.</p>
<ul>
<li>
<p><strong>Prediction Count by Endpoint</strong>: Measures the total number of predictions across different endpoints.</p>
</li>
<li>
<p><strong>Prediction Latency</strong>: Provides insights into the time taken to generate predictions, allowing users to identify bottlenecks in performance.</p>
</li>
<li>
<p><strong>Prediction Errors</strong>: Monitors the count of failed predictions across endpoints.</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elevate-llm-observability-with-gcp-vertex-ai-integration/Prediction.png" alt="Prediction Metrics" /></p>
<h4>2. Model Performance Metrics</h4>
<p>Model performance metrics provide crucial insights into deployment efficiency, and responsiveness. These metrics help optimize model performance and ensure reliable operations.</p>
<ul>
<li>
<p><strong>Model Usage</strong>: Tracks the usage distribution among different model deployments.</p>
</li>
<li>
<p><strong>Token Usage</strong>: Tracks the number of tokens consumed by each model deployment, which is critical for understanding model efficiency.</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elevate-llm-observability-with-gcp-vertex-ai-integration/token_model_usage.png" alt="Token model usage" /></p>
<ul>
<li>
<p><strong>Invocation Rates</strong>: Tracks the frequency of invocations made by each model deployment.</p>
</li>
<li>
<p><strong>Model Invocation Latency</strong>: Measures the time taken to invoke a model, helping in diagnosing performance issues.</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elevate-llm-observability-with-gcp-vertex-ai-integration/Invocation_Vertex.png" alt="Model Invocation Metrics" /></p>
<h4>3. Resource Utilization Metrics</h4>
<p>Resource utilization metrics are vital for monitoring resource efficiency and workload performance. They help optimize infrastructure, prevent bottlenecks, and ensure smooth operation of AI deployments.</p>
<ul>
<li>
<p><strong>CPU Utilization</strong>: Monitors CPU usage to ensure optimal resource allocation for AI workloads.</p>
</li>
<li>
<p><strong>Memory Usage</strong>: Tracks the memory consumed across all model deployments.</p>
</li>
<li>
<p><strong>Network Usage</strong>: Measures bytes sent and received, providing insights into data transfer during model interactions.</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elevate-llm-observability-with-gcp-vertex-ai-integration/Resource_Utilization.png" alt="Resource Utilization Metrics" /></p>
<h4>4. Overview Metrics</h4>
<p>These metrics give an overview of the models deployed in Google Cloud’s Vertex AI platform. They are essential for tracking overall performance, optimizing efficiency, and identifying potential issues across deployments.</p>
<ul>
<li>
<p><strong>Total Invocations</strong>: The overall count of prediction invocations across all models and endpoints, providing a comprehensive view of activity.</p>
</li>
<li>
<p><strong>Total Tokens</strong>: The total number of tokens processed across all model interactions, offering insights into resource utilization and efficiency.</p>
</li>
<li>
<p><strong>Total Errors</strong>: The total count of errors encountered across all models and endpoints, helping identify reliability issues.</p>
</li>
</ul>
<p>All metrics can be filtered by <strong>region</strong>, offering localized insights for better analysis.</p>
<p>Note: The Elastic I integration with Vertex AI provides comprehensive visibility into both deployment models: provisioned throughput, where capacity is pre-allocated, and pay-as-you-go, where resources are consumed on demand.</p>
<h3>Conclusion</h3>
<p>This <a href="https://www.elastic.co/docs/current/integrations/gcp_vertexai">integration with Vertex AI</a> represents a significant step forward in enhancing the LLM Observability for users of Google Cloud’s Vertex AI platform. By unlocking a wealth of actionable data, organizations can assess the health, performance and cost of LLMs and troubleshoot operational issues, ensuring scalability, and accuracy in AI-driven applications.</p>
<p>Now that you know how the Vertex AI integration enhances LLM Observability, it’s your turn to try it out n. Spin up an Elastic Cloud, and start monitoring your LLM applications hosted on Google Cloud’s Vertex AI platform.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elevate-llm-observability-with-gcp-vertex-ai-integration/vertexai-title.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[2025 observability trends: Maturing beyond the hype]]></title>
            <link>https://www.elastic.co/observability-labs/blog/emerging-trends-in-observability-2025</link>
            <guid isPermaLink="false">emerging-trends-in-observability-2025</guid>
            <pubDate>Thu, 27 Feb 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Discover what 500+ decision-makers revealed about OpenTelemetry adoption, GenAI integration, and LLM monitoring—insights that separate innovators from followers in Elastic's 2025 observability survey.]]></description>
            <content:encoded><![CDATA[<h1>2025 observability trends: Maturing beyond the hype</h1>
<p>Our latest survey of over 500 observability decision-makers reveals how dramatically the landscape has evolved as we move through 2025. What strikes me most is how observability has moved beyond its technical roots to become a true business imperative. Let’s dive into what we're seeing in the industry.</p>
<h2>The investment paradox of observability in 2025</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/emerging-trends-in-observability-2025/image5.png" alt="" /></p>
<p>Here's something fascinating: 96% of executives in our survey expect observability to remain a key investment area. Yet almost all of them (97%) are hitting roadblocks in realizing full value. And surprisingly, the primary hurdles for observability are not technical or complicated in nature, can you guess what they might be?</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/emerging-trends-in-observability-2025/image10.png" alt="" /></p>
<p>For 2025, IT leaders are challenged with financial hurdles for their observability. I'm seeing this tension play out constantly in conversations with leaders - they know they need to invest, but they're grappling with budget constraints, licensing costs, and proving ROI for their organizations. This creates an interesting dynamic where organizations must carefully balance increasing investment with rigorous cost optimization and business metrics.</p>
<p>What's particularly interesting is how this paradox is forcing organizations to become more strategic about their investments. Leaders are no longer just throwing money at the problem - they're thinking carefully about how to maximize value from every dollar spent.</p>
<h2>Why observability maturity Is making all the difference</h2>
<p>The data really jumps out at me here. The gap between observability experts and newcomers tells a compelling story that I wasn't expecting to see. Expert organizations are significantly outperforming their peers across every key metric:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/emerging-trends-in-observability-2025/image9.png" alt="" /></p>
<ul>
<li>91% of expert organizations are deploying applications and infrastructure faster (compared to just 34% of those in early stages)</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/emerging-trends-in-observability-2025/image11.png" alt="" /></p>
<ul>
<li>82% are successfully reducing operational costs (versus 56% of early-stage organizations)</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/emerging-trends-in-observability-2025/image4.png" alt="" /></p>
<ul>
<li>71% achieve better MTTR for incidents (while only 40% of early-stage organizations do)</li>
</ul>
<p>What I find particularly fascinating is how some benefits go beyond just maturity levels. About 80% of organizations report better customer issue response times regardless of their maturity stage. It tells me that even basic observability delivers immediate customer-facing value. This is crucial information for organizations just starting their observability journey - they can expect to see tangible benefits right from the start. But the overarching story may be that observability maturity leads teams from reactive to proactive and allows them to focus on higher level, value-add activities.</p>
<h2>Cost management: the new imperative</h2>
<p>The numbers around cost management paint a clear picture of where the industry is heading - 97% of IT decision-makers are actively managing observability costs, and 86% feel personally responsible for business outcomes.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/emerging-trends-in-observability-2025/image2.png" alt="" /></p>
<p>I'm seeing a clear trend where leaders are taking concrete steps in their day to day work:</p>
<ul>
<li>Consolidating their observability toolset while maintaining capabilities, they don’t want to lose anything</li>
<li>Implementing usage-based pricing models</li>
<li>Establishing clear ROI metrics</li>
<li>Creating cross-functional teams to optimize spending</li>
</ul>
<p>This isn't just about cutting costs - it's about being smarter with resources. Organizations are learning that more tools don't necessarily mean better observability.</p>
<h2>Two technologies reshaping the observability landscape</h2>
<h3>AI's growing impact</h3>
<p>The enthusiasm for AI is remarkable - 94% of respondents see its tremendous potential. What fascinates me is how concerns about Generative AI reliability have actually decreased from 64% to 55% over the past year.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/emerging-trends-in-observability-2025/image7.png" alt="" /></p>
<p>Leaders are particularly excited about:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/emerging-trends-in-observability-2025/image1.png" alt="" /></p>
<ul>
<li>Automated correlation of logs, metrics, and traces (72% of respondents)</li>
<li>Predictive analytics for preventing outages</li>
<li>Natural language interfaces for querying observability data</li>
<li>Automated root cause analysis</li>
</ul>
<p>The key shift I'm seeing for the upcoming year is the move from AI as a buzzword to AI as a practical tool delivering real value in observability workflows.</p>
<p>Generative AI capabilities paired with retrieval augmented generation (RAG) capabilities allow organizations to leverage the power of LLMs and private data (e.g., runbooks, alerts, business data) to deliver relevant and meaningful results and identify and solve problems faster while reducing noise.</p>
<h3>OpenTelemetry's continued momentum</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/emerging-trends-in-observability-2025/image3.png" alt="" /></p>
<p>Looking at expert organizations, 80% are either experimenting with or have deployed OpenTelemetry. This isn't just about technology adoption - it's about building for the future with open standards. The correlation between OpenTelemetry adoption and overall observability maturity is correlated and unmistakable.</p>
<p>What's particularly interesting is how OpenTelemetry is changing the vendor landscape. Organizations are increasingly demanding OpenTelemetry support from their vendors, seeing it as a way to future-proof their observability investments and avoid vendor lock-in. Thinking back to how Linux shifted the server landscape, can we expect to see the same in the observability domain?</p>
<h2>Business integration and insights deepens</h2>
<hr />
<p><img src="https://www.elastic.co/observability-labs/assets/images/emerging-trends-in-observability-2025/image8.png" alt="" /></p>
<p>Here's what I find most compelling: 64% of expert organizations are frequently correlating operational data with business outcomes, while only 9% of early-stage organizations do the same. This represents a fundamental shift from technical monitoring to business observability.</p>
<p>This isn't just about uptime anymore - organizations are increasingly using observability data to:</p>
<ul>
<li>Make informed business decisions</li>
<li>Improve customer experience</li>
<li>Optimize resource allocation</li>
<li>Drive innovation</li>
</ul>
<h2>Looking ahead</h2>
<p>As we continue through 2025, I'm seeing observability mature beyond its initial promise. Organizations are focusing less on basic implementation and more on delivering real business value through:</p>
<ul>
<li>Deeper business integration, like mapping system performance directly to revenue metrics</li>
<li>Optimized cost management through new data lake technology, efficient storage and intelligent retention</li>
<li>AI-enhanced capabilities powered by LLMs and Agentic AI</li>
<li>Standardized instrumentation through OpenTelemetry, reducing vendor lock-in</li>
</ul>
<p>The path to success in 2025 isn't just about having the right tools - it's about building mature practices that deliver measurable business value while managing costs effectively. The organizations that can balance these competing demands while maintaining focus on business outcomes are the ones pulling ahead.</p>
<p>What are you seeing in your organization's observability journey? Are these trends aligning with your experience?</p>
<p>If you would like to dig in deeper on emerging observability trends, download <a href="https://www.elastic.co/resources/observability/report/landscape-observability-report">our full report</a> or watch the on-demand webinar, <a href="https://www.elastic.co/virtual-events/observability-trends-2025">2025 Observability trends: Maturing beyond the hype and delivering results</a>!</p>
<p>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/emerging-trends-in-observability-2025/trends.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[How to enable Kubernetes alerting with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/enable-kubernetes-alerting-observability</link>
            <guid isPermaLink="false">enable-kubernetes-alerting-observability</guid>
            <pubDate>Tue, 30 May 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[In the Kubernetes world, different personas demand different kinds of insights. In this post, we’ll focus on alerting and provide an overview of how alerts in Elastic Observability can help users quickly identify Kubernetes problems.]]></description>
            <content:encoded><![CDATA[<p>In the Kubernetes world, different personas demand different kinds of insights. Developers are interested in granular metrics and debugging information. <a href="https://www.elastic.co/blog/elastic-observability-sre-incident-response">SREs</a> are interested in seeing everything at once to quickly get notified when a problem occurs and spot where the root cause is. In this post, we’ll focus on alerting and provide an overview of how alerts in Elastic Observability can help users quickly identify Kubernetes problems.</p>
<h2>Why do we need alerts?</h2>
<p>Logs, metrics, and traces are just the base to build a complete <a href="https://www.elastic.co/blog/kubernetes-cluster-metrics-logs-monitoring">monitoring solution for Kubernetes clusters</a>. Their main goal is to provide debugging information and historical evidence for the infrastructure.</p>
<p>While out-of-the-box dashboards, infrastructure topology, and logs exploration through Kibana are already quite handy to perform ad-hoc analyses, adding notifications and active monitoring of infrastructure allows users to deal with problems detected as early as possible and even proactively take actions to prevent their Kubernetes environments from facing even more serious issues.</p>
<h3>How can this be achieved?</h3>
<p>By building alerts on top of their infrastructure, users can leverage the data and effectively correlate it to a specific notification, creating a wide range of possibilities to dynamically monitor and observe their Kubernetes cluster.</p>
<p>In this blog post, we will explore how users can leverage Elasticsearch’s search powers to define alerting rules in order to be notified when a specific condition occurs.</p>
<h2>SLIs, alerts, and SLOs: Why are they important for SREs?</h2>
<p>For site reliability engineers (SREs), the <a href="https://www.elastic.co/blog/elastic-observability-sre-incident-response">incident response time</a> is tightly coupled with the success of everyday work. Monitoring, alerting, and actions will help to discover, resolve, or prevent issues in their systems.</p>
<blockquote>
<ul>
<li><em>An SLA (Service Level Agreement) is an agreement you create with your users to specify the level of service they can expect.</em></li>
<li><em>An SLO (Service Level Objective) is an agreement within an SLA about a specific metric like uptime or response time.</em></li>
<li><em>An SLI (Service Level Indicator) measures compliance with an SLO.</em></li>
</ul>
</blockquote>
<p>SREs’ day-to-day tasks and projects are driven by SLOs. By ensuring that SLOs are defended in the short term and that they can be maintained in the medium to long term, we lay the basis of a stable working infrastructure.</p>
<p>Having said this, identifying the high-level categories of SLOs is crucial in order to organize the work of an SRE. Then in each category of SLOs, SREs will need the corresponding SLIs that can cover the most important cases of their system under observation. Therefore, the decision of which SLIs we will need demands additional knowledge of the underlying system infrastructure.</p>
<p>One widely used approach to categorize SLIs and SLOs is the <a href="https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals">Four Golden Signals</a> method. The categories defined are Latency, Traffic, Errors, and Saturation.</p>
<p>A more specific approach is the <a href="https://thenewstack.io/monitoring-microservices-red-method/">The RED method</a> developed by Tom Wilkie, who was an SRE at Google and used the Four Golden Signals. The RED method drops the saturation category because this one is mainly used for more advanced cases — and people remember better things that come in threes.</p>
<p>Focusing on Kubernetes infrastructure operators, we will consider the following groups of infrastructure SLIs/SLOs:</p>
<ul>
<li>Group 1: Latency of control plane (apiserver,</li>
<li>Group 2: Resource utilization of the nodes/pods (how much cpu, memory, etc. is consumed)</li>
<li>Group 3: Errors (errors on logs or events or error count from components, network, etc.)</li>
</ul>
<h2>Creating alerts for a Kubernetes cluster</h2>
<p>Now that we have a complete outline of our goal to define alerts based on SLIs/SLOs, we will dive into defining the proper alerting. Alerts can be built using <a href="https://www.elastic.co/guide/en/kibana/current/alerting-getting-started.html">Kibana</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/enable-kubernetes-alerting-observability/blog-elastic-create-rule.png" alt="kubernetes create rule" /></p>
<p>See Elastic <a href="https://www.elastic.co/guide/en/kibana/current/alerting-getting-started.html">documentation</a>.</p>
<p>In this blog, we will define more complex alerts based on complex Elasticsearch queries provided by <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/watcher-getting-started.html">Watcher</a>’s functionality. <a href="https://www.elastic.co/guide/en/kibana/8.8/watcher-ui.html">Read more about Watcher</a> and how to properly use it in addition to the examples in this blog.</p>
<h3>Latency alerts</h3>
<p>For this kind of alert, we want to define the basic SLOs for a Kubernetes control plane, which will ensure that the basic control plane components can service the end users without an issue. For instance, facing high latencies in queries against the Kubernetes API Server is enough of a signal that action needs to be taken.</p>
<h3>Resource saturation</h3>
<p>The next group of alerting will be resource utilization. Node’s CPU utilization or changes in Node’s condition is something critical for a cluster to ensure the smooth servicing of the workloads provisioned to run the applications that end users will interact with.</p>
<h3>Error detection</h3>
<p>Last but not least, we will define alerts based on specific errors like the network error rate or Pods’ failures like the OOMKilled situation. It’s a very useful indicator for SRE teams to either detect issues on the infrastructure level or just be able to notify developer teams about problematic workloads. One example that we will examine later is having an application running as a Pod and constantly getting restarted because it hits its memory limit. In that case, the owners of this application will need to get notified to act properly.</p>
<h2>From Kubernetes data to Elasticsearch queries</h2>
<p>Having a solid plan about the alerts that we want to implement, it's time to explore the data we have collected from the Kubernetes cluster and stored in Elasticsearch. For this we will consult the list of the available data fields that are ingested using the Elastic Agent Kubernetes <a href="https://docs.elastic.co/en/integrations/kubernetes">integration</a> (the full list of fields can be found <a href="https://www.elastic.co/guide/en/beats/metricbeat/current/exported-fields-kubernetes.html">here</a>). Using these fields we can create various alerts like:</p>
<ul>
<li>Node CPU utilization</li>
<li>Node Memory utilization</li>
<li>BW utilization</li>
<li>Pod restarts</li>
<li>Pod CPU/memory utilization</li>
</ul>
<h3>CPU utilization alert</h3>
<p>Our first example will use the CPU utilization fields to calculate the Node’s CPU utilization and create an alert. For this alert, we leverage the metrics:</p>
<pre><code class="language-yaml">kubernetes.node.cpu.usage.nanocores
kubernetes.node.cpu.capacity.cores.
</code></pre>
<p>The following calculation (nodeUsage / 1000000000 ) /nodeCap grouped by node name will give us the CPU utilization of our cluster’s nodes.</p>
<p>The Watcher definition that implements this query can be created with the following API call to Elasticsearch:</p>
<pre><code class="language-bash">curl -X PUT &quot;https://elastic:changeme@localhost:9200/_watcher/watch/Node-CPU-Usage?pretty&quot; -k -H 'Content-Type: application/json' -d'
{
  &quot;trigger&quot;: {
    &quot;schedule&quot;: {
      &quot;interval&quot;: &quot;10m&quot;
    }
  },
  &quot;input&quot;: {
    &quot;search&quot;: {
      &quot;request&quot;: {
        &quot;body&quot;: {
          &quot;size&quot;: 0,
          &quot;query&quot;: {
            &quot;bool&quot;: {
              &quot;must&quot;: [
                {
                  &quot;range&quot;: {
                    &quot;@timestamp&quot;: {
                      &quot;gte&quot;: &quot;now-10m&quot;,
                      &quot;lte&quot;: &quot;now&quot;,
                      &quot;format&quot;: &quot;strict_date_optional_time&quot;
                    }
                  }
                },
                {
                  &quot;bool&quot;: {
                    &quot;must&quot;: [
                      {
                        &quot;query_string&quot;: {
                          &quot;query&quot;: &quot;data_stream.dataset: kubernetes.node OR data_stream.dataset: kubernetes.state_node&quot;,
                          &quot;analyze_wildcard&quot;: true
                        }
                      }
                    ],
                    &quot;filter&quot;: [],
                    &quot;should&quot;: [],
                    &quot;must_not&quot;: []
                  }
                }
              ],
              &quot;filter&quot;: [],
              &quot;should&quot;: [],
              &quot;must_not&quot;: []
            }
          },
          &quot;aggs&quot;: {
            &quot;nodes&quot;: {
              &quot;terms&quot;: {
                &quot;field&quot;: &quot;kubernetes.node.name&quot;,
                &quot;size&quot;: &quot;10000&quot;,
                &quot;order&quot;: {
                  &quot;_key&quot;: &quot;asc&quot;
                }
              },
              &quot;aggs&quot;: {
                &quot;nodeUsage&quot;: {
                  &quot;max&quot;: {
                    &quot;field&quot;: &quot;kubernetes.node.cpu.usage.nanocores&quot;
                  }
                },
                &quot;nodeCap&quot;: {
                  &quot;max&quot;: {
                    &quot;field&quot;: &quot;kubernetes.node.cpu.capacity.cores&quot;
                  }
                },
                &quot;nodeCPUUsagePCT&quot;: {
                  &quot;bucket_script&quot;: {
                    &quot;buckets_path&quot;: {
                      &quot;nodeUsage&quot;: &quot;nodeUsage&quot;,
                      &quot;nodeCap&quot;: &quot;nodeCap&quot;
                    },
                    &quot;script&quot;: {
                      &quot;source&quot;: &quot;( params.nodeUsage / 1000000000 ) / params.nodeCap&quot;,
                      &quot;lang&quot;: &quot;painless&quot;,
                      &quot;params&quot;: {
                        &quot;_interval&quot;: 10000
                      }
                    },
                    &quot;gap_policy&quot;: &quot;skip&quot;
                  }
                }
              }
            }
          }
        },
        &quot;indices&quot;: [
          &quot;metrics-kubernetes*&quot;
        ]
      }
    }
  },
  &quot;condition&quot;: {
    &quot;array_compare&quot;: {
      &quot;ctx.payload.aggregations.nodes.buckets&quot;: {
        &quot;path&quot;: &quot;nodeCPUUsagePCT.value&quot;,
        &quot;gte&quot;: {
          &quot;value&quot;: 80
        }
      }
    }
  },
  &quot;actions&quot;: {
    &quot;log_hits&quot;: {
      &quot;foreach&quot;: &quot;ctx.payload.aggregations.nodes.buckets&quot;,
      &quot;max_iterations&quot;: 500,
      &quot;logging&quot;: {
        &quot;text&quot;: &quot;Kubernetes node found with high CPU usage: {{ctx.payload.key}} -&gt; {{ctx.payload.nodeCPUUsagePCT.value}}&quot;
      }
    }
  },
  &quot;metadata&quot;: {
    &quot;xpack&quot;: {
      &quot;type&quot;: &quot;json&quot;
    },
    &quot;name&quot;: &quot;Node CPU Usage&quot;
  }
}
</code></pre>
<h3>OOMKilled Pods detection and alerting</h3>
<p>Another Watcher that we will explore is the one that detects Pods that have been restarted due to an OOMKilled error. This error is quite common in Kubernetes workloads and is useful to detect this early on to inform the team that owns this workload, so they can either investigate issues that could cause memory leaks or just consider increasing the required resources for the workload itself.</p>
<p>This information can be retrieved from a query like the following:</p>
<pre><code class="language-yaml">kubernetes.container.status.last_terminated_reason: OOMKilled
</code></pre>
<p>Here is how we can create the respective Watcher with an API call:</p>
<pre><code class="language-bash">curl -X PUT &quot;https://elastic:changeme@localhost:9200/_watcher/watch/Pod-Terminated-OOMKilled?pretty&quot; -k -H 'Content-Type: application/json' -d'
{
  &quot;trigger&quot;: {
    &quot;schedule&quot;: {
      &quot;interval&quot;: &quot;1m&quot;
    }
  },
  &quot;input&quot;: {
    &quot;search&quot;: {
      &quot;request&quot;: {
        &quot;search_type&quot;: &quot;query_then_fetch&quot;,
        &quot;indices&quot;: [
          &quot;*&quot;
        ],
        &quot;rest_total_hits_as_int&quot;: true,
        &quot;body&quot;: {
          &quot;size&quot;: 0,
          &quot;query&quot;: {
            &quot;bool&quot;: {
              &quot;must&quot;: [
                {
                  &quot;range&quot;: {
                    &quot;@timestamp&quot;: {
                      &quot;gte&quot;: &quot;now-1m&quot;,
                      &quot;lte&quot;: &quot;now&quot;,
                      &quot;format&quot;: &quot;strict_date_optional_time&quot;
                    }
                  }
                },
                {
                  &quot;bool&quot;: {
                    &quot;must&quot;: [
                      {
                        &quot;query_string&quot;: {
                          &quot;query&quot;: &quot;data_stream.dataset: kubernetes.state_container&quot;,
                          &quot;analyze_wildcard&quot;: true
                        }
                      },
                      {
                        &quot;exists&quot;: {
                          &quot;field&quot;: &quot;kubernetes.container.status.last_terminated_reason&quot;
                        }
                      },
                      {
                        &quot;query_string&quot;: {
                          &quot;query&quot;: &quot;kubernetes.container.status.last_terminated_reason: OOMKilled&quot;,
                          &quot;analyze_wildcard&quot;: true
                        }
                      }
                    ],
                    &quot;filter&quot;: [],
                    &quot;should&quot;: [],
                    &quot;must_not&quot;: []
                  }
                }
              ],
              &quot;filter&quot;: [],
              &quot;should&quot;: [],
              &quot;must_not&quot;: []
            }
          },
          &quot;aggs&quot;: {
            &quot;pods&quot;: {
              &quot;terms&quot;: {
                &quot;field&quot;: &quot;kubernetes.pod.name&quot;,
                &quot;order&quot;: {
                  &quot;_key&quot;: &quot;asc&quot;
                }
              }
            }
          }
        }
      }
    }
  },
  &quot;condition&quot;: {
    &quot;array_compare&quot;: {
      &quot;ctx.payload.aggregations.pods.buckets&quot;: {
        &quot;path&quot;: &quot;doc_count&quot;,
        &quot;gte&quot;: {
          &quot;value&quot;: 1,
          &quot;quantifier&quot;: &quot;some&quot;
        }
      }
    }
  },
  &quot;actions&quot;: {
    &quot;ping_slack&quot;: {
      &quot;foreach&quot;: &quot;ctx.payload.aggregations.pods.buckets&quot;,
      &quot;max_iterations&quot;: 500,
      &quot;webhook&quot;: {
        &quot;method&quot;: &quot;POST&quot;,
        &quot;url&quot;: &quot;https://hooks.slack.com/services/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&quot;,
        &quot;body&quot;: &quot;{\&quot;channel\&quot;: \&quot;#k8s-alerts\&quot;, \&quot;username\&quot;: \&quot;k8s-cluster-alerting\&quot;, \&quot;text\&quot;: \&quot;Pod {{ctx.payload.key}} was terminated with status OOMKilled.\&quot;}&quot;
      }
    }
  },
  &quot;metadata&quot;: {
    &quot;xpack&quot;: {
      &quot;type&quot;: &quot;json&quot;
    },
    &quot;name&quot;: &quot;Pod Terminated OOMKilled&quot;
  }
}
</code></pre>
<h3>From Kubernetes data to alerts summary</h3>
<p>So far we saw how we can start from plain Kubernetes fields, use them in ES queries, and build Watchers and alerts on top of them.</p>
<p>One can explore more possible data combinations and build queries and alerts following the examples we provided here. A <a href="https://github.com/elastic/integrations/tree/main/packages/kubernetes/docs">full list of alerts</a> is available, as well as a <a href="https://github.com/elastic/k8s-integration-infra/tree/main/scripts/alerting">basic scripted way of installing them</a>.</p>
<p>Of course, these examples come with simple actions defined that only log messages into the Elasticsearch logs. However, one can use more advanced and useful outputs like Slack’s webhooks:</p>
<pre><code class="language-json">&quot;actions&quot;: {
    &quot;ping_slack&quot;: {
      &quot;foreach&quot;: &quot;ctx.payload.aggregations.pods.buckets&quot;,
      &quot;max_iterations&quot;: 500,
      &quot;webhook&quot;: {
        &quot;method&quot;: &quot;POST&quot;,
        &quot;url&quot;: &quot;https://hooks.slack.com/services/T04SW3JHXasdfasdfasdfasdfasdf&quot;,
        &quot;body&quot;: &quot;{\&quot;channel\&quot;: \&quot;#k8s-alerts\&quot;, \&quot;username\&quot;: \&quot;k8s-cluster-alerting\&quot;, \&quot;text\&quot;: \&quot;Pod {{ctx.payload.key}} was terminated with status OOMKilled.\&quot;}&quot;
      }
    }
  }
</code></pre>
<p>The result would be a Slack message like the following:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/enable-kubernetes-alerting-observability/blog-elastic-k8s-cluster-alerting.png" alt="" /></p>
<h2>Next steps</h2>
<p>In our next steps, we would like to make these alerts part of our Kubernetes integration, which would mean that the predefined alerts would be installed when users install or enable the Kubernetes integration. At the same time, we plan to implement some of these as Kibana’s native SLIs, providing the option to our users to quickly define SLOs on top of the SLIs through a nice user interface. If you’re interested to learn more about these, follow the public GitHub issues for more information and feel free to provide your feedback:</p>
<ul>
<li><a href="https://github.com/elastic/package-spec/issues/484">https://github.com/elastic/package-spec/issues/484</a></li>
<li><a href="https://github.com/elastic/kibana/issues/150050">https://github.com/elastic/kibana/issues/150050</a></li>
</ul>
<p>For those who are eager to start using Kubernetes alerting today, here is what you need to do:</p>
<ol>
<li>Make sure that you have an Elastic cluster up and running. The fastest way to deploy your cluster is to spin up a <a href="https://www.elastic.co/elasticsearch/service">free trial of Elasticsearch Service</a>.</li>
<li>Install the latest Elastic Agent on your Kubernetes cluster following the respective <a href="https://www.elastic.co/guide/en/fleet/master/running-on-kubernetes-managed-by-fleet.html">documentation</a>.</li>
<li>Install our provided alerts that can be found at <a href="https://github.com/elastic/integrations/tree/main/packages/kubernetes/docs">https://github.com/elastic/integrations/tree/main/packages/kubernetes/docs</a> or at <a href="https://github.com/elastic/k8s-integration-infra/tree/main/scripts/alerting">https://github.com/elastic/k8s-integration-infra/tree/main/scripts/alerting</a>.</li>
</ol>
<p>Of course, if you have any questions, remember that we are always happy to help on the Discuss <a href="https://discuss.elastic.co/">forums</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/enable-kubernetes-alerting-observability/alert-management.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Bridging the Gap: End-to-End Observability from Cloud Native to Mainframe]]></title>
            <link>https://www.elastic.co/observability-labs/blog/end-to-end-o11y-from-cloud-native-to-mainframe</link>
            <guid isPermaLink="false">end-to-end-o11y-from-cloud-native-to-mainframe</guid>
            <pubDate>Sun, 01 Feb 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Achieving end-to-end observability in hybrid enterprise environments, where modern cloud-native applications interact with critical, yet often opaque, IBM mainframe systems is a challenge. By utilizing IBM Z Observability Connect, which enables OTel output, with Elastic Observability is a solution, transforming your mainframe black box into a fully observable component in your deployment]]></description>
            <content:encoded><![CDATA[<h2>Introduction:</h2>
<p>OpenTelemetry is emerging as the standard for modern observability. As a highly active project within the Cloud Native Computing Foundation (CNCF)—second only to Kubernetes—it has become the monitoring solution of choice for cloud-native applications. OpenTelemetry provides a unified method for collecting traces, metrics, and logs across Kubernetes, microservices, and infrastructure.</p>
<p>However, for many enterprises—especially in banking, insurance, healthcare, and government—the reality is more complex than just “cloud native.” Although most organizations have deployed mobile apps and adopted microservices architectures, much of their critical core processing still relies on IBM mainframe applications. These systems process credit card swipes, financial transactions, patient records, and premium calculations.</p>
<p>This creates a dilemma: while the modern distributed systems of the hybrid environment are well-observed, the critical backend remains a black box.</p>
<h2>The “Broken Trace”</h2>
<p>A common challenge we see with customers involves a request that originates from a modern mobile application. The request hits microservices running on Kubernetes, initiates a service call to the mainframe, and suddenly, visibility stops.</p>
<p>When latency spikes or a transaction fails, Site Reliability Engineers (SREs) are left guessing. Is it the network? The API gateway? Or underlying mainframe applications like CICS? Without a unified, end-to-end view of the services involved—from the frontend Node.js microservices to the backend CICS service—mean time to resolution (MTTR) becomes “mean time to innocence,” with teams simply proving it wasn't their microservice rather than fixing root causes.</p>
<p>We need a unified view where a trace flows seamlessly from a cloud-native frontend (like React) all the way into mainframe transactions.</p>
<h2>IBM Z Observability Connect</h2>
<p>With the recent release of <a href="https://www.ibm.com/docs/en/zapmc/7.1.0?topic=z-observability-connect-overview">Z Observability Connect</a>, IBM has introduced OpenTelemetry-native instrumentation into mainframe applications. This creates a bridge between modern cloud-native services and mainframe transactions.</p>
<p>This means the mainframe is no longer a special case; it acts just like any other microservice in a mesh. It functions as an OpenTelemetry data producer, emitting traces, metrics, and logs to OpenTelemetry-compliant backends like Elastic.</p>
<h2>The Architecture</h2>
<p>The architecture is straightforward:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/end-to-end-o11y-from-cloud-native-to-mainframe/architecture.png" alt="architecture" /></p>
<ul>
<li><strong>The Collector</strong>: <a href="https://docs.google.com/document/d/1-0gDjeM6s63AaQio1j0Cb2Pfodkr6KfSw1-q847Gzes/edit?tab=t.0#heading=h.xa5hqxwq5lps">IBM Z Observability Connect</a> runs on z/OS. It collects logs, metrics, or traces and converts them into the OTLP (OpenTelemetry Protocol) format.</li>
<li><strong>The Processor</strong>: The <a href="https://www.elastic.co/docs/reference/opentelemetry/motlp">Elastic Cloud Managed OTLP Endpoint</a> acts as a gateway collector, providing fully hosted, scalable, and reliable native OTLP ingestion.</li>
<li><strong>The Consumer</strong>: <a href="https://www.elastic.co/docs/solutions/observability/apm">Elastic APM</a> enables OpenTelemetry-native application performance monitoring, making it easy to pinpoint and fix performance problems quickly.</li>
</ul>
<h2>Putting it all together in Kubernetes</h2>
<p>We deploy an OpenTelemetry Collector within our Kubernetes cluster. This collector acts as a specialized gateway. It is configured to receive OTLP traffic directly from IBM Z Observability Connect on the mainframe and forward it securely to our observability backend, Elastic APM, by using the <code>otlp/elastic</code> exporter.</p>
<p>Here is the configuration for the OpenTelemetry Collector. Note the <code>exporters</code> section, which handles the authentication and batched transmission to Elastic:</p>
<pre><code>exporters:
  # Exporter to print the first 5 logs/metrics and then every 1000th
  debug:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 1000

  # Exporter to send logs and metrics to Elasticsearch Managed OTLP Input
  otlp/elastic:
    endpoint: ${env:ELASTIC_OTLP_ENDPOINT}
    headers:
      Authorization: ApiKey ${env:ELASTIC_API_KEY}
    sending_queue:
      enabled: true
      sizer: bytes
      queue_size: 50000000 # 50MB uncompressed
      block_on_overflow: true
    batch:
      flush_timeout: 1s
      min_size: 1_000_000 # 1MB uncompressed
      max_size: 4_000_000 # 4MB uncompressed

service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/elastic, debug]
</code></pre>
<p><em>Note: We strongly recommend using environment variables for your endpoints and API keys to keep your manifest secure.</em></p>
<h2>Why the OTel specification matters</h2>
<p><a href="https://www.elastic.co/docs/reference/opentelemetry/motlp">Elastic’s managed OTLP endpoint</a> and observability solution is built with native OTel support and adheres to the OTel specification and semantic conventions. Once we wired everything up and the data started to flow, we noticed that some of the traces in Elastic APM were not being represented correctly.</p>
<p>Most observability solutions derive the so-called RED metrics (rate, error, and duration) for the most important spans in a trace—i.e., incoming and outgoing spans of each individual service. This allows for an efficient indication of a service’s performance without the need to comb through all of the tracing data to show something as simple as the latency of a service’s endpoint or the error rate on outgoing requests.</p>
<p>For an efficient calculation of such derived metrics for incoming spans on a service, the <a href="https://github.com/open-telemetry/opentelemetry-specification/blob/main/oteps/0182-otlp-remote-parent.md">OTel community</a> introduced the <code>SPAN_FLAGS_CONTEXT_HAS_IS_REMOTE_MASK</code> and <code>SPAN_FLAGS_CONTEXT_IS_REMOTE_MASK</code> flags on the span entities within the OTLP protocol. These flags provide an unambiguous indication of whether an individual span is an entry span and, thus, allow observability backends to efficiently calculate metrics for entry-level spans.</p>
<p>If these flags are set incorrectly for an entry span, the span cannot be recognized as an entry span, and metrics are not derived properly—leading to a broken experience. This is what we initially experienced with the ingested OTel data from the IBM mainframe instrumentation.</p>
<p>In a proprietary world, this might have been a dead end or a months-long troubleshooting exercise. However, since OpenTelemetry is an open standard, we were able to debug the issue rapidly and share our findings with IBM engineers, who quickly developed a fix.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/end-to-end-o11y-from-cloud-native-to-mainframe/service_map.png" alt="service_map" /></p>
<h2>Streamline observability</h2>
<p>We now have end-to-end visibility that spans from modern mobile or web applications deep into the IBM mainframe. This unlocks significant value:</p>
<ul>
<li><strong>Unified Service Maps</strong>: You can visually see the dependency between the cloud-native cart service and the backend inventory system on z/OS.</li>
<li><strong>Single Pane of Glass</strong>: SREs no longer need to switch between modern observability tools and separate mainframe monitoring tools to view service health.</li>
<li><strong>Operational Efficiency</strong>: By eliminating the “blind spot” in the trace, you reduce the time spent on coordinating between cloud and mainframe teams, making issue resolution faster.</li>
</ul>
<h2>Conclusion</h2>
<p>If you are running hybrid workloads, it is time to stop treating your mainframe as a black box. With IBM Z Observability Connect, the Elastic Managed OTLP Endpoint, and Elastic APM, your entire stack can finally speak a single language: OpenTelemetry.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/end-to-end-o11y-from-cloud-native-to-mainframe/end-to-end-o11y.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[OpenTelemetry Demo with the Elastic Distributions of OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/opentelemetry-demo-with-the-elastic-distributions-of-opentelemetry</link>
            <guid isPermaLink="false">opentelemetry-demo-with-the-elastic-distributions-of-opentelemetry</guid>
            <pubDate>Mon, 07 Oct 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Discover how Elastic is dedicated to supporting users in their journey with OpenTelemetry. Explore our public deployment of the OpenTelemetry Demo and see how Elastic's solutions enhance your observability experience.]]></description>
            <content:encoded><![CDATA[<p>Recently, Elastic <a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry">introduced the Elastic Distributions
(EDOT)</a>
for various OpenTelemetry components, we are proud to announce that these EDOT
components are now available in the <a href="https://github.com/elastic/opentelemetry-demo">Elastic's fork of the OpenTelemetry
Demo</a>. We've also made public a
<a href="https://ela.st/demo-otel">Kibana endpoint</a>, allowing you to dive into the
demo’s live data and explore its capabilities firsthand. In this blog post,
we'll elaborate on the reasons behind the fork and explore the powerful new
features it introduces. We'll also provide a comprehensive overview of how
these enhancements can be leveraged with the Elastic Distributions of
OpenTelemetry (EDOT) for advanced error detection, as well as the EDOT
Collector—a cutting-edge evolution of the Elastic Agent—for seamless data
collection and analysis.</p>
<h2>What is the OpenTelemetry Demo?</h2>
<p>The <a href="https://github.com/open-telemetry/opentelemetry-demo">OpenTelemetry Demo</a>
is a microservices-based application created by OpenTelemetry's community to
showcase its capabilities in a realistic, and distributed system environment.
This demo application, known as the OpenTelemetry Astronomy Shop, simulates an
e-commerce website composed of over 10 interconnected microservices (written in
multiple languages: Go, Java, .NET, Node.js, etc.), communicating via HTTP and
gRPC. Each service is fully instrumented with OpenTelemetry, generating
comprehensive traces, metrics, and logs. The demo serves as an invaluable
resource for understanding how to implement and use OpenTelemetry in real-world
applications.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-demo-with-the-elastic-distributions-of-opentelemetry/opentelemetry_demo_service_map.png" alt="1 - Service Map for the OpenTelemetry Demo Elastic
fork" /></p>
<p>One of the microservices, called <code>loadgenerator</code>, automatically starts
generating requests to the various endpoints of the demo, simulating a
real-world environment where multiple clients are interacting with the system.
This helps replicate the behavior of a busy, live application with concurrent
user activity.</p>
<h3>Elastic's fork</h3>
<p>Elastic recognized an opportunity to enhance the OpenTelemetry Demo by forking
it and integrating advanced Elastic features for deeper observability and
simpler monitoring. While forking is the <a href="https://github.com/open-telemetry/opentelemetry-demo?tab=readme-ov-file#demos-featuring-the-astronomy-shop">recommended OpenTelemetry
approach</a>,
we aim to leverage the robust foundation and latest updates from the upstream
version as much as possible. To achieve this, Elastic’s fork of the
OpenTelemetry Demo performs daily pulls from upstream, seamlessly integrating
them with Elastic-specific changes. To avoid conflicts, we continuously
contribute upstream, ensuring Elastic's modifications are always additive or
configurable through environment variables. One such contribution is the
<a href="https://github.com/elastic/opentelemetry-demo/blob/main/.env.override">.env.override
file</a>,
designed exclusively for vendor forks to override the microservices images and
configuration files used in the demo.</p>
<h2>Deeper Insights with Elastic Distributions</h2>
<p>In our current update of Elastic's OpenTelemetry Demo fork, we have replaced
some of the microservices OTel SDKs used for instrumentation with Elastic's
specialized distributions. These changes ensure deeper integration with
Elastic's observability tools, offering richer insights and more robust
monitoring capabilities. These are some of the fork's changes:</p>
<p><strong>Java services:</strong> The Ad, Fraud Detection, and Kafka services now utilize the
Elastic distribution of the OpenTelemetry Java Agent. One of the included
features in the distribution are stack traces, which provides precise
information of where in the code path a span was originated. Learn more about
the Elastic Java Agent
<a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-java-agent">here</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-demo-with-the-elastic-distributions-of-opentelemetry/adservice_span_stacktrace.png" alt="2 - Ad Service span stack trace
example" /></p>
<p>The <strong>Cart service</strong> has been upgraded to use the Elastic distribution of the
OpenTelemetry .NET Agent. This replacement gives visibility on how the Elastic
Distribution of OpenTelemetry .NET (EDOT .NET) can be used to get started using
OpenTelemetry in your .NET applications with zero code changes. Discover more
about the Elastic .NET Agent in <a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-dotnet-applications">this blog
post</a>.</p>
<p>In the <strong>Payment service</strong>, we've configured the Elastic distribution of the
OpenTelemetry Node.js Agent. The distribution ships with the host-metrics
extension, and Kibana provides a curated service metrics UI. Read more about
the Elastic Node.js Agent
<a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-node-js">here</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-demo-with-the-elastic-distributions-of-opentelemetry/payment_service_host_metrics.png" alt="3 - Payment service host
metrics" /></p>
<p>The <strong>Recommendation service</strong> now leverages the EDOT Python, replacing the
standard OpenTelemetry Python agent. The Python distribution is another example
of a Zero-code (or Automatic) instrumentation, meaning that the distribution
will set up the OpenTelemetry SDK and enable all the recommended
instrumentations for you. Find out more about the Elastic Python Agent in <a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-python">this
blog
post</a>.</p>
<p>It's important to highlight that Elastic Distributions of OpenTelemetry don't
bundle proprietary software, they have been build on top of the vanilla OTel
SDKs but they offer some advantages, such as single package for installation,
easy auto-instrumentation with reasonable default configuration, automatic logs
telemetry sending, and many more. Along these lines, the ultimate goal is to
contribute as many features from EDOT's back to the upstream OpenTelemetry
agents; they are designed in such a way that the additional features, realized
as extensions, work directly with the OTel SDKs.</p>
<h2>Collecting Data with the Elastic Collector Distribution</h2>
<p>The OpenTelemetry Demo applications generate and send their signals to an
OpenTelemetry Collector OTLP endpoint. In the Demo's fork, the EDOT collector
is set up to forward all OTLP signals from the microservices to an <a href="https://www.elastic.co/guide/en/observability/current/apm.html">APM
server</a> OTLP
endpoint. Additionally, it sends all other metrics and logs collected by the
collector to an Elasticsearch endpoint.</p>
<p>If the fork is deployed in a Kubernetes environment, the collector will
automatically start collecting the system's metrics. The collector will be
configured to use the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/hostmetricsreceiver">hostmetrics
receivers</a>
to monitor all the K8s node's metrics, the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/kubeletstatsreceiver">kuebeletstats
receiver</a>
to retrieve Kubelet's metrics and the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver">filelog
receiver</a>,
that will collect all cluster's.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-demo-with-the-elastic-distributions-of-opentelemetry/node_host_metrics.png" alt="4 - Host
metrics" /></p>
<p>Both the signals generated by the microservices and those collected by the EDOT
collector are enriched with Kubernetes metadata, allowing users to correlate
them seamlessly. This makes it easy to track and observe which Kubernetes nodes
and pods each service is running on, providing deep insights into both
application performance and infrastructure health.</p>
<p>Learn more about the Elastic's OpenTelemetry Collector distribution:
<a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-collector">https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-collector</a></p>
<h2>Error detection with Elastic</h2>
<p>The OpenTelemetry Demo incorporates <a href="https://flagd.dev/">flagd</a>, a feature flag
evaluation engine used to simulate error scenarios. For example, the
<code>paymentServiceFailure</code> flag will force an error for every request to the
payment service <code>charge</code> endpoint. Since the service is instrumented with
OpenTelemetry, the error will be captured in the generated traces. We can then
use Kibana's powerful visualization and search tools to trace the error back to
its root cause.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-demo-with-the-elastic-distributions-of-opentelemetry/payment_error.png" alt="5 - Payment service
error" />
<img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-demo-with-the-elastic-distributions-of-opentelemetry/payment_trace_error.png" alt="6 - Payment service trace
error" /></p>
<p>Another available flag is named <code>adServiceHighCpu</code>, which causes a high CPU
load in the ad service. This increased CPU usage can be monitored either
through the service's metrics or the related metrics of its Kubernetes pod:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-demo-with-the-elastic-distributions-of-opentelemetry/adservice_high_cpu_error.png" alt="7 - AdService High CPU
error" /></p>
<p>The full list of simulated scenarios can be found at <a href="https://opentelemetry.io/docs/demo/feature-flags/">this
link</a>.</p>
<h2>Start your own exploration</h2>
<p>Ready to explore the OpenTelemetry Demo with Elastic and its enhanced
observability capabilities? Follow the link to Kibana and begin your own
exploration of how Elastic and OpenTelemetry can transform your approach to
observability.</p>
<p>Live demo: <a href="https://ela.st/demo-otel">https://ela.st/demo-otel</a></p>
<p>But that's not all—if you want to take it a step further, you can deploy the
OpenTelemetry Demo directly with your own Elasticsearch stack. Follow the steps
provided <a href="https://github.com/elastic/opentelemetry-demo">here</a> to set it up and
start gaining valuable insights from your own environment.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/opentelemetry-demo-with-the-elastic-distributions-of-opentelemetry/elastic-oteldemo.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Reconciliation in Elastic Streams: A Robust Architecture Deep Dive]]></title>
            <link>https://www.elastic.co/observability-labs/blog/from-tangled-to-streamlined-how-we-made-streams-robust-by-using-reconciliation</link>
            <guid isPermaLink="false">from-tangled-to-streamlined-how-we-made-streams-robust-by-using-reconciliation</guid>
            <pubDate>Tue, 04 Nov 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how Elastic's engineering team refactored Streams using a reconciliation model inspired by Kubernetes & React to build a robust, extensible, and debuggable system.]]></description>
            <content:encoded><![CDATA[<p>Streams is a new, unified approach to data management in the Elastic Stack. It wraps a set of existing Elasticsearch building blocks—data streams, index templates, ingest pipelines, retention policies—into a single, coherent primitive: the Stream. Instead of configuring these parts individually and in the right order, users can now rely on Streams to orchestrate them safely and automatically. With a unified UI in Kibana and a simplified API, Streams reduces cognitive load, lowers the risk of misconfiguration, and supports more flexible workflows like late binding—where users can ingest data first and decide how to process and route it later.</p>
<p>But behind that clean user experience lies a fast-moving, evolving codebase. In this post, we’ll explore how we rethought its architecture to keep up with product demands—while laying the groundwork for future flexibility and scale.</p>
<p>Rapid experimentation often leads to messy code—but before shipping to customers, we have to ask: If this succeeds, can we continue evolving it?
That question puts code health front and center. To move fast in the long term, we need a foundation that supports iteration.</p>
<p>When I joined the Streams team about six months ago, the project was moving fast through uncharted territory amid high uncertainty. This combination of speed and uncertainty created the perfect conditions for, well, spaghetti code—crafted by some of our most senior engineers, doing their best with a recipe missing a few ingredients.</p>
<p>The code was pragmatic and effective: it did exactly what it needed to do. But it was becoming increasingly difficult to understand and extend. Related logic was scattered across many files, with little separation of concerns, making it difficult to safely identify where and how to introduce changes. And the project still had a long road ahead.</p>
<p>Recently, we undertook a refactor of the underlying architecture—not just to bring greater clarity and structure to the codebase, but to establish clear phases that make it easier to debug and evolve. Our primary goal was to build a foundation that would let us continue moving quickly and confidently.
As a secondary goal, we aimed to enable new capabilities like bulk updates, dry runs, and system diagnostics.</p>
<p>In this post, we’ll briefly explore the challenges that prompted a new approach, share the architectural patterns that inspired us, explain how the new design works under the hood, and highlight what it enables for the future.</p>
<h2>The Challenges We Faced</h2>
<p>Streams aims to be a declarative model for data management. Users describe how data should flow: where it should go, what processing should happen along the way, and which mappings should apply. Behind the scenes, each API request results in one or more Elasticsearch resources being changed.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/from-tangled-to-streamlined-how-we-made-streams-robust-by-using-reconciliation/mess.png" alt="An image evoking a tangled mess" /></p>
<p>Before the refactor, the underlying code was increasingly difficult to reason about. There was no clear lifecycle that each request followed. Data was loaded only when it happened to be needed, validation was scattered across different functions, and cascading changes—like child streams reacting to parent updates—were applied recursively and implicitly. Elasticsearch requests could happen at any point during a request.</p>
<p>This led to several key challenges:</p>
<ul>
<li>
<p><strong>No clear place for validation</strong><br />
Without a single, centralized validation step, engineers weren’t sure where to add new checks—or whether existing ones would even run reliably. Some validations happened early, others late.</p>
</li>
<li>
<p><strong>No clear picture of the overall system state</strong><br />
Because there was no way to manage the system state as a whole it was hard to reason about or validate the state. We couldn’t easily check whether a change was valid in the context of all other existing streams or dependencies.</p>
</li>
<li>
<p><strong>Unpredictable side effects</strong><br />
Since Elasticsearch operations could occur at different points in the flow, failures were harder to handle or roll back. We didn’t have a clear “commit point” where the changes were executed.</p>
</li>
<li>
<p><strong>Tangled stream logic</strong><br />
Logic for different types of streams was mixed together in shared code paths, often guarded by conditionals. This made it hard to isolate behavior, test individual types, or add new ones without risking unintended consequences.</p>
</li>
</ul>
<p>These challenges made it clear: we needed a more structured foundation, one capable of supporting both the current complexity and future growth.</p>
<h2>What We Needed to Move Forward</h2>
<p>To move faster yet with confidence, we needed a foundation that could evolve gracefully, make behavior easier to reason about, and reduce the likelihood of unexpected side effects.</p>
<p>We aligned around a few key goals:</p>
<ul>
<li>
<p><strong>A clear request lifecycle</strong><br />
Each request should move through clear, well-defined phases: loading the current state, applying changes, validating the resulting state, determining the Elasticsearch actions, and executing the actions. This structure would help engineers understand where things happen—and why.</p>
</li>
<li>
<p><strong>A unified state model</strong><br />
We wanted a clear model of desired vs. current state—a single place to reason about the outcome of a change. This would enable safer validation, more efficient updates, and easier debugging by allowing us to compute the difference between the two states.</p>
</li>
<li>
<p><strong>A single commit point</strong><br />
All Elasticsearch changes should happen in one place, after everything’s validated and we know exactly what needs to change. This would reduce side effects, make failures easier to manage, and unlock support for dry runs.</p>
</li>
<li>
<p><strong>Isolated stream logic</strong><br />
We needed clearer separation between stream types so each could be developed and tested in isolation. This would simplify adding new types, reduce unintended side effects, and clarify whether changes belong to a stream type or the state management layer.</p>
</li>
<li>
<p><strong>Bulk operations and system introspection</strong><br />
Finally, we wanted to support features like bulk updates, dry runs, and health diagnostics—capabilities that were difficult or impossible with the old design. A more explicit and inspectable model of system state would make this possible.</p>
</li>
</ul>
<p>These goals became our north star as we explored new architectural patterns to get there, with a strong focus on comparing the current state with the desired state.</p>
<h2>Where We Drew Inspiration From</h2>
<p>Our new design drew inspiration from two well-known open source projects: <a href="https://kubernetes.io/">Kubernetes</a> and <a href="https://react.dev/">React</a>. Though very different, both share a central concept: reconciliation.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/from-tangled-to-streamlined-how-we-made-streams-robust-by-using-reconciliation/reconciliation.png" alt="An image showing a flow chart for reconciliation" /></p>
<p>Reconciliation means comparing two states, calculating their differences, and taking the necessary actions to move the system from its current state to its desired state.</p>
<ul>
<li>
<p>In <a href="https://kubernetes.io/docs/concepts/architecture/controller/">Kubernetes</a>, you declare the desired state of your resources, and the controller continuously works to align the cluster with that state.</p>
</li>
<li>
<p>In <a href="https://legacy.reactjs.org/docs/faq-internals.html">React</a>, each component defines how it should render, and the virtual DOM updates the real DOM efficiently to match that.</p>
</li>
</ul>
<p>We were also inspired by the <a href="https://mmapped.blog/posts/29-plan-execute">Plan/Execute</a> pattern which aims to separate decision making from execution. This sounded like what we needed in order to perform all validations before committing to any actions—ensuring we could reason about and inspect the system's intent ahead of time.</p>
<p>These concepts resonated with what we needed. It made clear that we required two key pieces:</p>
<ol>
<li>
<p>A model representing system state, responsible for comparing states and driving the overall workflow (like the Kubernetes controller loop).</p>
</li>
<li>
<p>A representation of individual streams that make up that state, handling the specific logic for each stream type (like React components).</p>
</li>
</ol>
<p>Each Stream is defined and stored in Elasticsearch. We recognized a disconnect between data management and state changes in our existing code, so we designed each stream to manage both. This fits naturally with the <a href="https://www.martinfowler.com/eaaCatalog/activeRecord.html">Active Record pattern</a>, where a class encapsulates both domain logic and persistence.</p>
<p>To make the system easier to extend and the state model’s interface simpler, we implemented an abstract Active Record class using the <a href="https://refactoring.guru/design-patterns/template-method">Template Method pattern</a>, clearly defining the interface new stream types must follow.</p>
<p>We did have some concerns that adopting these more advanced patterns—like reconciliation, the Active Record, and Template Method—might make it harder for new or less experienced engineers to get up to speed. While the code would be cleaner and more straightforward for those familiar with the patterns, we worried it could create a barrier for juniors or newcomers unfamiliar with these concepts.</p>
<p>In practice, however, we found the opposite: the code became easier to follow because the patterns provided a clear, consistent structure. More importantly, the architectural choices helped keep the focus on the domain itself, rather than on complex implementation details, making it more approachable for the whole team. The patterns are there but the code doesn't talk about them, it talks about the domain.</p>
<h2>How We Structured the System</h2>
<p>When a request hits one of our API endpoints in Kibana, the handler performs basic request validation, then passes the request to the Streams Client. The client’s job is to translate the request into one or more Change objects. Each Change represents the creation, modification, or deletion of a Stream.</p>
<p>These Change objects are then passed to a central class we introduced called <code>State</code>, which plays two key roles:</p>
<ul>
<li>
<p>It holds the set of Stream instances that make up the current version of the system.</p>
</li>
<li>
<p>It orchestrates the pipeline that applies changes and transitions from one state to another.</p>
</li>
</ul>
<p>Let’s walk through the key phases the State class manages when applying a change.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/from-tangled-to-streamlined-how-we-made-streams-robust-by-using-reconciliation/flow.png" alt="Flowchart of the phases" /></p>
<h3>Loading the Starting State</h3>
<p>First, the State class loads the current system state by reading the stored Stream definitions from Elasticsearch. This becomes our reference point for all subsequent comparisons—used during validation, diffing, and action planning.</p>
<h3>Applying Changes</h3>
<p>We begin by cloning the starting state. Each Stream is responsible for cloning itself.
Then we process each incoming Change:</p>
<ul>
<li>
<p>The change is presented to all Streams in the current state (creating a new one if needed).</p>
</li>
<li>
<p>Each Stream can react by updating itself and optionally emitting cascading changes—additional changes that ripple through related Streams.</p>
</li>
<li>
<p>Cascading changes are processed in a loop until no more are generated (or until we hit a safety threshold).</p>
</li>
</ul>
<p>We then move to the next requested Change.<br />
If any requested or cascading Change cannot be applied safely, the system aborts the entire request to prevent partial updates.</p>
<h3>Validating the Desired State</h3>
<p>Once we’ve applied all Changes and cascading effects, we run validations to ensure the resulting configuration is safe and consistent.</p>
<p>Each Stream is asked to validate itself in the context of the full desired state and the original starting state. This allows for both localized checks (within a Stream) and broader coordination (between related Streams). If any validation fails, we abort the request.</p>
<h3>Determining Actions</h3>
<p>Next, each Stream is asked to determine what Elasticsearch actions are needed to move from the starting state to the desired state. This is the first point where the system needs to consider which Elasticsearch resources back an individual Stream.</p>
<p>If the request is a dry run, we stop here and return a summary of what would happen. If it’s meant to be executed, we move to the next phase.</p>
<h3>Planning and Execution</h3>
<p>The list of Elasticsearch actions is handed off to a dedicated class called <code>ExecutionPlan</code>. This class handles:</p>
<ul>
<li>
<p>Resolving cross-stream dependencies that individual Streams cannot address alone.</p>
</li>
<li>
<p>Organizing the actions into the correct order to ensure safe application (e.g. to avoid data loss when routing rules change).</p>
</li>
<li>
<p>Maximizing parallelism wherever possible within those ordering constraints.</p>
</li>
</ul>
<p>If the plan executes successfully, we return a success response from the API.</p>
<h3>Handling Failures</h3>
<p>If the plan fails during execution, the <code>State</code> class attempts a roll back—it computes a new plan that should return the system to its starting state (by going from desired state to starting state instead) and tries to execute it.</p>
<p>If the roll back also fails, we have a fallback mechanism: a “reset” operation that re-applies the known-good state stored in Elasticsearch, skipping diffing entirely.</p>
<h3>A Closer Look at the Stream Active Record Classes</h3>
<p>All Streams in the State are subclasses of an abstract class called <code>StreamActiveRecord</code>. This class is responsible for:</p>
<ul>
<li>
<p>Tracking the change status of the Stream</p>
</li>
<li>
<p>Routing change application, validation, and action determination to specialized template method hooks implemented by its concrete subclasses based on the change status.</p>
</li>
</ul>
<p>These hooks are as follows:</p>
<ul>
<li>
<p>Apply upsert / Apply deletion</p>
</li>
<li>
<p>Validate upsert / Validate deletion</p>
</li>
<li>
<p>Determine actions for creation / change / deletion</p>
</li>
</ul>
<p>With this architecture in place, we’ve created a clear, phased, and declarative flow from input to action—one that’s modular, testable, and resilient to failure. It cleanly separates generic stream lifecycle logic (like change tracking and orchestration) from stream-specific behaviors (such as what “upsert” means for a given Stream type), enabling a highly extensible system. This structure allows us to isolate side effects, validate with confidence, and reason more clearly about system-wide behavior—all while supporting dry runs and bulk operations.</p>
<p>Now that we’ve covered how it works, let’s explore what this unlocks—the capabilities, safety guarantees, and new workflows this design makes possible.</p>
<h2>What This Unlocks</h2>
<p>The reconciliation based design we landed on isn’t just easier to reason about—it directly addresses many of the core limitations we faced in the earlier version of the system.</p>
<p><strong>Bulk operations and dry runs, by design</strong></p>
<p>One of our key goals was to support bulk configuration changes across many Streams in a single request. The previous codebase made this difficult because the side effects were interleaved with decision-making logic, making it risky to apply multiple changes at once.</p>
<p>Now, bulk changes are the default. The <code>State</code> class handles any number of changes, tracks cascading effects automatically, and validates the end result as a whole. Whether you're updating one Stream or fifty, the pipeline handles it consistently.</p>
<p>Dry runs were another desired feature. Because actions are now computed in a side-effect-free step—before anything is sent to Elasticsearch—we can generate a full preview of what would happen. This includes both which Streams would change and what specific Elasticsearch operations would be performed. That visibility helps users and developers make confident, informed decisions.</p>
<p><strong>Easier debugging, better diagnostics</strong></p>
<p>In the old system, debugging required reconstructing the execution context and piecing together side effects. Now, every phase of the pipeline is explicit and testable in isolation by following the phases.</p>
<p>Because validation and Elasticsearch actions are now tied directly to the Stream definition and lifecycle, any inconsistencies or errors are easier to trace to their source.</p>
<p><strong>Validated planning before execution</strong></p>
<p>Because we now validate and plan <em>before</em> making any changes, the risk of leaving the system in an inconsistent or partially-updated state has been greatly reduced. All actions are determined in advance, and only executed once we’re confident the entire set of changes is valid and coherent.</p>
<p>And if something does go wrong during execution, we can lean on the fact that both the starting and desired states are fully modeled in memory. This allows us to generate a roll back plan automatically, and when that’s not possible, fall back to a complete reset from the stored state. In short: safety is now built in, not bolted on.</p>
<p><strong>Extensible by default</strong></p>
<p>Adding a new type of Stream used to mean editing logic scattered across multiple files.
Now, it’s a focused, well-defined task. You subclass <code>StreamActiveRecord</code> and implement the handful of lifecycle hooks.</p>
<p>That’s it. The orchestration, tracking, and dependency handling are already wired up. That also means it’s easier to onboard new developers or experiment with new Stream types without fear of breaking unrelated parts of the system.</p>
<p><strong>Easier to test</strong></p>
<p>Because each Stream is now encapsulated and has clear, isolated responsibilities, testing is much simpler. You can test individual Stream classes by simulating specific inputs and asserting the resulting cascading changes, validation results, or Elasticsearch actions. There's no need to spin up a full end-to-end environment just to test a single validation.</p>
<h2>What’s Next</h2>
<p>At Elastic, we live by our Source Code, which states “Progress, SIMPLE Perfection”—a reminder to favor steady, incremental improvement over chasing perfection.</p>
<p>This new system is a solid foundation—but it’s only the beginning. Our focus so far has been on clarity, safety, and extensibility, and while we’ve addressed some long-standing pain points, there’s still plenty of room to evolve.</p>
<h3>Continuous improvement ahead</h3>
<p>We intentionally shipped this work with a sharp scope and have already identified several enhancements that we will be adding in the coming weeks:</p>
<ul>
<li>
<p><strong>Introduce a locking layer</strong><br />
To safely handle concurrent updates, we plan to introduce a locking mechanism that prevents race conditions during parallel modifications.</p>
</li>
<li>
<p><strong>Expose bulk and dry-run features via our APIs</strong><br />
The <code>State</code> class already supports them—now it’s time to make those capabilities available to users.</p>
</li>
<li>
<p><strong>Improve debugging output</strong><br />
Now that state transitions are modeled explicitly, we can expose clearer diagnostics to help both users and developers reason about changes.</p>
</li>
<li>
<p><strong>Avoiding Redundant Elasticsearch Requests</strong><br />
Currently we make multiple redundant requests during validation. Introducing a lightweight in-memory cache would let us avoid reloading the same resource more than once.</p>
</li>
<li>
<p><strong>Improve access controls</strong><br />
Currently, we rely on Elasticsearch to enforce access control. Because a single change can touch many different resources, it’s difficult to determine up front which privileges are required. We plan to extend our action definitions with privilege metadata, enabling us to validate the full set of required permissions before executing any actions. This will let us detect and report missing privileges early—before the plan runs.</p>
</li>
<li>
<p><strong>Add APM instrumentation</strong><br />
With the system structured in distinct, well-defined phases, we’re now in a great position to add performance instrumentation. This will help us identify bottlenecks and improve responsiveness over time.</p>
</li>
</ul>
<h3>Revisiting responsibilities</h3>
<p>As our orchestration becomes more robust, we’re also re-evaluating where it should live. Large-scale bulk operations, for example, might eventually be better handled closer to Elasticsearch itself, where we can benefit from greater atomicity and tighter performance guarantees. That kind of deep integration would have been premature earlier on—when we were still figuring out the right abstractions and phases for the system. But now that the design has stabilized, we’re in a much better position to start that conversation.</p>
<h3>Built to evolve</h3>
<p>We designed this system with adaptability in mind. Whether improvements come in the form of internal refactors, better developer experience, or deeper collaboration with Elasticsearch, we’re in a strong position to keep evolving. The architecture is modular by design—and that gives us both the stability to rely on and the flexibility to grow.</p>
<h2>Wrapping Up</h2>
<p>Building robust, maintainable systems is never just about code — it’s about aligning architecture with the evolving needs and direction of the product. Our journey refactoring Streams reaffirmed that a thoughtful, phased approach not only improves technical clarity but also empowers teams to move faster and innovate more confidently.</p>
<p>If you’re working on complex systems facing similar challenges—whether tangled logic, unpredictable side effects, or the need for extensibility—you’re not alone. We hope our story offers some useful insights and inspiration as you shape your own path forward.</p>
<p>We welcome feedback and collaboration from the community—whether it’s in the form of questions, ideas, or code.</p>
<p>To learn more about Streams, explore:</p>
<p><em>Read about</em> <a href="https://www.elastic.co/observability-labs/blog/reimagine-observability-elastic-streams"><em>Reimagining streams</em></a></p>
<p><em>Look at the</em> <a href="http://elastic.co/elasticsearch/streams"><em>Streams website</em></a></p>
<p><em>Read the</em> <a href="https://www.elastic.co/docs/solutions/observability/streams/streams"><em>Streams documentation</em></a></p>
<p><em>Check out the</em> <a href="https://github.com/elastic/kibana/pull/211696"><em>pull request on GitHub</em></a> to dive into the code or join the conversation.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/from-tangled-to-streamlined-how-we-made-streams-robust-by-using-reconciliation/article.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Future-proof your logs with ecs@mappings template]]></title>
            <link>https://www.elastic.co/observability-labs/blog/future-proof-your-logs-with-ecs-mappings-template</link>
            <guid isPermaLink="false">future-proof-your-logs-with-ecs-mappings-template</guid>
            <pubDate>Mon, 23 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Explore how the ecs@mappings component template in Elasticsearch simplifies data management by providing a centralized, official definition of Elastic Common Schema (ECS) mappings. Learn about its benefits, including reduced configuration hassles, improved data integrity, and enhanced performance for both integration developers and community users. Discover how this feature streamlines ECS field support across Elastic Agent integrations and future-proofs your data streams.]]></description>
            <content:encoded><![CDATA[<p>As the Elasticsearch ecosystem evolves, so do the tools and methodologies designed to streamline data management. One advancement that will significantly benefit our community is the <a href="https://github.com/elastic/elasticsearch/blob/v8.15.1/x-pack/plugin/core/template-resources/src/main/resources/ecs%40mappings.json">ecs@mappings</a> component template.</p>
<p><a href="https://www.elastic.co/guide/en/ecs/current/ecs-reference.html">ECS (Elastic Common Schema)</a> is a standardized data model for logs and metrics. It defines a set of <a href="https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html">common field names and data types</a> that help ensure consistency and compatibility.</p>
<p><code>ecs@mappings</code> is a <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-component-template.html">component template</a> that offers an <a href="https://github.com/elastic/elasticsearch/blob/v8.15.1/x-pack/plugin/core/template-resources/src/main/resources/ecs%40mappings.json">Elastic-maintained</a> definition of ECS mappings. Each Elasticsearch release contains an always up-to-date definition of all ECS fields.</p>
<h3>Elastic Common Schema and Open Telemetry</h3>
<p>Elastic will preserve our user's investment in Elastic Common Schema by <a href="https://www.elastic.co/blog/ecs-elastic-common-schema-otel-opentelemetry-faq">donating</a> ECS to Open Telemetry. Elastic participates and collaborates with the OTel community to merge ECS and Open Telemetry's Semantic Conventions over time.</p>
<h2>The Evolution of ECS Mappings</h2>
<p>Historically, users and integration developers have defined ECS (Elastic Common Schema) mappings manually within individual index templates and packages, each meticulously listing its fields. Although straightforward, this approach proved time-consuming and challenging to maintain.</p>
<p>To tackle this challenge, integration developers moved towards two primary methodologies:</p>
<ol>
<li>Referencing ECS mappings</li>
<li>Importing ECS mappings directly</li>
</ol>
<p>These methods were steps in the right direction but introduced their challenges, such as the maintenance cost of keeping the ECS mappings up-to-date with Elasticsearch changes.</p>
<h2>Enter ecs@mappings</h2>
<p>The <a href="https://github.com/elastic/elasticsearch/blob/v8.15.1/x-pack/plugin/core/template-resources/src/main/resources/ecs%40mappings.json">ecs@mappings</a> component template supports all the field definitions in ECS, leveraging naming conventions and a set of dynamic templates.</p>
<p>Elastic started shipping the <code>ecs@mappings</code> component template with Elasticsearch v8.9.0, including it in the <a href="https://github.com/elastic/elasticsearch/blob/v8.14.2/x-pack/plugin/core/template-resources/src/main/resources/logs%40template.json">logs-<em>-</em> index template</a>.</p>
<p>With Elasticsearch v8.13.0, Elastic now includes <code>ecs@mappings</code> in the index templates of all the Elastic Agent integrations.</p>
<p>This move was a breakthrough because:</p>
<ul>
<li><strong>Centralized</strong> and official: With ecs@mappings, we now have an official definition of ECS mappings.</li>
<li><strong>Out-of-the-box functionality</strong>: ECS mappings are readily available, reducing the need for additional imports or references.</li>
<li><strong>Simplified maintenance</strong>: The need to manually keep up with ECS changes has diminished since the template from Elasticsearch itself remains up-to-date.</li>
</ul>
<h3>Enhanced Consistency and Reliability</h3>
<p>With <code>ecs@mappings</code>, ECS mappings become the single source of truth. This unified approach means fewer discrepancies and higher consistency in data streams across integrations.</p>
<h2>How Community Users Benefit</h2>
<p>Community users stand to gain manifold from the adoption of <code>ecs@mappings</code>. Here are the key advantages:</p>
<ol>
<li><strong>Reduced configuration hassles</strong>: Whether you are an advanced user or just getting started, the simplified setup means fewer configuration steps and fewer opportunities for errors.</li>
<li><strong>Improved data integrity</strong>: Since ecs@mappings ensures that field definitions are accurate and up-to-date, data integrity is maintained effortlessly.</li>
<li><strong>Better performance</strong>: With less overhead in maintaining and referencing ECS fields, your Elasticsearch operations run more smoothly.</li>
<li><strong>Enhanced documentation and discoverability</strong>: As we standardize ECS mappings, the documentation can be centralized, making it easier for users to discover and understand ECS fields.</li>
</ol>
<p>Let's explore how the <code>ecs@mappings</code> component template helps users achieve these benefits.</p>
<h3>Reduced configuration hassles</h3>
<p>Modern Elasticsearch versions come with out-of-the-box full ECS field support (see the “requirements” section later for specific versions).</p>
<p>For example, the <a href="https://docs.elastic.co/integrations/aws_logs">Custom AWS Logs integration</a> installed on a supported Elasticsearch cluster already includes the <code>ecs@mappings</code> component template in its index template:</p>
<pre><code class="language-json">GET _index_template/logs-aws_logs.generic
{
  &quot;index_templates&quot;: [
    {
      &quot;name&quot;: &quot;logs-aws_logs.generic&quot;,
      ...,
        &quot;composed_of&quot;: [
          &quot;logs@settings&quot;,
          &quot;logs-aws_logs.generic@package&quot;,
          &quot;logs-aws_logs.generic@custom&quot;,
          &quot;ecs@mappings&quot;,
          &quot;.fleet_globals-1&quot;,
          &quot;.fleet_agent_id_verification-1&quot;
        ],
    ...
</code></pre>
<p>There is no need to import or define any ECS field.</p>
<h3>Improved data integrity</h3>
<p>The <code>ecs@mappings</code> component template supports all the existing ECS fields. If you use any ECS field in your document, it will accurately have the expected type.</p>
<p>To ensure that <code>ecs@mappings</code> is always up to date with the <a href="https://github.com/elastic/ecs/">ECS repository</a>, we set up a daily <a href="https://github.com/elastic/elasticsearch/blob/6ae9dbfda7d71ae3f1bd2bddf9334d37b3294632/x-pack/plugin/stack/src/javaRestTest/java/org/elasticsearch/xpack/stack/EcsDynamicTemplatesIT.java#L49">automated test</a> to ensure that the component template supports all fields.</p>
<h3>Better Performance</h3>
<h4>Compact definitions</h4>
<p>The ECS field definition is exceptionally compact; at the time of this writing, it is 228 lines long and supports all ECS fields. To learn more, see the <code>ecs@mappings</code> component template <a href="https://github.com/elastic/elasticsearch/blob/v8.15.1/x-pack/plugin/core/template-resources/src/main/resources/ecs%40mappings.json">source code</a>.</p>
<p>It relies on naming conventions and uses <a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.14/dynamic-templates.html">dynamic templates</a> to achieve this compactness.</p>
<h4>Lazy mapping</h4>
<p>Elasticsearch only adds existing document fields to the mapping, thanks to dynamic templates. The lazy mapping keeps memory overhead at a minimum, improving cluster performance and making field suggestions more relevant.</p>
<h3>Enhanced documentation and discoverability</h3>
<p>All Elastic Agent integrations are migrating to the <code>ecs@mappings</code> component template. These integrations no longer need to add and maintain ECS field mappings and can reference the official <a href="https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html">ECS Field Reference</a> or the ECS source code in the Git repository: <a href="https://github.com/elastic/ecs/">https://github.com/elastic/ecs/</a>.</p>
<h2>Getting started</h2>
<h3>Requirements</h3>
<p>To leverage the <code>ecs@mappings</code> component template, ensure the following stack version:</p>
<ul>
<li><strong>8.9.0</strong>: if your data stream uses the logs index template or you define your index template.</li>
<li><strong>8.13.0</strong>: if your data stream uses the index template of an Elastic Agent integration.</li>
</ul>
<h3>Example</h3>
<p>We will use the <a href="https://docs.elastic.co/integrations/aws_logs">Custom AWS Logs integration</a> to show you how <code>ecs@mapping</code> can handle mapping for any out-of-the-box ECS field.</p>
<p>Imagine you want to ingest the following log event using the Custom AWS Logs integration:</p>
<pre><code class="language-json">{
  &quot;@timestamp&quot;: &quot;2024-06-11T13:16:00+02:00&quot;, 
  &quot;command_line&quot;: &quot;ls -ltr&quot;,
  &quot;custom_score&quot;: 42
}
</code></pre>
<h4>Dev Tools</h4>
<p>Kibana offers an excellent tool for experimenting with Elasticseatch API, the <a href="https://www.elastic.co/guide/en/kibana/current/console-kibana.html">Dev Tools console</a>. With the Dev Tools, users can run all API requests quickly and without much friction.</p>
<p>To open the Dev Tools:</p>
<ul>
<li>Open <strong>Kibana</strong></li>
<li>Select <strong>Management &gt; Dev Tools &gt; Console</strong></li>
</ul>
<h4>Elasticsearch version &lt; 8.13</h4>
<p>On Elasticsearch versions before 8.13, the Custom AWS Logs integration has the following index template:</p>
<pre><code class="language-json">GET _index_template/logs-aws_logs.generic
{
  &quot;index_templates&quot;: [
    {
      &quot;name&quot;: &quot;logs-aws_logs.generic&quot;,
      &quot;index_template&quot;: {
        &quot;index_patterns&quot;: [
          &quot;logs-aws_logs.generic-*&quot;
        ],
        &quot;template&quot;: {
          &quot;settings&quot;: {},
          &quot;mappings&quot;: {
            &quot;_meta&quot;: {
              &quot;package&quot;: {
                &quot;name&quot;: &quot;aws_logs&quot;
              },
              &quot;managed_by&quot;: &quot;fleet&quot;,
              &quot;managed&quot;: true
            }
          }
        },
        &quot;composed_of&quot;: [
          &quot;logs-aws_logs.generic@package&quot;,
          &quot;logs-aws_logs.generic@custom&quot;,
          &quot;.fleet_globals-1&quot;,
          &quot;.fleet_agent_id_verification-1&quot;
        ],
        &quot;priority&quot;: 200,
        &quot;_meta&quot;: {
          &quot;package&quot;: {
            &quot;name&quot;: &quot;aws_logs&quot;
          },
          &quot;managed_by&quot;: &quot;fleet&quot;,
          &quot;managed&quot;: true
        },
        &quot;data_stream&quot;: {
          &quot;hidden&quot;: false,
          &quot;allow_custom_routing&quot;: false
        }
      }
    }
  ]
}
</code></pre>
<p>As you can see, it does not include the ecs@mappings component template.</p>
<p>If we try to index the test document:</p>
<pre><code class="language-json">POST logs-aws_logs.generic-default/_doc
{
  &quot;@timestamp&quot;: &quot;2024-06-11T13:16:00+02:00&quot;, 
  &quot;command_line&quot;: &quot;ls -ltr&quot;,
  &quot;custom_score&quot;: 42
}
</code></pre>
<p>The data stream will have the following mappings:</p>
<pre><code>GET logs-aws_logs.generic-default/_mapping/field/command_line
{
  &quot;.ds-logs-aws_logs.generic-default-2024.06.11-000001&quot;: {
    &quot;mappings&quot;: {
      &quot;command_line&quot;: {
        &quot;full_name&quot;: &quot;command_line&quot;,
        &quot;mapping&quot;: {
          &quot;command_line&quot;: {
            &quot;type&quot;: &quot;keyword&quot;,
            &quot;ignore_above&quot;: 1024
          }
        }
      }
    }
  }
}

GET logs-aws_logs.generic-default/_mapping/field/custom_score
{
  &quot;.ds-logs-aws_logs.generic-default-2024.06.11-000001&quot;: {
    &quot;mappings&quot;: {
      &quot;custom_score&quot;: {
        &quot;full_name&quot;: &quot;custom_score&quot;,
        &quot;mapping&quot;: {
          &quot;custom_score&quot;: {
            &quot;type&quot;: &quot;long&quot;
          }
        }
      }
    }
  }
}
</code></pre>
<p>These mappings do not align with ECS, so users and developers had to maintain them.</p>
<h4>Elasticsearch version &gt;= 8.13</h4>
<p>On Elasticsearch versions equal to or newer to 8.13, the Custom AWS Logs integration has the following index template:</p>
<pre><code class="language-json">GET _index_template/logs-aws_logs.generic
{
  &quot;index_templates&quot;: [
    {
      &quot;name&quot;: &quot;logs-aws_logs.generic&quot;,
      &quot;index_template&quot;: {
        &quot;index_patterns&quot;: [
          &quot;logs-aws_logs.generic-*&quot;
        ],
        &quot;template&quot;: {
          &quot;settings&quot;: {},
          &quot;mappings&quot;: {
            &quot;_meta&quot;: {
              &quot;package&quot;: {
                &quot;name&quot;: &quot;aws_logs&quot;
              },
              &quot;managed_by&quot;: &quot;fleet&quot;,
              &quot;managed&quot;: true
            }
          }
        },
        &quot;composed_of&quot;: [
          &quot;logs@settings&quot;,
          &quot;logs-aws_logs.generic@package&quot;,
          &quot;logs-aws_logs.generic@custom&quot;,
          &quot;ecs@mappings&quot;,
          &quot;.fleet_globals-1&quot;,
          &quot;.fleet_agent_id_verification-1&quot;
        ],
        &quot;priority&quot;: 200,
        &quot;_meta&quot;: {
          &quot;package&quot;: {
            &quot;name&quot;: &quot;aws_logs&quot;
          },
          &quot;managed_by&quot;: &quot;fleet&quot;,
          &quot;managed&quot;: true
        },
        &quot;data_stream&quot;: {
          &quot;hidden&quot;: false,
          &quot;allow_custom_routing&quot;: false
        },
        &quot;ignore_missing_component_templates&quot;: [
          &quot;logs-aws_logs.generic@custom&quot;
        ]
      }
    }
  ]
}
</code></pre>
<p>The index template for <code>logs-aws_logs.generic</code> now includes the <code>ecs@mappings</code> component template.</p>
<p>If we try to index the test document:</p>
<pre><code class="language-json">POST logs-aws_logs.generic-default/_doc
{
  &quot;@timestamp&quot;: &quot;2024-06-11T13:16:00+02:00&quot;, 
  &quot;command_line&quot;: &quot;ls -ltr&quot;,
  &quot;custom_score&quot;: 42
}
</code></pre>
<p>The data stream will have the following mappings:</p>
<pre><code class="language-json">GET logs-aws_logs.generic-default/_mapping/field/command_line
{
  &quot;.ds-logs-aws_logs.generic-default-2024.06.11-000001&quot;: {
    &quot;mappings&quot;: {
      &quot;command_line&quot;: {
        &quot;full_name&quot;: &quot;command_line&quot;,
        &quot;mapping&quot;: {
          &quot;command_line&quot;: {
            &quot;type&quot;: &quot;wildcard&quot;,
            &quot;fields&quot;: {
              &quot;text&quot;: {
                &quot;type&quot;: &quot;match_only_text&quot;
              }
            }
          }
        }
      }
    }
  }
}

GET logs-aws_logs.generic-default/_mapping/field/custom_score
{
  &quot;.ds-logs-aws_logs.generic-default-2024.06.11-000001&quot;: {
    &quot;mappings&quot;: {
      &quot;custom_score&quot;: {
        &quot;full_name&quot;: &quot;custom_score&quot;,
        &quot;mapping&quot;: {
          &quot;custom_score&quot;: {
            &quot;type&quot;: &quot;float&quot;
          }
        }
      }
    }
  }
}
</code></pre>
<p>In Elasticsearch 8.13, fields like <code>command_line</code> and <code>custom_score</code> get their definition from ECS out-of-the-box.</p>
<p>These mappings align with ECS, so users and developers do not have to maintain them. The same applies to all the hundreds of field definitions in the Elastic Common Schema. You can achieve this by including a 200-liner component template in your data stream.</p>
<h2>Caveats</h2>
<p>Some aspects of how the ecs@mappings component template deals with data types are worth mentioning.</p>
<h3>ECS types are not enforced</h3>
<p>The <code>ecs@mappings</code> component template does not contain mappings for ECS fields where dynamic mapping already uses the correct field type. Therefore, if you send a field value with a compatible but wrong type, Elasticsearch will not coerce the value.</p>
<p>For example, if you send the following document with a faas.coldstart field (defined as boolean in ECS):</p>
<pre><code class="language-json">{
  &quot;faas.coldstart&quot;: &quot;true&quot;
}
</code></pre>
<p>Elasticsearch will map <code>faas.coldstart</code> as a <code>keyword</code> and not a <code>boolean</code>. Therefore, you need to make sure that the values you ingest to Elasticsearch use the right JSON field types, according to how they’re defined in ECS.</p>
<p>This is the tradeoff for having a compact and efficient ecs@mappings component template. It also allows for better compatibility when dealing with a mix of ECS and custom fields because documents won’t be rejected if the types are not consistent with the ones defined in ECS.</p>
<h2>Conclusion</h2>
<p>The introduction of <code>ecs@mappings</code> marks a significant improvement in managing ECS mappings within Elasticsearch. By centralizing and streamlining these definitions, we can ensure higher consistency, reduced maintenance, and better overall performance.</p>
<p>Whether you're an integration developer or a community user, moving to <code>ecs@mappings</code> represents a step towards more efficient and reliable Elasticsearch operations. As we continue incorporating feedback and evolving our tools, your journey with Elasticsearch will only get smoother and more rewarding.</p>
<p><strong>Join the Conversation</strong></p>
<p>Do you have questions or feedback about <code>ecs@mappings</code>? Post on our helpful community of users on our community <a href="https://discuss.elastic.co/">discussion forum</a> and <a href="https://ela.st/slack">Slack instance</a> and share your experiences. Your input is invaluable in helping us fine-tune these advancements for the entire community.</p>
<p>Happy mapping!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/future-proof-your-logs-with-ecs-mappings-template/article.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Getting more from your logs with OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/getting-more-from-your-logs-with-opentelemetry</link>
            <guid isPermaLink="false">getting-more-from-your-logs-with-opentelemetry</guid>
            <pubDate>Thu, 11 Sep 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to evolve beyond basic log ingest by leveraging OpenTelemetry for ingestion, structured logging, geographic enrichment, and ES|QL analytics. Transform raw log data into actionable intelligence with practical examples and proactive observability strategies.]]></description>
            <content:encoded><![CDATA[<h1>Getting more from your logs with OpenTelemetry</h1>
<p>Most people today use their logging tools mostly still in the same way we have for decades as a simple search lake, essentially still grepping for logs but from a centralized platform. There’s nothing wrong with this, you can get a lot of value by having a centralized logging platform but the question becomes how can I start to evolve beyond this basic log and search use case? Where can I start to be more effective with my incident investigations? In this blog we start from where most of our customers are today and give you some practical tips on how to move a little beyond this simple logging use case.</p>
<h2>Ingestion</h2>
<p>Let's start at the beginning, ingest. Typically many of you are using older tools for ingestion today. If you want to be more forward thinking here, it’s time to introduce you to OpenTelemetry. OpenTelemetry was once not very mature or capable for logging but things have changed significantly. Elastic has been working particularly hard to improve the log capabilities resident in OpenTelemetry. So let's start by exploring how we can get started bringing logs into Elastic via the OpenTelemetry collector.</p>
<p>Firstly if you want to follow along simply create a host to run the log generator and OpenTelemetry collector.</p>
<p>Follow the instructions here to get the log generator running:</p>
<p><a href="https://github.com/davidgeorgehope/log-generator-bin/">https://github.com/davidgeorgehope/log-generator-bin/</a></p>
<p>To get the OpenTelemetry collector up and running in <a href="https://cloud.elastic.co/serverless-registration?onboarding_token=observability">Elastic Serverless</a>, you can click on Add Data from the bottom left, then 'host' and finally 'opentelemetry'</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image14.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image7.png" alt="" /></p>
<p>Follow the instructions but don’t start the collector just yet.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image16.png" alt="" /></p>
<p>Our host here is running a 3 tier application with an Nginx frontend, backend and connected to a MySQL database. So let's start by bringing the logs into Elastic.</p>
<p>First we’ll install the Elastic Distributions for OpenTelemetry but before starting it, we will make a small change to the OpenTelemetry configuration file to expand the directories it will search for logs in.  Edit the otel.yml by simply using vi or your favorite editor:</p>
<pre><code class="language-bash">vi otel.yml
</code></pre>
<p>Instead of simply /var/log/.log we will add /var/log/**/*.log to bring in all our log files.</p>
<pre><code class="language-yaml">receivers:
  # Receiver for platform specific log files
  filelog/platformlogs:
    include: [ /var/log/**/*.log ]
    retry_on_failure:
      enabled: true
    start_at: end
    storage: file_storage
</code></pre>
<p>Start the otel collector</p>
<pre><code class="language-bash">sudo ./otelcol --config otel.yml
</code></pre>
<p>And we can see these are being brought in, in discover</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image8.png" alt="" /></p>
<p>Now one thing that is immediately noticeable is that we automatically without changing anything get a bunch of useful additional information such as the os name and cpu information.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image12.png" alt="" /></p>
<p>The OpenTelemetry collector has automatically, without any changes, started to enrich our logs, making it useful for additional processing, though we could do significantly better!</p>
<p>To start with we want to give our logs some structure. Lets edit that otel.yml file and add some OTTL to extract some key data from our NGINX logs.</p>
<pre><code class="language-yaml">  transform/parse_nginx:
    trace_statements: []
    metric_statements: []
    log_statements:
      - context: log
        conditions:
          - 'attributes[&quot;log.file.name&quot;] != nil and IsMatch(attributes[&quot;log.file.name&quot;], &quot;access.log&quot;)'
        statements:
          - merge_maps(attributes, ExtractPatterns(body, &quot;^(?P&lt;client_ip&gt;\\S+)&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;^\\S+ - (?P&lt;user&gt;\\S+)&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\\[(?P&lt;timestamp_raw&gt;[^\\]]+)\\]&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\&quot;(?P&lt;method&gt;\\S+) &quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\&quot;\\S+ (?P&lt;path&gt;\\S+)\\?&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;req_id=(?P&lt;req_id&gt;[^ ]+)&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\&quot; (?P&lt;status&gt;\\d+) &quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\&quot; \\d+ (?P&lt;size&gt;\\d+)&quot;), &quot;upsert&quot;)
.....

   logs/platformlogs:
      receivers: [filelog/platformlogs]
      processors: [transform/parse_nginx,resourcedetection]
      exporters: [elasticsearch/otel]
</code></pre>
<p>Now when we start the Otel collector with this new configuration</p>
<pre><code class="language-bash">sudo ./otelcol --config otel.yml
</code></pre>
<p>We will see that we now have structured logs!!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image17.png" alt="" /></p>
<h2>Store and Optimize</h2>
<p>To ensure you aren’t blowing your budget out with all this additional structured data there are few things you can do to help maximize storage efficiency.</p>
<p>You can use the filter processors in the Otel collector with granular filtering/dropping of irrelevant attributes to control volume going out of the collector for example.</p>
<pre><code class="language-yaml">processors:
  filter/drop_logs_without_user_attributes:
    logs:
      log_record:
        - 'attributes[&quot;user&quot;] == nil'
  filter/drop_200_logs:
    logs:
      log_record:
        - 'attributes[&quot;status&quot;] == &quot;200&quot;'

service:
  pipelines:
    logs/platformlogs:
      receivers: [filelog/platformlogs]
      processors: [transform/parse_nginx, filter/drop_logs_without_user_attributes, filter/drop_200_logs, resourcedetection]
      exporters: [elasticsearch/otel]
</code></pre>
<p>The filter processor will help reduce the noise for example if you wanted to drop the debug logs or logs from a noisy service. Great ways to keep a lid on your observability spend.</p>
<p>Additionally for your most critical flows and logs where you don’t want to drop any data, Elastic has you covered. In version 9.x of Elastic you now have LogsDB switched on by default.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image15.png" alt="" /></p>
<p>With LogsDB, Elastic has reduced the storage footprint of log data in Elasticsearch by up to 65% allowing you to store more observability and security data without exceeding your budget, while keeping all data accessible and searchable.</p>
<p>LogsDB reduces log storage by up to 65%. This dramatically minimizes storage footprints by leveraging advanced compression techniques like ZSTD, delta encoding, and run-length encoding, and it also reconstructs the _source field on demand, saving about 40% more storage by not retaining the original JSON document. Synthetic _source represents the introduction of columnar storage within Elasticsearch.</p>
<h2>Analytics</h2>
<p>So we have our data in Elastic, it’s structured, it conforms to the idea of a wide-event log since it has lots of good context, user ids, request ids and the data is captured at the start of a request Next we’re going to look at the analytics part of this. First let's take a stab at looking at the number of Errors for each user transaction in our application.</p>
<pre><code class="language-esql">FROM logs-generic.otel-default
| WHERE log.file.name == &quot;access.log&quot;
| WHERE attributes.status &gt;= &quot;400&quot;
| STATS error_count = COUNT(*) BY attributes.user
| SORT error_count DESC
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image9.png" alt="" /></p>
<p>It’s pretty easy now to save this and put it on a dashboard, we just click the save button:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image1.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image5.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image6.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image3.png" alt="" /></p>
<p>Next let's look at putting something together to show the global impact, first we will update our collector config to enrich our log data with geo location.</p>
<p>Update the OTTL configuration with this new line:</p>
<pre><code class="language-yaml">   log_statements:
      - context: log
        conditions:
          - 'attributes[&quot;log.file.name&quot;] != nil and IsMatch(attributes[&quot;log.file.name&quot;], &quot;access.log&quot;)'
        statements:
          - merge_maps(attributes, ExtractPatterns(body, &quot;^(?P&lt;client_ip&gt;\\S+)&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;^\\S+ - (?P&lt;user&gt;\\S+)&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\\[(?P&lt;timestamp_raw&gt;[^\\]]+)\\]&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\&quot;(?P&lt;method&gt;\\S+) &quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\&quot;\\S+ (?P&lt;path&gt;\\S+)\\?&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;req_id=(?P&lt;req_id&gt;[^ ]+)&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\&quot; (?P&lt;status&gt;\\d+) &quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\&quot; \\d+ (?P&lt;size&gt;\\d+)&quot;), &quot;upsert&quot;)
          - set(attributes[&quot;source.address&quot;], attributes[&quot;client_ip&quot;]) where attributes[&quot;client_ip&quot;] != nil
</code></pre>
<p>Next add a new processor (you will need to download the GeoIP database from MaxMind)</p>
<pre><code class="language-yaml">geoip:
  context: record
  source:
    from: attributes
  providers:
    maxmind:
      database_path: /opt/geoip/GeoLite2-City.mmdb
</code></pre>
<p>And add this to the log pipeline after the parse_nginx</p>
<pre><code class="language-yaml">service:
  pipelines:
    logs/platformlogs:
      receivers: [filelog/platformlogs]
      processors: [transform/parse_nginx, geoip, resourcedetection]
      exporters: [elasticsearch/otel]
</code></pre>
<p>Start the otel collector</p>
<pre><code class="language-bash">sudo ./otelcol --config otel.yml
</code></pre>
<p>Once the data starts flowing we can add a map visualization:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image2.png" alt="" /></p>
<p>Add a layer:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image4.png" alt="" /></p>
<p>Use ES|QL</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image10.png" alt="" /></p>
<p>Use the following ES|QL</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image13.png" alt="" /></p>
<p>And this should give you a map showing the locations of all your NGINX server requests!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image11.png" alt="" /></p>
<p>As you can see, analytics is a breeze with your new Otel data collection pipeline.</p>
<h2>Conclusion: Beyond log aggregation to operational intelligence</h2>
<p>The journey from basic log aggregation to structured, enriched observability represents more than a technical upgrade, it's a shift in how organizations approach system understanding and incident response. By adopting OpenTelemetry for ingestion, implementing intelligent filtering to manage costs, and leveraging LogsDB's storage optimizations, you're not just modernizing your ELK stack; you're building the foundation for proactive system management.</p>
<p>The structured logs, geographic enrichment, and analytical capabilities demonstrated here transform raw log data into actionable intelligence with ES|QL. Instead of reactive grepping through logs during incidents, you now have the infrastructure to identify patterns, track user journeys, and correlate issues across your entire stack before they become critical problems.</p>
<p>But here's the key question: Are you prepared to act on these insights? Having rich, structured data is only valuable if your organization can shift from a reactive &quot;find and fix&quot; mentality to a proactive &quot;predict and prevent&quot; approach. The real evolution isn't in your logging stack, it's in your operational culture.</p>
<p>Get started with this today in <a href="https://cloud.elastic.co/serverless-registration?onboarding_token=observability">Elastic Serverless</a></p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/getting-more-from-your-logs-with-opentelemetry.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Getting started with OpenTelemetry instrumentation with a sample application]]></title>
            <link>https://www.elastic.co/observability-labs/blog/getting-started-opentelemetry-instrumentation-sample-app</link>
            <guid isPermaLink="false">getting-started-opentelemetry-instrumentation-sample-app</guid>
            <pubDate>Tue, 12 Sep 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[In this article, we’ll introduce you to a simple sample application: a UI for movie search, instrumented in different ways using Python, Go, Java, Node, and .NET. Additionally, we will show how to view your OpenTelemetry data in Elastic APM.]]></description>
            <content:encoded><![CDATA[<p>Application performance management (APM) has moved beyond traditional monitoring to become an essential tool for developers, offering deep insights into applications at the code level. With APM, teams can not only detect issues but also understand their root causes, optimizing software performance and end-user experiences. The modern landscape presents a wide range of APM tools and companies offering different solutions. Additionally, OpenTelemetry is becoming the open ingestion standard for APM. With OpenTelemetry, DevOps teams have a consistent approach to collecting and ingesting telemetry data.</p>
<p>Elastic&lt;sup&gt;®&lt;/sup&gt; offers its own <a href="https://www.elastic.co/guide/en/apm/agent/index.html">APM Agents</a>, which can be used for instrumenting your code. In addition, Elastic also <a href="https://www.elastic.co/observability/opentelemetry">supports OpenTelemtry</a> natively.</p>
<p>Navigating the differences and understanding how to instrument applications using these tools can be challenging. That's where <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix">our sample application, Elastiflix — a UI for movie search</a> — comes into play. We've crafted it to demonstrate the nuances of both OTEL and Elastic APM, guiding you through the process of the APM instrumentation and showcasing how you can use one or the other, depending on your preference.</p>
<h2>The sample application</h2>
<p>We deliberately kept the <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix">movie search UI really simple</a>. It displays some movies, has a search bar, and, at the time of writing, only one real functionality: you can add a movie to your list of favorites.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-started-opentelemetry-instrumentation-sample-app/elastic-blog-1-luca.png" alt="luca" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-started-opentelemetry-instrumentation-sample-app/elastic-blog-2-services.png" alt="services" /></p>
<h2>Services, languages, and instrumentation</h2>
<p>Our application has a few different services:</p>
<ul>
<li><strong>javascript-frontend:</strong> A React frontend, talking to the node service and Elasticsearch&lt;sup&gt;®&lt;/sup&gt;</li>
<li><strong>node-server:</strong> Node backend, talking to other backend services</li>
<li><strong>dotnet-login:</strong> A login service that returns a random username</li>
</ul>
<p>We reimplemented the “favorite” service in a few different languages, as we did not want to introduce additional complexity to the architecture of the application.</p>
<ul>
<li><strong>Go-favorite:</strong> A Go service that stores a list of favorites movies in Redis</li>
<li><strong>Java-favorite:</strong> A Java service that stores a list of favorites movies in Redis</li>
<li><strong>Python-favorite:</strong> A Python service that stores a list of favorites movies in Redis</li>
</ul>
<p>In addition, there’s also some other supporting containers:</p>
<ul>
<li><strong>Movie-data-loader:</strong> Loads the movie database into your Elasticsearch cluster</li>
<li><strong>Redis:</strong> Used as a datastore for keeping track of the user’s favorites</li>
<li><strong>Locust:</strong> A load generator that talks to the node service to introduce artificial load</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-started-opentelemetry-instrumentation-sample-app/elastic-blog-3-flowchart.png" alt="flowchart" /></p>
<p>The main difference compared to some other sample application repositories is that we’ve coded it in several languages, with each language version showcasing almost all possible types of instrumentation:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-started-opentelemetry-instrumentation-sample-app/elastic-blog-4-types_of_instrumentation.png" alt="types of instrumentation" /></p>
<h3>Why this approach?</h3>
<p>While sample applications provide good insight into how tools work, they often showcase only one version, leaving developers to find all of the necessary modifications themselves. We've taken a different approach. By offering multiple versions, we intend to bridge the knowledge gap, making it straightforward for developers to see and comprehend the transition process from non-instrumented code to either Elastic or OTEL instrumented versions.</p>
<p>Instead of simply starting the already instrumented version, you can instrument the base version yourself, by following some of our other blogs. This will teach you much more than just looking at an already built version.</p>
<ul>
<li>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual-instrumentation</a></li>
<li>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual-instrumentation</a></li>
</ul>
<h2>Prerequisites</h2>
<ul>
<li>Docker and Compose</li>
<li>Elastic Cloud Cluster (<a href="https://ela.st/freetrial">start your free trial</a>)</li>
</ul>
<p>Before starting the sample application, ensure you've set up your Elastic deployment details. Populate the .env file (located in the same directory as the compose files) with the necessary credentials. You can copy these from the Cloud UI and from within Kibana&lt;sup&gt;®&lt;/sup&gt; under the path /app/home#/tutorial/apm.</p>
<p><strong>Cloud UI</strong></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-started-opentelemetry-instrumentation-sample-app/elastic-blog-5-deployment.png" alt="my deployment" /></p>
<p><strong>Kibana APM Tutorial</strong></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-started-opentelemetry-instrumentation-sample-app/elastic-blog-6-configure-agent.png" alt="Kibana APM Tutorial" /></p>
<pre><code class="language-bash">ELASTIC_APM_SERVER_URL=&quot;https://foobar.apm.us-central1.gcp.cloud.es.io&quot;
ELASTIC_APM_SECRET_TOKEN=&quot;secret123&quot;
ELASTICSEARCH_USERNAME=&quot;elastic&quot;
ELASTICSEARCH_PASSWORD=&quot;changeme&quot;
ELASTICSEARCH_URL=&quot;https://foobar.es.us-central1.gcp.cloud.es.io&quot;

</code></pre>
<h2>Starting the application</h2>
<p>You have the flexibility to initiate our sample app in three distinctive manners, each corresponding to a different instrumentation scenario.</p>
<p>We provide public Docker images that you can use when you supply the --no-build flag. Otherwise the images will be built from source on your machine, which will take around 5–10 minutes.</p>
<p><strong>1. Non-instrumented version</strong></p>
<pre><code class="language-bash">cd Elastiflix
docker-compose -f docker-compose.yml up -d --no-build
</code></pre>
<p><strong>2. Elastic instrumented version</strong></p>
<pre><code class="language-bash">cd Elastiflix
docker-compose -f docker-compose-elastic.yml up -d --no-build
</code></pre>
<p><strong>3. OpenTelemetry instrumented version</strong></p>
<pre><code class="language-bash">cd Elastiflix
docker-compose -f docker-compose-elastic-otel.yml up -d --no-build
</code></pre>
<p>After launching the desired version, explore the application at localhost:9000. We also deploy a load generator on localhost:8089 where you can increase the number of concurrent users. Note that the load generator is talking directly to the node backend service. If you want to generate RUM data from the javascript frontend, then you have to manually browse to localhost:9000 and visit a few pages.</p>
<h2>Simulation and failure scenarios</h2>
<p>In the real world, applications are subject to varying conditions, random bugs, and misconfigurations. We've incorporated some of these to mimic potential real-life situations. You can find a list of possible environment variables <a href="https://github.com/elastic/observability-examples#scenario--feature-toggles">here</a>.</p>
<p><strong>Non-instrumented scenarios</strong></p>
<pre><code class="language-bash"># healthy
docker-compose -f docker-compose.yml up -d

# pause redis for 5 seconds, every 30 seconds
TOGGLE_CLIENT_PAUSE=true docker-compose -f docker-compose.yml up -d

# add artificial delay to python service, 100ms, delay 50% of requests by 1000ms
TOGGLE_SERVICE_DELAY=100 TOGGLE_CANARY_DELAY=1000 docker-compose -f docker-compose.yml up -d

# add artificial delay to python service, 100ms, delay 50% of requests by 1000ms, and fail 20% of them
TOGGLE_SERVICE_DELAY=100 TOGGLE_CANARY_DELAY=1000 TOGGLE_CANARY_FAILURE=0.2 docker-compose -f docker-compose.yml up -d

# throw error in nodejs service, 50% of the time
THROW_NOT_A_FUNCTION_ERROR=true docker-compose -f docker-compose.yml up -d
</code></pre>
<p><strong>Elastic instrumented scenarios</strong></p>
<pre><code class="language-bash"># healthy
docker-compose -f docker-compose-elastic.yml up -d

# pause redis for 5 seconds, every 30 seconds
TOGGLE_CLIENT_PAUSE=true docker-compose -f docker-compose-elastic.yml up -d

# add artificial delay to python service, 100ms, delay 50% of requests by 1000ms
TOGGLE_SERVICE_DELAY=100 TOGGLE_CANARY_DELAY=1000 docker-compose -f docker-compose-elastic.yml up -d

# add artificial delay to python service, 100ms, delay 50% of requests by 1000ms, and fail 20% of them
TOGGLE_SERVICE_DELAY=100 TOGGLE_CANARY_DELAY=1000 TOGGLE_CANARY_FAILURE=0.2 docker-compose -f docker-compose-elastic.yml up -d

# throw error in nodejs service, 50% of the time
THROW_NOT_A_FUNCTION_ERROR=true docker-compose -f docker-compose-elastic.yml up -d

</code></pre>
<p><strong>OpenTelemetry instrumented scenarios</strong></p>
<pre><code class="language-bash"># healthy
docker-compose -f docker-compose-elastic-otel.yml up -d

# pause redis for 5 seconds, every 30 seconds
TOGGLE_CLIENT_PAUSE=true docker-compose -f docker-compose-elastic-otel.yml up -d

# add artificial delay to python service, 100ms, delay 50% of requests by 1000ms
TOGGLE_SERVICE_DELAY=100 TOGGLE_CANARY_DELAY=1000 docker-compose -f docker-compose-elastic-otel.yml up -d

# add artificial delay to python service, 100ms, delay 50% of requests by 1000ms, and fail 20% of them
TOGGLE_SERVICE_DELAY=100 TOGGLE_CANARY_DELAY=1000 TOGGLE_CANARY_FAILURE=0.2 docker-compose -f docker-compose-elastic-otel.yml up -d


# throw error in nodejs service, 50% of the time
THROW_NOT_A_FUNCTION_ERROR=true docker-compose -f docker-compose-elastic-otel.yml up -d
</code></pre>
<h2>Mix Elastic and OTel</h2>
<p>Since the application has the services in all possible permutations and the “favorite” service even written in multiple languages, you can also run them in a mixed mode.</p>
<p>You can also run some of them in parallel, like we do for the “favorite” service.</p>
<p>Elastic and OTel are fully compatible, so you could run some services instrumented with OTel while others are running with the Elastic APM Agent.</p>
<p>Take a look at the existing compose file and simply copy one of the snippets for each service type.</p>
<pre><code class="language-yaml">favorite-java-otel-auto:
  build: java-favorite-otel-auto/.
  image: docker.elastic.co/demos/workshop/observability/elastiflix-java-favorite-otel-auto:${ELASTIC_VERSION}-${BUILD_NUMBER}
  depends_on:
    - redis
  networks:
    - app-network
  ports:
    - &quot;5004:5000&quot;
  environment:
    - ELASTIC_APM_SECRET_TOKEN=${ELASTIC_APM_SECRET_TOKEN}
    - OTEL_EXPORTER_OTLP_ENDPOINT=${ELASTIC_APM_SERVER_URL}
    - OTEL_METRICS_EXPORTER=otlp
    - OTEL_RESOURCE_ATTRIBUTES=service.version=1.0,deployment.environment=production
    - OTEL_SERVICE_NAME=java-favorite-otel-auto
    - OTEL_TRACES_EXPORTER=otlp
    - REDIS_HOST=redis
    - TOGGLE_SERVICE_DELAY=${TOGGLE_SERVICE_DELAY}
    - TOGGLE_CANARY_DELAY=${TOGGLE_CANARY_DELAY}
    - TOGGLE_CANARY_FAILURE=${TOGGLE_CANARY_FAILURE}
</code></pre>
<h2>Working with the source code</h2>
<p>The repository contains all possible permutations of the service.</p>
<ul>
<li>Subdirectories are named in the format $langauge-$serviceName-(elastic|otel)-(auto|manual). As an example, <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/python-favorite-otel-auto">python-favorite-otel-auto</a> is a Python service. The name of it is “favorite,” and it’s instrumented with OpenTelemetry, using auto-instrumentation.</li>
<li>You can now compare this directory to the non-instrumented version of this service available under the directory <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/python-favorite">python-favorite</a>.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-started-opentelemetry-instrumentation-sample-app/elastic-blog-7-code.png" alt="code" /></p>
<p>This allows you to easily understand the difference between the two. In addition, you can also start from scratch using the non-instrumentation version and try to instrument it yourself.</p>
<h2>Conclusion</h2>
<p>Monitoring is more than just observing; it's about understanding and optimizing. Our sample application seeks to guide you on your journey with Elastic APM or OpenTelemetry, providing you with the tools to build resilient and high-performing applications.</p>
<blockquote>
<p>Developer resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/getting-started-opentelemetry-instrumentation-sample-app">Elastiflix application</a>, a guide to instrument different languages with OpenTelemetry</li>
<li>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual-instrumentation</a></li>
<li>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Go: <a href="https://elastic.co/blog/manual-instrumentation-of-go-applications-opentelemetry">Manual-instrumentation</a></li>
<li><a href="https://www.elastic.co/blog/best-practices-instrumenting-opentelemetry">Best practices for instrumenting OpenTelemetry</a></li>
</ul>
<p>General configuration and use case resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></li>
<li><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a></li>
<li><a href="https://www.elastic.co/blog/custom-metrics-app-code-java-agent-plugin">Capturing custom metrics through OpenTelemetry API in code with Elastic</a></li>
<li><a href="https://www.elastic.co/virtual-events/future-proof-your-observability-platform-with-opentelemetry-and-elastic">Future-proof your observability platform with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-k8s-observability-elasticsearch-cncf">Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more</a></li>
</ul>
</blockquote>
<p>Don’t have an Elastic Cloud account yet? <a href="https://cloud.elastic.co/registration">Sign up for Elastic Cloud</a> and try out the auto-instrumentation capabilities that I discussed above. I would be interested in getting your feedback about your experience in gaining visibility into your application stack with Elastic.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/getting-started-opentelemetry-instrumentation-sample-app/email-thumbnail-generic-release-cloud_(1).png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Understanding APM: How to add extensions to the OpenTelemetry Java Agent]]></title>
            <link>https://www.elastic.co/observability-labs/blog/extensions-opentelemetry-java-agent</link>
            <guid isPermaLink="false">extensions-opentelemetry-java-agent</guid>
            <pubDate>Mon, 24 Jul 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[This blog post provides a comprehensive guide for Site Reliability Engineers (SREs) and IT Operations to gain visibility and traceability into applications, especially those written with non-standard frameworks or without access to the source code.]]></description>
            <content:encoded><![CDATA[<h2>Without code access, SREs and IT Operations cannot always get the visibility they need</h2>
<p>As an SRE, have you ever had a situation where you were working on an application that was written with non-standard frameworks, or you wanted to get some interesting business data from an application (number of orders processed for example) but you didn’t have access to the source code?</p>
<p>We all know this can be a challenging scenario resulting in visibility gaps, inability to fully trace code end to end, and missing critical business monitoring data that is useful for understanding the true impact of issues.</p>
<p>How can we solve this? One way we discussed in the following three blogs:</p>
<ul>
<li><a href="https://www.elastic.co/blog/create-your-own-instrumentation-with-the-java-agent-plugin">Create your own instrumentation with the Java Agent Plugin</a></li>
<li><a href="https://www.elastic.co/blog/custom-metrics-app-code-java-agent-plugin">How to capture custom metrics without app code changes using the Java Agent Plugin</a></li>
<li><a href="https://www.elastic.co/blog/regression-testing-your-java-agent-plugin">Regression testing your Java Agent Plugin</a></li>
</ul>
<p>This is where we develop a plugin for the Elastic&lt;sup&gt;®&lt;/sup&gt; APM Agent to help get access to critical business data for monitoring and add tracing where none exists.</p>
<p>What we will discuss in this blog is how you can do the same with the <a href="https://opentelemetry.io/docs/instrumentation/java/automatic/">OpenTelemetry Java Agent</a> using the Extensions framework.</p>
<h2>Basic concepts: How APM works</h2>
<p>Before we continue, let's first understand a few basic concepts and terms.</p>
<ul>
<li><strong>Java Agent:</strong> This is a tool that can be used to instrument (or modify) the bytecode of class files in the Java Virtual Machine (JVM). Java agents are used for many purposes like performance monitoring, logging, security, and more.</li>
<li><strong>Bytecode:</strong> This is the intermediary code generated by the Java compiler from your Java source code. This code is interpreted or compiled on the fly by the JVM to produce machine code that can be executed.</li>
<li><strong>Byte Buddy:</strong> Byte Buddy is a code generation and manipulation library for Java. It is used to create, modify, or adapt Java classes at runtime. In the context of a Java Agent, Byte Buddy provides a powerful and flexible way to modify bytecode. <strong>Both the Elastic APM Agent and the OpenTelemetry Agent use Byte Buddy under the covers.</strong></li>
</ul>
<p><strong>Now, let's talk about how automatic instrumentation works with Byte Buddy:</strong></p>
<p>Automatic instrumentation is the process by which an agent modifies the bytecode of your application's classes, often to insert monitoring code. The agent doesn't modify the source code directly, but rather the bytecode that is loaded into the JVM. This is done while the JVM is loading the classes, so the modifications are in effect during runtime.</p>
<p>Here's a simplified explanation of the process:</p>
<ol>
<li>
<p><strong>Start the JVM with the agent:</strong> When starting your Java application, you specify the Java agent with the -javaagent command line option. This instructs the JVM to load your agent before the main method of your application is invoked. At this point, the agent has the opportunity to set up class transformers.</p>
</li>
<li>
<p><strong>Register a class file transformer with Byte Buddy:</strong> Your agent will register a class file transformer with Byte Buddy. A transformer is a piece of code that is invoked every time a class is loaded into the JVM. This transformer receives the bytecode of the class and it can modify this bytecode before the class is actually used.</p>
</li>
<li>
<p><strong>Transform the bytecode:</strong> When your transformer is invoked, it will use Byte Buddy's API to modify the bytecode. Byte Buddy allows you to specify your transformations in a high-level, expressive way rather than manually writing complex bytecode. For example, you could specify a certain class and method within that class that you want to instrument and provide an &quot;interceptor&quot; that will add new behavior to that method.</p>
</li>
<li>
<p><strong>Use the transformed classes:</strong> Once the agent has set up its transformers, the JVM continues to load classes as usual. Each time a class is loaded, your transformers are invoked, allowing them to modify the bytecode. Your application then uses these transformed classes as if they were the original ones, but they now have the extra behavior that you've injected through your interceptor.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/extensions-opentelemetry-java-agent/elastic-blog-1-flowchart-process.png" alt="flowchart process" /></p>
<p>In essence, automatic instrumentation with Byte Buddy is about modifying the behavior of your Java classes at runtime, without needing to alter the source code directly. This is especially useful for cross-cutting concerns like logging, monitoring, or security, as it allows you to centralize this code in your Java Agent, rather than scattering it throughout your application.</p>
<h2>Application, prerequisites, and config</h2>
<p>There is a really simple application in <a href="https://github.com/davidgeorgehope/custom-instrumentation-examples">this GitHub repository</a> that is used throughout this blog. What it does is it simply asks you to input some text and then it counts the number of words.</p>
<p>It’s also listed below:</p>
<pre><code class="language-java">package org.davidgeorgehope;
import java.util.Scanner;
import java.util.logging.Logger;

public class Main {
    private static Logger logger = Logger.getLogger(Main.class.getName());

    public static void main(String[] args) {
        Scanner scanner = new Scanner(System.in);
        while (true) {
            System.out.println(&quot;Please enter your sentence:&quot;);
            String input = scanner.nextLine();
            Main main = new Main();
            int wordCount = main.countWords(input);
            System.out.println(&quot;The input contains &quot; + wordCount + &quot; word(s).&quot;);
        }
    }
    public int countWords(String input) {

        try {
            Thread.sleep(10000);
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }

        if (input == null || input.isEmpty()) {
            return 0;
        }

        String[] words = input.split(&quot;\s+&quot;);
        return words.length;
    }
}
</code></pre>
<p>For the purposes of this blog, we will be using Elastic Cloud to capture the data generated by OpenTelemetry — <a href="https://www.elastic.co/getting-started/observability/collect-and-analyze-logs#create-an-elastic-cloud-account">follow the instructions here</a> to <a href="https://cloud.elastic.co/registration?fromURI=/home">get started on Elastic Cloud</a>.</p>
<p>Once you are started with Elastic Cloud, go grab the OpenTelemetry config from the APM pages:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/extensions-opentelemetry-java-agent/elastic-blog-2-apm-agents.png" alt="apm agents" /></p>
<p>You will need this later.</p>
<p>Finally, <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases">download the OpenTelemetry Agent</a>.</p>
<h2>Firing up the application and OpenTelemetry</h2>
<p>If you start out with this simple application, build it and run it like so with the OpenTelemetry Agent, filling in the appropriate variables with those you got from earlier.</p>
<pre><code class="language-java">java -javaagent:opentelemetry-javaagent.jar -Dotel.exporter.otlp.endpoint=XX -Dotel.exporter.otlp.headers=XX -Dotel.metrics.exporter=otlp -Dotel.logs.exporter=otlp -Dotel.resource.attributes=XX -Dotel.service.name=your-service-name -jar simple-java-1.0-SNAPSHOT.jar
</code></pre>
<p>You will find nothing happens. The reason for this is that the OpenTelemetry Agent has no way of knowing what to monitor. The way that APM with automatic instrumentation works is that it “knows” about standard frameworks, like Spring or HTTPClient, and is able to get visibility by “injecting” trace code into those standard frameworks automatically.</p>
<p>It has no knowledge of org.davidgeorgehope.Main from our simple Java application.</p>
<p>Luckily, there is a way we can add this using the <a href="https://opentelemetry.io/docs/instrumentation/java/automatic/extensions/">OpenTelemetry Extensions framework</a>.</p>
<h2>The OpenTelemetry Extension</h2>
<p>In the repository above, aside from the simple-java application, there is also a plugin for Elastic APM and an extension for OpenTelemetry. The relevant files for OpenTelemetry Extension are located <a href="https://github.com/davidgeorgehope/custom-instrumentation-examples/tree/main/opentelemetry-custom-instrumentation/src/main/java/org/davidgeorgehope">here</a> — WordCountInstrumentation.java and WordCountInstrumentationModule.java .</p>
<p>You’ll notice that OpenTelemetry Extensions and Elastic APM Plugins both make use of Byte Buddy, which is a common library for code instrumentation. There are some key differences in the way the code is bootstrapped, though.</p>
<p>The WordCountInstrumentationModule class extends an OpenTelemtry specific class InstrumentationModule, whose purpose is to describe a set of TypeInstrumentation that need to be applied together to correctly instrument a specific library. The WordCountInstrumentation class is one such instance of a TypeInstrumentation.</p>
<p>Type instrumentations grouped in a module share helper classes, muzzle runtime checks, and applicable class loader criteria, and can only be enabled or disabled as a set.</p>
<p>This is a little bit different from how the Elastic APM Plugin works because the default method to to inject code with OpenTelemetry is inline (which is the default) with OpenTelemetry, and you can inject dependencies into the core application classloader using the InstrumentationModule configurations (as shown below). The Elastic APM method is safer as it allows isolation of helper classes and makes it easier to debug with normal IDEs we are contributing this method to OpenTelemetry. Here we inject the TypeInstrumentation class and the WordCountInstrumentation class into the classloader.</p>
<pre><code class="language-java">@Override
    public List&lt;String&gt; getAdditionalHelperClassNames() {
        return List.of(WordCountInstrumentation.class.getName(),&quot;io.opentelemetry.javaagent.extension.instrumentation.TypeInstrumentation&quot;);
    }
</code></pre>
<p>The other interesting part of the TypeInstrumentation class is the setup.</p>
<p>Here we give our instrumentation “group” a name. An InstrumentationModule needs to have at least one name. The user of the javaagent can suppress a chosen instrumentation by referring to it by one of its names. The instrumentation module names use kebab-case.</p>
<pre><code class="language-java">public WordCountInstrumentationModule() {
        super(&quot;wordcount-demo&quot;, &quot;wordcount&quot;);
    }
</code></pre>
<p>Apart from this, we see methods in this class to specify the order of loading this relative to other instrumentation if needed, and we specify the class that extends TypeInstrumention and are responsible for the main bulk of the instrumentation work.</p>
<p>Let's take a look at that WordCountInstrumention class, which extends TypeInstrumention now:</p>
<pre><code class="language-java">// The WordCountInstrumentation class implements the TypeInstrumentation interface.
// This allows us to specify which types of classes (based on some matching criteria) will have their methods instrumented.

public class WordCountInstrumentation implements TypeInstrumentation {

    // The typeMatcher method is used to define which classes the instrumentation should apply to.
    // In this case, it's the &quot;org.davidgeorgehope.Main&quot; class.
    @Override
    public ElementMatcher&lt;TypeDescription&gt; typeMatcher() {
        logger.info(&quot;TEST typeMatcher&quot;);
        return ElementMatchers.named(&quot;org.davidgeorgehope.Main&quot;);
    }

    // In the transform method, we specify which methods of the classes matched above will be instrumented,
    // and also the advice (a piece of code) that will be added to these methods.
    @Override
    public void transform(TypeTransformer typeTransformer) {
        logger.info(&quot;TEST transform&quot;);
        typeTransformer.applyAdviceToMethod(namedOneOf(&quot;countWords&quot;),this.getClass().getName() + &quot;$WordCountAdvice&quot;);
    }

    // The WordCountAdvice class contains the actual pieces of code (advices) that will be added to the instrumented methods.
    @SuppressWarnings(&quot;unused&quot;)
    public static class WordCountAdvice {
        // This advice is added at the beginning of the instrumented method (OnMethodEnter).
        // It creates and starts a new span, and makes it active.
        @Advice.OnMethodEnter(suppress = Throwable.class)
        public static Scope onEnter(@Advice.Argument(value = 0) String input, @Advice.Local(&quot;otelSpan&quot;) Span span) {
            // Get a Tracer instance from OpenTelemetry.
            Tracer tracer = GlobalOpenTelemetry.getTracer(&quot;instrumentation-library-name&quot;,&quot;semver:1.0.0&quot;);
            System.out.print(&quot;Entering method&quot;);

            // Start a new span with the name &quot;mySpan&quot;.
            span = tracer.spanBuilder(&quot;mySpan&quot;).startSpan();

            // Make this new span the current active span.
            Scope scope = span.makeCurrent();

            // Return the Scope instance. This will be used in the exit advice to end the span's scope.
            return scope;
        }

        // This advice is added at the end of the instrumented method (OnMethodExit).
        // It first closes the span's scope, then checks if any exception was thrown during the method's execution.
        // If an exception was thrown, it sets the span's status to ERROR and ends the span.
        // If no exception was thrown, it sets a custom attribute &quot;wordCount&quot; on the span, and ends the span.
        @Advice.OnMethodExit(onThrowable = Throwable.class, suppress = Throwable.class)
        public static void onExit(@Advice.Return(readOnly = false) int wordCount,
                                  @Advice.Thrown Throwable throwable,
                                  @Advice.Local(&quot;otelSpan&quot;) Span span,
                                  @Advice.Enter Scope scope) {
            // Close the scope to end it.
            scope.close();

            // If an exception was thrown during the method's execution, set the span's status to ERROR.
            if (throwable != null) {
                span.setStatus(StatusCode.ERROR, &quot;Exception thrown in method&quot;);
            } else {
                // If no exception was thrown, set a custom attribute &quot;wordCount&quot; on the span.
                span.setAttribute(&quot;wordCount&quot;, wordCount);
            }

            // End the span. This makes it ready to be exported to the configured exporter (e.g. Elastic).
            span.end();
        }
    }
}
</code></pre>
<p>The target class for our instrumentation is defined in the typeMatch method, and the method we want to instrument is defined in the transform method. We are targeting the Main class and the countWords method.</p>
<p>As you can see, we have an inner class here that does most of the work of defining an onEnter and onExit method, which tells us what to do when we enter the countWords method and when we exit the countWords method.</p>
<p>In the onEnter method, we set up a new OpenTelemetry span, and in the onExit method, we end the span. If the method successfully ends, we also grab the wordcount and append that to the attribute.</p>
<p>Now let's take a look at what happens when we run this. The good news is that we have made this extremely simple by providing a dockerfile for your use to do all the work for you.</p>
<h2>Pulling this all together</h2>
<p><a href="https://github.com/davidgeorgehope/custom-instrumentation-examples/tree/main">Clone the GitHub repository</a> if you have not already done so, and before continuing, let’s take a quick look at the dockerfile we are using.</p>
<pre><code class="language-dockerfile"># Build stage
FROM maven:3.8.7-openjdk-18 as build

COPY simple-java /home/app/simple-java
COPY opentelemetry-custom-instrumentation /home/app/opentelemetry-custom-instrumentation

WORKDIR /home/app/simple-java
RUN mvn install

WORKDIR /home/app/opentelemetry-custom-instrumentation
RUN mvn install

# Package stage
FROM maven:3.8.7-openjdk-18
COPY --from=build /home/app/simple-java/target/simple-java-1.0-SNAPSHOT.jar /usr/local/lib/simple-java-1.0-SNAPSHOT.jar
COPY --from=build /home/app/opentelemetry-custom-instrumentation/target/opentelemetry-custom-instrumentation-1.0-SNAPSHOT.jar /usr/local/lib/opentelemetry-custom-instrumentation-1.0-SNAPSHOT.jar

WORKDIR /

RUN curl -L -o opentelemetry-javaagent.jar https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar

COPY start.sh /start.sh
RUN chmod +x /start.sh

ENTRYPOINT [&quot;/start.sh&quot;]
</code></pre>
<p>This dockerfile works in two parts: during the docker build process, we build the simple-java application from source followed by the custom instrumentation. After this, we download the latest OpenTelemetry Java Agent. During runtime, we simple execute the start.sh file described below:</p>
<pre><code class="language-bash">#!/bin/sh
java \
-javaagent:/opentelemetry-javaagent.jar \
-Dotel.exporter.otlp.endpoint=${SERVER_URL} \
-Dotel.exporter.otlp.headers=&quot;Authorization=Bearer ${SECRET_KEY}&quot; \
-Dotel.metrics.exporter=otlp \
-Dotel.logs.exporter=otlp \
-Dotel.resource.attributes=service.name=simple-java,service.version=1.0,deployment.environment=production \
-Dotel.service.name=your-service-name \
-Dotel.javaagent.extensions=/usr/local/lib/opentelemetry-custom-instrumentation-1.0-SNAPSHOT.jar \
-Dotel.javaagent.debug=true \
-jar /usr/local/lib/simple-java-1.0-SNAPSHOT.jar
</code></pre>
<p>There are two important things to note with this script: the first is that we start the javaagent parameter set to the opentelemetry-javaagent.jar — this will start the OpenTelemetry javaagent running, which starts before any code is executed.</p>
<p>Inside this jar there has to be a class with a premain method which the JVM will look for. This bootstraps the java agent. As described above, any bytecode that is compiled is essentially filtered through the javaagent code so it can modify the class before being executed.</p>
<p>The second important thing here is the configuration of the javaagent.extensions, which loads our extension that we built to add instrumentation for our simple-java application.</p>
<p>Now run the following commands:</p>
<pre><code class="language-bash">docker build -t djhope99/custom-otel-instrumentation:1 .
docker run -it -e 'SERVER_URL=XXX' -e 'SECRET_KEY=XX djhope99/custom-otel-instrumentation:1
</code></pre>
<p>If you use the SERVER_URL and SECRET_KEY you got earlier in here, you should see this connect to Elastic.</p>
<p>When it starts up, it will ask you to enter a sentence, enter a few sentences, and press enter. Do this a few times — there is a sleep in here to force a long running transaction:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/extensions-opentelemetry-java-agent/elastic-blog-3-codeblack.png" alt="code" /></p>
<p>Eventually you will see the service show up in the service map:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/extensions-opentelemetry-java-agent/elastic-blog-4-services.png" alt="services" /></p>
<p>Traces will appear:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/extensions-opentelemetry-java-agent/elastic-blog-5-your-service-name.png" alt="service name" /></p>
<p>And in the span you will see the wordcount attribute we collected:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/extensions-opentelemetry-java-agent/elastic-blog-6-transaction-details.png" alt="transaction details" /></p>
<p>This can be used for further dashboarding and AI/ML, including anomaly detection if you need, which is easy to do, as you can see below.</p>
<p>First click on the burger on the left side and select <strong>Dashboard</strong> to create a new dashboard:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/extensions-opentelemetry-java-agent/elastic-blog-7-manage-deployment-analytics.png" alt="analytics" /></p>
<p>From here, click <strong>Create Visualization</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/extensions-opentelemetry-java-agent/elastic-blog-8-visualization.png" alt="visualization" /></p>
<p>Search for the wordcount label in the APM index as shown below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/extensions-opentelemetry-java-agent/elastic-blog-9-dashboard-word.png" alt="dashboard" /></p>
<p>As you can see, because we created this attribute in the Span code as below with wordCount as a type “Integer,” we were able to automatically assign it as a numeric field in Elastic:</p>
<pre><code class="language-javascript">span.setAttribute(&quot;wordCount&quot;, wordCount);
</code></pre>
<p>From here we can drag and drop it into the visualization for display on our Dashboard! Super easy.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/extensions-opentelemetry-java-agent/elastic-blog-10-drag-drop.png" alt="dra and drop" /></p>
<h2>In conclusion</h2>
<p>This blog elucidates the invaluable role of OpenTelemetry Java Agent in filling the visibility gaps and obtaining crucial business monitoring data, especially when access to the source code is not feasible.</p>
<p>The blog unraveled the basic understanding of Java Agent, Bytecode, and Byte Buddy, followed by a comprehensive examination of the automatic instrumentation process with Byte Buddy.</p>
<p>The implementation of the OpenTelemetry Java Agent, using the Extensions framework, was demonstrated with the aid of a simple Java application, which underscored the agent's ability to inject trace code into the application to facilitate monitoring.</p>
<p>It detailed how to configure the agent and integrate OpenTelemetry Extension, and it outlined the operation of a sample application to help users comprehend the practical application of the information discussed. This instructive blog post is an excellent resource for SREs and IT Operations seeking to optimize their work with applications using OpenTelemetry's automatic instrumentation feature.</p>
<blockquote>
<ul>
<li><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></li>
<li><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a></li>
<li><a href="https://www.elastic.co/blog/monitor-openai-api-gpt-models-opentelemetry-elastic">Monitor OpenAI API and GPT models with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/virtual-events/future-proof-your-observability-platform-with-opentelemetry-and-elastic">Future proof your observability platform with OpenTelemetry and Elastic</a></li>
</ul>
</blockquote>
<p>Don’t have an Elastic Cloud account yet? Sign up <a href="https://cloud.elastic.co/registration">for Elastic Cloud</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/extensions-opentelemetry-java-agent/flexible-implementation-1680X980.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[How to easily add application monitoring in Kubernetes pods]]></title>
            <link>https://www.elastic.co/observability-labs/blog/application-monitoring-kubernetes-pods</link>
            <guid isPermaLink="false">application-monitoring-kubernetes-pods</guid>
            <pubDate>Wed, 17 Jan 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[This blog walks through installing the Elastic APM K8s Attacher and shows how to configure your system for both common and non-standard deployments of Elastic APM agents.]]></description>
            <content:encoded><![CDATA[<p>The <a href="https://www.elastic.co/guide/en/apm/attacher/current/index.html">Elastic® APM K8s Attacher</a> allows auto-installation of Elastic APM application agents (e.g., the Elastic APM Java agent) into applications running in your Kubernetes clusters. The mechanism uses a <a href="https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/">mutating webhook</a>, which is a standard Kubernetes component, but you don’t need to know all the details to use the Attacher. Essentially, you can install the Attacher, add one annotation to any Kubernetes deployment that has an application you want monitored, and that’s it!</p>
<p>In this blog, we’ll walk through a full example from scratch using a Java application. Apart from the Java code and using a JVM for the application, everything else works the same for the other languages supported by the Attacher.</p>
<h2>Prerequisites</h2>
<p>This walkthrough assumes that the following are already installed on the system: JDK 17, Docker, Kubernetes, and Helm.</p>
<h2>The example application</h2>
<p>While the application (shown below) is a Java application, it would be easily implemented in any language, as it is just a simple loop that every 2 seconds calls the method chain methodA-&gt;methodB-&gt;methodC-&gt;methodD, with methodC sleeping for 10 milliseconds and methodD sleeping for 200 milliseconds. The choice of application is just to be able to clearly display in the Elastic APM UI that the application is being monitored.</p>
<p>The Java application in full is shown here:</p>
<pre><code class="language-java">package test;

public class Testing implements Runnable {

  public static void main(String[] args) {
    new Thread(new Testing()).start();
  }

  public void run()
  {
    while(true) {
      try {Thread.sleep(2000);} catch (InterruptedException e) {}
      methodA();
    }
  }

  public void methodA() {methodB();}

  public void methodB() {methodC();}

  public void methodC() {
    System.out.println(&quot;methodC executed&quot;);
    try {Thread.sleep(10);} catch (InterruptedException e) {}
    methodD();
  }

  public void methodD() {
    System.out.println(&quot;methodD executed&quot;);
    try {Thread.sleep(200);} catch (InterruptedException e) {}
  }
}
</code></pre>
<p>We created a Docker image containing that simple Java application for you that can be pulled from the following Docker repository:</p>
<pre><code class="language-bash">docker.elastic.co/demos/apm/k8s-webhook-test
</code></pre>
<h2>Deploy the pod</h2>
<p>First we need a deployment config. We’ll call the config file webhook-test.yaml, and the contents are pretty minimal — just pull the image and run that as a pod &amp; container called webhook-test in the default namespace:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Pod
metadata:
  name: webhook-test
  labels:
    app: webhook-test
spec:
  containers:
    - image: docker.elastic.co/demos/apm/k8s-webhook-test
      imagePullPolicy: Always
      name: webhook-test
</code></pre>
<p>This can be deployed normally using kubectl:</p>
<pre><code class="language-yaml">kubectl apply -f webhook-test.yaml
</code></pre>
<p>The result is exactly as expected:</p>
<pre><code class="language-bash">$ kubectl get pods
NAME           READY   STATUS    RESTARTS   AGE
webhook-test   1/1     Running   0          10s

$ kubectl logs webhook-test
methodC executed
methodD executed
methodC executed
methodD executed
</code></pre>
<p>So far, this is just setting up a standard Kubernetes application with no APM monitoring. Now we get to the interesting bit: adding in auto-instrumentation.</p>
<h2>Install Elastic APM K8s Attacher</h2>
<p>The first step is to install the <a href="https://www.elastic.co/guide/en/apm/attacher/current/index.html">Elastic APM K8s Attacher</a>. This only needs to be done once for the cluster — once installed, it is always available. Before installation, we will define where the monitored data will go. As you will see later, we can decide or change this any time. For now, we’ll specify our own Elastic APM server, which is at <a href="https://myserver.somecloud:443">https://myserver.somecloud:443</a> — we also have a secret token for authorization to that Elastic APM server, which has value MY_SECRET_TOKEN. (If you want to set up a quick test Elastic APM server, you can do so at <a href="https://cloud.elastic.co/">https://cloud.elastic.co/</a>).</p>
<p>There are two additional environment variables set for the application that are not generally needed but will help when we see the resulting UI content toward the end of the walkthrough (when the agent is auto-installed, these two variables tell the agent what name to give this application in the UI and what method to trace). Now we just need to define the custom yaml file to hold these. On installation, the custom yaml will be merged into the yaml for the Attacher:</p>
<pre><code class="language-yaml">apm:
  secret_token: MY_SECRET_TOKEN
  namespaces:
    - default
webhookConfig:
  agents:
    java:
      environment:
        ELASTIC_APM_SERVER_URL: &quot;https://myserver.somecloud:443&quot;
        ELASTIC_APM_TRACE_METHODS: &quot;test.Testing#methodB&quot;
        ELASTIC_APM_SERVICE_NAME: &quot;webhook-test&quot;
</code></pre>
<p>That custom.yaml file is all we need to install the attacher (note we’ve only specified the default namespace for agent auto-installation for now — this can be easily changed, as you’ll see later). Next we’ll add the Elastic charts to helm — this only needs to be done once, then all Elastic charts are available to helm. This is the usual helm add repo command, specifically:</p>
<pre><code class="language-bash">helm repo add elastic https://helm.elastic.co
</code></pre>
<p>Now the Elastic charts are available for installation (helm search repo would show you all the available charts). We’re going to use “elastic-webhook” as the name to install into, resulting in the following installation command:</p>
<pre><code class="language-bash">helm install elastic-webhook elastic/apm-attacher --namespace=elastic-apm --create-namespace --values custom.yaml
</code></pre>
<p>And that’s it, we now have the Elastic APM K8s Attacher installed and set to send data to the APM server defined in the custom.yaml file! (You can confirm installation with a helm list -A if you need.)</p>
<h2>Auto-install the Java agent</h2>
<p>The Elastic APM K8s Attacher is installed, but it doesn’t auto-install the APM application agents into every pod — that could lead to problems! Instead the Attacher is deliberately limited to auto-install agents into deployments defined a) by the namespaces listed in the custom.yaml, and b) to those deployments in those namespaces that have a specific annotation “co.elastic.apm/attach.”</p>
<p>So for now, restarting the webhook-test pod we created above won’t have any different effect on the pod, as it isn’t yet set to be monitored. What we need to do is add the annotation. Specifically, we need to add the annotation using the default agent configuration that was installed with the Attacher called “java” for the Java agent (we’ll see later how that agent configuration is altered — the default configuration installs the latest agent version and leaves everything else default for that version). So adding that annotation in to webhook-test yaml gives us the new yaml file contents (the additional config is shown labelled (1)):</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Pod
metadata:
  name: webhook-test
  annotations: #(1)
    co.elastic.apm/attach: java #(1)
  labels:
    app: webhook-test
spec:
  containers:
    - image: docker.elastic.co/demos/apm/k8s-webhook-test
      imagePullPolicy: Always
      name: webhook-test
</code></pre>
<p>Applying this change gives us the application now monitored:</p>
<pre><code class="language-bash">$ kubectl delete -f webhook-test.yaml
pod &quot;webhook-test&quot; deleted
$ kubectl apply -f webhook-test.yaml
pod/webhook-test created
$ kubectl logs webhook-test
… StartupInfo - Starting Elastic APM 1.45.0 …
</code></pre>
<p>And since the agent is now feeding data to our APM server, we can now see it in the UI:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/application-monitoring-kubernetes-pods/webhook-test-k8s-blog.png" alt="webhook-test" /></p>
<p>Note that the agent identifies Testing.methodB method as a trace root because of the ELASTIC_APM_TRACE_METHODS environment variable set to test.Testing#methodB in the custom.yaml — this tells the agent to specifically trace that method. The time taken by that method will be available in the UI for each invocation, but we don’t see the sub-methods . . . currently. In the next section, we’ll see how easy it is to customize the Attacher, and in doing so we’ll see more detail about the method chain being executed in the application.</p>
<h2>Customizing the agents</h2>
<p>In your systems, you’ll likely have development, testing, and production environments. You’ll want to specify the version of the agent to use rather than just pull the latest version whatever that is, you’ll want to have debug on for some applications or instances, and you’ll want to have specific options set to specific values. This sounds like a lot of effort, but the attacher lets you enable these kinds of changes in a very simple way. In this section, we’ll add a configuration that specifies all these changes and we can see just how easy it is to configure and enable it.</p>
<p>We start at the custom.yaml file we defined above. This is the file that gets merged into the Attacher. Adding a new configuration with all the items listed in the last paragraph is easy — though first we need to decide a name for our new configuration. We’ll call it “java-interesting” here. The new custom.yaml in full is (the first part is just the same as before, the new config is simply appended):</p>
<pre><code class="language-yaml">apm:
  secret_token: MY_SECRET_TOKEN
  namespaces:
    - default
webhookConfig:
  agents:
    java:
      environment:
        ELASTIC_APM_SERVER_URL: &quot;https://myserver.somecloud:443&quot;
        ELASTIC_APM_TRACE_METHODS: &quot;test.Testing#methodB&quot;
        ELASTIC_APM_SERVICE_NAME: &quot;webhook-test&quot;
    java-interesting:
      image: docker.elastic.co/observability/apm-agent-java:1.55.4
      artifact: &quot;/usr/agent/elastic-apm-agent.jar&quot;
      environment:
        ELASTIC_APM_SERVER_URL: &quot;https://myserver.somecloud:443&quot;
        ELASTIC_APM_TRACE_METHODS: &quot;test.Testing#methodB&quot;
        ELASTIC_APM_SERVICE_NAME: &quot;webhook-test&quot;
        ELASTIC_APM_ENVIRONMENT: &quot;testing&quot;
        ELASTIC_APM_LOG_LEVEL: &quot;debug&quot;
        ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED: &quot;true&quot;
        JAVA_TOOL_OPTIONS: &quot;-javaagent:/elastic/apm/agent/elastic-apm-agent.jar&quot;
</code></pre>
<p>Breaking the additional config down, we have:</p>
<ul>
<li>
<p>The name of the new config java-interesting</p>
</li>
<li>
<p>The APM Java agent image docker.elastic.co/observability/apm-agent-java</p>
<ul>
<li>With a specific version 1.43.0 instead of latest</li>
</ul>
</li>
<li>
<p>We need to specify the agent jar location (the attacher puts it here)</p>
<ul>
<li>artifact: &quot;/usr/agent/elastic-apm-agent.jar&quot;</li>
</ul>
</li>
<li>
<p>And then the environment variables</p>
</li>
<li>
<p>ELASTIC_APM_SERVER_URL as before</p>
</li>
<li>
<p>ELASTIC_APM_ENVIRONMENT set to testing, useful when looking in the UI</p>
</li>
<li>
<p>ELASTIC_APM_LOG_LEVEL set to debug for more detailed agent output</p>
</li>
<li>
<p>ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED turning this on (setting to true) will give us additional interesting information about the method chain being executed in the application</p>
</li>
<li>
<p>And lastly we need to set JAVA_TOOL_OPTIONS to the enable starting the agent &quot;-javaagent:/elastic/apm/agent/elastic-apm-agent.jar&quot; — this is fundamentally how the attacher auto-attaches the Java agent</p>
</li>
</ul>
<p>More configurations and details about configuration options are <a href="https://www.elastic.co/guide/en/apm/agent/java/current/configuration.html">here for the Java agent</a>, and <a href="https://www.elastic.co/guide/en/apm/agent/index.html">other language agents</a> are also available.</p>
<h2>The application traced with the new configuration</h2>
<p>And finally we just need to upgrade the attacher with the changed custom.yaml:</p>
<pre><code class="language-bash">helm upgrade elastic-webhook elastic/apm-attacher --namespace=elastic-apm --create-namespace --values custom.yaml
</code></pre>
<p>This is the same command as the original install, but now using upgrade. That’s it — add config to the custom.yaml and upgrade the attacher, and it’s done! Simple.</p>
<p>Of course we still need to use the new config on an app. In this case, we’ll edit the existing webhook-test.yaml file, replacing java with java-interesting, so the annotation line is now:</p>
<pre><code class="language-yaml">co.elastic.apm/attach: java-interesting
</code></pre>
<p>Applying the new pod config and restarting the pod, you can see the logs now hold debug output:</p>
<pre><code class="language-bash">$ kubectl delete -f webhook-test.yaml
pod &quot;webhook-test&quot; deleted
$ kubectl apply -f webhook-test.yaml
pod/webhook-test created
$ kubectl logs webhook-test
… StartupInfo - Starting Elastic APM 1.44.0 …
… DEBUG co.elastic.apm.agent. …
… DEBUG co.elastic.apm.agent. …
</code></pre>
<p>More interesting is the UI. Now that inferred spans is on, the full method chain is visible.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/application-monitoring-kubernetes-pods/trace-sample-k8s-blog.png" alt="trace sample" /></p>
<p>This gives the details for methodB (it takes 211 milliseconds because it calls methodC - 10ms - which calls methodD - 200ms). The times for methodC and methodD are inferred rather than recorded, (inferred rather than traced — if you needed accurate times you would instead add the methods to trace_methods and have them traced too).</p>
<h2>Note on the ECK operator</h2>
<p>The <a href="https://www.elastic.co/guide/en/cloud-on-k8s/master/k8s-overview.html">Elastic Cloud on Kubernetes operator</a> allows you to install and manage a number of other Elastic components on Kubernetes. At the time of publication of this blog, the <a href="https://www.elastic.co/guide/en/apm/attacher/current/index.html">Elastic APM K8s Attacher</a> is a separate component, and there is no conflict between these management mechanisms — they apply to different components and are independent of each other.</p>
<h2>Try it yourself!</h2>
<p>This walkthrough is easily repeated on your system, and you can make it more useful by replacing the example application with your own and the Docker registry with the one you use.</p>
<p><a href="https://www.elastic.co/observability/kubernetes-monitoring">Learn more about real-time monitoring with Kubernetes and Elastic Observability</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/application-monitoring-kubernetes-pods/139689_-_Blog_Header_Banner_V1.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[How to deploy Hello World Elastic Observability on Google Cloud Run]]></title>
            <link>https://www.elastic.co/observability-labs/blog/deploy-observability-google-cloud-run</link>
            <guid isPermaLink="false">deploy-observability-google-cloud-run</guid>
            <pubDate>Mon, 28 Aug 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Follow the step-by-step process of instrumenting Elastic Observability for a Hello World web app running on Google Cloud Run.]]></description>
            <content:encoded><![CDATA[<p>Elastic Cloud Observability is the premiere tool to provide visibility into your running web apps. Google Cloud Run is the serverless platform of choice to run your web apps that need to scale up massively and scale down to zero. Elastic Observability combined with Google Cloud Run is the perfect solution for developers to deploy <a href="https://www.elastic.co/blog/observability-powerful-flexible-efficient">web apps that are auto-scaled with fully observable operations</a>, in a way that’s straightforward to implement and manage.</p>
<p>This blog post will show you how to deploy a simple Hello World web app to Cloud Run and then walk you through the steps to instrument the Hello World web app to enable observation of the application’s operations with Elastic Cloud.</p>
<h2>Elastic Observability setup</h2>
<p>We’ll start with setting up an Elastic Cloud deployment, which is where observability will take place for the web app we’ll be deploying.</p>
<p>From the <a href="https://cloud.elastic.co">Elastic Cloud console</a>, select <strong>Create deployment</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-observability-google-cloud-run/elastic-blog-1-create-deployment.png" alt="create deployment" /></p>
<p>Enter a deployment name and click <strong>Create deployment</strong>. It takes a few minutes for your deployment to be created. While waiting, you are prompted to save the admin credentials for your deployment, which provides you with superuser access to your Elastic&lt;sup&gt;®&lt;/sup&gt; deployment. Keep these credentials safe as they are shown only once.</p>
<p>Elastic Observability requires an APM Server URL and an APM Secret token for an app to send observability data to Elastic Cloud. Once the deployment is created, we’ll copy the Elastic Observability server URL and secret token and store them somewhere safely for adding to our web app code in a later step.</p>
<p>To copy the APM Server URL and the APM Secret Token, go to <a href="https://cloud.elastic.co/home">Elastic Cloud</a>. Then go to the <a href="https://cloud.elastic.co/deployments">Deployments</a> page which lists all of the deployments you have created. Select the deployment you want to use, which will open the deployment details page. In the <strong>Kibana</strong> row of links, click on <strong>Open</strong> to open <strong>Kibana</strong> for your deployment.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-observability-google-cloud-run/elastic-blog-2-my-deployment.png" alt="my deployment" /></p>
<p>Select <strong>Integrations</strong> from the top-level menu. Then click the <strong>APM</strong> tile.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-observability-google-cloud-run/elastic-blog-3-apm.png" alt="apm" /></p>
<p>On the APM Agents page, copy the secretToken and the serverUrl values and save them for use in a later step.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-observability-google-cloud-run/elastic-blog-4-apm-agents.png" alt="apm agents" /></p>
<p>Now that we’ve completed the Elastic Cloud setup, the next step is to set up our Google Cloud project for deploying apps to Cloud Run.</p>
<h2>Google Cloud Run setup</h2>
<p>First we’ll need a Google Cloud project, so let’s create one by going to the <a href="https://console.cloud.google.com">Google Cloud console</a> and creating a new project. Select the project menu and then click the <strong>New Project</strong> button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-observability-google-cloud-run/elastic-blog-5-google-cloud-gray-dropdown.png" alt="google cloud with gray dropdown" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-observability-google-cloud-run/elastic-blog-5-select-a-project.png" alt="select a project" /></p>
<p>Once the new project is created, we’ll need to enable the necessary APIs that our Hello World app will be using. This can be done by clicking this <a href="https://console.cloud.google.com/flows/enableapi?apiid=compute.googleapis.com,,run.googleapis.com,containerregistry.googleapis.com,cloudbuild.googleapis.com">enable APIs</a> link, which opens a page in the Google Cloud console that lists the APIs that will be enabled and allows us to confirm their activation.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-observability-google-cloud-run/elastic-blog-6-enable-apis.png" alt="enable apis" /></p>
<p>After we’ve enabled the necessary APIs, we’ll need to set up the required permissions for our Hello World app, which can be done in the <a href="https://console.cloud.google.com/iam-admin">IAM section</a> of the Google Cloud Console. Within the IAM section, select the <strong>Compute Engine</strong> default service account and add the following roles:</p>
<ul>
<li>Logs Viewer</li>
<li>Monitoring Viewer</li>
<li>Pub/Sub Subscriber</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-observability-google-cloud-run/elastic-blog-7-principals.png" alt="principals" /></p>
<h2>Deploy a Hello World web app to Cloud Run</h2>
<p>We’ll perform the process of deploying a Node.js Hello World web app to Cloud Run using the handy Google Cloud tool called <a href="https://console.cloud.google.com/cloudshelleditor">Cloud Shell Editor</a>. To deploy the Hello World app, we’ll perform the following five steps:</p>
<ol>
<li>In Cloud Shell Editor, in the terminal window that appears at the bottom of the screen, clone a <a href="https://github.com/elastic/observability-examples/tree/main/gcp/run/helloworld">Node.js Hello World sample app</a> repo from GitHub by entering the following command.</li>
</ol>
<pre><code class="language-bash">git clone https://github.com/elastic/observability-examples
</code></pre>
<ol start="2">
<li>Change directory to the location of the Hello World web app code.</li>
</ol>
<pre><code class="language-bash">cd gcp/run/helloworld
</code></pre>
<ol start="3">
<li>Build the Hello World app image and push the image to Google Container Registry by running the command below in the terminal. Be sure to replace your-project-id in the command below with your actual Google Cloud project ID.</li>
</ol>
<pre><code class="language-bash">gcloud builds submit --tag gcr.io/your-project-id/elastic-helloworld
</code></pre>
<ol start="4">
<li>Deploy the Hello World app to Google Cloud Run by running the command below. Be sure to replace your-project-id in the command below with your actual Google Cloud project ID.</li>
</ol>
<pre><code class="language-bash">gcloud run deploy elastic-helloworld --image gcr.io/your-project-id/elastic-helloworld
</code></pre>
<ol start="5">
<li>When the deployment process is complete, a Service URL will be displayed within the terminal. Copy and paste the Service URL in a browser to view the Hello World app running in Cloud Run.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-observability-google-cloud-run/elastic-blog-8-hello-world.png" alt="hello world" /></p>
<h2>Instrument the Hello World web app with Elastic Observability</h2>
<p>With a web app successfully running in Cloud Run, we’re now ready to add the minimal code necessary to start monitoring the app. To enable observability for the Hello World app in Elastic Cloud, we’ll perform the following six steps:</p>
<ol>
<li>In the Google Cloud Shell Editor, edit the Dockerfile file to add the following Elastic Open Telemetry environment variables along with the commands to install and run the Elastic APM agent. Replace the ELASTIC_APM_SERVER_URL text and the ELASTIC_APM_SECRET_TOKEN text with the APM Server URL and the APM Secret Token values that you copied and saved in an earlier step.</li>
</ol>
<pre><code class="language-dockerfile">ENV OTEL_EXPORTER_OTLP_ENDPOINT='ELASTIC_APM_SERVER_URL'
ENV OTEL_EXPORTER_OTLP_HEADERS='Authorization=Bearer ELASTIC_APM_SECRET_TOKEN'
ENV OTEL_LOG_LEVEL=info
ENV OTEL_METRICS_EXPORTER=otlp
ENV OTEL_RESOURCE_ATTRIBUTES=service.version=1.0,deployment.environment=production
ENV OTEL_SERVICE_NAME=helloworld
ENV OTEL_TRACES_EXPORTER=otlp
RUN npm install --save @opentelemetry/api
RUN npm install --save @opentelemetry/auto-instrumentations-node
CMD [&quot;node&quot;, &quot;--require&quot;, &quot;@opentelemetry/auto-instrumentations-node/register&quot;, &quot;index.js&quot;]
</code></pre>
<p>The updated Dockerfile should look something like this:</p>
<pre><code class="language-dockerfile">FROM node:18-slim
WORKDIR /usr/src/app
COPY package*.json ./
RUN npm install --only=production
COPY . ./
OTEL_EXPORTER_OTLP_ENDPOINT='https://******.apm.us-central1.gcp.cloud.es.io:443'
ENV OTEL_EXPORTER_OTLP_HEADERS='Authorization=Bearer ******************'
ENV OTEL_LOG_LEVEL=info
ENV OTEL_METRICS_EXPORTER=otlp
ENV OTEL_RESOURCE_ATTRIBUTES=service.version=1.0,deployment.environment=production
ENV OTEL_SERVICE_NAME=helloworld
ENV OTEL_TRACES_EXPORTER=otlp
RUN npm install --save @opentelemetry/api
RUN npm install --save @opentelemetry/auto-instrumentations-node
CMD [&quot;node&quot;, &quot;--require&quot;, &quot;@opentelemetry/auto-instrumentations-node/register&quot;, &quot;index.js&quot;]
</code></pre>
<ol start="2">
<li>In the Google Cloud Shell Editor, edit the package.json file to add the Elastic APM dependency. The dependencies section in package.json should look something like this:</li>
</ol>
<pre><code class="language-json">&quot;dependencies&quot;: {
  	&quot;express&quot;: &quot;^4.18.2&quot;,
  	&quot;elastic-apm-node&quot;: &quot;^3.49.1&quot;
  },
</code></pre>
<ol start="3">
<li>In the Google Cloud Shell Editor, edit the index.js file:</li>
</ol>
<ul>
<li>Add the code required to initialize the Elastic Open Telemetry APM agent:</li>
</ul>
<pre><code class="language-javascript">const otel = require(&quot;@opentelemetry/api&quot;);
const tracer = otel.trace.getTracer(&quot;hello-world&quot;);
</code></pre>
<ul>
<li>Replace the “Hello World!” output code . . .</li>
</ul>
<pre><code class="language-javascript">res.send(`&lt;h1&gt;Hello World!&lt;/h1&gt;`);
</code></pre>
<p>...with the “Hello Elastic Observability” code block.</p>
<pre><code class="language-javascript">res.send(
  `&lt;div style=&quot;text-align: center;&quot;&gt;
   &lt;h1 style=&quot;color: #005A9E; font-family:'Verdana'&quot;&gt;
   Hello Elastic Observability - Google Cloud Run - Node.js
   &lt;/h1&gt;
   &lt;img src=&quot;https://storage.googleapis.com/elastic-helloworld/elastic-logo.png&quot;&gt;
   &lt;/div&gt;`
);
</code></pre>
<ul>
<li>Add a trace “hi” before the “Hello Elastic Observability” code block and add a trace “bye” after the “Hello Elastic Observability” code block.</li>
</ul>
<pre><code class="language-javascript">tracer.startActiveSpan(&quot;hi&quot;, (span) =&gt; {
  console.log(&quot;hello&quot;);
  span.end();
});
res.send(
  `&lt;div style=&quot;text-align: center;&quot;&gt;
   &lt;h1 style=&quot;color: #005A9E; font-family:'Verdana'&quot;&gt;
   Hello Elastic Observability - Google Cloud Run - Node.js
   &lt;/h1&gt;
   &lt;img src=&quot;https://storage.googleapis.com/elastic-helloworld/elastic-logo.png&quot;&gt;
   &lt;/div&gt;`
);
tracer.startActiveSpan(&quot;bye&quot;, (span) =&gt; {
  console.log(&quot;goodbye&quot;);
  span.end();
});
</code></pre>
<ul>
<li>The completed index.js file should look something like this:</li>
</ul>
<pre><code class="language-javascript">const otel = require(&quot;@opentelemetry/api&quot;);
const tracer = otel.trace.getTracer(&quot;hello-world&quot;);

const express = require(&quot;express&quot;);
const app = express();

app.get(&quot;/&quot;, (req, res) =&gt; {
  tracer.startActiveSpan(&quot;hi&quot;, (span) =&gt; {
    console.log(&quot;hello&quot;);
    span.end();
  });
  res.send(
    `&lt;div style=&quot;text-align: center;&quot;&gt;
    &lt;h1 style=&quot;color: #005A9E; font-family:'Verdana'&quot;&gt;
    Hello Elastic Observability - Google Cloud Run - Node.js
    &lt;/h1&gt;
   &lt;img src=&quot;https://storage.googleapis.com/elastic-helloworld/elastic-logo.png&quot;&gt;
    &lt;/div&gt;`
  );
  tracer.startActiveSpan(&quot;bye&quot;, (span) =&gt; {
    console.log(&quot;goodbye&quot;);
    span.end();
  });
});

const port = parseInt(process.env.PORT) || 8080;
app.listen(port, () =&gt; {
  console.log(`helloworld: listening on port ${port}`);
});
</code></pre>
<ol start="4">
<li>Rebuild the Hello World app image and push the image to the Google Container Registry by running the command below in the terminal. Be sure to replace your-project-id in the command below with your actual Google Cloud project ID.</li>
</ol>
<pre><code class="language-bash">gcloud builds submit --tag gcr.io/your-project-id/elastic-helloworld
</code></pre>
<ol start="5">
<li>Redeploy the Hello World app to Google Cloud Run by running the command below. Be sure to replace your-project-id in the command below with your actual Google Cloud project ID.</li>
</ol>
<pre><code class="language-bash">gcloud run deploy elastic-helloworld --image gcr.io/your-project-id/elastic-helloworld
</code></pre>
<ol start="6">
<li>When the deployment process is complete, a Service URL will be displayed within the terminal. Copy and paste the Service URL in a browser to view the updated Hello World app running in Cloud Run.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-observability-google-cloud-run/elastic-blog-9-elastic-logo.png" alt="elastic logo" /></p>
<h2>Observe the Hello World web app</h2>
<p>Now that we’ve instrumented the web app to send observability data to Elastic Observability, we can now use Elastic Cloud to monitor the web app’s operations.</p>
<ol>
<li>
<p>In Elastic Cloud, select the Observability <strong>Services</strong> menu item.</p>
</li>
<li>
<p>Click the <strong>helloworld</strong> service.</p>
</li>
<li>
<p>Click the <strong>Transactions</strong> tab.</p>
</li>
<li>
<p>Scroll down and click the <strong>GET /</strong> transaction.</p>
</li>
<li>
<p>Scroll down to the <strong>Trace Sample</strong> section to see the <strong>GET /</strong> , <strong>hi</strong> and <strong>bye</strong> trace samples.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-observability-google-cloud-run/elastic-blog-10-trace-sample.png" alt="trace sample" /></p>
<h2>Observability made to scale</h2>
<p>You’ve seen the entire process of deploying a web app to Google Cloud Run that is instrumented with Elastic Observability. The end result is a web app that will scale up and down with demand combined with the observability tools to monitor the web app as it serves a single user or millions of users.</p>
<p>Now that you’ve seen how to deploy a serverless web app instrumented with observability, visit <a href="https://www.elastic.co/observability">Elastic Observability</a> to learn more about how to implement a complete observability solution for your apps. Or visit <a href="https://www.elastic.co/getting-started/google-cloud">Getting started with Elastic on Google Cloud</a> for more examples of how you can drive the data insights you need by combining <a href="https://www.elastic.co/observability/google-cloud-monitoring">Google Cloud monitoring</a> and cloud computing services with Elastic’s search-powered platform.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/deploy-observability-google-cloud-run/illustration-dev-sec-ops-cloud-automations-1680x980.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[How to deploy a Hello World web app with Elastic Observability on Azure Container Apps]]></title>
            <link>https://www.elastic.co/observability-labs/blog/deploy-app-observability-azure-container-apps</link>
            <guid isPermaLink="false">deploy-app-observability-azure-container-apps</guid>
            <pubDate>Mon, 23 Oct 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Follow the step-by-step process of instrumenting Elastic Observability for a Hello World web app running on Azure Container Apps.]]></description>
            <content:encoded><![CDATA[<p>Elastic Observability is the optimal tool to provide visibility into your running web apps. Microsoft Azure Container Apps is a fully managed environment that enables you to run containerized applications on a serverless platform so that your applications scale up and down. This allows you to accomplish the dual objective of serving every customer’s need for availability while meeting your needs to do so as efficiently as possible.</p>
<p>Using Elastic Observability and Azure Container Apps is a perfect combination for developers to deploy <a href="https://www.elastic.co/blog/observability-powerful-flexible-efficient">web apps that are auto-scaled with fully observable operations</a>.</p>
<p>This blog post will show you how to deploy a simple Hello World web app to Azure Container Apps and then walk you through the steps to instrument the Hello World web app to enable observation of the application’s operations with Elastic Cloud.</p>
<h2>Elastic Observability setup</h2>
<p>We’ll start with setting up an Elastic Cloud deployment, which is where observability will take place for the web app we’ll be deploying.</p>
<p>From the <a href="https://cloud.elastic.co">Elastic Cloud console</a>, select <strong>Create deployment</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-azure-container-apps/elastic-blog-1-create-deployment.png" alt="create deployment" /></p>
<p>Enter a deployment name and click <strong>Create deployment</strong>. It takes a few minutes for your deployment to be created. While waiting, you are prompted to save the admin credentials for your deployment, which provides you with superuser access to your Elastic® deployment. Keep these credentials safe as they are shown only once.</p>
<p>Elastic Observability requires an APM Server URL and an APM Secret token for an app to send observability data to Elastic Cloud. Once the deployment is created, we’ll copy the Elastic Observability server URL and secret token and store them somewhere safely for adding to our web app code in a later step.</p>
<p>To copy the APM Server URL and the APM Secret Token, go to <a href="https://cloud.elastic.co/home">Elastic Cloud</a> . Then go to the <a href="https://cloud.elastic.co/deployments">Deployments</a> page, which lists all of the deployments you have created. Select the deployment you want to use, which will open the deployment details page. In the <strong>Kibana</strong> row of links, click on <strong>Open</strong> to open Kibana® for your deployment.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-azure-container-apps/elastic-blog-2-my-deployment.png" alt="my deployment" /></p>
<p>Select <strong>Integrations</strong> from the top-level menu. Then click the <strong>APM</strong> tile.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-azure-container-apps/elastic-blog-3-apm.png" alt="apm" /></p>
<p>On the APM Agents page, copy the secretToken and the serverUrl values and save them for use in a later step.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-azure-container-apps/elastic-blog-4-apm-agents.png" alt="apm agents" /></p>
<p>Now that we’ve completed the Elastic Cloud setup, the next step is to set up our account in Azure for deploying apps to the Container Apps service.</p>
<h2>Azure Container Apps setup</h2>
<p>First we’ll need an Azure account, so let’s create one by going to the <a href="https://azure.microsoft.com">Microsoft Azure portal</a> and creating a new project. Click the <strong>Start free</strong> button and follow the steps to sign in or create a new account.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-azure-container-apps/elastic-blog-5-azure-start-free.png" alt="azure start free" /></p>
<h2>Deploy a Hello World web app to Container Apps</h2>
<p>We’ll perform the process of deploying a C# Hello World web app to Container Apps using the handy Azure tool called <a href="https://azure.microsoft.com/en-us/get-started/azure-portal/cloud-shell">Cloud Shell</a>. To deploy the Hello World app, we’ll perform the following 12 steps:</p>
<ol>
<li>From the <a href="https://portal.azure.com/">Azure portal</a>, click the Cloud Shell icon at the top of the portal to open Cloud Shell…</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-azure-container-apps/elastic-blog-6-cloud-shell.png" alt="cloud shell" /></p>
<p>… and when the Cloud Shell first opens, select <strong>Bash</strong> as the shell type to use.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-azure-container-apps/elastic-blog-7-bash.png" alt="bash" /></p>
<ol start="2">
<li>If you’re prompted that “You have no storage mounted,” then click the <strong>Create storage</strong> button to create a file store to be used for saving and editing files from Cloud Shell.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-azure-container-apps/elastic-blog-8-create-storage.png" alt="create storage" /></p>
<ol start="3">
<li>In Cloud Shell, clone a <a href="https://github.com/elastic/observability-examples/tree/main/azure/container-apps/helloworld">C# Hello World sample app</a> repo from GitHub by entering the following command.</li>
</ol>
<pre><code class="language-bash">git clone https://github.com/elastic/observability-examples
</code></pre>
<ol start="4">
<li>Change directory to the location of the Hello World web app code.</li>
</ol>
<pre><code class="language-bash">cd observability-examples/azure/container-apps/helloworld
</code></pre>
<ol start="5">
<li>Define the environment variables that we’ll be using in the commands throughout this blog post.</li>
</ol>
<pre><code class="language-bash">RESOURCE_GROUP=&quot;helloworld-containerapps&quot;
LOCATION=&quot;centralus&quot;
ENVIRONMENT=&quot;env-helloworld-containerapps&quot;
APP_NAME=&quot;elastic-helloworld&quot;
</code></pre>
<ol start="6">
<li>Define a registry container name that is unique by running the following command.</li>
</ol>
<pre><code class="language-bash">ACR_NAME=&quot;helloworld&quot;$RANDOM
</code></pre>
<ol start="7">
<li>Create an Azure resource group by running the following command.</li>
</ol>
<pre><code class="language-bash">az group create --name $RESOURCE_GROUP --location &quot;$LOCATION&quot;
</code></pre>
<ol start="8">
<li>Run the following command to create a registry container in Azure Container Registry.</li>
</ol>
<pre><code class="language-bash">az acr create --resource-group $RESOURCE_GROUP \
--name $ACR_NAME --sku Basic --admin-enable true
</code></pre>
<ol start="9">
<li>Build the app image and push it to Azure Container Registry by running the following command.</li>
</ol>
<pre><code class="language-bash">az acr build --registry $ACR_NAME --image $APP_NAME .
</code></pre>
<ol start="10">
<li>Register the Microsoft.OperationalInsights namespace as a provider by running the following command.</li>
</ol>
<pre><code class="language-bash">az provider register -n Microsoft.OperationalInsights --wait
</code></pre>
<ol start="11">
<li>Run the following command to create a Container App environment for deploying your app into.</li>
</ol>
<pre><code class="language-bash">az containerapp env create --name $ENVIRONMENT \
--resource-group $RESOURCE_GROUP --location &quot;$LOCATION&quot;
</code></pre>
<ol start="12">
<li>Create a new Container App by deploying the Hello World app’s image to Container Apps, using the following command.</li>
</ol>
<pre><code class="language-bash">az containerapp create \
  --name $APP_NAME \
  --resource-group $RESOURCE_GROUP \
  --environment $ENVIRONMENT \
  --image $ACR_NAME.azurecr.io/$APP_NAME \
  --target-port 3500 \
  --ingress 'external' \
  --registry-server $ACR_NAME.azurecr.io \
  --query properties.configuration.ingress.fqdn
</code></pre>
<p>This command will output the deployed Hello World app's fully qualified domain name (FQDN). Copy and paste the FQDN into a browser to see your running Hello World app.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-azure-container-apps/elastic-blog-9-hello-world.png" alt="hello world" /></p>
<h2>Instrument the Hello World web app with Elastic Observability</h2>
<p>With a web app successfully running in Container Apps, we’re now ready to add the minimal code necessary to enable observability for the Hello World app in Elastic Cloud. We’ll perform the following eight steps:</p>
<ol>
<li>In Azure Cloud Shell, create a new file named Telemetry.cs by typing the following command.</li>
</ol>
<pre><code class="language-bash">touch Telemetry.cs
</code></pre>
<ol start="2">
<li>Open the Azure Cloud Shell file editor by typing the following command in Cloud Shell.</li>
</ol>
<pre><code class="language-bash">code .
</code></pre>
<ol start="3">
<li>In the Azure Cloud Shell editor, open the Telemetry.cs file and paste in the following code. Save the edited file in Cloud Shell by pressing the [Ctrl] + [s] keys on your keyboard (or if you’re on a macOS computer, use the [⌘] + [s] keys). This class file is used to create a tracer ActivitySource, which can generate trace Activity spans for observability.</li>
</ol>
<pre><code class="language-csharp">using System.Diagnostics;

public static class Telemetry
{
	public static readonly ActivitySource activitySource = new(&quot;Helloworld&quot;);
}
</code></pre>
<ol start="4">
<li>In the Azure Cloud Shell editor, edit the file named Dockerfile to add the following Elastic OpenTelemetry environment variables. Replace the ELASTIC_APM_SERVER_URL text and the ELASTIC_APM_SECRET_TOKEN text with the APM Server URL and the APM Secret Token values that you copied and saved in an earlier step.</li>
</ol>
<p>Save the edited file in Cloud Shell by pressing the [Ctrl] + [s] keys on your keyboard (or if you’re on a macOS computer, use the [⌘] + [s] keys).</p>
<p>The updated Dockerfile should look something like this:</p>
<pre><code class="language-dockerfile">FROM ${ARCH}mcr.microsoft.com/dotnet/aspnet:7.0. AS base
WORKDIR /app

FROM mcr.microsoft.com/dotnet/sdk:8.0-preview AS build
ARG TARGETPLATFORM

WORKDIR /src
COPY [&quot;helloworld.csproj&quot;, &quot;./&quot;]
RUN dotnet restore &quot;./helloworld.csproj&quot;
COPY . .
WORKDIR &quot;/src/.&quot;
RUN dotnet build &quot;helloworld.csproj&quot; -c Release -o /app/build

FROM build AS publish
RUN dotnet publish &quot;helloworld.csproj&quot; -c Release -o /app/publish

FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
EXPOSE 3500
ENV ASPNETCORE_URLS=http://+:3500

ENV OTEL_EXPORTER_OTLP_ENDPOINT='https://******.apm.us-east-2.aws.elastic-cloud.com:443'
ENV OTEL_EXPORTER_OTLP_HEADERS='Authorization=Bearer ***********'
ENV OTEL_LOG_LEVEL=info
ENV OTEL_METRICS_EXPORTER=otlp
ENV OTEL_RESOURCE_ATTRIBUTES=service.version=1.0,deployment.environment=production
ENV OTEL_SERVICE_NAME=helloworld
ENV OTEL_TRACES_EXPORTER=otlp

ENTRYPOINT [&quot;dotnet&quot;, &quot;helloworld.dll&quot;]
</code></pre>
<ol start="5">
<li>In the Azure Cloud Shell editor, edit the helloworld.csproj file to add the Elastic APM and OpenTelemetry dependencies. The updated helloworld.csproj file should look something like this:</li>
</ol>
<pre><code class="language-xml">
&lt;Project Sdk=&quot;Microsoft.NET.Sdk.Web&quot;&gt;

  &lt;PropertyGroup&gt;
	&lt;TargetFramework&gt;net7.0&lt;/TargetFramework&gt;
	&lt;Nullable&gt;enable&lt;/Nullable&gt;
	&lt;ImplicitUsings&gt;enable&lt;/ImplicitUsings&gt;
  &lt;/PropertyGroup&gt;
  &lt;ItemGroup&gt;
	&lt;PackageReference Include=&quot;Elastic.Apm&quot; Version=&quot;1.24.0&quot; /&gt;
	&lt;PackageReference Include=&quot;Elastic.Apm.NetCoreAll&quot; Version=&quot;1.24.0&quot; /&gt;
	&lt;PackageReference Include=&quot;OpenTelemetry&quot; Version=&quot;1.6.0&quot; /&gt;
	&lt;PackageReference Include=&quot;OpenTelemetry.Exporter.Console&quot; Version=&quot;1.6.0&quot; /&gt;
	&lt;PackageReference Include=&quot;OpenTelemetry.Exporter.OpenTelemetryProtocol&quot; Version=&quot;1.6.0&quot; /&gt;
	&lt;PackageReference Include=&quot;OpenTelemetry.Extensions.Hosting&quot; Version=&quot;1.6.0&quot; /&gt;
	&lt;PackageReference Include=&quot;OpenTelemetry.Instrumentation.AspNetCore&quot; Version=&quot;1.5.0-beta.1&quot; /&gt;
  &lt;/ItemGroup&gt;

&lt;/Project&gt;
</code></pre>
<ol start="6">
<li>In the Azure Cloud Shell editor, edit the Program.cs:</li>
</ol>
<ul>
<li>Add a using statement at the top of the file to import System.Diagnostics, which is used to create Activities that are equivalent to “spans” in OpenTelemetry. Also import the OpenTelemetry.Resources and OpenTelemetry.Trace packages.</li>
</ul>
<pre><code class="language-csharp">using System.Diagnostics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;
</code></pre>
<ul>
<li>Update the “builder” initialization code block to include configuration to enable Elastic OpenTelemetry observability.</li>
</ul>
<pre><code class="language-csharp">builder.Services.AddOpenTelemetry().WithTracing(builder =&gt; builder.AddOtlpExporter()
                	.AddSource(&quot;helloworld&quot;)
                	.AddAspNetCoreInstrumentation()
                	.AddOtlpExporter()
    	.ConfigureResource(resource =&gt;
        	resource.AddService(
            	serviceName: &quot;helloworld&quot;))
);
builder.Services.AddControllers();
</code></pre>
<ul>
<li>Replace the “Hello World!” HTML output string…</li>
</ul>
<pre><code class="language-html">&lt;h1&gt;Hello World!&lt;/h1&gt;
</code></pre>
<ul>
<li>...with the “Hello Elastic Observability” HTML output string.</li>
</ul>
<pre><code class="language-html">&lt;div style=&quot;text-align: center;&quot;&gt;
  &lt;h1 style=&quot;color: #005A9E; font-family:'Verdana'&quot;&gt;
    Hello Elastic Observability - Azure Container Apps - C#
  &lt;/h1&gt;
  &lt;img
    src=&quot;https://elastichelloworld.blob.core.windows.net/elastic-helloworld/elastic-logo.png&quot;
  /&gt;
&lt;/div&gt;
</code></pre>
<ul>
<li>Add a telemetry trace span around the output response utilizing the Telemetry class’ ActivitySource.</li>
</ul>
<pre><code class="language-csharp">using (Activity activity = Telemetry.activitySource.StartActivity(&quot;HelloSpan&quot;)!)
   	{
   		Console.Write(&quot;hello&quot;);
   		await context.Response.WriteAsync(output);
   	}
</code></pre>
<p>The updated Program.cs file should look something like this:</p>
<pre><code class="language-csharp">using System.Diagnostics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;

var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOpenTelemetry().WithTracing(builder =&gt; builder.AddOtlpExporter()
                	.AddSource(&quot;helloworld&quot;)
                	.AddAspNetCoreInstrumentation()
                	.AddOtlpExporter()
    	.ConfigureResource(resource =&gt;
        	resource.AddService(
            	serviceName: &quot;helloworld&quot;))
);
builder.Services.AddControllers();
var app = builder.Build();

string output =
&quot;&quot;&quot;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;h1 style=&quot;color: #005A9E; font-family:'Verdana'&quot;&gt;
Hello Elastic Observability - Azure Container Apps - C#
&lt;/h1&gt;
&lt;img src=&quot;https://elastichelloworld.blob.core.windows.net/elastic-helloworld/elastic-logo.png&quot;&gt;
&lt;/div&gt;
&quot;&quot;&quot;;

app.MapGet(&quot;/&quot;, async context =&gt;
	{
    	using (Activity activity = Telemetry.activitySource.StartActivity(&quot;HelloSpan&quot;)!)
    		{
        		Console.Write(&quot;hello&quot;);
        		await context.Response.WriteAsync(output);
    		}
	}
);
app.Run();
</code></pre>
<ol start="7">
<li>Rebuild the Hello World app image and push the image to the Azure Container Registry by running the following command.</li>
</ol>
<pre><code class="language-bash">az acr build --registry $ACR_NAME --image $APP_NAME .
</code></pre>
<ol start="8">
<li>Redeploy the updated Hello World app to Azure Container Apps, using the following command.</li>
</ol>
<pre><code class="language-bash">az containerapp create \
  --name $APP_NAME \
  --resource-group $RESOURCE_GROUP \
  --environment $ENVIRONMENT \
  --image $ACR_NAME.azurecr.io/$APP_NAME \
  --target-port 3500 \
  --ingress 'external' \
  --registry-server $ACR_NAME.azurecr.io \
  --query properties.configuration.ingress.fqdn
</code></pre>
<p>This command will output the deployed Hello World app's fully qualified domain name (FQDN). Copy and paste the FQDN into a browser to see the updated Hello World app running in Azure Container Apps.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-azure-container-apps/elastic-blog-10-elastic-hello-observability.png" alt="hello observability" /></p>
<h2>Observe the Hello World web app</h2>
<p>Now that we’ve instrumented the web app to send observability data to Elastic Observability, we can now use Elastic Cloud to monitor the web app’s operations.</p>
<ol>
<li>
<p>In Elastic Cloud, select the Observability <strong>Services</strong> menu item.</p>
</li>
<li>
<p>Click the <strong>helloworld</strong> service.</p>
</li>
<li>
<p>Click the <strong>Transactions</strong> tab.</p>
</li>
<li>
<p>Scroll down and click the <strong>GET /</strong> transaction.Scroll down to the <strong>Trace Sample</strong> section to see the <strong>GET /</strong> , <strong>HelloSpan</strong> trace sample.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-azure-container-apps/elastic-blog-12-latency-distribution.png" alt="latency-distribution" /></p>
<h2>Observability made to scale</h2>
<p>You’ve seen the entire process of deploying a web app to Azure Container Apps that is instrumented with Elastic Observability. This web app is now fully available on the web running on a platform that will auto-scale to serve visitors worldwide. And it’s instrumented for Elastic Observability APM using OpenTelemetry to ingest data into Elastic Cloud’s Kibana dashboards.</p>
<p>Now that you’ve seen how to deploy a Hello World web app with a basic observability setup, visit <a href="https://www.elastic.co/observability">Elastic Observability</a> to learn more about expanding to a full scale observability coverage solution for your apps. Or visit <a href="https://www.elastic.co/getting-started/microsoft-azure">Getting started with Elastic on Microsoft Azure</a> for more examples of how you can drive the data insights you need by combining Microsoft Azure’s cloud computing services with Elastic’s search-powered platform.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/deploy-app-observability-azure-container-apps/library-branding-elastic-observability-midnight-1680x980.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[How to monitor Kafka and Confluent Cloud with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/monitor-kafka-confluent-cloud-elastic-observability</link>
            <guid isPermaLink="false">monitor-kafka-confluent-cloud-elastic-observability</guid>
            <pubDate>Mon, 03 Apr 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[This blog post will take you through best practices to observe Kafka-based solutions implemented on Confluent Cloud with Elastic Observability.]]></description>
            <content:encoded><![CDATA[<p>The blog will take you through best practices to observe Kafka-based solutions implemented on Confluent Cloud with Elastic Observability. (To monitor Kafka brokers that are not in Confluent Cloud, I recommend checking out <a href="https://www.elastic.co/blog/how-to-monitor-containerized-kafka-with-elastic-observability">this blog</a>.) We will instrument Kafka applications with <a href="https://www.elastic.co/observability/application-performance-monitoring">Elastic APM</a>, use the Confluent Cloud metrics endpoint to get data about brokers, and pull it all together with a unified Kafka and Confluent Cloud monitoring dashboard in <a href="https://www.elastic.co/observability">Elastic Observability</a>.</p>
<h2>Using full-stack Elastic Observability to understand Kafka and Confluent performance</h2>
<p>In the <a href="https://dice.viewer.foleon.com/ebooks/dice-tech-salary-report-explore/">2023 Dice Tech Salary Report</a>, Elasticsearch and Kakfa are ranked #3 and #5 out of the top 12 <a href="https://dice.viewer.foleon.com/ebooks/dice-tech-salary-report-explore/salary-trends#Skills">most in demand skills</a> at the moment, so it’s no surprise that we are seeing a large number of customers who are implementing data in motion with Kafka.</p>
<p><a href="https://www.elastic.co/integrations/data-integrations?search=kafka">Kafka</a> comes with some additional complexities that go beyond traditional architectures and which make observability an even more important topic. Understanding where the bottlenecks are in messaging and stream-based architectures can be tough. This is why you need a comprehensive observability solution with <a href="https://www.elastic.co/blog/aiops-use-cases-observability-operations">machine learning</a> to help you.</p>
<p>In this blog, we will explore how to get Kafka applications instrumented with <a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">Elastic APM</a>, how to collect performance data with JMX, and how you can use the Elasticsearch Platform to pull in data from Confluent Cloud — which is by far the easiest and most cost-effective way to implement Kafka architectures.</p>
<p>For this blog post, we will be following the code at this <a href="https://github.com/davidgeorgehope/multi-cloud">git repository</a>. There are three services here that are designed to run on two clouds and push data from one cloud to the other and finally into Google BigQuery. We want to monitor all of this using Elastic Observability to give you a complete picture of Confluent and Kafka Services performance as a teaser — this is the goal below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-producer_metrics.png" alt="kafka producer metrics" /></p>
<h2>A look at the architecture</h2>
<p>As mentioned, we have three <a href="https://www.elastic.co/observability/cloud-monitoring">multi-cloud services</a> implemented in our example application.</p>
<p>The first service is a Spring WebFlux service that runs inside AWS EKS. This service will take a message from a REST Endpoint and simply put it straight on to a Kafka topic.</p>
<p>The second service, which is also a Spring WebFlux service hosted inside Google Cloud Platform (GCP) with its <a href="https://www.elastic.co/observability/google-cloud-monitoring">Google Cloud monitoring</a>, will then pick this up and forward it to another service that will put the message into BigQuery.</p>
<p>These services are all instrumented using Elastic APM. For this blog, we have decided to use Spring config to inject and configure the APM agent. You could of course use the “-javaagent” argument to inject the agent instead if preferred.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-obsevability-aws-kafka-google-cloud.png" alt="aws kafka google cloud" /></p>
<h2>Getting started with Elastic Observability and Confluent Cloud</h2>
<p>Before we dive into the application and its configuration, you will want to get an Elastic Cloud and Confluent Cloud account. You can sign up here for <a href="https://www.elastic.co/cloud/">Elastic</a> and here for <a href="https://www.confluent.io/confluent-cloud/">Confluent Cloud</a>. There are some initial configuration steps we need to do inside Confluent Cloud, as you will need to create three topics: gcpTopic, myTopic, and topic_2.</p>
<p>When you sign up for Confluent Cloud, you will be given an option of what type of cluster to create. For this walk-through, a Basic cluster is fine (as shown) — if you are careful about usage, it will not cost you a penny.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-confluent-create-cluster.png" alt="confluent create cluster" /></p>
<p>Once you have a cluster, go ahead and create the three topics.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-confluent-topics.png" alt="confluent topics" /></p>
<p>For this walk-through, you will only need to create single partition topics as shown below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-new-topic.png" alt="new topic" /></p>
<p>Now we are ready to set up the Elastic Cloud cluster.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-create-a-deployment.png" alt="create a deployment" /></p>
<p>One thing to note here is that when setting up an Elastic cluster, the defaults are mostly OK. With one minor tweak to add in the Machine Learning under “Advanced Settings,” add capacity for machine learning here.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-machine-learning-instances.png" alt="machine learning instances" /></p>
<h2>Getting APM up and running</h2>
<p>The first thing we want to do here is get our Spring Boot Webflux-based services up and running. For this blog, I have decided to implement this using the Spring Configuration, as you can see below. For brevity, I have not listed all the JMX configuration information, but you can see those details in <a href="https://github.com/davidgeorgehope/multi-cloud/blob/main/aws-multi-cloud/src/main/java/com/elastic/multicloud/ElasticApmConfig.java">GitHub</a>.</p>
<pre><code class="language-java">package com.elastic.multicloud;
import co.elastic.apm.attach.ElasticApmAttacher;
import jakarta.annotation.PostConstruct;
import lombok.Setter;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;

import java.util.HashMap;
import java.util.Map;

@Setter
@Configuration
@ConfigurationProperties(prefix = &quot;elastic.apm&quot;)
@ConditionalOnProperty(value = &quot;elastic.apm.enabled&quot;, havingValue = &quot;true&quot;)
public class ElasticApmConfig {

    private static final String SERVER_URL_KEY = &quot;server_url&quot;;
    private String serverUrl;

    private static final String SERVICE_NAME_KEY = &quot;service_name&quot;;
    private String serviceName;

    private static final String SECRET_TOKEN_KEY = &quot;secret_token&quot;;
    private String secretToken;

    private static final String ENVIRONMENT_KEY = &quot;environment&quot;;
    private String environment;

    private static final String APPLICATION_PACKAGES_KEY = &quot;application_packages&quot;;
    private String applicationPackages;

    private static final String LOG_LEVEL_KEY = &quot;log_level&quot;;
    private String logLevel;
    private static final Logger LOGGER = LoggerFactory.getLogger(ElasticApmConfig.class);

    @PostConstruct
    public void init() {
        LOGGER.info(environment);

        Map&lt;String, String&gt; apmProps = new HashMap&lt;&gt;(6);
        apmProps.put(SERVER_URL_KEY, serverUrl);
        apmProps.put(SERVICE_NAME_KEY, serviceName);
        apmProps.put(SECRET_TOKEN_KEY, secretToken);
        apmProps.put(ENVIRONMENT_KEY, environment);
        apmProps.put(APPLICATION_PACKAGES_KEY, applicationPackages);
        apmProps.put(LOG_LEVEL_KEY, logLevel);
        apmProps.put(&quot;enable_experimental_instrumentations&quot;,&quot;true&quot;);
          apmProps.put(&quot;capture_jmx_metrics&quot;,&quot;object_name[kafka.producer:type=producer-metrics,client-id=*] attribute[batch-size-avg:metric_name=kafka.producer.batch-size-avg]&quot;);


        ElasticApmAttacher.attach(apmProps);
    }
}
</code></pre>
<p>Now obviously this requires some dependencies, which you can see here in the Maven pom.xml.</p>
<pre><code class="language-xml">&lt;dependency&gt;
			&lt;groupId&gt;co.elastic.apm&lt;/groupId&gt;
			&lt;artifactId&gt;apm-agent-attach&lt;/artifactId&gt;
			&lt;version&gt;1.35.1-SNAPSHOT&lt;/version&gt;
		&lt;/dependency&gt;
		&lt;dependency&gt;
			&lt;groupId&gt;co.elastic.apm&lt;/groupId&gt;
			&lt;artifactId&gt;apm-agent-api&lt;/artifactId&gt;
			&lt;version&gt;1.35.1-SNAPSHOT&lt;/version&gt;
		&lt;/dependency&gt;
</code></pre>
<p>Strictly speaking, the agent-api is not required, but it could be useful if you have a desire to add your own monitoring code (as per the example below). The agent will happily auto-instrument without needing to do that though.</p>
<pre><code class="language-java">Transaction transaction = ElasticApm.currentTransaction();
        Span span = ElasticApm.currentSpan()
                .startSpan(&quot;external&quot;, &quot;kafka&quot;, null)
                .setName(&quot;DAVID&quot;).setServiceTarget(&quot;kafka&quot;,&quot;gcp-elastic-apm-spring-boot-integration&quot;);
        try (final Scope scope = transaction.activate()) {
            span.injectTraceHeaders((name, value) -&gt; producerRecord.headers().add(name,value.getBytes()));
            return Mono.fromRunnable(() -&gt; {
                kafkaTemplate.send(producerRecord);
            });
        } catch (Exception e) {
            span.captureException(e);
            throw e;
        } finally {
            span.end();
        }
</code></pre>
<p>Now we have enough code to get our agent bootstrapped.</p>
<p>To get the code from the GitHub repository up and running, you will need the following installed on your system and to ensure that you have the credentials for your GCP and AWS cloud.</p>
<pre><code class="language-java">
Java
Maven
Docker
Kubernetes CLI (kubectl)
</code></pre>
<h3>Clone the project</h3>
<p>Clone the multi-cloud Spring project to your local machine.</p>
<pre><code class="language-bash">git clone https://github.com/davidgeorgehope/multi-cloud
</code></pre>
<h3>Build the project</h3>
<p>From each service in the project (aws-multi-cloud, gcp-multi-cloud, gcp-bigdata-consumer-multi-cloud), run the following commands to build the project.</p>
<pre><code class="language-bash">mvn clean install
</code></pre>
<p>Now you can run the Java project locally.</p>
<pre><code class="language-java">java -jar gcp-bigdata-consumer-multi-cloud-0.0.1-SNAPSHOT.jar --spring.config.location=/Users/davidhope/applicaiton-gcp.properties
</code></pre>
<p>That will just get the Java application running locally, but you can also deploy this to Kubernetes using EKS and GKE as shown below.</p>
<h3>Create a Docker image</h3>
<p>Create a Docker image from the built project using the dockerBuild.sh provided in the project. You may want to customize this shell script to upload the built docker image to your own docker repository.</p>
<pre><code class="language-bash">./dockerBuild.sh
</code></pre>
<h3>Create a namespace for each service</h3>
<pre><code class="language-bash">kubectl create namespace aws
</code></pre>
<pre><code class="language-bash">kubectl create namespace gcp-1
</code></pre>
<pre><code class="language-bash">kubectl create namespace gcp-2
</code></pre>
<p>Once you have the namespaces created, you can switch context using the following command:</p>
<pre><code class="language-bash">kubectl config set-context --current --namespace=my-namespace
</code></pre>
<h3>Configuration for each service</h3>
<p>Each service needs an application.properties file. I have put an example <a href="https://github.com/davidgeorgehope/multi-cloud/blob/main/gcp-bigdata-consumer-multi-cloud/application.properties">here</a>.</p>
<p>You will need to replace the following properties with those you find in Elastic.</p>
<pre><code class="language-bash">elastic.apm.server-url=
elastic.apm.secret-token=
</code></pre>
<p>These can be found by going into Elastic Cloud and clicking on <strong>Services</strong> inside APM and then <strong>Add Data</strong> , which should be visible in the top right corner.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-add-data.png" alt="add data" /></p>
<p>From there you will see the following, which gives you the config information you need.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-apm-agents.png" alt="apm agents" /></p>
<p>You will need to replace the following properties with those you find in Confluent Cloud.</p>
<pre><code class="language-bash">elastic.kafka.producer.sasl-jaas-config=
</code></pre>
<p>This configuration comes from the Clients page in Confluent Cloud.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-confluent-new-client.png" alt="confluent new client" /></p>
<h3>Adding the config for each service in Kubernetes</h3>
<p>Once you have a fully configured application properties, you need to add it to your <a href="https://www.elastic.co/blog/kubernetes-cluster-metrics-logs-monitoring">Kubernetes environment</a> as below.</p>
<p>From the aws namespace.</p>
<pre><code class="language-bash">kubectl create secret generic my-app-config --from-file=application.properties
</code></pre>
<p>From the gcp-1 namespace.</p>
<pre><code class="language-bash">kubectl create secret generic my-app-config --from-file=application.properties
</code></pre>
<p>From the gcp-2 namespace.</p>
<pre><code class="language-bash">kubectl create secret generic bigdata-creds --from-file=elastic-product-marketing-e145e13fbc7c.json

kubectl create secret generic my-app-config-gcp-bigdata --from-file=application.properties
</code></pre>
<h3>Create a Kubernetes deployment</h3>
<p>Create a Kubernetes deployment YAML file and add your Docker image to it. You can use the deployment.yaml file provided in the project as a template. Make sure to update the image name in the file to match the name of the Docker image you just created.</p>
<pre><code class="language-bash">kubectl apply -f deployment.yaml
</code></pre>
<h3>Create a Kubernetes service</h3>
<p>Create a Kubernetes service YAML file and add your deployment to it. You can use the service.yaml file provided in the project as a template.</p>
<pre><code class="language-yaml">kubectl apply -f service.yaml
</code></pre>
<h3>Access your application</h3>
<p>Your application is now running in a Kubernetes cluster. To access it, you can use the service's cluster IP and port. You can get the service's IP and port using the following command.</p>
<pre><code class="language-bash">kubectl get services
</code></pre>
<p>Now once you know where the service is, you need to execute it!</p>
<p>You can regularly poke the service endpoint using the following command.</p>
<pre><code class="language-bash">curl -X POST -H &quot;Content-Type: application/json&quot; -d '{&quot;name&quot;: &quot;linuxize&quot;, &quot;email&quot;: &quot;linuxize@example.com&quot;}' http://localhost:8080/api/my-objects/publish
</code></pre>
<p>With this up and running, you should see the following service map build out in the Elastic APM product.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-aws-elastic-apm-spring-boot.png" alt="aws elastic apm spring boot" /></p>
<p>And traces will contain a waterfall graph showing all the spans that have executed across this distributed application, allowing you to pinpoint where any issues are within each transaction.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-services.png" alt="observability services" /></p>
<h2>JMX for Kafka Producer/Consumer metrics</h2>
<p>In the previous part of this blog, we briefly touched on the JMX metric configuration you can see below.</p>
<pre><code class="language-bash">&quot;capture_jmx_metrics&quot;,&quot;object_name[kafka.producer:type=producer-metrics,client-id=*] attribute[batch-size-avg:metric_name=kafka.producer.batch-size-avg]&quot;
</code></pre>
<p>We can use this “capture_jmx_metrics” configuration to configure JMX for any Kafka Producer/Consumer metrics we want to monitor.</p>
<p>Check out the documentation <a href="https://www.elastic.co/guide/en/apm/agent/java/current/config-jmx.html">here</a> to understand how to configure this and <a href="https://docs.confluent.io/platform/current/kafka/monitoring.html">here</a> to see the available JMX metrics you can monitor. In the <a href="https://github.com/davidgeorgehope/multi-cloud/blob/main/gcp-bigdata-consumer-multi-cloud/src/main/java/com/elastic/multicloud/ElasticApmConfig.java">example code in GitHub</a>, we actually pull all the available metrics in, so you can check in there how to configure this.</p>
<p>One thing that’s worth pointing out here is that it’s important to use the “metric_name” property shown above or it gets quite difficult to find the metrics in Elastic Discover without being specific here.</p>
<h2>Monitoring Confluent Cloud with Elastic Observability</h2>
<p>So we now have some good monitoring set up for Kafka Producers and Consumers and we can trace transactions between services down to the lines of code that are executing. The core part of our Kafka infrastructure is hosted in Confluent Cloud. How, then, do we get data from there into our <a href="https://www.elastic.co/observability">full stack observability solution</a>?</p>
<p>Luckily, Confluent has done a fantastic job of making this easy. It provides important Confluent Cloud metrics via an open Prometheus-based metrics URL. So let's get down to business and configure this to bring data into our <a href="https://www.elastic.co/observability">observability tool</a>.</p>
<p>The first step is to configure Confluent Cloud with the MetricsViewer. The MetricsViewer role provides service account access to the Metrics API for all clusters in an organization. This role also enables service accounts to import metrics into third-party metrics platforms.</p>
<p>To assign the MetricsViewer role to a new service account:</p>
<ol>
<li>In the top-right administration menu (☰) in the upper-right corner of the Confluent Cloud user interface, click <strong>ADMINISTRATION &gt; Cloud API keys</strong>.</li>
<li>Click <strong>Add key</strong>.</li>
<li>Click the <strong>Granular access tile</strong> to set the scope for the API key. Click <strong>Next</strong>.</li>
<li>Click <strong>Create a new one</strong> and specify the service account name. Optionally, add a description. Click <strong>Next</strong>.</li>
<li>The API key and secret are generated for the service account. You will need this API key and secret to connect to the cluster, so be sure to safely store this information. Click <strong>Save</strong>. The new service account with the API key and associated ACLs is created. When you return to the API access tab, you can view the newly-created API key to confirm.</li>
<li>Return to Accounts &amp; access in the administration menu, and in the Accounts tab, click <strong>Service accounts</strong> to view your service accounts.</li>
<li>Select the service account that you want to assign the MetricsViewer role to.</li>
<li>In the service account’s details page, click <strong>Access</strong>.</li>
<li>In the tree view, open the resource where you want the service account to have the MetricsViewer role.</li>
<li>Click <strong>Add role assignment</strong> and select the MetricsViewer tile. Click <strong>Save</strong>.</li>
</ol>
<p>Next we can head to <a href="https://www.elastic.co/observability">Elastic Observability</a> and configure the Prometheus integration to pull in the metrics data.</p>
<p>Go to the integrations page in Kibana.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-integrations.png" alt="observability integrations" /></p>
<p>Find the Prometheus integration. We are using the Prometheus integration because the Confluent Cloud metrics server can provide data in prometheus format. Trust us, this works really well — good work Confluent!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-integrations-prometheus.png" alt="integrations prometheus" /></p>
<p>Add Prometheus in the next page.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-add-prometheus.png" alt="add prometheus" /></p>
<p>Configure the Prometheus plugin in the following way: In the hosts box, add the following URL, replacing the resource kafka id with the cluster id you want to monitor.</p>
<pre><code class="language-bash">https://api.telemetry.confluent.cloud:443/v2/metrics/cloud/export?resource.kafka.id=lkc-3rw3gw
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-collect-prometheus-metrics.png" alt="collect prometheus metrics" /></p>
<p>Add the username and password under the advanced options you got from the API keys step you executed against Confluent Cloud above.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-http-config-options.png" alt="http config options" /></p>
<p>Once the Integration is created, <a href="https://www.elastic.co/guide/en/fleet/current/agent-policy.html#apply-a-policy">the policy needs to be applied</a> to an instance of a running Elastic Agent.</p>
<p>That’s it! It’s that easy to get all the data you need for a full stack observability monitoring solution.</p>
<p>Finally, let’s pull all this together in a dashboard.</p>
<h2>Pulling it all together</h2>
<p>Using Kibana to generate dashboards is super easy. If you configured everything the way we recommended above, you should find the metrics (producer/consumer/brokers) you need to create your own dashboard as per the following screenshot.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-dashboard-metrics.png" alt="dashboard metrics" /></p>
<p>Luckily, I made a dashboard for you and stored it in <a href="https://github.com/davidgeorgehope/multi-cloud/blob/main/export.ndjson">GitHub</a>. Take a look below and use this to import it into your own environments.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-producer_metrics.png" alt="producer metrics" /></p>
<h2>Adding the icing on the cake: machine learning anomaly detection</h2>
<p>Now that we have all the critical bits in place, we are going to add the icing on the cake: machine learning (ML)!</p>
<p>Within Kibana, let's head over to the Machine Learning tab in “Analytics.”</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-kibana-analytics.png" alt="kibana analytics" /></p>
<p>Go to the jobs page, where we’ll get started creating our first anomaly detection job.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-create-your-first-anomaly-detection-job.png" alt="create your first anomaly detection job" /></p>
<p>The metrics data view contains what we need to create this new anomaly detection job.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-metrics.png" alt="observability metrics" /></p>
<p>Use the wizard and select a “Single Metric.”</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-use-a-wizard.png" alt="use a wizard" /></p>
<p>Use the full data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-use-full-data.png" alt="use full data" /></p>
<p>In this example, we are going to look for anomalies in the connection count. We really do not want a major deviation here, as this could indicate something very bad occurring if we suddenly have too many or too few things connecting to our Kafka cluster.</p>
<p>Once you have selected the connection count metric, you can proceed through the wizard and eventually your ML job will be created and you should be able to view the data as per the example below.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-single-metric-viewer.png" alt="single metric viewer" /></p>
<p>Congratulations, you have now created a machine learning job to alert you if there are any problems with your Kafka cluster, adding <a href="https://www.elastic.co/observability/aiops">a full AIOps solution</a> to your Kafka and Confluent observability!</p>
<h2>Summary</h2>
<p>We looked at monitoring Kafka-based solutions implemented on Confluent Cloud using Elastic Observability.</p>
<p>We covered the architecture of a multi-cloud solution involving AWS EKS, Confluent Cloud, and GCP GKE. We looked at how to instrument Kafka applications with Elastic APM, use JMX for Kafka Producer/Consumer metrics, integrate Prometheus, and set up machine learning anomaly detection.</p>
<p>We went through a detailed walk-through with code snippets, configuration steps, and deployment instructions included to help you get started.</p>
<p>Interested in learning more about Elastic Observability? Check out the following resources:</p>
<ul>
<li><a href="https://www.elastic.co/virtual-events/intro-to-elastic-observability">An Introduction to Elastic Observability</a></li>
<li><a href="https://www.elastic.co/training/observability-fundamentals">Observability Fundamentals Training</a></li>
<li><a href="https://www.elastic.co/observability/demo">Watch an Elastic Observability demo</a></li>
<li><a href="https://www.elastic.co/blog/observability-predictions-trends-2023">Observability Predictions and Trends for 2023</a></li>
</ul>
<p>And sign up for our <a href="https://www.elastic.co/virtual-events/emerging-trends-in-observability">Elastic Observability Trends Webinar</a> featuring AWS and Forrester, not to be missed!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/patterns-white-background-no-logo-observability_(1).png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[How to remove PII from your Elastic data in 3 easy steps]]></title>
            <link>https://www.elastic.co/observability-labs/blog/remove-pii-data</link>
            <guid isPermaLink="false">remove-pii-data</guid>
            <pubDate>Tue, 20 Jun 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Personally Identifiable Information compliance is an ever increasing challenge for any organization. With Elastic's intuitive ML interface and parsing capabilities, sensitive data may be easily redacted from unstructured data with ease.]]></description>
            <content:encoded><![CDATA[<p>Personally identifiable information (PII) compliance is an ever-increasing challenge for any organization. Whether you’re in ecommerce, banking, healthcare, or other fields where data is sensitive, PII may inadvertently be captured and stored. Having structured logs enables quick identification, removal, and protection of sensitive data fields easily; but what about unstructured messages? Or perhaps call center transcriptions?</p>
<p>Elasticsearch, with its long experience in <a href="https://www.elastic.co/what-is/elasticsearch-machine-learning">machine learning</a>, provides various options to bring in custom models, such as large language models (LLMs), and provides its own models. These models will help implement PII redaction.</p>
<p>If you would like to learn more about natural language processing, machine learning, and Elastic, please be sure to check out these related articles:</p>
<ul>
<li><a href="https://www.elastic.co/blog/introduction-to-nlp-with-pytorch-models">Introduction to modern natural language processing with PyTorch in Elasticsearch</a></li>
<li><a href="https://www.elastic.co/blog/how-to-deploy-natural-language-processing-nlp-getting-started">How to deploy natural language processing (NLP): Getting started</a></li>
<li><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/redact-processor.html">Elastic Redact Processor Documentation</a></li>
<li><a href="https://www.elastic.co/blog/may-2023-launch-sparse-encoder-ai-model">Introducing Elastic Learned Sparse Encoder: Elastic’s AI model for semantic search</a></li>
<li><a href="https://www.elastic.co/blog/may-2023-launch-machine-learning-models">Accessing machine learning models in Elastic</a></li>
</ul>
<p>In this blog, we will show you how to set up PII redaction through the use of Elasticsearch’s ability to load a trained model within machine learning and the flexibility of Elastic’s ingest pipelines.</p>
<p>Specifically, we’ll walk through setting up a <a href="https://www.elastic.co/blog/how-to-deploy-nlp-named-entity-recognition-ner-example">named entity recognition (NER)</a> model for person and location identification, as well as deploying the redact processor for custom data identification and removal. All of this will then be combined with an ingest pipeline where we can use Elastic machine learning and data transformations capabilities to remove sensitive information from your data.</p>
<h2>Loading the trained model</h2>
<p>Before we begin, we must load our NER model into our Elasticsearch cluster. This may be easily accomplished with Docker and the Elastic Eland client. From a command line, let’s install the Eland client via git:</p>
<pre><code class="language-bash">git clone https://github.com/elastic/eland.git
</code></pre>
<p>Navigate into the recently downloaded client:</p>
<pre><code class="language-bash">cd eland/
</code></pre>
<p>Now let’s build the client:</p>
<pre><code class="language-bash">docker build -t elastic/eland .
</code></pre>
<p>From here, you’re ready to deploy the trained model to an Elastic machine learning node! Be sure to replace your username, password, es-cluster-hostname, and esport.</p>
<p>If you’re using the Elastic Cloud or have signed certificates, simply run this command:</p>
<pre><code class="language-bash">docker run -it --rm --network host elastic/eland eland_import_hub_model --url https://&lt;username&gt;:&lt;password&gt;@&lt;es-cluster-hostname&gt;:&lt;esport&gt;/ --hub-model-id dslim/bert-base-NER --task-type ner --start
</code></pre>
<p>If you’re using self-signed certificates, run this command:</p>
<pre><code class="language-bash">docker run -it --rm --network host elastic/eland eland_import_hub_model --url https://&lt;username&gt;:&lt;password&gt;@&lt;es-cluster-hostname&gt;:&lt;esport&gt;/ --insecure --hub-model-id dslim/bert-base-NER --task-type ner --start
</code></pre>
<p>From here you’ll witness the Eland client in action downloading the trained model from <a href="https://huggingface.co/dslim/bert-base-NER">HuggingFace</a> and automatically deploying it into your cluster!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/blog-elastic-huggingface.png" alt="huggingface code" /></p>
<p>Synchronize your newly loaded trained model by clicking on the blue hyperlink via your Machine Learning Overview UI “Synchronize your jobs and trained models.”</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/blog-elastic-Machine-Learning-Overview-UI.png" alt="Machine Learning Overview UI" /></p>
<p>Now click the Synchronize button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/blog-elastic-Synchronize-button.png" alt="Synchronize button" /></p>
<p>That’s it! Congratulations, you just loaded your first trained model into Elastic!</p>
<h2>Create the redact processor and ingest pipeline</h2>
<p>From DevTools, let’s configure the redact processor along with our inference processor to take advantage of Elastic’s trained model we just loaded. This will create an ingest pipeline named “redact” that we can then use to remove sensitive data from any field we wish. In this example, I’ll be focusing on the “message” field. Note: at the time of this writing, the redact processor is experimental and must be created via DevTools.</p>
<pre><code class="language-bash">PUT _ingest/pipeline/redact
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;redacted&quot;,
        &quot;value&quot;: &quot;{{{message}}}&quot;
      }
    },
    {
      &quot;inference&quot;: {
        &quot;model_id&quot;: &quot;dslim__bert-base-ner&quot;,
        &quot;field_map&quot;: {
          &quot;message&quot;: &quot;text_field&quot;
        }
      }
    },
    {
      &quot;script&quot;: {
        &quot;lang&quot;: &quot;painless&quot;,
        &quot;source&quot;: &quot;String msg = ctx['message'];\r\n                for (item in ctx['ml']['inference']['entities']) {\r\n                msg = msg.replace(item['entity'], '&lt;' + item['class_name'] + '&gt;')\r\n                }\r\n                ctx['redacted']=msg&quot;
      }
    },
    {
      &quot;redact&quot;: {
        &quot;field&quot;: &quot;redacted&quot;,
        &quot;patterns&quot;: [
          &quot;%{EMAILADDRESS:EMAIL}&quot;,
          &quot;%{IP:IP_ADDRESS}&quot;,
          &quot;%{CREDIT_CARD:CREDIT_CARD}&quot;,
          &quot;%{SSN:SSN}&quot;,
          &quot;%{PHONE:PHONE}&quot;
        ],
        &quot;pattern_definitions&quot;: {
          &quot;CREDIT_CARD&quot;: &quot;\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{4}&quot;,
          &quot;SSN&quot;: &quot;\d{3}-\d{2}-\d{4}&quot;,
          &quot;PHONE&quot;: &quot;\d{3}-\d{3}-\d{4}&quot;
        }
      }
    },
    {
      &quot;remove&quot;: {
        &quot;field&quot;: [
          &quot;ml&quot;
        ],
        &quot;ignore_missing&quot;: true,
        &quot;ignore_failure&quot;: true
      }
    }
  ],
  &quot;on_failure&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;failure&quot;,
        &quot;value&quot;: &quot;pii_script-redact&quot;
      }
    }
  ]
}
</code></pre>
<p>OK, but what does each processor really do? Let’s walk through each processor in detail here:</p>
<ol>
<li>
<p>The SET processor creates the field “redacted,” which is copied over from the message field and used later on in the pipeline.</p>
</li>
<li>
<p>The INFERENCE processor calls the NER model we loaded to be used on the message field for identifying names, locations, and organizations.</p>
</li>
<li>
<p>The SCRIPT processor then replaced the detected entities within the redacted field from the message field.</p>
</li>
<li>
<p>Our REDACT processor uses Grok patterns to identify any custom set of data we wish to remove from the redacted field (which was copied over from the message field).</p>
</li>
<li>
<p>The REMOVE processor deletes the extraneous ml.* fields from being indexed; note we’ll add “message” to this processor once we validate data is being redacted properly.</p>
</li>
<li>
<p>The ON_FAILURE / SET processor captures any errors just in case we have them.</p>
</li>
</ol>
<h2>Slice your PII</h2>
<p>Now that your ingest pipeline with all the necessary steps has been configured, let’s start testing how well we can remove sensitive data from documents. Navigate over to Stack Management, select Ingest Pipelines and search for “redact”, and then click on the result.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/blog-elastic-Ingest-Pipelines.png" alt="Ingest Pipelines" /></p>
<p>Click on the Manage button, and then click Edit.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/elastic-blog-Manage-button.png" alt="Manage button" /></p>
<p>Here we are going to test our pipeline by adding some documents. Below is a sample you can copy and paste to make sure everything is working correctly.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/elastic-blog-test-pipeline.png" alt="test pipeline" /></p>
<pre><code class="language-yaml">{
  &quot;_source&quot;:
    {
      &quot;message&quot;: &quot;John Smith lives at 123 Main St. Highland Park, CO. His email address is jsmith123@email.com and his phone number is 412-189-9043.  I found his social security number, it is 942-00-1243. Oh btw, his credit card is 1324-8374-0978-2819 and his gateway IP is 192.168.1.2&quot;,
    },
}
</code></pre>
<p>Simply press the Run the pipeline button, and you will then see the following output:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/elastic-blog-pii-output-2.png" alt="pii output code" /></p>
<h2>What’s next?</h2>
<p>After you’ve added this ingest pipeline to a data set you’re indexing and validated that it is meeting expectations, you can add the message field to be removed so that no PII data is indexed. Simply update your REMOVE processor to include the message field and simulate again to only see the redacted field.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/elastic-blog-manage-processor.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/elastic-blog-pii-output.png" alt="pii output code 2" /></p>
<h2>Conclusion</h2>
<p>With this step-by-step approach, you are now ready and able to detect and redact any sensitive data throughout your indices.</p>
<p>Here’s a quick recap of what we covered:</p>
<ul>
<li>Loading a pre-trained named entity recognition model into an Elastic cluster</li>
<li>Configuring the Redact processor, along with the inference processor, to use the trained model during data ingestion</li>
<li>Testing sample data and modifying the ingest pipeline to safely remove personally identifiable information</li>
</ul>
<p>Ready to get started? Sign up <a href="https://cloud.elastic.co/registration">for Elastic Cloud</a> and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your OpenTelemetry data.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/blog-post4-ai-search-B.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[How we fixed head-based sampling in OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/how-we-fixed-head-based-sampling-in-opentelemetry</link>
            <guid isPermaLink="false">how-we-fixed-head-based-sampling-in-opentelemetry</guid>
            <pubDate>Wed, 18 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Head-based sampling can break throughput charts without sampling metadata. Learn how OpenTelemetry tracestate probability fields fixed this in Java, JS, and Python.]]></description>
            <content:encoded><![CDATA[<p>Head-based sampling in OpenTelemetry is cheap and practical, but it used to create a major analytics problem: sampled traces reduced raw span counts, so backend throughput charts became wrong. The fix was to carry sampling probability in <code>tracestate</code> so a backend can estimate how many original traces each sampled trace represents. This article explains the problem, the spec, and how we implemented the fix in OpenTelemetry Java, JavaScript, and Python.</p>
<h2>Why sampling creates a throughput problem</h2>
<p>Most production systems sample traces because sending every span is expensive. Two common approaches are:</p>
<ul>
<li><strong>Head-based sampling</strong>: decide at trace start whether to keep or drop the trace.</li>
<li><strong>Tail-based sampling</strong>: decide later, after seeing more or all spans from the trace.</li>
</ul>
<p>Head-based sampling is fast and low-cost because it decides early. But if your backend only sees 10% of traces, a naive throughput chart built from ingested traces can undercount real traffic by 10x.</p>
<p>In other words, without extra metadata, sampled telemetry loses the context needed to reconstruct volume-based metrics.</p>
<h2>The OpenTelemetry spec that solves it</h2>
<p>The OpenTelemetry specification defines a way to encode probability sampling information in <code>tracestate</code>:</p>
<ul>
<li><a href="https://opentelemetry.io/docs/specs/otel/trace/tracestate-probability-sampling/">Tracestate: Probability Sampling</a></li>
<li><a href="https://opentelemetry.io/docs/specs/otel/trace/sdk/#built-in-composablesamplers">SDK: Built-in Composable Samplers</a></li>
</ul>
<p>At a high level, the sampler writes enough information into <code>tracestate</code> for downstream systems to understand the effective sampling probability of a trace.
When this metadata is present and propagated correctly, throughput and rate-oriented analytics can stay accurate while still getting the cost benefits of head-based sampling.
Elastic Observability supports this spec and behavior out of the box.
If you use Elastic's distribution (EDOT) SDKs or correctly configure the upstream OpenTelemetry SDKs as described below, Elastic can estimate the original throughput metrics from sampled data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/how-we-fixed-head-based-sampling-in-opentelemetry/sampling-metadata.jpg" alt="Head-based sampling with tracestate probability propagation" /></p>
<p>For example, a sampled span might carry this entry:</p>
<pre><code class="language-text">tracestate: ot=th:fd70a4;rv:fe123456789abc
</code></pre>
<p>Using the spec rules:</p>
<ul>
<li><code>th</code> is the rejection threshold (<code>T</code>) with trailing zeros removed.</li>
<li><code>rv</code> is the 56-bit randomness value (<code>R</code>).</li>
<li>A participant keeps the span when <code>R &gt;= T</code>.</li>
</ul>
<p>Here, <code>th:fd70a4</code> expands to <code>T = 0xfd70a400000000</code>, and <code>rv</code> gives <code>R = 0xfe123456789abc</code>, so the span is kept because <code>R &gt;= T</code>.</p>
<p>In decimal, that is:</p>
<ul>
<li><code>T = 0xfd70a400000000 = 71,337,018,784,743,424</code></li>
<li><code>R = 0xfe123456789abc = 71,514,660,082,850,492</code></li>
<li><code>2^56 = 72,057,594,037,927,936</code></li>
</ul>
<p>Since <code>71,514,660,082,850,492 &gt;= 71,337,018,784,743,424</code>, this trace is sampled.</p>
<p>The backend can convert <code>T</code> into representative count (adjusted count):</p>
<pre><code class="language-text">probability = (2^56 - T) / 2^56
adjusted_count = 1 / probability = 2^56 / (2^56 - T)
</code></pre>
<p>Plugging in the values:</p>
<pre><code class="language-text">probability = (72,057,594,037,927,936 - 71,337,018,784,743,424) / 72,057,594,037,927,936
            = 720,575,253,184,512 / 72,057,594,037,927,936
           ~= 0.01

adjusted_count = 1 / 0.01 = 100
</code></pre>
<p>So in practice this is approximately 1% sampling, and each sampled span represents about 100 original spans.</p>
<p>That means a backend can do weighted calculations, for example:</p>
<pre><code class="language-text">extrapolated_throughput = sampled_throughput * adjusted_count
</code></pre>
<h2>What was missing before</h2>
<p>A few months ago, OpenTelemetry SDK users had the spec but no out-of-the-box implementation in major SDKs.
Hence, spans received at a backend didn't carry the sampling metadata that would allow the backend to estimate the original trace volume.
In practice, this meant teams adopting head-based sampling had limited options:</p>
<ol>
<li>accept skewed throughput numbers,</li>
<li>build custom sampler logic, or</li>
<li>switch to more complex sampling setups.</li>
</ol>
<p>For many teams, that made standard head-based sampling much less useful than it should be.</p>
<h2>The fix: implementation across Java, JavaScript, and Python</h2>
<p>We implemented the spec-aligned behavior in three SDKs so teams can use standardized sampling metadata instead of custom workarounds.</p>
<ul>
<li><strong>Java</strong>: <a href="https://github.com/open-telemetry/opentelemetry-java/pull/7626">open-telemetry/opentelemetry-java#7626</a></li>
<li><strong>JavaScript</strong>: <a href="https://github.com/open-telemetry/opentelemetry-js/pull/5839">open-telemetry/opentelemetry-js#5839</a></li>
<li><strong>Python</strong>: <a href="https://github.com/open-telemetry/opentelemetry-python/pull/4714">open-telemetry/opentelemetry-python#4714</a></li>
</ul>
<p>All three PRs implemented the composite/probability sampling behavior so the root sampling decision is represented in <code>tracestate</code> and can be preserved across service boundaries.</p>
<h3>What changed conceptually</h3>
<p>The important shift is not &quot;sample more&quot; or &quot;sample less&quot;. It is:</p>
<ul>
<li>keep probabilistic head-based sampling,</li>
<li>propagate probability metadata with the trace,</li>
<li>let backends compute weighted rates from sampled data.</li>
</ul>
<p>This keeps ingestion costs manageable and restores correct aggregate analysis.</p>
<h2>Implementation walkthrough</h2>
<p>The exact API shape differs by language and release, but the rollout pattern is similar:</p>
<ol>
<li>Use the SDK sampler implementation that supports the probability/composite spec behavior.</li>
<li>Keep W3C Trace Context propagation enabled so <code>tracestate</code> moves across services.</li>
<li>Validate in your backend that throughput/rate charts use weighted interpretation when sampling metadata exists.</li>
</ol>
<h3>Java</h3>
<p>If you are using the Elastic Distribution for OpenTelemetry Java (EDOT Java) the sampler is by default already configured to use the probability/composite spec behavior.
By default, EDOT comes with a sampling rate of 100% for all traces. You can change the sampling rate by setting the <code>sampling_rate</code> in <a href="https://www.elastic.co/docs/reference/opentelemetry/edot-sdks/java/configuration#central-configuration-settings">central configuration</a> or by setting the <code>otel.traces.sampler.arg</code> Java system property / <code>OTEL_TRACES_SAMPLER_ARG</code> environment variable.</p>
<p>For the upstream OTel Java SDK use the following logic to configure the sampler:</p>
<pre><code class="language-java">import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.sdk.OpenTelemetrySdk;
import io.opentelemetry.sdk.extension.incubator.trace.samplers.ComposableSampler;
import io.opentelemetry.sdk.extension.incubator.trace.samplers.CompositeSampler;
import io.opentelemetry.sdk.trace.SdkTracerProvider;

// Use a sampling ratio. For example, 10% sampling:
double ratio = 0.1;

SdkTracerProvider tracerProvider =
    SdkTracerProvider.builder()
        .setSampler(
            CompositeSampler.wrap(
                ComposableSampler.parentThreshold(
                    ComposableSampler.probability(ratio)
                )
            )
        )
        // (other configuration, e.g., span processor, exporter)
        .build();

OpenTelemetry openTelemetry =
    OpenTelemetrySdk.builder()
        .setTracerProvider(tracerProvider)
        .build();

Tracer tracer = openTelemetry.getTracer(&quot;my-instrumentation-library&quot;);

// You can now start spans with the configured sampler.
// Example:
tracer.spanBuilder(&quot;example-span&quot;).startSpan();
</code></pre>
<p>The example above shows how to configure head-based probability sampling using the OpenTelemetry Java SDK.
Let’s break down the key parts:</p>
<ul>
<li><strong>Importing necessary classes:</strong> The imports bring in the required OpenTelemetry APIs and sampler extensions.</li>
<li><strong>Setting the sampling ratio:</strong> The <code>ratio</code> variable controls the fraction of traces sampled (for example, 0.1 for 10%).</li>
<li><strong>Sampler configuration:</strong>
<ul>
<li><code>CompositeSampler</code> and <code>ComposableSampler</code> are used to set up a sampler that follows the OpenTelemetry specification for composite samplers, enabling more accurate probability-based head sampling.</li>
<li><code>ComposableSampler.probability(ratio)</code> specifies that traces are sampled at the configured ratio.</li>
<li><code>ComposableSampler.parentThreshold(...)</code> ensures parent sampling decisions are respected, which keeps trace context consistent across service boundaries.</li>
<li>Wrapping this in <code>CompositeSampler.wrap(...)</code> gives you a sampler compliant with the latest spec.</li>
<li>In current OTel Java, sampled root spans emit the <code>th</code> value in <code>tracestate</code>, and <code>rv</code> is preserved when it is already present from upstream context.</li>
</ul>
</li>
<li><strong>Tracer provider and OpenTelemetry setup:</strong>
<ul>
<li>The configured sampler is attached to the <code>SdkTracerProvider</code> which is then built into the <code>OpenTelemetrySdk</code> instance.</li>
</ul>
</li>
<li><strong>Using the configured tracer:</strong>
<ul>
<li>When you build a span (like <code>tracer.spanBuilder(&quot;example-span&quot;).startSpan()</code>), the SDK applies your sampling policy as you generate traces.</li>
</ul>
</li>
</ul>
<p>This pattern ensures that your head-based sampler not only controls costs (by sampling only a percentage of traces) but also carries and respects sampling metadata.
This, in turn, enables downstream backends (like Elastic or any OTel-compliant backend) to correctly calculate throughput/volume metrics, accounting for sampling, and provide more accurate operational measurements.</p>
<h3>Node.js</h3>
<p>If you are using the Elastic Distribution for OpenTelemetry Node.js (EDOT Node) the sampler is by default already configured to use the probability/composite spec behavior.
By default, EDOT comes with a sampling rate of 100% for all traces.
You can change the sampling rate by setting the <code>sampling_rate</code> in <a href="https://www.elastic.co/docs/reference/opentelemetry/edot-sdks/node/configuration#central-configuration-settings">central configuration</a> or by setting the <code>OTEL_TRACES_SAMPLER_ARG</code> environment variable.</p>
<p>For the upstream OTel JavaScript SDK use the following logic to configure the sampler:</p>
<pre><code class="language-javascript">const { NodeSDK } = require('@opentelemetry/sdk-node');
const {
  createCompositeSampler,
  createComposableParentThresholdSampler,
  createComposableTraceIDRatioBasedSampler,
} = require('@opentelemetry/sampler-composite');

// Example: sample 10% of new root traces and preserve parent decisions.
const sampler = createCompositeSampler(
  createComposableParentThresholdSampler(
    createComposableTraceIDRatioBasedSampler(0.1)
  )
);

const sdk = new NodeSDK({ sampler });
sdk.start();
</code></pre>
<p>This JavaScript snippet demonstrates how to configure head-based probability sampling using the upstream OpenTelemetry JavaScript SDK with the new composite sampler specification.</p>
<ul>
<li>
<p><strong>Imports:</strong><br />
The code imports utility functions—<code>createCompositeSampler</code>, <code>createComposableParentThresholdSampler</code>, and <code>createComposableTraceIDRatioBasedSampler</code>—from the <code>@opentelemetry/sampler-composite</code> extension, which implements the spec-compliant composable sampling logic.</p>
</li>
<li>
<p><strong>Sampler configuration:</strong><br />
The sampler is constructed to:</p>
<ul>
<li>Use <code>createComposableTraceIDRatioBasedSampler(0.1)</code> to sample 10% of all (root) traces.</li>
<li>Wrap this in <code>createComposableParentThresholdSampler</code>, so sampling respects the decision made by any parent span that might come from upstream (preserving distributed trace context).</li>
<li>Finally, the whole structure is wrapped in <code>createCompositeSampler</code>, which puts it into the form expected by the OTel SDK.</li>
</ul>
</li>
<li>
<p><strong>Usage:</strong><br />
The sampler is passed to the <code>NodeSDK</code> from <code>@opentelemetry/sdk-node</code> at startup. After this, all spans you create in your application will follow this sampling logic.</p>
</li>
</ul>
<p>This approach enables accurate, head-based sampling in OpenTelemetry JavaScript, following the latest OTel specification. It ensures your traces are sampled at the rate you set, while also propagating sampling-related metadata in the <code>tracestate</code> to downstream services and telemetry backends (e.g., Elastic, Jaeger). This is critical for volume adjustment, accurate metrics, and cost control in distributed tracing environments.</p>
<h3>Python</h3>
<p>If you are using the Elastic Distribution for OpenTelemetry Python (EDOT Python) the sampler is by default already configured to use the probability/composite spec behavior.
By default, EDOT comes with a sampling rate of 100% for all traces.
You can change the sampling rate by setting the <code>sampling_rate</code> in <a href="https://www.elastic.co/docs/reference/opentelemetry/edot-sdks/python/configuration#central-configuration-settings">central configuration</a> or by setting the <code>OTEL_TRACES_SAMPLER_ARG</code> environment variable.</p>
<p>For the upstream OTel Python SDK, register a custom sampler entry point and point <code>OTEL_TRACES_SAMPLER</code> to it:</p>
<pre><code class="language-toml"># pyproject.toml
[project.entry-points.opentelemetry_traces_sampler]
parentbased_composite = &quot;your_package.sampling:ParentBasedCompositeSampler&quot;
</code></pre>
<pre><code class="language-python"># your_package/sampling.py
from __future__ import annotations

from typing import Sequence

from opentelemetry.context import Context
from opentelemetry.sdk.trace._sampling_experimental import (
    composable_parent_threshold,
    composable_traceid_ratio_based,
    composite_sampler,
)
from opentelemetry.sdk.trace.sampling import Sampler, SamplingResult
from opentelemetry.trace import Link, SpanKind, TraceState
from opentelemetry.util.types import Attributes


class ParentBasedCompositeSampler(Sampler):
    # The SDK passes OTEL_TRACES_SAMPLER_ARG as this constructor argument.
    def __init__(self, ratio_str: str | None):
        try:
            ratio = float(ratio_str) if ratio_str else 1.0
        except ValueError:
            ratio = 1.0
        self._delegate = composite_sampler(
            composable_parent_threshold(composable_traceid_ratio_based(ratio))
        )

    def should_sample(
        self,
        parent_context: Context | None,
        trace_id: int,
        name: str,
        kind: SpanKind | None = None,
        attributes: Attributes | None = None,
        links: Sequence[Link] | None = None,
        trace_state: TraceState | None = None,
    ) -&gt; SamplingResult:
        return self._delegate.should_sample(
            parent_context,
            trace_id,
            name,
            kind,
            attributes,
            links,
            trace_state,
        )

    def get_description(self) -&gt; str:
        return self._delegate.get_description()
</code></pre>
<pre><code class="language-bash">export OTEL_TRACES_SAMPLER=parentbased_composite
export OTEL_TRACES_SAMPLER_ARG=0.10
</code></pre>
<p>This Python snippet demonstrates how to configure probability-based head sampling in the OpenTelemetry Python SDK using a custom sampler that supports the <code>tracestate</code> probability propagation spec.</p>
<p>Here's what happens in the example:</p>
<ul>
<li>It imports experimental sampling APIs from <code>opentelemetry.sdk.trace._sampling_experimental</code> to build a composite sampler that encodes the sampling probability in the <code>tracestate</code> of each root span. This supports backend throughput correction.</li>
<li>The <code>ParentBasedCompositeSampler</code> class is a wrapper you can plug into the SDK. Its constructor accepts the sampling probability as a string (from <code>OTEL_TRACES_SAMPLER_ARG</code>, but defaults to 1.0 = 100% sampling).</li>
<li><code>composite_sampler(composable_parent_threshold(composable_traceid_ratio_based(ratio)))</code> builds a sampler that:
<ol>
<li>Uses the probability to sample root spans.</li>
<li>Propagates the root sampling decision for child spans via parent-based threshold logic.</li>
<li>Embeds and respects the OpenTelemetry probability fields in <code>tracestate</code>.</li>
</ol>
</li>
<li>The example then shows the required environment variables to enable this sampling logic:
<pre><code class="language-bash">export OTEL_TRACES_SAMPLER=parentbased_compositeexport OTEL_TRACES_SAMPLER_ARG=0.10
</code></pre>With these values, you enable parent-based composite sampling with a 10% sampling rate.</li>
</ul>
<p>This snippet enables standards-compliant probability sampling and propagation in OpenTelemetry Python, so throughput metrics can be accurately estimated by your backend based on the <code>tracestate</code> metadata.</p>
<h2>Validation</h2>
<p>To validate your setup and ensure accurate throughput metrics when using head-based sampling, follow these steps:</p>
<ol>
<li>Deploy the SDK in a controlled environment where you can manage both load and sampling rate.</li>
<li>Generate a steady, predictable load with a known throughput.</li>
<li>Set a fixed sampling rate for traces.</li>
<li>Send the resulting telemetry data to Elastic Observability.</li>
<li>Confirm that the reported throughput metrics in Elastic match your expectations.</li>
</ol>
<p>You can use the following ES|QL commands to compare the observed raw throughput (counted from sampled traces) to the extrapolated throughput available in the derived metrics:</p>
<p><strong>Raw throughput:</strong></p>
<pre><code class="language-esql">FROM traces-* 
| WHERE service.name == &quot;your-service&quot;
| WHERE transaction.name IS NOT NULL
| STATS count_transactions = COUNT(*),
    time_range = DATE_DIFF(&quot;minute&quot;, MIN(@timestamp), MAX(@timestamp))
| EVAL raw_throughput_per_min = count_transactions::double / time_range
</code></pre>
<p><strong>Extrapolated throughput:</strong></p>
<pre><code class="language-esql">FROM metrics-*
| WHERE service.name == &quot;your-service&quot;
| WHERE metricset.name == &quot;service_transaction&quot; AND metricset.interval == &quot;1m&quot;
| STATS count_transactions = COUNT(transaction.duration.summary),
    time_range = DATE_DIFF(&quot;minute&quot;, MIN(@timestamp), MAX(@timestamp))
| EVAL extrapolated_throughput_per_min = count_transactions::double / time_range
</code></pre>
<p>The <code>extrapolated_throughput_per_min</code> should be close to your real throughput rate, while the <code>raw_throughput_per_min</code> should be close to the your configured sampling rate.</p>
<h2>Conclusion</h2>
<p>This work turns head-based sampling into a much safer default for teams that need both cost control and reliable operational metrics.
You no longer have to choose between affordable trace volume and trustworthy throughput calculations.</p>
<p>Before enabling head-based sampling with probability propagation, review the linked PRs and consult each SDK’s release notes to confirm the minimum version with full support (Java, JavaScript, and Python).
Start by enabling sampling-aware configuration in a single service, and once you verify correct <code>tracestate</code> propagation and backend metric accuracy, gradually roll out the change to additional services.
To maintain reliability, establish monitoring and automated regression tests for throughput accuracy, so you can spot any unintended metric drift when sampling rates or SDK components are updated.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/how-we-fixed-head-based-sampling-in-opentelemetry/header.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Improving the Elastic APM UI performance with continuous rollups and service metrics]]></title>
            <link>https://www.elastic.co/observability-labs/blog/apm-ui-performance-continuous-rollups-service-metrics</link>
            <guid isPermaLink="false">apm-ui-performance-continuous-rollups-service-metrics</guid>
            <pubDate>Thu, 29 Jun 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[We made significant improvements to the UI performance in Elastic APM to make it scale with even the most demanding workloads, by pre-aggregating metrics at the service level, and storing the metrics at different levels of granularity.]]></description>
            <content:encoded><![CDATA[<p>In today's fast-paced digital landscape, the ability to monitor and optimize application performance is crucial for organizations striving to deliver exceptional user experiences. At Elastic, we recognize the significance of providing our user base with a reliable <a href="https://www.elastic.co/observability">observability platform</a> that scales with you as you’re onboarding thousands of services that produce terabytes of data each day. We have been diligently working behind the scenes to enhance our solution to meet the demands of even the largest deployments.</p>
<p>In this blog post, we are excited to share the significant strides we have made in improving the UI performance of Elastic APM. Maintaining a snappy user interface can be a challenge when interactively summarizing the massive amounts of data needed to provide an overview of the performance for an entire enterprise-scale service inventory. We want to assure our customers that we have listened, taken action, and made notable architectural changes to elevate the scalability and maturity of our solution.</p>
<h2>Architectural enhancements</h2>
<p>Our journey began back in the 7.x series where we noticed that doing ad-hoc aggregations on raw <a href="https://www.elastic.co/guide/en/apm/guide/current/data-model-transactions.html">transaction</a> data put Elasticsearch&lt;sup&gt;®&lt;/sup&gt; under a lot of pressure in large-scale environments. Since then, we’ve begun to pre-aggregate the transactions into transaction metrics during ingestion. This has helped to keep the performance of the UI relatively stable. Regardless of how busy the monitored application is and how many transaction events it is creating, we’re just querying pre-aggregated metrics that are stored at a constant rate. We’ve enabled the metrics-powered UI by default in <a href="https://github.com/elastic/kibana/issues/92024">7.15</a>.</p>
<p>However, when showing an inventory of a large number of services over large time ranges, the number of metric data points that need to be aggregated can still be large enough to cause performance issues. We also create a time series for each distinct set of dimensions. The dimensions include metadata, such as the transaction name and the host name. Our <a href="https://www.elastic.co/guide/en/apm/guide/current/data-model-metrics.html#_transaction_metrics">documentation</a> includes a full list of all available dimensions. If there’s a very high number of unique transaction names, which could be a result of improper instrumentation (see <a href="https://www.elastic.co/guide/en/kibana/current/troubleshooting.html#troubleshooting-too-many-transactions">docs</a> for more details), this will create a lot of individual time series that will need to be aggregated when requesting a summary of the service’s overall performance. Global labels that are added to the APM Agent configuration are also added as dimensions to these metrics, and therefore they can also impact the number of time series. Refer to the FAQs section below for more details.</p>
<p>Within the 8.7 and 8.8 releases, we’ve addressed these challenges with the following architectural enhancements that aim to reduce the number of documents Elasticsearch needs to search and aggregate on-the-fly, resulting in faster response times:</p>
<ul>
<li><strong>Pre-aggregation of transaction metrics into service metrics.</strong> Instead of aggregating all distinct time series that are created for each individual transaction name on-the-fly for every user request, we’re already pre-aggregating a summary time series for each service during data ingestion. Depending on how many unique transaction names the services have, this reduces the number of documents Elasticsearch needs to look up and aggregate by a factor of typically 10–100. This is particularly useful for the <a href="https://www.elastic.co/guide/en/kibana/master/services.html">service inventory</a> and the <a href="https://www.elastic.co/guide/en/kibana/master/service-overview.html">service overview</a> pages.</li>
<li><strong>Pre-aggregation of all metrics into different levels of granularity.</strong> The APM UI chooses the most appropriate level of granularity, depending on the selected time range. In addition to the metrics that are stored at a 1-minute granularity, we’re also summarizing and storing metrics at a 10-minute and 60-minute granularity level. For example, when looking at a 7-day period, the 60-minute data stream is queried instead of the 1-minute one, resulting in 60x fewer documents for Elasticsearch to examine. This makes sure that all graphs are rendered quickly, even when looking at larger time ranges.</li>
<li><strong>Safeguards on the number of unique transactions per service for which we are aggregating metrics.</strong> Our agents are designed to keep the cardinality of the transaction name low. But in the wild, we’ve seen some services that have a huge amount of unique transaction names. This used to cause performance problems in the UI because APM Server would create many time series that the UI needed to aggregate at query time. In order to protect APM Server from running out of memory when aggregating a large number of time series for each unique transaction name, metrics were published without aggregating when limits for the number of time series were reached. This resulted in a lot of individual metric documents that needed to be aggregated at query time. To address the problem, we've introduced a system where we aggregate metrics in a dedicated overflow bucket for each service when limits are reached. Refer to our <a href="https://www.elastic.co/guide/en/kibana/8.8/troubleshooting.html#troubleshooting-too-many-transactions">documentation</a> for more details.</li>
</ul>
<p>The exact factor of the document count reduction depends on various conditions. But to get a feeling for a typical scenario, if your services, on average, have 10 instances, no instance-specific global labels, 100 unique transaction names each, and you’re looking at time ranges that can leverage the 60m granularity, you’d see a reduction of documents that Elasticsearch needs to aggregate by a factor of 180,000 (10 instances x 100 transaction names x 60m x 3 because we’re also collapsing the event.outcome dimension). While the response times of Elasticsearch aggregations isn’t exactly scaling linearly with the number of documents, there is a strong correlation.</p>
<h2>FAQs</h2>
<h3>When upgrading to the latest version, will my old data also load faster?</h3>
<p>Updating to 8.8 doesn’t immediately make the UI faster. Because the improvements are powered by pre-aggregations that APM Server is doing during ingestion, only new data will benefit from it. For that reason, you should also make sure to update APM Server as well. The UI can still display data that was ingested using an older version of the stack.</p>
<h3>If the UI is based on metrics, can I still slice and dice using custom labels?</h3>
<p>High cardinality analysis is a big strength of Elastic Observability, and this focus on pre-aggregated metrics does not compromise that in any way.</p>
<p>The UI implements a sophisticated fallback mechanism that uses service metrics, transaction metrics, or raw transaction events, depending on which filters are applied. We’re not creating metrics for each user.id, for example. But you can still filter the data by user.id and the UI will then use raw transaction events. Chances are that you’re looking at a narrow slice of data when filtering by a dimension that is not available on the pre-aggregated metrics, therefore aggregations on the raw data are typically very fast.</p>
<p>Note that all global labels that are added to the APM agent configuration are part of the dimension of the pre-aggregated metrics, with the exception of RUM (see more details in <a href="https://github.com/elastic/apm-server/issues/11037">this issue</a>).</p>
<h3>Can I use the pre-aggregated metrics in custom dashboards?</h3>
<p>Yes! If you use <a href="https://www.elastic.co/guide/en/kibana/current/lens.html">Lens</a> and select the &quot;APM&quot; data view, you can filter on either metricset.name:service_transaction or metricset.name:transaction, depending on the level of detail you need. Transaction latency is captured in transaction.duration.histogram, and successful outcomes and failed outcomes are stored in event.success_count. If you don't need a distribution of values, you can also select the transaction.duration.summary field for your metric aggregations, which should be faster. If you want to calculate the failure rate, here's a <a href="https://www.elastic.co/guide/en/kibana/current/lens.html#lens-formulas">Lens formula</a>: 1 - (sum(event.success_count) / count(event.success_count)). Note that the only granularity supported here is 1m.</p>
<h3>Do the additional metrics have an impact on the storage?</h3>
<p>While we’re storing more metrics than before, and we’re storing all metrics in different levels of granularity, we were able to offset that by enabling <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html#synthetic-source">synthetic source</a> for all metric data streams. We’ve even increased the default retention for the metrics in the coarse-grained granularity levels, so that the 60m rollup data streams are now stored for 390 days. Please consult our <a href="https://www.elastic.co/guide/en/apm/guide/current/apm-data-streams.html">documentation</a> for more information about the different metric data streams.</p>
<h3>Are there limits on the amount of time series that APM Server can aggregate?</h3>
<p>APM Server performs pre-aggregations in memory, which is fast, but consumes a considerable amount of memory. There are limits in place to protect APM Server from running out of memory, and from 8.7, most of them scale with available memory by default, meaning that allocating more memory to APM Server will allow it to handle more unique pre-aggregation groups like services and transactions. These limits are described in <a href="https://www.elastic.co/guide/en/apm/guide/current/data-model-metrics.html#_aggregated_metrics_limits_and_overflows">APM Server Data Model docs</a>.</p>
<p>On the APM Server roadmap, we have plans to move to a LSM-based approach where pre-aggregations are performed with the help of disks in order to reduce memory usage. This will enable APM Server to scale better with the input size and cardinality.</p>
<p>A common pitfall when working with pre-aggregations is to add instance-specific global labels to APM agents. This may exhaust the aggregation limits and cause metrics to be aggregated under the overflow bucket instead of the corresponding service. Therefore, make sure to follow the best practice of only adding a limited set of global labels to a particular service.</p>
<h2>Validation</h2>
<p>To validate the effectiveness of the new architecture, and to ensure that the accuracy of the data is not negatively affected, we prepared a test environment where we generated 35K+ transactions per minute in a timespan of 14 days resulting in approximately 850 million documents.</p>
<p>We’ve tested the queries that power our service inventory, the service overview, and the transaction details using different time ranges (1d, 7d, 14d). Across the board, we’ve seen orders of magnitude improvements. Particularly, queries across larger time ranges that benefit from using the coarse-grained metrics in addition to the pre-aggregated service metrics saw incredible reductions of the response time.</p>
<p>We’ve also validated that there’s no loss in accuracy when using the more coarse-grained metrics for larger time ranges.</p>
<p>Every environment will behave a bit differently, but we’re confident that the impressive improvements in response time will translate well to setups of even bigger scale.</p>
<h2>Planned improvements</h2>
<p>As mentioned in the FAQs section, the number of time series for transaction metrics can grow quickly, as it is the product of multiple dimensions. For example, given a service that runs on 100 hosts and has 100 transaction names that each have 4 transaction results, APM Server needs to track 40,000 (100 x 100 x 4) different time series for that service. This would even exceed the maximum per-service limit of 32,000 for APM Servers with 64GB of main memory.</p>
<p>As a result, the UI will show an entry for “Remaining Transactions” in the Service overview page. This tracks the transaction metrics for a service once it hits the limit. As a result, you may not see all transaction names of your service. It may also be that all distinct transaction names are listed, but that the transaction metrics for some of the instances of that service are combined in the “Remaining Transactions” category.</p>
<p>We’re currently considering restructuring the dimensions for the metrics to avoid that the combination of the dimensions for transaction name and service instance-specific dimensions (such as the host name) lead to an explosion of time series. Stay tuned for more details.</p>
<h2>Conclusion</h2>
<p>The architectural improvements we’ve delivered in the past releases provide a step-function in terms of the scalability and responsiveness of our UI. Instead of having to aggregate massive amounts of data on-the-fly as users are navigating through the user interface, we pre-aggregate the results for the most common queries as data is coming in. This ensures we have the answers ready before users have even asked their most frequently asked questions, while still being able to answer ad-hoc questions.</p>
<p>We are excited to continue supporting our community members as they push boundaries on their growth journey, providing them with a powerful and mature platform that can effortlessly handle the demands of the largest workloads. Elastic is committed to its mission to enable everyone to find the answers that matter. From all data. In real time. At scale.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/apm-ui-performance-continuous-rollups-service-metrics/elastic-blog-header-ui.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Infrastructure monitoring with OpenTelemetry in Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/infrastructure-monitoring-with-opentelemetry-in-elastic-observability</link>
            <guid isPermaLink="false">infrastructure-monitoring-with-opentelemetry-in-elastic-observability</guid>
            <pubDate>Wed, 24 Jul 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Integrating OpenTelemetry with Elastic Observability for Application and Infrastructure Monitoring Solutions.]]></description>
            <content:encoded><![CDATA[<p>At Elastic, we recently made a decision to fully embrace OpenTelemetry as the premier data collection framework. As an Observability engineer, I firmly believe that vendor agnosticism is essential for delivering the greatest value to our customers. By committing to OpenTelemetry, we are not only staying current with technological advancements but also driving them forward. This investment positions us at the forefront of the industry, championing a more open and flexible approach to observability.</p>
<p>Elastic donated  <a href="https://www.elastic.co/guide/en/ecs/current/index.html">Elastic Common Schema (ECS)</a> to OpenTelemetry and is actively working to <a href="https://opentelemetry.io/blog/2023/ecs-otel-semconv-convergence/">converge</a> it with semantic conventions. In the meantime, we are dedicated to support our users by ensuring they don’t have to navigate different standards. Our goal is to provide a seamless end-to-end experience while using OpenTelemetry with our application and infrastructure monitoring solutions. This commitment allows users to benefit from the best of both worlds without any friction.</p>
<p>In this blog, we explore how to use the OpenTelemetry (OTel) collector to capture core system metrics from various sources such as AWS EC2, Google Compute, Kubernetes clusters, and individual systems running Linux or MacOS.</p>
<h2>Powering Infrastructure UIs with Two Ingest Paths</h2>
<p>Elastic users who wish to have OpenTelemetry as their data collection mechanism can now monitor the health of the hosts where the OpenTelemetry collector is deployed using the Hosts and Inventory UIs available in Elastic Observability.</p>
<p>Elastic offers two distinct ingest paths to power Infrastructure UIs: the ElasticsearchExporter Ingest Path and the OTLP Exporter Ingest Path.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/infrastructure-monitoring-with-opentelemetry-in-elastic-observability/IngestPath.png" alt="IngestPath" /></p>
<h3>ElasticsearchExporter Ingest Path:</h3>
<p>The <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/hostmetricsreceiver/README.md#host-metrics-receiver">hostmetrics receiver</a> in OpenTelemetry collects system-level metrics such as CPU, memory, and disk usage from the host machine in OTel Schema.
The ElasticsearchExporter ingest path leverages the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/hostmetricsreceiver/README.md#host-metrics-receiver">Hostmetrics Receiver</a> to generate host metrics in the OTel schema. We've developed the <a href="https://github.com/elastic/opentelemetry-collector-components/tree/main/processor/elasticinframetricsprocessor#elastic-infra-metrics-processor">ElasticInfraMetricsProcessor</a>, which utilizes the <a href="https://github.com/elastic/opentelemetry-lib/tree/main?tab=readme-ov-file#opentelemetry-lib">opentelemetry-lib</a> to convert these metrics into a format that Elastic UIs understand.</p>
<p>For example, the <code>system.network.io</code> OTel metric includes a <code>direction</code> attribute  with values <code>receive</code> or <code>transmit</code>. These correspond to <code>system.network.in.bytes</code> and <code>system.network.out.bytes</code>, respectively, within Elastic.</p>
<p>The <a href="https://github.com/elastic/opentelemetry-collector-components/tree/main/processor/elasticinframetricsprocessor#elastic-infra-metrics-processor">processor</a> then forwards these metrics to the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/elasticsearchexporter#elasticsearch-exporter">Elasticsearch Exporter</a>, now enhanced to support exporting metrics in ECS mode. The exporter sends the metrics to an Elasticsearch endpoint, lighting up the Infrastructure UIs with insightful data.</p>
<p>To utilize this path, you can deploy the collector from the Elastic Collector Distro, available <a href="https://github.com/elastic/elastic-agent/blob/main/internal/pkg/otel/README.md">here</a>.</p>
<p>An example collector config for this Ingest Path:</p>
<pre><code class="language-yaml">receivers:
  hostmetrics:
    collection_interval: 10s
    scrapers:
      cpu:
        metrics:
          system.cpu.utilization:
            enabled: true
          system.cpu.logical.count:
            enabled: true
      memory:
        metrics:
          system.memory.utilization:
            enabled: true
      process:
        metrics:
          process.open_file_descriptors:
            enabled: true
          process.memory.utilization:
            enabled: true
          process.disk.operations:
            enabled: true
      network:
      processes:
      load:
      disk:
      filesystem:

processors:
  resourcedetection/system:
    detectors: [&quot;system&quot;, &quot;ec2&quot;]
  elasticinframetrics:

exporters:  
  logging:
    verbosity: detailed
  elasticsearch/metrics: 
    endpoints: &lt;elasticsearch_endpoint&gt;
    api_key: &lt;api_key&gt;
    mapping:
      mode: ecs

service:
  pipelines:
    metrics/host:
      receivers: [hostmetrics]
      processors: [resourcedetection/system, elasticinframetrics]
      exporters: [logging, elasticsearch/ metrics]

</code></pre>
<p>The Elastic exporter path is ideal for users who would prefer using the custom Elastic Collector <a href="https://github.com/elastic/elastic-agent/blob/main/internal/pkg/otel/README.md">Distro</a>. This path includes the ElasticInfraMetricsProcessor, which sends data to Elasticsearch via Elasticsearch exporter.</p>
<h3>OTLP Exporter Ingest Path:</h3>
<p>In the OTLP Exporter Ingest path, the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/hostmetricsreceiver/README.md#host-metrics-receiver">hostmetrics receiver</a> collects system-level metrics such as CPU, memory, and disk usage from the host machine in OTel Schema. These metrics are sent to the <a href="https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/otlpexporter#otlp-grpc-exporter">OTLP Exporter</a>, which forwards them to the <a href="https://www.elastic.co/guide/en/observability/current/apm-open-telemetry-direct.html#apm-connect-open-telemetry-collector">APM Server endpoint</a>. The APM Server, using the same <a href="https://github.com/elastic/opentelemetry-lib/tree/main?tab=readme-ov-file#opentelemetry-lib">opentelemetry-lib</a>, converts these metrics into a format compatible with Elastic UIs. Subsequently, the APM Server pushes the metrics to Elasticsearch, powering the Infrastructure UIs.</p>
<p>An example collector configuration for the APM Ingest Path</p>
<pre><code class="language-yaml">receivers:
  hostmetrics:
    collection_interval: 10s
    scrapers:
      cpu:
        metrics:
          system.cpu.utilization:
            enabled: true
          system.cpu.logical.count:
            enabled: true
      memory:
        metrics:
          system.memory.utilization:
            enabled: true
      process:
        metrics:
          process.open_file_descriptors:
            enabled: true
          process.memory.utilization:
            enabled: true
          process.disk.operations:
            enabled: true
      network:
      processes:
      load:
      disk:
      filesystem:

processors:
  resourcedetection/system:
    detectors: [&quot;system&quot;]
    system:
      hostname_sources: [&quot;os&quot;]

exporters:
  otlphttp:
    endpoint: &lt;mis_endpoint&gt;
    tls:
      insecure: false
    headers:
      Authorization: &lt;api_key_&gt;
  logging:
    verbosity: detailed

service:
  pipelines:
    metrics/host:
      receivers: [hostmetrics]
      processors: [resourcedetection/system]
      exporters: [logging, otlphttp]


</code></pre>
<p>The OTLP Exporter Ingest path can help existing users who are already using Elastic APM and want to see the Infrastructure UIs populated as well. These users can use the default <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib?tab=readme-ov-file#opentelemetry-collector-contrib">OpenTelemetry Collector</a>.</p>
<h2>A glimpse of the Infrastructure UIs</h2>
<p>The Infrastructure UIs showcase both Host and Kubernetes level views. Below are some of the glimpses of the UIs</p>
<p>The Hosts Overview UI</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/infrastructure-monitoring-with-opentelemetry-in-elastic-observability/HostUI.png" alt="HostUI" /></p>
<p>The Hosts Inventory UI
<img src="https://www.elastic.co/observability-labs/assets/images/infrastructure-monitoring-with-opentelemetry-in-elastic-observability/Inventory.png" alt="InventoryUI" /></p>
<p>The Process-related Details of the Host</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/infrastructure-monitoring-with-opentelemetry-in-elastic-observability/Processes.png" alt="Processes" /></p>
<p>The Kubernetes Inventory UI</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/infrastructure-monitoring-with-opentelemetry-in-elastic-observability/K8s.png" alt="K8s" /></p>
<p>Pod level Metrics</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/infrastructure-monitoring-with-opentelemetry-in-elastic-observability/Pod_Metrics.png" alt="Pod Metrics" /></p>
<p>Our next step is to create Infrastructure UIs powered by native OTel data, with dedicated OTel dashboards that run on this native data.</p>
<h2>Conclusion</h2>
<p>Elastic's integration with OpenTelemetry simplifies the observability landscape and while we are diligently working to align ECS with OpenTelemetry’s semantic conventions, our immediate priority is to support our users by simplifying their experience. With this added support, we aim to deliver a seamless, end-to-end experience for those using OpenTelemetry with our application and infrastructure monitoring solutions. We are excited to see how our users will leverage these capabilities to gain deeper insights into their systems.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/infrastructure-monitoring-with-opentelemetry-in-elastic-observability/Monitoring-infra-with-Otel.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Ingesting and analyzing Prometheus metrics with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/ingesting-analyzing-prometheus-metrics-observability</link>
            <guid isPermaLink="false">ingesting-analyzing-prometheus-metrics-observability</guid>
            <pubDate>Mon, 09 Oct 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[In this blog post, we will showcase the integration of Prometheus with Elastic, emphasizing how Elastic elevates metrics monitoring through extensive historical analytics, anomaly detection, and forecasting, all in a cost-effective manner.]]></description>
            <content:encoded><![CDATA[<p>In the world of monitoring and observability, <a href="https://prometheus.io/">Prometheus</a> has grown into the de-facto standard for monitoring in cloud-native environments because of its robust data collection mechanism, flexible querying capabilities, and integration with other tools for rich dashboarding and visualization.</p>
<p>Prometheus is primarily built for short-term metric storage, typically retaining data in-memory or on local disk storage, with a focus on real-time monitoring and alerting rather than historical analysis. While it offers valuable insights into current metric values and trends, it may pose economic challenges and fall short of the robust functionalities and capabilities necessary for in-depth historical analysis, long-term trend detection, and forecasting. This is particularly evident in large environments with a substantial number of targets or high data ingestion rates, where metric data accumulates rapidly.</p>
<p>Numerous organizations assess their unique needs and explore avenues to augment their Prometheus monitoring and observability capabilities. One effective approach is integrating Prometheus with Elastic®. In this blog post, we will showcase the integration of Prometheus with Elastic, emphasizing how Elastic elevates metrics monitoring through extensive historical analytics, anomaly detection, and forecasting, all in a cost-effective manner.</p>
<h2>Integrate Prometheus with Elastic seamlessly</h2>
<p>Organizations that have configured their cloud-native applications to expose metrics in Prometheus format can seamlessly transmit the metrics to Elastic by using <a href="https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-prometheus.html">Prometheus integration</a>. Elastic enables organizations to monitor their metrics in conjunction with all other data gathered through <a href="https://www.elastic.co/integrations/data-integrations">Elastic's extensive integrations</a>.</p>
<p>Go to Integrations and find the Prometheus integration.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-1-integrations.png" alt="1 - integrations" /></p>
<p>To gather metrics from Prometheus servers, the Elastic Agent is employed, with central management of Elastic agents handled through the <a href="https://www.elastic.co/guide/en/fleet/current/fleet-overview.html">Fleet server</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-2-set-up-prometheus-integration.png" alt="2 - set up integration" /></p>
<p>After enrolling the Elastic Agent in the Fleet, users can choose from the following methods to ingest Prometheus metrics into Elastic.</p>
<h3>1. Prometheus collectors</h3>
<p><a href="https://docs.elastic.co/integrations/prometheus#prometheus-exporters-collectors">The Prometheus collectors</a> connect to the Prometheus server and pull metrics or scrape metrics from a Prometheus exporter.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-3-prometheus-collectors.png" alt="3 - Prometheus collectors" /></p>
<h3>2. Prometheus queries</h3>
<p><a href="https://docs.elastic.co/integrations/prometheus#prometheus-queries-promql">The Prometheus queries</a> execute specific Prometheus queries against <a href="https://prometheus.io/docs/prometheus/latest/querying/api/#expression-queries">Prometheus Query API</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-4-promtheus-queries.png" alt="4 - Prometheus queries" /></p>
<h3>3. Prometheus remote-write</h3>
<p><a href="https://docs.elastic.co/integrations/prometheus#prometheus-server-remote-write">The Prometheus remote_write</a> can receive metrics from a Prometheus server that has configured the <a href="https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write">remote_write</a> setting.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-5-prometheus-remote-write.png" alt="5 - Prometheus remote-write" /></p>
<p>After your Prometheus metrics are ingested, you have the option to visualize your data graphically within the <a href="https://www.elastic.co/guide/en/observability/current/explore-metrics.html">Metrics Explorer</a> and further segment it based on labels, such as hosts, containers, and more.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-10-metrics-explorer.png" alt="10 - metrics explorer" /></p>
<p>You can also query your metrics data in <a href="https://www.elastic.co/guide/en/kibana/current/discover.html">Discover</a> and explore the fields of your individual documents within the details panel.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-7-expanded-doc.png" alt="7 - expanded document" /></p>
<h2>Storing historical metrics with Elastic’s data tiering mechanism</h2>
<p>By exporting Prometheus metrics to Elasticsearch, organizations can extend the retention period and gain the ability to analyze metrics historically. Elastic optimizes data storage and access based on the frequency of data usage and the performance requirements of different data sets. The goal is to efficiently manage and store data, ensuring that it remains accessible when needed while keeping storage costs in check.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-8-hot-to-frozen.png" alt="8 - hot to frozen flow chart" /></p>
<p>After ingesting Prometheus metrics data, you have various retention options. You can set the duration for data to reside in the hot tier, which utilizes high IO hardware (SSD) and is more expensive. Alternatively, you can move the Prometheus metrics to the warm tier, employing cost-effective hardware like spinning disks (HDD) while maintaining consistent and efficient search performance. The cold tier mirrors the infrastructure of the warm tier for primary data but utilizes S3 for replica storage. Elastic automatically recovers replica indices from S3 in case of node or disk failure, ensuring search performance comparable to the warm tier while reducing disk cost.</p>
<p>The <a href="https://www.elastic.co/blog/introducing-elasticsearch-frozen-tier-searchbox-on-s3">frozen tier</a> allows direct searching of data stored in S3 or an object store, without the need for rehydration. The purpose is to further reduce storage costs for Prometheus metrics data that is less frequently accessed. By moving historical data into the frozen tier, organizations can optimize their storage infrastructure, ensuring that the recent, critical data remains in higher-performance tiers while less frequently accessed data is stored economically in the frozen tier. This way, organizations can perform historical analysis and trend detection, identify patterns and make informed decisions, and maintain compliance with regulatory standards in a cost-effective manner.</p>
<p>An alternative way to store your cloud-native metrics more efficiently is to use <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tsds.html">Elastic Time Series Data Stream</a> (TSDS). TSDS can store your metrics data more efficiently with <a href="https://www.elastic.co/blog/70-percent-storage-savings-for-metrics-with-elastic-observability">~70% less disk space</a> than a regular data stream. The <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/downsampling.html">downsampling</a> functionality will further reduce the storage required by rolling up metrics within a fixed time interval into a single summary metric. This not only assists organizations in cutting down on storage expenses for metric data but also simplifies the metric infrastructure, making it easier for users to correlate metrics with logs and traces through a unified interface.</p>
<h2>Advanced analytics</h2>
<p>Besides <a href="https://www.elastic.co/guide/en/observability/current/explore-metrics.html">Metrics Explorer</a> and <a href="https://www.elastic.co/guide/en/kibana/current/discover.html">Discover</a>, Elasticsearch® provides more advanced analytics capabilities and empowers organizations to gain deeper, more valuable insights into their Prometheus metrics data.</p>
<p>Out of the box, Prometheus integration provides a default overview dashboard.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-9-advacned-analytics.png" alt="9 - adv analytics" /></p>
<p>From Metrics Explorer or Discover, users can also easily edit their Prometheus metrics visualization in <a href="https://www.elastic.co/kibana/kibana-lens">Elastic Lens</a> or create new visualizations from Lens.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-6-metrics-explorer.png" alt="6 - metrics explorer" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-11-green-bars.png" alt="11 - green bars" /></p>
<p>Elastic Lens enables users to explore and visualize data intuitively through dynamic visualizations. This user-friendly interface eliminates the need for complex query languages, making data analysis accessible to a broader audience. Elasticsearch also offers other powerful visualization methods with <a href="https://www.elastic.co/guide/en/kibana/current/add-aggregation-based-visualization-panels.html">aggregations</a> and <a href="https://www.youtube.com/watch?v=I8NtctS33F0">filters</a>, enabling users to perform advanced analytics on their Prometheus metrics data, including short-term and historical data. To learn more, check out the <a href="https://www.elastic.co/videos/training-how-to-series-stack">how-to series: Kibana</a>.</p>
<h2>Anomaly detection and forecasting</h2>
<p>When analyzing data, maintaining a constant watch on the screen is simply not feasible, especially when dealing with millions of time series of Prometheus metrics. Engineers frequently encounter the challenge of differentiating normal from abnormal data points, which involves analyzing historical data patterns — a process that can be exceedingly time consuming and often exceeds human capabilities. Thus, there is a pressing need for a more intelligent approach to detect anomalies efficiently.</p>
<p>Setting up alerts may seem like an obvious solution, but relying solely on rule-based alerts with static thresholds can be problematic. What's normal on a Wednesday at 9:00 a.m. might be entirely different from a Sunday at 2:00 a.m. This often leads to complex and hard-to-maintain rules or wide alert ranges that end up missing crucial issues. Moreover, as your business, infrastructure, users, and products evolve, these fixed rules don't keep up, resulting in lots of false positives or, even worse, important issues slipping through the cracks without detection. A more intelligent and adaptable approach is needed to ensure accurate and timely anomaly detection.</p>
<p>Elastic's machine learning anomaly detection excels in such scenarios. It automatically models the normal behavior of your Prometheus data, learning trends, and identifying anomalies, thereby reducing false positives and improving mean time to resolution (MTTR). With over 13 years of development experience in this field, Elastic has emerged as a trusted industry leader.</p>
<p>The key advantage of Elastic's machine learning anomaly detection lies in its unsupervised learning approach. By continuously observing real-time data, it acquires an understanding of the data's behavior over time. This includes grasping daily and weekly patterns, enabling it to establish a normalcy range of expected behavior. Behind the scenes, it constructs statistical models that allow accurate predictions, promptly identifying any unexpected variations. In cases where emerging data exhibits unusual trends, you can seamlessly integrate with alerting systems, operationalizing this valuable insight.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-12-LPO.png" alt="12 - LPO" /></p>
<p>Machine learning's ability to project into the future, forecasting data trends one day, a week, or even a month ahead, equips engineers not only with reporting capabilities but also with pattern recognition and failure prediction based on historical Prometheus data. This plays a crucial role in maintaining mission-critical workloads, offering organizations a proactive monitoring approach. By foreseeing and addressing issues before they escalate, organizations can avert downtime, cut costs, optimize resource utilization, and ensure uninterrupted availability of their vital applications and services.</p>
<p><a href="https://www.elastic.co/guide/en/machine-learning/current/ml-ad-run-jobs.html#ml-ad-create-job">Creating a machine learning job</a> for your Prometheus data is a straightforward task with a few simple steps. Simply specify the data index and set the desired time range in the single metric view. The machine learning job will then automatically process the historical data, building statistical models behind the scenes. These models will enable the system to predict trends and identify anomalies effectively, providing valuable and actionable insights for your monitoring needs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-13-creating-ML-job.png" alt="13 - create ML job" /></p>
<p>In essence, Elastic machine learning empowers us to harness the capabilities of data scientists and effectively apply them in monitoring Prometheus metrics. By seamlessly detecting anomalies and predicting potential issues in advance, Elastic machine learning bridges the gap and enables IT professionals to benefit from the insights derived from advanced data analysis. This practical and accessible approach to anomaly detection equips organizations with a proactive stance toward maintaining the reliability of their systems.</p>
<h2>Try it out</h2>
<p><a href="https://www.elastic.co/cloud/cloud-trial-overview">Start a free trial</a> on Elastic Cloud and <a href="https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-prometheus.html">ingest your Prometheus metrics into Elastic</a>. Enhance your Prometheus monitoring with Elastic Observability. Stay ahead of potential issues with advanced AI/ML anomaly detection and prediction capabilities. Eliminate data silos, reduce costs, and enhance overall response efficiency.</p>
<p>Elevate your monitoring capabilities with Elastic today!</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/illustration-machine-learning-anomaly-v2.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Introducing Elastic Distribution for OpenTelemetry Python]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-python</link>
            <guid isPermaLink="false">elastic-opentelemetry-distribution-python</guid>
            <pubDate>Sun, 07 Jul 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Announcing the first alpha release of the Elastic Distribution for OpenTelemetry Python. See how easy it is to instrument your Python applications with OpenTelemetry in this blog post.]]></description>
            <content:encoded><![CDATA[<p>We are delighted to announce the alpha release of the <a href="https://github.com/elastic/elastic-otel-python#readme">Elastic Distribution for OpenTelemetry Python</a>. This project is a customized OpenTelemetry distribution that allows us to configure better defaults for using OpenTelemetry with the Elastic cloud offering.</p>
<h2>Background</h2>
<p>Elastic is standardizing on OpenTelemetry (OTel) for observability and security data collection. As part of that effort, we are <a href="https://www.elastic.co/blog/elastic-opentelemetry-sdk-distributions">providing distributions of the OpenTelemetry Language SDKs</a>. We have recently released alpha distributions for <a href="https://github.com/elastic/elastic-otel-java#readme">Java</a>, <a href="https://github.com/elastic/elastic-otel-dotnet#readme">.NET</a> and <a href="https://github.com/elastic/elastic-otel-node#readme">Node.js</a>. Our <a href="https://github.com/elastic/apm-agent-android#readme">Android</a> and <a href="https://github.com/elastic/apm-agent-ios#readme">iOS</a> SDKs have been OpenTelemetry-based from the start. The Elastic Distribution for OpenTelemetry Python is the latest addition.</p>
<h2>Design choices</h2>
<p>We have chosen to provide a lean distribution that does not install all the instrumentations by default but that instead provides tools
to do so. We leverage the <code>opentelemetry-bootstrap</code> tool provided by OpenTelemetry Python project to scan the packages installed in your
environment and recognizes libraries we are able to instrument.  This tool can just report the instrumentations available and optionally
is able to install them as well.
This allows you to avoid installing packages you are not going to need or instrument libraries you are not interested in tracing.</p>
<h2>Getting started</h2>
<p>To get started with Elastic Distribution for OpenTelemetry Python you need to install  the package <code>elastic-opentelemetry</code> in your project
environment. We'll use <code>pip</code> in our examples but you are free to use any python package and environment manager of your choice.</p>
<pre><code class="language-bash">pip install elastic-opentelemetry
</code></pre>
<p>Once you have installed our distro you'll have also the <code>opentelemetry-bootstrap</code> command available. Running it:</p>
<pre><code class="language-bash">opentelemetry-bootstrap
</code></pre>
<p>will list all available packages for your instrumentation, e.g. you can expect something like the following:</p>
<pre><code>opentelemetry-instrumentation-asyncio==0.46b0
opentelemetry-instrumentation-dbapi==0.46b0
opentelemetry-instrumentation-logging==0.46b0
opentelemetry-instrumentation-sqlite3==0.46b0
opentelemetry-instrumentation-threading==0.46b0
opentelemetry-instrumentation-urllib==0.46b0
opentelemetry-instrumentation-wsgi==0.46b0
opentelemetry-instrumentation-grpc==0.46b0
opentelemetry-instrumentation-requests==0.46b0
opentelemetry-instrumentation-system-metrics==0.46b0
opentelemetry-instrumentation-urllib3==0.46b0
</code></pre>
<p>It also provides a command option to install the packages automatically</p>
<pre><code class="language-bash">opentelemetry-bootstrap --action=install
</code></pre>
<p>It is advised to run this command every time you release a new version of your application so that you can install or just revise any
instrumentation packages for your code.</p>
<p>Some environment variables are needed to provide the needed configuration for instrumenting your services. These mostly
concern the destination of your traces but also for easily identifying your service.
A <em>service name</em> is required to have your service distinguishable from the others. Then you need to provide
the <em>authorization</em> headers for authentication with Elastic Observability cloud and the Elastic cloud endpoint where the data is sent.</p>
<p>The API Key you get from your Elastic cloud serverless project must be <em>URL-encoded</em>, you can do that with the following Python snippet:</p>
<pre><code class="language-python">from urllib.parse import quote
quote(&quot;ApiKey &lt;your api key&gt;)
</code></pre>
<p>Once you have all your configuration values you can export via environment variables as below:</p>
<pre><code class="language-bash">export OTEL_RESOURCE_ATTRIBUTES=service.name=&lt;service-name&gt;
export OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=&lt;url encoded apikey header value&gt;&quot;
export OTEL_EXPORTER_OTLP_ENDPOINT=&lt;your elastic cloud url&gt;
</code></pre>
<p>We are done with the configuration and the last piece of the puzzle is wrapping your service invocation with
<code>opentelemetry-instrument</code>, the wrapper that provides <em>zero-code instrumentation</em>. <em>Zero-code</em> (or Automatic) instrumentation means
that the distribution will set up the OpenTelemetry SDK and enable all the previously installed instrumentations for you.
Unfortunately <em>Zero-code</em> instrumentation does not cover all libraries and some — web frameworks in particular — will require minimal manual
configuration.</p>
<p>For a web service running with gunicorn it may look like:</p>
<pre><code class="language-bash">opentelemetry-instrument gunicorn main:app
</code></pre>
<p>The result is an observable application using the industry-standard <a href="https://opentelemetry.io/">OpenTelemetry</a> — offering high-quality instrumentation of many popular Python libraries, a portable API to avoid vendor lock-in and an active community.</p>
<p>Using Elastic Observability, some out-of-the-box benefits you can expect are: rich trace viewing, Service maps, integrated metrics and log analysis, and more.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-distribution-python/traces-original.png" alt="trace sample screenshot" /></p>
<h2>What's next?</h2>
<p>Elastic is committed to helping OpenTelemetry succeed and to helping our customers use OpenTelemetry effectively in their systems. Last year, we <a href="https://opentelemetry.io/blog/2023/ecs-otel-semconv-convergence/">donated ECS</a> and continue to work on integrating it with OpenTelemetry Semantic Conventions. More recently, we are working on <a href="https://www.elastic.co/observability-labs/blog/elastic-profiling-agent-acceptance-opentelemetry">donating our eBPF-based profiler</a> to OpenTelemetry. We contribute to many of the language SDKs and other OpenTelemetry projects.</p>
<p>In the Python ecosystem we are active reviewers and contributors of both the <a href="https://github.com/open-telemetry/opentelemetry-python/">opentelemetry-python</a> and <a href="https://github.com/open-telemetry/opentelemetry-python-contrib/">opentelemetry-python-contrib</a> repositories.</p>
<p>The Elastic Distribution for OpenTelemetry Python is currently an alpha. Please <a href="https://github.com/elastic/elastic-otel-python/">try it out</a> and let us know if it might work for you. Watch for the <a href="https://github.com/elastic/elastic-otel-python/releases">latest releases here</a>. You can engage with us on <a href="https://github.com/elastic/elastic-otel-python/issues">the project issue tracker</a>.</p>
<p>We are eager to know your use cases to help you succeed in your Observability journey.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<h2>Resources</h2>
<ul>
<li><a href="https://www.elastic.co/blog/elastic-opentelemetry-sdk-distributions">https://www.elastic.co/blog/elastic-opentelemetry-sdk-distributions</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-java-agent">https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-java-agent</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-dotnet-applications">https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-dotnet-applications</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-node-js">https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-node-js</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/manual-instrumentation-python-apps-opentelemetry">https://www.elastic.co/observability-labs/blog/manual-instrumentation-python-apps-opentelemetry</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/auto-instrumentation-python-applications-opentelemetry">https://www.elastic.co/observability-labs/blog/auto-instrumentation-python-applications-opentelemetry</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/opentelemetry-observability">https://www.elastic.co/observability-labs/blog/opentelemetry-observability</a></li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-distribution-python/python.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Introducing the OTTL Playground for OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/introducing-the-ottl-playground-for-opentelemetry</link>
            <guid isPermaLink="false">introducing-the-ottl-playground-for-opentelemetry</guid>
            <pubDate>Thu, 13 Mar 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic is proud to introduce the OTTL Playground (https://ottl.run), a powerful and user-friendly tool designed to allow users to experiment with OpenTelemetry Transformation Language (OTTL) effortlessly. The playground provides a rich interface for users to create, modify, and test statements in real-time, making it easier to understand how different configurations impact the OpenTelemetry data transformation.]]></description>
            <content:encoded><![CDATA[<h2>OTTL Playground</h2>
<p>As the demand for observability and monitoring solutions grows, OpenTelemetry
has emerged as a key framework for collecting, processing, and exporting
telemetry data. Within this ecosystem, the OpenTelemetry Transformation Language
(<a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/ottl#opentelemetry-transformation-language">OTTL</a>)
is a powerful way to customize telemetry data transformation, but it can be daunting for
both new and experienced users alike.</p>
<p>To help addressing these challenges, we are thrilled to introduce the <strong>OTTL Playground</strong> (<a href="https://ottl.run">https://ottl.run</a>),
a powerful and user-friendly tool designed to allow users to experiment with OTTL
effortlessly. The playground provides a rich interface for users to create,
modify, and test statements in real-time, making it easier to understand how
different configurations impact the OpenTelemetry data transformation. Users can
instantly validate OTTL transformations, from input to output, along with diffs.
This allows new users to explore the nuances of OTTL without the risk of
disrupting production environments.</p>
<h2>How does it work?</h2>
<p>The OTTL Playground allows you to run OTTL statements using different processors and versions.
Currently, it supports the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/transformprocessor">transform processor</a>
and the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/filterprocessor">filter processor</a>,
with additional evaluators potentially being added in the future.</p>
<p>To start exploring, simply visit <a href="https://ottl.run">https://ottl.run</a>.
Once the processor configuration and OTLP payload are filled, click on the “Run”
button in the top-right corner of the screen, and the result will instantly
appear on the right-sided result panel.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/introducing-the-ottl-playground-for-opentelemetry/playground-result-example.png" alt="OTTL Playground result example" /></p>
<p>The above example uses the
<a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/transformprocessor">transform processor</a>
to rename a trace resource attribute, and its effect on the data can be easily
spotted on the diff-based result panel.</p>
<p>Examples of processor configurations and OTLP payloads can be loaded by
selecting an option from the dropdown menu in the top-right corner of the
Configuration and OTLP Payload panels.</p>
<h2>Different results flavors</h2>
<p>The primary goal of the OTTL Playground is to help users understand how
processors and their OTTL statements impact telemetry data. The Playground
offers several types of results to aid in this understanding, including visual
results, JSON results, and execution debug logs.</p>
<ul>
<li><strong>Visual delta</strong>: Shows a diff-based comparison, providing an intuitive and
immediate understanding of how the data is transformed. This makes it easier
for users to grasp complex changes without delving into raw data.</li>
<li><strong>Annotated delta</strong>: This visualization is similar to the visual delta, but
instead of a graphical representation, it shows the JSON diff values, providing
a detailed step-by-step explanation of the changes made to the original data.</li>
<li><strong>JSON</strong>: Offers a detailed view of the data, allowing users to see the exact
output of their OTTL statements. This is particularly useful for debugging and
verifying precise data transformations.</li>
<li><strong>Execution logs</strong>: The OTTL and processors have very detailed debug logs, which
provide a step-by-step account of the processing. This is invaluable for
troubleshooting and understanding the sequence of transformations applied to
the data.</li>
</ul>
<p>Together, these features empower users to experiment confidently with OTTL
statements, ensuring they can fine-tune their telemetry data transformations
with clarity and precision.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/introducing-the-ottl-playground-for-opentelemetry/playground-different-result-types.gif" alt="OTTL Playground different result types" /></p>
<h2>Sharing</h2>
<p>The OTTL Playground supports sharing configurations easily. By clicking on the
“Copy Link” button located in the top right corner of the interface, users can
generate a unique URL that encapsulates their current playground state. This
link can then be shared with colleagues or community members, allowing others to
quickly load and review the exact setup. This feature facilitates collaboration
and troubleshooting by enabling seamless sharing of specific OTTL configurations
and results.</p>
<p>Given that the shareable links are public and might carry data on it, we advise
you to refrain from submitting any confidential information.</p>
<h2>Playground architecture</h2>
<p>The OTTL Playground is a static website that operates entirely within the
client's browser. It leverages <a href="https://webassembly.org/">WebAssembly</a>, compiled
from the actual OpenTelemetry Collector code, to deliver results identical to
those of a real collector distribution, all while maintaining near-native
performance.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/introducing-the-ottl-playground-for-opentelemetry/playground-architecture.png" alt="OTTL Playground architecture" /></p>
<p>The user interface of the OTTL Playground uses WebComponents, which offers
several benefits. This modular approach makes the UI more maintainable and
scalable, enabling reusability, and allowing developers to create and reuse the
Playground elements across different projects.</p>
<h2>The road so far</h2>
<p>As we wrap up, it's important to note that the OTTL Playground is still in its
beta phase. This means we're actively working on refining its features and
improving its performance based on user feedback.</p>
<p>We're incredibly excited about the potential this tool holds for simplifying the
OTTL usage and enhancing collaboration within the community. We invite you to
explore the OTTL Playground, share your experiences, and help us shape it into a
more efficient and user-friendly tool. Stay tuned for more updates and new
features in the coming months!</p>
<p>This project is being developed in collaboration with the OTel community, and you can track
its progress and contribute through this <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/33747">GitHub issue</a>.
The source code is available in this <a href="https://github.com/elastic/ottl-playground">repository</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/introducing-the-ottl-playground-for-opentelemetry/ottl-playground.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Gaining new perspectives beyond logging: An introduction to application performance monitoring]]></title>
            <link>https://www.elastic.co/observability-labs/blog/introduction-apm-tracing-logging</link>
            <guid isPermaLink="false">introduction-apm-tracing-logging</guid>
            <pubDate>Tue, 30 May 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Change is on the horizon for the world of logging. In this post, we’ll outline a recommended journey for moving from just logging to a fully integrated solution with logs, traces, and APM.]]></description>
            <content:encoded><![CDATA[<h2>Prioritize customer experience with APM and tracing</h2>
<p>Enterprise software development and operations has become an interesting space. We have some incredibly powerful tools at our disposal, yet as an industry, we have failed to adopt many of these tools that can make our lives easier. One such tool that is currently underutilized is <a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">application performance monitoring</a> (APM) and tracing, despite the fact that OpenTelemetry has made it possible to adopt at low friction.</p>
<p>Logging, however, is ubiquitous. Every software application has logs of some kind, and the default workflow for troubleshooting (even today) is to go from exceptions experienced by customers and systems to the logs and start from there to find a solution.</p>
<p>There are various challenges with this, one of the main ones being that logs often do not give enough information to solve the problem. Many services today return ambiguous 500 errors with little or nothing to go on. What if there isn’t an error or log file at all or the problem is that the system is very slow? Logging alone cannot help solve these problems. This leaves users with half broken systems and poor user experiences. We’ve all been on the wrong side of this, and it can be incredibly frustrating.</p>
<p>The question I find myself asking is why does the customer experience often come second to errors? If the customer experience is a top priority, then a strategy should be in place to adopt tracing and APM and make this as important as logging. Users should stop going to logs by default and thinking primarily in logs, as many are doing today. This will also come with some required changes to mental models.</p>
<p>What’s the path to get there? That’s exactly what we will explore in this blog post. We will start by talking about supporting organizational changes, and then we’ll outline a recommended journey for moving from just logging to a fully integrated solution with logs, traces, and APM.</p>
<h2>Cultivating a new monitoring mindset: How to drive APM and tracing adoption</h2>
<p>To get teams to shift their troubleshooting mindset, what organizational changes need to be made?</p>
<p>Initially, businesses should consider strategic priorities and goals that need to be shared broadly among the teams. One thing that can help drive this in a very large organization is to consider an entire product team devoted to Observability or a CoE (Center of Excellence) with its own roadmap and priorities.</p>
<p>This team (either virtual or permanent) should start with the customer in mind and work backward, starting with key questions like: What do I need to collect? What do I need to observe? How do I act? Once team members understand the answers to these questions, they can start to think about the technology decisions needed to drive those outcomes.</p>
<p>From a tracing and APM perspective, the areas of greatest concern are the customer experience, service level objectives, and service level outcomes. From here, organizations can start to implement programs of work to continuously improve and share knowledge across teams. This will help to align teams around a common framework with shared goals.</p>
<p>In the next few sections, we will go through a four step journey to help you maximize your success with APM and tracing. This journey will take you through the following key steps on your journey to successful APM adoption:</p>
<ol>
<li><strong>Ingest:</strong> What choices do you have to make to get tracing activated and start ingesting trace data into your observability tools?</li>
<li><strong>Integrate:</strong> How does tracing integrate with logs to enable full end-to-end observability, and what else beyond simple tracing can you utilize to get even better resolution on your data?</li>
<li><strong>Analytics and AIOPs:</strong> Improve the customer experience and reduce the noise through machine learning.</li>
<li><strong>Scale and total cost of ownership:</strong> Roll out enterprise-wide tracing and adopt strategies to deal with data volume.</li>
</ol>
<h2>1. Ingest</h2>
<p>Ingesting data for APM purposes generally involves “instrumenting” the application. In this section, we will explore methods for instrumenting applications, talk a little bit about sampling, and finally wrap up with a note on using common schemas for data representation.</p>
<h3>Getting started with instrumentation</h3>
<p>What options do we have for ingesting APM and trace data? There are many, many options we will discuss to help guide you, but first let's take a step back. APM has a deep history — in very first implementations of APM, people were concerned mainly with timing methods, like this below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/introduction-apm-tracing-logging/blog-elastic-timing-methods.png" alt="timing methods" /></p>
<p>Usually you had a configuration file to specify which methods you wanted to time, and the APM implementation would instrument the specified code with method timings.</p>
<p>From here things started to evolve, and one of the first additions to APM was to add in tracing.</p>
<p>For Java, it’s fairly trivial to implement a system to do this by using what's known as a Java agent. You just specify -javagent command line argument, and the agent code gets access to the dynamic compilation routines within Java so it can modify the code before it is compiled into machine code, allowing you to “wrap” specific methods with timing or tracing routines. So, auto instrumenting Java was one of the first things that the original APM vendors did.</p>
<p><a href="https://opentelemetry.io/docs/instrumentation/java/automatic/">OpenTelemetry has agents like this</a>, and most observability vendors that offer APM solutions have their own proprietary ways of doing this, often with more advanced and differing features from the open source tooling.</p>
<p>Things have moved on since then, and Node.JS and Python are now popular.</p>
<p>As a result, ways of auto instrumenting these language runtimes have appeared, which mostly work by injecting the libraries into the code before starting them up. OpenTelemetry has a way of doing this on Kubernetes with an Operator and sidecar <a href="https://github.com/open-telemetry/opentelemetry-operator/blob/main/README.md">here</a>, which supports Python, Node.JS, Java, and DotNet.</p>
<p>The other alternative is to start adding APM and tracing API calls into your own code, which is not dissimilar to adding logging functionality. You may even wish to create an abstraction in your code to deal with this cross-cutting concern, although this is less of a problem now that there are open standards with which you can implement this.</p>
<p>You can see an example of how to add OpenTelemetry spans and attributes to your code for manual instrumentation below and <a href="https://github.com/davidgeorgehope/ChatGPTMonitoringWithOtel/blob/main/monitor.py">here</a>.</p>
<pre><code class="language-python">from flask import Flask
import monitor  # Import the module
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import urllib
import os

from opentelemetry import trace
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.requests import RequestsInstrumentor


# Service name is required for most backends
resource = Resource(attributes={
    SERVICE_NAME: &quot;your-service-name&quot;
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT'),
        headers=&quot;Authorization=Bearer%20&quot;+os.getenv('OTEL_EXPORTER_OTLP_AUTH_HEADER')))

provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
RequestsInstrumentor().instrument()

# Initialize Flask app and instrument it
app = Flask(__name__)

@app.route(&quot;/completion&quot;)
@tracer.start_as_current_span(&quot;do_work&quot;)
def completion():
        span = trace.get_current_span()
        if span:
            span.set_attribute(&quot;completion_count&quot;,1)
</code></pre>
<p>By implementing APM in this way, you could even eliminate the need to do any logging by storing all your required logging information within span attributes, exceptions, and metrics. The downside is that you can only do this with code that you own, so you will not be able to remove all logs this way.</p>
<h3>Sampling</h3>
<p>Many people don’t realize that APM is an expensive process. It adds a lot of CPU cycles and memory to your applications, and although there is a lot of value to be had, there are certainly trade-offs to be made.</p>
<p>Should you sample everything 100% and eat the cost? Or should you think about an intelligent trade-off with fewer samples or even tail-based sampling, which many products commonly support? Here, we will talk about the two most common sampling techniques — head-based sampling and tail-based sampling — to help you decide.</p>
<p><strong>Head-based sampling</strong><br />
In this approach, sampling decisions are made at the beginning of a trace, typically at the entry point of a service or application. A fixed rate of traces is sampled, and this decision propagates through all the services involved in a distributed trace.</p>
<p>With head-based sampling, you can control the rate using a configuration, allowing you to control the percentage of requests that are sampled and reported to the APM server. For instance, a sampling rate of 0.5 means that only 50% of requests are sampled and sent to the server. This is useful for reducing the amount of collected data while still maintaining a representative sample of your application's performance.</p>
<p><strong>Tail-based sampling</strong><br />
Unlike head-based sampling, tail-based sampling makes sampling decisions after the entire trace has been completed. This allows for more intelligent sampling decisions based on the actual trace data, such as only reporting traces with errors or traces that exceed a certain latency threshold.</p>
<p>We recommend tail-based sampling because it has the highest likelihood of reducing the noise and helping you focus on the most important issues. It also helps keep costs down on the data store side. A downside of tail-based sampling, however, is that it results in more data being generated from APM agents. This could use more CPU and memory on your application.</p>
<h3>OpenTelemetry Semantic Conventions and Elastic Common Schema</h3>
<p>OpenTelemetry prescribes Semantic Conventions, or Semantic Attributes, to establish uniform names for various operations and data types. Adhering to these conventions fosters standardization across codebases, libraries, and platforms, ultimately streamlining the monitoring process.</p>
<p>Creating OpenTelemetry spans for tracing is flexible, allowing implementers to annotate them with operation-specific attributes. These spans represent particular operations within and between systems, often involving widely recognized protocols like HTTP or database calls. To effectively represent and analyze a span in monitoring systems, supplementary information is necessary, contingent upon the protocol and operation type.</p>
<p>Unifying attribution methods across different languages is essential for operators to easily correlate and cross-analyze telemetry from polyglot microservices without needing to grasp language-specific nuances.</p>
<p>Elastic's recent contribution of the Elastic Common Schema to OpenTelemetry enhances Semantic Conventions to encompass logs and security.</p>
<p>Abiding by a shared schema yields considerable benefits, enabling operators to rapidly identify intricate interactions and correlate logs, metrics, and traces, thereby expediting root cause analysis and reducing time spent searching for logs and pinpointing specific time frames.</p>
<p>We advocate for adhering to established schemas such as ECS when defining trace, metrics, and log data in your applications, particularly when developing new code. This practice will conserve time and effort when addressing issues.</p>
<h2>2. Integrate</h2>
<p>Integrations are very important for APM. How well your solution can integrate with other tools and technologies such as cloud, as well as its ability to integrate logs and metrics into your tracing data, is critical to fully understand the customer experience. In addition, most APM vendors have adjacent solutions for <a href="https://www.elastic.co/observability/synthetic-monitoring">synthetic monitoring</a> and profiling to gain deeper perspectives to supercharge your APM. We will explore these topics in the following section.</p>
<h3>APM + logs = superpowers!</h3>
<p>Because APM agents can instrument code, they can also instrument code that is being used for logging. This way, you can capture log lines directly within APM. <a href="https://www.elastic.co/guide/en/observability/master/logs-send-application.html">This is normally simple to enable</a>.</p>
<p>With this enabled, you will also get automated injection of useful fields like these:</p>
<ul>
<li>service.name, service.version, service.environment</li>
<li>trace.id, transaction.id, error.id</li>
</ul>
<p>This means log messages will be automatically correlated with transactions as shown below, making it far easier to reduce mean time to resolution (MTTR) and find the needle in the haystack:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/introduction-apm-tracing-logging/blog-elastic-latency-distribution.png" alt="latency distribution" /></p>
<p>If this is available to you, we highly recommend turning it on.</p>
<h3>Deploying APM inside Kubernetes</h3>
<p>It is common for people to want to deploy APM inside a Kubernetes environment, and tracing is critical for monitoring applications in cloud-native environments. There are three different ways you can tackle this.</p>
<p><strong>1. Auto instrumentation using sidecars</strong><br />
With Kubernetes, it is possible to use an init container and something that will modify Kubernetes manifests on the fly to auto instrument your applications.</p>
<p>The init container will be used simply to copy the required library or jar file into the container at startup that you need to the main Kubernetes pod. Then, you can use <a href="https://kustomize.io/">Kustomize</a> to add the required command line arguments to bootstrap your agents.</p>
<p>If you are not familiar with it, Kustomize adds, removes, or modifies Kubernetes manifests on the fly. It is even available as a flag to the Kubernetes CLI — simply execute kubectl -k.</p>
<p>OpenTelemetry has an <a href="https://github.com/open-telemetry/opentelemetry-operator/blob/main/README.md">operator</a> that does all this for you automatically (without the need for Kustomize) for Java, DotNet, Python, and Node.JS, and many vendors also have their own operator or <a href="https://www.elastic.co/guide/en/apm/attacher/current/apm-attacher.html">helm charts</a> that can achieve the same result.</p>
<p><strong>2. Baking APM into containers or code</strong><br />
A second option for deploying out APM in Kubernetes — and indeed any containerized environment — is using Docker to bake the APM agents and configuration into a dockerfile.</p>
<p>Have a look at an example here using the OpenTelemetry Java Agent:</p>
<pre><code class="language-dockerfile"># Use the official OpenJDK image as the base image
FROM openjdk:11-jre-slim

# Set up environment variables
ENV APP_HOME /app
ENV OTEL_VERSION 1.7.0-alpha
ENV OTEL_JAVAAGENT_URL https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v${OTEL_VERSION}/opentelemetry-javaagent-${OTEL_VERSION}-all.jar

# Create the application directory
RUN mkdir $APP_HOME
WORKDIR $APP_HOME

# Download the OpenTelemetry Java agent
ADD ${OTEL_JAVAAGENT_URL} /otel-javaagent.jar

# Add your Java application JAR file
COPY your-java-app.jar $APP_HOME/your-java-app.jar

# Expose the application port (e.g. 8080)
EXPOSE 8080

# Configure the OpenTelemetry Java agent and run the application
CMD java -javaagent:/otel-javaagent.jar \
      -Dotel.resource.attributes=service.name=your-service-name \
      -Dotel.exporter.otlp.endpoint=your-otlp-endpoint:4317 \
      -Dotel.exporter.otlp.insecure=true \
      -jar your-java-app.jar
</code></pre>
<p><strong>3. Tracing using a service mesh (Envoy/Istio)</strong><br />
The final option you have here is if you are using a service mesh. A service mesh is a dedicated infrastructure layer for handling service-to-service communication in a microservices architecture. It provides a transparent, scalable, and efficient way to manage and control the communication between services, enabling developers to focus on building application features without worrying about inter-service communication complexities.</p>
<p>The great thing about this is that we can activate tracing within the proxy and therefore get visibility into requests between services. We don’t have to change any code or even run APM agents for this; we simply turn on the OpenTelemetry collector that exists within the proxy — therefore this is likely the lowest overhead solution. <a href="https://www.envoyproxy.io/docs/envoy/latest/start/sandboxes/opentelemetry">Learn more about this option</a>.</p>
<h3>Synthetics Universal Profiling</h3>
<p>Most APM vendors have add ons to the primary APM use cases. Typically we see synthetics and <a href="https://www.elastic.co/observability/universal-profiling">continuous profiling</a> being added to APM solutions. APM can integrate with both, and there is some good value in bringing these technologies together to give even more insights into issues.</p>
<p><strong>Synthetics</strong><br />
Synthetic monitoring is a method used to measure the performance, availability, and reliability of web applications, websites, and APIs by simulating user interactions and traffic. It involves creating scripts or automated tests that mimic real user behavior, such as navigating through pages, filling out forms, or clicking buttons, and then running these tests periodically from different locations and devices.</p>
<p>This gives Development and Operations teams the ability to spot problems far earlier than they might otherwise, catching issues before real users do in many cases.</p>
<p>Synthetics can be integrated with APM — inject an APM agent into the website when the script runs, so even if you didn’t put end user monitoring into your website initially, it can be injected at run time. This usually happens without any input from the user. From there, a tracing id for each request can be passed down through the various layers of the system, allowing teams to follow the request all the way from the synthetics script to the lowest levels of the application stack such as the database.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/introduction-apm-tracing-logging/blog-elastic-rainbow-sandals.png" alt="observability rainbow sandals" /></p>
<p><strong>Universal profiling</strong><br />
“Profiling” is a dynamic method of analyzing the complexity of a program, such as CPU utilization or the frequency and duration of function calls. With profiling, you can locate exactly which parts of your application are consuming the most resources. <a href="https://www.elastic.co/observability/universal-profiling">“Continuous profiling”</a> is a more powerful version of profiling that adds the dimension of time. By understanding your system’s resources over time, you can then locate, debug, and fix issues related to performance.</p>
<p>Universal profiling is a further extension of this, which allows you to capture profile information about all of the code running in your system all the time. Using a technology like <a href="https://www.elastic.co/blog/ebpf-observability-security-workload-profiling">eBPF</a> can allow you to see <em>all</em> the function calls in your systems, including into things like the Kubernetes runtime. Doing this gives you the ability to finally see unknown unknowns — things you didn’t know were problems. This is very different from APM, which is really about tracking individual traces and requests and the overall customer experience. Universal profiling is about overcoming those issues you didn’t even know existed and even answering the question “What is my most expensive line of code?”</p>
<p>Universal profiling can be linked into APM, showing you profiles that occurred during a specific customer issue, for example, or by linking profiles directly to traces by looking at the global state that exists at the thread level. These technologies can work wonders when used together.</p>
<p>Typically, profiles are viewed as “flame graphs” shown below. The boxes represent the amount of “on-cpu” time spent executing a particular function.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/introduction-apm-tracing-logging/blog-elastic-universal-profiling.png" alt="observability universal profiling" /></p>
<h2>3. Analytics and AIOps</h2>
<p>The interesting thing about APM is it opens up a whole new world of analytics versus just logs. All of a sudden, you have access to the information flows from <em>inside</em> applications.</p>
<p>This allows you to easily capture things like the amount of money a specific customer is currently spending on your most critical ecommerce store, or look at failed trades in a brokerage app to see how much lost revenue those failures are impacting. You can even then apply machine learning algorithms to project future spend or look at anomalies occurring in this data, giving you a new window into how your business runs.</p>
<p>In this section, we will look at ways to do this and how to get the most out of this new world, as well as how to apply AIOps practices to this new data. We will also discuss getting SLIs and SLOs setup for APM data.</p>
<h3>Getting business data into your traces</h3>
<p>There are generally two ways of getting business data into your traces. You can modify code and add in Span attributes, an example of which is available <a href="https://github.com/davidgeorgehope/ChatGPTMonitoringWithOtel/blob/main/monitor.py">here</a> and shown below. Or you can write an extension or a plugin, which has the benefit of avoiding code changes. OpenTelemetry supports <a href="https://opentelemetry.io/docs/instrumentation/java/extensions/">adding extensions in its auto-instrumentation agents</a>. Most other APM vendors usually have something similar.</p>
<pre><code class="language-python">def count_completion_requests_and_tokens(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        counters['completion_count'] += 1
        response = func(*args, **kwargs)

        token_count = response.usage.total_tokens
        prompt_tokens = response.usage.prompt_tokens
        completion_tokens = response.usage.completion_tokens
        cost = calculate_cost(response)
        strResponse = json.dumps(response)

        # Set OpenTelemetry attributes
        span = trace.get_current_span()
        if span:
            span.set_attribute(&quot;completion_count&quot;, counters['completion_count'])
            span.set_attribute(&quot;token_count&quot;, token_count)
            span.set_attribute(&quot;prompt_tokens&quot;, prompt_tokens)
            span.set_attribute(&quot;completion_tokens&quot;, completion_tokens)
            span.set_attribute(&quot;model&quot;, response.model)
            span.set_attribute(&quot;cost&quot;, cost)
            span.set_attribute(&quot;response&quot;, strResponse)
        return response
    return wrapper
</code></pre>
<h3>Using business data for fun and profit</h3>
<p>Once you have the business data in your traces, you can start to have some fun with it. Take a look at the example below for a financial services fraud team. Here we are tracking transactions — average transaction value for our larger business customers. Crucially, we can see if there are any unusual transactions.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/introduction-apm-tracing-logging/blog-elastic-customer-count.png" alt="customer count" /></p>
<p>A lot of this is powered by machine learning, which can classify transactions or do <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">anomaly detection</a>. Once you start capturing the data, it is possible to do a lot of useful things like this, and with a flexible platform, integrating machine learning models into this process becomes a breeze.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/introduction-apm-tracing-logging/blog-elastic-fraud-12h.png" alt="fraud 12-h" /></p>
<h3>SLIs and SLOs</h3>
<p>Service level indicators (SLIs) and service level objectives (SLOs) serve as critical components for maintaining and enhancing application performance. SLIs, which represent key performance metrics such as latency, error rate, and throughput, help quantify an application's performance, while SLOs establish target performance levels to meet user expectations.</p>
<p>By selecting relevant SLIs and setting achievable SLOs, organizations can better monitor their application's performance using APM tools. Continually evaluating and adjusting SLIs and SLOs in response to changes in application requirements, user expectations, or the competitive landscape ensures that the application remains competitive and delivers an exceptional user experience.</p>
<p>In order to define and track SLIs and SLOs, APM becomes a critical perspective that is needed for understanding the user experience. Once APM is implemented, we recommend that organizations perform the following steps.</p>
<ul>
<li>Define SLOs and SLIs required to track them.</li>
<li>Define SLO budgets and how they are calculated. Reflect business’ perspective and set realistic targets.</li>
<li>Define SLIs to be measured from a user experience perspective.</li>
<li>Define different alerting and paging rules, page only on customer facing SLO degradations, record symptomatic alerts, notify on critical symptomatic alerts.</li>
</ul>
<p>Synthetic monitoring and end user monitoring (EUM) can also help with getting even more data required to understand latency, throughput, and error rate from the user’s perspective, where it is critical to get good business focused metrics and data from.</p>
<h2>4. Scale and total cost of ownership</h2>
<p>With increased perspectives, customers often run into scalability and total cost of ownership issues. All this new data can be overwhelming. Luckily there are various techniques you can use to deal with this. Tracing itself can actually help with volume challenges because you can decompose unstructured logs and combine them with traces, which leads to additional efficiency. You can also use different sampling methods to deal with scale challenges (i.e., both techniques we previously mentioned).</p>
<p>In addition to this, for large enterprise scale, we can use streaming pipelines like Kafka or Pulsar to manage the data volumes. This has an additional benefit that you get for free: if you take down the systems consuming the data or they face outages, it is less likely you will lose data.</p>
<p>With this configuration in place, your “Observability pipeline” architecture would look like this:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/introduction-apm-tracing-logging/blog-elastic-opentelemetry-collector.png" alt="opentelemetry collector" /></p>
<p>This completely decouples your sources of data from your chosen observability solution, which will future proof your observability stack going forward, enable you to reach massive scale, and make you less reliant on specific vendor code for collection of data.</p>
<p>Another thing we recommend doing is being intelligent about instrumentation. This will serve two benefits: you will get some CPU cycles back in the instrumented application, and your backend data collection systems will have less data to process. If you know, for example, that you have no interest in tracking calls to a specific endpoint, you can exclude those classes and methods from instrumentation.</p>
<p>And finally, data tiering is a transformative approach for managing data storage that can significantly reduce the total cost of ownership (TCO) for businesses. Primarily, it allows organizations to store data across different types of storage mediums based on their accessibility needs and the value of the data. For instance, frequently accessed, high-value data can be stored in expensive, high-speed storage, while less frequently accessed, lower-value data can be stored in cheaper, slower storage.</p>
<p>This approach, often incorporated in cloud storage solutions, enables cost optimization by ensuring that businesses only pay for the storage they need at any given time. Furthermore, it provides the flexibility to scale up or down based on demand, eliminating the need for large capital expenditures on storage infrastructure. This scalability also reduces the need for costly over-provisioning to handle potential future demand.</p>
<h2>Conclusion</h2>
<p>In today's highly competitive and fast-paced software development landscape, simply relying on logging is no longer sufficient to ensure top-notch customer experiences. By adopting APM and distributed tracing, organizations can gain deeper insights into their systems, proactively detect and resolve issues, and maintain a robust user experience.</p>
<p>In this blog, we have explored the journey of moving from a logging-only approach to a comprehensive observability strategy that integrates logs, traces, and APM. We discussed the importance of cultivating a new monitoring mindset that prioritizes customer experience, and the necessary organizational changes required to drive APM and tracing adoption. We also delved into the various stages of the journey, including data ingestion, integration, analytics, and scaling.</p>
<p>By understanding and implementing these concepts, organizations can optimize their monitoring efforts, reduce MTTR, and keep their customers satisfied. Ultimately, prioritizing customer experience through APM and tracing can lead to a more successful and resilient enterprise in today's challenging environment.</p>
<p><a href="https://www.elastic.co/observability/application-performance-monitoring">Learn more about APM at Elastic</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/introduction-apm-tracing-logging/log-management-720x420_(2).jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Dynamic workload discovery on Kubernetes now supported with EDOT Collector]]></title>
            <link>https://www.elastic.co/observability-labs/blog/k8s-discovery-with-EDOT-collector</link>
            <guid isPermaLink="false">k8s-discovery-with-EDOT-collector</guid>
            <pubDate>Tue, 01 Apr 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Discover how Elastic's OpenTelemetry Collector leverages Kubernetes pod annotations providing dynamic workload discovery and improves automated metric and log collection for Kubernetes clusters.]]></description>
            <content:encoded><![CDATA[<p>At Elastic, Kubernetes is one of the most significant observability use cases we focus on.
We want to provide the best onboarding experience and lifecycle management based on real-world GitOps best practices.</p>
<p>OpenTelemetry recently <a href="https://opentelemetry.io/blog/2025/otel-collector-k8s-discovery/">published a blog</a> on how to do <code>Autodiscovery based on Kubernetes Pods' annotations</code> with the OpenTelemetry Collector.</p>
<p>In this blog post, we will talk about how to use this Kubernetes-related feature of the OpenTelemetry Collector,
which is already available with the Elastic Distribution of the OpenTelemetry (EDOT) Collector.</p>
<p>In addition to this feature, at Elastic, we heavily invest in making OpenTelemetry the best, standardized ingest solution for Observability.
You might already have seen us focusing on:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/blog/ecs-elastic-common-schema-otel-opentelemetry-announcement">Semantic Conventions standardization</a></p>
</li>
<li>
<p>significant <a href="https://www.elastic.co/observability-labs/blog/elastics-collaboration-opentelemetry-filelog-receiver">log collection improvements</a></p>
</li>
<li>
<p>various other topics around <a href="https://www.elastic.co/observability-labs/blog/auto-instrumentation-go-applications-opentelemetry">instrumentation</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-donation-proposal-to-contribute-profiling-agent-to-opentelemetry">profiling</a></p>
</li>
</ul>
<p>Let's walk you through a hands-on journey using the EDOT Collector covering various use cases you might encounter in the real world, highlighting the capabilities of this powerful feature.</p>
<h2>Configuring EDOT Collector</h2>
<p>The Collector’s configuration is not our main focus here, since based on the nature of this feature it is minimal,
letting workloads define how they should be monitored.</p>
<p>To illustrate the point, here is the Collector configuration snippet that enables the feature for both logs and metrics:</p>
<pre><code class="language-yaml">receivers:
    receiver_creator/metrics:
      watch_observers: [k8s_observer]
      discovery:
        enabled: true
      receivers:

    receiver_creator/logs:
      watch_observers: [k8s_observer]
      discovery:
        enabled: true
      receivers:
</code></pre>
<p>You can include the above in the EDOT’s Collector configuration, specifically the
<a href="https://github.com/elastic/elastic-agent/blob/v9.0.0-rc1/deploy/helm/edot-collector/kube-stack/values.yaml#L339">receivers’ section</a>.</p>
<p>Since logs collection in our examples will happen from the discovery feature make sure that the static filelog receiver
<a href="https://github.com/elastic/elastic-agent/blob/v9.0.0-rc1/deploy/helm/edot-collector/kube-stack/values.yaml#L348">configuration block</a> is removed
and its <a href="https://github.com/elastic/elastic-agent/blob/v9.0.0-rc1/deploy/helm/edot-collector/kube-stack/values.yaml#L193"><code>preset</code></a>
is disabled (i.e. set to <code>false</code>) to avoid having log duplication.</p>
<p>Make sure that the receiver creator is properly added in the pipelines for
<a href="https://github.com/elastic/elastic-agent/blob/v9.0.0-rc1/deploy/helm/edot-collector/kube-stack/values.yaml#L471">logs</a>
(in addition to removing the <code>filelog</code> receiver completely)
and <a href="https://github.com/elastic/elastic-agent/blob/v9.0.0-rc1/deploy/helm/edot-collector/kube-stack/values.yaml#L484">metrics</a>
respectively.</p>
<p>Ensure that <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.122.0/extension/observer/k8sobserver/README.md"><code>k8sobserver</code></a>
is enabled as part of the extensions:</p>
<pre><code class="language-yaml">extensions:
  k8s_observer:
    observe_nodes: true
    observe_services: true
    observe_ingresses: true

// ...

service:
  extensions: [k8s_observer]
</code></pre>
<p>Last but not least, ensure the log files' volume is mounted properly:</p>
<pre><code class="language-yaml">volumeMounts:
 - name: varlogpods
   mountPath: /var/log/pods
   readOnly: true

volumes:
  - name: varlogpods
    hostPath:
      path: /var/log/pods
</code></pre>
<p>Once the configuration is ready follow the <a href="https://www.elastic.co/docs/reference/opentelemetry/quickstart/">Kubernetes quickstart guides on how to deploy the EDOT Collector</a>.
Make sure to replace the <code>values.yaml</code> file linked in the quickstart guide with the file that includes the above-described modifications.</p>
<h3>Collecting Metrics from Moving Targets Based on Their Annotations</h3>
<p>In this example, we have a Deployment with a Pod spec that consists of two different containers.
One container runs a Redis server, while the other runs an NGINX server. Consequently, we want to provide
different hints for each of these target containers.</p>
<p>The annotation-based discovery feature supports this, allowing us to specify metrics annotations
per exposed container port.</p>
<p>Here is how the complete spec file looks:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-conf
data:
  nginx.conf: |
    user  nginx;
    worker_processes  1;
    error_log  /dev/stderr warn;
    pid        /var/run/nginx.pid;
    events {
      worker_connections  1024;
    }
    http {
      include       /etc/nginx/mime.types;
      default_type  application/octet-stream;

      log_format  main  '$remote_addr - $remote_user [$time_local] &quot;$request&quot; '
                        '$status $body_bytes_sent &quot;$http_referer&quot; '
                        '&quot;$http_user_agent&quot; &quot;$http_x_forwarded_for&quot;';
      access_log  /dev/stdout main;
      server {
          listen 80;
          server_name localhost;

          location /nginx_status {
              stub_status on;
          }
      }
      include /etc/nginx/conf.d/*;
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-deployment
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
      annotations:
        # redis container port hints
        io.opentelemetry.discovery.metrics.6379/enabled: &quot;true&quot;
        io.opentelemetry.discovery.metrics.6379/scraper: redis
        io.opentelemetry.discovery.metrics.6379/config: |
          collection_interval: &quot;20s&quot;
          timeout: &quot;10s&quot;

        # nginx container port hints
        io.opentelemetry.discovery.metrics.80/enabled: &quot;true&quot;
        io.opentelemetry.discovery.metrics.80/scraper: nginx
        io.opentelemetry.discovery.metrics.80/config: |
          endpoint: &quot;http://`endpoint`/nginx_status&quot;
          collection_interval: &quot;30s&quot;
          timeout: &quot;20s&quot;
    spec:
      volumes:
      - name: nginx-conf
        configMap:
          name: nginx-conf
          items:
            - key: nginx.conf
              path: nginx.conf
      containers:
        - name: webserver
          image: nginx:latest
          ports:
            - containerPort: 80
              name: webserver
          volumeMounts:
            - mountPath: /etc/nginx/nginx.conf
              readOnly: true
              subPath: nginx.conf
              name: nginx-conf
        - image: redis
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - name: redis
              containerPort: 6379
              protocol: TCP
</code></pre>
<p>When this workload is deployed, the Collector will automatically discover it and identify the specific annotations.
After this, two different receivers will be started, each one responsible for each of the target containers.</p>
<h3>Collecting Logs from Multiple Target Containers</h3>
<p>The annotation-based discovery feature also supports log collection based on the provided annotations.
In the example below, we again have a Deployment with a Pod consisting of two different containers,
where we want to apply different log collection configurations.
We can specify annotations that are scoped to individual container names:</p>
<pre><code class="language-yaml">apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox-logs-deployment
  labels:
    app: busybox
spec:
  replicas: 1
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
      annotations:
        io.opentelemetry.discovery.logs.lazybox/enabled: &quot;true&quot;
        io.opentelemetry.discovery.logs.lazybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints-lazybox
        io.opentelemetry.discovery.logs.busybox/enabled: &quot;true&quot;
        io.opentelemetry.discovery.logs.busybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints-busybox
    spec:
      containers:
        - name: busybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo &quot;otel logs from busybox at $(date +%H:%M:%S)&quot; &amp;&amp; sleep 5s; done
        - name: lazybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo &quot;otel logs from lazybox at $(date +%H:%M:%S)&quot; &amp;&amp; sleep 25s; done
</code></pre>
<p>The above configuration enables two different filelog receiver instances, each applying a unique parsing configuration.
This is handy when we know how to parse specific technology logs, such as Apache server access logs.</p>
<h3>Combining Both Metrics and Logs Collection</h3>
<p>In our third example, we illustrate how to define both metrics and log annotations on the same workload.
This allows us to collect both signals from the discovered workload.
Below is a Deployment with a Pod consisting of a Redis server and a BusyBox container that performs dummy log writing.
We can target annotations to the port and container levels to collect metrics from the Redis server using
the Redis receiver, and logs from the BusyBox using the filelog receiver. Here’s how:</p>
<pre><code class="language-yaml">apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-deployment
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
      annotations:
        io.opentelemetry.discovery.metrics.6379/enabled: &quot;true&quot;
        io.opentelemetry.discovery.metrics.6379/scraper: redis
        io.opentelemetry.discovery.metrics.6379/config: |
          collection_interval: &quot;20s&quot;
          timeout: &quot;10s&quot;

        io.opentelemetry.discovery.logs.busybox/enabled: &quot;true&quot;
        io.opentelemetry.discovery.logs.busybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints
    spec:
      containers:
        - image: redis
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - name: redis
              containerPort: 6379
              protocol: TCP
        - name: busybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo &quot;otel logs at $(date +%H:%M:%S)&quot; &amp;&amp; sleep 15s; done
</code></pre>
<h3>Explore and analyse data coming from dynamic targets in Elastic</h3>
<p>Once the target Pods are discovered and the Collector has started collecting telemetry data from them,
we can then explore this data in Elastic. In Discover we can search for Redis and NGINX metrics as well as
logs collected from the Busybox container. Here is how it looks like:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/k8s-discovery-with-EDOT-collector/discoverlogs.png" alt="Logs Discovery" />
<img src="https://www.elastic.co/observability-labs/assets/images/k8s-discovery-with-EDOT-collector/discovermetrics.png" alt="Metrics Discovery" /></p>
<h2>Summary</h2>
<p>The examples above showcase how users of our OpenTelemetry Collector can take advantage of this new feature
— one we played a major role in developing.</p>
<p>For this, we leveraged our years of experience with similar features already supported in
<a href="https://www.elastic.co/guide/en/beats/metricbeat/current/configuration-autodiscover-hints.html">Metricbeat</a>,
<a href="https://www.elastic.co/guide/en/beats/filebeat/current/configuration-autodiscover-hints.html">Filebeat</a>, and
<a href="https://www.elastic.co/guide/en/fleet/current/hints-annotations-autodiscovery.html">Elastic-Agent</a>.
This makes us extremely happy and confident, as it closes the feature gap between Elastic's specific
monitoring agents and the OpenTelemetry Collector — making it even better.</p>
<p>Interested in learning more? Visit the
<a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/receivercreator/README.md#generate-receiver-configurations-from-provided-hints">documentation</a>
and give it a try by following our <a href="https://www.elastic.co/docs/reference/opentelemetry/quickstart/">EDOT quickstart guide</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/k8s-discovery-with-EDOT-collector/k8s-discovery-new.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Troubleshooting Kafka-Logstash-Elasticsearch Performance Issues in delay-sensitive platforms]]></title>
            <link>https://www.elastic.co/observability-labs/blog/kafka-logstash-elasticsearch-performance-issues</link>
            <guid isPermaLink="false">kafka-logstash-elasticsearch-performance-issues</guid>
            <pubDate>Wed, 11 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to troubleshoot ingestion bottlenecks in data pipelines built with Kafka, Logstash and Elasticsearch.]]></description>
            <content:encoded><![CDATA[<p>Kafka is an open-source, distributed event streaming and queuing platform widely used with Elastic to build high-throughput, large-scale data pipelines, facilitate seamless data integration, and support mission-critical applications. System designs with Kafka significantly enable the decoupling of components within the data pipeline ensuring scalability and a robust design for failure by managing downstream back-pressure during traffic surges, maintenance activities, or any other periods of performance degradation. </p>
<p>In addition to its queuing capabilities, Kafka can serve as a central processing middleware for data pre-processing and enrichment. This is particularly useful when such operations are impractical to perform directly downstream due to specific business or technical requirements or constraints.</p>
<p>For instance, integrating Kafka with stream processing engines like <a href="https://ksqldb.io/">KsqlDB</a> or <a href="https://materialize.com/">Materialize</a>, allows for advanced stream processing tasks, including SQL-based joins across topics and streams to enrich data at scale in real-time. The enriched datasets can then be ingested into Elasticsearch for further processing at subsequent stages.</p>
<p>Despite these benefits, adopting Kafka or similar queuing systems is arguably conditional. These systems introduce additional costs and complexity to the overall platform implementation and maintenance. They may also add processing overhead, delay data flow to the downstream, and risk becoming bottlenecks if not correctly sized or optimized to align with other pipeline components.</p>
<p>This article provides guidance for troubleshooting ingestion bottlenecks in data pipelines built with Kafka and Elastic. Identifying and fixing such issues can be sometimes challenging, particularly when multiple changes are made across multiple systems aspects at the same time, which often increases the number of variables in play. This commonly results in a longer process and inconsistent results.</p>
<p>Consider the below Security Operations Center (SOC) platform, where data is ingested from various sources via Elastic Agent. The data is queued and pre-processed in a Kafka cluster before being pulled by Logstash and forwarded to Elastic Security. In this environment, delays at any stage of the pipeline can result in critical security events going undetected by Elastic Security, emphasizing the importance of a well-optimized data pipeline.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kafka-logstash-elasticsearch-performance-issues/image5.png" alt="" /></p>
<h2>Implement lag and throughput monitoring</h2>
<p>Ingestion bottlenecks usually materialize as limited throughput and event lags, which often correlate. Monitoring these two indicators is important to measure the impact of tuning attempts. </p>
<p><strong><em>Tip</em></strong><em>: With the anomaly detection features of machine learning you can use the</em> <a href="https://www.elastic.co/guide/en/observability/current/inspect-log-anomalies.html"><em>Logs Anomalies page</em></a> <em>to detect and inspect log anomalies and the log partitions where the log anomalies occur.</em></p>
<p>End-to-end lag monitoring can be broken down into the various stages of the pipeline. The incremental improvements across those stages would collectively contribute to a significant reduction in the end-to-end lag:</p>
<p><strong>A) Ingest lag between the source and Kafka:</strong> This lag is the time difference between the real event-time, which is typically extracted from the event itself or added by the event producer (Elastic Agent for example), and the Kafka record timestamp, which can be added to the Logstash events via event <a href="https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kafka.html#plugins-inputs-kafka-decorate_events">decoration</a> in the Kafka input plugin. </p>
<p>In most cases, this lag is influenced by the write performance of the Kafka cluster and network latency between the event source and Kafka. In some cases, the lag may also appear due to time configuration mismatches that make it look like there's a lag when there really isn't.</p>
<p><strong>B) Ingest lag between Kafka and Logstash:</strong> This lag is the time difference between the Kafka record's timestamp and the execution timestamp of the first filter in the Logstash pipeline. If your pipelines are using a persistent queue, note that this duration also includes the time spent in the PQ.</p>
<p>The below Ruby filter adds the current-time to the event in the `logstash.start` field to use for comparison later.</p>
<pre><code>ruby {
 code =&gt; &quot;event.set(logstash.start, Time.now());&quot;
}
</code></pre>
<p>The primary factors contributing to ingestion lag include the consumption performance of the Kafka cluster, the Logstash input performance, data skew across the different topic partitions, and most importantly, the backpressure propagation to the Logstash input plugin, because Logstash does not fetch new events from the Kafka topic as quickly as they become available, when it is busy processing the events that it has already fetched. </p>
<p>Network latency and reduced size of TCP read buffer (<a href="https://man7.org/linux/man-pages/man7/tcp.7.html">SO_RCVBUF</a>) on the Logstash host can also throttle Logstash from fetching the data from Kafka at the required rate.</p>
<p>Consumer lag serves as an effective indicator of this issue and can be viewed on Kafka's consumer group metrics. It is calculated as the difference between the log-end offset (the offset of the most recently produced message) and the current offset (the last committed offset by the consumer) for each partition.</p>
<pre><code>$KAFKA_HOME/bin/kafka-consumer-groups.sh  --bootstrap-server &lt;server:port&gt; --describe --group &lt;group_id&gt;
</code></pre>
<pre><code>GROUP                 TOPIC             PARTITION  CURRENT-OFFSET  LOG-END-OFFSET   LAG
logstash-cg-soc-1     windows-events    0          4498            17309            12811
logstash-cg-soc-1     windows-events    1          4470            17213            12743
...
</code></pre>
<p><strong>C) Ingest lag in the Logstash processing:</strong> This lag is the time difference between the first and last Logstash filters. To calculate this lag, an additional filter can be added at the end of the pipeline to record the `logstash.end` timestamp in the same way the `logstash.start` field was added before. The primary factors contributing to this lag are the filters efficiency of processing, which is primarily affected by the complexity and optimization of the transformations they perform, access to external services for data loading which might require network, limited number of the <a href="https://www.elastic.co/guide/en/logstash/current/logstash-settings-file.html">pipeline’s workers and small batch size</a>, and the amount of resources available for Logstash – particularly when running on virtual environments with resources contention.</p>
<p><strong>D) Ingest lag between Logstash and Elasticsearch:</strong> This lag is the time difference between the last applied Logstash filter in the pipeline, and the timestamp when the event is ingested in Elasticsearch. The ECS field `<a href="https://www.elastic.co/guide/en/ecs/current/ecs-event.html#field-event-ingested"><code>event.ingested</code></a>` is automatically added by the Elastic integrations to record this value. For custom sources, the field should be added via an ingest pipeline:</p>
<pre><code>{
    &quot;processors&quot;: [
      {
        &quot;set&quot;: {
          &quot;field&quot;: &quot;event.ingested&quot;,
          &quot;value&quot;: &quot;{{_ingest.timestamp}}&quot;
        }
      }
…
</code></pre>
<p>If the data is undergoing heavy processing in Elasticsearch before indexing, it also pays to analyze the performance of each ingest processor in the pipeline to pinpoint and optimize the heaviest ones. <a href="https://github.com/elastic/integrations/pull/4597">Ingest pipelines monitoring dashboard</a> can help streamline this process.</p>
<p>The primary factors contributing to this phase’s lag are usually the Logstash output configuration like a small number of pipeline workers and batch size, slow indexing actions (like upserts), network latency, and how fast the Elasticsearch cluster can run the ingest pipelines and index the data. You can find more techniques about this last point <a href="https://www.elastic.co/docs/deploy-manage/production-guidance/optimize-performance/indexing-speed">here</a>.</p>
<p>Visualizing these stages in Kibana helps identify the most throttled areas and analyze the impact of various parameter adjustments across the entire data pipeline during the tuning process.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kafka-logstash-elasticsearch-performance-issues/image4.png" alt="" /></p>
<h2>Isolate and fix the bottleneck</h2>
<p>Identifying the source of the bottleneck can be challenging without a systematic approach to isolating the behavior of each component and stage of the pipeline. To make the investigation approach more consistent, it is important to keep the source data consistent as well. One approach can be to use a dedicated topic with a replicated production workload, and repeat the test using different consumer groups.</p>
<p>Below is a set of benchmarks that can be driven while monitoring the event lag and the pipeline throughput. The best achieved results from each of the tuning exercises can be used as a basis for the next one.</p>
<h2>First benchmark: Kafka input, no filters, null output</h2>
<p>This benchmark is aimed at assessing the throughput of the Kafka input in isolation, excluding the downstream impacts of the Logstash filters and outputs. Use the <a href="https://www.elastic.co/guide/en/logstash/current/plugins-outputs-sink.html">sink</a> plugin in the output section to discard the events without incurring IO overhead and get a theoretical maximum reading speed.</p>
<p>This test is better performed with and without a <a href="https://www.elastic.co/guide/en/logstash/current/persistent-queues.html#persistent-queues-architecture">persistence queue</a> to isolate the additional overhead at this stage. </p>
<p>It is helpful to use a unique consumer group_id for this test instead of the default `logstash`. Otherwise, this null-output pipeline might consume and drop events that should be processed by other pipelines.</p>
<pre><code>input {
 kafka {
   ...
 }
}
filter {
}
output {
  sink { }
}
</code></pre>
<p>If the throughput from this test closely matches the original pipeline, then most probably you have a closed valve upstream and consuming the events is definitely a bottleneck. </p>
<p>Note that the maximum throughput is significantly impacted by the Kafka cluster's ability to handle consumer requests and network latency. The maximum throughput is also bound by the rate of events that is flowing into the Kafka topic once the consumer group has caught up with the topic.</p>
<p>A few things might be considered in this exercise: </p>
<ul>
<li>
<p><strong>Match consumers count to partitions count:</strong> Ideally, the total number of <a href="https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kafka.html#plugins-inputs-kafka-consumer_threads">consumer threads</a> across all the pipelines that share the same consumer group_id, should be equal to the number of topic partitions for a perfect balance. Each Kafka topic-partition can be assigned to at-most one consumer within a consumer group at a time. So if you have more consumer threads than your topic partitions, some of those threads will not be assigned a partition. Partition-replicas do not count, as consumer threads consume messages from the leader partitions, not directly from replicas. Exceeding 1:1 ratio may also introduce unnecessary computational overhead in Logstash without any gains in read throughput. Incrementally increasing the partition count in the topic can potentially improve the throughput. Kafka 4.0 introduces early access to <a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka">KIP932</a>, which bypasses this 1:1 mapping requirement using share groups implementing a queuing semantic to the consumption model. The Share Groups are not supported in Logstash yet.</p>
</li>
<li>
<p><strong>Tune the input parameters for maximum throughput:</strong> Increasing <code>max.poll.records</code>, <code>fetch.max.bytes</code>, and <code>receive.buffer.bytes</code> can enhance performance. The TCP read buffer size is rarely an issue but can also be significantly important.  This setting is bound by the <code>net.core.rmem_max</code> value.</p>
</li>
<li>
<p><strong>Use fast disks with enough space if using persistent queues:</strong> The queue sits between the input and filter stages in the same process. The I/O performance of the storage directly impacts the input throughput. When the queue is full, Logstash puts back pressure on the inputs to stall the data flow.</p>
</li>
</ul>
<h2>Second benchmark: Kafka input, filters, no outputs</h2>
<p>This benchmark helps measure the impact of the filters on the input throughput using the best achieved input configuration from the first exercise. It quantifies the throttling effect on the input stream only caused by the events processing. Note that <a href="https://www.elastic.co/guide/en/logstash/current/lookup-enrichment.html">some filter plugins</a> are also IO-bound, like the plugins that use the network to enrich the events.</p>
<pre><code>input {
 kafka {
   ...
 }
}
filter {
...
}
output {
}
</code></pre>
<p>To increase the number of simultaneously processed events by the filters, try increasing the number of pipeline workers and the pipeline batch size, particularly if the pipeline <code>worker\_utilization</code> <a href="https://www.elastic.co/guide/en/logstash/current/node-stats-api.html#plugin-flow-rates">flow metric</a> is near 100 and Logstash is not spending all available CPU. Increasing the workers number <a href="https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html">past the number of available processors</a> can also yield better results as some of the filter plugins may spend significant time in an I/O wait state like external lookups. </p>
<p>Increasing the number of workers <a href="https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html">past the number of available processors</a> can also improve performance, as some filter plugins may spend considerable time in an I/O wait state, such as during external lookups. This also makes a more efficient use of the Logstash host resources.</p>
<p>Optimizing the pipeline filters is the most effective approach to resolving this bottleneck. It can significantly reduce latency and increase the throughput regardless of the pipeline input configuration and Logstash resources. The per-plugin <code>worker_utilization</code> and <code>worker_millis_per_event</code> <a href="https://www.elastic.co/guide/en/logstash/current/node-stats-api.html#plugin-flow-rates">flow metrics</a> are very useful in identifying where most of the resources are being spent, and consequently, where these improvements should focus first.</p>
<p>Optimizing pipeline filters is the most effective way to address this bottleneck. it can significantly reduce latency and boost throughput, regardless of the pipeline's input configuration or available resources. The per-plugin <code>worker_utilization</code> and <code>worker_millis_per_event</code> flow metrics are useful for finding which plugins are spending the most resources, and the optimization efforts should focus on those plugins first. Some general best practices that can usually make improvements are utilizing <a href="https://www.elastic.co/blog/do-you-grok-grok">anchors</a> for Grok plugins, switching to faster plugins like <a href="https://www.elastic.co/blog/logstash-dude-wheres-my-chainsaw-i-need-to-dissect-my-logs">dissect</a> whenever possible, optimizing Ruby filters code, eliminating unnecessary parsing, and improving the network-based enrichments. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kafka-logstash-elasticsearch-performance-issues/image3.png" alt="" />
<em>Source: <a href="https://www.elastic.co/blog/do-you-grok-grok">do you grok</a></em></p>
<p>In some cases, optimizing the pipeline may require a complete redesign of the ingestion workflow or the pipeline itself!</p>
<h2>Third benchmark: Kafka input, no filters, Elasticsearch output</h2>
<p>This benchmark helps quantify the throttling effect of the Elasticsearch output on the input throughput. The test can be divided into two phases: the first phase uses raw logs to isolate the impact of Elasticsearch indexing, while the second phase assesses the impact of ingest pipelines.</p>
<p><em>In case a pipeline is using multiple outputs, note that,</em> <a href="https://www.elastic.co/guide/en/logstash/current/pipeline-to-pipeline.html#output-isolator-pattern"><em>by default</em></a><em>, a pipeline is blocked if any single output is blocked. This behavior is important in guaranteeing at-least-once delivery of data, but can cause the outputs to perform at the rate of the most clogged one.</em></p>
<pre><code>input {
 kafka {
   ...
 }
}
filter {
}
output {
 Elasticsearch {
   ...
 }
}
</code></pre>
<p>To increase throughput, consider progressively increasing the number of pipeline workers and the pipeline batch size. Prior guidance about the <code>worker\_utilization</code> flow metric applies here too although availability of CPU plays a smaller role since this output is mostly IO-bound.  Also keep looking for the Elasticsearch Output's rejection rates (e.g.: response code 429 `es_rejected_execution_exception` indicating explicit back-pressure) as a signal that the Elasticsearch cluster is busy processing other batches.</p>
<p>The Logstash output tries to send batches of events to the Elasticsearch Bulk API in a single request. However, if a batch exceeds 20 MB, the plugin splits it into multiple bulk requests. </p>
<p>If the Elasticsearch cluster is behind a proxy or API gateway, it's important to adjust the proxy limits to allow Logstash requests with large payloads to pass through to the Elasticsearch cluster. By default, most proxy servers have a much smaller maximum size for HTTP request payloads, which should be tuned in this case to accommodate larger requests. To identify potential issues, look for error code 413 in your proxy logs, as this indicates that the size of the Logstash request has exceeded the maximum payload size the proxy is configured to handle.</p>
<p>On the Elasticsearch cluster, tune your ingest pipelines efficiency following the same general best practices discussed above for the Logstash pipelines. Also, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html">tune for the indexing speed</a> by using faster hardware, less index refreshes, auto-generated IDs, and consider increasing the number of primary shards to enhance indexing parallelism if you have multiple nodes. Beware that excessively increasing this number can negatively impact the search performance.</p>
<p>Finally, keep in mind that the Elasticsearch output plugin is mostly IO-bound, which means that your network latency and bandwidth significantly reduce the rate at which data is transferred and hence your output throughput.</p>
<h2>Reassemble your pipeline</h2>
<p>After tuning the pipeline in each of the previous phases separately, put all the parts together again to assess the real throughput and latency of the reassembled pipeline. At this last step, you should have reached the best performance from your Logstash host as well, and you can progressively add more instances to reach the ultimate latency and throughput you are aiming for for a specific topic or data source.  </p>
<h2>Example</h2>
<p>Below is an example of the configuration required on Logstash and Elasticsearch to implement the architecture above.</p>
<p>Logstash pipeline:</p>
<pre><code>input {
 kafka {
   bootstrap_servers =&gt; &quot;&lt;server&gt;:&lt;port&gt;&quot;
   topics =&gt; [&quot;&lt;topic-id&gt;&quot;]
   group_id =&gt; &quot;&lt;consumer-group-id&gt;&quot;
   decorate_events =&gt; &quot;extended&quot;
   auto_offset_reset =&gt; &quot;earliest&quot;
   codec =&gt; json {
   }
 }
}


filter {
 ruby {
   code =&gt; &quot;event.set('[logstash][start]', Time.now());&quot;
 }


 mutate {
   add_field =&gt; {
     &quot;[kafka][timestamp]&quot; =&gt; &quot;%{[@metadata][kafka][timestamp]}&quot;
     &quot;[kafka][offset]&quot; =&gt; &quot;%{[@metadata][kafka][offset]}&quot;
     &quot;[kafka][consumer_group]&quot; =&gt; &quot;%{[@metadata][kafka][consumer_group]}&quot;
     &quot;[kafka][topic]&quot; =&gt; &quot;%{[@metadata][kafka][topic]}&quot;
   }
 }


 date {
   match =&gt; [&quot;[kafka][timestamp]&quot;, &quot;UNIX&quot;, &quot;UNIX_MS&quot;]
   target =&gt; &quot;[kafka][timestamp]&quot;
 }
 ...
 ruby {
   code =&gt; &quot;event.set('[logstash][end]', Time.now());&quot;
 }
}


output {
 elasticsearch {
   hosts =&gt; &quot;hosts&quot;
   api_key =&gt; &quot;api_key&quot;
   data_stream =&gt; true
   ssl =&gt; true
 }
}
</code></pre>
<p>Create an ingest pipeline for lag calculation. Note that when using Elastic integrations, the ECS fields: \ <code>\*.end\</code>, \ <code>\*.start\</code>, \ <code>\*.timestamp\</code> are automatically mapped as a date.</p>
<pre><code>PUT _ingest/pipeline/calculate_ingest_lag
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;event.ingested&quot;,
        &quot;value&quot;: &quot;{{_ingest.timestamp}}&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;script&quot;: {
        &quot;lang&quot;: &quot;painless&quot;,
        &quot;if&quot;: &quot;ctx['@timestamp'] != null &amp;&amp; ctx?.kafka?.timestamp != null &amp;&amp; ctx?.logstash?.start != null &amp;&amp; ctx?.logstash?.end != null &amp;&amp; ctx?.event?.ingested != null&quot;,
        &quot;source&quot;: &quot;&quot;&quot; 
  ctx.lag_in_millis = [:];
              ctx.lag_in_millis.src_kfk = Duration.between(ZonedDateTime.parse(ctx['@timestamp']), ZonedDateTime.parse(ctx['kafka']['timestamp'])).toMillis(); 
              ctx.lag_in_millis.kfk_ls = Duration.between(ZonedDateTime.parse(ctx['kafka']['timestamp']), ZonedDateTime.parse(ctx['logstash']['start'])).toMillis();
              ctx.lag_in_millis.within_ls  = Duration.between(ZonedDateTime.parse(ctx['logstash']['start']), ZonedDateTime.parse(ctx['logstash']['end'])).toMillis();
              ctx.lag_in_millis.ls_es = Duration.between(ZonedDateTime.parse(ctx['logstash']['end']), ZonedDateTime.parse(ctx['event']['ingested'])).toMillis(); 
              ctx.lag_in_millis.end_end = Duration.between(ZonedDateTime.parse(ctx['@timestamp']), ZonedDateTime.parse(ctx['event']['ingested'])).toMillis();     
        &quot;&quot;&quot;
      }
    }
  ]
}
</code></pre>
<p>Use the pipeline to add the lag calculation to your Elastic integrations</p>
<pre><code>PUT _ingest/pipeline/logs-system.integration@custom
{
  &quot;processors&quot;: [
    {
      &quot;pipeline&quot;: {
        &quot;name&quot;: &quot;calculate_ingest_lag&quot;,
        &quot;ignore_missing_pipeline&quot;: true,
        &quot;description&quot;: &quot;add ingest lag calculation to elastic_agent integration&quot;
      }
    }
  ]
}
</code></pre>
<h2>Kibana Dashboard and Alerts</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kafka-logstash-elasticsearch-performance-issues/image6.png" alt="" />
<img src="https://www.elastic.co/observability-labs/assets/images/kafka-logstash-elasticsearch-performance-issues/image2.png" alt="" /></p>
<p>Using the metrics mentioned above along with the <a href="https://www.elastic.co/guide/en/observability/current/inspect-log-anomalies.html">Log Rate ML job</a>, you can set up <a href="https://www.elastic.co/guide/en/kibana/current/rule-types.html#observability-rules">Kibana alerts</a> to trigger when with anomalous changes in throughput or delays or simply when delays exceed defined thresholds.</p>
<h2>Time to try it out</h2>
<p>Start your <a href="https://cloud.elastic.co/registration?elektra=whats-new-elastic-7-14-blog">free 14-day trial of Elastic Cloud</a> to experience the latest version of <a href="https://www.elastic.co/security">Elastic</a>. Also, make sure to take advantage of the Elastic threat detection <a href="https://www.elastic.co/training/elastic-security-quick-start">training</a> to set yourself up for success.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/kafka-logstash-elasticsearch-performance-issues/cover-resized.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Turn Dashboards Into an Investigation Tool with ES|QL Variable Controls]]></title>
            <link>https://www.elastic.co/observability-labs/blog/kibana-dashboard-esql-variable-controls</link>
            <guid isPermaLink="false">kibana-dashboard-esql-variable-controls</guid>
            <pubDate>Wed, 18 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to use ES|QL variables in Kibana to turn a dashboard into an investigation tool, applying value and structure controls to uncover problems.]]></description>
            <content:encoded><![CDATA[<p>Static dashboards are useful until the first incident, where the default view hides the signal you need. ES|QL variable controls on a Kibana dashboard make it possible to go from a healthy-looking fleet overview to a clear root cause without editing a single query.</p>
<p>In this blog, we’ll show how these ES|QL variable controls turn dashboards into interactive investigation tools, and how to set them up to uncover problems that averages were hiding. By selecting a value in a control, every panel using that variable adapts.</p>
<h2>The dashboard</h2>
<p>This is a custom &quot;Infrastructure Overview&quot; dashboard monitoring 10 hosts across 3 AWS regions using OpenTelemetry host metrics. Four line charts (CPU, Memory, Disk, Load average) and ES|QL variable controls at the top.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-dashboard-esql-variable-controls/1-default-view.png" alt="Default dashboard view showing healthy fleet metrics aggregated by region with ES|QL variable controls visible at the top" /></p>
<p>With the default dashboard controls (AVG aggregation, region breakdown, 15-minute buckets, all hosts selected), everything looks healthy. Smooth diurnal cycles across all three regions.</p>
<p>But there is a problem hiding in this view.</p>
<h2>The problem with fixed queries</h2>
<p>A fixed chart query hardcodes decisions that need to change during an investigation:</p>
<ul>
<li>The aggregation function (AVG, MAX, MIN, MEDIAN)</li>
<li>The dimension used to slice the data (host, region, availability zone)</li>
<li>Which hosts are included or excluded</li>
<li>The time bucket interval (1m, 5m, 15m, 1h)</li>
</ul>
<p>With those baked in, every change means editing queries across multiple panels.</p>
<h2>ES|QL variable controls</h2>
<p>ES|QL variable controls inject user-selected values into queries at runtime. Two types:</p>
<ul>
<li><strong>Value controls</strong> (<code>?variable</code>): replace a value in the query, such as a time interval or a list of hostnames</li>
<li><strong>Structure controls</strong> (<code>??variable</code>): replace a function name or field name, such as the aggregation function or the dimension used to slice data</li>
</ul>
<p>One query pattern, reused across all panels.</p>
<h2>The query</h2>
<p>The original static CPU query looks like this:</p>
<pre><code class="language-esql">TS metrics-hostmetricsreceiver.otel-default
| WHERE system.cpu.utilization IS NOT NULL
  AND attributes.state != &quot;idle&quot;
| STATS AVG(system.cpu.utilization)
  BY BUCKET(@timestamp, 1 minute), resource.attributes.host.name
</code></pre>
<p>To adapt this query to use variable controls, each hardcoded part has to be replaced with a variable. The aggregation function, the time bucket, and the breakdown dimension are straightforward replacements. The hostname filter requires one extra step because we want the control to allow selecting multiple hosts at once, and filtering by a single value only matches one host at a time. <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/mv-functions/mv_contains"><code>MV_CONTAINS</code></a> checks whether a value exists inside a multi-value list, so <code>MV_CONTAINS(?hostname, resource.attributes.host.name)</code> returns true if the field contains any of the selected values in the control.</p>
<p>After replacing each part, the query becomes:</p>
<pre><code class="language-esql">TS metrics-hostmetricsreceiver.otel-default
| WHERE system.cpu.utilization IS NOT NULL
  AND attributes.state != &quot;idle&quot;
  AND MV_CONTAINS(?hostname, resource.attributes.host.name)
| STATS ??aggregation(system.cpu.utilization)
  BY BUCKET(@timestamp, ?interval), ??breakdown
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-dashboard-esql-variable-controls/5-esql.png" alt="ES|QL query with variable placeholders visible in the Lens editor" /></p>
<p>The same pattern applies to all four panels (CPU, Memory, Disk, Load). Changing any control updates every panel at once.</p>
<h2>The controls</h2>
<ul>
<li>
<p><strong>Hostname</strong> (<code>?hostname</code>): Filters to the hosts selected in the control. Configured as &quot;Values from a query&quot; with multi-select enabled. It runs an ES|QL query that returns available host names, and <code>MV_CONTAINS</code> in the chart queries enables selecting more than one.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-dashboard-esql-variable-controls/6-host-control-config-small.png" alt="Host control configuration showing Values from a query settings and the ES|QL query that populates the control" /></p>
</li>
<li>
<p><strong>Aggregation</strong> (<code>??aggregation</code>): Swaps the aggregation function. Static values control with <code>AVG</code>, <code>MAX</code>, <code>MIN</code>, <code>MEDIAN</code>.</p>
</li>
<li>
<p><strong>Time interval</strong> (<code>?interval</code>): Controls the time bucket size. Static values control with <code>1 minute</code>, <code>5 minutes</code>, <code>15 minutes</code>, <code>1 hour</code>.</p>
</li>
<li>
<p><strong>Breakdown</strong> (<code>??breakdown</code>): Swaps the dimension used to slice the data. Static values control with <code>resource.attributes.host.name</code>, <code>resource.attributes.cloud.region</code>, <code>resource.attributes.cloud.availability_zone</code>.</p>
</li>
</ul>
<h2>The investigation</h2>
<p>The dashboard opens with AVG aggregation, region breakdown, 15-minute buckets, and all hosts selected. Nothing looks wrong. The first change is switching the aggregation from AVG to MAX and the time interval to 1 minute. A bump immediately appears in <code>us-east-1</code> around March 7, roughly 68% where normal peak sits around 57%. The average was hiding this because one host's intermittent spikes get averaged across five hosts in the region.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-dashboard-esql-variable-controls/2-aggregation-max.png" alt="Dashboard after switching to MAX aggregation and 1-minute interval, showing a visible bump in us-east-1 on March 7" /></p>
<p>Next, switching the breakdown from region to host makes it clear. <code>db-01</code> stands out with spikes to 65-70% while its normal baseline sits around 24%. Every other host follows its expected pattern.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-dashboard-esql-variable-controls/3-breakdown-host.png" alt="Host-level breakdown revealing db-01 with clear CPU spikes" /></p>
<p>Setting the hostname control to <code>db-01</code> only isolates the incident. Intermittent CPU bursts, not sustained saturation. Memory climbs from 85% to 93%, Load from 2.4 to 3.0, Disk from 67% to 73%. All four panels corroborate a 4-hour event window.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-dashboard-esql-variable-controls/4-db01-filtered.png" alt="Dashboard filtered to db-01 only, all four panels showing correlated anomalies during the incident window" /></p>
<h2>Why structure your queries with variable controls</h2>
<p>A dashboard built with variable controls supports investigation paths that did not exist when the dashboard was built. Without them, every dashboard is a frozen perspective chosen at build time. When an incident does not match that perspective, someone has to edit queries or build a new dashboard under pressure. With controls, the panels adapt.</p>
<p>Value controls like <code>?hostname</code> and <code>?interval</code> handle what you filter and define the granularity of the data. Structure controls like <code>??aggregation</code> and <code>??breakdown</code> handle how you aggregate and how you slice. Panels sharing one query pattern means a fix or improvement applies everywhere, and a new investigation path is a single value added to a control. Together they turn a static dashboard into an investigation surface.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/kibana-dashboard-esql-variable-controls/header.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Kibana: How to create impactful visualisations with magic formulas ? (part 1)]]></title>
            <link>https://www.elastic.co/observability-labs/blog/kibana-impactful-visualizations-with-magic-formulas-part1</link>
            <guid isPermaLink="false">kibana-impactful-visualizations-with-magic-formulas-part1</guid>
            <pubDate>Mon, 09 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[We will see how magic math formulas in the Kibana Lens editor can help to highlight high values.]]></description>
            <content:encoded><![CDATA[<h2>Kibana: How to create impactful visualizations with magic formulas? (part 1)</h2>
<h3>Introduction</h3>
<p>In the previous blog post,<a href="https://www.elastic.co/blog/designing-intuitive-kibana-dashboards-as-a-non-designer"> Designing Intuitive Kibana Dashboards as a non-designer</a>, we highlighted the importance of creating intuitive dashboards. It demonstrated how simple changes (grouping themes, changing type charts, and more) can make a difference in understanding your data. When delivering courses like<a href="https://www.elastic.co/training/data-analysis-with-kibana"> Data Analysis with Kibana</a> or<a href="https://www.elastic.co/training/elastic-observability-engineer"> Elastic Observability Engineer</a> courses, we emphasize this blog post and how these changes help bring essential information to the surface. I like a complementary approach to reach this goal: using two colors to separate the highest data values from the common ones.</p>
<p>To illustrate this idea, we will use the <em>Sample flight data</em> dataset. Now, let’s compare two visualizations ranking the top 10 destination countries per total number of flights. Which visualization has a higher impact?</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-dbg-excalidraw-flights-teaser-intro.png" alt="Flights: Top 10 destinations" /></p>
<p>If you chose the second one, you may be wondering how this was done with the Kibana Lens editor. While preparing for the certification last year, I found a way to achieve this result. The secret is using two different layers and some magic formulas. This post will explain how math in Lens formulas helps create two data-color visualizations.</p>
<p>We will start with the first example that emphasizes only the highest value of the dataset we are focusing on. The second example describes how to highlight other high values (as shown in the illustration above).</p>
<p><em>[Note: the tips explained in this blog post can be applied from v 7.15]</em></p>
<h2>Only the highest value&lt;a id=&quot;only-the-highest-value&quot;&gt;&lt;/a&gt;</h2>
<p>To understand how math helps to separate high values from common ones, let’s start with this first example: emphasizing only the highest value.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-1.1-teaser.png" alt="1.1 flights: " /></p>
<p>We start with a bar horizontal chart:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-1.1-kibana-bar-horizontal-setup.png" alt="1.1 flights: Lens bar horizontal chart" /></p>
<p>We need to identify the highest value of the scope we are currently examining. We will use one proper overall_* function: the <strong>overall_max()</strong>, a pipeline function (equivalent to a pipeline aggregation in Query DSL). </p>
<p>In our example, we group the flights by country(destination). This means we count the number of flights for each DestCountry (= 1 bucket). The <strong>overall_max()</strong> will select which bucket has the highest value. </p>
<p>The math trick here is to divide the number of flights per bucket by the maximum value found among all buckets. Only one bucket will return 1: the bucket matching the max value found by overall_max(). All the other buckets will return a value &lt; 1 and &gt;0. We use <strong>floor()</strong> to ensure any 0.xxx values are rounded to 0. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-1.1-explaination-floor.png" alt="1.1 flights: explaining floor()" /></p>
<p>Now, we can multiple it with a count() and we have our formula for the 1st layer!</p>
<p><strong><em>Layer 1</em></strong>: <code>count()*floor(count()/overall_max(count()))</code></p>
<p>From here, in Lens Editor, we duplicate the layer to adjust the formula of the second layer containing the rest of the data. We need to append another count() followed by the minus operator to the formula. This is the other trick. In this layer, we just need to ensure the highest value is not represented, which will happen only once. It is when count() = overall_max(), which is = 1 when we divide them.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-1.1-explaination-layer1-and-layer2.png" alt="1.1 flights: layer 1 + layer 2" /></p>
<p><strong><em>Layer 2</em></strong>: <code>count() - count()*floor(count()/overall_max(count()))</code></p>
<p>To achieve a nice merge of these two layers, we need to do the following adjustments in both:</p>
<ul>
<li>
<p>select <strong>bar horizontal stacked</strong></p>
</li>
<li>
<p>Vertical axis: change”Rank by” to Custom and ensure Rank function is “Count”</p>
</li>
</ul>
<p>Here is the final setup of the two layers:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-1.1-kibana-final-2layers-setup.png" alt="1.1 flights: 2layers setup" /></p>
<p><strong><em>Layer 1</em></strong>: <code>count()*floor(count()/overall_max(count()))</code></p>
<p><strong><em>Layer 2</em></strong>: <code>count() - count()*floor(count()/overall_max(count()))</code></p>
<p>This visualization also works well for time series data where you need to quickly highlight which time period (12h in the example below) had the highest number of flights:<br />
<img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-1.1-timeserie-example.png" alt="1.1 flights: timeseries example" /></p>
<h2>Above the surface&lt;a id=&quot;above-the-surface&quot;&gt;&lt;/a&gt;</h2>
<p>Building on what we have done earlier, we can extend the approach to get other high values above the surface. Let’s see which formula we used to create the visualization in the introduction:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-dbg-excalidraw-flights-teaser-intro-s1.png" alt="2.1 Flights: Top 10 destinations" /></p>
<p>For this visualization, we used a property of the <strong>round()</strong> function. This function brings in only the values greater than 50% of the highest value.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-2.1-explaination-round.png" alt="2.1 flights: round() &gt; 50% of max explanation" /></p>
<p>Let's duplicate our first visualization and swap out the floor() function with round().</p>
<p><strong><em>Layer 1</em></strong>: <code>count()*round(count()/overall_max(count()))</code></p>
<p><strong><em>Layer 2</em></strong>: <code>count() - count()*round(count()/overall_max(count()))</code></p>
<p>It was an easy fix.<br />
What if we want to extend the first layer further by adding more high values?<br />
For instance, we would like all the values above the average.</p>
<p>To do this, we use <strong>overall_average</strong>() as a new reference value instead of the overall_max () reference to separate the eligible values in Layer 1.</p>
<p>As we are comparing against the average value among all the buckets, the division might return values greater than 1.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-2.2-explaination-floor.png" alt="2.2 flights: round() explanation" /></p>
<p>Here, the <strong>clamp</strong>() function nicely solves this issue. </p>
<p>According to the formula reference, clamp() &quot;limits the value from a minimum to maximum&quot;. Combining clamp() and floor() ensures that there are only two possible output values: either the minimum value ( 0 ) or the maximum value ( 1 ) given as parameters.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-2.2-explaination-clamp.png" alt="2.2 flights: clamp() explanation" /></p>
<p>Applied to our flights dataset, it highlights the country destinations that have more flights than the average:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-dbg-excalidraw-flights-2.2-above-overall-average.png" alt="2.2 flights: above the overall average " /></p>
<p><strong><em>Layer 1</em></strong>: <code>count()*clamp(floor(count()/overall_average(count())),0,1)</code></p>
<p><strong><em>Layer 2</em></strong>: <code>count() - count()*clamp(floor(count()/overall_average(count())),0,1)</code></p>
<p>It also opens up options for using other dynamic references. For instance, we could place all the values greater than 60% of the highest above the surface ( &gt; <code>0.6*overall_max(count())</code>).
We can tune our formula as follow: </p>
<pre><code>
count()*clamp(floor(count()/(0.6*overall_max(count()) ) ),0,1)
</code></pre>
<h2>Conclusion&lt;a id=&quot;conclusion&quot;&gt;&lt;/a&gt;</h2>
<p>In the first part, we have seen the main tips allowing us to create a two-color histogram:</p>
<ul>
<li>
<p>Two layers: one for the highest value and one for the remaining values</p>
</li>
<li>
<p>Visualization type: bar horizontal/vertical <strong>stacked</strong></p>
</li>
<li>
<p>To separate the data we use a formula where only the highest value return 1 otherwise 0</p>
</li>
</ul>
<p> </p>
<p>Then in the second part, we have seen how we can extend this principle to embrace more high values above the surface. This approach can be summarized as follows:</p>
<ul>
<li>
<p>Start with layer 1 focusing on the high value: count()*&lt;formula returning 0 or 1&gt;</p>
</li>
<li>
<p>Duplicate the layer and adjust the formula:<br />
 ( count() - count()*&lt;formula returning 0 or 1&gt;)</p>
</li>
</ul>
<p>Finally, we provide 4 generic formulas that are ready to use to spice up your dashboards:</p>
<table>
<thead>
<tr>
<th></th>
<th align="center"></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>1. Only the highest</strong></td>
<td align="center"></td>
</tr>
<tr>
<td>Layer 1</td>
<td align="center"><code>count()*floor(count()/overall_max(count()))</code></td>
</tr>
<tr>
<td>Layer 2</td>
<td align="center"><code>count() - count()*floor(count()/overall_max(count()))</code></td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th></th>
<th align="center"></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>2.1. Above the surface :</strong> high values (above 50% of the max value)</td>
<td align="center"></td>
</tr>
<tr>
<td>Layer 1</td>
<td align="center"><code>count()*floor(count()/overall_max(count()))</code></td>
</tr>
<tr>
<td>Layer 2</td>
<td align="center"><code>count() - count()*floor(count()/overall_max(count()))</code></td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th></th>
<th align="center"></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>2.2. Above the surface :</strong> all values above the overall average</td>
<td align="center"></td>
</tr>
<tr>
<td>Layer 1</td>
<td align="center"><code>count()*clamp(floor(count()/overall_average(count())),0,1)</code></td>
</tr>
<tr>
<td>Layer 2</td>
<td align="center"><code>count() - count()*clamp(floor(count()/overall_average(count())),0,1)</code></td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th></th>
<th align="center"></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>2.2. Above the surface :</strong> all the values greater than 60% of the highest</td>
<td align="center"></td>
</tr>
<tr>
<td>Layer 1</td>
<td align="center"><code>count()*clamp(floor(count()/(0.6*overall_max(count()) ) ),0,1)</code></td>
</tr>
<tr>
<td>Layer 2</td>
<td align="center"><code>count() - count()*clamp(floor(count()/(0.6*overall_max(count()) ) ),0,1)</code></td>
</tr>
</tbody>
</table>
<p>Try these examples out for yourself by signing up for a <a href="https://cloud.elastic.co/registration?elektra=10-common-questions-kibana-blog">free trial of Elastic Cloud</a> or <a href="https://www.elastic.co/downloads/">download</a> the self-managed version of the Elastic Stack for free. If you have additional questions about getting started, head on over to the <a href="https://discuss.elastic.co/c/elastic-stack/kibana/7">Kibana forum</a> or check out the <a href="https://www.elastic.co/guide/en/kibana/current/index.html">Kibana documentation guide</a>.<br />
In the next blog post, we will see how the new function <strong>ifelse</strong>() (introduced in version 8.6) will greatly simplify the creation of visualizations with more advanced formulas.</p>
<p><strong>References</strong>:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/blog/designing-intuitive-kibana-dashboards-as-a-non-designer">Designing intuitive Kibana dashboards as a non-designer</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/guide/en/kibana/current/lens.html#lens-formulas">Kibana: Lens editor - use formula to perform math</a></p>
</li>
<li>
<p>Discovering the clamp() function <a href="https://discuss.elastic.co/t/if-condition-in-kibana-table-visualization/305751/5">in this discussion (Thanks Marco!)</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/kibana-magic-formulas-p1.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Managing your Kubernetes cluster with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/kubernetes-cluster-metrics-logs-monitoring</link>
            <guid isPermaLink="false">kubernetes-cluster-metrics-logs-monitoring</guid>
            <pubDate>Mon, 24 Oct 2022 00:00:00 GMT</pubDate>
            <description><![CDATA[Unify all of your Kubernetes metrics, log, and trace data on a single platform and dashboard, Elastic. From the infrastructure to the application layer Elastic Observability makes it easier for you to understand how your cluster is performing.]]></description>
            <content:encoded><![CDATA[<p>As an operations engineer (SRE, IT manager, DevOps), you’re always struggling with how to manage technology and data sprawl. Kubernetes is becoming increasingly pervasive and a majority of these deployments will be in Amazon Elastic Kubernetes Service (EKS), Google Kubernetes Engine (GKE), or Azure Kubernetes Service (AKS). Some of you may be on a single cloud while others will have the added burden of managing clusters on multiple Kubernetes cloud services. In addition to cloud provider complexity, you also have to manage hundreds of deployed services generating more and more observability and telemetry data.</p>
<p>The day-to-day operations of understanding the status and health of your Kubernetes clusters and applications running on them, through the logs, metrics, and traces they generate, will likely be your biggest challenge. But as an operations engineer you will need all of that important data to help prevent, predict, and remediate issues. And you certainly don’t need that volume of metrics, logs and traces spread across multiple tools when you need to visualize and analyze Kubernetes telemetry data for troubleshooting and support.</p>
<p>Elastic Observability helps manage the sprawl of Kubernetes metrics and logs by providing extensive and centralized observability capabilities beyond just the logging that we are known for. Elastic Observability provides you with granular insights and context into the behavior of your Kubernetes clusters along with the applications running on them by unifying all of your metrics, log, and trace data through OpenTelemetry and APM agents.</p>
<p>Regardless of the cluster location (EKS, GKE, AKS, self-managed) or application, <a href="https://www.elastic.co/what-is/kubernetes-monitoring">Kubernetes monitoring</a> is made simple with Elastic Observability. All of the node, pod, container, application, and infrastructure (AWS, GCP, Azure) metrics, infrastructure and application logs, along with application traces are available in Elastic Observability.</p>
<p>In this blog we will show:</p>
<ul>
<li>How <a href="http://cloud.elastic.co">Elastic Cloud</a> can aggregate and ingest metrics and log data through the Elastic Agent (easily deployed on your cluster as a DaemonSet) to retrieve logs and metrics from the host (system metrics, container stats) along with logs from all services running on top of Kubernetes.</li>
<li>How Elastic Observability can bring a unified telemetry experience (logs, metrics,traces) across all your Kubernetes cluster components (pods, nodes, services, namespaces, and more).</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-ElasticAgentIntegration-1.png" alt="Elastic Agent with Kubernetes Integration" /></p>
<h2>Prerequisites and config</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up this demonstration:</p>
<ul>
<li>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>).</li>
<li>While we used GKE, you can use any location for your Kubernetes cluster.</li>
<li>We used a variant of the ever so popular <a href="https://github.com/GoogleCloudPlatform/microservices-demo">HipsterShop</a> demo application. It was originally written by Google to showcase Kubernetes across a multitude of variants available such as the <a href="https://github.com/open-telemetry/opentelemetry-demo">OpenTelemetry Demo App</a>. To use the app, please go <a href="https://github.com/bshetti/opentelemetry-microservices-demo/tree/main/deploy-with-collector-k8s">here</a> and follow the instructions to deploy. You don’t need to deploy otelcollector for Kubernetes metrics to flow — we will cover this below.</li>
<li>Elastic supports native ingest from Prometheus and FluentD, but in this blog, we are showing a direct ingest from Kubernetes cluster via Elastic Agent. There will be a follow-up blog showing how Elastic can also pull in telemetry from Prometheus or FluentD/bit.</li>
</ul>
<h2>What can you observe and analyze with Elastic?</h2>
<p>Before we walk through the steps on getting Elastic set up to ingest and visualize Kubernetes cluster metrics and logs, let’s take a sneak peek at Elastic’s helpful dashboards.</p>
<p>As we noted, we ran a variant of HipsterShop on GKE and deployed Elastic Agents with Kubernetes integration as a DaemonSet on the GKE cluster. Upon deployment of the agents, Elastic starts ingesting metrics from the Kubernetes cluster (specifically from kube-state-metrics) and additionally Elastic will pull all log information from the cluster.</p>
<h3>Visualizing Kubernetes metrics on Elastic Observability</h3>
<p>Here are a few Kubernetes dashboards that will be available out of the box (OOTB) on Elastic Observability.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-HipsterShopMetrics-2.png" alt="HipsterShop cluster metrics on Elastic Kubernetes overview dashboard " /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-HipsterShopDashboard-3.png" alt="HipsterShop default namespace pod dashboard on Elastic Observability" /></p>
<p>In addition to the cluster overview dashboard and pod dashboard, Elastic has several useful OOTB dashboards:</p>
<ul>
<li>Kubernetes overview dashboard (see above)</li>
<li>Kubernetes pod dashboard (see above)</li>
<li>Kubernetes nodes dashboard</li>
<li>Kubernetes deployments dashboard</li>
<li>Kubernetes DaemonSets dashboard</li>
<li>Kubernetes StatefulSets dashboards</li>
<li>Kubernetes CronJob &amp; Jobs dashboards</li>
<li>Kubernetes services dashboards</li>
<li>More being added regularly</li>
</ul>
<p>Additionally, you can either customize these dashboards or build out your own.</p>
<h3>Working with logs on Elastic Observability</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-Logging-4.png" alt="Kubernetes container logs and Elastic Agent logs" /></p>
<p>As you can see from the screens above, not only can I get Kubernetes cluster metrics, but also all the Kubernetes logs simply by using the Elastic Agent in my Kubernetes cluster.</p>
<h3>Prevent, predict, and remediate issues</h3>
<p>In addition to helping manage metrics and logs, Elastic can help you detect and predict anomalies across your cluster telemetry. Simply turn on Machine Learning in Elastic against your data and watch it help you enhance your analysis work. As you can see below, Elastic is not only a unified observability location for your Kubernetes cluster logs and metrics, but it also provides extensive true machine learning capabilities to enhance your analysis and management.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-AnomalyDetection-5.png" alt="Anomaly detection across logs on Elastic Observability" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-PodIssues-6.png" alt="Analyzing issues on a Kubernetes pod with Elastic Observability " /></p>
<p>In the top graph, you see anomaly detection across logs and it shows something potentially wrong in the September 21 to 23 time period. Dig into the details on the bottom chart by analyzing a single kubernetes.pod.cpu.usage.node metric showing cpu issues early in September and again, later on in the month. You can do more complicated analyses on your cluster telemetry with Machine Learning using multi-metric analysis (versus the single metric issue I am showing above) along with population analysis.</p>
<p>Elastic gives you better machine learning capabilities to enhance your analysis of Kubernetes cluster telemetry. In the next section, let’s walk through how easy it is to get your telemetry data into Elastic.</p>
<h2>Setting it all up</h2>
<p>Let’s walk through the details of how to get metrics, logs, and traces into Elastic from a HipsterShop application deployed on GKE.</p>
<p>First, pick your favorite version of Hipstershop — as we noted above, we used a variant of the <a href="https://github.com/open-telemetry/opentelemetry-demo">OpenTelemetry-Demo</a> because it already has OTel. We slimmed it down for this blog, however (fewer services with some varied languages).</p>
<h3>Step 0: Get an account on Elastic Cloud</h3>
<p>Follow the instructions to <a href="https://cloud.elastic.co/registration?fromURI=/home">get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-FreeElasticCloud-7.png" alt="" /></p>
<h3>Step 1: Get a Kubernetes cluster and load your Kubernetes app into your cluster</h3>
<p>Get your app on a Kubernetes cluster in your Cloud service of choice or local Kubernetes platform. Once your app is up on Kubernetes, you should have the following pods (or some variant) running on the default namespace.</p>
<pre><code class="language-yaml">NAME                                    READY   STATUS    RESTARTS   AGE
adservice-8694798b7b-jbfxt              1/1     Running   0          4d3h
cartservice-67b598697c-hfsxv            1/1     Running   0          4d3h
checkoutservice-994ddc4c4-p9p2s         1/1     Running   0          4d3h
currencyservice-574f65d7f8-zc4bn        1/1     Running   0          4d3h
emailservice-6db78645b5-ppmdk           1/1     Running   0          4d3h
frontend-5778bfc56d-jjfxg               1/1     Running   0          4d3h
jaeger-686c775fbd-7d45d                 1/1     Running   0          4d3h
loadgenerator-c8f76d8db-gvrp7           1/1     Running   0          4d3h
otelcollector-5b87f4f484-4wbwn          1/1     Running   0          4d3h
paymentservice-6888bb469c-nblqj         1/1     Running   0          4d3h
productcatalogservice-66478c4b4-ff5qm   1/1     Running   0          4d3h
recommendationservice-648978746-8bzxc   1/1     Running   0          4d3h
redis-cart-96d48485f-gpgxd              1/1     Running   0          4d3h
shippingservice-67fddb767f-cq97d        1/1     Running   0          4d3h
</code></pre>
<h3>Step 2: Turn on &lt;a href=&quot;https://github.com/kubernetes/kube-state-metrics&quot; target=&quot;_self&quot;&gt;kube-state-metrics&lt;/a&gt;</h3>
<p>Next you will need to turn on <a href="https://github.com/kubernetes/kube-state-metrics">kube-state-metrics</a>.</p>
<p>First:</p>
<pre><code class="language-bash">git clone https://github.com/kubernetes/kube-state-metrics.git
</code></pre>
<p>Next, in the kube-state-metrics directory under the examples directory, just apply the standard config.</p>
<pre><code class="language-bash">kubectl apply -f ./standard
</code></pre>
<p>This will turn on kube-state-metrics, and you should see a pod similar to this running in kube-system namespace.</p>
<pre><code class="language-yaml">kube-state-metrics-5f9dc77c66-qjprz                    1/1     Running   0          4d4h
</code></pre>
<h3>Step 3: Install the Elastic Agent with Kubernetes integration</h3>
<p><strong>Add Kubernetes Integration:</strong></p>
<ol>
<li><img src="https://images.contentstack.io/v3/assets/bltefdd0b53724fa2ce/blt5a3ae745e98b9e37/635691670a58db35cbdbc0f6/ManagingKubernetes-Addk8sButton-8.png" alt="" /></li>
<li>In Elastic, go to integrations and select the Kubernetes Integration, and select to Add Kubernetes.</li>
<li>Select a name for the Kubernetes integration.</li>
<li>Turn on kube-state-metrics in the configuration screen.</li>
<li>Give the configuration a name in the new-agent-policy-name text box.</li>
<li>Save the configuration. The integration with a policy is now created.</li>
</ol>
<p>You can read up on the agent policies and how they are used on the Elastic Agent <a href="https://www.elastic.co/guide/en/fleet/current/agent-policy.html">here</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-K8sIntegration-9.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-FleetManagement-10.png" alt="" /></p>
<ol>
<li>Add Kubernetes integration.</li>
<li>Select the policy you just created in the second.</li>
<li>In the third step of Add Agent instructions, copy and paste or download the manifest.</li>
<li>Add manifest to the shell where you have kubectl running, save it as elastic-agent-managed-kubernetes.yaml, and run the following command.</li>
</ol>
<pre><code class="language-yaml">kubectl apply -f elastic-agent-managed-kubernetes.yaml
</code></pre>
<p>You should see a number of agents come up as part of a DaemonSet in kube-system namespace.</p>
<pre><code class="language-yaml">NAME                                                   READY   STATUS    RESTARTS   AGE
elastic-agent-qr6hj                                    1/1     Running   0          4d7h
elastic-agent-sctmz                                    1/1     Running   0          4d7h
elastic-agent-x6zkw                                    1/1     Running   0          4d7h
elastic-agent-zc64h                                    1/1     Running   0          4d7h
</code></pre>
<p>In my cluster, I have four nodes and four elastic-agents started as part of the DaemonSet.</p>
<h3>Step 4: Look at Elastic out of the box dashboards (OOTB) for Kubernetes metrics and start discovering Kubernetes logs</h3>
<p>That is it. You should see metrics flowing into all the dashboards. To view logs for specific pods, simply go into Discover in Kibana and search for a specific pod name.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-HipsterShopMetrics-2.png" alt="HipsterShop cluster metrics on Elastic Kubernetes overview dashboard" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-HipsterShopDashboard-3.png" alt="Hipstershop default namespace pod dashboard on Elastic Observability" /></p>
<p>Additionally, you can browse all the pod logs directly in Elastic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKurbenetes-PodLogs-11.png" alt="frontendService and cartService logs" /></p>
<p>In the above example, I searched for frontendService and cartService logs.</p>
<h3>Step 5: Bonus!</h3>
<p>Because we were using an OTel based application, Elastic can even pull in the application traces. But that is a discussion for another blog.</p>
<p>Here is a quick peek at what Hipster Shop’s traces for a front end transaction look like in Elastic Observability.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-CheckOutTransaction-12.png" alt="Trace for Checkout transaction for HipsterShop" /></p>
<h2>Conclusion: Elastic Observability rocks for Kubernetes monitoring</h2>
<p>I hope you’ve gotten an appreciation for how Elastic Observability can help you manage Kubernetes clusters along with the complexity of the metrics, log, and trace data it generates for even a simple deployment.</p>
<p>A quick recap of lessons and more specifically learned:</p>
<ul>
<li>How <a href="http://cloud.elastic.co">Elastic Cloud</a> can aggregate and ingest telemetry data through the Elastic Agent, which is easily deployed on your cluster as a DaemonSet and retrieves metrics from the host, such as system metrics, container stats, and metrics from all services running on top of Kubernetes</li>
<li>Show what Elastic brings from a unified telemetry experience (Kubernenetes logs, metrics, traces) across all your Kubernetes cluster components (pods, nodes, services, any namespace, and more).</li>
<li>Interest in exploring Elastic’s ML capabilities which will reduce your <strong>MTTHH</strong> (mean time to happy hour)</li>
</ul>
<p>Ready to get started? <a href="https://cloud.elastic.co/registration">Register</a> and try out the features and capabilities I’ve outlined above.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-ElasticAgentIntegration-1.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Gain insights into Kubernetes errors with Elastic Observability logs and OpenAI]]></title>
            <link>https://www.elastic.co/observability-labs/blog/kubernetes-errors-observability-logs-openai</link>
            <guid isPermaLink="false">kubernetes-errors-observability-logs-openai</guid>
            <pubDate>Thu, 18 May 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[This blog post provides an example of how one can analyze error messages in Elasticsearch with ChatGPT using the OpenAI API via Elasticsearch.]]></description>
            <content:encoded><![CDATA[<p>As we’ve shown in previous blogs, Elastic&lt;sup&gt;®&lt;/sup&gt; provides a way to ingest and manage telemetry from the <a href="https://www.elastic.co/blog/kubernetes-cluster-metrics-logs-monitoring">Kubernetes cluster</a> and the <a href="https://www.elastic.co/blog/opentelemetry-observability">application</a> running on it. Elastic provides out-of-the-box dashboards to help with tracking metrics, <a href="https://www.elastic.co/blog/log-management-observability-operations">log management and analytics</a>, <a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">APM functionality</a> (which also supports <a href="https://www.elastic.co/blog/opentelemetry-observability">native OpenTelemetry</a>), and the ability to analyze everything with <a href="https://www.elastic.co/blog/observability-logs-machine-learning-aiops">AIOps features</a> and <a href="https://www.elastic.co/what-is/elasticsearch-machine-learning?elektra=home">machine learning</a> (ML). While you can use pre-existing <a href="https://www.elastic.co/blog/improving-information-retrieval-elastic-stack-search-relevance">ML models in Elastic</a>, <a href="https://www.elastic.co/blog/aiops-automation-analytics-elastic-observability-use-cases">out-of-the-box AIOps features</a>, or your own ML models, there is a need to dig deeper into the root cause of an issue.</p>
<p>Elastic helps reduce the operational work to support more efficient operations, but users still need a way to investigate and understand everything from the cause of an issue to the meaning of specific error messages. As an operations user, if you haven’t run into a particular error before or it's part of some runbook, you will likely go to Google and start searching for information.</p>
<p>OpenAI’s ChatGPT is becoming an interesting generative AI tool that helps provide more information using the models behind it. What if you could use OpenAI to obtain deeper insights (even simple semantics) for an error in your production or development environment? You can easily tie Elastic to OpenAI’s API to achieve this.</p>
<p>Kubernetes, a mainstay in most deployments (on-prem or in a cloud service provider) requires a significant amount of expertise — even if that expertise is to manage a service like GKE, EKS, or AKS.</p>
<p>In this blog, I will cover how you can use <a href="https://www.elastic.co/guide/en/kibana/current/watcher-ui.html">Elastic’s watcher</a> capability to connect Elastic to OpenAI and ask it for more information about the error logs Elastic is ingesting from a Kubernetes cluster(s). More specifically, we will use <a href="https://azure.microsoft.com/en-us/products/cognitive-services/openai-service">Azure’s OpenAI Service</a>. Azure OpenAI is a partnership between Microsoft and OpenAI, so the same models from OpenAI are available in the Microsoft version.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-azure-openai.png" alt="elastic azure openai" /></p>
<p>While this blog goes over a specific example, it can be modified for other types of errors Elastic receives in logs. Whether it's from AWS, the application, databases, etc., the configuration and script described in this blog can be modified easily.</p>
<h2>Prerequisites and config</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up the configuration:</p>
<ul>
<li>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>).</li>
<li>We used a GCP GKE Kubernetes cluster, but you can use any Kubernetes cluster service (on-prem or cloud based) of your choice.</li>
<li>We’re also running with a version of the OpenTelemetry Demo. Directions for using Elastic with OpenTelemetry Demo are <a href="https://github.com/elastic/opentelemetry-demo">here</a>.</li>
<li>We also have an Azure account and <a href="https://azure.microsoft.com/en-us/products/cognitive-services/openai-service">Azure OpenAI service configured</a>. You will need to get the appropriate tokens from Azure and the proper URL endpoint from Azure’s OpenAI service.</li>
<li>We will use <a href="https://www.elastic.co/guide/en/kibana/current/devtools-kibana.html">Elastic’s dev tools</a>, the console to be specific, to load up and run the script, which is an <a href="https://www.elastic.co/guide/en/kibana/current/watcher-ui.html">Elastic watcher</a>.</li>
<li>We will also add a new index to store the results from the OpenAI query.</li>
</ul>
<p>Here is the configuration we will set up in this blog:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-configuration.png" alt="Configuration to analyze Kubernetes cluster errors" /></p>
<p>As we walk through the setup, we’ll also provide the alternative setup with OpenAI versus Azure OpenAI Service.</p>
<h2>Setting it all up</h2>
<p>Over the next few steps, I’ll walk through:</p>
<ul>
<li>Getting an account on Elastic Cloud and setting up your K8S cluster and application</li>
<li>Gaining Azure OpenAI authorization (alternative option with OpenAI)</li>
<li>Identifying Kubernetes error logs</li>
<li>Configuring the watcher with the right script</li>
<li>Comparing the output from Azure OpenAI/OpenAI versus ChatGPT UI</li>
</ul>
<h3>Step 0: Create an account on Elastic Cloud</h3>
<p>Follow the instructions to <a href="https://cloud.elastic.co/registration?fromURI=/home">get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-start-cloud-trial.png" alt="elastic start cloud trial" /></p>
<p>Once you have the Elastic Cloud login, set up your Kubernetes cluster and application. A complete step-by-step instructions blog is available <a href="https://www.elastic.co/blog/kubernetes-cluster-metrics-logs-monitoring">here</a>. This also provides an overview of how to see Kubernetes cluster metrics in Elastic and how to monitor them with dashboards.</p>
<h3>Step 1: Azure OpenAI Service and authorization</h3>
<p>When you log in to your Azure subscription and set up an instance of Azure OpenAI Service, you will be able to get your keys under Manage Keys.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-microsoft-azure-manage-keys.png" alt="microsoft azure manage keys" /></p>
<p>There are two keys for your OpenAI instance, but you only need KEY 1 .</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-pme-openai-keys-and-endpoint.png" alt="Used with permission from Microsoft." /></p>
<p>Additionally, you will need to get the service URL. See the image above with our service URL blanked out to understand where to get the KEY 1 and URL.</p>
<p>If you are not using Azure OpenAI Service and the standard OpenAI service, then you can get your keys at:</p>
<pre><code class="language-bash">**https** ://platform.openai.com/account/api-keys
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-api-keys.png" alt="api keys" /></p>
<p>You will need to create a key and save it. Once you have the key, you can go to Step 2.</p>
<h3>Step 2: Identifying Kubernetes errors in Elastic logs</h3>
<p>As your Kubernetes cluster is running, <a href="https://docs.elastic.co/en/integrations/kubernetes">Elastic’s Kubernetes integration</a> running on the Elastic agent daemon set on your cluster is sending logs and metrics to Elastic. <a href="https://www.elastic.co/blog/log-monitoring-management-enterprise">The telemetry is ingested, processed, and indexed</a>. Kubernetes logs are stored in an index called .ds-logs-kubernetes.container_logs-default-* (* is for the date), and an automatic data stream logs-kubernetes.container_logs is also pre-loaded. So while you can use some of the out-of-the-box dashboards to investigate the metrics, you can also look at all the logs in Elastic Discover.</p>
<p>While any error from Kubernetes can be daunting, the more nuanced issues occur with errors from the pods running in the kube-system namespace. Take the pod konnectivity agent, which is essentially a network proxy agent running on the node to help establish tunnels and is a vital component in Kubernetes. Any error will cause the cluster to have connectivity issues and lead to a cascade of issues, so it’s important to understand and troubleshoot these errors.</p>
<p>When we filter out for error logs from the konnectivity agent, we see a good number of errors.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-expanded-document.png" alt="expanded document" /></p>
<p>But unfortunately, we still can’t understand what these errors mean.</p>
<p>Enter OpenAI to help us understand the issue better. Generally, you would take the error message from Discover and paste it with a question in ChatGPT (or run a Google search on the message).</p>
<p>One error in particular that we’ve run into but do not understand is:</p>
<pre><code class="language-bash">E0510 02:51:47.138292       1 client.go:388] could not read stream err=rpc error: code = Unavailable desc = error reading from server: read tcp 10.120.0.8:46156-&gt;35.230.74.219:8132: read: connection timed out serverID=632d489f-9306-4851-b96b-9204b48f5587 agentID=e305f823-5b03-47d3-a898-70031d9f4768
</code></pre>
<p>The OpenAI output is as follows:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-openai-output.png" alt="openai output" /></p>
<p>ChatGPT has given us a fairly nice set of ideas on why this rpc error is occurring against our konnectivity-agent.</p>
<p>So how can we get this output automatically for any error when those errors occur?</p>
<h3>Step 3: Configuring the watcher with the right script</h3>
<p><a href="https://www.elastic.co/guide/en/kibana/current/watcher-ui.html">What is an Elastic watcher?</a> Watcher is an Elasticsearch feature that you can use to create actions based on conditions, which are periodically evaluated using queries on your data. Watchers are helpful for analyzing mission-critical and business-critical streaming data. For example, you might watch application logs for errors causing larger operational issues.</p>
<p>Once a watcher is configured, it can be:</p>
<ol>
<li>Manually triggered</li>
<li>Run periodically</li>
<li>Created using a UI or a script</li>
</ol>
<p>In this scenario, we will use a script, as we can modify it easily and run it as needed.</p>
<p>We’re using the DevTools Console to enter the script and test it out:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-test-script.png" alt="test script" /></p>
<p>The script is listed at the end of the blog in the <strong>appendix</strong>. It can also be downloaded <a href="https://github.com/elastic/chatgpt-error-analysis"><strong>here</strong></a> <strong>.</strong></p>
<p>The script does the following:</p>
<ol>
<li>It runs continuously every five minutes.</li>
<li>It will search the logs for errors from the container konnectivity-agent.</li>
<li>It will take the first error’s message, transform it (re-format and clean up), and place it into a variable first_hit.</li>
</ol>
<pre><code class="language-json">&quot;script&quot;: &quot;return ['first_hit': ctx.payload.first.hits.hits.0._source.message.replace('\&quot;', \&quot;\&quot;)]&quot;
</code></pre>
<ol start="4">
<li>The error message is sent into OpenAI with a query:</li>
</ol>
<pre><code class="language-yaml">What are the potential reasons for the following kubernetes error:
  { { ctx.payload.second.first_hit } }
</code></pre>
<ol start="5">
<li>If the search yielded an error, it will proceed to then create an index and place the error message, pod.name (which is konnectivity-agent-6676d5695b-ccsmx in our setup), and OpenAI output into a new index called chatgpt_k8_analyzed.</li>
</ol>
<p>To see the results, we created a new data view called chatgpt_k8_analyzed against the newly created index:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-edit-data-view.png" alt="edit data view" /></p>
<p>In Discover, the output on the data view provides us with the analysis of the errors.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-analysis-of-errors.png" alt="analysis of errors" /></p>
<p>For every error the script sees in the five minute interval, it will get an analysis of the error. We could alternatively also use a range as needed to analyze during a specific time frame. The script would just need to be modified accordingly.</p>
<h3>Step 4. Output from Azure OpenAI/OpenAI vs. ChatGPT UI</h3>
<p>As you noticed above, we got relatively the same result from the Azure OpenAI API call as we did by testing out our query in the ChatGPT UI. This is because we configured the API call to run the same/similar model as what was selected in the UI.</p>
<p>For the API call, we used the following parameters:</p>
<pre><code class="language-json">&quot;request&quot;: {
             &quot;method&quot; : &quot;POST&quot;,
             &quot;Url&quot;: &quot;https://XXX.openai.azure.com/openai/deployments/pme-gpt-35-turbo/chat/completions?api-version=2023-03-15-preview&quot;,
             &quot;headers&quot;: {&quot;api-key&quot; : &quot;XXXXXXX&quot;,
                         &quot;content-type&quot; : &quot;application/json&quot;
                        },
             &quot;body&quot; : &quot;{ \&quot;messages\&quot;: [ { \&quot;role\&quot;: \&quot;system\&quot;, \&quot;content\&quot;: \&quot;You are a helpful assistant.\&quot;}, { \&quot;role\&quot;: \&quot;user\&quot;, \&quot;content\&quot;: \&quot;What are the potential reasons for the following kubernetes error: {{ctx.payload.second.first_hit}}\&quot;}], \&quot;temperature\&quot;: 0.5, \&quot;max_tokens\&quot;: 2048}&quot; ,
              &quot;connection_timeout&quot;: &quot;60s&quot;,
               &quot;read_timeout&quot;: &quot;60s&quot;
                            }
</code></pre>
<p>By setting the role: system with You are a helpful assistant and using the gpt-35-turbo url portion, we are essentially setting the API to use the davinci model, which is the same as the ChatGPT UI model set by default.</p>
<p>Additionally, for Azure OpenAI Service, you will need to set the URL to something similar the following:</p>
<pre><code class="language-bash">https://YOURSERVICENAME.openai.azure.com/openai/deployments/pme-gpt-35-turbo/chat/completions?api-version=2023-03-15-preview
</code></pre>
<p>If you use OpenAI (versus Azure OpenAI Service), the request call (against <a href="https://api.openai.com/v1/completions">https://api.openai.com/v1/completions</a>) would be as such:</p>
<pre><code class="language-json">&quot;request&quot;: {
            &quot;scheme&quot;: &quot;https&quot;,
            &quot;host&quot;: &quot;api.openai.com&quot;,
            &quot;port&quot;: 443,
            &quot;method&quot;: &quot;post&quot;,
            &quot;path&quot;: &quot;\/v1\/completions&quot;,
            &quot;params&quot;: {},
            &quot;headers&quot;: {
               &quot;content-type&quot;: &quot;application\/json&quot;,
               &quot;authorization&quot;: &quot;Bearer YOUR_ACCESS_TOKEN&quot;
                        },
            &quot;body&quot;: &quot;{ \&quot;model\&quot;: \&quot;text-davinci-003\&quot;,  \&quot;prompt\&quot;: \&quot;What are the potential reasons for the following kubernetes error: {{ctx.payload.second.first_hit}}\&quot;,  \&quot;temperature\&quot;: 1,  \&quot;max_tokens\&quot;: 512,     \&quot;top_p\&quot;: 1.0,      \&quot;frequency_penalty\&quot;: 0.0,   \&quot;presence_penalty\&quot;: 0.0 }&quot;,
            &quot;connection_timeout_in_millis&quot;: 60000,
            &quot;read_timeout_millis&quot;: 60000
          }
</code></pre>
<p>If you are interested in creating a more OpenAI-based version, you can <a href="https://elastic-content-share.eu/downloads/watcher-job-to-integrate-chatgpt-in-elasticsearch/">download an alternative script</a> and look at <a href="https://mar1.hashnode.dev/unlocking-the-power-of-aiops-with-chatgpt-and-elasticsearch">another blog from an Elastic community member</a>.</p>
<h2>Gaining other insights beyond Kubernetes logs</h2>
<p>Now that the script is up and running, you can modify it using different:</p>
<ul>
<li>Inputs</li>
<li>Conditions</li>
<li>Actions</li>
<li>Transforms</li>
</ul>
<p>Learn more on how to modify it <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/xpack-alerting.html">here</a>. Some examples of modifications could include:</p>
<ol>
<li>Look for error logs from application components (e.g., cartService, frontEnd, from the OTel demo), cloud service providers (e.g., AWS/Azure/GCP logs), and even logs from components such as Kafka, databases, etc.</li>
<li>Vary the time frame from running continuously to running over a specific <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-range-query.html">range</a>.</li>
<li>Look for specific errors in the logs.</li>
<li>Query for analysis on a set of errors at once versus just one, which we demonstrated.</li>
</ol>
<p>The modifications are endless, and of course you can run this with OpenAI rather than Azure OpenAI Service.</p>
<h2>Conclusion</h2>
<p>I hope you’ve gotten an appreciation for how Elastic Observability can help you connect to OpenAI services (Azure OpenAI, as we showed, or even OpenAI) to better analyze an error log message instead of having to run several Google searches and hunt for possible insights.</p>
<p>Here’s a quick recap of what we covered:</p>
<ul>
<li>Developing an Elastic watcher script that can be used to find and send Kubernetes errors into OpenAI and insert them into a new index</li>
<li>Configuring Azure OpenAI Service or OpenAI with the right authorization and request parameters</li>
</ul>
<p>Ready to get started? Sign up <a href="https://cloud.elastic.co/registration">for Elastic Cloud</a> and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your OpenTelemetry data.</p>
<h2>Appendix</h2>
<p>Watcher script</p>
<pre><code class="language-bash">PUT _watcher/watch/chatgpt_analysis
{
    &quot;trigger&quot;: {
      &quot;schedule&quot;: {
        &quot;interval&quot;: &quot;5m&quot;
      }
    },
    &quot;input&quot;: {
      &quot;chain&quot;: {
          &quot;inputs&quot;: [
              {
                  &quot;first&quot;: {
                      &quot;search&quot;: {
                          &quot;request&quot;: {
                              &quot;search_type&quot;: &quot;query_then_fetch&quot;,
                              &quot;indices&quot;: [
                                &quot;logs-kubernetes*&quot;
                              ],
                              &quot;rest_total_hits_as_int&quot;: true,
                              &quot;body&quot;: {
                                &quot;query&quot;: {
                                  &quot;bool&quot;: {
                                    &quot;must&quot;: [
                                      {
                                        &quot;match&quot;: {
                                          &quot;kubernetes.container.name&quot;: &quot;konnectivity-agent&quot;
                                        }
                                      },
                                      {
                                        &quot;match&quot; : {
                                          &quot;message&quot;:&quot;error&quot;
                                        }
                                      }
                                    ]
                                  }
                                },
                                &quot;size&quot;: &quot;1&quot;
                              }
                            }
                        }
                    }
                },
                {
                    &quot;second&quot;: {
                        &quot;transform&quot;: {
                            &quot;script&quot;: &quot;return ['first_hit': ctx.payload.first.hits.hits.0._source.message.replace('\&quot;', \&quot;\&quot;)]&quot;
                        }
                    }
                },
                {
                    &quot;third&quot;: {
                        &quot;http&quot;: {
                            &quot;request&quot;: {
                                &quot;method&quot; : &quot;POST&quot;,
                                &quot;url&quot;: &quot;https://XXX.openai.azure.com/openai/deployments/pme-gpt-35-turbo/chat/completions?api-version=2023-03-15-preview&quot;,
                                &quot;headers&quot;: {
                                    &quot;api-key&quot; : &quot;XXX&quot;,
                                    &quot;content-type&quot; : &quot;application/json&quot;
                                },
                                &quot;body&quot; : &quot;{ \&quot;messages\&quot;: [ { \&quot;role\&quot;: \&quot;system\&quot;, \&quot;content\&quot;: \&quot;You are a helpful assistant.\&quot;}, { \&quot;role\&quot;: \&quot;user\&quot;, \&quot;content\&quot;: \&quot;What are the potential reasons for the following kubernetes error: {{ctx.payload.second.first_hit}}\&quot;}], \&quot;temperature\&quot;: 0.5, \&quot;max_tokens\&quot;: 2048}&quot; ,
                                &quot;connection_timeout&quot;: &quot;60s&quot;,
                                &quot;read_timeout&quot;: &quot;60s&quot;
                            }
                        }
                    }
                }
            ]
        }
    },
    &quot;condition&quot;: {
      &quot;compare&quot;: {
        &quot;ctx.payload.first.hits.total&quot;: {
          &quot;gt&quot;: 0
        }
      }
    },
    &quot;actions&quot;: {
        &quot;index_payload&quot; : {
            &quot;transform&quot;: {
                &quot;script&quot;: {
                    &quot;source&quot;: &quot;&quot;&quot;
                        def payload = [:];
                        payload.timestamp = new Date();
                        payload.pod_name = ctx.payload.first.hits.hits[0]._source.kubernetes.pod.name;
                        payload.error_message = ctx.payload.second.first_hit;
                        payload.chatgpt_analysis = ctx.payload.third.choices[0].message.content;
                        return payload;
                    &quot;&quot;&quot;
                }
            },
            &quot;index&quot; : {
                &quot;index&quot; : &quot;chatgpt_k8s_analyzed&quot;
            }
        }
    }
}
</code></pre>
<h3>Additional logging resources:</h3>
<ul>
<li><a href="https://www.elastic.co/getting-started/observability/collect-and-analyze-logs">Getting started with logging on Elastic (quickstart)</a></li>
<li><a href="https://www.elastic.co/guide/en/observability/current/logs-metrics-get-started.html">Ingesting common known logs via integrations (compute node example)</a></li>
<li><a href="https://docs.elastic.co/integrations">List of integrations</a></li>
<li><a href="https://www.elastic.co/blog/log-monitoring-management-enterprise">Ingesting custom application logs into Elastic</a></li>
<li><a href="https://www.elastic.co/blog/observability-logs-parsing-schema-read-write">Enriching logs in Elastic</a></li>
<li>Analyzing Logs with <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">Anomaly Detection (ML)</a> and <a href="https://www.elastic.co/blog/observability-logs-machine-learning-aiops">AIOps</a></li>
</ul>
<h3>Common use case examples with logs:</h3>
<ul>
<li><a href="https://youtu.be/ax04ZFWqVCg">Nginx log management</a></li>
<li><a href="https://www.elastic.co/blog/vpc-flow-logs-monitoring-analytics-observability">AWS VPC Flow log management</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-errors-observability-logs-openai">Using OpenAI to analyze Kubernetes errors</a></li>
<li><a href="https://youtu.be/Li5TJAWbz8Q">PostgreSQL issue analysis with AIOps</a></li>
</ul>
<p><em>In this blog post, we may have used third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
<p><em>Screenshots of Microsoft products used with permission from Microsoft.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-configuration.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Process Kubernetes logs with ease using Elastic Streams]]></title>
            <link>https://www.elastic.co/observability-labs/blog/kubernetes-logs-elastic-streams-processing</link>
            <guid isPermaLink="false">kubernetes-logs-elastic-streams-processing</guid>
            <pubDate>Thu, 12 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to process Kubernetes logs with Elastic Streams using conditional blocks, AI-generated Grok patterns, and selective drops to reduce noise and storage cost.]]></description>
            <content:encoded><![CDATA[<p>Streams, a new AI capability within Elastic Observability. Built on the Elasticsearch platform, it's designed for Site Reliability Engineers (SREs) to use logs as the primary signal for investigations, enabling faster answers and quicker issue resolution. For decades, logs have been considered too noisy, expensive, and complex to manage, and many observability vendors have treated them as a second-class citizen. Streams flips this script by transforming raw logs into your most valuable asset to immediately identify not only the root cause, but also the why behind the root cause to enable instant resolution.</p>
<p>Learn more from our previous article <a href="https://www.elastic.co/observability-labs/blog/elastic-observability-streams-ai-logs-investigations">Introducing Streams</a></p>
<p>Many SREs deploy on cloud native archtiectures. Kubernetes is essentially the baseline deployment architecture of choice. Yet Kubernetes logs are messy by default. A single (data)stream often mixes access logs, JSON payloads, health checks, and internal service chatter.</p>
<p>Elastic Streams gives you a faster path. You can isolate subsets of logs with conditionals, use AI to generate Grok patterns from real samples, and drop documents you do not need before they add storage and query cost.</p>
<h2>Why Kubernetes logs get messy fast</h2>
<p>The default Kubernetes container logs stream can contain data from many services at once. In one sample, you might see:</p>
<ul>
<li>HTTP access logs from application pods</li>
<li>Verbose worker or batch job status logs</li>
<li>Platform and container lifecycle events with different formats</li>
</ul>
<p>This is why &quot;one global parsing rule&quot; will fail. You need targeted processing logic per log shape or type of application.
Histrocially doing this kind of custom processing has been error prone and time consuming.</p>
<h2>What Streams Processing changes</h2>
<p>Streams Processing (available in 9.2 and later) moves this workflow into a live, interactive experience:</p>
<ul>
<li>You build conditions and processors in the UI</li>
<li>You validate each change against sample documents before saving</li>
<li>You can use AI to generate extraction patterns from selected logs</li>
</ul>
<p>The result is a safer way to iterate on parsing logic without guessing.</p>
<h2>Walkthrough: parse custom application logs</h2>
<p>We'll start from your Kubernetes stream (logs-kubernetes.containers_logs-default) and create a conditional block that scopes processing to one service.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/01-conditional-filter-litellm.png" alt="Conditional block filtering Kubernetes logs for litellm before parsing in Elastic Streams" /></p>
<p>Once the condition is saved, it will automatically filter the sample data to a subset of logs that match the condition. This is indicate by the blue highlight in the preview.</p>
<p>Inside that block, we'll add a Grok processor and click <strong>Generate pattern</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/02-generate-pattern-button.png" alt="Generate pattern button in Elastic Streams using AI to process Kubernetes logs" /></p>
<p>This agentic process will now use an LLM to generate a Grok pattern that will be used to parse the logs. By default this would be using the Elastic Inference Service, but you can configure it to use your own LLM.
Review the generated pattern and accept it once the sample set validates.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/03-accept-generated-grok.png" alt="Accepting AI-generated Grok pattern after matching selected Uvicorn logs" /></p>
<h2>Walkthrough: drop noisy postgres-loadgen documents</h2>
<p>Not all logs are that important that we'd like to keep them around forever. For example, logs from a load testing tool like a load generator are not useful for long-term analysis, so let's drop those.</p>
<p>To do this we will add a second conditional block for logs you intentionally do not want to index long-term.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/05-preview-selected-postgres-loadgen.png" alt="Selected tab preview of noisy postgres-loadgen documents before drop" /></p>
<p>Add a drop processor inside this block, then validate in the <strong>Dropped</strong> tab.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/07-preview-dropped-tab.png" alt="Dropped tab preview showing noisy Kubernetes logs excluded from indexing" /></p>
<h2>Save safely with live simulation</h2>
<p>One of the most useful parts of Streams is the preview-first workflow. You can inspect matched, parsed, skipped, failed, and dropped samples before making the change live.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/08-save-changes.png" alt="Save changes button after validating processing logic on live samples" /></p>
<h2>YAML mode and the equivalent API request</h2>
<p>The interactive builder works well for most edits, but advanced users can switch to YAML mode for direct control.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/11-yaml-mode.png" alt="Switching from interactive builder to YAML mode in Streams processing" /></p>
<p>You can also open <strong>Equivalent API Request</strong> to copy the payload for automation and Infrastructure as Code workflows.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/12-equivalent-api-request.png" alt="Equivalent API request panel for automating Streams processing" /></p>
<h2>A note on backwards compatibility</h2>
<p>Streams Processing builds on Elasticsearch ingest pipelines, so it works with the same ingestion model teams already use.</p>
<p>When you save processing changes, Streams appends logic through the stream processing pipeline model (for example via <code>@custom</code> conventions used by data streams). That means you can adopt conditionals, parsing, and selective dropping incrementally, without changing your Kubernetes log shippers.</p>
<h2>What's next?</h2>
<p>Streams Processing is consistently getting new processing capabilities. Check out the <a href="https://www.elastic.co/docs/solutions/observability/streams/streams">Streams documentation</a> for the latest updates.</p>
<p>Over the coming months more of this will be automated and moved to the background, reducing the manual effort required to process logs.</p>
<p>Another miletsone we're working towards is to offer this processing at read time, rather than write time. Using ES|QL this will enable you to iterate on your parsing logic without having to worry about committing changes that are harder to revert.</p>
<p>Also try this out by getting a free trial on <a href="https://cloud.elastic.co/">Elastic Serverless</a>.</p>
<p>Happy log analytics!!!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/cover.svg" length="0" type="image/svg"/>
        </item>
        <item>
            <title><![CDATA[Troubleshooting your Agents and Amazon Bedrock AgentCore with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/llm-agentic-ai-observability-amazon-bedrock-agentcore</link>
            <guid isPermaLink="false">llm-agentic-ai-observability-amazon-bedrock-agentcore</guid>
            <pubDate>Mon, 01 Dec 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Discover how to achieve end-to-end observability for Amazon Bedrock AgentCore: from tracking service health and token costs to debugging complex reasoning loops with distributed tracing.]]></description>
            <content:encoded><![CDATA[<h2>Troubleshooting your Agents and Amazon Bedrock AgentCore with Elastic Observability</h2>
<h3>Introduction</h3>
<p>We're excited to introduce Elastic Observability’s Amazon Bedrock AgentCore integration, which allows users to observe <a href="https://aws.amazon.com/bedrock/agentcore/">Amazon Bedrock AgentCore</a> and the agents' LLM interactions end-to-end. Agentic AI represents a fundamental shift in how we build applications. </p>
<p>Unlike standard LLM chatbots that simply generate text, agents can reason, plan, and execute multi-step workflows to complete complex tasks autonomously. Many times these agents are running on a platform such as Amazon Bedrock AgentCore, which helps developers build, deploy and scale agents. Amazon Bedrock AgentCore is Amazon Bedrock's platform providing the secure, scalable, and modular infrastructure services (like agent runtime, memory, and identity) necessary for developers to deploy and operate highly capable AI agents built with any framework or model.</p>
<p>Using a platform, such as Amazon Bedrock Agentcore, is easy, but troubleshooting an agent is far more complex than debugging a standard microservice. Key challenges include:</p>
<ul>
<li>
<p><strong>Non-Deterministic Behavior:</strong> Agents may choose different tools or reasoning paths for the same prompt, making it difficult to reproduce bugs.</p>
</li>
<li>
<p><strong>&quot;Black Box&quot; Execution:</strong> When an agent fails or provides a hallucinated answer, it is often unclear if the issue lies in the LLM's reasoning, the context provided, or a failed tool execution.</p>
</li>
<li>
<p><strong>Cost &amp; Latency Blind Spots:</strong> A single user query can trigger recursive loops or expensive multi-step tool calls, leading to unexpected spikes in token usage and latency.</p>
</li>
</ul>
<p>To effectively observe these systems, you need to correlate signals from two distinct layers:</p>
<ol>
<li>
<p><strong>The Platform Layer (Amazon Bedrock AgentCore):</strong> You need to understand the overall health of the managed service. This includes high-level metrics like invocation counts, latency, throttling, and platform-level errors that affect all agents running in AgentCore.</p>
</li>
<li>
<p><strong>The Application Layer (Your Agentic Logic):</strong> You want to understand the granular &quot;why&quot; behind the behavior. This includes distributed traces, usually with OpenTelemetry, that visualize the full request lifecycle (e.g. waterfall view), identifying exactly which step in the reasoning chain failed or took too long.</p>
</li>
</ol>
<p><strong>Agentic AI Observability in Elastic</strong> provides a unified, end-to-end view of your agentic deployment by combining platform-level insights from Amazon Bedrock AgentCore, through the new <a href="https://www.elastic.co/docs/reference/integrations/aws_bedrock_agentcore">Amazon Bedrock AgentCore integration</a>, with deep application-level visibility from OpenTelemetry (OTel) traces, logs and metrics form the agent. This unified view in Elastic allows you to observe, troubleshoot, and optimize your agentic applications from end to end without switching tools. Additionally, Elastic provides Agent Builder which allows you to create agents to analyze any of the data from Amazon Bedrock AgentCore and the agents running on it.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-agentic-ai-observability-amazon-bedrock-agentcore/amazon-bedrock-agentcore-dashboard-runtime-gateway-traces.jpg" alt="Amazon Bedrock AgentCore Dashboards for Runtime and Gateway and APM Tracing" /></p>
<h2>Agentic AI Observability in Elastic</h2>
<p>As mentioned above there are two main parts to end-to-end Agentic AI Observability in Elastic.</p>
<ul>
<li>
<p><strong>Amazon Bedrock AgentCore Platform Observability -</strong> using platform logs and metrics,  Elastic provides comprehensive visibility into the high-level health of the AgentCore service by ingesting AWS vended logs and metrics across four critical components:</p>
<ul>
<li>
<p><strong>Runtime:</strong> Monitor core performance indicators such as agent errors, overall latency, throttle counts, and invocation rates, for each endpoint. </p>
</li>
<li>
<p><strong>Gateway:</strong> specific insights into gateway and tool call performance, including invocations, error rates, and latency.</p>
</li>
<li>
<p><strong>Memory:</strong> Track short-term and long-term memory operations, including event creation, retrieval, and listing, alongside performance analysis, errors, and latency metrics.</p>
</li>
<li>
<p><strong>Identity:</strong> Audit security and access health with logs on successful and failed access attempts.</p>
</li>
</ul>
</li>
</ul>
<ul>
<li><strong>Agent Observability with APM, logs and metrics -</strong> To understand <em>how</em> your agent is behaving, Elastic ingests OTel-native traces, metrics and logs from your application running within AgentCore. This allows you to visualize the full execution path, including LLM reasoning steps and tool calls, in a detailed waterfall diagram. </li>
</ul>
<ul>
<li><strong>Agentic AI Analysis</strong> - All of the data from Amazon Bedrock AgentCore and the agent running on it, can be analyzed with <strong>Elastic’s AI driven capabilities</strong>. These include:</li>
</ul>
<ul>
<li>
<p><strong>Elastic AgentCore SRE Agent built on Elastic Agent Builder</strong> - We don't just monitor agents; we provide you with one to assist your team. The <strong>AgentCore SRE Agent</strong> is a specialized assistant built using <strong>Elastic Agent Builder</strong>. It possesses specialized knowledge of AgentCore applications observed in Elastic.</p>
<ul>
<li>
<p><strong>How it helps:</strong> You can ask specific questions regarding your AgentCore environment, such as how to interpret a complex error log or why a specific trace shows latency.</p>
</li>
<li>
<p><strong>Get the Agent:</strong> You can deploy this agent yourself from our <a href="https://github.com/elastic/observability-examples/tree/main/aws/amazon-bedrock-agentcore/elastic_agentcore_sre_agent">GitHub repository</a>.</p>
</li>
</ul>
</li>
<li>
<p><strong>Elastic Observability AI Assistant</strong> - Use natural language anywhere in Elastic’s UI to help you pinpoint issues, analyze something specific, or just learn what the problem is through LLM knowledge base. Additionally, SREs can interpret log messages, errors, metrics patterns, optimize code, write reports, and even identify and execute a runbook, or find a related github issue.</p>
</li>
<li>
<p><strong>Streams -</strong> AI-Driven Log Analysis - When you send AgentCore logs from your instrumented application into Elastic, you can parse and analyze them. Additionally, Streams finds <strong>Significant Events</strong> within your log stream allowing you to focus immediately on what matters most.</p>
</li>
<li>
<p><strong>Dashboards and ES|QL</strong> Data is only useful if you can act on it. Elastic provides out-of-the-box (OOTB) assets to accelerate your mean time to resolution (MTTR). And Elastic provides ES|QL to help you perform ad-hoc analysis on any signal</p>
<ul>
<li>
<p><strong>OOTB Dashboards:</strong> Pre-built visualizations based on AgentCore service signals. These dashboards provide an immediate, high-level overview of the usage, health, and performance of your AgentCore runtime, gateway, memory, and identity components.</p>
</li>
<li>
<p><strong>OOTB Alert Templates:</strong> Pre-configured alerts for common agentic issues (e.g., high error rates, latency spikes, or unusual token consumption), allowing you to move from reactive to proactive troubleshooting immediately.</p>
</li>
</ul>
</li>
</ul>
<h2>Onboarding Amazon Bedrock AgentCore signals into Elastic</h2>
<h3> Amazon Bedrock AgentCore Integration</h3>
<p>To get started with platform-level visibility, you need to enable the <strong>Amazon Bedrock AgentCore</strong> integration in Elastic. This integration automatically collects metrics and logs from your AgentCore runtime, gateway, memory, and identity components via Amazon CloudWatch.</p>
<p><strong>Setup Steps:</strong></p>
<ol>
<li>
<p><strong>Prepare AWS Environment:</strong> Ensure your AgentCore agents are deployed and running and that you have enabled logging on your AgentCore resources in the AWS console.</p>
</li>
<li>
<p><strong>Add the Integration:</strong></p>
<ul>
<li>
<p>In Elastic (Kibana), navigate to <strong>Integrations</strong>.</p>
</li>
<li>
<p>Search for <strong>&quot;Amazon Bedrock AgentCore&quot;</strong>. Select <strong>Add Amazon Bedrock AgentCore</strong>.</p>
</li>
</ul>
</li>
<li>
<p><strong>Configure &amp; Deploy:</strong></p>
<p>Configure Elastic's Amazon Bedrock AgentCore integration to collect CloudWatch metrics from your chosen AWS region at the specified collection interval. Logs will be added soon after the publication of this blog.</p>
</li>
</ol>
<h3>Onboard the Agent with OTel Instrumentation</h3>
<p>The next step is observing the application logic itself. The beauty of Amazon Bedrock AgentCore is that the application runtime often comes pre-instrumented. You simply need to tell it where to send the telemetry data.</p>
<p>For this example, we will use the <a href="https://github.com/elastic/observability-examples/tree/main/aws/amazon-bedrock-agentcore/travel_assistant"><strong>Travel Assistant</strong></a> from the Elastic Observability examples.</p>
<p>To instrument this agent, you do not need to modify the source code. Instead, when you invoke the agent using the <code>agentcore</code> CLI, you simply pass your Elastic connection details as environment variables. This redirects the OTel signals (traces, metrics, and logs) directly to the Elastic EDOT collector.</p>
<p><strong>Example Invoke Command:</strong> Run the following command to launch the agent and start streaming telemetry to Elastic:</p>
<pre><code class="language-bash">    agentcore launch \
    --env BEDROCK_MODEL_ID=&quot;us.anthropic.claude-3-5-sonnet-20240620-v1:0&quot; \
    --env OTEL_EXPORTER_OTLP_ENDPOINT=&quot;https://&lt;REPLACE_WITH_ELASTIC_ENDPOINT&gt;.region.cloud.elastic.co:443&quot; \
    --env OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=ApiKey &lt;REPLACE_WITH_YOUR_API_KEY&gt;&quot; \
    --env OTEL_EXPORTER_OTLP_PROTOCOL=&quot;http/protobuf&quot; \
    --env OTEL_METRICS_EXPORTER=&quot;otlp&quot; \
    --env OTEL_TRACES_EXPORTER=&quot;otlp&quot; \
    --env OTEL_LOGS_EXPORTER=&quot;otlp&quot; \
    --env OTEL_RESOURCE_ATTRIBUTES=&quot;service.name=travel_assistant,service.version=1.0.0&quot; \
    --env AGENT_OBSERVABILITY_ENABLED=&quot;true&quot; \
    --env DISABLE_ADOT_OBSERVABILITY=&quot;true&quot; \
    --env TAVILY_API_KEY=&quot;&lt;REPLACE_WITH_YOUR_TAVILY_KEY&gt;&quot;
</code></pre>
<p><strong>Key Configuration Parameters:</strong></p>
<ul>
<li>
<p><code>OTEL_EXPORTER_OTLP_ENDPOINT</code>: Your Elastic OTLP endpoint (ensure port 443 is specified).</p>
</li>
<li>
<p><code>OTEL_EXPORTER_OTLP_HEADERS</code>: The Authorization header containing your Elastic API Key.</p>
</li>
<li>
<p><code>DISABLE_ADOT_OBSERVABILITY=true</code>: This ensures the native AgentCore signals are routed exclusively to your defined endpoint (Elastic) rather than default AWS paths.</p>
</li>
</ul>
<h2>Analyzing Agentic Data in Elastic Observability</h2>
<p>As we walk through the analysis features below, we will use the Travel Assistant agent which we instrumented earlier as well as any other apps you may be running on AgentCore. For the purposes of this example, as a second agent, we will use <a href="https://github.com/awslabs/amazon-bedrock-agentcore-samples/tree/main/02-use-cases/customer-support-assistant"><strong>Customer Support Assistant</strong></a> from the AWS Labs AgentCore samples </p>
<h3>Out-of-the-Box (OOTB) Dashboards</h3>
<p>Elastic populates a set of comprehensive dashboards based on Amazon Bedrock AgentCore service logs and metrics. These appear as a unified view with tabs, providing a &quot;single pane of glass&quot; into the operational health of your platform.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-agentic-ai-observability-amazon-bedrock-agentcore/amazon-bedrock-agentcore-integration-dashboards-runtime-gateway-memory-identity.gif" alt="Amazon Bedrock AgentCore out-of-the-box Dashboards for Runtime, Gateway, Memory and Identity" /></p>
<p>This view is divided into four key zones, each addressing specific components of AgentCore - Runtime, Gateway, Memory, Identity. Note that note all agentic applications use all 4 components. In our example only the Customer Assistant uses all four components, whereas the Travel agent uses only Runtime. </p>
<p><strong>Runtime Health</strong></p>
<hr />
<p>Visualize agent invocations, session metrics, error trends (system vs. user), and performance stats like latency and throttling, split per endpoint. This dashboard helps you answer questions like</p>
<ul>
<li>&quot;How are my Travel Assistant agent and Customer Support agent performing in terms of overall traffic and latency, and are there any spikes in errors or throttling?&quot;</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-agentic-ai-observability-amazon-bedrock-agentcore/amazon-bedrock-agentcore-runtime-dashboard.jpg" alt="Amazon Bedrock AgentCore out-of-the-box Dashboard for AgentCore Runtime" /></p>
<p><strong>Gateway Performance</strong></p>
<hr />
<p>Analyze invocations across Lambda and MCP (Model Context Protocol), with detailed breakdowns for tool vs. non-tool calls. The dashboard highlights throttling detection, target execution times, and separates system errors from user errors.</p>
<ul>
<li><em>Question answered:</em> &quot;Are my external integrations (Lambda, MCP) performing efficiently, or are specific tool calls experiencing high latency, throttling, or system-level errors?&quot;</li>
</ul>
<p><strong>Memory Operations</strong></p>
<hr />
<p>Track core operations like event creation, retrieval, and listing, alongside deep dives into long-term memory processing. This includes extraction and consolidation metrics broken down by strategy type, as well as specific monitoring for throttling and system vs. user errors.</p>
<ul>
<li><em>Question answered:</em> &quot;Are failures in memory consolidation strategies or high retrieval latency preventing the agent from effectively recalling user context?&quot;</li>
</ul>
<p><strong>Identity &amp; Access</strong></p>
<hr />
<p>Monitor identity token fetch operations (workload, OAuth, API keys) and real-time authentication success/failure rates. The dashboard breaks down activity by provider and highlights throttling or capacity bottlenecks.</p>
<ul>
<li><em>Question answered:</em> &quot;Are authentication failures or token fetch bottlenecks from specific providers preventing agents from accessing required resources?&quot;</li>
</ul>
<h3>Out-of-the-Box (OOTB) Alert Templates</h3>
<p>Observability isn't just about looking at dashboards; it's about knowing when to act. To move from reactive checking to proactive monitoring, Elastic provides <strong>OOTB Alert Rule Templates</strong> (starting with Elastic version 9.2.1).</p>
<p>These templates eliminate guesswork by pre-selecting the optimal metrics to monitor and applying sensible thresholds. This configuration focuses on high-fidelity alerts for genuine anomalies, helping you catch critical issues early while minimizing alert fatigue.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-agentic-ai-observability-amazon-bedrock-agentcore/elastic-alert-rule-templates-for-amazon-bedrock-agentcore.jpg" alt="Amazon Bedrock AgentCore out-of-the-box Alert rule templates for AgentCore" /></p>
<p><strong>Suggested OOTB Alerts:</strong></p>
<ul>
<li>
<p><strong>Agent Runtime System Errors:</strong> Detects server-side errors (500 Internal Server Error) during agent runtime invocations, indicating infrastructure or service issues with AWS Bedrock AgentCore.</p>
</li>
<li>
<p><strong>Agent Runtime User Errors:</strong> Flags client-side errors (4xx) during agent runtime invocations, including validation failures (400), resource not found (404), access denied (403), and resource conflicts (409). This helps catch misconfigured permissions, invalid input, or missing resources early.</p>
</li>
<li>
<p><strong>Agent Runtime High Latency:</strong> Triggers when the average latency for agent runtime invocations exceeds 10 seconds (10,000ms). Latency measures the time elapsed between receiving a request and sending the final response token.</p>
</li>
</ul>
<h3> APM Tracing</h3>
<p>While logs and metrics tell you <em>that</em> an issue exists, <strong>APM Tracing</strong> tells you exactly <em>where</em> and <em>why</em> it is happening. By ingesting the OpenTelemetry signals from your instrumented agent, Elastic generates a detailed distributed trace (e.g. waterfall view) for every interaction. To get further details on LLM information such as prompts, responses, token usage, etc, you can explore the APM logs.  </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-agentic-ai-observability-amazon-bedrock-agentcore/otel-native-strands-agent-traces-in-elastic-apm.jpg" alt="Amazon Bedrock AgentCore OTel-nativedistributed tracing waterfall diagram in Elastic APM" /></p>
<p>This allows you to peer inside the &quot;black box&quot; of the agent's execution flow:</p>
<ul>
<li>
<p><strong>Visualize the Chain of Thought:</strong> See the full sequence of events, from the user's initial prompt to the final response, including all intermediate reasoning steps.</p>
</li>
<li>
<p><strong>Pinpoint Tool Failures:</strong> Identify exactly which external tool (e.g., a Lambda function for flight booking or a knowledge base query) failed or timed out.</p>
</li>
<li>
<p><strong>Analyze Latency Contributors:</strong> Distinguish between latency caused by the LLM's generation time versus latency caused by slow downstream API calls.</p>
</li>
<li>
<p><strong>Debug with Context:</strong> Drill down into individual spans to see specific error messages, attributes, and metadata that explain why a particular step failed.</p>
</li>
</ul>
<h3>Conclusion</h3>
<p>As organizations move from experimental chatbots to complex, autonomous agents in production, the need for robust observability has never been greater. Agentic applications introduce new layers of complexity—non-deterministic behaviors, multi-step reasoning loops, and cost implications—that standard monitoring tools simply cannot see.</p>
<p>Elastic Agentic AI Observability for Amazon Bedrock AgentCore bridges this gap. By unifying platform-level health metrics from AgentCore with deep, transaction-level distributed tracing from OpenTelemetry, Elastic gives SREs and developers the complete picture. Whether you are debugging a failed tool call, optimizing latency, or controlling token costs, you have the visibility needed to run agentic AI with confidence.</p>
<p><strong>Complete Visibility: AgentCore + Amazon Bedrock:</strong> For the most comprehensive view, we recommend onboarding Elastic’s <a href="https://www.elastic.co/docs/reference/integrations/aws_bedrock"><strong>Amazon Bedrock</strong> integration</a> alongside AgentCore. While the AgentCore integration focuses on the orchestration layer—monitoring agent errors, tool latency, and invocations—the Bedrock integration provides deep visibility into the underlying foundation models themselves. This includes tracking model-specific latency, token usage, full prompts and responses, and even <strong>Guardrails</strong> usage and effectiveness. By combining both, you ensure complete coverage from the high-level agent workflow down to the raw model inference.</p>
<ul>
<li>
<p><strong>Read more:</strong><a href="https://www.elastic.co/observability-labs/blog/llm-observability-aws-bedrock"> Monitor Amazon Bedrock with Elastic</a></p>
</li>
<li>
<p><strong>Read more:</strong><a href="https://www.elastic.co/observability-labs/blog/llm-observability-amazon-bedrock-guardrails"> Amazon Bedrock Guardrails Observability</a></p>
</li>
</ul>
<p><strong>Get Started Today</strong> Ready to see your agents in action?</p>
<ul>
<li>
<p><strong>Try it out:</strong> Log in to <a href="https://cloud.elastic.co/login">Elastic Cloud</a> and add the Amazon Bedrock AgentCore integration. Or use <a href="https://aws.amazon.com/marketplace/seller-profile?id=d8f59038-c24c-4a9d-a66d-6711d35d7305">Elastic from Amazon Marketplace</a></p>
</li>
<li>
<p><strong>Explore the Code:</strong> Check out our GitHub repository for the <a href="https://github.com/elastic/observability-examples/tree/main/aws/amazon-bedrock-agentcore/travel_assistant">Travel assistant</a> which you saw in this blog, as well as the <a href="https://github.com/elastic/observability-examples/tree/main/aws/amazon-bedrock-agentcore/elastic_agentcore_sre_agent">AgentCore SRE Agent</a>.</p>
</li>
<li>
<p><strong>Learn More:</strong> Read the <a href="https://www.elastic.co/docs/reference/integrations/aws_bedrock_agentcore">full documentation</a> on setting up integration for Agentic AI Observability for Amazon Bedrock AgentCore.</p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/llm-agentic-ai-observability-amazon-bedrock-agentcore/agentcore-blog.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[LLM observability with Elastic: Taming the LLM with Guardrails for Amazon Bedrock]]></title>
            <link>https://www.elastic.co/observability-labs/blog/llm-observability-amazon-bedrock-guardrails</link>
            <guid isPermaLink="false">llm-observability-amazon-bedrock-guardrails</guid>
            <pubDate>Sun, 02 Mar 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic’s enhanced Amazon Bedrock integration for Observability now includes Guardrails monitoring, offering real-time visibility into AI safety mechanisms. Track guardrail performance, usage, and policy interventions with pre-built dashboards. Learn how to set up observability for Guardrails and monitor key signals to strengthen safeguards against hallucinations, harmful content, and policy violations.]]></description>
            <content:encoded><![CDATA[<p>In a previous <a href="https://www.elastic.co/observability-labs/blog/llm-observability-aws-bedrock">blog</a> we showed you how to set up observability for your models hosted on Amazon Bedrock using Elastic’s integration. You can now effortlessly enable observability for your Amazon Bedrock guardrails using the enhanced <a href="https://www.elastic.co/guide/en/integrations/current/aws_bedrock.html">Elastic Amazon Bedrock integration</a>. If you previously onboarded the Amazon Bedrock integration, just upgrade it and you will automatically get all guardrails-related updates. The enhanced integration provides a single pane of glass dashboard with two panels - one focusing on overall Bedrock visualizations as well as a separate panel dedicated to Guardrails. You can now ingest and visualize metrics and logs specific to Guardrails, such as guardrail invocation count, invocation latency, text unit utilization, guardrail policy types associated with interventions and many more.</p>
<p>In this blog we will show you how to set up observability for Amazon Bedrock Guardrails, how you can make use of the enhanced dashboards and what key signals to alert on for an effective observability coverage of your Bedrock guardrails.</p>
<h2>Prerequisites</h2>
<p>To follow along with this blog, please make sure you have:</p>
<ul>
<li>An account on <a href="http://cloud.elastic.co/">Elastic Cloud</a> and a deployed stack in AWS (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>). Ensure you are using version 8.16.2 or higher. Alternatively, you can use <a href="https://www.elastic.co/cloud/serverless">Elastic Cloud Serverless</a>, a fully managed solution that eliminates infrastructure management, automatically scales based on usage, and lets you focus entirely on extracting value from your data.</li>
<li>An AWS account with permissions to pull the necessary data from AWS. See <a href="https://docs.elastic.co/en/integrations/aws#aws-permissions">details in our documentation</a>.</li>
</ul>
<h2>Steps to create a guardrail for Amazon Bedrock</h2>
<p>Before you set up observability for the guardrails, ensure that you have configured guardrails for your model. Follow the steps below to create an Amazon Bedrock Guardrail</p>
<ol>
<li><strong>Access the Amazon Bedrock Console</strong>
<ul>
<li>Sign in to the AWS Management Console with appropriate permissions and navigate to the Amazon Bedrock console.</li>
</ul>
</li>
<li><strong>Navigate to Guardrails</strong>
<ul>
<li>From the left-hand menu, select <strong>Guardrails</strong>.</li>
</ul>
</li>
<li><strong>Create a New Guardrail</strong>
<ul>
<li>Select <strong>Create guardrail</strong>.</li>
<li>Provide a descriptive name, an optional brief description, and specify a message to display when the guardrail blocks the user prompt.
<ul>
<li>Example: <em>Sorry, I am not configured to answer such questions. Kindly ask a different question.</em></li>
</ul>
</li>
</ul>
</li>
<li><strong>Configure Guardrail Policies</strong>
<ul>
<li><strong>Content Filters</strong>: Adjust settings to block harmful content and prompt attacks.</li>
<li><strong>Denied Topics</strong>: Specify topics to block.</li>
<li><strong>Word Filters</strong>: Define specific words or phrases to block.</li>
<li><strong>Sensitive Information Filters</strong>: Set up filters to detect and remove sensitive information.</li>
<li><strong>Contextual Grounding</strong>:
<ul>
<li>Configure the <strong>Grounding Threshold</strong> to set the minimum confidence level for factual accuracy.</li>
<li>Set the <strong>Relevance Threshold</strong> to ensure responses align with user queries.</li>
</ul>
</li>
</ul>
</li>
<li><strong>Review and Create</strong>
<ul>
<li>Review your settings and select <strong>Create</strong> to finalize the guardrail.</li>
</ul>
</li>
<li><strong>Create a Guardrail Version</strong>
<ul>
<li>In the <strong>Version</strong> section, select <strong>Create</strong>.</li>
<li>Optionally add a description, then select <strong>Create Version</strong>.</li>
</ul>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-amazon-bedrock-guardrails/guardrails-policy-configuration.png" alt="Amazon Bedrock Guardrails Policy Configurations" /></p>
<p>After creating a version of your guardrail, it's important to note down the <strong>Guardrail ID</strong> and the <strong>Guardrail Version Name</strong>. These identifiers are essential when integrating the guardrail into your application, as you'll need to specify them during guardrail invocation.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-amazon-bedrock-guardrails/guardrails-creation-confirmations.png" alt="Amazon Bedrock Guardrails Policy Version" /></p>
<h2>Example code to integrate with Amazon Bedrock guardrails</h2>
<p>Integrating Amazon Bedrock's ChatBedrock into your Python application enables advanced language model interactions with customisable safety measures. By configuring guardrails, you can ensure that the model adheres to predefined policies, preventing it from generating inappropriate or sensitive content.</p>
<p>The following code demonstrates how to integrate Amazon Bedrock with guardrails to enforce contextual grounding in AI-generated responses. It sets up a Bedrock client using AWS credentials, defines a reference grounding statement, and uses the ChatBedrock API to process user queries with contextual constraints. The <strong>converse_with_guardrails</strong> function sends a user query alongside a predefined grounding reference, ensuring that responses align with the provided knowledge source.</p>
<h3>Setting Up Environment Variables</h3>
<p>Before running the script, configure the required <strong>AWS credentials</strong> and <strong>guardrail settings</strong> as environment variables. These variables allow the script to authenticate with Amazon Bedrock and apply the necessary guardrails for safe and controlled AI interactions.</p>
<p>Create a <strong>.env</strong> file in the same directory as your script and add:</p>
<pre><code class="language-bash">AWS_ACCESS_KEY=&quot;your-access-key&quot; 
AWS_SECRET_KEY=&quot;your-secret-key&quot; 
AWS_REGION=&quot;your-aws-region&quot; 
GUARDRAIL_ID=&quot;your-guardrail-id&quot; 
GUARDRAIL_VERSION=&quot;your-guardrail-version&quot;
</code></pre>
<h3>Create a Python script and run</h3>
<p>Create a Python script using the code below and execute it to interact with the Amazon Bedrock Guardrails you set up.</p>
<pre><code class="language-python">import os
import boto3
from dotenv import load_dotenv
from langchain_aws import ChatBedrock
import json
from botocore.exceptions import ClientError

# Load environment variables
load_dotenv()

# Function to check for hallucinations using contextual grounding
def check_hallucination(response):
   output_assessments = response.get(&quot;trace&quot;, {}).get(&quot;guardrail&quot;, {}).get(&quot;outputAssessments&quot;, {})

   # Iterate over all assessments
   for key, assessments in output_assessments.items():
       for assessment in assessments:
           contextual_policy = assessment.get(&quot;contextualGroundingPolicy&quot;, {})
          
           if &quot;filters&quot; in contextual_policy:
               grounding = relevance = None
               grounding_threshold = relevance_threshold = None

               for filter_result in contextual_policy[&quot;filters&quot;]:
                   filter_type = filter_result.get(&quot;type&quot;)
                   if filter_type == &quot;RELEVANCE&quot;:
                       relevance = filter_result.get(&quot;score&quot;, 0)
                       relevance_threshold = filter_result.get(&quot;threshold&quot;, 0)
                   elif filter_type == &quot;GROUNDING&quot;:
                       grounding = filter_result.get(&quot;score&quot;, 0)
                       grounding_threshold = filter_result.get(&quot;threshold&quot;, 0)
          
           if relevance &lt; relevance_threshold or grounding &lt; grounding_threshold:
               return True, relevance, grounding, relevance_threshold, grounding_threshold  # Hallucination detected
  
   return False, relevance, grounding, relevance_threshold, grounding_threshold  # No hallucination detected

def converse_with_guardrails(bedrock_client, messages, grounding_reference):
   message = [
       {
           &quot;role&quot;: &quot;user&quot;,
           &quot;content&quot;: [
               {
                   &quot;guardContent&quot;: {
                       &quot;text&quot;: {
                           &quot;text&quot;: grounding_reference,
                           &quot;qualifiers&quot;: [&quot;grounding_source&quot;],
                       }
                   }
               },
               {
                   &quot;guardContent&quot;: {
                       &quot;text&quot;: {
                           &quot;text&quot;: messages,
                           &quot;qualifiers&quot;: [&quot;query&quot;],
                       }
                   }
               },
           ],
       }
   ]
   converse_config = {
       &quot;modelId&quot;: os.getenv('CHAT_MODEL'),
       &quot;messages&quot;: message,
       &quot;guardrailConfig&quot;: {
           &quot;guardrailIdentifier&quot;: os.getenv(&quot;GUARDRAIL_ID&quot;),
           &quot;guardrailVersion&quot;: os.getenv(&quot;GUARDRAIL_VERSION&quot;),
           &quot;trace&quot;: &quot;enabled&quot;
       },
       &quot;inferenceConfig&quot;: {
           &quot;temperature&quot;: 0.5       
       },
   }
   try:
       response = bedrock_client.converse(**converse_config)
       return response
   except ClientError as e:
       error_message = e.response['Error']['Message']
       print(f&quot;An error occurred: {error_message}&quot;)
       print(&quot;Converse config:&quot;)
       print(json.dumps(converse_config, indent=2))
       return None
  
def pretty_print_response(response, is_hallucination, relevance, relevance_threshold, grounding, grounding_threshold):
   print(&quot;\n&quot; + &quot;=&quot;*60)
   print(&quot; Guardrail Assessment&quot;)
   print(&quot;=&quot;*60)
   # Extract response message safely
   response_text = response.get(&quot;output&quot;, {}).get(&quot;message&quot;, {}).get(&quot;content&quot;, [{}])[0].get(&quot;text&quot;, &quot;N/A&quot;)
   print(&quot;\n **Model Response:**&quot;)
   print(f&quot;   {response_text}&quot;)
   print(&quot;\n **Guardrail Assessment:**&quot;)
   print(f&quot;   Is Hallucination : {is_hallucination}&quot;)
   print(&quot;\n **Contextual Grounding Policy Scores:**&quot;)
   print(f&quot;   - Relevance Score : {relevance:.2f} (Threshold: {relevance_threshold:.2f})&quot;)
   print(f&quot;   - Grounding Score : {grounding:.2f} (Threshold: {grounding_threshold:.2f})&quot;)
   print(&quot;\n&quot; + &quot;=&quot;*60 + &quot;\n&quot;)
  
def main():
   bs = boto3.Session(
       aws_access_key_id=os.getenv('AWS_ACCESS_KEY'),
       aws_secret_access_key=os.getenv('AWS_SECRET_KEY'),
       region_name=os.getenv('AWS_REGION')
   )

   # Initialize Bedrock client
   bedrock_client = bs.client(&quot;bedrock-runtime&quot;)

   # Grounding reference
   grounding_reference = &quot;The Wright brothers made the first powered aircraft flight on December 17, 1903.&quot;

   # User query
   user_query = &quot;Who were the first to fly an airplane?&quot;
  
   # Get model response
   response = converse_with_guardrails(bedrock_client, user_query, grounding_reference)

   # Check for hallucinations
   is_hallucination, relevance, grounding, relevance_threshold, grounding_threshold = check_hallucination(response)

   # Print the results
   pretty_print_response(response, is_hallucination, relevance, relevance_threshold, grounding, grounding_threshold)


if __name__ == &quot;__main__&quot;:
   main()
</code></pre>
<h3>Identifying Hallucinations with Contextual Grounding</h3>
<p>The contextual grounding feature proved effective in identifying potential hallucinations by comparing model responses against reference information. Relevance and grounding scores provided quantitative measures to assess the accuracy of model outputs.</p>
<p>The python script run output below demonstrates how the <strong>Grounding Score</strong> helps detect hallucinations:</p>
<pre><code>============================================================
 Guardrail Assessment
============================================================

 **Model Response:**
   Sorry, I am not configured to answer such questions. Kindly ask a different question.

 **Guardrail Assessment:**
   Is Hallucination : True

 **Contextual Grounding Policy Scores:**
   - Relevance Score : 1.00 (Threshold: 0.99)
   - Grounding Score : 0.03 (Threshold: 0.99)

============================================================
</code></pre>
<p>Here, the <strong>Grounding Score</strong> of <strong>0.03</strong> is significantly lower than the configured threshold of <strong>0.99</strong>, indicating that the response lacks factual accuracy. Since the score falls below the threshold, the system flags the response as a hallucination, highlighting the need to monitor guardrail outputs to ensure AI safety.</p>
<h2>Configuring Amazon Bedrock Guardrails Metrics &amp; Logs Collection</h2>
<p>Elastic makes it easy to collect both logs and metrics from Amazon Bedrock Guardrails using the Amazon Bedrock integration. By default, Elastic provides a curated set of logs and metrics, but you can customize the configuration based on your needs. The integration supports Amazon S3 and Amazon CloudWatch Logs for log collection, along with metrics collection from your chosen AWS region at a specified interval.</p>
<p>Follow these steps to enable the collection of metrics and logs:</p>
<ol>
<li>
<p><strong>Navigate to Amazon Bedrock Settings</strong> - In the AWS Console, go to <strong>Amazon Bedrock</strong> and open the <strong>Settings</strong> section.</p>
</li>
<li>
<p><strong>Choose Logging Destination</strong> - Select whether to send logs to <strong>Amazon S3</strong> or <strong>Amazon CloudWatch Logs</strong>.</p>
</li>
<li>
<p><strong>Provide Required Details</strong></p>
<ul>
<li><strong>If using Amazon S3</strong>, logs can be collected from objects referenced in <strong>S3 notification events</strong> (read from an SQS queue) or by <strong>direct polling</strong> from an S3 bucket.</li>
<li><strong>If using CloudWatch Logs</strong>: you need to create a <strong>CloudWatch log group</strong> and note its <strong>ARN</strong>, as this will be required for configuring both <strong>Amazon Bedrock</strong> and <strong>Elastic Amazon Bedrock integration</strong>.</li>
</ul>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-amazon-bedrock-guardrails/amazon-bedrock-settings-configuraiton.png" alt="Amazon Bedrock settings" /></p>
<ol start="4">
<li><strong>Configure Elastic's Amazon Bedrock integration</strong> - In <strong>Elastic</strong>, set up the <strong>Amazon Bedrock integration</strong>, ensuring the logging destination matches the one configured in <strong>Amazon Bedrock</strong>. Logs from your selected source and metrics from your AWS region will be collected automatically.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-amazon-bedrock-guardrails/amazon-bedrock-integrations-logs-config.png" alt="Amazon Bedrock integration logs configuration" /></p>
<ol start="5">
<li><strong>Accept Defaults or Customize Settings</strong> - Elastic provides a default configuration for logs and metrics collection. You can accept these defaults or adjust settings such as collection intervals to better fit your needs.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-amazon-bedrock-guardrails/amazon-bedrock-integrations-metrics-config.png" alt="Amazon Bedrock integration guardrails metrics configuration" /></p>
<h2>Understanding the pre-configured dashboard for Amazon Bedrock Guardrails</h2>
<p>You can access the Amazon Bedrock Guardrails dashboard using either of the following methods:</p>
<ol>
<li>
<p><strong>Navigate to the Dashboard Menu</strong>  - Select the <strong>Dashboard</strong> menu option in <strong>Elastic</strong> and search for <strong>[Amazon Bedrock] Guardrails</strong> to open the dashboard.</p>
</li>
<li>
<p><strong>Navigate to the Integrations Menu</strong>  - Open the <strong>Integrations</strong> menu in <strong>Elastic</strong>, select <strong>Amazon Bedrock</strong>, go to the <strong>Assets</strong> tab, and choose <strong>[Amazon Bedrock] Guardrails</strong> from the dashboard assets.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-amazon-bedrock-guardrails/amazon-bedrock-guardrails-overview-dashboard.png" alt="Amazon Bedrock settings" /></p>
<p>The Amazon Bedrock Guardrails dashboard in the Elastic integration provides insights into guardrail performance, tracking total invocations, API latency, text unit usage, and intervention rates. It analyzes policy-based interventions, highlighting trends, text consumption, and frequently triggered policies. The dashboard also showcases instances where guardrails modified or blocked responses and offers a detailed breakdown of invocations by policy and content source.</p>
<h3>Guardrail invocation overview</h3>
<p>This dashboard section provides a comprehensive summary of key metrics related to guardrail performance and usage:</p>
<ul>
<li><strong>Total guardrails API invocations</strong>: Displays the overall count of times guardrails were invoked.</li>
<li><strong>Average Guardrails API invocation latency</strong>: Shows the average response time for guardrail API calls, offering insights into system performance.</li>
<li><strong>Total text unit utilization</strong>: Indicates the volume of text processed during guardrail invocations. For pricing of text units refer to Amazon Bedrock pricing page.</li>
<li><strong>Invocations - with and without guardrail interventions</strong>: A pie chart representation showing the distribution of LLM invocations based on guardrail activity. It displays the count of invocations where no guardrail interventions occurred, those where guardrails intervened and detected policy violations, and those where guardrails intervened but found no violations.</li>
</ul>
<p>These metrics help users evaluate guardrail effectiveness, track intervention patterns, and optimize configurations to ensure policy enforcement while maintaining system performance.</p>
<h3>Guardrail policy types for interventions</h3>
<p>This section provides a comprehensive view of guardrail policy interventions and their impact:</p>
<ul>
<li><strong>Interventions by Policy Type</strong>: Bar charts display the number of interventions applied to user inputs and model outputs, categorized by policy type (e.g., Contextual Grounding Policy, Word Policy, Content Policy, Sensitive Information Policy, Topic Policy).</li>
<li><strong>Text Unit Utilization by Policy Type</strong>: Panels highlight the text units consumed by various policy interventions, separately for user inputs and model outputs.</li>
<li><strong>Policy Usage Trends</strong>: A word cloud visualisation reveals the most frequently applied policy types, offering insights into intervention patterns.</li>
</ul>
<p>By analyzing intervention counts, text unit usage, and policy trends, users can identify frequently triggered policies, optimize guardrail settings, and ensure LLM interactions align with compliance and safety requirements.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-amazon-bedrock-guardrails/amazon-bedrock-guardrails-overview.png" alt="Amazon Bedrock Guardrails dashboard overview and policy types sections" /></p>
<h3>Prompt and response where guardrails intervened</h3>
<p>This dashboard section displays the original LLM prompt, inputs from various sources (API calls, applications, or chat interfaces), and the corresponding guardrail response. The text panel presents the prompt alongside the model's response after applying guardrail interventions. These interventions occur when input evaluation or model responses violate configured policies, leading to blocked or masked outputs.</p>
<p>The section also includes additional details to enhance visibility into how guardrails operate. It indicates whether a violation was detected, along with the violation type (e.g., <strong>GROUNDING</strong>, <strong>RELEVANCE</strong>) and the action taken (<strong>BLOCKED</strong>, <strong>NONE</strong>). For contextual grounding, the dashboard also shows the filter threshold, which defines the minimum confidence level required for a response to be considered valid, and the <strong>confidence score</strong>, which reflects how well the response aligns with the expected criteria.</p>
<p>By analyzing violations, actions taken, and confidence scores, users can adjust guardrail thresholds to balance blocking unsafe responses and allowing valid ones, ensuring optimal accuracy and compliance. This process is particularly crucial for detecting and mitigating hallucinations—instances where models generate information not grounded in source data. Implementing contextual grounding checks enables the identification of such ungrounded or irrelevant content, enhancing the reliability of applications like retrieval-augmented generation (RAG).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-amazon-bedrock-guardrails/guardrails-intervened-logs.png" alt="Amazon Bedrock Guardrails logs where guardrails intervened" /></p>
<h3>Guardrail invocation by guardrail policy</h3>
<p>This section offers insights into the number of Guardrails API invocations, the overall latency, the total text units categorised by various guardrail policies (identified by guardrail ARN) and the policy versions.</p>
<h3>Guardrail invocation by content source (Input &amp; Output)</h3>
<p>This section provides a detailed overview of critical metrics related to guardrail performance and usage. It includes the total number of guardrail invocations, the count of intervention invocations where policies were applied, the volume of text units consumed during these interventions for both user inputs and model outputs and the average guardrail API invocation latency.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-amazon-bedrock-guardrails/guardrails-invocationby-policy-contentsource.png" alt="Amazon Bedrock Guardrails invocation by policy and content source" /></p>
<p>These insights help users understand how guardrails operate across different policies and content sources. By analyzing invocation counts, latency, and text unit consumption, users can assess policy effectiveness, track intervention patterns, and optimize configurations. Evaluating how guardrails interact with user inputs and model outputs ensures consistent enforcement, helping refine thresholds and improve compliance strategies.</p>
<h2>Configure SLOs and Alerts</h2>
<p>To create an SLO for monitoring <strong>contextual grounding accuracy</strong>, define a custom query SLI where <strong>good events</strong> are model responses that meet contextual grounding criteria, ensuring factual accuracy and alignment with the provided reference.</p>
<p>A suitable query for tracking good events is:</p>
<pre><code>gen_ai.prompt : &quot;*qualifiers[\\\&quot;grounding_source\\\&quot;]*&quot; and 
(gen_ai.compliance.violation_detected : false or 
not gen_ai.compliance.violation_detected : *)
</code></pre>
<p>The total query considers all relevant interactions having contextual grounding check is:</p>
<pre><code>gen_ai.prompt : &quot;*qualifiers[\\\&quot;grounding_source\\\&quot;]*&quot;
</code></pre>
<p>Set an <strong>SLO target of 99.5%</strong>, ensuring that the vast majority of responses remain factually grounded. This helps detect hallucinations and misaligned outputs in real-time. By continuously monitoring contextual grounding accuracy, you can proactively address inconsistencies, retrain models, or refine RAG pipelines before inaccuracies impact end users.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-amazon-bedrock-guardrails/slo-configurations.png" alt="SLO settings for Guardails metrics" /></p>
<p>Elastic's alerting capabilities enable proactive monitoring of key performance metrics. For instance, by setting up an alert on the <strong>average aws_bedrock.guardrails.invocation_latency</strong> with a <strong>500ms</strong> threshold, you can promptly identify and address performance bottlenecks, ensuring that policy enforcement remains efficient without causing unexpected delays.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-amazon-bedrock-guardrails/alert-configurations.png" alt="Alert settings for Guardails metrics" /></p>
<h2>Conclusion</h2>
<p>The Elastic Amazon Bedrock integration makes it easy for you to collect a curated set of metrics and logs for your LLM-powered applications using Amazon Bedrock including Guardrails. It comes with an out-of-the-box dashboard which you can further customize for your specific needs.</p>
<p>If you haven’t already done so, read our previous <a href="https://www.elastic.co/observability-labs/blog/llm-observability-aws-bedrock">blog</a> on what you can do with the Amazon Bedrock integration, set up guardrails for your Bedrock models, and enable the <a href="https://www.elastic.co/guide/en/integrations/current/aws_bedrock.html">Bedrock integration</a> to start observing your Bedrock models and guardrails today!</p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/llm-observability-amazon-bedrock-guardrails/llm-observability-aws-bedrock-illustration.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[LLM Observability with the new Amazon Bedrock Integration in Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/llm-observability-aws-bedrock</link>
            <guid isPermaLink="false">llm-observability-aws-bedrock</guid>
            <pubDate>Mon, 25 Nov 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic's new Amazon Bedrock integration for Observability provides comprehensive insights into Amazon Bedrock LLM performance and usage. Learn about how LLM based metric and log collection in real-time with pre-built dashboards can effectively monitor and resolve LLM invocation errors and performance challenges.]]></description>
            <content:encoded><![CDATA[<p>As organizations increasingly adopt LLMs for AI-powered applications such as content creation, Retrieval-Augmented Generation (RAG), and data analysis, SREs and developers face new challenges. Tasks like monitoring workflows, analyzing input and output, managing query latency, and controlling costs become critical. LLM observability helps address these issues by providing clear insights into how these models perform, allowing teams to quickly identify bottlenecks, optimize configurations, and improve reliability. With better observability, SREs can confidently scale LLM applications, especially on platforms like <a href="https://aws.amazon.com/bedrock/">Amazon Bedrock</a>, while minimizing downtime and keeping costs in check.</p>
<p>Elastic is expanding support for LLM Observability with Elastic Observability's new <a href="https://www.elastic.co/docs/current/integrations/aws_bedrock">Amazon Bedrock integration</a>. This new observability integration provides you with comprehensive visibility into the performance and usage of foundational models from leading AI companies and from Amazon available through Amazon Bedrock. The new Amazon Bedrock Observability integration offers an out-of-the-box experience by simplifying the collection of Amazon Bedrock metrics and logs, making it easier to gain actionable insights and effectively manage your models. The integration is simple to set up and comes with pre-built, out-of-the-box dashboards. With real-time insights, SREs can now monitor, optimize and troubleshoot LLM applications that are using Amazon Bedrock.</p>
<p>This blog will walk through the features available to SREs, such as monitoring invocations, errors, and latency information across various models, along with the usage and performance of LLM requests. Additionally, the blog will show how easy it is to set up and what insights you can gain from Elastic for LLM Observability.</p>
<h2>Prerequisites</h2>
<p>To follow along with this blog, please make sure you have:</p>
<ul>
<li>An account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack in AWS (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>). Ensure you are using version 8.13 or higher.</li>
<li>An AWS account with permissions to pull the necessary data from AWS. <a href="https://docs.elastic.co/en/integrations/aws#aws-permissions">See details in our documentation</a>.</li>
</ul>
<h2>Configuring Amazon Bedrock Logs Collection</h2>
<p>To collect Amazon Bedrock logs, you can choose from the following options:</p>
<ol>
<li>Amazon Simple Storage Service (Amazon S3)  bucket</li>
<li>Amazon CloudWatch logs</li>
</ol>
<p><strong>S3 Bucket Logs Collection</strong>: When collecting logs from the Amazon S3 bucket, you can retrieve logs from Amazon S3 objects pointed to by Amazon S3 notification events, which are read from an SQS queue, or by directly polling a list of Amazon S3 objects in an Amazon S3 bucket. Refer to Elastic’s <a href="https://www.elastic.co/docs/current/integrations/aws_logs">Custom AWS Logs</a> integration for more details.</p>
<p><strong>CloudWatch Logs Collection</strong>: In this option, you will need to create a <a href="https://console.aws.amazon.com/cloudwatch/">CloudWatch log group</a>. After creating the log group, be sure to note down the ARN of the newly created log group, as you will need it for the Amazon Bedrock settings configuration and Amazon Bedrock integration configuration for logs.</p>
<p>Configure the Amazon Bedrock CloudWatch logs with the Log group ARN to start collecting CloudWatch logs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-aws-bedrock/cloudwatch-logs-configuration.png" alt="" /></p>
<p>Please visit the <a href="https://aws.amazon.com/console/">AWS Console</a> and navigate to the &quot;Settings&quot; section under <a href="https://aws.amazon.com/bedrock/">Amazon Bedrock</a> and select your preferred method of collecting logs. Based on the value you select from the Logging Destination in the Amazon Bedrock settings, you will need to enter either the Amazon S3 location or the CloudWatch log group ARN.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-aws-bedrock/aws-bedrock-logs-configuration.png" alt="" /></p>
<h2>Configuring Amazon Bedrock Metrics Collection</h2>
<p>Configure Elastic's Amazon Bedrock integration to collect Amazon Bedrock metrics from your chosen AWS region at the specified collection interval.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-aws-bedrock/cloudwatch-metrics-configuration.png" alt="" /></p>
<h2>Maximize Visibility with Out-of-the-Box Dashboards</h2>
<p>Amazon Bedrock integration offers rich out-of-the-box visibility into the performance and usage information of models in Amazon Bedrock, including text and image models. The <strong>Amazon Bedrock Overview</strong> dashboard provides a summarized view of the invocations, errors and latency information across various models.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-aws-bedrock/aws-bedrock-dashboard-metric-summary.png" alt="" /></p>
<p>The <strong>Text / Chat metrics</strong> section in the <strong>Amazon Bedrock Overview</strong> dashboard provides insights into token usage for Text models in Amazon Bedrock. This includes use cases such as text content generation, summarization, translation, code generation, question answering, and sentiment analysis.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-aws-bedrock/aws-bedrock-dashboard-text-metrics.png" alt="" /></p>
<p>The <strong>Image metrics</strong> section in the <strong>Amazon Bedrock Overview</strong> dashboard offers valuable insights into the usage of Image models in Amazon Bedrock.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-aws-bedrock/aws-bedrock-dashboard-image-metrics.png" alt="" /></p>
<p>The <strong>Logs</strong> section of the <strong>Amazon Bedrock Overview</strong> dashboard in Elastic provides detailed insights into the usage and performance of LLM requests. It enables you to monitor key details such as model name, version, LLM prompt and response, usage tokens, request size, completion tokens, response size, and any error codes tied to specific LLM requests.</p>
<p>The detailed logs provide full visibility into raw model interactions, capturing both the inputs (prompts) and the outputs (responses) generated by the models. This transparency enables you to analyze and optimize how your LLM handles different requests, allowing for more precise fine-tuning of both the prompt structure and the resulting model responses. By closely monitoring these interactions, you can refine prompt strategies and enhance the quality and reliability of model outputs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-aws-bedrock/aws-bedrock-dashboard-logs-details.png" alt="" /></p>
<p><strong>Amazon Bedrock Overview</strong> dashboard provides a comprehensive view of the initial and final response times. It includes a percentage comparison graph that highlights the performance differences between these response stages, enabling you to quickly identify efficiency improvements or potential bottlenecks in your LLM interactions.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-aws-bedrock/aws-bedrock-dashboard-performance.png" alt="" /></p>
<h2>Creating Alerts and SLOs to Monitor Amazon Bedrock</h2>
<p>As with any Elastic integration, Amazon Bedrock <a href="https://www.elastic.co/docs/current/integrations/aws_bedrock#collecting-bedrock-model-invocation-logs-from-s3-bucket">logs</a> and <a href="https://www.elastic.co/docs/current/integrations/aws_bedrock#metrics">metrics</a> are fully integrated into Elastic Observability, allowing you to leverage features like SLOs, alerting, custom dashboards, and detailed logs exploration.</p>
<p>To create an alert, for example to monitor LLM invocation latency in Amazon Bedrock, you can apply a Custom Threshold rule on the Amazon Bedrock datastream. Set the rule to trigger an alert when the LLM invocation latency exceeds a defined threshold. This ensures proactive monitoring of model performance, allowing you to detect and address latency issues before they impact the user experience.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-aws-bedrock/aws-bedrock-alert-invocation-latency.png" alt="" /></p>
<p>When a violation occurs, the Alert Details view linked in the notification provides detailed context, including when the issue began, its current status, and any history of similar violations. This rich information enables rapid triaging, investigation, and root cause analysis to resolve issues efficiently.</p>
<p>Similarly, to create an SLO for monitoring Amazon Bedrock invocation performance for instance, you can define a custom query SLI where good events are those Amazon Bedrock invocations that do not result in client errors or server errors and have latency less than 10 seconds. Set an appropriate SLO target, such as 99%. This will help you identify errors and latency issues in applications using LLMs, allowing you to take timely corrective actions before they affect the overall user experience.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-aws-bedrock/aws-bedrock-slo-configuration.png" alt="" /></p>
<p>The image below highlights the SLOs, SLIs, and the remaining error budget for Amazon Bedrock models. The observed violations are a result of deliberately crafted long text generation prompts, which led to extended response times. This example demonstrates how the system tracks performance against defined targets, helping you quickly identify latency issues and performance bottlenecks. By monitoring these metrics, you gain valuable insights for proactive issue triaging, allowing for timely corrective actions and improved user experience of applications using LLM.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-aws-bedrock/aws-bedrock-slo-rundata.png" alt="" /></p>
<h2>Try it out today</h2>
<p>The Amazon Bedrock playgrounds provide a console environment to experiment with running inference on different models and configurations before deciding to use them in an application. Start your own 7-day free trial by signing up via AWS Marketplace and quickly spin up a deployment in minutes on any of the Elastic Cloud regions on AWS around the world.</p>
<p>Deploy a cluster on our <a href="https://www.elastic.co/cloud/elasticsearch-service">Elasticsearch Service</a>, <a href="https://www.elastic.co/downloads/">download</a> the Elasticsearch stack, or run <a href="https://aws.amazon.com/marketplace/seller-profile?id=d8f59038-c24c-4a9d-a66d-6711d35d7305">Elastic from AWS Marketplace</a> then spin up the new technical preview of Amazon Bedrock integration, open the curated dashboards in Kibana and start monitoring your Amazon Bedrock service!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/llm-observability-aws-bedrock/LLM-observability-AWS-Bedrock.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[LLM Observability with Elastic’s Azure AI Foundry Integration]]></title>
            <link>https://www.elastic.co/observability-labs/blog/llm-observability-azure-ai-foundry</link>
            <guid isPermaLink="false">llm-observability-azure-ai-foundry</guid>
            <pubDate>Fri, 25 Jul 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Gain comprehensive visibility into your generative AI workloads on Azure AI Foundry. Monitor token usage, latency, and cost, while leveraging built-in content filters to ensure safe and compliant application behavior—all with out-of-the-box observability powered by Elastic.]]></description>
            <content:encoded><![CDATA[<h2>Introduction</h2>
<p>As organizations increasingly adopt LLMs for AI-powered applications such as content creation, Retrieval-Augmented Generation (RAG), and data analysis, SREs and developers face new challenges. Tasks like monitoring workflows, analyzing input and output, managing query latency, and controlling costs become critical. LLM Observability helps address these issues by providing clear insights into how these models perform, allowing teams to quickly identify bottlenecks, optimize configurations, and improve reliability. With better observability, SREs can confidently scale LLM applications, especially on platforms like Azure AI Foundry, while minimizing downtime and keeping costs in check.</p>
<p>Elastic is expanding support for LLM Observability with Elastic Observability's new Azure AI Foundry integration. This is now available as a tech preview on Elastic Cloud. This new observability integration provides you with comprehensive visibility into the performance and usage of foundational models, such as <strong>GPT-4, Mistral, Llama</strong>, and thousands of others from leading AI companies and from Azure available through Azure AI Foundry. The new Azure AI Foundry Integration in Elastic Observability integration offers an out-of-the-box experience by simplifying the collection of metrics and logs, making it easier to gain actionable insights and effectively manage your models. The integration is simple to set up and comes with pre-built, out-of-the-box dashboards. With real-time insights, SREs can now monitor, optimize and troubleshoot LLM applications that are using Azure AI Foundry.</p>
<p>This blog will walk through the features available to SREs, such as monitoring invocations, errors, and latency information across various models, along with the usage and performance of LLM requests. Additionally, the blog will show how easy it is to set up and what insights you can gain from Elastic for LLM Observability.</p>
<h2>Prerequisites</h2>
<p>To get started with the Azure AI Foundry integration, you will need:</p>
<ul>
<li>An account on Elastic Cloud and a deployed stack in Azure (<a href="https://azuremarketplace.microsoft.com/en-us/marketplace/apps/elastic.ec-azure-pp?ocid=Elastic-Microsoft-Partner-Page-Get-Started">see instructions here</a>). Ensure you are using version 9.0.0 or higher.</li>
<li>An Azure account with permissions to pull the necessary data from Azure and Azure AI Foundry. See details in our <a href="https://www.elastic.co/docs/reference/integrations/azure_ai_foundry">documentation</a>.</li>
</ul>
<h2>Configuring Azure AI Foundry Integration</h2>
<p>To collect logs and metrics from Azure AI Foundry ensure you properly configure Azure logs and metrics from the following links:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/docs/reference/integrations/azure_metrics#setup">Configure to receive Azure Metrics</a> - This integration specifically collects Azure AI Foundry metrics which will come from the service, and ensure you have the client id, subscription id, and tenant id from Azure AI Foundry to collect metrics.
<img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-ai-foundry/azure_ai_foundry_metrics.png" alt="Azure AI Foundry metrics" /></p>
</li>
<li>
<p><a href="https://www.elastic.co/docs/reference/integrations/azure">Configure to receive Azure Logs</a> and more specifically ensure that you <a href="https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-create">configure Azure event hub</a> to properly allow Elastic to ingest logs. Once you have the Azure event hub information, you will need it to configure the logs section of the Azure AI Foundry Integration.
<img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-ai-foundry/azure_ai_foundry_logs.png" alt="Azure AI Foundry logs" /></p>
</li>
</ul>
<h2>Maximize Visibility with Out-of-the-box dashboards</h2>
<p>Azure AI Foundry integration offers rich out-of-the-box visibility into the performance and usage information of models in Azure AI Foundry, including text and image models. There are several dashboards currently available. More will be coming as the integration goes to GA.</p>
<ul>
<li>Azure AI Foundry Overview dashboard provides a summarized view of the invocations, errors and latency information across various models.</li>
<li>Azure AI Foundry Billing dashboard - which provides total costs and daily usage costs from Azure cognitive services.</li>
<li>Azure AI Foundry Advanced Monitoring - which focuses on logs generated by the Azure AI Foundry service when connected through the API Management Service. Provides request rate, error rate, model usage, latency, LLM prompt input, response completion.</li>
</ul>
<p>Each dashboard provides specific insights important to SREs. Here is a quick overview of some of these insights:</p>
<ul>
<li>
<p><strong>Model Usage and Token Trends</strong> – Visualize token consumption and completion counts by model, endpoint, and time window.
<img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-ai-foundry/azure_ai_foundry_tokens.png" alt="Azure AI Foundry token usage metrics" /></p>
</li>
<li>
<p><strong>Latency Metrics</strong> – Monitor average and percentile latency per prompt, per endpoint, and correlate with prompt types or user IDs.
<img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-ai-foundry/azure_ai_foundry_model_latency.png" alt="Azure AI Foundry latency metrics" /></p>
</li>
<li>
<p><strong>Cost Estimation</strong> – Estimate API usage cost based on token consumption and model pricing.
<img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-ai-foundry/azure_ai_foundry_billing.png" alt="Azure AI Foundry cost estimation metrics" /></p>
</li>
<li>
<p><strong>Prompt/Completion Logging</strong> – View prompt-response pairs for debugging and quality monitoring.
<img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-ai-foundry/azure_ai_foundry_prompt_response.png" alt="Azure AI Foundry prompt/completions metrics" /></p>
</li>
<li>
<p><strong>Content Filtering and Guardrails</strong> – See which prompts or completions are being filtered, and why.
<img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-ai-foundry/azure_ai_foundry_guardrails.png" alt="Azure AI Foundry guardrails metrics" />
<img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-ai-foundry/azure_ai_foundry_prompt_filtered.png" alt="Azure AI Foundry guardrails prompt filtered" /></p>
</li>
</ul>
<p>You can drill into specific users or sessions, slice by model type or region, and export reports for usage reviews or compliance.</p>
<hr />
<h2>Try it out today</h2>
<p>The Azure AI Foundry Integration is currently available in Elastic Cloud (both serverless and hosted options). Sign up for a 7 day trial by signing up to Elastic Cloud directly or through Azure Marketplace.
Alternatively you can also deploy a cluster on our Elasticsearch Service, download the Elasticsearch stack, or run Elastic from Azure Marketplace then spin up the new technical preview of Azure AI Foundry integration, open the curated dashboards in Kibana and start monitoring your Azure AI Foundry service!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-ai-foundry/LLM-observability.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Optimizing Spend and Content Moderation on Azure OpenAI with Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/llm-observability-azure-openai-content-filter</link>
            <guid isPermaLink="false">llm-observability-azure-openai-content-filter</guid>
            <pubDate>Tue, 13 May 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[We have added further capabilities to the Azure OpenAI GA package, which now offer content filter monitoring and enhancements to the billing insights!]]></description>
            <content:encoded><![CDATA[<p>In a previous blog we showed you how to set up observability for your models hosted on Azure OpenAI using Elastic’s integration. We’ve expanded the integration to also include Azure OpenAI content filtering, and cost analysis for Azure OpenAI. If you previously onboarded the Azure OpenAI integration, just upgrade it and you will automatically get all new features we discuss in this blog. The enhanced integration now provides multiple dashboards including a general Azure OpenAI Overview, Azure Provisioned Throughput Unit dashboard, Azure Content filtering, and a dashboard for Azure OpenAI billing.</p>
<p>In this blog we will cover how to use Azure OpenAI Content Filtering and tracking Azure OpenAI usage costs. Let’s first review what these two capabilities from Azure OpenAI enable you to do:</p>
<h2>Azure OpenAI Content Filtering: Enhancing AI Safety</h2>
<p>Content filtering for Azure OpenAI plays a critical role in addressing AI safety challenges by helping to mitigate the risks associated with harmful or inappropriate content generated by AI models. By implementing robust content filtering mechanisms, organizations can proactively identify and filter out potentially harmful content, such as hate speech, misinformation, or violent imagery, before it is disseminated to users. This helps prevent the spread of harmful content and reduces the potential negative impact on individuals and communities.</p>
<p>Monitoring Azure OpenAI content filtering is essential for staying proactive in addressing emerging content moderation challenges. By closely monitoring the system, businesses can quickly detect any new types of harmful content or patterns of misuse that may arise. This enables organizations to stay ahead of potential content moderation issues and take timely action to protect their users and uphold their brand reputation.</p>
<h2>Tracking Azure OpenAI Usage Costs</h2>
<p>Monitoring Azure OpenAI model usage costs is crucial for managing budget and resource allocation effectively. By keeping track of usage costs, organizations can optimize their operations to avoid unnecessary expenses and ensure that they are getting the best value from their investment in AI technologies. Additionally, it helps in forecasting future expenses and aids in scaling resources according to the demand without compromising performance or incurring excessive costs. Effective monitoring also allows for transparency and accountability, enabling better decision-making in terms of AI deployment and utilization within Azure environments.</p>
<p>As we walk through this blog, we will provide you with prerequisites to set up and use the pre-configured dashboards for both of these capabilities, which are part of the Azure OpenAI integration.</p>
<h2>Prerequisites</h2>
<p>In order to follow along in this blog you will have to</p>
<ol>
<li>
<p>Set up and install the Azure billing integration to monitor the usage costs. Once the integration is installed, you can track the usage in the enhanced Azure OpenAI Billing dashboard.</p>
</li>
<li>
<p>Additionally, make sure you have enabled the Azure API Management service to access the Azure OpenAI models.</p>
</li>
</ol>
<h3>How to Use Azure API Management with Azure OpenAI:</h3>
<ul>
<li><strong>Provision an Azure OpenAI resource:</strong> Create an Azure OpenAI resource and select a model for your application.</li>
<li><strong>Create an API Management instance:</strong> Establish an Azure API Management instance to manage the Azure OpenAI APIs.</li>
<li><strong>Import the Azure OpenAI API:</strong> Import the Azure OpenAI API into your API Management instance using its OpenAPI specification.</li>
<li><strong>Configure Policies:</strong> Implement policies in API Management to manage request authentication, rate limiting, traffic shaping, and more.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai-content-filter/llm-observability-azure_openai_create_APM.png" alt="LLM Observability: Azure OpenAI Create API Management Service" /></p>
<h2>Steps to create a content filter for Azure OpenAI</h2>
<p>Before you set up observability for the content filtering, ensure that you have configured the Azure content filtering for your model. Follow the steps below to create an Azure OpenAI content filtering,</p>
<ol>
<li><strong>Access the Azure OpenAI service console:</strong>
<ul>
<li>Sign in to the Azure Console with the appropriate permissions and navigate to the Azure OpenAI service console.</li>
</ul>
</li>
<li><strong>Navigate to Safety + security:</strong>
<ul>
<li>From the left-hand menu, select <strong>Safety + security</strong>.</li>
</ul>
</li>
<li><strong>Create a New Content filter:</strong>
<ul>
<li>Select <strong>Create content filter</strong>.</li>
<li>Configure various content filter policies including the following
<ul>
<li><strong>Set input filter:</strong> Content will be annotated by category and blocked according to the threshold you set for prompts.</li>
<li><strong>Set output filter:</strong> Content will be annotated by category and blocked according to the threshold you set for response output.</li>
<li><strong>Blocklists:</strong> Define specific words or phrases to block.</li>
<li><strong>Deployments:</strong> Apply filters to model deployments.</li>
</ul>
</li>
</ul>
</li>
<li><strong>Review and Create:</strong>
<ul>
<li>Review your settings and select Create to finalize the content filter configurations.</li>
</ul>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai-content-filter/llm-observability-azure_openai_create_content_filter.png" alt="LLM Observability: Azure OpenAI Create Content Filter" /></p>
<p>Customers can also configure content filters and create custom safety policies that are tailored to their use case requirements. The configurability feature allows customers to adjust the settings, separately for prompts and completions, to filter content for each content category at different severity levels.</p>
<h2>Content filter types</h2>
<ul>
<li>The content filtering categories,
<ul>
<li>(hate, sexual, violence, self-harm)</li>
<li>Other optional classification models aimed at detecting jailbreak risk and known content for text and code.</li>
</ul>
</li>
<li>Severity level within each content filter category,
<ul>
<li>(low, medium, high)</li>
<li>Content detected at the 'safe' severity level is labeled in annotations but isn't subject to filtering and isn't configurable.</li>
</ul>
</li>
</ul>
<h2>Understanding the pre-configured dashboard for Azure OpenAI Content Filtering</h2>
<p>Now that you have set up the filter, you can see what is being filtered in Elastic through the Azure OpenAI content filtering dashboard.</p>
<ol>
<li>Navigate to the Dashboard Menu – Select the <strong>Dashboard</strong> menu option in Elastic and search for <strong>[Azure OpenAI] Content Filtering Overview</strong> to open the dashboard.</li>
<li>Navigate to the Integrations Menu – Open the <strong>Integrations</strong> menu in Elastic, select <strong>Azure OpenAI</strong>, go to the <strong>Assets</strong> tab, and choose <strong>[Azure OpenAI] Content Filtering Overview</strong> from the dashboard assets.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai-content-filter/llm-observability-azure_openai_content_filter_overview.png" alt="LLM Observability: Azure OpenAI Content Filtering Overview" /></p>
<p>The Azure OpenAI Content Filtering Overview dashboard in the Elastic integration provides insights into blocked requests, API latency, error rates. This dashboard also provides detailed breakdown of content being filtered by the content filtering policy.</p>
<h2>Content Filter overview</h2>
<p>When the content filtering system detects harmful content, you receive either an error on the API call if the prompt was deemed inappropriate, or the finish_reason on the response will be content_filter to signify that some of the completion was filtered.</p>
<p>This can be summarized as,</p>
<ul>
<li>
<p><strong>Prompt filters:</strong> The prompt content that is classified in the filtered category will return HTTP 400 error.</p>
</li>
<li>
<p><strong>Non-streaming completion:</strong> When the content is filtered, non-streaming completions calls won't return any content. In rare cases with longer responses, a partial result can be returned. In these cases, the finish_reason is updated.</p>
</li>
<li>
<p><strong>Streaming completion:</strong> For streaming completions calls, segments are returned back to the user as they're completed. The service continues streaming until either reaching a stop token, length, or when content that is classified at a filtered category and severity level is detected.</p>
</li>
</ul>
<h2>Prompt and response where content has been blocked</h2>
<p>This dashboard section displays the original LLM prompt, inputs from various sources (API calls, applications, or chat interfaces), and the corresponding completion response. The panel below gives a view on the responses after applying content filtering policy for prompts and completions.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai-content-filter/llm-observability-azure_openai_content_filter_logs.png" alt="LLM Observability: Azure OpenAI Content Filtered Logs" /></p>
<p>You can use the following code snippet  to start integrating your current prompt and settings into your application to test the content filter:</p>
<pre><code>chat_prompt = [
   {
       &quot;role&quot;: &quot;user&quot;,
       &quot;content&quot;: &quot;How to kill a mocking bird?&quot;
   }
]
</code></pre>
<p>After running the code, you can find the content being filtered by violence category with the severity level medium.
<img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai-content-filter/llm-observability-azure_openai_content-filter_response.png" alt="LLM Observability: Azure OpenAI Content Filtered Response" /></p>
<h2>Content filtered by content source (Input &amp; Output)</h2>
<p>The content filtering system helps monitor and moderate different categories of content based on severity levels. The categories typically include things like adult content, offensive language, hate speech, violence, and more. The severity levels indicate the degree of sensitivity or potential harm associated with the content. This panel helps the user to effectively monitor and filter out inappropriate or harmful content to maintain a safe environment.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai-content-filter/llm-observability-azure_openai_content_filter_category_serverity.png" alt="LLM Observability: Azure OpenAI Content Filter Category &amp; Severity Level" /></p>
<p>These metrics can be categorized into the following groups:</p>
<ul>
<li><strong>Blocked requests by category:</strong> Provides insights into the total blocked requests by category.</li>
<li><strong>Severity distribution by categories:</strong> Monitors the blocked requests by categories and severity distribution. The severity distribution may be either low, medium or high.</li>
<li><strong>Content filtered categories:</strong> Provides insights into the content filtered categories over time.</li>
</ul>
<h2>Reviewing the Azure OpenAI Billing dashboard</h2>
<p>You can now look at what you are spending on Azure OpenAI.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai-content-filter/llm-observability-azure_openai_billing.png" alt="LLM Observability: Azure OpenAI Billing" /></p>
<p>Here is what you see on this dashboard:</p>
<ul>
<li><strong>Total costs:</strong> This measures the total usage cost across all the model deployments.</li>
<li><strong>Overall Usage by model:</strong> This tracks the total usage costs broken down by model.</li>
<li><strong>Daily usage:</strong> Monitors usage costs on a daily basis.</li>
<li><strong>Daily usage costs by model:</strong> Monitors daily usage costs broken down by model deployments.</li>
</ul>
<h2>Conclusion</h2>
<p>The Azure OpenAI integration makes it easy for you to collect a curated set of metrics and logs for your LLM-powered applications using Azure OpenAI along with content filtered responses. It comes with an out-of-the-box dashboard which you can further customize for your specific needs.</p>
<p>Deploy a cluster on our Elasticsearch Service or download the stack, spin up the new Azure OpenAI integration, open the curated dashboards in Kibana and start monitoring your Azure OpenAI service!</p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai-content-filter/LLM-observability.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[LLM Observability with Elastic: Azure OpenAI Part 2]]></title>
            <link>https://www.elastic.co/observability-labs/blog/llm-observability-azure-openai-v2</link>
            <guid isPermaLink="false">llm-observability-azure-openai-v2</guid>
            <pubDate>Fri, 23 Aug 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[We have added further capabilities to the Azure OpenAI GA package, which now offer prompt and response monitoring, PTU deployment performance tracking, and billing insights!]]></description>
            <content:encoded><![CDATA[<p>We recently announced GA of the Azure OpenAI integration. You can find details in our previous blog <a href="https://www.elastic.co/observability-labs/blog/llm-observability-azure-openai">LLM Observability: Azure OpenAI</a>.</p>
<p>Since then, we have added further capabilities to the Azure OpenAI GA package, which now offer prompt and response monitoring, PTU deployment performance tracking, and billing insights. Read on to learn more!</p>
<h2>Advanced Logging and Monitoring</h2>
<p>The initial GA release of the integration focused mainly on the native logs, to track the telemetry of the service by using <strong>cognitive services logging</strong>. This version of the Azure OpenAI integration allows you to process the advanced logs which gives a more holistic view of OpenAI resource usage.</p>
<p>To achieve this, you have to setup API Management services in Azure. The API Management service is a centralized place where you can put all OpenAI services endpoints to manage all of them end-to-end. Enable the API Management services and configure the Azure event hub to stream the logs.</p>
<p>To learn more about setting up the API Management service to access Azure OpenAI, please refer to the <a href="https://learn.microsoft.com/en-us/azure/architecture/ai-ml/openai/architecture/log-monitor-azure-openai">Azure documentation</a>.</p>
<p>By using advanced logging, you can collect the following log data:</p>
<ul>
<li>Request input text</li>
<li>Response output text</li>
<li>Content filter results</li>
<li>Usage Information
<ul>
<li>Input prompt tokens</li>
<li>Output completion tokens</li>
<li>Total tokens</li>
</ul>
</li>
</ul>
<p>Azure OpenAI integration now collects the API Management Gateway logs. When a question from the user goes to the API Management, it logs the questions and the responses from the GPT models.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai-v2/llm-observability-azure-openai-log-categories.png" alt="LLM Observability: Azure OpenAI Logs Overview" /></p>
<p>Here’s what a sample log looks like,
<img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai-v2/llm-observability-advance-log-monitoring.png" alt="LLM Observability: Azure OpenAI Advanced Logs" /></p>
<h3>Content filtered results</h3>
<p>Azure OpenAI’s content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. With Azure OpenAI model deployments, you can use the default content filter or create your own content filter.</p>
<p>Now, The integration collects the content filtered result logs. In this example let's create a custom filter in the Azure OpenAI Studio that generates an error log.</p>
<p>By leveraging the <strong>Azure Content Filters</strong>, you can create your own custom lists of terms or phrases to block or flag.
<img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai-v2/llm-observability-azure-content-filters.png" alt="LLM Observability: Azure OpenAI Set Content Filter" /></p>
<p>And the document ingested in Elastic would look like this:
<img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai-v2/llm-observability-content-filter-logs.png" alt="LLM Observability: Azure OpenAI Content Filter Logs" />
This screenshot provides insights into the content filtered request.</p>
<h2>PTU Deployment Monitoring</h2>
<p><a href="https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/provisioned-throughput">Provisioned throughput units (PTU)</a> are units of model processing capacity that you can reserve and deploy for processing prompts and generating completions.</p>
<p>The curated dashboard for PTU Deployment gives comprehensive visibility into metrics such as request latency, active token usage, PTU utilization, and fine-tuning activities, offering a quick snapshot of your deployment's health and performance.</p>
<p>Here are the essential PTU metrics captured by default:</p>
<ul>
<li><strong>Time to Response:</strong> Time taken for the first response to appear after a user send a prompt.</li>
<li><strong>Active Tokens:</strong> Use this metric to understand your TPS or TPM based utilization for PTUs and compare to the benchmarks for target TPS or TPM scenarios.</li>
<li><strong>Provision-managed Utilization V2:</strong> Provides insights into utilization percentages, helping prevent overuse and ensuring efficient resource allocation.</li>
<li><strong>Prompt Token Cache Match Rate:</strong> The prompt token cache hit ratio expressed as a percentage.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai-v2/llm-observability-azure_open_ai_ptu_deployment.png" alt="LLM Observability: Azure OpenAI PTU Deployment Metrics Monitoring" /></p>
<h2>Using Billing for cost</h2>
<p>Using the curated overview dashboard you can now monitor the actual usage cost for the AI applications. You are one step away from processing the billing information.</p>
<p>You need to configure and install the <a href="https://www.elastic.co/docs/current/integrations/azure_billing">Azure billing metrics integration</a>. Once the installation is complete the usage cost is visualized for the cognitive services in the Azure OpenAI overview dashboard.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai-v2/llm-observability-azure_openai_billing_overview.png" alt="LLM Observability: Azure OpenAI Usage Cost Monitoring" /></p>
<h2>Try it out today</h2>
<p>Deploy a cluster on our <a href="https://www.elastic.co/cloud/elasticsearch-service">Elasticsearch Service</a> or <a href="https://www.elastic.co/downloads/">download</a> the stack, spin up the new Azure OpenAI integration, open the curated dashboards in Kibana and start monitoring your Azure OpenAI service!</p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai-v2/LLM-observability.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[LLM Observability: Azure OpenAI]]></title>
            <link>https://www.elastic.co/observability-labs/blog/llm-observability-azure-openai</link>
            <guid isPermaLink="false">llm-observability-azure-openai</guid>
            <pubDate>Mon, 24 Jun 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[We are excited to announce the general availability of the Azure OpenAI Integration that provides comprehensive Observability into the performance and usage of the Azure OpenAI Service!]]></description>
            <content:encoded><![CDATA[<p>We are excited to announce the general availability of the <a href="https://www.elastic.co/integrations/data-integrations?solution=all-solutions&amp;category=azure">Azure OpenAI Integration</a> that provides comprehensive Observability into the performance and usage of the <a href="https://azure.microsoft.com/en-us/products/ai-services/openai-service">Azure OpenAI Service</a>! Also look at <a href="https://www.elastic.co/observability-labs/blog/llm-observability-azure-openai-v2">Part 2 of this blog</a></p>
<p>While we have offered <a href="https://www.elastic.co/observability-labs/blog/monitor-openai-api-gpt-models-opentelemetry">visibility into LLM environments</a> for a while now, the addition of our Azure OpenAI integration enables richer out-of-the-box visibility into the performance and usage of your Azure OpenAI based applications, further enhancing LLM Observability.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai/llm-observability-azure-openai-monitoring.png" alt="LLM Observability: Azure OpenAI Monitoring" /></p>
<p>The Azure OpenAI integration leverages <a href="https://www.elastic.co/elastic-agent">Elastic Agent</a>’s Azure integration capabilities to collect both logs (using <a href="https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/stream-monitoring-data-event-hubs">Azure EventHub</a>) and metrics (using <a href="https://learn.microsoft.com/en-us/azure/azure-monitor/reference/supported-metrics/metrics-index">Azure Monitor</a>) to provide deep visibility on the usage of the <a href="https://azure.microsoft.com/en-us/products/ai-services/openai-service">Azure OpenAI Service</a>.</p>
<p>The integration includes an out-of-the-box dashboard that summarizes the most relevant aspects of the service usage, including request and error rates, token usage and chat completion latency.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai/llm-observability-azure-openai-monitoring-overview.png" alt="LLM Observability: Azure OpenAI Monitoring Overview" /></p>
<h2>Creating Alerts and SLOs to monitor Azure OpenAI</h2>
<p>As with every other Elastic integration, all the <a href="https://www.elastic.co/docs/current/integrations/azure_openai#logs">logs</a> and <a href="https://www.elastic.co/docs/current/integrations/azure_openai#metrics">metrics</a> information is fully available to leverage in every capability in <a href="https://www.elastic.co/observability">Elastic Observability</a>, including <a href="https://www.elastic.co/guide/en/observability/current/slo.html">SLOs</a>, <a href="https://www.elastic.co/guide/en/observability/current/create-alerts.html">alerting</a>, custom <a href="https://www.elastic.co/guide/en/kibana/current/dashboard.html">dashboards</a>, in-depth <a href="https://www.elastic.co/guide/en/observability/current/monitor-logs.html">logs exploration</a>, etc.</p>
<p>To create an alert to monitor token usage, for example, start with the Custom Threshold rule on the Azure OpenAI datastream and set an aggregation condition to track and report violations of token usage past a certain threshold.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai/llm-observability-azure-openai-create-alert.png" alt="LLM Observability: Azure OpenAI Monitoring Alert Creation" /></p>
<p>When a violation occurs, the Alert Details view linked in the alert notification for that alert provides rich context surrounding the violation, such as when the violation started, its current status, and any previous history of such violations, enabling quick triaging, investigation and root cause analysis.</p>
<p>Similarly, to create an SLO to monitor error rates in Azure OpenAI calls, start with the custom query SLI definition adding in the good events to be any result signature at or above 400 over a total value that includes all responses. Then, by setting an appropriate SLO target such as 99%, start monitoring your Azure OpenAI error rate SLO over a period of 7, 30, or 90 days to track degradation and take action before it becomes a pervasive problem.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai/llm-observability-azure-openai-create-slo.png" alt="LLM Observability: Azure OpenAI Monitoring SLO Creation" /></p>
<p>Please refer to the <a href="https://www.elastic.co/guide/en/observability/current/monitor-azure-openai.html">User Guide</a> to learn more and to get started!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/llm-observability-azure-openai/AI_fingertip_touching_human_fingertip.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[End to end LLM observability with Elastic: seeing into the opaque world of generative AI applications]]></title>
            <link>https://www.elastic.co/observability-labs/blog/llm-observability-elastic</link>
            <guid isPermaLink="false">llm-observability-elastic</guid>
            <pubDate>Wed, 02 Apr 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic’s LLM Observability delivers end-to-end visibility into the performance, reliability, cost, and compliance of LLMs across Amazon Bedrock, Azure OpenAI, Google Vertex AI, and OpenAI, empowering SREs to optimize and troubleshoot AI-powered applications.]]></description>
            <content:encoded><![CDATA[<p>In the ever-evolving landscape of artificial intelligence, Large Language Models (LLMs) stand as beacons of innovation, offering unprecedented capabilities across industries. From generating human-like text and translating languages to providing personalized customer interactions, the possibilities with LLMs are vast and increasingly indispensable. Enterprises are deploying these models for everything, from automating customer support systems to enhancing creative writing processes. Imagine a virtual assistant not only answering questions but also drafting business proposals or a customer service bot that understands and responds with empathy—all powered by LLMs. However, with great power comes the need for great oversight.</p>
<p>Despite the transformative potential, LLMs introduce complex challenges that necessitate a new level of observability as LLMs are notoriously opaque. Enter LLM observability: a crucial component in the lifecycle management of LLMs. This aspect becomes vital for Service Reliability Engineers (SREs) and other key stakeholders tasked with ensuring seamless, error-free operations, cost control, and minimizing the risks associated with the unpredictable nature of LLM generated responses. SREs need insights into performance metrics, error frequencies, latency issues, the cost implications of running these sophisticated models, and the prompt and response exchange with the model. Traditional monitoring tools fall short in this high-stakes environment; what’s needed is a nuanced approach to address the unique observability demands that LLMs introduce.</p>
<h3>Elastic's LLM Observability Capabilities Address These Challenges</h3>
<p>With Elastic’s end-to-end LLM observability you can cover a wide range of use cases. To achieve this, you can onboard two types of integrations - API-based logs and metrics and via APM instrumentation. Depending on your use case, you can also choose to use of the LLM integrations.</p>
<ol>
<li>
<p><strong>High level overview</strong>: via API-based logs and metrics. Monitoring LLM services from providers by ingesting a curated set of service metrics and logs like latency, invocation frequency, tokens, errors, and prompts and responses. Each LLM integration comes with out-of-the-box dashboards.</p>
</li>
<li>
<p><strong>Troubleshooting applications</strong>: via APM instrumentation. Fully OTel-native tracing and auto-instrumentation for LLM-based applications through Elastic Distributions of OpenTelemetry (EDOT). Additionally, you can use third party libraries (Langtrace, OpenLit, OpenLLMetry) together with Elastic to extend the coverage to additional LLM-related technologies. </p>
</li>
</ol>
<h4>High level overview: LLM Observability for Leading Providers</h4>
<p>Elastic offers tailored API-based integrations for four major LLM hosting providers:</p>
<ul>
<li>
<p>Azure OpenAI</p>
</li>
<li>
<p>OpenAI</p>
</li>
<li>
<p>Amazon Bedrock</p>
</li>
<li>
<p>Google Vertex AI</p>
</li>
</ul>
<p>These integrations bring a curated set of logs and metrics collection tailored to each provider. What this means for SREs is straightforward access to pre-configured dashboards that highlight the prompts and responses, usage patterns, performance metrics, and cost details across different models and providers.</p>
<p>For instance, SREs keen on identifying which LLM generates the most errors or insights about the models in terms of latency, cost, or usage frequency can leverage these integrations. Imagine having the capability to instantly visualize which LLM is slowing down processes or incurring high costs, thus enabling data-driven decisions to optimize operations.</p>
<h4>Troubleshooting applications: Tracing and Auto-Instrumentation of OpenAI, Amazon Bedrock and Google Vertex AI models</h4>
<p>Elastic supports OTLP tracing capabilities in EDOT for applications using OpenAI models and models hosted on Amazon Bedrock and Google Vertex AI. In addition, Elastic also supports LLM tracing from third party libraries (Langtrace, OpenLIT, OpenLLMetry). </p>
<p>Tracing offers a comprehensive map of an application's request flow, pinpointing granular details about each call within the system. For each transaction and span of a request, tracing shows critical information such as specific models utilized, request duration, errors encountered, tokens used per request, and the prompts and responses between the LLM.</p>
<p>Tracing helps SREs troubleshoot performance issues with applications developed in languages like Python, Node.js and Java.&quot; If an SRE needs to investigate latency or error issues, LLM tracing provides a zoomed-in view into the request lifecycle and allows for profound insights into whether a delay is application-specific, model-specific or systemic across deployments.</p>
<h3>Use Cases: Bringing Elastic's Observability Features to Life</h3>
<p>Let’s explore some practical scenarios where Elastic’s observability tools shine:</p>
<h4>1. Understanding LLM Performance and Reliability</h4>
<p>An SRE team looking to optimize a customer support system powered by Azure OpenAI can utilize Elastic’s <a href="https://www.elastic.co/guide/en/integrations/current/azure_openai.html">Azure OpenAI integration</a> to quickly ascertain which model variants incur higher latency or error rates. This enhances decision-making regarding model deployment or even switching providers based on performance metrics.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-elastic/Azure-OpenAI.png" alt="Azure OpenAI" /></p>
<p>Similarly SREs can also use in parallel integrations for <a href="https://www.elastic.co/guide/en/integrations/current/gcp_vertexai.html">Google Vertex AI</a>, <a href="https://www.elastic.co/guide/en/integrations/current/aws_bedrock.html">Amazon Bedrock</a>, and <a href="https://www.elastic.co/guide/en/integrations/current/openai.html">OpenAI</a> for other applications using models hosted on these providers.</p>
<h4>2. Troubleshooting OpenAI-Powered Applications</h4>
<p>Consider an enterprise utilizing an OpenAI model for real-time user interactions. Encountering unexplained delays, an SRE can use OpenAI tracing to dissect the transaction pathway, identifying if one specific API call or model invocation is the bottleneck. The SRE can also check the out-of-the-box OpenAI integration dashboard to verify if the latency is only affecting this application or all model invocations across the organization.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-elastic/OpenAI-tracing.png" alt="OpenAI Tracing" /></p>
<p>An engineer troubleshooting the LLM-based application can also check to see what were the prompt and response exchanges with the LLM during this request so they can rule out possible impact on performance due to the input.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-elastic/OpenAI-trace.png" alt="OpenAI Trace sample with logs " /></p>
<h4>3. Addressing Cost and Usage Concerns</h4>
<p>SREs are generally acutely aware of which LLM configurations are less cost-effective than required. Elastic’s integration dashboards, pre-configured to display model usage patterns, help mitigate unnecessary spending effectively. You can find out-of-the box dashboards for Azure OpenAI, OpenAI, Amazon Bedrock, and Google VertexAI models. These dashboards show key cost and usage information such as total invocations and tokens, as well as time series breakdown by model and endpoint. In addition, some integrations show more advanced usage information such as provisioned throughput units (PTU) as well as billing cost.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-elastic/GCP-Vertex-AI.png" alt="GCP Vertex AI" /></p>
<h4>4. Understanding LLM Compliance </h4>
<p>With the Elastic Amazon Bedrock integration for Guardrails, and Azure OpenAI integration for content filtering, SREs can swiftly address security concerns, like verifying if certain user interactions prompt policy violations. Elastic's observability logs clarify whether guardrails rightly blocked potentially harmful responses, bolstering compliance assurance.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-elastic/Bedrock-Guardrails.png" alt="Bedrock-Guardrails.png" /></p>
<h3>Conclusion</h3>
<p>As LLMs continue to revolutionize the capabilities of modern applications, the role of observability becomes increasingly paramount. Elastic’s comprehensive observability framework empowers enterprises to harness the full potential of LLMs while maintaining robust operational insight and control. The integration with prominent LLM hosting providers and advanced tracing for OpenAI, Amazon Bedrock and Google Vertex AI models, equips SREs with the necessary arsenal to navigate the complex landscape of LLM-driven applications, ensuring they remain safe, reliable, efficient, and cost-effective.</p>
<p>In this new era of AI, balancing innovation with observability isn't just beneficial—it's essential. Whether optimizing performance, troubleshooting intricacies, or managing costs and compliance, Elastic stands at the forefront, ensuring your LLM journey is as seamless as it is groundbreaking.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/llm-observability-elastic/llm-e2e.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[LLM observability: track usage and manage costs with Elastic's OpenAI integration]]></title>
            <link>https://www.elastic.co/observability-labs/blog/llm-observability-openai</link>
            <guid isPermaLink="false">llm-observability-openai</guid>
            <pubDate>Tue, 11 Mar 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic's new OpenAI integration for Observability provides comprehensive insights into OpenAI model usage. With our pre-built dashboards and metrics, you can effectively track and monitor OpenAI model usage including GPT-4o and DALL·E.]]></description>
            <content:encoded><![CDATA[<p>In an era where AI-driven applications are becoming ubiquitous, understanding and managing the usage of language models is crucial. OpenAI has been at the forefront of developing advanced language models that power a multitude of applications, from chatbots to code generation. However, as applications grow in complexity and scale, observing crucial metrics that ensure optimal performance and cost-effectiveness becomes essential. Specific needs arise in areas such as performance and reliability monitoring, and cost management, which are pivotal for maximizing the potential of language models.</p>
<p>As organizations adopt OpenAI's diverse AI models, including language models like GPT-4o and GPT-3.5 Turbo, image models like DALL·E, and audio models like Whisper, comprehensive usage monitoring is crucial to track and optimize performance, reliability, usage and cost of each model.</p>
<p>Elastic's new <a href="https://www.elastic.co/guide/en/integrations/current/openai.html">OpenAI integration</a> offers a solution to the challenges faced by developers and businesses using these models. It is designed to provide a unified view of your OpenAI usage across all model types.</p>
<h3>Key benefits of the OpenAI integration</h3>
<p>OpenAI's usage-based pricing model applies across all these services, making it essential to track consumption and identify which models are being used to control costs and optimize deployments. The new OpenAI integration by Elastic utilizes the <a href="https://platform.openai.com/docs/api-reference/usage">OpenAI Usage API</a> to track consumption and identify specific models being used. It offers an out-of-the-box experience with pre-built dashboards, simplifying the process of monitoring your usage patterns.</p>
<p>Continue reading to learn about what you will get with the integration. We'll also show you the setup process, how to leverage the pre-built dashboards, and what insights you can gain from Elastic for LLM Observability.</p>
<h2>Setting up the OpenAI Integration</h2>
<h3>Prerequisites</h3>
<p>To follow along with this blog, you will need:</p>
<ul>
<li>An Elastic cloud account (version 8.16.3 or higher). Alternatively, you can use <a href="https://www.elastic.co/cloud/serverless">Elastic Cloud Serverless</a>, a fully managed solution that eliminates infrastructure management, automatically scales based on usage, and lets you focus entirely on extracting value from your data.</li>
<li>An OpenAI account with an <a href="https://platform.openai.com/docs/api-reference/admin-api-keys">Admin API key</a>.</li>
<li>Applications that use the OpenAI APIs.</li>
</ul>
<h3>Generating sample OpenAI usage data</h3>
<p>If you're new to OpenAI and eager to try this integration, you can quickly set it up and populate your dashboards with sample data. You'll just need to generate some usage by interacting with the OpenAI API. If you don't have an OpenAI API key, you can create one <a href="https://platform.openai.com/api-keys">here</a>. For more information on authentication, refer to the OpenAI <a href="https://platform.openai.com/docs/api-reference/authentication">documentation</a>.</p>
<p>The OpenAI documentation provides detailed examples for each of their API endpoints. Here are direct links to the relevant sections for generating sample usage data:</p>
<ul>
<li>Language models (completions): Use the Chat Completions API to generate text. See the examples <a href="https://platform.openai.com/docs/api-reference/chat/create">here</a>.</li>
<li>Audio models (text-to-speech): Generate audio from text using the Speech API. See the examples <a href="https://platform.openai.com/docs/api-reference/audio/createSpeech">here</a>.</li>
<li>Audio models (speech-to-text): Transcribe audio to text using the Transcriptions API. See the examples <a href="https://platform.openai.com/docs/api-reference/audio/createTranscription">here</a>.</li>
<li>Embeddings: Generate vector representations of text using the Embeddings API. See the examples <a href="https://platform.openai.com/docs/api-reference/embeddings">here</a>.</li>
<li>Image models: Create images from text prompts using the Image Generation API. See the examples <a href="https://platform.openai.com/docs/api-reference/images/create">here</a>.</li>
<li>Moderation: Check the contents with Moderation API. See the examples <a href="https://platform.openai.com/docs/api-reference/moderations">here</a>.</li>
</ul>
<p>There are more endpoints that you can explore to generate sample usage data.</p>
<p>After running these examples (using your API key), remember that the OpenAI <a href="https://platform.openai.com/docs/api-reference/usage">Usage API</a> has a delay. It may take some time (usually a few minutes) for the usage data to appear in your dashboard.</p>
<h3>Configuration</h3>
<p>To connect the OpenAI integration to your OpenAI account, you'll need your OpenAI's <a href="https://platform.openai.com/settings/organization/admin-keys">Admin API key</a>. The integration will use this key to periodically retrieve usage data from the OpenAI <a href="https://platform.openai.com/docs/api-reference/usage">Usage API</a>.</p>
<p>The integration supports eight distinct <a href="https://www.elastic.co/guide/en/integrations/current/openai.html#openai-data-streams">data streams</a>, corresponding to different categories of OpenAI API usage:</p>
<ul>
<li>Audio speeches (text-to-speech)</li>
<li>Audio transcriptions (speech-to-text)</li>
<li>Code interpreter sessions</li>
<li>Completions (language models)</li>
<li>Embeddings</li>
<li>Images</li>
<li>Moderations</li>
<li>Vector stores</li>
</ul>
<p>By default, all data streams are enabled. However, you can disable any data streams that are not relevant to your usage. All enabled data streams are visualized in a single, comprehensive dashboard, providing a unified view of your usage.</p>
<p>For advanced users, the integration offers additional configuration options, including setting the bucket width and initial interval. These options are documented in detail in the official integration <a href="https://www.elastic.co/guide/en/integrations/current/openai.html#openai-collection-behavior">documentation</a>.</p>
<h2>Maximize visibility with the out-of-the-box dashboard</h2>
<p>You can access the OpenAI dashboard in two ways:</p>
<ol>
<li>Navigate to the Dashboards menu in the left side panel and search for &quot;OpenAI&quot;. In the search results select  <strong>[Metrics OpenAI] OpenAI Usage Overview</strong> to open the dashboard.</li>
<li>Alternatively, navigate to the Integrations Menu — Open the <strong>Integrations</strong> menu under the <strong>Management</strong> section in Elastic, select <strong>OpenAI</strong>, go to the <strong>Assets</strong> tab, and choose <strong>[Metrics OpenAI] OpenAI Usage Overview</strong> from the dashboards assets.</li>
</ol>
<h3>Understanding the pre-configured dashboard for OpenAI</h3>
<p>The pre-built dashboard provides a structured view of OpenAI's API consumption, displaying key metrics such as token usage, API call distribution, and model-wise invocation counts. It highlights top-performing projects, users, and API keys, along with breakdowns of image generation, audio transcription, and text-to-speech usage. By analyzing these insights, users can track usage patterns, and optimize AI-driven applications.</p>
<h3>OpenAI usage metrics overview</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-openai/dashboard-overview.png" alt="LLM Observability: OpenAI usage metrics overview" /></p>
<p>This dashboard section shows key usage metrics from OpenAI, including invocation rates, token usage, and the top-performing models. It also highlights the total number of invocations and tokens and the invocation count by object type. Understanding these insights can help users optimize model usage, reduce costs, and enhance efficiency when integrating AI models into their applications.</p>
<h3>Top performing Project, User, and API Key IDs</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-openai/dashboard-top-tables.png" alt="LLM Observability: Top performing Project, User, and API Key IDs" /></p>
<p>Here, you can analyze the top Project IDs, User IDs, and API Key IDs based on invocation counts. This data provides valuable insights to help organizations track usage patterns across different projects and applications.</p>
<h3>Token metrics</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-openai/dashboard-token-metrics.png" alt="LLM Observability: Token metrics" /></p>
<p>In this dashboard section you can see token usage trends across various models. This can help you analyze trends across input types (e.g., audio, embeddings, moderations), output types (e.g., audio), and input cached tokens. This information can help developers fine-tune their prompts and optimize token consumption.</p>
<h3>Image generation metrics</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-openai/dashboard-image-metrics.png" alt="LLM Observability: Image generation metrics" /></p>
<p>AI-generated images are becoming increasingly popular across industries. This section provides an overview of image generation metrics, including invocation rates by model and the most <a href="https://platform.openai.com/docs/guides/images#size-and-quality-options">common output dimensions</a>. These insights help assess invocation costs and analyze image generation usage.</p>
<h3>Audio transcription metrics</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-openai/dashboard-audio-transcription-metrics.png" alt="LLM Observability: Audio transcription metrics" /></p>
<p>OpenAI's AI-powered transcription services make speech-to-text conversion easier than ever. This section tracks audio transcription metrics, including invocation rates and total transcribed seconds per model. Understanding these trends can help businesses optimize costs when building audio transcription-based applications.</p>
<h3>Audio speech metrics</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-openai/dashboard-audio-speech.png" alt="LLM Observability: Audio speech metrics" /></p>
<p>OpenAI's text-to-speech (TTS) models deliver realistic voice synthesis for applications such as accessibility tools and virtual assistants. This section explores TTS invocation rates and the number of characters synthesized per model, offering insights into the adoption of AI-driven voice synthesis.</p>
<h2>Creating Alerts and SLOs to monitor OpenAI</h2>
<p>As with every other Elastic integration, all the logs and metrics information is fully available to leverage in every capability in <a href="https://www.elastic.co/observability">Elastic Observability</a>, including <a href="https://www.elastic.co/guide/en/observability/current/slo.html">SLOs</a>, <a href="https://www.elastic.co/guide/en/observability/current/create-alerts.html">alerting</a>, custom <a href="https://www.elastic.co/guide/en/kibana/current/dashboard.html">dashboards</a>, in-depth <a href="https://www.elastic.co/guide/en/observability/current/monitor-logs.html">logs exploration</a>, etc.</p>
<p>To proactively manage your OpenAI token usage and avoid unexpected costs, <a href="https://www.elastic.co/guide/en/observability/current/create-alerts-rules.html">create</a> a custom threshold rule in Observability Alerts.</p>
<p><em>Example</em>: Target the relevant data stream, and configure the rule to sum the related tokens field (along with other token-related fields, if applicable). Set a threshold representing your desired usage limit, and the alert will notify you if this limit is exceeded within a specified timeframe, such as daily or hourly.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-openai/dashboard-create-alert.png" alt="LLM Observability: Alert creation" /></p>
<p>When an alert condition is met, the Alert Details view linked in the alert notification for that alert provides detailed insights surrounding the violation, such as when the violation started, its current status, and any previous history of similar violations, enabling proactive issue resolution, and improving system resilience.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-openai/dashboard-alert-overview.png" alt="LLM Observability: Alert overview" /></p>
<p><em>Example</em>: To create an SLO that monitors model distribution in OpenAI, start by defining a custom metric SLI definition, adding good events where <code>openai.base.model</code> contains <code>gpt-3.5*</code> and total events encompassing all OpenAI requests, grouped by <code>openai.base.project_id</code> and <code>openai.base.user_id</code>. Then, set an appropriate SLO target such as 80% and monitor this over a 7-day rolling window to identify projects and users that may be overusing more expensive models.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-openai/dashboard-create-slo.png" alt="LLM Observability: SLO creation" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/llm-observability-openai/dashboard-slo-overview.png" alt="LLM Observability: SLO overview" /></p>
<p>You can now track the distribution of requests across different OpenAI models by project and user. This example demonstrates how Elastic's OpenAI integration helps you optimize costs. By monitoring the percentage of requests handled by cost-efficient GPT-3.5 models — the SLI — against the 80% target (part of the SLO), you can quickly identify which specific projects or users are driving up costs through excessive usage of models like GPT-4-turbo, GPT-4o, etc. This visibility enables targeted optimization strategies, ensuring your AI initiatives remain cost-effective while still leveraging advanced capabilities.</p>
<h2>Conclusion, next steps and further reading</h2>
<p>You now know how Elastic's OpenAI integration provides an essential tool for anyone relying on OpenAI's models to power their applications. By offering a comprehensive and customizable dashboard, this integration empowers SREs and developers to effectively monitor performance, manage costs, and optimize your AI systems effortlessly. Now, it's your turn to onboard this application following the instructions in this blog and start monitoring your OpenAI usage! We'd love to hear from you on how you get on and always welcome ideas for enhancements.</p>
<p>To learn how to set up Application Performance Monitoring (APM) tracing of OpenAI-powered applications, read this <a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-openai">blog</a>. For further reading and more LLM observability use cases, explore Elastic's observability lab blogs <a href="https://www.elastic.co/observability-labs/blog/tag/llmobs">here</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/llm-observability-openai/llm-observability-openai.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Serverless log analytics powered by Elasticsearch, in a new low priced tier]]></title>
            <link>https://www.elastic.co/observability-labs/blog/log-analytics-elastic-serverless-logs-essentials</link>
            <guid isPermaLink="false">log-analytics-elastic-serverless-logs-essentials</guid>
            <pubDate>Thu, 07 Aug 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic Observability Logs Essentials delivers cost-effective, hassle-free log analytics on Elastic Cloud Serverless. SREs can ingest, search, enrich, analyze, store, and act on logs without the operational overhead of managing the deployment.]]></description>
            <content:encoded><![CDATA[<p>We're thrilled to introduce Elastic Observability Logs Essentials (Logs Essentials), a new tier in Elastic Cloud Serverless (SaaS). Built on the same robust stateless architecture as Elastic Observability Complete, it’s designed for Site Reliability Engineers (SREs) and developers seeking powerful, efficient, and economical log analytics, without the overhead of managing the Elastic Stack. As the leader in log management, Elasticsearch powers this new tier with unmatched search and analytics. </p>
<p>Logs Essentials is ideal for teams that want Elastic’s speed and scale without paying for premium features or managing the Elastic Stack. With Elastic Cloud Serverless, there’s no infrastructure to manage, and pricing is simple and predictable, making it easy to get started, stay supported, and focus on solving problems faster.</p>
<h2>Unmatched value for log analytics</h2>
<p>Logs Essentials empowers SREs and developers with analytics capabilities designed to help them quickly pinpoint the root cause of issues. </p>
<ul>
<li>
<p>Accelerate root cause analysis with fast, precise log search using filters, pattern matching, and event identification in seconds.</p>
</li>
<li>
<p>Gain deep contextual insights through ES|QL, Elastic’s powerful piped query language that supports structured exploration and joins across indices.</p>
</li>
<li>
<p>Detect issues proactively by setting alerts for error spikes or unusual log volumes, enabling timely incident response.</p>
</li>
<li>
<p>Visualize and monitor operational health with rich dashboards built in Kibana, giving teams a clear and actionable view of system behavior.</p>
</li>
</ul>
<p>Once on Logs Essentials, if you need SLOs, AI/ML, AI Assistant, or other advanced features to analyze logs, you should upgrade to <a href="https://www.elastic.co/pricing/serverless-observability">Observability Complete</a>. Additionally, if you are also interested in expanding to traces and metrics, you should upgrade to Observability Complete.</p>
<h2>SaaS making it simple</h2>
<p>SREs don’t have to worry about managing the powerful Elastic Stack with Logs Essentials. <a href="https://www.elastic.co/blog/journey-to-build-elastic-cloud-serverless">Elastic Cloud Serverless </a>automatically scales and adjusts to needs seamlessly without impacting performance, all while keeping costs low. SREs don’t have to worry about the operational overhead of managing your deployment or being an Elastic Stack expert. SREs get the following benefits:</p>
<p><strong>No infrastructure to manage or scale:</strong> Elastic Cloud Serverless transitions from traditional stateful deployments to a fully stateless, autoscaling architecture, offloading storage to cloud-native object stores and orchestrating compute through Kubernetes. SRE teams can now focus solely on logs and insights, not capacity planning or cluster sizing.</p>
<p><strong>High reliability, resilience, and automation built-in:</strong> Elastic’s Cloud Serverless features multi-region deployments, automated control-plane and data-plane upgrades, automatic configuration updates, canary deployments, and capacity pool management to ensure always-on observability</p>
<p>These capabilities deliver what SREs need: a hassle-free, scale-as-you-go, high-availability logging solution that empowers SREs to focus entirely on operational insights, not infrastructure.</p>
<h2>Affordable log analytics</h2>
<p>Logs Essentials offers a cost-effective and predictable path to log analytics. Elastic Cloud Serverless employs advanced autoscaling controllers that adjust compute and storage dynamically, enabling a flexible pricing model that charges based on real usage (ingest and retention), enabling SREs to “sign up and use,” without upfront provisioning or surprise costs. </p>
<p>Instead of paying for idle capacity or managing infrastructure costs, users are billed based on ingest, and retention, eliminating the guesswork and overprovisioning common in traditional observability solutions. SREs can simply sign up and start analyzing logs. No infrastructure to manage, no surprise costs, just transparent, cost-effective pricing for what they use.</p>
<h2>Logs Essentials in action</h2>
<p>Let’s walk through how a Site Reliability Engineer (SRE) would use it in a real-world scenario. Customers are unable to complete transactions on an ecommerce site and the root cause isn’t clear. The issue could be in the front end, the back end, the database, or even the load balancer. Fortunately, logs are being collected from multiple components including NGINX, MySQL, and the application itself. With Elastic Observability Logs Essentials, an SRE can quickly dive into these logs to investigate the issue by starting with high-level symptoms and drill down across services using powerful search, correlation via ES|QL, and visualization tools like dashboarding.</p>
<p>The investigation continues as the SRE walks through several steps using ES|QL, search, and dashboards.</p>
<ul>
<li>There is an alert indicating a logs spike, which is triggered by a significant number of MySQL errors indicating that a database table “orders” is full. We also use ES|QL to understand how many errors have been seen in the last three hours. </li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/log-analytics-elastic-serverless-logs-essentials/logs-essentials-alerts.jpg" alt="Alerts" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/log-analytics-elastic-serverless-logs-essentials/logs-essentials-sql-error.jpg" alt="MySQL error" /></p>
<ul>
<li>Next, the SRE tries to understand the impact on customers and potential revenue by looking at how many http issues are occurring and what region is seeing it most. With a significant number of &gt;=400 and the US as the main region seeing the issue, this is revenue impacting.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/log-analytics-elastic-serverless-logs-essentials/logs-essentials-nginx-400.jpg" alt="NGINX 400 Issues" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/log-analytics-elastic-serverless-logs-essentials/logs-essentials-geo.jpg" alt="GEO Analysis" /></p>
<ul>
<li>Next, the SRE looks at whether infrastructure is being impacted by finding the related Kubernetes cluster and pod. With this the SRE can further investigate whether the MySQL pod or the Kubernetes node is having CPU or memory utilization issues.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/log-analytics-elastic-serverless-logs-essentials/logs-essentials-k8s.jpg" alt="K8S Cluster Analysis" /></p>
<p>SREs can also create visualizations and dashboards easily through Observability Logs Essentials’ ES|QL, discover, alerting, and dashboards capabilities.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/log-analytics-elastic-serverless-logs-essentials/logs-essentials-dashboard.jpg" alt="Dashboard" /></p>
<h2>Get started with Observability Logs Essentials</h2>
<p>By combining the trusted capabilities of Elasticsearch with the flexibility and scalability of Elastic Cloud Serverless offering, Log Essentials delivers a streamlined, cost-effective solution that helps teams resolve incidents faster and with greater clarity. Whether you're troubleshooting critical outages, monitoring service health, or building dashboards for proactive insight, Logs Essentials gives you the tools you need —  search, ES|QL, alerting, and visualization — in a package that’s simple to adopt and scale. </p>
<p>In order to get started, first <a href="https://cloud.elastic.co/serverless-registration">register on Elastic Cloud</a> and start a trial.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/log-analytics-elastic-serverless-logs-essentials/logs-essentials.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Process data from Elastic integrations with the integration filter plugin in Logstash]]></title>
            <link>https://www.elastic.co/observability-labs/blog/logstash-integration-filter-plugin</link>
            <guid isPermaLink="false">logstash-integration-filter-plugin</guid>
            <pubDate>Fri, 14 Mar 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Offload data processing operations outside of your Elastic deployment and onto Logstash by using the integration filter plugin.]]></description>
            <content:encoded><![CDATA[<p>The <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-elastic_integration.html">Elastic Integration filter plugin</a> for Logstash allows you to process data from Elastic integrations through executing ingest pipelines within Logstash, before forwarding the data to Elastic.</p>
<h3>Why should I use it?</h3>
<p>This approach has the advantage of offloading data processing operations outside of your Elastic deployment and onto Logstash, giving you flexibility on where this should occur.</p>
<p>Additionally, with Logstash as the final route for your data before ingestion into Elastic, it could save you from having to open different ports and set different firewall rules for each agent or beats instance, as Logstash could aggregate all output from these components.</p>
<h3>Prerequisites</h3>
<p>You have an Elastic agent with one or more integrations inside an agent policy running on a server. If you need to install an Elastic agent, you can follow the guide <a href="https://www.elastic.co/guide/en/fleet/current/install-fleet-managed-elastic-agent.html">here.</a></p>
<h3>Steps</h3>
<p>We will:</p>
<ul>
<li>
<p>Install Logstash, but not run it until all steps are complete</p>
</li>
<li>
<p>Generate custom certificates and keys on our Logstash server, to enable secure communication between Fleet server and Logstash</p>
</li>
<li>
<p>Configure Fleet to add a Logstash output</p>
</li>
<li>
<p>Set up Logstash, including a custom pipeline that receives input from Elastic agent, uses the integration filter plugin, and finally forwards the events to Elastic</p>
</li>
<li>
<p>Start Logstash</p>
</li>
<li>
<p>Update an agent policy to use that new Logstash output</p>
</li>
</ul>
<h3>Installing Logstash</h3>
<p>Use <a href="https://www.elastic.co/guide/en/logstash/current/installing-logstash.html">this guide</a> to install Logstash on your server.</p>
<h3>Set up SSL/TLS on the Logstash server</h3>
<p>Use <a href="https://www.elastic.co/guide/en/fleet/current/secure-logstash-connections.html">this guide</a> to create custom certificates and keys for securing the Logstash output connection that will be used by Fleet. We need to do this before we set up a custom pipeline file for Logstash, as we’ll refer to some of the certificate values in that config.</p>
<p>As per the guide, I <a href="https://www.elastic.co/downloads/elasticsearch">downloaded Elasticsearch</a> so I could use the certutil tool that is included, and extracted the contents.</p>
<h3>Add a Logstash output to Fleet in Kibana</h3>
<p>With our certificates and keys to hand, we can complete the steps necessary to <a href="https://www.elastic.co/guide/en/fleet/current/secure-logstash-connections.html#add-ls-output">set up a Logstash output for Fleet</a> from within Kibana. Do not yet set the Logstash output on an agent policy, as we need to configure a custom pipeline in Logstash first.</p>
<h3>Set up a custom pipeline for Logstash</h3>
<p>We need to add a custom pipeline yml file, which will include our Elastic agent input and integration filter. The typical definition for a Logstash pipeline is this:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/logstash-integration-filter-plugin/basic_logstash_pipeline.png" alt="Basic Logstash pipeline" /></p>
<p>Our custom pipeline yml file will start with the Elastic agent <em>input</em> plugin, the guide for which is <a href="https://www.elastic.co/guide/en/logstash/current/plugins-inputs-elastic_agent.html">here.</a></p>
<p>We will then have the <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-elastic_integration.html">integration filter</a>, and an <em>output</em> to Elastic Cloud that will be different depending on if you are ingesting to a hosted cloud deployment, or a serverless project.</p>
<p>Your completed file should look something like this:</p>
<pre><code class="language-yaml">input {
  elastic_agent {
    port =&gt; 5044
    ssl_enabled =&gt; true
    ssl_certificate_authorities =&gt; [&quot;/pathtoca/ca.crt&quot;]
    ssl_certificate =&gt; &quot;/pathtologstashcrt/logstash.crt&quot;
    ssl_key =&gt; &quot;/pathtologstashkey/logstash.pkcs8.key&quot;
    ssl_client_authentication =&gt; &quot;required&quot;
  }
}
filter {
  elastic_integration{
    cloud_id =&gt; &quot;Ross_is_Testing:123456&quot;
    cloud_auth =&gt; &quot;elastic:yourpasswordhere&quot;
  }
}
output {
    # For cloud hosted deployments
    elasticsearch {
        cloud_id =&gt; &quot;Ross_is_Testing:123456
        cloud_auth =&gt; &quot;elastic:yourpasswordhere&quot;
        data_stream =&gt; true
        ssl =&gt; true
        ecs_compatibility =&gt; v8
    }
    # For serverless projects
    elasticsearch {
        hosts =&gt; [&quot;https://projectname.es.us-east-1.aws.elastic.cloud:443&quot;]
        api_key =&gt; &quot;yourapikey-here&quot;
        data_stream =&gt; true
        ssl =&gt; true
        ecs_compatibility =&gt; v8
    }
}
</code></pre>
<p>The above syntax for the output section is valid, you <em>can</em> specify multiple outputs!</p>
<p>For cloud hosted deployments you can use the deployment’s CloudId for authentication, which you can get from the <a href="https://cloud.elastic.co/">cloud admin console</a>, on the deployment overview screen:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/logstash-integration-filter-plugin/deployment_overview.png" alt="Deployment overview screen" /></p>
<p>I'm also using a username and password, but you could instead specify an API key if desired.</p>
<p>For serverless projects, you’ll need to use your Elasticsearch endpoint and an API key to connect Logstash, as documented <a href="https://www.elastic.co/guide/en/serverless/current/elasticsearch-ingest-data-through-logstash.html">here.</a> You can get the Elasticsearch endpoint from the manage project screen of the cloud admin console:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/logstash-integration-filter-plugin/serverless_endpoint.png" alt="Serverless project endpoint" /></p>
<p>Ensure the main pipelines.yml file for Logstash also includes a reference to our custom pipeline file:</p>
<pre><code class="language-yaml"># This file is where you define your pipelines. You can define multiple.
# For more information on multiple pipelines, see the documentation:
#   https://www.elastic.co/guide/en/logstash/current/multiple-pipelines.html
- pipeline.id: fromagent
  path.config: &quot;/etc/logstash/conf.d/agent.conf&quot;
</code></pre>
<p>We can then start Logstash. As we haven’t yet updated an Elastic agent policy to use our Logstash output, no events will yet be going through Logstash.</p>
<h3>Update an agent policy to use our Logstash output</h3>
<p>With Logstash running, we can now <a href="https://www.elastic.co/guide/en/fleet/current/secure-logstash-connections.html#use-ls-output">set our configured Logstash output on an agent policy</a> of our choosing.</p>
<h3>Complete</h3>
<p>Events from the integrations on the chosen agent policy will be sent through Logstash, and relevant ingest pipelines run within Logstash to process the data before sending to Elastic Cloud.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/logstash-integration-filter-plugin/logstash-filter.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Logstash Pipeline Management & Configuration with GitOps]]></title>
            <link>https://www.elastic.co/observability-labs/blog/logstash-pipeline-management-configuration-gitops</link>
            <guid isPermaLink="false">logstash-pipeline-management-configuration-gitops</guid>
            <pubDate>Mon, 09 Feb 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Stop treating Logstash like a black box. This guide shows you how to use GitOps to create auditable, automated, and resilient data pipelines. Eliminate config drift and boost security with this GitHub and Jenkins blueprint.]]></description>
            <content:encoded><![CDATA[<p>Is your Logstash environment a 'black box'? Are manual configuration changes leading to unexpected outages, security gaps, and countless hours spent on troubleshooting? It's time to stop treating observability infrastructure like a fragile art project. This blog post delivers a strategic blueprint for taming your Logstash pipelines, transforming them into a version-controlled, automated, and auditable asset. By adopting a GitOps approach, you can eliminate configuration drift, empower your teams to collaborate securely and ensure your observability platform is as resilient as the systems it monitors.</p>
<h2>From Fragile Art Project to Auditable Asset: How to Tame Your Logstash Configurations with Version Control and Automation</h2>
<p>Observability ensures system health, performance, and security. Logstash drives this by processing and routing your data. But as you scale, manual configuration management becomes a bottleneck. It leads to errors, outages, and security gaps. You need a better way.</p>
<p>This blog post shows you how to manage Logstash pipelines using GitOps. You will use Git as your single source of truth and automate deployments to increase stability, security, and efficiency of your enterprise organisation’s observability infrastructure. </p>
<p>This blog post details the benefits of this methodology and provides a practical implementation model using <strong>GitHub</strong> for version control and <strong>Jenkins</strong> for Continuous Integration and Continuous Deployment (CI/CD).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/logstash-pipeline-management-configuration-gitops/ls-pipeline-gitops-flow.png" alt="Logstash Central Pipeline GitOps flow" /></p>
<h2>The Unsung Hero: Why Logstash Remains a Cornerstone of Enterprise Data Strategy</h2>
<p>In the evolving landscape of observability and data pipelines, <strong>Logstash</strong> remains one of the most powerful and reliable components in the Elastic ecosystem. While it may not always take the spotlight, its depth of capability, flexibility, and resilience make it essential for enterprises managing complex, varied data streams. Logstash offers four main benefits:</p>
<ul>
<li>
<p><strong>Extensive Integration Support:</strong> Logstash supports a wide array of input and output plugins — including Kafka, syslog, Beats, cloud services, and databases — making it ideal for ingesting data from diverse environments and routing it across your architecture.</p>
</li>
<li>
<p><strong>Advanced Data Transformation:</strong> With rich filtering capabilities and optional Ruby scripting, Logstash enables complex enrichment, field manipulation, and conditional routing — allowing teams to standardise and prepare data early in the pipeline.</p>
</li>
<li>
<p><strong>Offloading Elasticsearch Ingest Load:</strong> The <a href="https://www.elastic.co/docs/reference/logstash/using-logstash-with-elastic-integrations"><code>elastic_integration</code></a> filter replicates ingest pipeline logic in Logstash, enabling upstream transformations that reduce processing overhead on Elasticsearch and streamline the indexing path.</p>
</li>
<li>
<p><strong>Operational Resilience with Persistent Queues:</strong> Logstash’s persistent-queue buffers data during downstream slowdowns or outages, helping smooth ingestion spikes, prevent data loss, and maintain stability under load.</p>
</li>
</ul>
<p>In modern CI/CD workflows, where automation and rapid iteration are standard, Logstash’s maturity and flexibility continue to make it a dependable choice — quietly powering the data flows that keep observability pipelines running strong.</p>
<h2>The Case for a GitOps-Driven Observability Strategy</h2>
<p>GitOps is a paradigm that applies proven DevOps best practices such as version control, collaboration, compliance, and CI/CD to infrastructure and configuration management. When applied to Logstash, this means that every pipeline configuration is treated as code—defined, versioned, reviewed, and deployed from a Git repository.</p>
<p>For enterprise environments, the adoption of a GitOps model for Logstash pipelines offers compelling advantages:</p>
<ul>
<li>
<p><strong>Enhanced Auditability and Compliance:</strong> Every pipeline modification is captured as a Git commit, creating an immutable, chronological audit trail. This provides unparalleled visibility into who made what change, when, and why, which is indispensable for meeting regulatory compliance requirements and conducting security audits.</p>
</li>
<li>
<p><strong>Improved System Stability and Reliability:</strong> The risk of deploying faulty configurations is drastically reduced. By enforcing a pull request (PR) workflow, all changes undergo peer review and automated validation <em>before</em> they are merged and deployed. In the event of an incident caused by a new configuration, a rollback is as fast and straightforward as reverting a Git commit.</p>
</li>
<li>
<p><strong>Increased Automation and Operational Efficiency:</strong> Automating the deployment lifecycle eliminates manual, error-prone configuration tasks. This frees up skilled engineers from routine operational duties, allowing them to focus on higher-value activities such as optimising data flows, improving analytics, and strengthening security postures.</p>
</li>
<li>
<p><strong>Fostered Cross-Team Collaboration:</strong> Git provides a universal and well-understood platform for collaboration. Development, Security, and Operations (DevSecOps) teams can work together seamlessly on a unified codebase. This shared ownership breaks down silos and ensures that pipeline configurations are robust, secure, and fit for purpose across the organization.</p>
</li>
</ul>
<h2>Implementation Model: GitHub and Jenkins</h2>
<p>This section details a practical framework for implementing a GitOps workflow for Logstash.</p>
<h3>1. Prerequisites</h3>
<ul>
<li>
<p>An established <strong>GitHub</strong> organisation or account.</p>
</li>
<li>
<p>A running <strong>Jenkins</strong> instance with the necessary plugins installed (e.g., Git, GitHub Integration).</p>
</li>
<li>
<p>A target <strong>Logstash</strong> environment where configurations will be deployed.</p>
</li>
<li>
<p>Working knowledge of Git, Jenkins pipelines, and Logstash configuration syntax.</p>
</li>
</ul>
<h3>2. Step 1: Establish a Centralised Git Repository</h3>
<p>The foundation of a GitOps workflow is a version-controlled repository.</p>
<ol>
<li>
<p><strong>Create a Repository:</strong> In GitHub, create a new repository (e.g., logstash-configurations). This will serve as the single source of truth for all pipeline configurations.</p>
</li>
<li>
<p><strong>Define a Directory Structure:</strong> A logical directory structure is crucial for managing configurations across different environments. A recommended structure is:</p>
</li>
</ol>
<pre><code>    /
    ├── pipelines/
    │   ├── development/
    │   │   ├── 01-input-beats.conf
    │   │   ├── 10-filter-nginx.conf
    │   │   └── 99-output-elasticsearch.conf
    │   ├── staging/
    │   │   └── ...
    │   └── production/
    │       └── ...
    └── Jenkinsfile
</code></pre>
<p>This structure clearly separates configurations by environment and allows for a modular and maintainable pipeline design.</p>
<h3>3. Step 2: Automate Deployment with a Jenkins CI/CD Pipeline</h3>
<p>The Jenkins pipeline automates validation and deployment of the configurations from Git to your Logstash instances.</p>
<ol>
<li>
<p><strong>Create a</strong> <code>Jenkinsfile</code><strong>:</strong> Add a <code>Jenkinsfile</code> to the root of your repository to define the automation pipeline. This pipeline-as-code approach ensures the deployment process itself is version-controlled.</p>
</li>
<li>
<p><strong>Define the Pipeline Stages:</strong> The pipeline should include distinct stages for checking out code, validating configurations, and deploying to the target environment.</p>
<p>A sample <code>Jenkinsfile</code> could look as follows:</p>
<pre><code>
 pipeline {
     agent any

     // Trigger the pipeline on every push to the main branch
     triggers {
         githubPush()
     }

     stages {
         stage('Checkout') {
             steps {
                 // Clone the repository
                 git 'https://github.com/your-org/logstash-configurations.git'
             }
         }

         stage('Validate Staging Configs') {
             steps {
                 // Run Logstash's built-in config test
                 // This prevents syntax errors from reaching production
                 sh 'docker run --rm -v ${WORKSPACE}/pipelines/staging:/usr/share/logstash/pipeline/ docker.elastic.co/logstash/logstash:9.3.1 logstash --config.test_and_exit'
                 sh 'docker run --rm -v ${WORKSPACE}/pipelines/staging:/usr/share/logstash/pipeline/ docker.elastic.co/logstash/logstash:9.3.1 logstash --config.test_and_exit'
             }
         }

         stage('Deploy to Staging') {
             // This stage requires Jenkins to have credentials to access the Staging server
             steps {
                 withCredentials([sshUserPrivateKey(credentialsId: 'staging-server-creds', keyFileVariable: 'KEY_FILE')]) {
                     sh '''
                         scp -i ${KEY_FILE} ${WORKSPACE}/pipelines/staging/*.conf user@staging-logstash-host:/etc/logstash/conf.d/
                         ssh -i ${KEY_FILE} user@staging-logstash-host 'sudo systemctl reload logstash'
                     '''
                 }
             }
         }

         // Optional: Add a manual approval step before deploying to production
         stage('Approval for Production') {
             steps {
                 input 'Deploy to Production?'
             }
         }

         stage('Deploy to Production') {
              steps {
                 // Similar deployment steps for the production environment
                 // using production credentials
              }
         }
     }
 }
</code></pre>
</li>
</ol>
<h3>4. The GitOps Workflow in Practice</h3>
<p>This setup enables a controlled, auditable, and automated workflow:</p>
<ol>
<li>
<p><strong>Branch Creation:</strong> An engineer creates a feature branch in Git to propose a change (e.g., <code>feature/add-syslog-input</code>).</p>
</li>
<li>
<p><strong>Configuration Change:</strong> The engineer modifies or adds a pipeline configuration file in their branch.</p>
</li>
<li>
<p><strong>Pull Request:</strong> A pull request is created in GitHub. This action can trigger automated checks in Jenkins to validate the syntax of the proposed changes.</p>
</li>
<li>
<p><strong>Peer Review:</strong> Team members review the changes for logic, security, and adherence to standards.</p>
</li>
<li>
<p><strong>Merge and Deploy:</strong> Upon approval, the PR is merged into the <code>main</code> branch. This merge automatically triggers the Jenkins pipeline, which deploys the validated configuration to the corresponding Logstash environment.</p>
</li>
</ol>
<h2>Best Practices for Enterprise Adoption</h2>
<p>To successfully implement this model at an enterprise scale, consider the following best practices:</p>
<ul>
<li>
<p><strong>Branching Strategy:</strong> Adopt a consistent branching strategy, such as GitFlow, to manage features, releases, and hotfixes in an orderly manner. Protect your <code>main</code> or <code>production</code> branches with rules that require PR reviews and passing status checks before merging.</p>
</li>
<li>
<p><strong>Scalability:</strong> For large-scale deployments with many Logstash nodes, use configuration management tools like Ansible, Puppet, or Chef within your Jenkins pipeline to orchestrate the deployment across your entire fleet.</p>
</li>
<li>
<p><strong>Fostering a GitOps Culture:</strong> Successful adoption is as much about people and processes as it is about tools. Provide training and documentation to ensure all stakeholders understand the workflow and their role within it. Emphasise the collaborative benefits and the shared responsibility for maintaining a stable and secure observability platform.</p>
</li>
<li>
<p><strong>Pipeline Observability</strong> (<em>Optional</em>): Monitoring the health and performance of your CI/CD pipelines is crucial and recommended for early detection of issues, visibility into bottlenecks, and auditability. Elastic Observability provides native support for monitoring Jenkins pipelines using the Elastic <a href="https://plugins.jenkins.io/opentelemetry/">CI/CD Observability plugin</a>.</p>
</li>
<li>
<p><strong>Secrets Management:</strong> Never hardcode sensitive information (passwords, API keys) in your configuration files. Use a secrets management tool like HashiCorp Vault or AWS Secrets Manager, and have Logstash retrieve these secrets at runtime.</p>
</li>
</ul>
<p>A sample <em>snippet</em> to retrieve the secret and store it securely in a <a href="https://www.elastic.co/docs/reference/logstash/keystore">Logstash keystore</a> could look as follows:</p>
<pre><code>```
...

	stage('Update Logstash Secret') {
    // Define the secret path and key in Vault
    def secretPath = 'secret/logstash/production'
    def secretKey = 'elasticsearch_password'
    def keystoreKey = 'ES_PWD' // The key name to be used in the Logstash keystore

    // Wrap the steps in withVault to get access to the secrets
    withVault(configuration: [url: 'http://your-vault-server:8200',
                             credentialsId: 'vault-approle-creds']) {
        
        // Retrieve the secret from Vault. The plugin makes it available as an environment variable.
        def secrets = readVault(path: secretPath, key: secretKey)
        def esPassword = secrets[secretKey]

        // Use SSH credentials to access the Logstash server
        withCredentials() {
            sh &quot;&quot;&quot;
            ssh -i ${KEY_FILE} user@logstash-host &lt;&lt;'ENDSSH'
            # Pipe the secret directly into the logstash-keystore command
            # This avoids writing the secret to disk or exposing it in the process list
            echo &quot;${esPassword}&quot; | sudo -u logstash /usr/share/logstash/bin/logstash-keystore add ${keystoreKey} --stdin

            # After updating the keystore, reload Logstash to apply the change
            sudo systemctl reload logstash
            ENDSSH
            &quot;&quot;&quot;
        }
    }
}
...

```
</code></pre>
<h2>Conclusion</h2>
<p>Adopting a GitOps approach for managing Logstash pipelines is a strategic move that aligns observability with modern DevSecOps principles. It replaces manual, opaque processes with an automated, transparent, and collaborative framework. For enterprise organisations, this leads to a more secure, resilient, and efficient observability infrastructure, empowering teams to derive maximum value from their data while minimising operational overhead and risk.</p>
<p>The above example is just a start; there’s a lot more you can do once you lay the foundation—GitOps is just the beginning. From branching automation to pipeline promotion workflows to building self-service deployment portals, the possibilities are limited only by your creativity (and maybe your CI minutes).</p>
<p>GitOps lays the foundation. To see the whole picture, you need to monitor your pipelines. <a href="https://cloud.elastic.co/registration">Start a trial</a> and try out Elastic’s <a href="https://www.elastic.co/docs/solutions/observability/cicd">CI/CD Observability solution</a> to track build health and deployment trends. It connects code changes to production behavior, giving you deep visibility into your new automated workflow.</p>
<p>Build smarter pipelines. Monitor what matters. And let your GitOps-powered observability stack become the quiet hero of your DevSecOps story. See how <a href="https://www.elastic.co/elasticsearch/streams">Streams</a> can supercharge your data engineering with the next generation of AI-powered log management &amp; log processing.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/logstash-pipeline-management-configuration-gitops/continuous-improvement.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Convert Logstash pipelines to OpenTelemetry Collector Pipelines]]></title>
            <link>https://www.elastic.co/observability-labs/blog/logstash-to-otel</link>
            <guid isPermaLink="false">logstash-to-otel</guid>
            <pubDate>Fri, 25 Oct 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[This guide helps Logstash users transition to OpenTelemetry by demonstrating how to convert common Logstash pipelines into equivalent OpenTelemetry Collector configurations. We will focus on the log signal.]]></description>
            <content:encoded><![CDATA[<h1>Convert Logstash pipelines to OpenTelemetry Collector Pipelines</h1>
<h2>Introduction</h2>
<p>Elastic observability strategy is increasingly aligned with OpenTelemetry. With the recent launch of <a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry">Elastic Distributions of OpenTelemetry</a> we’re expanding our offering to make it easier to use OpenTelemetry, the Elastic Agent now offers an <a href="https://www.elastic.co/guide/en/fleet/current/otel-agent.html">&quot;otel&quot; mode</a>, enabling it to run a custom distribution of the OpenTelemetry Collector, seamlessly enhancing your observability onboarding and experience with Elastic.</p>
<p>This post is designed to assist users familiar with Logstash transitioning to OpenTelemetry by demonstrating how to convert some standard Logstash pipelines into corresponding OpenTelemetry Collector configurations.</p>
<h2>What is OpenTelemetry Collector and why should I care?</h2>
<p><a href="https://opentelemetry.io/">OpenTelemetry</a> is an open-source framework that ensures vendor-agnostic data collection, providing a standardized approach for the collection, processing, and ingestion of observability data. Elastic is fully committed to this principle, aiming to make observability truly vendor-agnostic and eliminating the need for users to re-instrument their observability when switching platforms.</p>
<p>By embracing OpenTelemetry, you have access to  these benefits:</p>
<ul>
<li><strong>Unified Observability</strong>: By using the OpenTelemetry Collector, you can collect and manage logs, metrics, and traces from a single tool, providing holistic observability into your system's performance and behavior. This simplifies monitoring and debugging in complex, distributed environments like microservices.</li>
<li><strong>Flexibility and Scalability</strong>: Whether you're running a small service or a large distributed system, the OpenTelemetry Collector can be scaled to handle the amount of data generated, offering the flexibility to deploy as an agent (running alongside applications) or as a gateway (a centralized hub).</li>
<li><strong>Open Standards</strong>: Since OpenTelemetry is an open-source project under the Cloud Native Computing Foundation (CNCF), it ensures that you're working with widely accepted standards, contributing to the long-term sustainability and compatibility of your observability stack.</li>
<li><strong>Simplified Telemetry Pipelines</strong>: The ability to build pipelines using receivers, processors, and exporters simplifies telemetry management by centralizing data flows and minimizing the need for multiple agents.</li>
</ul>
<p>In the next sections, we will explain how OTEL Collector and Logstash pipelines are structured, and we will clarify how the steps for each option are used.</p>
<h2>OTEL Collector Configuration</h2>
<p>An OpenTelemetry Collector <a href="https://opentelemetry.io/docs/collector/configuration/">Configuration</a> has different sections:</p>
<ul>
<li><strong>Receivers</strong>: Collect data from different sources.</li>
<li><strong>Processors</strong>: Transform the data collected by receivers</li>
<li><strong>Exporters</strong>: Send data to different collectors</li>
<li><strong>Connectors</strong>: Link two pipelines together</li>
<li><strong>Service</strong>: defines which components are active
<ul>
<li><strong>Pipelines</strong>:  Combine the defined receivers, processors, exporters, and connectors to process the data</li>
<li><strong>Extensions</strong> are optional components that expand the capabilities of the Collector to accomplish tasks not directly involved with processing telemetry data (e.g., health monitoring)</li>
<li><strong>Telemetry</strong> where you can set observability for the collector itself (e.g., logging and monitoring)</li>
</ul>
</li>
</ul>
<p>We can visualize it schematically as follows:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/logstash-to-otel/otel-config-schema.png" alt="otel-config-schema" /></p>
<p>We refer to the official documentation <a href="https://opentelemetry.io/docs/collector/configuration/">Configuration | OpenTelemetry</a> for an in-depth introduction in the components.</p>
<h2>Logstash pipeline definition</h2>
<p>A <a href="https://www.elastic.co/guide/en/logstash/current/configuration-file-structure.html">Logstash pipeline</a> is composed of three main components:</p>
<ul>
<li>Input Plugins: Allow us to read data from different sources</li>
<li>Filters Plugins: Allow us to transform and filter the data</li>
<li>Output Plugins: Allow us to send the data</li>
</ul>
<p>Logstash also has a special input and a special output that allow the pipeline-to-pipeline communication, we can consider this as a similar concept to an OpenTelemetry connector.</p>
<h2>Logstash pipeline compared to Otel Collector components</h2>
<p>We can schematize how Logstash Pipeline and OTEL Collector pipeline components can relate to each other as follows:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/logstash-to-otel/logstash-pipeline-to-otel-pipeline.png" alt="logstash-pipeline-to-otel-pipeline" /></p>
<p>Enough theory! Let us dive into some examples.</p>
<h2>Convert a Logstash Pipeline into OpenTelemetry Collector Pipeline</h2>
<h3>Example 1: Parse and transform log line</h3>
<p>Let's consider the below line:</p>
<pre><code>2024-09-20T08:33:27: user frank accessed from 89.66.167.22:10592 path /blog with error 404
</code></pre>
<p>We will apply the following steps:</p>
<ol>
<li>Read the line from the file <code>/tmp/demo-line.log</code>.</li>
<li>Define the output to be an Elasticsearch datastream <code>logs-access-default</code>.</li>
<li>Extract the <code>@timestamp</code>, <code>user.name</code>, <code>client.ip</code>, <code>client.port</code>, <code>url.path</code> and <code>http.status.code</code>.</li>
<li>Drop log messages related to the <code>SYSTEM</code> user.</li>
<li>Parse the date timestamp with the relevant date format and store it in <code>@timestamp</code>.</li>
<li>Add a code <code>http.status.code_description</code> based on known codes' descriptions.</li>
<li>Send data to Elasticsearch.</li>
</ol>
<p><strong>Logstash pipeline</strong></p>
<pre><code class="language-ruby">input {
    file {
        path =&gt; &quot;/tmp/demo-line.log&quot; #[1]
        start_position =&gt; &quot;beginning&quot;
        add_field =&gt; { #[2]
            &quot;[data_stream][type]&quot; =&gt; &quot;logs&quot;
            &quot;[data_stream][dataset]&quot; =&gt; &quot;access_log&quot;
            &quot;[data_stream][namespace]&quot; =&gt; &quot;default&quot;
        }
    }
}

filter {
    grok { #[3]
        match =&gt; {
            &quot;message&quot; =&gt; &quot;%{TIMESTAMP_ISO8601:[date]}: user %{WORD:[user][name]} accessed from %{IP:[client][ip]}:%{NUMBER:[client][port]:int} path %{URIPATH:[url][path]} with error %{NUMBER:[http][status][code]}&quot;
        }
    }
    if &quot;_grokparsefailure&quot; not in [tags] {
        if [user][name] == &quot;SYSTEM&quot; { #[4]
            drop {}
        }
        date { #[5]
            match =&gt; [&quot;[date]&quot;, &quot;ISO8601&quot;]
            target =&gt; &quot;[@timestamp]&quot;
            timezone =&gt; &quot;UTC&quot;
            remove_field =&gt; [ &quot;date&quot; ]
        }
        translate { #[6]
            source =&gt; &quot;[http][status][code]&quot;
            target =&gt; &quot;[http][status][code_description]&quot;
            dictionary =&gt; {
                &quot;200&quot; =&gt; &quot;OK&quot;
                &quot;403&quot; =&gt; &quot;Permission denied&quot;
                &quot;404&quot; =&gt; &quot;Not Found&quot;
                &quot;500&quot; =&gt; &quot;Server Error&quot;
            }
            fallback =&gt; &quot;Unknown error&quot;
        }
    }
}

output {
    elasticsearch { #[7]
        hosts =&gt; &quot;elasticsearch-enpoint:443&quot;
        api_key =&gt; &quot;${ES_API_KEY}&quot;
    }
}
</code></pre>
<p><strong>OpenTelemtry Collector configuration</strong></p>
<pre><code class="language-yaml">receivers:
  filelog: #[1]
    start_at: beginning
    include:
      - /tmp/demo-line.log
    include_file_name: false
    include_file_path: true
    storage: file_storage 
    operators:
    # Copy the raw message into event.original (this is done OOTB by Logstash in ECS mode)
    - type: copy
      from: body
      to: attributes['event.original']
    - type: add #[2]
      field: attributes[&quot;data_stream.type&quot;]
      value: &quot;logs&quot;
    - type: add #[2]
      field: attributes[&quot;data_stream.dataset&quot;]
      value: &quot;access_log_otel&quot; 
    - type: add #[2]
      field: attributes[&quot;data_stream.namespace&quot;]
      value: &quot;default&quot;

extensions:
  file_storage:
    directory: /var/lib/otelcol/file_storage

processors:
  # Adding  host.name (this is done OOTB by Logstash)
  resourcedetection/system:
    detectors: [&quot;system&quot;]
    system:
      hostname_sources: [&quot;os&quot;]
      resource_attributes:
        os.type:
          enabled: false

  transform/grok: #[3]
    log_statements:
      - context: log
        statements:
        - 'merge_maps(attributes, ExtractGrokPatterns(attributes[&quot;event.original&quot;], &quot;%{TIMESTAMP_ISO8601:date}: user %{WORD:user.name} accessed from %{IP:client.ip}:%{NUMBER:client.port:int} path %{URIPATH:url.path} with error %{NUMBER:http.status.code}&quot;, true), &quot;insert&quot;)'

  filter/exclude_system_user:  #[4]
    error_mode: ignore
    logs:
      log_record:
        - attributes[&quot;user.name&quot;] == &quot;SYSTEM&quot;

  transform/parse_date: #[5]
    log_statements:
      - context: log
        statements:
          - set(time, Time(attributes[&quot;date&quot;], &quot;%Y-%m-%dT%H:%M:%S&quot;))
          - delete_key(attributes, &quot;date&quot;)
        conditions:
          - attributes[&quot;date&quot;] != nil

  transform/translate_status_code:  #[6]
    log_statements:
      - context: log
        conditions:
        - attributes[&quot;http.status.code&quot;] != nil
        statements:
        - set(attributes[&quot;http.status.code_description&quot;], &quot;OK&quot;)                where attributes[&quot;http.status.code&quot;] == &quot;200&quot;
        - set(attributes[&quot;http.status.code_description&quot;], &quot;Permission Denied&quot;) where attributes[&quot;http.status.code&quot;] == &quot;403&quot;
        - set(attributes[&quot;http.status.code_description&quot;], &quot;Not Found&quot;)         where attributes[&quot;http.status.code&quot;] == &quot;404&quot;
        - set(attributes[&quot;http.status.code_description&quot;], &quot;Server Error&quot;)      where attributes[&quot;http.status.code&quot;] == &quot;500&quot;
        - set(attributes[&quot;http.status.code_description&quot;], &quot;Unknown Error&quot;)     where attributes[&quot;http.status.code_description&quot;] == nil

exporters:
  elasticsearch: #[7]
    endpoints: [&quot;elasticsearch-enpoint:443&quot;]
    api_key: ${env:ES_API_KEY}
    tls:
    logs_dynamic_index:
      enabled: true
    mapping:
      mode: ecs

service:
  extensions: [file_storage]
  pipelines:
    logs:
      receivers:
        - filelog
      processors:
        - resourcedetection/system
        - transform/grok
        - filter/exclude_system_user
        - transform/parse_date
        - transform/translate_status_code
      exporters:
        - elasticsearch
</code></pre>
<p>These will generate the following document in Elasticsearch</p>
<pre><code class="language-json">{
    &quot;@timestamp&quot;: &quot;2024-09-20T08:33:27.000Z&quot;,
    &quot;client&quot;: {
        &quot;ip&quot;: &quot;89.66.167.22&quot;,
        &quot;port&quot;: 10592
    },
    &quot;data_stream&quot;: {
        &quot;dataset&quot;: &quot;access_log&quot;,
        &quot;namespace&quot;: &quot;default&quot;,
        &quot;type&quot;: &quot;logs&quot;
    },
    &quot;event&quot;: {
        &quot;original&quot;: &quot;2024-09-20T08:33:27: user frank accessed from 89.66.167.22:10592 path /blog with error 404&quot;
    },
    &quot;host&quot;: {
        &quot;hostname&quot;: &quot;my-laptop&quot;,
        &quot;name&quot;: &quot;my-laptop&quot;,
     },
    &quot;http&quot;: {
        &quot;status&quot;: {
            &quot;code&quot;: &quot;404&quot;,
            &quot;code_description&quot;: &quot;Not Found&quot;
        }
    },
    &quot;log&quot;: {
        &quot;file&quot;: {
            &quot;path&quot;: &quot;/tmp/demo-line.log&quot;
        }
    },
    &quot;message&quot;: &quot;2024-09-20T08:33:27: user frank accessed from 89.66.167.22:10592 path /blog with error 404&quot;,
    &quot;url&quot;: {
        &quot;path&quot;: &quot;/blog&quot;
    },
    &quot;user&quot;: {
        &quot;name&quot;: &quot;frank&quot;
    }
}
</code></pre>
<h3>Example 2: Parse and transform a NDJSON-formatted log file</h3>
<p>Let's consider the below json line:</p>
<pre><code class="language-json">{&quot;log_level&quot;:&quot;INFO&quot;,&quot;message&quot;:&quot;User login successful&quot;,&quot;service&quot;:&quot;auth-service&quot;,&quot;timestamp&quot;:&quot;2024-10-11 12:34:56.123 +0100&quot;,&quot;user&quot;:{&quot;id&quot;:&quot;A1230&quot;,&quot;name&quot;:&quot;john_doe&quot;}}
</code></pre>
<p>We will apply the following steps:</p>
<ol>
<li>Read a line from the file <code>/tmp/demo.ndjson</code>.</li>
<li>Define the output to be an Elasticsearch datastream <code>logs-json-default</code></li>
<li>Parse the JSON and assign relevant keys and values.</li>
<li>Parse the date.</li>
<li>Override the message field.</li>
<li>Rename fields to follow ECS conventions.</li>
<li>Send data to Elasticsearch.</li>
</ol>
<p><strong>Logstash pipeline</strong></p>
<pre><code class="language-ruby">input {
    file {
        path =&gt; &quot;/tmp/demo.ndjson&quot; #[1]
        start_position =&gt; &quot;beginning&quot;
        add_field =&gt; { #[2]
            &quot;[data_stream][type]&quot; =&gt; &quot;logs&quot;
            &quot;[data_stream][dataset]&quot; =&gt; &quot;json&quot;
            &quot;[data_stream][namespace]&quot; =&gt; &quot;default&quot;
        }
    }
}

filter {
  if [message] =~ /^\{.*/ {
    json { #[3] &amp; #[5]
        source =&gt; &quot;message&quot;
    }
  }
  date { #[4]
    match =&gt; [&quot;[timestamp]&quot;, &quot;yyyy-MM-dd HH:mm:ss.SSS Z&quot;]
    remove_field =&gt; &quot;[timestamp]&quot;
  }
  mutate {
    rename =&gt; { #[6]
      &quot;service&quot; =&gt; &quot;[service][name]&quot;
      &quot;log_level&quot; =&gt; &quot;[log][level]&quot;
    }
  }
}


output {
    elasticsearch { # [7]
        hosts =&gt; &quot;elasticsearch-enpoint:443&quot;
        api_key =&gt; &quot;${ES_API_KEY}&quot;
    }
}
</code></pre>
<p><strong>OpenTelemtry Collector configuration</strong></p>
<pre><code class="language-yaml">receivers:
  filelog/json: # [1]
    include: 
      - /tmp/demo.ndjson
    retry_on_failure:
      enabled: true
    start_at: beginning
    storage: file_storage 
    operators:
     # Copy the raw message into event.original (this is done OOTB by Logstash in ECS mode)
    - type: copy
      from: body
      to: attributes['event.original']
    - type: add #[2]
      field: attributes[&quot;data_stream.type&quot;]
      value: &quot;logs&quot;      
    - type: add #[2]
      field: attributes[&quot;data_stream.dataset&quot;]
      value: &quot;otel&quot; #[2]
    - type: add
      field: attributes[&quot;data_stream.namespace&quot;]
      value: &quot;default&quot;     


extensions:
  file_storage:
    directory: /var/lib/otelcol/file_storage

processors:
  # Adding  host.name (this is done OOTB by Logstash)
  resourcedetection/system:
    detectors: [&quot;system&quot;]
    system:
      hostname_sources: [&quot;os&quot;]
      resource_attributes:
        os.type:
          enabled: false

  transform/json_parse:  #[3]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - merge_maps(attributes, ParseJSON(body), &quot;upsert&quot;)
        conditions: 
          - IsMatch(body, &quot;^\\{&quot;)
      

  transform/parse_date:  #[4]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(time, Time(attributes[&quot;timestamp&quot;], &quot;%Y-%m-%d %H:%M:%S.%L %z&quot;))
          - delete_key(attributes, &quot;timestamp&quot;)
        conditions: 
          - attributes[&quot;timestamp&quot;] != nil

  transform/override_message_field: [5]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(body, attributes[&quot;message&quot;])
          - delete_key(attributes, &quot;message&quot;)

  transform/set_log_severity: # [6]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(severity_text, attributes[&quot;log_level&quot;])          

  attributes/rename_attributes: #[6]
    actions:
      - key: service.name
        from_attribute: service
        action: insert
      - key: service
        action: delete
      - key: log_level
        action: delete

exporters:
  elasticsearch: #[7]
    endpoints: [&quot;elasticsearch-enpoint:443&quot;]
    api_key: ${env:ES_API_KEY}
    tls:
    logs_dynamic_index:
      enabled: true
    mapping:
      mode: ecs

service:
  extensions: [file_storage]
  pipelines:
    logs/json:
      receivers: 
        - filelog/json
      processors:
        - resourcedetection/system    
        - transform/json_parse
        - transform/parse_date        
        - transform/override_message_field
        - transform/set_log_severity
        - attributes/rename_attributes
      exporters: 
        - elasticsearch

</code></pre>
<p>These will generate the following document in Elasticsearch</p>
<pre><code class="language-json">{
    &quot;@timestamp&quot;: &quot;2024-10-11T12:34:56.123000000Z&quot;,
    &quot;data_stream&quot;: {
        &quot;dataset&quot;: &quot;otel&quot;,
        &quot;namespace&quot;: &quot;default&quot;,
        &quot;type&quot;: &quot;logs&quot;
    },
    &quot;event&quot;: {
        &quot;original&quot;: &quot;{\&quot;log_level\&quot;:\&quot;WARNING\&quot;,\&quot;message\&quot;:\&quot;User login successful\&quot;,\&quot;service\&quot;:\&quot;auth-service\&quot;,\&quot;timestamp\&quot;:\&quot;2024-10-11 12:34:56.123 +0100\&quot;,\&quot;user\&quot;:{\&quot;id\&quot;:\&quot;A1230\&quot;,\&quot;name\&quot;:\&quot;john_doe\&quot;}}&quot;
    },
    &quot;host&quot;: {
        &quot;hostname&quot;: &quot;my-laptop&quot;,
        &quot;name&quot;: &quot;my-laptop&quot;,
     },
    &quot;log&quot;: {
        &quot;file&quot;: {
            &quot;name&quot;: &quot;json.log&quot;
        },
        &quot;level&quot;: &quot;WARNING&quot;
    },
    &quot;message&quot;: &quot;User login successful&quot;,
    &quot;service&quot;: {
        &quot;name&quot;: &quot;auth-service&quot;
    },
    &quot;user&quot;: {
        &quot;id&quot;: &quot;A1230&quot;,
        &quot;name&quot;: &quot;john_doe&quot;
    }
}

</code></pre>
<h2>Conclusion</h2>
<p>In this post, we showed examples of how to convert a typical Logstash pipeline into an OpenTelemetry Collector pipeline for logs. While OpenTelemetry provides powerful tools for collecting and exporting logs, if your pipeline relies on complex transformations or scripting, Logstash remains a superior choice. This is because Logstash offers a broader range of built-in features and a more flexible approach to handling advanced data manipulation tasks.</p>
<h2>What's Next?</h2>
<p>Now that you've seen basic (but realistic) examples of converting a Logstash pipeline to OpenTelemetry, it's your turn to dive deeper. Depending on your needs, you can explore further and find more detailed resources in the following repositories:</p>
<ul>
<li><a href="https://github.com/open-telemetry/opentelemetry-collector">OpenTelemetry Collector</a>: Learn about the core OpenTelemetry components, from receivers to exporters.</li>
<li><a href="https://github.com/open-telemetry/opentelemetry-collector-contrib">OpenTelemetry Collector Contrib</a>: Find community-contributed components for a wider range of integrations and features.</li>
<li><a href="https://github.com/elastic/opentelemetry-collector-components">Elastic's opentelemetry-collector-components</a>: Dive into Elastic's extensions for the OpenTelemetry Collector, offering more tailored features for Elastic Stack users.</li>
</ul>
<p>If you encounter specific challenges or need to handle more advanced use cases, these repositories will be an excellent resource for discovering additional components or integrations that can enhance your pipeline. All these repositories have a similar structure with folders named <code>receiver</code>, <code>processor</code>, <code>exporter</code>, <code>connector</code>, which should be familiar after reading this blog. Whether you are migrating a simple Logstash pipeline or tackling more complex data transformations, these tools and communities will provide the support you need for a successful OpenTelemetry implementation.</p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/logstash-to-otel/logstash-otel.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Managing your applications on Amazon ECS EC2-based clusters with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/manage-applications-amazon-ecs-ec2-clusters-observability</link>
            <guid isPermaLink="false">manage-applications-amazon-ecs-ec2-clusters-observability</guid>
            <pubDate>Tue, 15 Aug 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to manage applications on Amazon ECS clusters based on EC2 instances and how simple it is to use Elastic agents with the AWS and docker integrations to provide a complete picture of your apps, ECS service, and corresponding EC2 instances.]]></description>
            <content:encoded><![CDATA[<p>In previous blogs, we explored how Elastic Observability can help you monitor various AWS services and analyze them effectively:</p>
<ul>
<li><a href="https://www.elastic.co/blog/aws-service-metrics-monitor-observability-easy">Managing fundamental AWS services such as Amazon EC2, Amazon RDS, Amazon VPC, and NAT gateway</a></li>
<li><a href="https://www.elastic.co/blog/aws-kinesis-data-firehose-elastic-observability-analytics">Data can be ingested into Elastic observability using a serverless forwarder or Amazon Kinesis Data Firehose</a></li>
<li><a href="https://www.elastic.co/blog/vpc-flow-logs-monitoring-analytics-observability">Ingesting and analyzing AWS VPC Flow logs</a></li>
</ul>
<p>One of the more heavily used AWS container services is Amazon ECS (Elastic Container Service). While there is a trend toward using Fargate to simplify the setup and management of ECS clusters, many users still prefer using Amazon ECS with EC2 instances. It may not be as straightforward or efficient as AWS Fargate, but it offers more control over the underlying infrastructure.</p>
<p>In the most recent blog, we explored how <a href="https://www.elastic.co/blog/elastic-agent-monitor-ecs-aws-fargate-elastic-observability">Elastic Observability helps manage Amazon ECS with Fargate</a>. However, this blog will review how to manage an Amazon ECS cluster with EC2 instances using Elastic Observability instead.</p>
<p>In general, when setting up Amazon ECS-based clusters with EC2, you may or may not have access to the EC2 instances. This determines what you can use with Elastic Observability in monitoring your EC2-based ECS cluster. Hence, there are two components you can use in monitoring the EC2-based ECS cluster with Elastic Observability:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-1-amazon-ecs.png" alt="amazon ecs" /></p>
<p>As you can see in the diagram above, the two components are:</p>
<ol>
<li>
<p>_ <strong>Baseline setup</strong> __ <strong>:</strong> _ The elastic agent running the AWS integration is configured to obtain ECS metrics and logs from cloud watch. This agent runs on an instance that is not part of the ECS cluster because it allows you to see ALL ECS clusters and other AWS Services, such as EKS, RDS, and EC2.</p>
</li>
<li>
<p>_ <strong>Additional setup:</strong> _ If you have access to the EC2 instances in the ECS cluster, then you can run Elastic’s docker integration in each EC2 instance. This gives you significantly more details on the containers than AWS container insights. And it does not require AWS Cloudwatch, which can be fairly costly.</p>
</li>
</ol>
<p>Using either just the baseline or the additional setup, you will have to set up AWS CloudWatch Container Insights for the ECS cluster. However, the docker integration with the additional setup can provide additional information to the AWS CloudWatch Container Insights.</p>
<p>Hence, we will review how you can monitor the various components of an EC2-based ECS cluster:</p>
<ul>
<li>EC2 instances in the ASG group</li>
<li>ECS services running in the ECS cluster</li>
<li>ECS tasks (containers)</li>
</ul>
<p>Also, we will review how you can obtain metrics and logs from the ECS cluster with and without AWS Cloudwatch. We’ll show you how to use:</p>
<ul>
<li>AWS CloudWatch Container Insights (from Cloudwatch)</li>
<li>Docker metrics (non-Cloudwatch)</li>
<li>Amazon ECS logs via Cloudwatch</li>
</ul>
<h2>Prerequisites and configuration</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up the configuration:</p>
<ul>
<li>An account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>) — ensure that you have both.</li>
<li>A <a href="https://hub.docker.com/_/nginx">nginx</a> container and a <a href="https://github.com/containerstack/alpine-stress">stress container</a> — we will use these two basic containers to help highlight the load on the Elastic ECS Cluster.</li>
<li>An ECS EC2 Cluster in an Auto Scaling Group — ensure you have access in order to load up the Elastic agent on the EC2 instances, or you can create an AMI and use that as the baseline image for your ECS cluster.</li>
<li>An EC2 instance anywhere in your account that is not part of the ECS cluster and has public access (to send metrics and logs)</li>
</ul>
<h2>What will you see in Elastic Observability once it's all set up?</h2>
<p>If you utilize the baseline configuration with ECS EC2 cluster configured with AWS CloudWatch Container Insights configured, the Elastic Agent configured with the following Elastic agent integrations:</p>
<ul>
<li>ECS integration</li>
<li>EC2 integration</li>
<li>AWS Cloudwatch Integration with metrics and logging</li>
</ul>
<p>Then you will be able to get the following information in Elastic dashboards:</p>
<ul>
<li>Containers in the cluster (AWS CloudWatch Container Insights via Elastic Agent and AWS Cloudwatch integration)</li>
<li>Services in the cluster (AWS CloudWatch Container Insights via Elastic Agent and AWS Cloudwatch integration)</li>
<li>CPU and memory utilization of the ECS Cluster (Elastic Agent with ECS integration)</li>
<li>EC2 CPU and memory utilization of the instance in the cluster (Elastic Agent with EC2 integration)</li>
<li>CPU and memory utilization per container (via AWS CloudWatch Container Insights via Elastic Agent and AWS Cloudwatch integration)</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-2-containers-in-cluster.png" alt="containers in cluster" /></p>
<p>If the additional configuration using Elastic agents with docker integration per ECS EC2 instance is used, you will be able to get a direct feed of metrics via docker. The following metrics can be viewed:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-3-metrics-graphs.png" alt="metrics graphs" /></p>
<p>Let’s see how to set this all up.</p>
<h2>Setting it all up</h2>
<p>Over the next few steps, I’ll walk through:</p>
<ul>
<li>Getting an account on Elastic Cloud</li>
<li>Bringing up an ECS EC2 cluster and potentially setting up your own AMI</li>
<li>Setting up the containers <a href="https://hub.docker.com/_/nginx">nginx</a> and a <a href="https://github.com/containerstack/alpine-stress">stress container</a></li>
<li>Setting up the Elastic agent with docker container integration on the ECS EC2 instances</li>
<li>Setting up the Elastic agent with AWS, Cloudwatch, and ECS integrations on an independent EC2 instance</li>
</ul>
<h3>Step 1: Create an account on Elastic Cloud</h3>
<p>Follow the instructions to <a href="https://cloud.elastic.co/registration?fromURI=/home">get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-4-free-trial.png" alt="free trial" /></p>
<h3>Step 2: Set up an ECS Cluster with EC2 instances</h3>
<p>When creating a cluster, you have two options when setting it up using the console:</p>
<ul>
<li>Create a new ASG group where you will only be allowed to use the preloaded set of Amazon Linux (2 or 2023) based AMIs</li>
<li>Set up your own ASG Cluster prior to setting up the ECS Cluster and select this from the options. This option will give you more control over what Linux version and the ability to add things like Elastic agents in the AMI used for the instances in the ASG.</li>
</ul>
<p>Regardless of either option, you will need to turn on <strong>Container Insights</strong> (see the bottom part of the image below).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-5-infrastructure.png" alt="infrastructure" /></p>
<p>Once the cluster is setup, you can go to AWS Cloudwatch where you should see Container Insights for your cluster:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-6-container-insights.png" alt="container insights" /></p>
<h3>Step 3: Set up Elastic agent with docker integration</h3>
<p>Next, you will need to add an Elastic agent to each one of the instances. In the Elastic cloud, set up an Elastic policy with the docker and system integrations as such:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-7-ecs-ec2-cluster-policy.png" alt="cluster policy" /></p>
<p>Next, add an agent for the policy, then copy the appropriate install script (in our case it was Linux since we were running Amazon Linux 2), and run it on every EC2 instance in the cluster:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-8-add-agent.png" alt="add agent" /></p>
<p>Once this is added you should see agents in the fleet. Each agent will be on each EC2 instance:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-9-fleet.png" alt="fleet" /></p>
<p>If you decide to set up an ECS EC2 cluster with your own ASG and don’t use Amazon Lunix AMIs (2 or 2023 version), you will have to:</p>
<ul>
<li>Pick your base image to base an AMI on</li>
<li><a href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-install.html">Add an ECS agent and register each instance to the AMI base image manually</a></li>
<li><a href="https://www.elastic.co/guide/en/fleet/current/install-standalone-elastic-agent.html">Add the Elastic agent — standalone version</a> — this step will require you to configure your Elastic endpoint and API key (or simply add the script in the “add agent” part of the configuration above when using the UI)</li>
<li>Create the AMI once all the above components are added</li>
<li>Use the newly created AMI in creating the ASG for ECS cluster</li>
</ul>
<h3>Step 4: Set up an Elastic agent with the AWS integration</h3>
<p>From the integrations tab in Elastic Cloud, select AWS integration and select add agent. You will then have to walk through the configuration of the AWS integration.</p>
<p>At a minimum, ensure that you have the following configuration options turned on:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-10-toggles.png" alt="toggles" /></p>
<p>This will ensure that not only EC2 metrics and logs are ingested but that all CloudWatch metrics and logs are also ingested. ECS metrics and logs are stored in CloudWatch.</p>
<p>If you want to ensure only logs from the specific ECS cluster are ingested, you can also restrict what to ingest by several parameters. In our setup, we are collecting only logs from Log Group with a prefix of /aws/ecs/containerinsights/EC2BasedCluster/.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-11-cloudwatch.png" alt="cloudwatch" /></p>
<p>Once this policy is set up, add an agent like in Step 1.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-12-add-agent-testing-aws.png" alt="add agent testing aws" /></p>
<p>However, this agent needs to be added to an EC2 instance which is independent of the ECS cluster.</p>
<p>Once installed, this agent will help pull in:</p>
<ul>
<li>All EC2 instance metrics across your account (which can be adjusted in the integration policy)</li>
<li>Ingest AWS CloudWatch Container Insights from ECS</li>
<li>ECS metrics such as:
<ul>
<li>aws.ecs.metrics.CPUReservation.avg</li>
<li>aws.ecs.metrics.CPUUtilization.avg</li>
<li>aws.ecs.metrics.GPUReservation.avg</li>
<li>aws.ecs.metrics.MemoryReservation.avg</li>
<li>aws.ecs.metrics.MemoryUtilization.avg</li>
<li><a href="https://docs.elastic.co/integrations/aws/ecs">More - see the full list here</a></li>
</ul>
</li>
</ul>
<h3>Step 5: Setting up services and containers</h3>
<p>In running this configuration, we used <a href="https://hub.docker.com/_/nginx">nginx</a> and a <a href="https://github.com/containerstack/alpine-stress">stress container</a> before we go into the task.</p>
<p>In order to initiate service and containers on ECS, you will need to set up a task for each of these containers. But more importantly, you will need to ensure that the roles for both of the following:</p>
<p>&quot;taskRoleArn&quot;: &quot;arn:aws:iam::xxxxx:role/ecsTaskExecutionRol&quot;executionRoleArn&quot;:,</p>
<p>&quot;arn:aws:iam::xxxxx:role/ecsTaskExecutionRole&quot;,</p>
<p>have the following permissions:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-13-permissions.png" alt="permissions" /></p>
<p>Most importantly, you should ensure that this permission is added:</p>
<p>AmazonEC2ContainerServiceforEC2Role</p>
<p>It will ensure containers can be brought up on the EC2 instances in the cluster.</p>
<p>Once you have the right permissions, then set up the following tasks.</p>
<p>Here is the task JSON for NGINX:</p>
<pre><code class="language-json">{
  &quot;family&quot;: &quot;NGINX&quot;,
  &quot;containerDefinitions&quot;: [
    {
      &quot;name&quot;: &quot;nginx&quot;,
      &quot;image&quot;: &quot;nginx: latest&quot;,
      &quot;cpu&quot;: 0,
      &quot;portMappings&quot;: [
        {
          &quot;name&quot;: &quot;nginx-80-tcp&quot;,
          &quot;containerPort&quot;: 80,
          &quot;hostPort&quot;: 80,
          &quot;protocol&quot;: &quot;tcp&quot;,
          &quot;appProtocol&quot;: &quot;http&quot;
        }
      ],
      &quot;essential&quot;: true,
      &quot;environment&quot;: [],
      &quot;environmentFiles&quot;: [],
      &quot;mountPoints&quot;: [],
      &quot;volumesFrom&quot;: [],
      &quot;ulimits&quot;: [],
      &quot;logConfiguration&quot;: {
        &quot;logDriver&quot;: &quot;awslogs&quot;,
        &quot;options&quot;: {
          &quot;awslogs-create-group&quot;: &quot;true&quot;,
          &quot;awslogs-group&quot;: &quot;/ecs/&quot;,
          &quot;awslogs-region&quot;: &quot;us-west-2&quot;,
          &quot;awslogs-stream-prefix&quot;: &quot;ecs&quot;
        },
        &quot;secretOptions&quot;: []
      }
    }
  ],
  &quot;taskRoleArn&quot;: &quot;arn:aws:iam::xxxxxx:role/ecsTaskExecutionRole&quot;,
  &quot;executionRoleArn&quot;: &quot;arn:aws:iam::xxxxx:role/ecsTaskExecutionRole&quot;,
  &quot;networkMode&quot;: &quot;awsvpc&quot;,
  &quot;requiresCompatibilities&quot;: [&quot;EC2&quot;],
  &quot;cpu&quot;: &quot;256&quot;,
  &quot;memory&quot;: &quot;512&quot;,
  &quot;runtimePlatform&quot;: {
    &quot;cpuArchitecture&quot;: &quot;X86_64&quot;,
    &quot;operatingSystemFamily&quot;: &quot;LINUX&quot;
  }
}
</code></pre>
<p>Here is the task JSON for stress container:</p>
<pre><code class="language-json">{
  &quot;family&quot;: &quot;stressLoad&quot;,
  &quot;containerDefinitions&quot;: [
    {
      &quot;name&quot;: &quot;stressLoad&quot;,
      &quot;image&quot;: &quot;containerstack/alpine-stress&quot;,
      &quot;cpu&quot;: 0,
      &quot;memory&quot;: 512,
      &quot;memoryReservation&quot;: 512,
      &quot;portMappings&quot;: [],
      &quot;essential&quot;: true,
      &quot;entryPoint&quot;: [&quot;sh&quot;, &quot;-c&quot;],
      &quot;command&quot;: [
        &quot;/usr/local/bin/stress --cpu 2 --io 2 --vm 1 --vm-bytes 128M --timeout 6000s&quot;
      ],
      &quot;environment&quot;: [],
      &quot;mountPoints&quot;: [],
      &quot;volumesFrom&quot;: [],
      &quot;logConfiguration&quot;: {
        &quot;logDriver&quot;: &quot;awslogs&quot;,
        &quot;options&quot;: {
          &quot;awslogs-create-group&quot;: &quot;true&quot;,
          &quot;awslogs-group&quot;: &quot;/ecs/&quot;,
          &quot;awslogs-region&quot;: &quot;us-west-2&quot;,
          &quot;awslogs-stream-prefix&quot;: &quot;ecs&quot;
        }
      }
    }
  ],
  &quot;taskRoleArn&quot;: &quot;arn:aws:iam::xxxxx:role/ecsTaskExecutionRole&quot;,
  &quot;executionRoleArn&quot;: &quot;arn:aws:iam::xxxxx:role/ecsTaskExecutionRole&quot;,
  &quot;networkMode&quot;: &quot;awsvpc&quot;,
  &quot;requiresCompatibilities&quot;: [&quot;EC2&quot;],
  &quot;cpu&quot;: &quot;256&quot;,
  &quot;memory&quot;: &quot;512&quot;,
  &quot;runtimePlatform&quot;: {
    &quot;cpuArchitecture&quot;: &quot;X86_64&quot;,
    &quot;operatingSystemFamily&quot;: &quot;LINUX&quot;
  }
}
</code></pre>
<p>Once you have defined the tasks, ensure you bring up each service (one for each task) with the launch type of EC2:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-14-environment.png" alt="environment" /></p>
<p>You should have two services running now.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-15-ec2basedcluster.png" alt="ec2basedcluster" /></p>
<h3>Step 6: Check on metrics and logs in Elastic Cloud</h3>
<p>Go to Elastic Cloud and ensure that you are getting metrics and logs from the ECS Cluster. First, check to see if you are receiving metrics by viewing the built-in dashboard called [Metrics Docker] Overview.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-16-docker.png" alt="Docker image" /></p>
<p>_ <strong>With some work on this dashboard by adding in container insight metrics and docker metrics, you should be able to see:</strong> _</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-3-metrics-graphs.png" alt="graphs" /></p>
<p>If you only have the ECS integration and the Elastic agent in Step 2, then you will need to create a new dashboard:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-2-containers-in-cluster.png" alt="cluster" /></p>
<p>This dashboard can be set up with the following metrics:</p>
<ul>
<li>Containers in the cluster (containerInsights via Elastic Agent and AWS Cloudwatch integration). Set up a TSVB panel using the following metric: aws.dimensions.ClusterName : &quot;EC2BasedCluster&quot; with aws.containerinsights.metrics.TaskCount.max</li>
<li>Services in the cluster (containerInsights via Elastic Agent and AWS Cloudwatch integration). Use the following configuration to setup the chart:</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-17-table.png" alt="table" /></p>
<ul>
<li>CPU and memory utilization of the ECS Cluster (Elastic Agent with ECS integration). Use the following configuration to set up both CPU and memory utilization charts:</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-18-line.png" alt="line" /></p>
<ul>
<li>EC2 CPU and storage utilization of the instance in the cluster (Elastic Agent with EC2 integration). Use the following configuration to set up both CPU and memory utilization charts:</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-19-bar-vertical-stacked.png" alt="bar vertical stacked" /></p>
<ul>
<li>(Not shown): CPU and memory utilization per container (via containerInsights via Elastic Agent and AWS Cloudwatch integration)</li>
</ul>
<h3>Step 7: Look at logs from your ECS cluster</h3>
<p>Since we set up AWS CloudWatch logs collection in Step 2, we can view these logs in Discover by filtering on the logs group arn /aws/ecs/containerinsights/EC2BasedCluster/.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/elastic-blog-20-logs.png" alt="logs" /></p>
<h2>Summary</h2>
<p>I hope you’ve gotten an appreciation for how Elastic Observability can help your <a href="https://www.elastic.co/observability/aws-monitoring">AWS monitoring</a> ECS service metrics. Here’s a quick recap of lessons and what you learned:</p>
<ul>
<li>Elastic Observability supports ingesting and analysis of AWS ECS service metrics and the corresponding EC2 metrics through the AWS integration on the Elastic Agent. It’s easy to set up ingest from AWS Services via the Elastic Agent.</li>
<li>Elastic Observability can also get container metrics via the Docker integration running on Elastic agents on each of the EC2 instances in the ECS EC2 auto scaling group.</li>
<li>Elastic has multiple out-of-the-box (OOTB) AWS service dashboards that can be used as baselines to get your own customized view.</li>
</ul>
<p>Ready to get started? Start your own <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k?trk=5fbc596b-6d2a-433a-8333-0bd1f28e84da%E2%89%BBchannel=el">7-day free trial</a> by signing up via <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k?trk=d54b31eb-671c-49ba-88bb-7a1106421dfa%E2%89%BBchannel=el">AWS Marketplace</a> and quickly spin up a deployment in minutes on any of the <a href="https://www.elastic.co/guide/en/cloud/current/ec-reference-regions.html#ec_amazon_web_services_aws_regions">Elastic Cloud regions on AWS</a> around the world. Your AWS Marketplace purchase of Elastic will be included in your monthly consolidated billing statement and will draw against your committed spend with AWS.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/manage-applications-amazon-ecs-ec2-clusters-observability/library-branding-elastic-observability-midnight-1680x980.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Manual instrumentation of Go applications with OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/manual-instrumentation-apps-opentelemetry</link>
            <guid isPermaLink="false">manual-instrumentation-apps-opentelemetry</guid>
            <pubDate>Tue, 12 Sep 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[In this blog post, we will show you how to manually instrument Go applications using OpenTelemetry. We will explore how to use the proper OpenTelemetry Go packages and, in particular, work on instrumenting tracing in a Go application.]]></description>
            <content:encoded><![CDATA[<p>DevOps and SRE teams are transforming the process of software development. While DevOps engineers focus on efficient software applications and service delivery, SRE teams are key to ensuring reliability, scalability, and performance. These teams must rely on a full-stack observability solution that allows them to manage and monitor systems and ensure issues are resolved before they impact the business.</p>
<p>Observability across the entire stack of modern distributed applications requires data collection, processing, and correlation often in the form of dashboards. Ingesting all system data requires installing agents across stacks, frameworks, and providers — a process that can be challenging and time-consuming for teams who have to deal with version changes, compatibility issues, and proprietary code that doesn't scale as systems change.</p>
<p>Thanks to <a href="http://opentelemetry.io">OpenTelemetry</a> (OTel), DevOps and SRE teams now have a standard way to collect and send data that doesn't rely on proprietary code and have a large support community reducing vendor lock-in.</p>
<p>In this blog post, we will show you how to manually instrument Go applications using OpenTelemetry. This approach is slightly more complex than using auto-instrumentation</p>
<p>In a <a href="https://www.elastic.co/blog/opentelemetry-observability">previous blog</a>, we also reviewed how to use the OpenTelemetry demo and connect it to Elastic&lt;sup&gt;®&lt;/sup&gt;, as well as some of Elastic’s capabilities with OpenTelemetry. In this blog, we will use <a href="https://github.com/elastic/observability-examples">an alternative demo application</a>, which helps highlight manual instrumentation in a simple way.</p>
<p>Finally, we will discuss how Elastic supports mixed-mode applications, which run with Elastic and OpenTelemetry agents. The beauty of this is that there is <strong>no need for the otel-collector</strong>! This setup enables you to slowly and easily migrate an application to OTel with Elastic according to a timeline that best fits your business.</p>
<h2>Application, prerequisites, and config</h2>
<p>The application that we use for this blog is called <a href="https://github.com/elastic/observability-examples">Elastiflix</a>, a movie streaming application. It consists of several micro-services written in .NET, NodeJS, Go, and Python.</p>
<p>Before we instrument our sample application, we will first need to understand how Elastic can receive the telemetry data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-apps-opentelemetry/GO-flowhcart.png" alt="Elastic configuration options for OpenTelemetry" /></p>
<p>All of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:</p>
<ul>
<li>Service maps</li>
<li>Service details (latency, throughput, failed transactions)</li>
<li>Dependencies between services, distributed tracing</li>
<li>Transactions (traces)</li>
<li>Machine learning (ML) correlations</li>
<li>Log correlation</li>
</ul>
<p>In addition to Elastic’s APM and a unified view of the telemetry data, you will also be able to use Elastic’s powerful machine learning capabilities to reduce the analysis, and alerting to help reduce MTTR.</p>
<h2>Prerequisites</h2>
<ul>
<li>An Elastic Cloud account — <a href="https://cloud.elastic.co/">sign up now</a></li>
<li>A clone of the <a href="https://github.com/elastic/observability-examples">Elastiflix demo application</a>, or your own Go application</li>
<li>Basic understanding of Docker — potentially install <a href="https://www.docker.com/products/docker-desktop/">Docker Desktop</a></li>
<li>Basic understanding of Go</li>
</ul>
<h2>View the example source code</h2>
<p>The full source code including the Dockerfile used in this blog can be found on <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/go-favorite-otel-manual">GitHub</a>. The repository also contains the <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/go-favorite">same application without instrumentation</a>. This allows you to compare each file and see the differences.</p>
<p>Before we begin, let’s look at the non-instrumented code first.</p>
<p>This is our simple go application that can receive a GET request. Note that the code shown here is a slightly abbreviated version.</p>
<pre><code class="language-go">package main

import (
	&quot;log&quot;
	&quot;net/http&quot;
	&quot;os&quot;
	&quot;time&quot;

	&quot;github.com/go-redis/redis/v8&quot;

	&quot;github.com/sirupsen/logrus&quot;

	&quot;github.com/gin-gonic/gin&quot;
	&quot;strconv&quot;
	&quot;math/rand&quot;
)

var logger = &amp;logrus.Logger{
	Out:   os.Stderr,
	Hooks: make(logrus.LevelHooks),
	Level: logrus.InfoLevel,
	Formatter: &amp;logrus.JSONFormatter{
		FieldMap: logrus.FieldMap{
			logrus.FieldKeyTime:  &quot;@timestamp&quot;,
			logrus.FieldKeyLevel: &quot;log.level&quot;,
			logrus.FieldKeyMsg:   &quot;message&quot;,
			logrus.FieldKeyFunc:  &quot;function.name&quot;, // non-ECS
		},
		TimestampFormat: time.RFC3339Nano,
	},
}

func main() {
	delayTime, _ := strconv.Atoi(os.Getenv(&quot;TOGGLE_SERVICE_DELAY&quot;))

	redisHost := os.Getenv(&quot;REDIS_HOST&quot;)
	if redisHost == &quot;&quot; {
		redisHost = &quot;localhost&quot;
	}

	redisPort := os.Getenv(&quot;REDIS_PORT&quot;)
	if redisPort == &quot;&quot; {
		redisPort = &quot;6379&quot;
	}

	applicationPort := os.Getenv(&quot;APPLICATION_PORT&quot;)
	if applicationPort == &quot;&quot; {
		applicationPort = &quot;5000&quot;
	}

	// Initialize Redis client
	rdb := redis.NewClient(&amp;redis.Options{
		Addr:     redisHost + &quot;:&quot; + redisPort,
		Password: &quot;&quot;,
		DB:       0,
	})

	// Initialize router
	r := gin.New()
	r.Use(logrusMiddleware)

	r.GET(&quot;/favorites&quot;, func(c *gin.Context) {
		// artificial sleep for delayTime
		time.Sleep(time.Duration(delayTime) * time.Millisecond)

		userID := c.Query(&quot;user_id&quot;)

		contextLogger(c).Infof(&quot;Getting favorites for user %q&quot;, userID)

		favorites, err := rdb.SMembers(c.Request.Context(), userID).Result()
		if err != nil {
			contextLogger(c).Error(&quot;Failed to get favorites for user %q&quot;, userID)
			c.String(http.StatusInternalServerError, &quot;Failed to get favorites&quot;)
			return
		}

		contextLogger(c).Infof(&quot;User %q has favorites %q&quot;, userID, favorites)

		c.JSON(http.StatusOK, gin.H{
			&quot;favorites&quot;: favorites,
		})
	})

	// Start server
	logger.Infof(&quot;App startup&quot;)
	log.Fatal(http.ListenAndServe(&quot;:&quot;+applicationPort, r))
	logger.Infof(&quot;App stopped&quot;)
}
</code></pre>
<h2>Step-by-step guide</h2>
<h3>Step 0. Log in to your Elastic Cloud account</h3>
<p>This blog assumes you have an Elastic Cloud account — if not, follow the <a href="https://cloud.elastic.co/registration?elektra=en-cloud-page">instructions to get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-apps-opentelemetry/elastic-blog-4-free-trial.png" alt="free trial" /></p>
<h3>Step 1. Install and initialize OpenTelemetry</h3>
<p>As a first step, we’ll need to add some additional packages to our application.</p>
<pre><code class="language-go">import (
      &quot;github.com/go-redis/redis/extra/redisotel/v8&quot;
      &quot;go.opentelemetry.io/otel&quot;
      &quot;go.opentelemetry.io/otel/attribute&quot;
      &quot;go.opentelemetry.io/otel/exporters/otlp/otlptrace&quot;
    &quot;go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc&quot;

	&quot;go.opentelemetry.io/otel/propagation&quot;

	&quot;google.golang.org/grpc/credentials&quot;
	&quot;crypto/tls&quot;

      sdktrace &quot;go.opentelemetry.io/otel/sdk/trace&quot;

	&quot;go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin&quot;

	&quot;go.opentelemetry.io/otel/trace&quot;
	&quot;go.opentelemetry.io/otel/codes&quot;
)
</code></pre>
<p>This code imports necessary OpenTelemetry packages, including those for tracing, exporting, and instrumenting specific libraries like Redis.</p>
<p>Next we read the &quot;OTEL_EXPORTER_OTLP_ENDPOINT&quot; variable and initialize the exporter.</p>
<pre><code class="language-go">var (
    collectorURL = os.Getenv(&quot;OTEL_EXPORTER_OTLP_ENDPOINT&quot;)
)
var tracer trace.Tracer


func initTracer() func(context.Context) error {
	tracer = otel.Tracer(&quot;go-favorite-otel-manual&quot;)

	// remove https:// from the collector URL if it exists
	collectorURL = strings.Replace(collectorURL, &quot;https://&quot;, &quot;&quot;, 1)
	secretToken := os.Getenv(&quot;ELASTIC_APM_SECRET_TOKEN&quot;)
	if secretToken == &quot;&quot; {
		log.Fatal(&quot;ELASTIC_APM_SECRET_TOKEN is required&quot;)
	}

	secureOption := otlptracegrpc.WithInsecure()
    exporter, err := otlptrace.New(
        context.Background(),
        otlptracegrpc.NewClient(
            secureOption,
            otlptracegrpc.WithEndpoint(collectorURL),
			otlptracegrpc.WithHeaders(map[string]string{
				&quot;Authorization&quot;: &quot;Bearer &quot; + secretToken,
			}),
			otlptracegrpc.WithTLSCredentials(credentials.NewTLS(&amp;tls.Config{})),
        ),
    )

    if err != nil {
        log.Fatal(err)
    }

    otel.SetTracerProvider(
        sdktrace.NewTracerProvider(
            sdktrace.WithSampler(sdktrace.AlwaysSample()),
            sdktrace.WithBatcher(exporter),
        ),
    )
	otel.SetTextMapPropagator(
		propagation.NewCompositeTextMapPropagator(
			propagation.Baggage{},
			propagation.TraceContext{},
		),
	)
    return exporter.Shutdown
}
</code></pre>
<p>For instrumenting connections to Redis, we will add a tracing hook to it, and in order to instrument Gin, we will add the OTel middleware. This will automatically capture all interactions with our application, since Gin will be fully instrumented. In addition, all outgoing connections to Redis will also be instrumented.</p>
<pre><code class="language-go">// Initialize Redis client
	rdb := redis.NewClient(&amp;redis.Options{
		Addr:     redisHost + &quot;:&quot; + redisPort,
		Password: &quot;&quot;,
		DB:       0,
	})
	rdb.AddHook(redisotel.NewTracingHook())
	// Initialize router
	r := gin.New()
	r.Use(logrusMiddleware)
	r.Use(otelgin.Middleware(&quot;go-favorite-otel-manual&quot;))
</code></pre>
<p><strong>Adding custom spans</strong><br />
Now that we have everything added and initialized, we can add custom spans.</p>
<p>If we want to have additional instrumentation for a part of our app, we simply start a custom span and then defer ending the span.</p>
<pre><code class="language-go">// start otel span
ctx := c.Request.Context()
ctx, span := tracer.Start(ctx, &quot;add_favorite_movies&quot;)
defer span.End()
</code></pre>
<p>For comparison, this is the instrumented code of our sample application. You can find the full source code in <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/go-favorite-otel-manual">GitHub</a>.</p>
<pre><code class="language-go">package main

import (
	&quot;log&quot;
	&quot;net/http&quot;
	&quot;os&quot;
	&quot;time&quot;
	&quot;context&quot;

	&quot;github.com/go-redis/redis/v8&quot;
	&quot;github.com/go-redis/redis/extra/redisotel/v8&quot;


	&quot;github.com/sirupsen/logrus&quot;

	&quot;github.com/gin-gonic/gin&quot;

  &quot;go.opentelemetry.io/otel&quot;
  &quot;go.opentelemetry.io/otel/attribute&quot;
  &quot;go.opentelemetry.io/otel/exporters/otlp/otlptrace&quot;
  &quot;go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc&quot;

	&quot;go.opentelemetry.io/otel/propagation&quot;

	&quot;google.golang.org/grpc/credentials&quot;
	&quot;crypto/tls&quot;

  sdktrace &quot;go.opentelemetry.io/otel/sdk/trace&quot;

	&quot;go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin&quot;

	&quot;go.opentelemetry.io/otel/trace&quot;

	&quot;strings&quot;
	&quot;strconv&quot;
	&quot;math/rand&quot;
	&quot;go.opentelemetry.io/otel/codes&quot;

)

var tracer trace.Tracer

func initTracer() func(context.Context) error {
	tracer = otel.Tracer(&quot;go-favorite-otel-manual&quot;)

	collectorURL = strings.Replace(collectorURL, &quot;https://&quot;, &quot;&quot;, 1)

	secureOption := otlptracegrpc.WithInsecure()

	// split otlpHeaders by comma and convert to map
	headers := make(map[string]string)
	for _, header := range strings.Split(otlpHeaders, &quot;,&quot;) {
		headerParts := strings.Split(header, &quot;=&quot;)

		if len(headerParts) == 2 {
			headers[headerParts[0]] = headerParts[1]
		}
	}

    exporter, err := otlptrace.New(
        context.Background(),
        otlptracegrpc.NewClient(
            secureOption,
            otlptracegrpc.WithEndpoint(collectorURL),
			otlptracegrpc.WithHeaders(headers),
			otlptracegrpc.WithTLSCredentials(credentials.NewTLS(&amp;tls.Config{})),
        ),
    )

    if err != nil {
        log.Fatal(err)
    }

    otel.SetTracerProvider(
        sdktrace.NewTracerProvider(
            sdktrace.WithSampler(sdktrace.AlwaysSample()),
            sdktrace.WithBatcher(exporter),
            //sdktrace.WithResource(resources),
        ),
    )
	otel.SetTextMapPropagator(
		propagation.NewCompositeTextMapPropagator(
			propagation.Baggage{},
			propagation.TraceContext{},
		),
	)
    return exporter.Shutdown
}

var (
  collectorURL = os.Getenv(&quot;OTEL_EXPORTER_OTLP_ENDPOINT&quot;)
	otlpHeaders = os.Getenv(&quot;OTEL_EXPORTER_OTLP_HEADERS&quot;)
)


var logger = &amp;logrus.Logger{
	Out:   os.Stderr,
	Hooks: make(logrus.LevelHooks),
	Level: logrus.InfoLevel,
	Formatter: &amp;logrus.JSONFormatter{
		FieldMap: logrus.FieldMap{
			logrus.FieldKeyTime:  &quot;@timestamp&quot;,
			logrus.FieldKeyLevel: &quot;log.level&quot;,
			logrus.FieldKeyMsg:   &quot;message&quot;,
			logrus.FieldKeyFunc:  &quot;function.name&quot;, // non-ECS
		},
		TimestampFormat: time.RFC3339Nano,
	},
}

func main() {
	cleanup := initTracer()
  defer cleanup(context.Background())

	redisHost := os.Getenv(&quot;REDIS_HOST&quot;)
	if redisHost == &quot;&quot; {
		redisHost = &quot;localhost&quot;
	}

	redisPort := os.Getenv(&quot;REDIS_PORT&quot;)
	if redisPort == &quot;&quot; {
		redisPort = &quot;6379&quot;
	}

	applicationPort := os.Getenv(&quot;APPLICATION_PORT&quot;)
	if applicationPort == &quot;&quot; {
		applicationPort = &quot;5000&quot;
	}

	// Initialize Redis client
	rdb := redis.NewClient(&amp;redis.Options{
		Addr:     redisHost + &quot;:&quot; + redisPort,
		Password: &quot;&quot;,
		DB:       0,
	})
	rdb.AddHook(redisotel.NewTracingHook())


	// Initialize router
	r := gin.New()
	r.Use(logrusMiddleware)
	r.Use(otelgin.Middleware(&quot;go-favorite-otel-manual&quot;))


	// Define routes
	r.GET(&quot;/&quot;, func(c *gin.Context) {
		contextLogger(c).Infof(&quot;Main request successful&quot;)
		c.String(http.StatusOK, &quot;Hello World!&quot;)
	})

	r.GET(&quot;/favorites&quot;, func(c *gin.Context) {
		// artificial sleep for delayTime
		time.Sleep(time.Duration(delayTime) * time.Millisecond)

		userID := c.Query(&quot;user_id&quot;)

		contextLogger(c).Infof(&quot;Getting favorites for user %q&quot;, userID)

		favorites, err := rdb.SMembers(c.Request.Context(), userID).Result()
		if err != nil {
			contextLogger(c).Error(&quot;Failed to get favorites for user %q&quot;, userID)
			c.String(http.StatusInternalServerError, &quot;Failed to get favorites&quot;)
			return
		}

		contextLogger(c).Infof(&quot;User %q has favorites %q&quot;, userID, favorites)

		c.JSON(http.StatusOK, gin.H{
			&quot;favorites&quot;: favorites,
		})
	})

	// Start server
	logger.Infof(&quot;App startup&quot;)
	log.Fatal(http.ListenAndServe(&quot;:&quot;+applicationPort, r))
	logger.Infof(&quot;App stopped&quot;)
}
</code></pre>
<h3>Step 2. Running the Docker image with environment variables</h3>
<p>As specified in the <a href="https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/">OTEL documentation</a>, we will use environment variables and pass in the configuration values that are found in your APM Agent’s configuration section.</p>
<p>Because Elastic accepts OTLP natively, we just need to provide the Endpoint and authentication where the OTEL Exporter needs to send the data, as well as some other environment variables.</p>
<p><strong>Where to get these variables in Elastic Cloud and Kibana</strong> &lt;sup&gt;®&lt;/sup&gt;<br />
You can copy the endpoints and token from Kibana under the path /app/home#/tutorial/apm.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-apps-opentelemetry/elastic-blog-GO-apm-agents.png" alt="GO apm agents" /></p>
<p>You will need to copy the OTEL_EXPORTER_OTLP_ENDPOINT as well as the OTEL_EXPORTER_OTLP_HEADERS.</p>
<p><strong>Build the image</strong></p>
<pre><code class="language-bash">docker build -t  go-otel-manual-image .
</code></pre>
<h2>Run the image</h2>
<pre><code class="language-bash">docker run \
       -e OTEL_EXPORTER_OTLP_ENDPOINT=&quot;&lt;REPLACE WITH OTEL_EXPORTER_OTLP_ENDPOINT&gt;&quot; \
       -e OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=Bearer &lt;REPLACE WITH TOKEN&gt;&quot; \
       -e OTEL_RESOURCE_ATTRIBUTES=&quot;service.version=1.0,deployment.environment=production,service.name=go-favorite-otel-manual&quot; \
       -p 5000:5000 \
       go-otel-manual-image
</code></pre>
<p>You can now issue a few requests in order to generate trace data. Note that these requests are expected to return an error, as this service relies on a connection to Redis that you don’t currently have running. As mentioned before, you can find a more complete example using Docker compose <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix">here</a>.</p>
<pre><code class="language-bash">curl localhost:500/favorites
# or alternatively issue a request every second

while true; do curl &quot;localhost:5000/favorites&quot;; sleep 1; done;
</code></pre>
<h2>How do the traces show up in Elastic?</h2>
<p>Now that the service is instrumented, you should see the following output in Elastic APM when looking at the transactions section of your Node.js service:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-apps-opentelemetry/GO-trace-samples.png" alt="trace samples" /></p>
<h2>Conclusion</h2>
<p>In this blog, we discussed the following:</p>
<ul>
<li>How to manually instrument Go with OpenTelemetry</li>
<li>How to properly initialize OpenTelemetry and add a custom span</li>
<li>How to easily set the OTLP ENDPOINT and OTLP HEADERS with Elastic without the need for a collector</li>
</ul>
<p>Hopefully, this provides an easy-to-understand walk-through of instrumenting Go with OpenTelemetry and how easy it is to send traces into Elastic.</p>
<blockquote>
<p>Developer resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/getting-started-opentelemetry-instrumentation-sample-app">Elastiflix application</a>, a guide to instrument different languages with OpenTelemetry</li>
<li>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual-instrumentation</a></li>
<li>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Go: <a href="https://elastic.co/blog/manual-instrumentation-apps-opentelemetry">Manual-instrumentation</a></li>
<li><a href="https://www.elastic.co/blog/best-practices-instrumenting-opentelemetry">Best practices for instrumenting OpenTelemetry</a></li>
</ul>
<p>General configuration and use case resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></li>
<li><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a></li>
<li><a href="https://www.elastic.co/blog/custom-metrics-app-code-java-agent-plugin">Capturing custom metrics through OpenTelemetry API in code with Elastic</a></li>
<li><a href="https://www.elastic.co/virtual-events/future-proof-your-observability-platform-with-opentelemetry-and-elastic">Future-proof your observability platform with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-k8s-observability-elasticsearch-cncf">Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more</a></li>
</ul>
</blockquote>
<p>Don’t have an Elastic Cloud account yet? <a href="https://cloud.elastic.co/registration">Sign up for Elastic Cloud</a> and try out the auto-instrumentation capabilities that I discussed above. I would be interested in getting your feedback about your experience in gaining visibility into your application stack with Elastic.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-apps-opentelemetry/observability-launch-series-5-go-manual.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Manual instrumentation of Java applications with OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/manual-instrumentation-java-apps-opentelemetry</link>
            <guid isPermaLink="false">manual-instrumentation-java-apps-opentelemetry</guid>
            <pubDate>Thu, 31 Aug 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[OpenTelemetry provides an observability framework for cloud-native software, allowing us to trace, monitor, and debug applications seamlessly. In this post, we'll explore how to manually instrument a Java application using OpenTelemetry.]]></description>
            <content:encoded><![CDATA[<p>In the fast-paced universe of software development, especially in the cloud-native realm, DevOps and SRE teams are increasingly emerging as essential partners in application stability and growth.</p>
<p>DevOps engineers continuously optimize software delivery, while SRE teams act as the stewards of application reliability, scalability, and top-tier performance. The challenge? These teams require a cutting-edge observability solution, one that encompasses full-stack insights, empowering them to rapidly manage, monitor, and rectify potential disruptions before they culminate into operational challenges.</p>
<p>Observability in our modern distributed software ecosystem goes beyond mere monitoring—it demands limitless data collection, precision in processing, and the correlation of this data into actionable insights. However, the road to achieving this holistic view is paved with obstacles: from navigating version incompatibilities to wrestling with restrictive proprietary code.</p>
<p>Enter <a href="https://opentelemetry.io/">OpenTelemetry (OTel)</a>, with the following benefits for those who adopt it:</p>
<ul>
<li>Escape vendor constraints with OTel, freeing yourself from vendor lock-in and ensuring top-notch observability.</li>
<li>See the harmony of unified logs, metrics, and traces come together to provide a complete system view.</li>
<li>Improve your application oversight through richer and enhanced instrumentations.</li>
<li>Embrace the benefits of backward compatibility to protect your prior instrumentation investments.</li>
<li>Embark on the OpenTelemetry journey with an easy learning curve, simplifying onboarding and scalability.</li>
<li>Rely on a proven, future-ready standard to boost your confidence in every investment.</li>
</ul>
<p>In this blog, we will explore how you can use <a href="https://opentelemetry.io/docs/instrumentation/java/manual/">manual instrumentation in your Java</a> application using Docker, without the need to refactor any part of your application code. We will use an <a href="https://github.com/elastic/observability-examples">application called Elastiflix</a>. This approach is slightly more complex than using <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">automatic instrumentation</a>.</p>
<p>The beauty of this is that there is <strong>no need for the otel-collector</strong>! This setup enables you to slowly and easily migrate an application to OTel with Elastic according to a timeline that best fits your business.</p>
<h2>Application, prerequisites, and config</h2>
<p>The application that we use for this blog is called <a href="https://github.com/elastic/observability-examples">Elastiflix</a>, a movie streaming application. It consists of several micro-services written in .NET, NodeJS, Go, and Python.</p>
<p>Before we instrument our sample application, we will first need to understand how Elastic can receive the telemetry data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-java-apps-opentelemetry/elastic-blog-1-config.png" alt="Elastic configuration options for OpenTelemetry" /></p>
<p>All of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:</p>
<ul>
<li>Service maps</li>
<li>Service details (latency, throughput, failed transactions)</li>
<li>Dependencies between services, distributed tracing</li>
<li>Transactions (traces)</li>
<li>Machine learning (ML) correlations</li>
<li>Log correlation</li>
</ul>
<p>In addition to Elastic’s APM and a unified view of the telemetry data, you will also be able to use Elastic’s powerful machine learning capabilities to reduce the analysis, and alerting to help reduce MTTR.</p>
<h2>Prerequisites</h2>
<ul>
<li>An Elastic Cloud account — <a href="https://cloud.elastic.co/">sign up now</a></li>
<li>A clone of the <a href="https://github.com/elastic/observability-examples">Elastiflix demo application</a>, or your own Java application</li>
<li>Basic understanding of Docker — potentially install <a href="https://www.docker.com/products/docker-desktop/">Docker Desktop</a></li>
<li>Basic understanding of Java</li>
</ul>
<h2>View the example source code</h2>
<p>The full source code, including the Dockerfile used in this blog, can be found on <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/python-favorite-otel-auto">GitHub</a>. The repository also contains the <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/python-favorite">same application without instrumentation</a>. This allows you to compare each file and see the differences.</p>
<p>In particular, we will be working through the following file:</p>
<pre><code class="language-bash">Elastiflix/java-favorite/src/main/java/com/movieapi/ApiServlet.java
</code></pre>
<p>The following steps will show you how to instrument this application and run it on the command line or in Docker. If you are interested in a more complete OTel example, take a look at the docker-compose file <a href="https://github.com/elastic/observability-examples/tree/main#start-the-app">here</a>, which will bring up the full project.</p>
<p>Before we begin, let’s look at the non-instrumented code first.</p>
<h2>Step-by-step guide</h2>
<h3>Step 0. Log in to your Elastic Cloud account</h3>
<p>This blog assumes you have an Elastic Cloud account — if not, follow the <a href="https://cloud.elastic.co/registration?elektra=en-cloud-page">instructions to get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-java-apps-opentelemetry/elastic-blog-2-trial.png" alt="trial" /></p>
<h3>Step 1. Set up OpenTelemetry</h3>
<p>The first step is to set up the OpenTelemetry SDK in your Java application. You can start by adding the OpenTelemetry Java SDK and its dependencies to your project's build file, such as Maven or Gradle. In our example application, we are using Maven. Add the dependencies below to your pom.xml:</p>
<pre><code class="language-xml">&lt;dependency&gt;
      &lt;groupId&gt;io.opentelemetry.instrumentation&lt;/groupId&gt;
      &lt;artifactId&gt;opentelemetry-logback-mdc-1.0&lt;/artifactId&gt;
      &lt;version&gt;1.25.1-alpha&lt;/version&gt;
    &lt;/dependency&gt;

    &lt;dependency&gt;
      &lt;groupId&gt;io.opentelemetry&lt;/groupId&gt;
      &lt;artifactId&gt;opentelemetry-api&lt;/artifactId&gt;
    &lt;/dependency&gt;
    &lt;dependency&gt;
      &lt;groupId&gt;io.opentelemetry&lt;/groupId&gt;
      &lt;artifactId&gt;opentelemetry-sdk&lt;/artifactId&gt;
    &lt;/dependency&gt;
    &lt;dependency&gt;
      &lt;groupId&gt;io.opentelemetry&lt;/groupId&gt;
      &lt;artifactId&gt;opentelemetry-exporter-otlp&lt;/artifactId&gt;
    &lt;/dependency&gt;
    &lt;dependency&gt;
      &lt;groupId&gt;io.opentelemetry&lt;/groupId&gt;
      &lt;artifactId&gt;opentelemetry-semconv&lt;/artifactId&gt;
    &lt;/dependency&gt;
    &lt;dependency&gt;
      &lt;groupId&gt;io.opentelemetry&lt;/groupId&gt;
      &lt;artifactId&gt;opentelemetry-exporter-otlp-logs&lt;/artifactId&gt;
    &lt;/dependency&gt;
    &lt;dependency&gt;
      &lt;groupId&gt;io.opentelemetry.instrumentation&lt;/groupId&gt;
      &lt;artifactId&gt;opentelemetry-logback-appender-1.0&lt;/artifactId&gt;
      &lt;version&gt;1.25.1-alpha&lt;/version&gt;
    &lt;/dependency&gt;
</code></pre>
<p>And add the following bill of materials from OpenTelemetry too:</p>
<pre><code class="language-xml">&lt;dependencyManagement&gt;
    &lt;dependencies&gt;
      &lt;dependency&gt;
        &lt;groupId&gt;io.opentelemetry&lt;/groupId&gt;
        &lt;artifactId&gt;opentelemetry-bom&lt;/artifactId&gt;
        &lt;version&gt;1.25.0&lt;/version&gt;
        &lt;type&gt;pom&lt;/type&gt;
        &lt;scope&gt;import&lt;/scope&gt;
      &lt;/dependency&gt;
      &lt;dependency&gt;
        &lt;groupId&gt;io.opentelemetry&lt;/groupId&gt;
        &lt;artifactId&gt;opentelemetry-bom-alpha&lt;/artifactId&gt;
        &lt;version&gt;1.25.0-alpha&lt;/version&gt;
        &lt;type&gt;pom&lt;/type&gt;
        &lt;scope&gt;import&lt;/scope&gt;
      &lt;/dependency&gt;
    &lt;/dependencies&gt;
  &lt;/dependencyManagement&gt;
</code></pre>
<h3>Step 2. Add the application configuration</h3>
<p>We recommend that you add the following configuration to the application’s main method, to start before any application code. Doing it like this gives you a bit more control and flexibility and ensures that OpenTelemetry will be available at any stage of the application lifecycle. In the examples, we put this code before the Spring Boot Application startup. Elastic supports OTLP over HTTP and OTLP over GRPC. In this example, we are using GRPC.</p>
<pre><code class="language-java">String SERVICE_NAME = System.getenv(&quot;OTEL_SERVICE_NAME&quot;);

// set service name on all OTel signals
Resource resource = Resource.getDefault().merge(Resource.create(Attributes.of(ResourceAttributes.SERVICE_NAME,SERVICE_NAME,ResourceAttributes.SERVICE_VERSION,&quot;1.0&quot;,ResourceAttributes.DEPLOYMENT_ENVIRONMENT,&quot;production&quot;)));

// init OTel logger provider with export to OTLP
SdkLoggerProvider sdkLoggerProvider = SdkLoggerProvider.builder().setResource(resource).addLogRecordProcessor(BatchLogRecordProcessor.builder(OtlpGrpcLogRecordExporter.builder().setEndpoint(System.getenv(&quot;OTEL_EXPORTER_OTLP_ENDPOINT&quot;)).addHeader(&quot;Authorization&quot;, &quot;Bearer &quot; + System.getenv(&quot;ELASTIC_APM_SECRET_TOKEN&quot;)).build()).build()).build();

// init OTel trace provider with export to OTLP
SdkTracerProvider sdkTracerProvider = SdkTracerProvider.builder().setResource(resource).setSampler(Sampler.alwaysOn()).addSpanProcessor(BatchSpanProcessor.builder(OtlpGrpcSpanExporter.builder().setEndpoint(System.getenv(&quot;OTEL_EXPORTER_OTLP_ENDPOINT&quot;)).addHeader(&quot;Authorization&quot;, &quot;Bearer &quot; + System.getenv(&quot;ELASTIC_APM_SECRET_TOKEN&quot;)).build()).build()).build();

// init OTel meter provider with export to OTLP
SdkMeterProvider sdkMeterProvider = SdkMeterProvider.builder().setResource(resource).registerMetricReader(PeriodicMetricReader.builder(OtlpGrpcMetricExporter.builder().setEndpoint(System.getenv(&quot;OTEL_EXPORTER_OTLP_ENDPOINT&quot;)).addHeader(&quot;Authorization&quot;, &quot;Bearer &quot; + System.getenv(&quot;ELASTIC_APM_SECRET_TOKEN&quot;)).build()).build()).build();

// create sdk object and set it as global
OpenTelemetrySdk sdk = OpenTelemetrySdk.builder().setTracerProvider(sdkTracerProvider).setLoggerProvider(sdkLoggerProvider).setMeterProvider(sdkMeterProvider).setPropagators(ContextPropagators.create(W3CTraceContextPropagator.getInstance())).build();

GlobalOpenTelemetry.set(sdk);
// connect logger
GlobalLoggerProvider.set(sdk.getSdkLoggerProvider());
// Add hook to close SDK, which flushes logs
Runtime.getRuntime().addShutdownHook(new Thread(sdk::close));
</code></pre>
<h3>Step 3. Create the Tracer and start the OpenTelemetry Span inside the TracingFilter</h3>
<p>In the Spring Boot, example you will notice that we have a TracingFilter class which extends the OncePerRequestFilter class. This Filter is a component placed at the front of the request processing chain. Its primary roles are to intercept incoming requests and outgoing responses, performing tasks such as logging, authentication, transformation of request/response entities, and more. So what we do here is intercept the request as it comes into the Favorite service, so that we can pull out the headers which may contain tracing information from upstream systems.</p>
<p>We start by using the OpenTelemetry Tracer, which is a core component of OpenTelemetry that allows you to create spans, start and stop them, and add attributes and events. In your Java code, import the necessary OpenTelemetry classes and create an instance of the Tracer within your application.</p>
<p>We use this to create a new downstream span, which will continue as a child from the span created in the upstream system using the information we got from the upstream request. In our Elastiflix example, this will be the nodejs application.</p>
<pre><code class="language-java">@Override
protected void doFilterInternal(jakarta.servlet.http.HttpServletRequest request, jakarta.servlet.http.HttpServletResponse response, jakarta.servlet.FilterChain filterChain) throws jakarta.servlet.ServletException, IOException {
        Tracer tracer = GlobalOpenTelemetry.getTracer(SERVICE_NAME);

        Context extractedContext = GlobalOpenTelemetry.getPropagators()
                .getTextMapPropagator()
                .extract(Context.current(), request, getter);

        Span span = tracer.spanBuilder(request.getRequestURI())
                .setSpanKind(SpanKind.SERVER)
                .setParent(extractedContext)
                .startSpan();

        try (Scope scope = span.makeCurrent()) {
            filterChain.doFilter(request, response);
        } catch (Exception e) {
            span.setStatus(StatusCode.ERROR);
            throw e;
        } finally {
            span.end();
        }
    }
</code></pre>
<h3>Step 4. Instrument other interesting code with spans</h3>
<p>To instrument with spans and track specific regions of your code, you can use the Tracer's SpanBuilder to create spans. To accurately measure the duration of a specific operation, make sure to start and stop the spans at the appropriate locations in your code. Use the startSpan and endSpan methods provided by the Tracer to mark the beginning and end of the span. For example, you can create a span around a specific method or operation in your code, as shown here in the handleCanary method:</p>
<pre><code class="language-java">private void handleCanary() throws Exception {
        Span span = GlobalOpenTelemetry.getTracer(SERVICE_NAME).spanBuilder(&quot;handleCanary&quot;).startSpan();
        Scope scope = span.makeCurrent();

///.....


 span.setStatus(StatusCode.OK);

        span.end();

        scope.close();
    }
</code></pre>
<h3>Step 5. Add attributes and events to spans</h3>
<p>You can enhance the spans with additional attributes and events to provide more context and details about the operation being tracked. Attributes can be key-value pairs that describe the span, while events can be used to mark significant points in the span's lifecycle. This is also shown in the handleCanary method:</p>
<pre><code class="language-java">private void handleCanary() throws Exception {

            Span.current().setAttribute(&quot;canary&quot;, &quot;test-new-feature&quot;);
            Span.current().setAttribute(&quot;quiz_solution&quot;, &quot;correlations&quot;);

            span.addEvent(&quot;a span event&quot;, Attributes
                    .of(AttributeKey.longKey(&quot;someKey&quot;), Long.valueOf(93)));
    }
</code></pre>
<h3>Step 6. Instrument backends</h3>
<p>Let's consider an example where we are instrumenting a Redis database call. We're using the Java OpenTelemetry SDK, and our goal is to create a trace that captures each &quot;Post User Favorites&quot; operation to the database.</p>
<p>Below is the Java method that performs the operation and collects telemetry data:</p>
<pre><code class="language-java">public void postUserFavorites(String user_id, String movieID) {
  ...
}
</code></pre>
<p>Let's go through it line by line:</p>
<p><strong>Initializing a span</strong><br />
The first important line of our method is where we initialize a span. A span represents a single operation within a trace, which could be a database call, a remote procedure call (RPC), or any segment of code that you want to measure.</p>
<pre><code class="language-java">Span span = GlobalOpenTelemetry.getTracer(SERVICE_NAME).spanBuilder(&quot;Redis.Post&quot;).setSpanKind(SpanKind.CLIENT).startSpan();
</code></pre>
<p><strong>Setting span attributes</strong><br />
Next, we add attributes to our span. Attributes are key-value pairs that provide additional information about the span. In order to get the backend call to appear correctly in the service map, it is critical that the attributes are set correctly for the backend call type. In this example, we set the db.system attribute to redis.</p>
<pre><code class="language-javascript">span.setAttribute(&quot;db.system&quot;, &quot;redis&quot;);
span.setAttribute(&quot;db.connection_string&quot;, redisHost);
span.setAttribute(
  &quot;db.statement&quot;,
  &quot;POST user_id &quot; + user_id + &quot; AND movie_id &quot; + movieID
);
</code></pre>
<p>This will ensure calls to the backend redis backend are tracked as shown below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-java-apps-opentelemetry/elastic-blog-3-flowchart.png" alt="flowchart" /></p>
<p><strong>Capturing the result of the operation</strong><br />
We then execute the operation we're interested in, within a try-catch block. If an exception occurs during the execution of the operation, we record it in the span.</p>
<pre><code class="language-java">try (Scope scope = span.makeCurrent()) {
    ...
} catch (Exception e) {
    span.setStatus(StatusCode.ERROR, &quot;Error while getting data from Redis&quot;);
    span.recordException(e);
}
</code></pre>
<p><strong>Closing resources</strong><br />
Finally, we close the Redis connection and end the span.</p>
<pre><code class="language-java">finally {
    jedis.close();
    span.end();
}
</code></pre>
<h3>Step 7. Configure logging</h3>
<p>Logging is an essential part of application monitoring and troubleshooting. OpenTelemetry allows you to integrate with existing logging frameworks, such as Logback or Log4j, to capture logs along with the telemetry data. Configure the logging framework of your choice to capture logs related to the instrumented spans. In our example application, check out the logback configuration, which shows how to export logs directly to Elastic.</p>
<pre><code class="language-xml">&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;configuration debug=&quot;true&quot;&gt;

    &lt;appender name=&quot;otel-otlp&quot;
        class=&quot;io.opentelemetry.instrumentation.logback.appender.v1_0.OpenTelemetryAppender&quot;&gt;
        &lt;captureExperimentalAttributes&gt;false&lt;/captureExperimentalAttributes&gt;
        &lt;captureCodeAttributes&gt;true&lt;/captureCodeAttributes&gt;
        &lt;captureKeyValuePairAttributes&gt;true&lt;/captureKeyValuePairAttributes&gt;
    &lt;/appender&gt;

    &lt;appender name=&quot;STDOUT&quot; class=&quot;ch.qos.logback.core.ConsoleAppender&quot;&gt;
        &lt;encoder&gt;
            &lt;pattern&gt;%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n&lt;/pattern&gt;
        &lt;/encoder&gt;
    &lt;/appender&gt;

    &lt;root level=&quot;DEBUG&quot;&gt;
     &lt;appender-ref ref=&quot;otel-otlp&quot; /&gt;
        &lt;appender-ref ref=&quot;STDOUT&quot; /&gt;

    &lt;/root&gt;
&lt;/configuration&gt;
</code></pre>
<h3>Step 8. Running the Docker image with environment variables</h3>
<p>As specified in the <a href="https://opentelemetry.io/docs/instrumentation/java/automatic/">OTEL Java documentation</a>, we will use environment variables and pass in the configuration values to enable it to connect with <a href="https://www.elastic.co/guide/en/apm/guide/current/open-telemetry.html">Elastic Observability’s APM server</a>.</p>
<p>Because Elastic accepts OTLP natively, we just need to provide the Endpoint and authentication where the OTEL Exporter needs to send the data, as well as some other environment variables.</p>
<p><strong>Getting Elastic Cloud variables</strong><br />
You can copy the endpoints and token from Kibana under the path <code>/app/home#/tutorial/apm</code>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-java-apps-opentelemetry/elastic-blog-3-apm.png" alt="apm agents" /></p>
<p>You will need to copy the following environment variable:</p>
<pre><code class="language-bash">OTEL_EXPORTER_OTLP_ENDPOINT
</code></pre>
<p>As well as the token from:</p>
<pre><code class="language-bash">OTEL_EXPORTER_OTLP_HEADERS
</code></pre>
<p><strong>Build the Docker image</strong></p>
<pre><code class="language-bash">docker build -t java-otel-manual-image .
</code></pre>
<p><strong>Run the Docker image</strong></p>
<pre><code class="language-bash">docker run \
       -e OTEL_EXPORTER_OTLP_ENDPOINT=&quot;REPLACE WITH OTEL_EXPORTER_OTLP_ENDPOINT&quot; \
       -e ELASTIC_APM_SECRET_TOKEN=&quot;REPLACE WITH TOKEN&quot; \
       -e OTEL_RESOURCE_ATTRIBUTES=&quot;service.version=1.0,deployment.environment=production&quot; \
       -e OTEL_SERVICE_NAME=&quot;java-favorite-otel-manual&quot; \
       -p 5000:5000 \
       java-otel-manual-image
</code></pre>
<p>You can now issue a few requests in order to generate trace data. Note that these requests are expected to return an error, as this service relies on a connection to Redis that you don’t currently have running. As mentioned before, you can find a more complete example using docker-compose <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix">here</a>.</p>
<pre><code class="language-bash">curl localhost:5000/favorites

# or alternatively issue a request every second

while true; do curl &quot;localhost:5000/favorites&quot;; sleep 1; done;
</code></pre>
<h3>Step 9. Explore traces and logs in Elastic APM</h3>
<p>Once you have this up and running, you can ping the endpoint for your instrumented service (in our case, this is /favorites), and you should see the app appear in Elastic APM, as shown below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-java-apps-opentelemetry/elastic-blog-5-services.png" alt="services" /></p>
<p>It will begin by tracking throughput and latency critical metrics for SREs to pay attention to.</p>
<p>Digging in, we can see an overview of all our Transactions.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-java-apps-opentelemetry/elastic-blog-6-java-fave-otel.png" alt="java favorite otel graph" /></p>
<p>And look at specific transactions:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-java-apps-opentelemetry/elastic-blog-7-graph1.png" alt="graph2" /></p>
<p>Click on <strong>Logs</strong> , and we see that logs are also brought over. The OTel Agent will automatically bring in logs and correlate them with traces for you:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-java-apps-opentelemetry/elastic-blog-8-graph2.png" alt="graph3" /></p>
<p>This gives you complete visibility across logs, metrics, and traces!</p>
<h2>Wrapping up</h2>
<p>Manually instrumenting your Java applications with OpenTelemetry gives you greater control over what to track and monitor. By following the steps outlined in this blog post, you can effectively monitor the performance of your Java applications, identify issues, and gain insights into the overall health of your application.</p>
<p>Remember, OpenTelemetry is a powerful tool, and proper instrumentation requires careful consideration of what metrics, traces, and logs are essential for your specific use case. Experiment with different configurations, leverage the OpenTelemetry SDK for Java documentation, and continuously iterate to achieve the observability goals of your application.</p>
<p>In this blog, we discussed the following:</p>
<ul>
<li>How to manually instrument Java with OpenTelemetry</li>
<li>How to properly initialize and instrument span</li>
<li>How to easily set the OTLP ENDPOINT and OTLP HEADERS from Elastic without the need for a collector</li>
</ul>
<p>Hopefully, this provided an easy-to-understand walk-through of instrumenting Java with OpenTelemetry and how easy it is to send traces into Elastic.</p>
<blockquote>
<p>Developer resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/getting-started-opentelemetry-instrumentation-sample-app">Elastiflix application</a>, a guide to instrument different languages with OpenTelemetry</li>
<li>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-java-apps-opentelemetry">Manual-instrumentation</a></li>
<li>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual-instrumentation</a></li>
<li>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Go: <a href="https://elastic.co/blog/manual-instrumentation-of-go-applications-opentelemetry">Manual-instrumentation</a></li>
<li><a href="https://www.elastic.co/blog/best-practices-instrumenting-opentelemetry">Best practices for instrumenting OpenTelemetry</a></li>
</ul>
<p>General configuration and use case resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></li>
<li><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a></li>
<li><a href="https://www.elastic.co/blog/custom-metrics-app-code-java-agent-plugin">Capturing custom metrics through OpenTelemetry API in code with Elastic</a></li>
<li><a href="https://www.elastic.co/virtual-events/future-proof-your-observability-platform-with-opentelemetry-and-elastic">Future-proof your observability platform with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-k8s-observability-elasticsearch-cncf">Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more</a></li>
</ul>
</blockquote>
<p>Don’t have an Elastic Cloud account yet? <a href="https://cloud.elastic.co/registration">Sign up for Elastic Cloud</a> and try out the auto-instrumentation capabilities that I discussed above. I would be interested in getting your feedback about your experience in gaining visibility into your application stack with Elastic.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-java-apps-opentelemetry/observability-launch-series-3-java-manual.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Manual instrumentation of .NET applications with OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/manual-instrumentation-net-apps-opentelemetry</link>
            <guid isPermaLink="false">manual-instrumentation-net-apps-opentelemetry</guid>
            <pubDate>Fri, 01 Sep 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[In this blog, we will look at how to manually instrument your .NET applications using OpenTelemetry, which provides a set of APIs, libraries, and agents to capture distributed traces and metrics from your application. You can analyze them in Elastic.]]></description>
            <content:encoded><![CDATA[<p>In the fast-paced universe of software development, especially in the cloud-native realm, DevOps and SRE teams are increasingly emerging as essential partners in application stability and growth.</p>
<p>DevOps engineers continuously optimize software delivery, while SRE teams act as the stewards of application reliability, scalability, and top-tier performance. The challenge? These teams require a cutting-edge observability solution, one that encompasses full-stack insights, empowering them to rapidly manage, monitor, and rectify potential disruptions before they culminate into operational challenges.</p>
<p>Observability in our modern distributed software ecosystem goes beyond mere monitoring — it demands limitless data collection, precision in processing, and the correlation of this data into actionable insights. However, the road to achieving this holistic view is paved with obstacles, from navigating version incompatibilities to wrestling with restrictive proprietary code.</p>
<p>Enter <a href="https://opentelemetry.io/">OpenTelemetry (OTel)</a>, with the following benefits for those who adopt it:</p>
<ul>
<li>Escape vendor constraints with OTel, freeing yourself from vendor lock-in and ensuring top-notch observability.</li>
<li>See the harmony of unified logs, metrics, and traces come together to provide a complete system view.</li>
<li>Improve your application oversight through richer and enhanced instrumentations.</li>
<li>Embrace the benefits of backward compatibility to protect your prior instrumentation investments.</li>
<li>Embark on the OpenTelemetry journey with an easy learning curve, simplifying onboarding and scalability.</li>
<li>Rely on a proven, future-ready standard to boost your confidence in every investment.</li>
<li>Explore manual instrumentation, enabling customized data collection to fit your unique needs.</li>
<li>Ensure monitoring consistency across layers with a standardized observability data framework.</li>
<li>Decouple development from operations, driving peak efficiency for both.</li>
</ul>
<p>In this post, we will dive into the methodology to instrument a .NET application manually using Docker.</p>
<h2>What's covered?</h2>
<ul>
<li>Instrumenting the .NET application manually</li>
<li>Creating a Docker image for a .NET application with the OpenTelemetry instrumentation baked in</li>
<li>Installing and running the OpenTelemetry .NET Profiler for automatic instrumentation</li>
</ul>
<h2>Prerequisites</h2>
<ul>
<li>An understanding of Docker and .NET</li>
<li>Elastic Cloud</li>
<li>Docker installed on your machine (we recommend docker desktop)</li>
</ul>
<h2>View the example source code</h2>
<p>The full source code, including the Dockerfile used in this blog, can be found on <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/dotnet-login-otel-manual">GitHub</a>. The repository also contains the <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/dotnet-login">same application without instrumentation</a>. This allows you to compare each file and see the differences.</p>
<p>The following steps will show you how to instrument this application and run it on the command line or in Docker. If you are interested in a more complete OTel example, take a look at the docker-compose file <a href="https://github.com/elastic/observability-examples/tree/main#start-the-app">here</a>, which will bring up the full project.</p>
<h2>Step-by-step guide</h2>
<p>This blog assumes you have an Elastic Cloud account — if not, follow the <a href="https://cloud.elastic.co/registration?elektra=en-cloud-page">instructions to get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-net-apps-opentelemetry/elastic-blog-2-free-trial.png" alt="" /></p>
<h2>Step 1. Getting started</h2>
<p>In our demonstration, we will manually instrument a .NET Core application - Login. This application simulates a simple user login service. In this example, we are only looking at Tracing since the OpenTelemetry logging instrumentation is currently at mixed maturity, as mentioned <a href="https://opentelemetry.io/docs/instrumentation/">here</a>.</p>
<p>The application has the following files:</p>
<ol>
<li>
<p>Program.cs</p>
</li>
<li>
<p>Startup.cs</p>
</li>
<li>
<p>Telemetry.cs</p>
</li>
<li>
<p>LoginController.cs</p>
</li>
</ol>
<h2>Step 2. Instrumenting the application</h2>
<p>When it comes to OpenTelemetry, the .NET ecosystem presents some unique aspects. While OpenTelemetry offers its API, .NET leverages its native <strong>System</strong>.Diagnostics API to implement OpenTelemetry's Tracing API. The pre-existing constructs such as <strong>ActivitySource</strong> and <strong>Activity</strong> are aptly repurposed to comply with OpenTelemetry.</p>
<p>That said, understanding the OpenTelemetry API and its terminology remains crucial for .NET developers. It's pivotal in gaining full command over instrumenting your applications, and as we've seen, it also extends to understanding elements of the <strong>System</strong>.Diagnostics API.</p>
<p>For those who might lean toward using the original OpenTelemetry APIs over the <strong>System</strong>.Diagnostics ones, there is also a way. OpenTelemetry provides an API shim for tracing that you can use. It enables developers to switch to OpenTelemetry APIs, and you can find more details about it in the OpenTelemetry API Shim documentation.</p>
<p>By integrating such practices into your .NET application, you can take full advantage of the powerful features OpenTelemetry provides, irrespective of whether you're using OpenTelemetry's API or the <strong>System</strong>.Diagnostics API.</p>
<p>In this blog, we are sticking to the default method and using the Activity convention which the <strong>System</strong>.Diagnostics API dictates.</p>
<p>To manually instrument a .NET application, you need to make changes in each of these files. Let's take a look at these changes one by one.</p>
<h3>Program.cs</h3>
<p>This is the entry point for our application. Here, we create an instance of IHostBuilder with default configurations. Notice how we set up a console logger with Serilog.</p>
<pre><code class="language-csharp">public static void Main(string[] args)
{
    Log.Logger = new LoggerConfiguration().WriteTo.Console().CreateLogger();
    CreateHostBuilder(args).Build().Run();
}
</code></pre>
<h3>Startup.cs</h3>
<p>In the <strong>Startup</strong>.cs file, we use the <strong>ConfigureServices</strong> method to add the OpenTelemetry Tracing.</p>
<pre><code class="language-csharp">public void ConfigureServices(IServiceCollection services)
{
    services.AddOpenTelemetry().WithTracing(builder =&gt; builder.AddOtlpExporter()
        .AddSource(&quot;Login&quot;)
        .AddAspNetCoreInstrumentation()
        .AddOtlpExporter()
        .ConfigureResource(resource =&gt;
            resource.AddService(
                serviceName: &quot;Login&quot;))
    );
    services.AddControllers();
}
</code></pre>
<p>The WithTracing method enables tracing in OpenTelemetry. We add the OTLP (OpenTelemetry Protocol) exporter, which is a general-purpose telemetry data delivery protocol. We also add the AspNetCoreInstrumentation, which will automatically collect traces from our application. This is a critically important step that is not mentioned in the OpenTelemetry docs. Without adding this method, the instrumentation was not working for me for the Login application.</p>
<h3>Telemetry.cs</h3>
<p>This file contains the definition of our ActivitySource. The ActivitySource represents the source of the telemetry activities. It is named after the service name for your application, and this name can come from a configuration file, constants file, etc. We can use this ActivitySource to start activities.</p>
<pre><code class="language-csharp">using System.Diagnostics;

public static class Telemetry
{
    //...

    // Name it after the service name for your app.
    // It can come from a config file, constants file, etc.
    public static readonly ActivitySource LoginActivitySource = new(&quot;Login&quot;);

    //...
}
</code></pre>
<p>In our case, we've created an <strong>ActivitySource</strong> named <strong>Login</strong>. In our <strong>LoginController</strong>.cs, we use this <strong>LoginActivitySource</strong> to start a new activity when we begin our operations.</p>
<pre><code class="language-csharp">using (Activity activity = Telemetry.LoginActivitySource.StartActivity(&quot;SomeWork&quot;))
{
    // Perform operations here
}
</code></pre>
<p>This piece of code starts a new activity named <strong>SomeWork</strong> , performs some operations (in this case, generating a random user and logging them in), and then ends the activity. These activities are traced and can be analyzed later to understand the performance of the operations.</p>
<p>This <strong>ActivitySource</strong> is fundamental to OpenTelemetry's manual instrumentation. It represents the source of the activities and provides a way to start and stop activities.</p>
<h3>LoginController.cs</h3>
<p>In the <strong>LoginController</strong>.cs file, we are tracing the operations performed by the GET and POST methods. We start a new activity, <strong>SomeWork</strong> , before we begin our operations and dispose of it once we're done.</p>
<pre><code class="language-csharp">using (Activity activity = Telemetry.LoginActivitySource.StartActivity(&quot;SomeWork&quot;))
{
    var user = GenerateRandomUserResponse();
    Log.Information(&quot;User logged in: {UserName}&quot;, user);
    return user;
}
</code></pre>
<p>This will track the time taken by these operations and send this data to any configured telemetry backend via the OTLP exporter.</p>
<h2>Step 3. Base image setup</h2>
<p>Now that we have our application source code created and instrumented, it’s time to create a Dockerfile to build and run our .NET Login service.</p>
<p>Start with the .NET runtime image for the base layer of our Dockerfile:</p>
<pre><code class="language-dockerfile">FROM ${ARCH}mcr.microsoft.com/dotnet/aspnet:7.0. AS base
WORKDIR /app
EXPOSE 8000
</code></pre>
<p>Here, we're setting up the application's runtime environment.</p>
<h2>Step 4. Building the .NET application</h2>
<p>This feature of Docker is just the best. Here, we compile our .NET application. We'll use the SDK image. In the bad old days, we used to build on a different platform and then put the compiled code into the Docker container. This way, we are much more confident our build will replicate from a developers desktop and into production by using Docker all the way through.</p>
<pre><code class="language-dockerfile">FROM --platform=$BUILDPLATFORM mcr.microsoft.com/dotnet/sdk:8.0-preview AS build
ARG TARGETPLATFORM

WORKDIR /src
COPY [&quot;login.csproj&quot;, &quot;./&quot;]
RUN dotnet restore &quot;./login.csproj&quot;
COPY . .
WORKDIR &quot;/src/.&quot;
RUN dotnet build &quot;login.csproj&quot; -c Release -o /app/build
</code></pre>
<p>This section ensures that our .NET code is properly restored and compiled.</p>
<h2>Step 5. Publishing the application</h2>
<p>Once built, we'll publish the app:</p>
<pre><code class="language-dockerfile">FROM build AS publish
RUN dotnet publish &quot;login.csproj&quot; -c Release -o /app/publish
</code></pre>
<h2>Step 6. Preparing the final image</h2>
<p>Now, let's set up the final runtime image:</p>
<pre><code class="language-dockerfile">FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
</code></pre>
<h2>Step 7. Entry point setup</h2>
<p>Lastly, set the Docker image's entry point to both source the OpenTelemetry instrumentation, which sets up the Environment variables required to bootstrap the .NET Profiler, and then we start our .NET application:</p>
<pre><code class="language-bash">ENTRYPOINT [&quot;/bin/bash&quot;, &quot;-c&quot;, &quot;dotnet login.dll&quot;]
</code></pre>
<h2>Step 8. Running the Docker image with environment variables</h2>
<p>To build and run the Docker image, you'd typically follow these steps:</p>
<h3>Build the Docker image</h3>
<p>First, you'd want to build the Docker image from your Dockerfile. Let's assume the Dockerfile is in the current directory, and you'd like to name/tag your image dotnet-login-otel-image.</p>
<pre><code class="language-bash">docker build -t dotnet-login-otel-image .
</code></pre>
<h3>Run the Docker image</h3>
<p>After building the image, you'd run it with the specified environment variables. For this, the docker <strong>run</strong> command is used with the -e flag for each environment variable.</p>
<pre><code class="language-bash">docker run \
       -e OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=Bearer ${ELASTIC_APM_SECRET_TOKEN}&quot; \
       -e OTEL_EXPORTER_OTLP_ENDPOINT=&quot;${ELASTIC_APM_SERVER_URL}&quot; \
       -e OTEL_METRICS_EXPORTER=&quot;otlp&quot; \
       -e OTEL_RESOURCE_ATTRIBUTES=&quot;service.version=1.0,deployment.environment=production&quot; \
       -e OTEL_SERVICE_NAME=&quot;dotnet-login-otel-manual&quot; \
       -e OTEL_TRACES_EXPORTER=&quot;otlp&quot; \
       dotnet-login-otel-image
</code></pre>
<p>Make sure that <code>${ELASTIC_APM_SECRET_TOKEN}</code> and <code>${ELASTIC_APM_SERVER_URL}</code> are set in your shell environment, replace them with their actual values from the cloud as shown below.</p>
<p><strong>Getting Elastic Cloud variables</strong><br />
You can copy the endpoints and token from Kibana under the path <code>/app/home#/tutorial/apm</code>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-net-apps-opentelemetry/elastic-blog-3-apm-agents.png" alt="apm agents" /></p>
<p>You can also use an environment file with docker run --env-file to make the command less verbose if you have multiple environment variables.</p>
<p>Once you have this up and running, you can ping the endpoint for your instrumented service (in our case, this is /login), and you should see the app appear in Elastic APM, as shown below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-net-apps-opentelemetry/services-2.png" alt="services" /></p>
<p>It will begin by tracking throughput and latency critical metrics for SREs to pay attention to.</p>
<p>Digging in, we can see an overview of all our Transactions.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-net-apps-opentelemetry/manual-net-login.png" alt="login" /></p>
<p>And look at specific transactions, including the “SomeWork” activity/span we created in the code above:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-net-apps-opentelemetry/latency_distribution_graph.png" alt="latency distribution graph" /></p>
<p>There is clearly an outlier here, where one transaction took over 20ms. This is likely to be due to the CLR warming up.</p>
<h2>Wrapping up</h2>
<p>With the code here instrumented and the Dockerfile bootstrapping the application, you've transformed your simple .NET application into one that's instrumented with OpenTelemetry. This will aid greatly in understanding application performance, tracing errors, and gaining insights into how users interact with your software.</p>
<p>Remember, observability is a crucial aspect of modern application development, especially in distributed systems. With tools like OpenTelemetry, understanding complex systems becomes a tad bit easier.</p>
<p>In this blog, we discussed the following:</p>
<ul>
<li>How to manually instrument .NET with OpenTelemetry.</li>
<li>Using standard commands in a Docker file, our instrumented application was built and started.</li>
<li>Using OpenTelemetry and its support for multiple languages, DevOps and SRE teams can instrument their applications with ease, gaining immediate insights into the health of the entire application stack and reducing mean time to resolution (MTTR).</li>
</ul>
<p>Since Elastic can support a mix of methods for ingesting data whether it be using auto-instrumentation of open-source OpenTelemetry or manual instrumentation with its native APM agents, you can plan your migration to OTel by focusing on a few applications first and then using OpenTelemety across your applications later on in a manner that best fits your business needs.</p>
<blockquote>
<p>Developer resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/getting-started-opentelemetry-instrumentation-sample-app">Elastiflix application</a>, a guide to instrument different languages with OpenTelemetry</li>
<li>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual-instrumentation</a></li>
<li>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-net-apps-opentelemetry">Manual-instrumentation</a></li>
<li>Go: <a href="https://elastic.co/blog/manual-instrumentation-of-go-applications-opentelemetry">Manual-instrumentation</a></li>
<li><a href="https://www.elastic.co/blog/best-practices-instrumenting-opentelemetry">Best practices for instrumenting OpenTelemetry</a></li>
</ul>
<p>General configuration and use case resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></li>
<li><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a></li>
<li><a href="https://www.elastic.co/blog/custom-metrics-app-code-java-agent-plugin">Capturing custom metrics through OpenTelemetry API in code with Elastic</a></li>
<li><a href="https://www.elastic.co/virtual-events/future-proof-your-observability-platform-with-opentelemetry-and-elastic">Future-proof your observability platform with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-k8s-observability-elasticsearch-cncf">Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more</a></li>
</ul>
</blockquote>
<p>Don’t have an Elastic Cloud account yet? <a href="https://cloud.elastic.co/registration">Sign up for Elastic Cloud</a> and try out the instrumentation capabilities that I discussed above. I would be interested in getting your feedback about your experience in gaining visibility into your application stack with Elastic.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-net-apps-opentelemetry/observability-launch-series-4-net-manual.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Manual instrumentation with OpenTelemetry for Node.js applications]]></title>
            <link>https://www.elastic.co/observability-labs/blog/manual-instrumentation-nodejs-apps-opentelemetry</link>
            <guid isPermaLink="false">manual-instrumentation-nodejs-apps-opentelemetry</guid>
            <pubDate>Thu, 31 Aug 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[In this blog post, we will show you how to manually instrument Node.js applications using OpenTelemetry. We will explore how to use the proper OpenTelemetry Node.js libraries and in particular work on instrumenting tracing in a Node.js application.]]></description>
            <content:encoded><![CDATA[<p>DevOps and SRE teams are transforming the process of software development. While DevOps engineers focus on efficient software applications and service delivery, SRE teams are key to ensuring reliability, scalability, and performance. These teams must rely on a full-stack observability solution that allows them to manage and monitor systems and ensure issues are resolved before they impact the business.</p>
<p>Observability across the entire stack of modern distributed applications requires data collection, processing, and correlation often in the form of dashboards. Ingesting all system data requires installing agents across stacks, frameworks, and providers — a process that can be challenging and time-consuming for teams who have to deal with version changes, compatibility issues, and proprietary code that doesn't scale as systems change.</p>
<p>Thanks to <a href="http://opentelemetry.io">OpenTelemetry</a> (OTel), DevOps and SRE teams now have a standard way to collect and send data that doesn't rely on proprietary code and have a large support community reducing vendor lock-in.</p>
<p>In a <a href="https://www.elastic.co/blog/opentelemetry-observability">previous blog</a>, we also reviewed how to use the <a href="https://github.com/elastic/opentelemetry-demo">OpenTelemetry demo</a> and connect it to Elastic&lt;sup&gt;®&lt;/sup&gt;, as well as some of Elastic’s capabilities with OpenTelemetry and Kubernetes.</p>
<p>In this blog, we will show how to use <a href="https://opentelemetry.io/docs/instrumentation/java/manual/">manual instrumentation for OpenTelemetry</a> with the Node.js service of our <a href="https://github.com/elastic/observability-examples">application called Elastiflix</a>. This approach is slightly more complex than using <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">auto-instrumentation</a>.</p>
<p>The beauty of this is that there is <strong>no need for the otel-collector</strong>! This setup enables you to slowly and easily migrate an application to OTel with Elastic according to a timeline that best fits your business.</p>
<h2>Application, prerequisites, and config</h2>
<p>The application that we use for this blog is called <a href="https://github.com/elastic/observability-examples">Elastiflix</a>, a movie streaming application. It consists of several micro-services written in .NET, NodeJS, Go, and Python.</p>
<p>Before we instrument our sample application, we will first need to understand how Elastic can receive the telemetry data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-nodejs-apps-opentelemetry/elastic-blog-1-config.png" alt="Configuration" /></p>
<p>All of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:</p>
<ul>
<li>Service maps</li>
<li>Service details (latency, throughput, failed transactions)</li>
<li>Dependencies between services, distributed tracing</li>
<li>Transactions (traces)</li>
<li>Machine learning (ML) correlations</li>
<li>Log correlation</li>
</ul>
<p>In addition to Elastic’s APM and a unified view of the telemetry data, you will also be able to use Elastic’s powerful machine learning capabilities to reduce the analysis, and alerting to help reduce MTTR.</p>
<h2>Prerequisites</h2>
<ul>
<li>An Elastic Cloud account — <a href="https://cloud.elastic.co/">sign up now</a></li>
<li>A clone of the <a href="https://github.com/elastic/observability-examples">Elastiflix demo application</a>, or your own Node.js application</li>
<li>Basic understanding of Docker — potentially install <a href="https://www.docker.com/products/docker-desktop/">Docker Desktop</a></li>
<li>Basic understanding of Node.js</li>
</ul>
<h2>View the example source code</h2>
<p>The full source code, including the Dockerfile used in this blog, can be found on <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/node-server-otel-manual">GitHub</a>. The repository also contains the <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/node-server">same application without instrumentation</a>. This allows you to compare each file and see the differences.</p>
<p>Before we begin, let’s look at the non-instrumented code first.</p>
<p>This is our simple index.js file that can receive a POST request. See the full code <a href="https://github.com/elastic/observability-examples/blob/main/Elastiflix/node-server-otel-manual/index.js">here</a>.</p>
<pre><code class="language-javascript">const pino = require(&quot;pino&quot;);
const ecsFormat = require(&quot;@elastic/ecs-pino-format&quot;); //
const log = pino({ ...ecsFormat({ convertReqRes: true }) });
const expressPino = require(&quot;express-pino-logger&quot;)({ logger: log });

var API_ENDPOINT_FAVORITES =
  process.env.API_ENDPOINT_FAVORITES || &quot;127.0.0.1:5000&quot;;
API_ENDPOINT_FAVORITES = API_ENDPOINT_FAVORITES.split(&quot;,&quot;);

const express = require(&quot;express&quot;);
const cors = require(&quot;cors&quot;)({ origin: true });
const cookieParser = require(&quot;cookie-parser&quot;);
const { json } = require(&quot;body-parser&quot;);

const PORT = process.env.PORT || 3001;

const app = express().use(cookieParser(), cors, json(), expressPino);

const axios = require(&quot;axios&quot;);

app.use(express.json());
app.use(express.urlencoded({ extended: false }));
app.use((err, req, res, next) =&gt; {
  log.error(err.stack);
  res.status(500).json({ error: err.message, code: err.code });
});

var favorites = {};

app.post(&quot;/api/favorites&quot;, (req, res) =&gt; {
  var randomIndex = Math.floor(Math.random() * API_ENDPOINT_FAVORITES.length);
  if (process.env.THROW_NOT_A_FUNCTION_ERROR == &quot;true&quot; &amp;&amp; Math.random() &lt; 0.5) {
    // randomly choose one of the endpoints
    axios
      .post(
        &quot;http://&quot; +
          API_ENDPOINT_FAVORITES[randomIndex] +
          &quot;/favorites?user_id=1&quot;,
        req.body
      )
      .then(function (response) {
        favorites = response.data;
        // quiz solution: &quot;42&quot;
        res.jsonn({ favorites: favorites });
      })
      .catch(function (error) {
        res.json({ error: error, favorites: [] });
      });
  } else {
    axios
      .post(
        &quot;http://&quot; +
          API_ENDPOINT_FAVORITES[randomIndex] +
          &quot;/favorites?user_id=1&quot;,
        req.body
      )
      .then(function (response) {
        favorites = response.data;
        res.json({ favorites: favorites });
      })
      .catch(function (error) {
        res.json({ error: error, favorites: [] });
      });
  }
});

app.listen(PORT, () =&gt; {
  console.log(`Server listening on ${PORT}`);
});
</code></pre>
<h2>Step-by-step guide</h2>
<h3>Step 0. Log in to your Elastic Cloud account</h3>
<p>This blog assumes you have an Elastic Cloud account — if not, follow the <a href="https://cloud.elastic.co/registration?elektra=en-cloud-page">instructions to get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-nodejs-apps-opentelemetry/elastic-blog-2-trial.png" alt="trial" /></p>
<h3>Step 1. Install and initialize OpenTelemetry</h3>
<p>As a first step, we’ll need to add some additional modules to our application.</p>
<pre><code class="language-javascript">const opentelemetry = require(&quot;@opentelemetry/api&quot;);
const { NodeTracerProvider } = require(&quot;@opentelemetry/sdk-trace-node&quot;);
const { BatchSpanProcessor } = require(&quot;@opentelemetry/sdk-trace-base&quot;);
const { Resource } = require(&quot;@opentelemetry/resources&quot;);
const {
  SemanticResourceAttributes,
} = require(&quot;@opentelemetry/semantic-conventions&quot;);

const { registerInstrumentations } = require(&quot;@opentelemetry/instrumentation&quot;);
const { HttpInstrumentation } = require(&quot;@opentelemetry/instrumentation-http&quot;);
const {
  ExpressInstrumentation,
} = require(&quot;@opentelemetry/instrumentation-express&quot;);
</code></pre>
<p>We start by creating a collectorOptions object with parameters such as the url and headers for connecting to the Elastic APM Server or OpenTelemetry collector.</p>
<pre><code class="language-javascript">const collectorOptions = {
  url: OTEL_EXPORTER_OTLP_ENDPOINT,
  headers: OTEL_EXPORTER_OTLP_HEADERS,
};
</code></pre>
<p>In order to pass additional parameters to OpenTelemetry, we will read the OTEL_RESOURCE_ATTRIBUTES variable and convert it into an object.</p>
<pre><code class="language-javascript">const envAttributes = process.env.OTEL_RESOURCE_ATTRIBUTES || &quot;&quot;;

// Parse the environment variable string into an object
const attributes = envAttributes.split(&quot;,&quot;).reduce((acc, curr) =&gt; {
  const [key, value] = curr.split(&quot;=&quot;);
  if (key &amp;&amp; value) {
    acc[key.trim()] = value.trim();
  }
  return acc;
}, {});
</code></pre>
<p>Next we will then use these parameters to populate the resources configuration.</p>
<pre><code class="language-javascript">const resource = new Resource({
  [SemanticResourceAttributes.SERVICE_NAME]:
    attributes[&quot;service.name&quot;] || &quot;node-server-otel-manual&quot;,
  [SemanticResourceAttributes.SERVICE_VERSION]:
    attributes[&quot;service.version&quot;] || &quot;1.0.0&quot;,
  [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]:
    attributes[&quot;deployment.environment&quot;] || &quot;production&quot;,
});
</code></pre>
<p>We then set up the trace provider using the previously created resource, followed by the exporter which takes the collectorOptions from before. The trace provider will allow us to create spans later.</p>
<p>Additionally, we specify the use of BatchSPanProcessor. The Span processor is an interface that allows hooks for span start and end method invocations.</p>
<p>In OpenTelemetry, different Span processors are offered. The BatchSPanProcessor batches span and sends them in bulk. Multiple Span processors can be configured to be active at the same time using the MultiSpanProcessor. <a href="https://opentelemetry.io/docs/instrumentation/java/manual/#span-processor">See OpenTelemetry documentation</a>.</p>
<p>Additionally, we added the resource module. This allows us to specify attributes such as service.name, version, and more. See <a href="https://opentelemetry.io/docs/specs/otel/resource/semantic_conventions/#semantic-attributes-with-sdk-provided-default-value">OpenTelemetry semantic conventions documentation</a> for more details.</p>
<pre><code class="language-javascript">const tracerProvider = new NodeTracerProvider({
  resource: resource,
});

const exporter = new OTLPTraceExporter(collectorOptions);
tracerProvider.addSpanProcessor(new BatchSpanProcessor(exporter));
tracerProvider.register();
</code></pre>
<p>Next, we are going to register some instrumentations. This will automatically instrument Express and HTTP for us. While it’s possible to do this step fully manually as well, it would be complex and a waste of time. This way we can ensure that any incoming and outgoing request is captured properly and that functionality such as distributed tracing works without any additional work.</p>
<pre><code class="language-javascript">registerInstrumentations({
  instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()],
  tracerProvider: tracerProvider,
});
</code></pre>
<p>As a last step, we will now get an instance of the tracer that we can use to create custom spans.</p>
<pre><code class="language-javascript">const tracer = opentelemetry.trace.getTracer();
</code></pre>
<h3>Step 2. Adding custom spans</h3>
<p>Now that we have the modules added and initialized, we can add custom spans.</p>
<p>Our sample application has a POST request which calls a downstream service. If we want to have additional instrumentation for this part of our app, we simply wrap the function code with:</p>
<pre><code class="language-javascript">tracer.startActiveSpan('favorites',   tracer.startActiveSpan('favorites', (span) =&gt; {...
</code></pre>
<p>The wrapped code is as follows:</p>
<pre><code class="language-javascript">app.post(&quot;/api/favorites&quot;, (req, res, next) =&gt; {
  tracer.startActiveSpan(&quot;favorites&quot;, (span) =&gt; {
    axios
      .post(
        &quot;http://&quot; + API_ENDPOINT_FAVORITES + &quot;/favorites?user_id=1&quot;,
        req.body
      )
      .then(function (response) {
        favorites = response.data;
        span.end();
        res.jsonn({ favorites: favorites });
      })
      .catch(next);
  });
});
</code></pre>
<p><strong>Automatic error handling</strong><br />
For automatic error handling, we are adding a function that we use in Express which captures the exception for any error that happens during runtime.</p>
<pre><code class="language-javascript">app.use((err, req, res, next) =&gt; {
  log.error(err.stack);
  span = opentelemetry.trace.getActiveSpan();
  span.recordException(error);
  span.end();
  res.status(500).json({ error: err.message, code: err.code });
});
</code></pre>
<p><strong>Additional code</strong><br />
n addition to modules and span instrumentation, the sample application also checks some environment variables at startup. When sending data to Elastic without an OTel collector, the OTEL_EXPORTER_OTLP_HEADERS variable is required as it contains the authentication. The same is true for OTEL_EXPORTER_OTLP_ENDPOINT, the host where we’ll send the telemetry data.</p>
<pre><code class="language-javascript">const OTEL_EXPORTER_OTLP_HEADERS = process.env.OTEL_EXPORTER_OTLP_HEADERS;
// error if secret token is not set
if (!OTEL_EXPORTER_OTLP_HEADERS) {
  throw new Error(&quot;OTEL_EXPORTER_OTLP_HEADERS environment variable is not set&quot;);
}

const OTEL_EXPORTER_OTLP_ENDPOINT = process.env.OTEL_EXPORTER_OTLP_ENDPOINT;
// error if server url is not set
if (!OTEL_EXPORTER_OTLP_ENDPOINT) {
  throw new Error(
    &quot;OTEL_EXPORTER_OTLP_ENDPOINT environment variable is not set&quot;
  );
}
</code></pre>
<p><strong>Final code</strong><br />
For comparison, this is the instrumented code of our sample application. You can find the full source code in <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/node-server-otel-manual">GitHub</a>.</p>
<pre><code class="language-javascript">const pino = require(&quot;pino&quot;);
const ecsFormat = require(&quot;@elastic/ecs-pino-format&quot;); //
const log = pino({ ...ecsFormat({ convertReqRes: true }) });
const expressPino = require(&quot;express-pino-logger&quot;)({ logger: log });

// Add OpenTelemetry packages
const opentelemetry = require(&quot;@opentelemetry/api&quot;);
const { NodeTracerProvider } = require(&quot;@opentelemetry/sdk-trace-node&quot;);
const { BatchSpanProcessor } = require(&quot;@opentelemetry/sdk-trace-base&quot;);
const {
  OTLPTraceExporter,
} = require(&quot;@opentelemetry/exporter-trace-otlp-grpc&quot;);
const { Resource } = require(&quot;@opentelemetry/resources&quot;);
const {
  SemanticResourceAttributes,
} = require(&quot;@opentelemetry/semantic-conventions&quot;);

const { registerInstrumentations } = require(&quot;@opentelemetry/instrumentation&quot;);

// Import OpenTelemetry instrumentations
const { HttpInstrumentation } = require(&quot;@opentelemetry/instrumentation-http&quot;);
const {
  ExpressInstrumentation,
} = require(&quot;@opentelemetry/instrumentation-express&quot;);

var API_ENDPOINT_FAVORITES =
  process.env.API_ENDPOINT_FAVORITES || &quot;127.0.0.1:5000&quot;;
API_ENDPOINT_FAVORITES = API_ENDPOINT_FAVORITES.split(&quot;,&quot;);

const OTEL_EXPORTER_OTLP_HEADERS = process.env.OTEL_EXPORTER_OTLP_HEADERS;
// error if secret token is not set
if (!OTEL_EXPORTER_OTLP_HEADERS) {
  throw new Error(&quot;OTEL_EXPORTER_OTLP_HEADERS environment variable is not set&quot;);
}

const OTEL_EXPORTER_OTLP_ENDPOINT = process.env.OTEL_EXPORTER_OTLP_ENDPOINT;
// error if server url is not set
if (!OTEL_EXPORTER_OTLP_ENDPOINT) {
  throw new Error(
    &quot;OTEL_EXPORTER_OTLP_ENDPOINT environment variable is not set&quot;
  );
}

const collectorOptions = {
  // url is optional and can be omitted - default is http://localhost:4317
  // Unix domain sockets are also supported: 'unix:///path/to/socket.sock'
  url: OTEL_EXPORTER_OTLP_ENDPOINT,
  headers: OTEL_EXPORTER_OTLP_HEADERS,
};

const envAttributes = process.env.OTEL_RESOURCE_ATTRIBUTES || &quot;&quot;;

// Parse the environment variable string into an object
const attributes = envAttributes.split(&quot;,&quot;).reduce((acc, curr) =&gt; {
  const [key, value] = curr.split(&quot;=&quot;);
  if (key &amp;&amp; value) {
    acc[key.trim()] = value.trim();
  }
  return acc;
}, {});

// Create and configure the resource object
const resource = new Resource({
  [SemanticResourceAttributes.SERVICE_NAME]:
    attributes[&quot;service.name&quot;] || &quot;node-server-otel-manual&quot;,
  [SemanticResourceAttributes.SERVICE_VERSION]:
    attributes[&quot;service.version&quot;] || &quot;1.0.0&quot;,
  [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]:
    attributes[&quot;deployment.environment&quot;] || &quot;production&quot;,
});

// Create and configure the tracer provider
const tracerProvider = new NodeTracerProvider({
  resource: resource,
});
const exporter = new OTLPTraceExporter(collectorOptions);
tracerProvider.addSpanProcessor(new BatchSpanProcessor(exporter));
tracerProvider.register();

//Register instrumentations
registerInstrumentations({
  instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()],
  tracerProvider: tracerProvider,
});

const express = require(&quot;express&quot;);
const cors = require(&quot;cors&quot;)({ origin: true });
const cookieParser = require(&quot;cookie-parser&quot;);
const { json } = require(&quot;body-parser&quot;);

const PORT = process.env.PORT || 3001;

const app = express().use(cookieParser(), cors, json(), expressPino);

const axios = require(&quot;axios&quot;);

app.use(express.json());
app.use(express.urlencoded({ extended: false }));
app.use((err, req, res, next) =&gt; {
  log.error(err.stack);
  span = opentelemetry.trace.getActiveSpan();
  span.recordException(error);
  span.end();
  res.status(500).json({ error: err.message, code: err.code });
});

const tracer = opentelemetry.trace.getTracer();

var favorites = {};

app.post(&quot;/api/favorites&quot;, (req, res, next) =&gt; {
  tracer.startActiveSpan(&quot;favorites&quot;, (span) =&gt; {
    var randomIndex = Math.floor(Math.random() * API_ENDPOINT_FAVORITES.length);

    if (
      process.env.THROW_NOT_A_FUNCTION_ERROR == &quot;true&quot; &amp;&amp;
      Math.random() &lt; 0.5
    ) {
      // randomly choose one of the endpoints
      axios
        .post(
          &quot;http://&quot; +
            API_ENDPOINT_FAVORITES[randomIndex] +
            &quot;/favorites?user_id=1&quot;,
          req.body
        )
        .then(function (response) {
          favorites = response.data;
          // quiz solution: &quot;42&quot;
          span.end();
          res.jsonn({ favorites: favorites });
        })
        .catch(next);
    } else {
      axios
        .post(
          &quot;http://&quot; +
            API_ENDPOINT_FAVORITES[randomIndex] +
            &quot;/favorites?user_id=1&quot;,
          req.body
        )
        .then(function (response) {
          favorites = response.data;
          span.end();
          res.json({ favorites: favorites });
        })
        .catch(next);
    }
  });
});

app.listen(PORT, () =&gt; {
  log.info(`Server listening on ${PORT}`);
});
</code></pre>
<h3>Step 3. Running the Docker image with environment variables</h3>
<p>We will use environment variables and pass in the configuration values to enable it to connect with <a href="https://www.elastic.co/guide/en/observability/current/apm-open-telemetry.html">Elastic Observability’s APM server</a>.</p>
<p>Because Elastic accepts OTLP natively, we just need to provide the Endpoint and authentication where the OTEL Exporter needs to send the data, as well as some other environment variables.</p>
<p><strong>Getting Elastic Cloud variables</strong><br />
You can copy the endpoints and token from Kibana&lt;sup&gt;®&lt;/sup&gt; under the path <code>/app/home#/tutorial/apm</code>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-nodejs-apps-opentelemetry/elastic-blog-3-apm.png" alt="apm" /></p>
<p>You will need to copy the following environment variables:</p>
<pre><code class="language-bash">OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_EXPORTER_OTLP_HEADERS
</code></pre>
<p><strong>Build the image</strong></p>
<pre><code class="language-bash">docker build -t  node-otel-manual-image .
</code></pre>
<p><strong>Run the image</strong></p>
<pre><code class="language-bash">docker run \
       -e OTEL_EXPORTER_OTLP_ENDPOINT=&quot;&lt;REPLACE WITH OTEL_EXPORTER_OTLP_ENDPOINT&gt;&quot; \
       -e OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=Bearer &lt;REPLACE WITH TOKEN&gt;&quot; \
       -e OTEL_RESOURCE_ATTRIBUTES=&quot;service.version=1.0,deployment.environment=production,service.name=node-server-otel-manual&quot; \
       -p 3001:3001 \
       node-otel-manual-image
</code></pre>
<p>You can now issue a few requests in order to generate trace data. Note that these requests are expected to return an error, as this service relies on some downstream services that you may not have running on your machine.</p>
<pre><code class="language-bash">curl localhost:3001/api/login
curl localhost:3001/api/favorites

# or alternatively issue a request every second

while true; do curl &quot;localhost:3001/api/favorites&quot;; sleep 1; done;
</code></pre>
<h3>Step 4. Explore in Elastic APM</h3>
<p>Now that the service is instrumented, you should see the following output in Elastic APM when looking at the transactions section of your Node.js service:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-nodejs-apps-opentelemetry/elastic-blog-4-graphs.png" alt="graphs" /></p>
<p>Notice how this mirrors the auto-instrumented version.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-nodejs-apps-opentelemetry/elastic-blog-4-graphs.png" alt="graphs-2" /></p>
<h2>Is it worth it?</h2>
<p>This is the million-dollar question. Depending on what level of detail you need, it's potentially necessary to manually instrument. Manual instrumentation lets you add custom spans, custom labels, and metrics where you want or need them. It allows you to get a level of detail that otherwise would not be possible and is oftentimes important for tracking business-specific KPIs.</p>
<p>Your operations, and whether you need to troubleshoot or analyze the performance of specific parts of the code, will dictate when and what to instrument. But it’s helpful to know that you have the option to manually instrument.</p>
<p>If you noticed we didn’t yet instrument metrics, that is another blog. We discussed logs in a <a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">previous blog</a>.</p>
<h2>Conclusion</h2>
<p>In this blog, we discussed the following:</p>
<ul>
<li>How to manually instrument Node.js with OpenTelemetry</li>
<li>The different modules needed when using Express</li>
<li>How to properly initialize and instrument span</li>
<li>How to easily set the OTLP ENDPOINT and OTLP HEADERS from Elastic without the need for a collector</li>
</ul>
<p>Hopefully, this provides an easy-to-understand walk-through of instrumenting Node.js with OpenTelemetry and how easy it is to send traces into Elastic.</p>
<blockquote>
<p>Developer resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/getting-started-opentelemetry-instrumentation-sample-app">Elastiflix application</a>, a guide to instrument different languages with OpenTelemetry</li>
<li>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-nodejs-apps-opentelemetry">Manual-instrumentation</a></li>
<li>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Go: <a href="https://elastic.co/blog/manual-instrumentation-of-go-applications-opentelemetry">Manual-instrumentation</a></li>
<li><a href="https://www.elastic.co/blog/best-practices-instrumenting-opentelemetry">Best practices for instrumenting OpenTelemetry</a></li>
</ul>
<p>General configuration and use case resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></li>
<li><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a></li>
<li><a href="https://www.elastic.co/blog/custom-metrics-app-code-java-agent-plugin">Capturing custom metrics through OpenTelemetry API in code with Elastic</a></li>
<li><a href="https://www.elastic.co/virtual-events/future-proof-your-observability-platform-with-opentelemetry-and-elastic">Future-proof your observability platform with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-k8s-observability-elasticsearch-cncf">Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more</a></li>
</ul>
</blockquote>
<p>Don’t have an Elastic Cloud account yet? <a href="https://cloud.elastic.co/registration">Sign up for Elastic Cloud</a> and try out the auto-instrumentation capabilities that I discussed above. I would be interested in getting your feedback about your experience in gaining visibility into your application stack with Elastic.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-nodejs-apps-opentelemetry/observability-launch-series-1-node-js-manual_(1).jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Manual instrumentation with OpenTelemetry for Python applications]]></title>
            <link>https://www.elastic.co/observability-labs/blog/manual-instrumentation-python-apps-opentelemetry</link>
            <guid isPermaLink="false">manual-instrumentation-python-apps-opentelemetry</guid>
            <pubDate>Thu, 31 Aug 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[In this blog post, we will show you how to manually instrument Python applications using OpenTelemetry. We will explore how to use the proper OpenTelemetry Python libraries and in particular work on instrumenting tracing in a Python application.]]></description>
            <content:encoded><![CDATA[<p>DevOps and SRE teams are transforming the process of software development. While DevOps engineers focus on efficient software applications and service delivery, SRE teams are key to ensuring reliability, scalability, and performance. These teams must rely on a full-stack observability solution that allows them to manage and monitor systems and ensure issues are resolved before they impact the business.</p>
<p>Observability across the entire stack of modern distributed applications requires data collection, processing, and correlation often in the form of dashboards. Ingesting all system data requires installing agents across stacks, frameworks, and providers — a process that can be challenging and time-consuming for teams who have to deal with version changes, compatibility issues, and proprietary code that doesn't scale as systems change.</p>
<p>Thanks to <a href="http://opentelemetry.io">OpenTelemetry</a> (OTel), DevOps and SRE teams now have a standard way to collect and send data that doesn't rely on proprietary code and have a large support community reducing vendor lock-in.</p>
<p>In a <a href="https://www.elastic.co/blog/opentelemetry-observability">previous blog</a>, we also reviewed how to use the <a href="https://github.com/elastic/opentelemetry-demo">OpenTelemetry demo</a> and connect it to Elastic&lt;sup&gt;®&lt;/sup&gt;, as well as some of Elastic’s capabilities with OpenTelemetry and Kubernetes.</p>
<p>In this blog, we will show how to use <a href="https://opentelemetry.io/docs/instrumentation/python/manual/">manual instrumentation for OpenTelemetry</a> with the Python service of our <a href="https://github.com/elastic/observability-examples">application called Elastiflix</a>. This approach is slightly more complex than using <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">automatic instrumentation</a>.</p>
<p>The beauty of this is that there is <strong>no need for the otel-collector</strong>! This setup enables you to slowly and easily migrate an application to OTel with Elastic according to a timeline that best fits your business.</p>
<h2>Application, prerequisites, and config</h2>
<p>The application that we use for this blog is called <a href="https://github.com/elastic/observability-examples">Elastiflix</a>, a movie streaming application. It consists of several micro-services written in .NET, NodeJS, Go, and Python.</p>
<p>Before we instrument our sample application, we will first need to understand how Elastic can receive the telemetry data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-python-apps-opentelemetry/elastic-blog-1-config.png" alt="configuration" /></p>
<p>All of Elastic Observability’s APM capabilities are available with OTel data. Some of these include:</p>
<ul>
<li>Service maps</li>
<li>Service details (latency, throughput, failed transactions)</li>
<li>Dependencies between services, distributed tracing</li>
<li>Transactions (traces)</li>
<li>Machine learning (ML) correlations</li>
<li>Log correlation</li>
</ul>
<p>In addition to Elastic’s APM and a unified view of the telemetry data, you will also be able to use Elastic’s powerful machine learning capabilities to reduce the analysis, and alerting to help reduce MTTR.</p>
<h2>Prerequisites</h2>
<ul>
<li>An Elastic Cloud account — <a href="https://cloud.elastic.co/">sign up now</a></li>
<li>A clone of the <a href="https://github.com/elastic/observability-examples">Elastiflix demo application</a>, or your own Python application</li>
<li>Basic understanding of Docker — potentially install <a href="https://www.docker.com/products/docker-desktop/">Docker Desktop</a></li>
<li>Basic understanding of Python</li>
</ul>
<h2>View the example source code</h2>
<p>The full source code, including the Dockerfile used in this blog, can be found on <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/python-favorite-otel-auto">GitHub</a>. The repository also contains the <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/python-favorite">same application without instrumentation</a>. This allows you to compare each file and see the differences.</p>
<p>The following steps will show you how to instrument this application and run it on the command line or in Docker. If you are interested in a more complete OTel example, take a look at the docker-compose file <a href="https://github.com/elastic/observability-examples/tree/main#start-the-app">here</a>, which will bring up the full project.</p>
<p>Before we begin, let’s look at the non-instrumented code first.</p>
<p>This is our simple Python Flask application that can receive a GET request. (This is a portion of the full <a href="https://github.com/elastic/observability-examples/blob/main/Elastiflix/python-favorite/main.py">main.py</a> file.)</p>
<pre><code class="language-python">from flask import Flask, request
import sys

import logging
import redis
import os
import ecs_logging
import datetime
import random
import time

redis_host = os.environ.get('REDIS_HOST') or 'localhost'
redis_port = os.environ.get('REDIS_PORT') or 6379

application_port = os.environ.get('APPLICATION_PORT') or 5000

app = Flask(__name__)

# Get the Logger
logger = logging.getLogger(&quot;app&quot;)
logger.setLevel(logging.DEBUG)

# Add an ECS formatter to the Handler
handler = logging.StreamHandler()
handler.setFormatter(ecs_logging.StdlibFormatter())
logger.addHandler(handler)
logging.getLogger('werkzeug').setLevel(logging.ERROR)
logging.getLogger('werkzeug').addHandler(handler)

r = redis.Redis(host=redis_host, port=redis_port, decode_responses=True)

@app.route('/favorites', methods=['GET'])
def get_favorite_movies():
    user_id = str(request.args.get('user_id'))

    logger.info('Getting favorites for user ' + user_id, extra={
        &quot;event.dataset&quot;: &quot;favorite.log&quot;,
        &quot;user.id&quot;: request.args.get('user_id')
    })

    favorites = r.smembers(user_id)

    # convert to list
    favorites = list(favorites)
    logger.info('User ' + user_id + ' has favorites: ' + str(favorites), extra={
        &quot;event.dataset&quot;: &quot;favorite.log&quot;,
        &quot;user.id&quot;: user_id
    })
    return { &quot;favorites&quot;: favorites}

logger.info('App startup')
app.run(host='0.0.0.0', port=application_port)
logger.info('App Stopped')
</code></pre>
<h2>Step-by-step guide</h2>
<h3>Step 0. Log in to your Elastic Cloud account</h3>
<p>This blog assumes you have an Elastic Cloud account — if not, follow the <a href="https://cloud.elastic.co/registration?elektra=en-cloud-page">instructions to get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-python-apps-opentelemetry/elastic-blog-2-trial.png" alt="trial" /></p>
<h3>Step 1. Install and initialize OpenTelemetry</h3>
<p>As a first step, we’ll need to add some additional libraries to our application.</p>
<pre><code class="language-python">from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
from opentelemetry.sdk.resources import Resource
</code></pre>
<p>This code imports necessary OpenTelemetry libraries, including those for tracing, exporting, and instrumenting specific libraries like Flask, Requests, and Redis.</p>
<p>Next we read the variables:</p>
<pre><code class="language-bash">OTEL_EXPORTER_OTLP_HEADERS
OTEL_EXPORTER_OTLP_ENDPOINT
</code></pre>
<p>And then initialize the exporter.</p>
<pre><code class="language-python">otel_exporter_otlp_headers = os.environ.get('OTEL_EXPORTER_OTLP_HEADERS')

otel_exporter_otlp_endpoint = os.environ.get('OTEL_EXPORTER_OTLP_ENDPOINT')

exporter = OTLPSpanExporter(endpoint=otel_exporter_otlp_endpoint, headers=otel_exporter_otlp_headers)
</code></pre>
<p>In order to pass additional parameters to OpenTelemetry, we will read the OTEL_RESOURCE_ATTRIBUTES variable and convert it into an object.</p>
<pre><code class="language-python">resource_attributes = os.environ.get('OTEL_RESOURCE_ATTRIBUTES') or 'service.version=1.0,deployment.environment=production'
key_value_pairs = resource_attributes.split(',')
result_dict = {}

for pair in key_value_pairs:
    key, value = pair.split('=')
    result_dict[key] = value
</code></pre>
<p>Next, we will then use these parameters to populate the resources configuration.</p>
<pre><code class="language-python">resourceAttributes = {
     &quot;service.name&quot;: otel_service_name,
     &quot;service.version&quot;: result_dict['service.version'],
     &quot;deployment.environment&quot;: result_dict['deployment.environment']
}

resource = Resource.create(resourceAttributes)
</code></pre>
<p>We then set up the trace provider using the previously created resource. The trace provider will allow us to create spans later after getting a tracer instance from it.</p>
<p>Additionally, we specify the use of BatchSPanProcessor. The Span processor is an interface that allows hooks for span start and end method invocations.</p>
<p>In OpenTelemetry, different Span processors are offered. The BatchSPanProcessor batches span and sends them in bulk. Multiple Span processors can be configured to be active at the same time using the MultiSpanProcessor. <a href="https://opentelemetry.io/docs/instrumentation/java/manual/#span-processor">See OpenTelemetry documentation</a>.</p>
<p>Additionally, we added the resource module. This allows us to specify attributes such as service.name, version, and more. See <a href="https://opentelemetry.io/docs/specs/otel/resource/semantic_conventions/#semantic-attributes-with-sdk-provided-default-value">OpenTelemetry semantic conventions documentation</a> for more details.</p>
<pre><code class="language-python">provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(exporter)
provider.add_span_processor(processor)

# Sets the global default tracer provider
trace.set_tracer_provider(provider)

# Creates a tracer from the global tracer provider
tracer = trace.get_tracer(otel_service_name)
</code></pre>
<p>Finally, because we are using Flask and Redis, we also add the following, which allows us to automatically instrument both Flask and Redis.</p>
<p>Technically you could consider this “cheating.” We are using some parts of the Python auto-instrumentation. However, it’s generally a good approach to resort to using some of the auto-instrumentation modules. This saves you a lot of time, and in addition, it ensures that functionality like distributed tracing will work automatically for any requests you receive or send.</p>
<pre><code class="language-python">FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
RedisInstrumentor().instrument()
</code></pre>
<h3>Step 2. Adding Custom Spans</h3>
<p>Now that we have everything added and initialized, we can add custom spans.</p>
<p>If we want to have additional instrumentation for a part of our app, we simply wrap the /favoritesGET function code using Python with:</p>
<pre><code class="language-python">with tracer.start_as_current_span(&quot;add_favorite_movies&quot;, set_status_on_exception=True) as span:
        ...
</code></pre>
<p>The wrapped code is as follows:</p>
<pre><code class="language-python">@app.route('/favorites', methods=['GET'])
def get_favorite_movies():
    # add artificial delay if enabled
    if delay_time &gt; 0:
        time.sleep(max(0, random.gauss(delay_time/1000, delay_time/1000/10)))

    with tracer.start_as_current_span(&quot;get_favorite_movies&quot;) as span:
        user_id = str(request.args.get('user_id'))

        logger.info('Getting favorites for user ' + user_id, extra={
            &quot;event.dataset&quot;: &quot;favorite.log&quot;,
            &quot;user.id&quot;: request.args.get('user_id')
        })

        favorites = r.smembers(user_id)

        # convert to list
        favorites = list(favorites)
        logger.info('User ' + user_id + ' has favorites: ' + str(favorites), extra={
            &quot;event.dataset&quot;: &quot;favorite.log&quot;,
            &quot;user.id&quot;: user_id
        })
</code></pre>
<p><strong>Additional code</strong></p>
<p>In addition to modules and span instrumentation, the sample application also checks some environment variables at startup. When sending data to Elastic without an OTel collector, the OTEL_EXPORTER_OTLP_HEADERS variable is required as it contains the authentication. The same is true for OTEL_EXPORTER_OTLP_ENDPOINT, the host where we’ll send the telemetry data.</p>
<pre><code class="language-python">otel_exporter_otlp_headers = os.environ.get('OTEL_EXPORTER_OTLP_HEADERS')
# fail if secret token not set
if otel_exporter_otlp_headers is None:
    raise Exception('OTEL_EXPORTER_OTLP_HEADERS environment variable not set')


otel_exporter_otlp_endpoint = os.environ.get('OTEL_EXPORTER_OTLP_ENDPOINT')
# fail if server url not set
if otel_exporter_otlp_endpoint is None:
    raise Exception('OTEL_EXPORTER_OTLP_ENDPOINT environment variable not set')
else:
    exporter = OTLPSpanExporter(endpoint=otel_exporter_otlp_endpoint, headers=otel_exporter_otlp_headers)
</code></pre>
<p><strong>Final code</strong><br />
For comparison, this is the instrumented code of our sample application. You can find the full source code in <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/python-favorite-otel-manual">GitHub</a>.</p>
<pre><code class="language-python">from flask import Flask, request
import sys

import logging
import redis
import os
import ecs_logging
import datetime
import random
import time

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

#Using grpc exporter since per the instructions in OTel docs this is needed for any endpoint receiving OTLP.

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
#from opentelemetry.instrumentation.wsgi import OpenTelemetryMiddleware
from opentelemetry.sdk.resources import Resource

redis_host = os.environ.get('REDIS_HOST') or 'localhost'
redis_port = os.environ.get('REDIS_PORT') or 6379
otel_traces_exporter = os.environ.get('OTEL_TRACES_EXPORTER') or 'otlp'
otel_metrics_exporter = os.environ.get('OTEL_TRACES_EXPORTER') or 'otlp'
environment = os.environ.get('ENVIRONMENT') or 'dev'
otel_service_version = os.environ.get('OTEL_SERVICE_VERSION') or '1.0.0'
resource_attributes = os.environ.get('OTEL_RESOURCE_ATTRIBUTES') or 'service.version=1.0,deployment.environment=production'

otel_exporter_otlp_headers = os.environ.get('OTEL_EXPORTER_OTLP_HEADERS')
# fail if secret token not set
if otel_exporter_otlp_headers is None:
    raise Exception('OTEL_EXPORTER_OTLP_HEADERS environment variable not set')
#else:
#    otel_exporter_otlp_fheaders= f&quot;Authorization=Bearer%20{secret_token}&quot;

otel_exporter_otlp_endpoint = os.environ.get('OTEL_EXPORTER_OTLP_ENDPOINT')
# fail if server url not set
if otel_exporter_otlp_endpoint is None:
    raise Exception('OTEL_EXPORTER_OTLP_ENDPOINT environment variable not set')
else:
    exporter = OTLPSpanExporter(endpoint=otel_exporter_otlp_endpoint, headers=otel_exporter_otlp_headers)


key_value_pairs = resource_attributes.split(',')
result_dict = {}

for pair in key_value_pairs:
    key, value = pair.split('=')
    result_dict[key] = value

resourceAttributes = {
     &quot;service.name&quot;: result_dict['service.name'],
     &quot;service.version&quot;: result_dict['service.version'],
     &quot;deployment.environment&quot;: result_dict['deployment.environment']
#     # Add more attributes as needed
}

resource = Resource.create(resourceAttributes)


provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(exporter)
provider.add_span_processor(processor)

# Sets the global default tracer provider
trace.set_tracer_provider(provider)

# Creates a tracer from the global tracer provider
tracer = trace.get_tracer(&quot;favorite&quot;)


application_port = os.environ.get('APPLICATION_PORT') or 5000

app = Flask(__name__)


FlaskInstrumentor().instrument_app(app)
#OpenTelemetryMiddleware().instrument()
RequestsInstrumentor().instrument()
RedisInstrumentor().instrument()

#app.wsgi_app = OpenTelemetryMiddleware(app.wsgi_app)

# Get the Logger
logger = logging.getLogger(&quot;app&quot;)
logger.setLevel(logging.DEBUG)

# Add an ECS formatter to the Handler
handler = logging.StreamHandler()
handler.setFormatter(ecs_logging.StdlibFormatter())
logger.addHandler(handler)
logging.getLogger('werkzeug').setLevel(logging.ERROR)
logging.getLogger('werkzeug').addHandler(handler)

r = redis.Redis(host=redis_host, port=redis_port, decode_responses=True)

@app.route('/favorites', methods=['GET'])
def get_favorite_movies():
    with tracer.start_as_current_span(&quot;get_favorite_movies&quot;) as span:
        user_id = str(request.args.get('user_id'))

        logger.info('Getting favorites for user ' + user_id, extra={
            &quot;event.dataset&quot;: &quot;favorite.log&quot;,
            &quot;user.id&quot;: request.args.get('user_id')
        })

        favorites = r.smembers(user_id)

        # convert to list
        favorites = list(favorites)
        logger.info('User ' + user_id + ' has favorites: ' + str(favorites), extra={
            &quot;event.dataset&quot;: &quot;favorite.log&quot;,
            &quot;user.id&quot;: user_id
        })
        return { &quot;favorites&quot;: favorites}

logger.info('App startup')
app.run(host='0.0.0.0', port=application_port)
logger.info('App Stopped')
</code></pre>
<h3>Step 3. Running the Docker image with environment variables</h3>
<p>As specified in the <a href="https://opentelemetry.io/docs/instrumentation/python/automatic/#configuring-the-agent">OTEL documentation</a>, we will use environment variables and pass in the configuration values to enable it to connect with <a href="https://www.elastic.co/guide/en/observability/current/apm-open-telemetry.html">Elastic Observability’s APM server</a>.</p>
<p>Because Elastic accepts OTLP natively, we just need to provide the Endpoint and authentication where the OTEL Exporter needs to send the data, as well as some other environment variables.</p>
<p><strong>Getting Elastic Cloud variables</strong><br />
You can copy the endpoints and token from Kibana&lt;sup&gt;®&lt;/sup&gt; under the path <code>/app/home#/tutorial/apm</code>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-python-apps-opentelemetry/elastic-blog-3-apm.png" alt="apm agents" /></p>
<p>You will need to copy the following environment variables:</p>
<pre><code class="language-bash">OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_EXPORTER_OTLP_HEADERS
</code></pre>
<p><strong>Build the image</strong></p>
<pre><code class="language-bash">docker build -t  python-otel-manual-image .
</code></pre>
<p><strong>Run the image</strong></p>
<pre><code class="language-bash">docker run \
       -e OTEL_EXPORTER_OTLP_ENDPOINT=&quot;&lt;REPLACE WITH OTEL_EXPORTER_OTLP_ENDPOINT&gt;&quot; \
       -e OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=Bearer &lt;REPLACE WITH TOKEN&gt;&quot; \
       -e OTEL_RESOURCE_ATTRIBUTES=&quot;service.version=1.0,deployment.environment=production,service.name=python-favorite-otel-manual&quot; \
       -p 3001:3001 \
       python-otel-manual-image
</code></pre>
<p>You can now issue a few requests in order to generate trace data. Note that these requests are expected to return an error, as this service relies on a connection to Redis that you don’t currently have running. As mentioned before, you can find a more complete example using docker-compose <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix">here</a>.</p>
<pre><code class="language-bash">curl localhost:500/favorites
# or alternatively issue a request every second

while true; do curl &quot;localhost:5000/favorites&quot;; sleep 1; done;
</code></pre>
<h3>Step 4. Explore traces, metrics, and logs in Elastic APM</h3>
<p>Now that the service is instrumented, you should see the following output in Elastic APM when looking at the transactions section of your Python service:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-python-apps-opentelemetry/elastic-blog-4-graph1.png" alt="graph-1" /></p>
<p>Notice how this is slightly different from the auto-instrumented version, as we now also have our custom span in this view.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-python-apps-opentelemetry/elastic-blog-5-graph2.png" alt="graph-2" /></p>
<h2>Is it worth it?</h2>
<p>This is the million-dollar question. Depending on what level of detail you need, it's potentially necessary to manually instrument. Manual instrumentation lets you add custom spans, custom labels, and metrics where you want or need them. It allows you to get a level of detail that otherwise would not be possible and is oftentimes important for tracking business-specific KPIs.</p>
<p>Your operations, and whether you need to troubleshoot or analyze the performance of specific parts of the code, will dictate when and what to instrument. But it’s helpful to know that you have the option to manually instrument.</p>
<p>If you noticed we didn’t yet instrument metrics, that is another blog. We discussed logs in a <a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">previous blog</a>.</p>
<h2>Conclusion</h2>
<p>In this blog, we discussed the following:</p>
<ul>
<li>How to manually instrument Python with OpenTelemetry</li>
<li>How to properly initialize OpenTelemetry and add a custom span</li>
<li>How to easily set the OTLP ENDPOINT and OTLP HEADERS with Elastic without the need for a collector</li>
</ul>
<p>Hopefully, this provides an easy-to-understand walk-through of instrumenting Python with OpenTelemetry and how easy it is to send traces into Elastic.</p>
<blockquote>
<p>Developer resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/getting-started-opentelemetry-instrumentation-sample-app">Elastiflix application</a>, a guide to instrument different languages with OpenTelemetry</li>
<li>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-python-apps-opentelemetry">Manual-instrumentation</a></li>
<li>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual-instrumentation</a></li>
<li>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Go: <a href="https://elastic.co/blog/manual-instrumentation-of-go-applications-opentelemetry">Manual-instrumentation</a></li>
<li><a href="https://www.elastic.co/blog/best-practices-instrumenting-opentelemetry">Best practices for instrumenting OpenTelemetry</a></li>
</ul>
<p>General configuration and use case resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></li>
<li><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a></li>
<li><a href="https://www.elastic.co/blog/custom-metrics-app-code-java-agent-plugin">Capturing custom metrics through OpenTelemetry API in code with Elastic</a></li>
<li><a href="https://www.elastic.co/virtual-events/future-proof-your-observability-platform-with-opentelemetry-and-elastic">Future-proof your observability platform with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-k8s-observability-elasticsearch-cncf">Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more</a></li>
</ul>
</blockquote>
<p>Don’t have an Elastic Cloud account yet? <a href="https://cloud.elastic.co/registration">Sign up for Elastic Cloud</a> and try out the auto-instrumentation capabilities that I discussed above. I would be interested in getting your feedback about your experience in gaining visibility into your application stack with Elastic</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/manual-instrumentation-python-apps-opentelemetry/observability-launch-series-2-python-manual_(1).jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Automating User Journeys for Synthetic Monitoring with MCP in Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/mcp-elastic-synthetics</link>
            <guid isPermaLink="false">mcp-elastic-synthetics</guid>
            <pubDate>Wed, 17 Sep 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[This post explores how you can automatically create user journeys with Synthetic Monitoring in Elastic Observability, TypeScript, and FastMCP, and walks through the app and its workflow.]]></description>
            <content:encoded><![CDATA[<p><a href="https://www.elastic.co/docs/solutions/observability/synthetics">Synthetic Monitoring in Elastic Observability</a> enables you to track user pathways using a global testing infrastructure, emulating the full user path to measure the impact of web applications. It also provides comprehensive insight into your website's performance, functionality, and availability from development to production, allowing you to identify and resolve issues before they affect your customers.</p>
<p>One of the main components of Elastic's Synthetic Monitoring is the ability to create user journeys, which can be done with or without code. There is a <a href="https://github.com/elastic/synthetics">Synthetics agent,</a>, a CLI tool that guides you through the process of creating both heartbeat monitors and user journeys and deploying your code to Elastic Observability. If you are using code to create user journeys, you are using <a href="https://playwright.dev/">Playwright</a> under the hood with some additional configuration to make it easier to work with Elastic Observability.</p>
<p>To automatically create user journeys using TypeScript, you can create Playwright tests based on a prompt using <a href="https://www.warp.dev">Warp</a>, an AI-assisted terminal, <a href="https://deepmind.google/models/gemini/pro/">Gemini 2.5 Pro</a>, and <a href="https://modelcontextprotocol.io/docs/getting-started/intro">MCP</a>. This application was built using Python and <a href="https://gofastmcp.com/getting-started/welcome">FastMCP</a>, which wraps the synthetic agent to deploy browser tests to Elastic automatically. This blog post will guide you through how the application works, how to use it, and its development process. You can find the complete code on <a href="https://github.com/JessicaGarson/MCP-Elastic-Synthetics">GitHub</a>.</p>
<h2>Solution overview</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/mcp-elastic-synthetics/01-diagram.jpg" alt="diagram" /></p>
<p>Currently, this solution is set up to run inside Warp as an <a href="https://docs.warp.dev/knowledge-and-collaboration/mcp">MCP server</a>; however, you can also use another client, such as <a href="https://claude.ai/download">Claude Desktop</a> or <a href="https://cursormcp.com/en">Cursor</a>. From there, you create a Python script using <a href="https://gofastmcp.com/getting-started/welcome">FastMCP</a>, which allows you to create functions that are callable by an LLM. Within Warp, you can make a configuration file in JSON that enables you to point to your Python script and pass in all the environment variables you are working with. From there, you'll want to toggle agent mode and ask a question about creating synthetic testing or call the MCP function directly. There are many options for which LLM you can select, be sure to check out <a href="https://docs.warp.dev/agents/using-agents">Warp's documentation</a> to learn more about the options available.</p>
<p>After that, you should ask a question about creating synthetic testing or call the MCP function you are looking for. The following three functions can be used:</p>
<ul>
<li>
<p><code>diagnose_warp_mcp_config</code>
Used for debugging environment variable issues that may arise. This function likely won't be needed unless there is an issue with your configuration.</p>
</li>
<li>
<p><code>create_and_deploy_browser_test</code>
Will automatically create Playwright tests if given the test name, the URL you want to test, and a schedule. This approach uses a template-based method, rather than a machine learning-based method, and all the tests it outputs will appear similar.</p>
</li>
<li>
<p><code>llm_create_and_deploy_test_from_prompt</code>
Similar to <code>create_and_deploy_browser_test</code>, but the main difference is that it uses an LLM to create tests based on a prompt you give it. The tests should reflect the prompt you provided. To run this function you'll provide a test name, URL, prompt, and schedule.</p>
</li>
</ul>
<h2>Why create this solution as an MCP server?</h2>
<p>The reason this was developed as an MCP server, as opposed to just a standalone script or a standard CLI, is that it can be structured and interacted with in a more conversational manner. It enables an LLM to generate dynamic Playwright testing while maintaining consistent arguments, environment variables, and responses to ensure accuracy and reliability. Thus, it becomes a reliable workflow that other agents or developers can compose with additional tools. In other words, the MCP layer turns your LLM-based test authoring into a standardized, reusable capability instead of a one-off script. To learn more about the direction of MCP, be sure to check out our article on the <a href="https://www.elastic.co/search-labs/blog/mcp-current-state">topic.</a></p>
<h2>Implementation considerations</h2>
<p>When creating a solution like this one, one thing to be mindful of is your use of tokens. An early version of this solution took approximately twenty minutes to create synthetic tests and ultimately led to severe rate-limiting.</p>
<p>Another issue faced during the building process was striking a balance between creating a template that facilitates the creation of a Playwright script and having an LLM create Playwright scripts based on prompts that didn't feel cookie-cutter. While using a more LLM approach an issue faced was that the scripts often didn't work or were based on parameters that didn't exist and a more templated approach was more reliable but felt repetitive. The final version of this solution attempted to balance this by using elements of the template while adjusting the LLM parameter of temperature, which controls the randomness or creativity of a large language model's output.</p>
<p>While testing this solution, a failing test also emerged that required navigating past a pop-up. In more complex cases, this may serve as a building block that requires additional domain knowledge to create a complete passing Playwright test.</p>
<h2>How to get started</h2>
<h3>Prerequisites</h3>
<ul>
<li>The version of Python that is used is Python 3.12.1 but you can use any version of Python higher than 3.10.</li>
<li>This application uses Elastic Observability version 9.1.2, but you can use any version of Elastics Observability that is higher than 8.10. You can also use <a href="https://www.elastic.co/cloud/serverless">Elastic Cloud Serverless</a> as well.</li>
<li>You will also need an OpenAI API key to use the LLM capabilities of this application. You will want to configure an environment variable for your OpenAI API Key, which you can find on the API keys page in <a href="https://platform.openai.com/api-keys">OpenAI's developer portal</a>.</li>
</ul>
<h3>Step 1: Install the packages and clone the repository</h3>
<p>In order for this MCP server to run locally you will need to install the the following packages:</p>
<pre><code class="language-shell">pip install fastmcp openai
npm install -g playwright @elastic/synthetics
</code></pre>
<p>You will use <a href="https://gofastmcp.com/getting-started/welcome">FastMCP 2.0</a> to create the MCP server, and <a href="https://github.com/openai/openai-python">OpenAI</a> to generate tests based on prompts that you provide. Additionally, you will want to clone the repository to obtain a local copy of the server.</p>
<h3>Step 2: Set up a configuration file in Warp</h3>
<p>Inside of Warp, you will want to go to the side panel, where it says MCP servers and where it says “add”.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/mcp-elastic-synthetics/02-add-mcp.jpg" alt="Add MCP Server" /></p>
<p>After that, you will be prompted to add a JSON configuration file that should resemble the following. Be sure to add your own Kibana URL, update the correct path, and include your own keys and tokens.</p>
<pre><code class="language-json">{
 &quot;elastic-synthetics&quot;: {
   &quot;command&quot;: &quot;python&quot;,
   &quot;args&quot;: [&quot;elastic_synthetics_server.py&quot;],
   &quot;env&quot;: {
     &quot;PYTHONPATH&quot;: &quot;.&quot;,
     &quot;ELASTIC_KIBANA_URL&quot;: &quot;https://your-kibana-url.elastic-cloud.com&quot;,
     &quot;ELASTIC_API_KEY&quot;: &quot;your-api-key-here&quot;,
     &quot;ELASTIC_PROJECT_ID&quot;: &quot;mcp-synthetics-demo&quot;,
     &quot;ELASTIC_SPACE&quot;: &quot;default&quot;,
     &quot;ELASTIC_AUTO_PUSH&quot;: &quot;true&quot;,
     &quot;ELASTIC_USE_JAVASCRIPT&quot;: &quot;false&quot;,
     &quot;ELASTIC_INSTALL_DEPENDENCIES&quot;: &quot;true&quot;,
     &quot;OPENAI_API_KEY&quot;: &quot;sk-your-openai-key&quot;,
     &quot;LLM_MODEL&quot;: &quot;gpt-4o&quot;
   },
   &quot;working_directory&quot;: &quot;/path/to/your/file&quot;,
   &quot;start_on_launch&quot;: true 
   }
}
</code></pre>
<h3>Step 3: Ask a question or call the tools directly</h3>
<p>Now that you've set up locally, you will want to toggle agent mode and select the LLM you wish to use. The reason why Gemini-Pro-2.5 was chosen for this blog post is that it provides a straightforward answer, while other LLMs selected returned a very lengthy response.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/mcp-elastic-synthetics/03-agent-mode.jpg" alt="Agent mode" /></p>
<p>To start using the MCP tools, from your MCP server, you can ask a question that contains the test name, URL, prompt, and schedule.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/mcp-elastic-synthetics/04-full-question-answer.jpg" alt="Full question and answer" /></p>
<p>You can also call the directly by typing <code>llm_create_and_deploy_test_from_prompt()</code> and the program will prompt you for the relevant details:<br />
<img src="https://www.elastic.co/observability-labs/assets/images/mcp-elastic-synthetics/05-call-mcp-tool.jpg" alt="Call MCP Tool" /></p>
<p>Inside Kibana, you should see your monitor listed if you click under Applications and select Monitors listed under Synthetics. You can also find a link to your monitor in the response of your MCP tool.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/mcp-elastic-synthetics/06-kibana-monitors.jpg" alt="Monitors in Kibana" /></p>
<h2>What's Going On Inside</h2>
<p>This code sample consists of three primary functions, which are MCP tools that you can call from your MCP client, including <code>diagnose_warp_mcp_config</code>, <code>create_and_deploy_browser_test</code> and <code>llm_create_and_deploy_test_from_prompt</code>.</p>
<h3>Debugging environment issues</h3>
<p>There were various issues that came up while creating this application around environment variable loading, so there was a need to create an MCP that could be called depending on errors that may be present.</p>
<p>The tool <code>diagnose_warp_mcp_config</code> kicks off with a decorator <code>@mcp.tool()</code> which allows it to be called and listed in the list of available tools. This tool is designed to help debug issues with Elastic-specific environment variables for troubleshooting purposes. First, it loads in the environment variables and looks for the Elastic specific variables, after it does some security masking so it doesn't show any variables and hides sensitive information like API keys in the output, showing only the first eight characters followed by &quot;...&quot;. This tool determines if the minimum required credentials (Kibana URL and API Key) are present to proceed with deployment and provides a report letting you know to address any issues that may exist.</p>
<pre><code class="language-py">@mcp.tool()
def diagnose_warp_mcp_config() -&gt; Dict[str, Any]:
   &quot;&quot;&quot;Diagnose Warp MCP environment configuration for Elastic Synthetics&quot;&quot;&quot;
   try:
       env_vars = load_env_from_warp_mcp()
      
       # Check for required variables
       kibana_url = env_vars.get('ELASTIC_KIBANA_URL') or env_vars.get('KIBANA_URL')
       api_key = env_vars.get('ELASTIC_API_KEY') or env_vars.get('API_KEY')
       project_id = env_vars.get('ELASTIC_PROJECT_ID') or env_vars.get('PROJECT_ID')
       space = env_vars.get('ELASTIC_SPACE') or env_vars.get('SPACE', 'default')
      
       # Mask sensitive values for display
       masked_vars = {}
       for key, value in env_vars.items():
           if 'API_KEY' in key or 'TOKEN' in key:
               masked_vars[key] = f&quot;{value[:8]}...&quot; if value and len(value) &gt; 8 else &quot;***&quot;
           else:
               masked_vars[key] = value
      
       deployment_ready = bool(kibana_url and api_key)
      
       return safe_json_response({
           &quot;status&quot;: &quot;success&quot;,
           &quot;environment_variables&quot;: masked_vars,
           &quot;required_check&quot;: {
               &quot;kibana_url&quot;: bool(kibana_url),
               &quot;api_key&quot;: bool(api_key),
               &quot;project_id&quot;: bool(project_id),
               &quot;space&quot;: bool(space)
           },
           &quot;deployment_ready&quot;: deployment_ready,
           &quot;recommendations&quot;: [
               &quot;Environment variables detected&quot; if env_vars else &quot;No environment variables found&quot;,
               &quot;Kibana URL configured&quot; if kibana_url else &quot;Missing ELASTIC_KIBANA_URL or KIBANA_URL&quot;,
               &quot;API Key configured&quot; if api_key else &quot;Missing ELASTIC_API_KEY or API_KEY&quot;,
               &quot;Ready for deployment&quot; if deployment_ready else &quot;Missing required credentials&quot;
           ]
       })
      
   except Exception as e:
       return safe_json_response({
           &quot;status&quot;: &quot;error&quot;,
           &quot;error&quot;: str(e),
           &quot;error_type&quot;: type(e).__name__
       })
</code></pre>
<h3>Creating synthetic tests based on a template</h3>
<p>While developing this solution to generate tests based on a prompt, the process wasn't always smooth. Early versions encountered issues with accuracy, hallucinations, and the creation of loops. To make progress, a version that relied on creating a test template to verify the mechanics of the solution, such as whether the test could pass and be deployed to Elastic correctly, was a logical next step.</p>
<p>This solution automates the entire process of creating a synthetic browser test that will regularly check if a website is working correctly, then deploys it to Elastic Observability Synthetics. Similar to <code>diagnose_warp_mcp_config</code>, the MCP tool <code>create_and_deploy_browser_test</code> starts with the decorator <code>@mcp.tool()</code> and checks to make sure that the proper environment variables are loaded.</p>
<p>From there, it creates a TypeScript test file that is based on templates and generates dynamic test steps based on the target website's characteristics, including navigating to the website, verifying the page title exists, checking page load performance, taking a screenshot, verifying page content is visible, and finally saves the test file in a <code>synthetic_tests</code> directory.</p>
<p>Finally, it wraps Elastic's CLI tool <code>@elastic/synthetics</code> to push the test to Kibana, allowing you to set which geographic locations to run tests from, how often to run the test, and the project and workspace settings.</p>
<p>You check out the full code for this MCP tool <a href="https://github.com/JessicaGarson/MCP-Elastic-Synthetics/blob/main/elastic_synthetics_server.py#L943">here.</a></p>
<h3>Creating synthetic tests based on a prompt</h3>
<p>While creating browser tests based on a templated approach is a good starting point, it felt generic and cookie-cutter. But it made a helpful structure to build an LLM-based function on top of.</p>
<p>The MCP tool <code>llm_create_and_deploy_test_from_prompt</code> begins by ensuring that basic parameters, including locations, schedule, and directories, are listed. Additionally, it aims to learn more about the target website to inform the AI and initialize the OpenAI client and model, which is GPT-4o.</p>
<p>After setting up the LLM, it converts natural language requests into actual Playwright test code, then cleans and validates the AI-generated code to prevent issues like injection attacks or malformed syntax. It draws inspiration from the templated approach, wrapping AI-generated steps within a proven, reliable test framework template. Finally, it deploys the test to Elastic in a similar manner to the previous tool.</p>
<p>You can find the code for this tool <a href="https://github.com/JessicaGarson/MCP-Elastic-Synthetics/blob/main/elastic_synthetics_server.py#L1559">here</a>.</p>
<h2>Conclusion and next steps</h2>
<p>Synthetic monitoring in Elastic Observability makes it easy to test complete user journeys and keep your site reliable, with simple setup and a Playwright integration. A tool like this can provide a starting point for tests that you can iterate on after.</p>
<p>A solution like this is just the start of an MCP implementation that automatically generates Playwright tests for you and can be expanded in the future to include heartbeat monitors, utilize the <a href="https://github.com/microsoft/playwright-mcp">Playwright MCP server</a>, or consider experimenting with <a href="https://www.anthropic.com/news/claude-for-chrome">Claude for Chrome</a> to create synthetic testing.</p>
<p>Check out more articles on <a href="https://www.elastic.co/observability-labs/blog/tag/synthetics">Observability Labs on Synthetic Monitoring</a></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/mcp-elastic-synthetics/retro.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Explore and Analyze Metrics with Ease in Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/metrics-explore-analyze-with-esql-discover</link>
            <guid isPermaLink="false">metrics-explore-analyze-with-esql-discover</guid>
            <pubDate>Thu, 23 Oct 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[The latest enhancements to ES|QL and Discover based metrics exploration unleash a potent set of tools for quick and effective metrics analytics.]]></description>
            <content:encoded><![CDATA[<h2>Metrics are critical in identifying the “what”</h2>
<p>As a core pillar of Observability, metrics offer a highly structured, quantitative view of system performance and health. They provide a crucial symptomatic perspective—revealing <em>what</em> is happening, such as high application latency, increasing service errors, or spiking container CPU utilization, which is essential for initiating alerting and triaging efforts. This capability for effective monitoring, alerting, and triaging is paramount to ensuring robust service delivery and achieving successful business outcomes.</p>
<p>Elastic Observability provides a comprehensive, end-to-end experience for metrics data. Elastic ensures that metrics data can be collected from numerous sources, enriched as needed and shipped to the Elastic Stack. Elastic efficiently stores this time series data, including high-cardinality metrics, utilizing the <a href="https://www.elastic.co/observability-labs/blog/time-series-data-streams-observability-metrics">TSDS index mode</a> (Time Series Data Stream), introduced in <a href="https://www.elastic.co/blog/whats-new-elasticsearch-8-7-0#efficient-storage-of-metrics-with-tsdb,-now-generally-available">prior versions</a> and used across Elastic time series <a href="https://www.elastic.co/blog/70-percent-storage-savings-for-metrics-with-elastic-observability">integrations</a>. This foundation ensures comprehensive observability through out-of-the-box dashboards, alerts, SLOs, and streamlined data management.</p>
<p>Elastic Observability 9.2 provides enhancements to metrics exploration and analysis through powerful query language extensions and expanded UI capabilities. These enhancements focus on making analysis on TSDS data via counter rates and common aggregations over time easier and faster than ever before.</p>
<p>The main metrics enhancements center on these key features, offered as Tech Preview:</p>
<ol>
<li>Metrics analytics with TSDS and ES|QL</li>
<li>Interactive metrics exploration in Discover</li>
<li>OTLP endpoint for metrics</li>
</ol>
<h2>Metrics analytics with TSDS and ES|QL</h2>
<p>The introduction of the new <a href="https://www.elastic.co/docs/reference/query-languages/esql/commands/ts"><code>TS</code> source command</a> in <a href="https://www.elastic.co/docs/reference/query-languages/esql">ES|QL</a> (Elasticsearch Query Language) on TSDS metrics dramatically simplifies time series analysis.</p>
<p>The <code>TS</code> command is specifically designed to target only time series indices, differentiating it from the general <code>FROM</code> command. Its core power lies in enabling a dedicated suite of time series aggregation functions within the <code>STATS</code> command.</p>
<p>This mechanism utilizes a dual aggregation paradigm, which is standard for time series querying. These queries involve two aggregation functions:</p>
<ul>
<li>
<p><strong>Inner (Time Series) function:</strong> Applied implicitly per time series, often over bucketed time intervals.</p>
</li>
<li>
<p><strong>Outer (Regular) function:</strong> Used to aggregate the results of the inner function across groups. For instance, if you use <code>STATS SUM(RATE(search_requests)) BY TBUCKET(1 hour), host</code>, the <code>RATE()</code> function is the inner function applied per time series in hourly buckets, and <code>SUM()</code> is the outer function, summing these rates for each host and hourly bucket.</p>
</li>
</ul>
<p>If an ES|QL query using the <code>TS</code> command is missing an inner (time series) aggregation function, <code>LAST_OVER_TIME()</code> is implicitly assumed and used. For example, <code>TS metrics | STATS AVG(memory_usage)</code> is equivalent to <code>TS metrics | STATS AVG(LAST_OVER_TIME(memory_usage))</code>.</p>
<h3>Key time series aggregation functions available in ES|QL via <code>TS</code> command</h3>
<p>These functions allow for powerful analysis on time-series data:</p>
<table>
<thead>
<tr>
<th align="center"></th>
<th align="center"></th>
<th align="center"></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center"><strong>Function</strong></td>
<td align="center"><strong>Description</strong></td>
<td align="center"><strong>Example Use Case</strong></td>
</tr>
<tr>
<td align="center"><code>RATE()</code> <strong>/</strong> <code>IRATE()</code></td>
<td align="center">Calculates the per-second average rate of increase of a counter (<code>RATE</code>), accounting for non-monotonic breaks like counter resets, making it the most appropriate function for counters, or the per-second rate of increase between the last two data points (<code>IRATE</code>), ignoring all but the last two points for high responsiveness.</td>
<td align="center">Calculating request per second (RPS) or throughput.</td>
</tr>
<tr>
<td align="center"><code>AVG_OVER_TIME()</code></td>
<td align="center">Calculates the average of a numeric field over the defined time range.</td>
<td align="center">Determining average resource usage over an hour.</td>
</tr>
<tr>
<td align="center"><code>SUM_OVER_TIME()</code></td>
<td align="center">Calculates the sum of a field over the time range.</td>
<td align="center">Total errors over a specific time window.</td>
</tr>
<tr>
<td align="center"><code>MAX_OVER_TIME()</code> <strong>/</strong> <code>MIN_OVER_TIME()</code></td>
<td align="center">Calculates the maximum or minimum value of a field over time.</td>
<td align="center">Identifying peak resource consumption.</td>
</tr>
<tr>
<td align="center"><code>DELTA()</code> <strong>/</strong> <code>IDELTA()</code></td>
<td align="center">Calculates the absolute change of a gauge field over a time window (<code>DELTA</code>) or specifically between the last two data points (<code>IDELTA</code>), making <code>IDELTA</code> more responsive to recent changes.</td>
<td align="center">Tracking changes in system gauge metrics (e.g., buffer size).</td>
</tr>
<tr>
<td align="center"><code>INCREASE()</code></td>
<td align="center">Calculates the absolute increase of a counter (<code>INCREASE</code>).</td>
<td align="center">Analyzing immediate rate changes in fast-moving counters.</td>
</tr>
<tr>
<td align="center"><code>FIRST_OVER_TIME()</code> <strong>/</strong> <code>LAST_OVER_TIME()</code></td>
<td align="center">Calculates the earliest or latest recorded value of a field, determined by the <code>@timestamp</code> field.</td>
<td align="center">Inspecting initial and final metric states within a bucket.</td>
</tr>
<tr>
<td align="center"><code>ABSENT_OVER_TIME()</code> <strong>/</strong> <code>PRESENT_OVER_TIME()</code></td>
<td align="center">Calculates the absence or presence of a field in the result over the time range.</td>
<td align="center">Identifying monitoring coverage gaps.</td>
</tr>
<tr>
<td align="center"><code>COUNT_OVER_TIME()</code> <strong>/</strong> <code>COUNT_DISTINCT_OVER_TIME()</code></td>
<td align="center">Calculates the total count or the count of distinct values of a field over time.</td>
<td align="center">Measuring frequency or cardinality changes.</td>
</tr>
</tbody>
</table>
<p>These functions, available with the <code>TS</code> command, allow SREs and Ops teams to easily perform rate calculations and other common aggregations, enabling efficient metrics analysis as a routine part of observability workflows. And it’s much faster, too! Internal performance testing has revealed that TS commands outperform other ways of querying metrics data by an order of magnitude or more, and consistently! </p>
<h2>Interactive metrics exploration in Discover</h2>
<p>The 9.2 release introduces the capability to explore and analyze metrics directly and interactively within the Discover interface. In addition to exploring and analyzing logs and raw events, Discover now provides a dedicated environment for metrics exploration:</p>
<ul>
<li>
<p><strong>Easy start:</strong> Begin exploration simply by querying metrics ingested via <code>TS metrics-*</code>.</p>
</li>
<li>
<p><strong>Grid view and pre-applied aggregations:</strong> This command displays all metrics in a grid format at a glance, immediately applying the appropriate aggregations based on the metric type, such as <code>rate</code> versus <code>avg</code>.</p>
</li>
<li>
<p><strong>Search and group-by:</strong> Quickly search for specific metrics by name. Also easily group and analyze metrics by dimensions (labels) and specific values. This allows narrowing down to metrics and dimensions of choice for targeted analysis.</p>
</li>
<li>
<p><strong>Quick access to details:</strong> Furthermore, the interface provides access to crucial details, including query and response details, the underlying ES|QL commands, the metric field type, and applicable dimensions, for each metric.</p>
</li>
<li>
<p><strong>Easy tweaking and dashboarding:</strong> The system automatically populates ES|QL queries, aiding in making easy tweaks, slicing, and dicing the data. Once analyzed, metrics and resulting analyses can be added to new or existing dashboards with ease.</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/metrics-explore-analyze-with-esql-discover/metrics-discover-ts-command.png" alt="Interactive metrics exploration in Discover" /></p>
<h2>OTLP endpoint for metrics</h2>
<p>We are also introducing a native OpenTelemetry Protocol (OTLP) endpoint specifically for metrics ingest directly into Elasticsearch. The endpoint especially benefits self-managed customers, and will be integrated into our <a href="https://www.elastic.co/docs/reference/opentelemetry/motlp">Elastic Cloud Managed OTLP Endpoint</a> for Elastic-managed offerings. The native endpoint and related updates improve ingest performance and scalability of OTel metrics, providing up to 60% higher throughput via <code>_otlp</code>, and up to 25% higher throughput when using classic <code>_bulk</code> methods. </p>
<h2>In Conclusion</h2>
<p>By merging the power of ES|QL's new time series aggregations with the familiar interactive experience of Discover, Elastic 9.2 enables a potent set of metrics analytics tools. The tools significantly boost the exploration and analysis phase of any observability workflow. And we’re just getting started on unleashing the full power of metrics in Elastic Observability!</p>
<p>We welcome you to <a href="https://cloud.elastic.co/serverless-registration?onboarding_token=observability">try the new features</a> today!</p>
<p>Also learn more about how we provide metrics analytics for AWS, Azure, GCP, Kubernetes, and LLMs on <a href="https://www.elastic.co/observability-labs">Observability Labs</a></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/metrics-explore-analyze-with-esql-discover/metrics-blog-image-ts-discover.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Migrating 1 billion log lines from OpenSearch to Elasticsearch]]></title>
            <link>https://www.elastic.co/observability-labs/blog/migrating-billion-log-lines-opensearch-elasticsearch</link>
            <guid isPermaLink="false">migrating-billion-log-lines-opensearch-elasticsearch</guid>
            <pubDate>Wed, 11 Oct 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to migrate 1 billion log lines from OpenSearch to Elasticsearch for improved performance and reduced disk usage. Discover the migration strategies, data transfer methods, and optimization techniques used in this guide.]]></description>
            <content:encoded><![CDATA[<p>What are the current options to migrate from OpenSearch to Elasticsearch&lt;sup&gt;®&lt;/sup&gt;?</p>
<p>OpenSearch is a fork of Elasticsearch 7.10 that has diverged quite a bit from itself lately, resulting in a different set of features and also different performance, as <a href="https://www.elastic.co/blog/elasticsearch-opensearch-performance-gap">this benchmark</a> shows (hint: it’s currently much slower than Elasticsearch).</p>
<p>Given the differences between the two solutions, restoring a snapshot from OpenSearch is not possible, nor is reindex-from-remote, so our only option is then using something in between that will read from OpenSearch and write to Elasticsearch.</p>
<p>This blog will show you how easy it is to migrate from OpenSearch to Elasticsearch for better performance and less disk usage!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/migrating-billion-log-lines-opensearch-elasticsearch/blog-elastic-348gb-disk-space-logs.jpg" alt="1 - arrows" /></p>
<h2>1 billion log lines</h2>
<p>We are going to use part of the data set we used for the benchmark, which takes about half a terabyte on disk, including replicas, and spans over a week ( January 1–7, 2023).</p>
<p>We have in total 1,009,165,775 documents that take <strong>453.5GB</strong> of space in OpenSearch, including the replicas. That’s <strong>241.2KB per document</strong>. This is going to be important later when we enable a couple optimizations in Elasticsearch that will bring this total size way down without sacrificing performance!</p>
<p>This billion log line data set is spread over nine indices that are part of a datastream we are calling logs-myapplication-prod. We have primary shards of about 25GB in size, according to the best practices for optimal shard sizing. A GET _cat/indices show us the indices we are dealing with:</p>
<pre><code class="language-bash">index                              docs.count pri rep pri.store.size store.size
.ds-logs-myapplication-prod-000049  102519334   1   1         22.1gb     44.2gb
.ds-logs-myapplication-prod-000048  114273539   1   1         26.1gb     52.3gb
.ds-logs-myapplication-prod-000044  111093596   1   1         25.4gb     50.8gb
.ds-logs-myapplication-prod-000043  113821016   1   1         25.7gb     51.5gb
.ds-logs-myapplication-prod-000042  113859174   1   1         24.8gb     49.7gb
.ds-logs-myapplication-prod-000041  112400019   1   1         25.7gb     51.4gb
.ds-logs-myapplication-prod-000040  113362823   1   1         25.9gb     51.9gb
.ds-logs-myapplication-prod-000038  110994116   1   1         25.3gb     50.7gb
.ds-logs-myapplication-prod-000037  116842158   1   1         25.4gb     50.8gb
</code></pre>
<p>Both OpenSearch and Elasticsearch clusters have the same configuration: 3 nodes with 64GB RAM and 12 CPU cores. Just like in the <a href="https://www.elastic.co/blog/elasticsearch-opensearch-performance-gap">benchmark</a>, the clusters are running in Kubernetes.</p>
<h2>Moving data from A to B</h2>
<p>Typically, moving data from one Elasticsearch cluster to another is easy as a <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html">snapshot and restore</a> if the clusters are compatible versions of each other or a <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html#reindex-from-remote">reindex from remote</a> if you need real-time synchronization and minimized downtime. These methods do not apply when migrating data from OpenSearch to Elasticsearch because the projects have significantly diverged from the 7.10 fork. However, there is one method that will work: scrolling.</p>
<h3>Scrolling</h3>
<p>Scrolling involves using an external tool, such as Logstash&lt;sup&gt;®&lt;/sup&gt;, to read data from the source cluster and write it to the destination cluster. This method provides a high degree of customization, allowing us to transform the data during the migration process if needed. Here are a couple advantages of using Logstash:</p>
<ul>
<li><strong>Easy parallelization:</strong> It’s really easy to write concurrent jobs that can read from different “slices” of the indices, essentially maximizing our throughput.</li>
<li><strong>Queuing:</strong> Logstash automatically queues documents before sending.</li>
<li><strong>Automatic retries:</strong> In the event of a failure or an error during data transmission, Logstash will automatically attempt to resend the data; moreover, it will stop querying the source cluster as often, until the connection is re-established, all without manual intervention.</li>
</ul>
<p>Scrolling allows us to do an initial search and to keep pulling batches of results from Elasticsearch until there are no more results left, similar to how a “cursor” works in relational databases.</p>
<p>A <a href="https://www.elastic.co/guide/en/elasticsearch/guide/master/scroll.html">scrolled search</a> takes a snapshot in time by freezing the segments that make the index up until the time the request is made, preventing those segments from merging. As a result, the scroll doesn’t see any changes that are made to the index after the initial search request has been made.</p>
<h3>Migration strategies</h3>
<p>Reading from A and writing in B in can be slow without optimization because it involves paginating through the results, transferring each batch over the network to Logstash, which will assemble the documents in another batch and then transfer those batches over the network again to Elasticsearch, where the documents will be indexed. So when it comes to such large data sets, we must be very efficient and extract every bit of performance where we can.</p>
<p>Let’s start with the facts — what do we know about the data we need to transfer? We have nine indices in the datastream, each with about 100 million documents. Let’s test with just one of the indices and measure the indexing rate to see how long it takes to migrate. The indexing rate can be seen by activating the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/monitoring-overview.html">monitoring</a> functionality in Elastic&lt;sup&gt;®&lt;/sup&gt; and then navigating to the index you want to inspect.</p>
<p><strong>Scrolling in the deep</strong><br />
The simplest approach for transferring the log lines over would be to make Elasticsearch scroll over the entire data set and check it later when it finishes. Here we will introduce our first two variables: PAGE_SIZE and BATCH_SIZE. The former is how many records we are going to bring from the source every time we query it, and the latter is how many documents are going to be assembled together by Logstash and written to the destination index.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/migrating-billion-log-lines-opensearch-elasticsearch/elastic-blog-2-scrolling-in-the-deep.jpg" alt="Deep scrolling" /></p>
<p>With such a large data set, the scroll slows down as this deep pagination progresses. The indexing rate starts at 6,000 docs/second and steadily descends down to 700 docs/second because the pagination gets very deep. Without any optimization, it would take us 19 days (!) to migrate the 1 billion documents. We can do better than that!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/migrating-billion-log-lines-opensearch-elasticsearch/elastic-blog-3-index-rate.png" alt="Indexing rate for a deep scroll" /></p>
<p><strong>Slice me nice</strong><br />
We can optimize scrolling by using an approach called <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html#slice-scroll">Sliced scroll</a>, where we split the index in different slices to consume them independently.</p>
<p>Here we will introduce our last two variables: SLICES and WORKERS. The amount of slices cannot be too small as the performance decreases drastically over time, and it can’t be too big as the overhead of maintaining the scrolls would counter the benefits of a smaller search.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/migrating-billion-log-lines-opensearch-elasticsearch/elastic-blog-4-slice-me-nice.jpg" alt="Sliced scroll" /></p>
<p>Let’s start by migrating a single index (out of the nine we have) with different parameters to see what combination gives us the highest throughput.</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>SLICES</td>
<td>PAGE_SIZE</td>
<td>WORKERS</td>
<td>BATCH_SIZE</td>
<td>Average Indexing Rate</td>
</tr>
<tr>
<td>3</td>
<td>500</td>
<td>3</td>
<td>500</td>
<td>13,319 docs/sec</td>
</tr>
<tr>
<td>3</td>
<td>1,000</td>
<td>3</td>
<td>1,000</td>
<td>13,048 docs/sec</td>
</tr>
<tr>
<td>4</td>
<td>250</td>
<td>4</td>
<td>250</td>
<td>10,199 docs/sec</td>
</tr>
<tr>
<td>4</td>
<td>500</td>
<td>4</td>
<td>500</td>
<td>12,692 docs/sec</td>
</tr>
<tr>
<td>4</td>
<td>1,000</td>
<td>4</td>
<td>1,000</td>
<td>10,900 docs/sec</td>
</tr>
<tr>
<td>5</td>
<td>500</td>
<td>5</td>
<td>500</td>
<td>12,647 docs/sec</td>
</tr>
<tr>
<td>5</td>
<td>1,000</td>
<td>5</td>
<td>1,000</td>
<td>10,334 docs/sec</td>
</tr>
<tr>
<td>5</td>
<td>2,000</td>
<td>5</td>
<td>2,000</td>
<td>10,405 docs/sec</td>
</tr>
<tr>
<td>10</td>
<td>250</td>
<td>10</td>
<td>250</td>
<td>14,083 docs/sec</td>
</tr>
<tr>
<td>10</td>
<td>250</td>
<td>4</td>
<td>1,000</td>
<td>12,014 docs/sec</td>
</tr>
<tr>
<td>10</td>
<td>500</td>
<td>4</td>
<td>1,000</td>
<td>10,956 docs/sec</td>
</tr>
</tbody>
</table>
<p>It looks like we have a good set of candidates for maximizing the throughput for a single index, in between 12K and 14K documents per second. That doesn't mean we have reached our ceiling. Even though search operations are single threaded and every slice will trigger sequential search operations to read data, that does not prevent us from reading several indices in parallel.</p>
<p>By default, the maximum number of open scrolls is 500 — this limit can be updated with the search.max_open_scroll_context cluster setting, but the default value is enough for this particular migration.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/migrating-billion-log-lines-opensearch-elasticsearch/elastic-blog-5-index-rate-volatile.png" alt="5 - indexing rate" /></p>
<h2>Let’s migrate</h2>
<h3>Preparing our destination indices</h3>
<p>We are going to create a datastream called logs-myapplication-reindex to write the data to, but before indexing any data, let’s ensure our <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index-templates.html">index template</a> and <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-index-lifecycle.html">index lifecycle management</a> configurations are properly set up. An index template acts as a blueprint for creating new indices, allowing you to define various settings that should be applied consistently across your indices.</p>
<p><strong>Index lifecycle management policy</strong><br />
Index lifecycle management (ILM) is equally vital, as it automates the management of indices throughout their lifecycle. With ILM, you can define policies that determine how long data should be retained, when it should be rolled over into new indices, and when old indices should be deleted or archived. Our policy is really straightforward:</p>
<pre><code class="language-bash">PUT _ilm/policy/logs-myapplication-lifecycle-policy
{
  &quot;policy&quot;: {
    &quot;phases&quot;: {
      &quot;hot&quot;: {
        &quot;actions&quot;: {
          &quot;rollover&quot;: {
            &quot;max_primary_shard_size&quot;: &quot;25gb&quot;
          }
        }
      },
      &quot;warm&quot;: {
        &quot;min_age&quot;: &quot;0d&quot;,
        &quot;actions&quot;: {
          &quot;forcemerge&quot;: {
            &quot;max_num_segments&quot;: 1
          }
        }
      }
    }
  }
}
</code></pre>
<p><strong>Index template (and saving 23% in disk space)</strong><br />
Since we are here, we’re going to go ahead and enable <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html#synthetic-source">Synthetic Source</a>, a clever feature that allows us to store and discard the original JSON document while still reconstructing it when needed from the stored fields.</p>
<p>For our example, enabling Synthetic Source resulted in a remarkable <strong>23.4% improvement in storage efficiency</strong> , reducing the size required to store a single document from 241.2KB in OpenSearch to just <strong>185KB</strong> in Elasticsearch.</p>
<p>Our full index template is therefore:</p>
<pre><code class="language-bash">PUT _index_template/logs-myapplication-reindex
{
  &quot;index_patterns&quot;: [
    &quot;logs-myapplication-reindex&quot;
  ],
  &quot;priority&quot;: 500,
  &quot;data_stream&quot;: {},
  &quot;template&quot;: {
    &quot;settings&quot;: {
      &quot;index&quot;: {
        &quot;lifecycle.name&quot;: &quot;logs-myapplication-lifecycle-policy&quot;,
        &quot;codec&quot;: &quot;best_compression&quot;,
        &quot;number_of_shards&quot;: &quot;1&quot;,
        &quot;number_of_replicas&quot;: &quot;1&quot;,
        &quot;query&quot;: {
          &quot;default_field&quot;: [
            &quot;message&quot;
          ]
        }
      }
    },
    &quot;mappings&quot;: {
      &quot;_source&quot;: {
        &quot;mode&quot;: &quot;synthetic&quot;
      },
      &quot;_data_stream_timestamp&quot;: {
        &quot;enabled&quot;: true
      },
      &quot;date_detection&quot;: false,
      &quot;properties&quot;: {
        &quot;@timestamp&quot;: {
          &quot;type&quot;: &quot;date&quot;
        },
        &quot;agent&quot;: {
          &quot;properties&quot;: {
            &quot;ephemeral_id&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            },
            &quot;id&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            },
            &quot;name&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            },
            &quot;type&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            },
            &quot;version&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            }
          }
        },
        &quot;aws&quot;: {
          &quot;properties&quot;: {
            &quot;cloudwatch&quot;: {
              &quot;properties&quot;: {
                &quot;ingestion_time&quot;: {
                  &quot;type&quot;: &quot;keyword&quot;,
                  &quot;ignore_above&quot;: 1024
                },
                &quot;log_group&quot;: {
                  &quot;type&quot;: &quot;keyword&quot;,
                  &quot;ignore_above&quot;: 1024
                },
                &quot;log_stream&quot;: {
                  &quot;type&quot;: &quot;keyword&quot;,
                  &quot;ignore_above&quot;: 1024
                }
              }
            }
          }
        },
        &quot;cloud&quot;: {
          &quot;properties&quot;: {
            &quot;region&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            }
          }
        },
        &quot;data_stream&quot;: {
          &quot;properties&quot;: {
            &quot;dataset&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            },
            &quot;namespace&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            },
            &quot;type&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            }
          }
        },
        &quot;ecs&quot;: {
          &quot;properties&quot;: {
            &quot;version&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            }
          }
        },
        &quot;event&quot;: {
          &quot;properties&quot;: {
            &quot;dataset&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            },
            &quot;id&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            },
            &quot;ingested&quot;: {
              &quot;type&quot;: &quot;date&quot;
            }
          }
        },
        &quot;host&quot;: {
          &quot;type&quot;: &quot;object&quot;
        },
        &quot;input&quot;: {
          &quot;properties&quot;: {
            &quot;type&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            }
          }
        },
        &quot;log&quot;: {
          &quot;properties&quot;: {
            &quot;file&quot;: {
              &quot;properties&quot;: {
                &quot;path&quot;: {
                  &quot;type&quot;: &quot;keyword&quot;,
                  &quot;ignore_above&quot;: 1024
                }
              }
            }
          }
        },
        &quot;message&quot;: {
          &quot;type&quot;: &quot;match_only_text&quot;
        },
        &quot;meta&quot;: {
          &quot;properties&quot;: {
            &quot;file&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            }
          }
        },
        &quot;metrics&quot;: {
          &quot;properties&quot;: {
            &quot;size&quot;: {
              &quot;type&quot;: &quot;long&quot;
            },
            &quot;tmin&quot;: {
              &quot;type&quot;: &quot;long&quot;
            }
          }
        },
        &quot;process&quot;: {
          &quot;properties&quot;: {
            &quot;name&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            }
          }
        },
        &quot;tags&quot;: {
          &quot;type&quot;: &quot;keyword&quot;,
          &quot;ignore_above&quot;: 1024
        }
      }
    }
  }
}
</code></pre>
<h3>Building a custom Logstash image</h3>
<p>We are going to use a containerized Logstash for this migration because both clusters are sitting on a Kubernetes infrastructure, so it's easier to just spin up a Pod that will communicate to both clusters.</p>
<p>Since OpenSearch is not an official Logstash input, we must build a custom Logstash image that contains the logstash-input-opensearch plugin. Let’s use the base image from docker.elastic.co/logstash/logstash:9.3.1 and just install the plugin:</p>
<pre><code class="language-dockerfile">FROM docker.elastic.co/logstash/logstash:9.3.1

USER logstash
WORKDIR /usr/share/logstash
RUN bin/logstash-plugin install logstash-input-opensearch
</code></pre>
<h3>Writing a Logstash pipeline</h3>
<p>Now we have our Logstash Docker image, and we need to write a pipeline that will read from OpenSearch and write to Elasticsearch.</p>
<p><strong>The</strong> <strong>input</strong></p>
<pre><code class="language-ruby">input {
    opensearch {
        hosts =&gt; [&quot;os-cluster:9200&quot;]
        ssl =&gt; true
        ca_file =&gt; &quot;/etc/logstash/certificates/opensearch-ca.crt&quot;
        user =&gt; &quot;${OPENSEARCH_USERNAME}&quot;
        password =&gt; &quot;${OPENSEARCH_PASSWORD}&quot;
        index =&gt; &quot;${SOURCE_INDEX_NAME}&quot;
        slices =&gt; &quot;${SOURCE_SLICES}&quot;
        size =&gt; &quot;${SOURCE_PAGE_SIZE}&quot;
        scroll =&gt; &quot;5m&quot;
        docinfo =&gt; true
        docinfo_target =&gt; &quot;[@metadata][doc]&quot;
    }
}
</code></pre>
<p>Let’s break down the most important input parameters. The values are all represented as environment variables here:</p>
<ul>
<li><strong>hosts:</strong> Specifies the host and port of the OpenSearch cluster. In this case, it’s connecting to “os-cluster” on port 9200.</li>
<li><strong>index:</strong> Specifies the index in the OpenSearch cluster from which to retrieve logs. In this case, it’s “logs-myapplication-prod” which is a datastream that contains the actual indices (e.g., .ds-logs-myapplication-prod-000049).</li>
<li><strong>size:</strong> Specifies the maximum number of logs to retrieve in each request.</li>
<li><strong>scroll:</strong> Defines how long a search context will be kept open on the OpenSearch server. In this case, it’s set to “5m,” which means each request must be answered and a new “page” asked within five minutes.</li>
<li><strong>docinfo</strong> and <strong>docinfo_target:</strong> These settings control whether document metadata should be included in the Logstash output and where it should be stored. In this case, document metadata is being stored in the [@metadata][doc] field — this is important because the document’s _id will be used as the destination id as well.</li>
</ul>
<p>The ssl and ca_file are highly recommended if you are migrating from clusters that are in a different infrastructure (separate cloud providers). You don’t need to specify a ca_file if your TLS certificates are signed by a public authority, which is likely the case if you are using a SaaS and your endpoint is reachable over the internet. In this case, only ssl =&gt; true would suffice. In our case, all our TLS certificates are self-signed, so we must also provide the Certificate Authority (CA) certificate.</p>
<p><strong>The (optional)</strong> <strong>filter</strong><br />
We could use this to drop or alter the documents to be written to Elasticsearch if we wanted, but we are not going to, as we want to migrate the documents as is. We are only removing extra metadata fields that Logstash includes in all documents, such as &quot;@version&quot; and &quot;host&quot;. We are also removing the original &quot;data_stream&quot; as it contains the source data stream name, which might not be the same in the destination.</p>
<pre><code class="language-ruby">filter {
    mutate {
        remove_field =&gt; [&quot;@version&quot;, &quot;host&quot;, &quot;data_stream&quot;]
    }
}
</code></pre>
<p><strong>The</strong> <strong>output</strong><br />
The output is really simple — we are going to name our datastream logs-myapplication-reindex and we are using the document id of the original documents in document_id, to ensure there are no duplicate documents. In Elasticsearch, datastream names follow a convention &lt;type&gt;-&lt;dataset&gt;-&lt;namespace&gt; so our logs-myapplication-reindex datastream has “myapplication” as dataset and “prod” as namespace.</p>
<pre><code class="language-ruby">elasticsearch {
    hosts =&gt; &quot;${ELASTICSEARCH_HOST}&quot;

    user =&gt; &quot;${ELASTICSEARCH_USERNAME}&quot;
    password =&gt; &quot;${ELASTICSEARCH_PASSWORD}&quot;

    document_id =&gt; &quot;%{[@metadata][doc][_id]}&quot;

    data_stream =&gt; &quot;true&quot;
    data_stream_type =&gt; &quot;logs&quot;
    data_stream_dataset =&gt; &quot;myapplication&quot;
    data_stream_namespace =&gt; &quot;prod&quot;
}
</code></pre>
<h3>Deploying Logstash</h3>
<p>We have a few options to deploy Logstash: it can be deployed <a href="https://www.elastic.co/guide/en/logstash/current/running-logstash-command-line.html">locally from the command line</a>, as a <a href="https://www.elastic.co/guide/en/logstash/current/running-logstash.html">systemd service</a>, via <a href="https://www.elastic.co/guide/en/logstash/current/docker.html">docker</a>, or on <a href="https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-logstash.html">Kubernetes</a>.</p>
<p>Since both of our clusters are deployed in a Kubernetes environment, we are going to deploy Logstash as a <strong>Pod</strong> referencing our Docker image created earlier. Let’s put our pipeline inside a <strong>ConfigMap</strong> along with some configuration files (pipelines.yml and config.yml).</p>
<p>In the below configuration, we have SOURCE_INDEX_NAME, SOURCE_SLICES, SOURCE_PAGE_SIZE, LOGSTASH_WORKERS, and LOGSTASH_BATCH_SIZE conveniently exposed as environment variables so you just need to fill them out.</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Pod
metadata:
  name: logstash-1
spec:
  containers:
    - name: logstash
      image: ugosan/logstash-opensearch-input:8.10.0
      imagePullPolicy: Always
      env:
        - name: SOURCE_INDEX_NAME
          value: &quot;.ds-logs-benchmark-dev-000037&quot;
        - name: SOURCE_SLICES
          value: &quot;10&quot;
        - name: SOURCE_PAGE_SIZE
          value: &quot;500&quot;
        - name: LOGSTASH_WORKERS
          value: &quot;4&quot;
        - name: LOGSTASH_BATCH_SIZE
          value: &quot;1000&quot;
        - name: OPENSEARCH_USERNAME
          valueFrom:
            secretKeyRef:
              name: os-cluster-admin-password
              key: username
        - name: OPENSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: os-cluster-admin-password
              key: password
        - name: ELASTICSEARCH_USERNAME
          value: &quot;elastic&quot;
        - name: ELASTICSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: es-cluster-es-elastic-user
              key: elastic
      resources:
        limits:
          memory: &quot;4Gi&quot;
          cpu: &quot;2500m&quot;
        requests:
          memory: &quot;1Gi&quot;
          cpu: &quot;300m&quot;
      volumeMounts:
        - name: config-volume
          mountPath: /usr/share/logstash/config
        - name: etc
          mountPath: /etc/logstash
          readOnly: true
  volumes:
    - name: config-volume
      projected:
        sources:
          - configMap:
              name: logstash-configmap
              items:
                - key: pipelines.yml
                  path: pipelines.yml
                - key: logstash.yml
                  path: logstash.yml
    - name: etc
      projected:
        sources:
          - configMap:
              name: logstash-configmap
              items:
                - key: pipeline.conf
                  path: pipelines/pipeline.conf
          - secret:
              name: os-cluster-http-cert
              items:
                - key: ca.crt
                  path: certificates/opensearch-ca.crt
          - secret:
              name: es-cluster-es-http-ca-internal
              items:
                - key: tls.crt
                  path: certificates/elasticsearch-ca.crt
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: logstash-configmap
data:
  pipelines.yml: |
    - pipeline.id: reindex-os-es
      path.config: &quot;/etc/logstash/pipelines/pipeline.conf&quot;
      pipeline.batch.size: ${LOGSTASH_BATCH_SIZE}
      pipeline.workers: ${LOGSTASH_WORKERS}
  logstash.yml: |
    log.level: info
    pipeline.unsafe_shutdown: true
    pipeline.ordered: false
  pipeline.conf: |
    input {
        opensearch {
          hosts =&gt; [&quot;os-cluster:9200&quot;]
          ssl =&gt; true
          ca_file =&gt; &quot;/etc/logstash/certificates/opensearch-ca.crt&quot;
          user =&gt; &quot;${OPENSEARCH_USERNAME}&quot;
          password =&gt; &quot;${OPENSEARCH_PASSWORD}&quot;
          index =&gt; &quot;${SOURCE_INDEX_NAME}&quot;
          slices =&gt; &quot;${SOURCE_SLICES}&quot;
          size =&gt; &quot;${SOURCE_PAGE_SIZE}&quot;
          scroll =&gt; &quot;5m&quot;
          docinfo =&gt; true
          docinfo_target =&gt; &quot;[@metadata][doc]&quot;
        }
    }

    filter {
        mutate {
            remove_field =&gt; [&quot;@version&quot;, &quot;host&quot;, &quot;data_stream&quot;]
        }
    }

    output {
        elasticsearch {
            hosts =&gt; &quot;https://es-cluster-es-http:9200&quot;
            ssl =&gt; true
            ssl_certificate_authorities =&gt; [&quot;/etc/logstash/certificates/elasticsearch-ca.crt&quot;]
            ssl_verification_mode =&gt; &quot;full&quot;

            user =&gt; &quot;${ELASTICSEARCH_USERNAME}&quot;
            password =&gt; &quot;${ELASTICSEARCH_PASSWORD}&quot;

            document_id =&gt; &quot;%{[@metadata][doc][_id]}&quot;

            data_stream =&gt; &quot;true&quot;
            data_stream_type =&gt; &quot;logs&quot;
            data_stream_dataset =&gt; &quot;myapplication&quot;
            data_stream_namespace =&gt; &quot;reindex&quot;
        }
    }
</code></pre>
<h2>That’s it.</h2>
<p>After a couple hours, we successfully migrated 1 billion documents from OpenSearch to Elasticsearch and even saved 23% plus on disk storage! Now that we have the logs in Elasticsearch how about extracting actual business value from them? Logs contain so much valuable information - we can not only do all sorts of interesting things with AIOPS, like <a href="https://www.elastic.co/guide/en/observability/current/categorize-logs.html#analyze-log-categories">Automatically Categorize</a> those logs, but also extract <a href="https://www.youtube.com/watch?v=0E7isxR_FzY&amp;list=PLzPXmNbs8vqUc2bROb1E2gNyj2GynRB5b&amp;index=3&amp;t=1122s">business metrics</a> and <a href="https://www.youtube.com/watch?v=0E7isxR_FzY&amp;list=PLzPXmNbs8vqUc2bROb1E2gNyj2GynRB5b&amp;index=3&amp;t=1906s">detect anomalies</a> on them, give it a try.</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>OpenSearch</td>
<td></td>
<td></td>
<td>Elasticsearch</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Index</td>
<td>docs</td>
<td>size</td>
<td>Index</td>
<td>docs</td>
<td>size</td>
<td>Diff.</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000037</td>
<td>116842158</td>
<td>27285520870</td>
<td>logs-myapplication-reindex-000037</td>
<td>116842158</td>
<td>21998435329</td>
<td>21.46%</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000038</td>
<td>110994116</td>
<td>27263291740</td>
<td>logs-myapplication-reindex-000038</td>
<td>110994116</td>
<td>21540011082</td>
<td>23.45%</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000040</td>
<td>113362823</td>
<td>27872438186</td>
<td>logs-myapplication-reindex-000040</td>
<td>113362823</td>
<td>22234641932</td>
<td>22.50%</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000041</td>
<td>112400019</td>
<td>27618801653</td>
<td>logs-myapplication-reindex-000041</td>
<td>112400019</td>
<td>22059453868</td>
<td>22.38%</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000042</td>
<td>113859174</td>
<td>26686723701</td>
<td>logs-myapplication-reindex-000042</td>
<td>113859174</td>
<td>21093766108</td>
<td>23.41%</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000043</td>
<td>113821016</td>
<td>27657006598</td>
<td>logs-myapplication-reindex-000043</td>
<td>113821016</td>
<td>22059454752</td>
<td>22.52%</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000044</td>
<td>111093596</td>
<td>27281936915</td>
<td>logs-myapplication-reindex-000044</td>
<td>111093596</td>
<td>21559513422</td>
<td>23.43%</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000048</td>
<td>114273539</td>
<td>28111420495</td>
<td>logs-myapplication-reindex-000048</td>
<td>114273539</td>
<td>22264398939</td>
<td>23.21%</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000049</td>
<td>102519334</td>
<td>23731274338</td>
<td>logs-myapplication-reindex-000049</td>
<td>102519334</td>
<td>19307250001</td>
<td>20.56%</td>
</tr>
</tbody>
</table>
<p>Interested in trying Elasticsearch? <a href="https://cloud.elastic.co/registration?elektra=en-cloud-page">Start our 14-day free trial</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/migrating-billion-log-lines-opensearch-elasticsearch/elastic-blog-header-1-billion-log-lines.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[ML and AI Ops Observability with OpenTelemetry and Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/ml-ai-ops-observability-opentelemetry-elastic</link>
            <guid isPermaLink="false">ml-ai-ops-observability-opentelemetry-elastic</guid>
            <pubDate>Tue, 31 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to instrument ML and AI pipelines with OpenTelemetry and Elastic to correlate traces, logs, and metrics from notebooks to production inference services.]]></description>
            <content:encoded><![CDATA[<p>While isolated execution logs might work for local experiments, they are no longer enough for the new era of complex, production-ready Machine Learning (ML) pipelines and Artificial Intelligence (AI) agents. Modern ML and AI systems present three unique challenges:</p>
<ul>
<li><strong>Distributed components</strong>: A single request might hit an API gateway, retrieve data from a feature store, evaluate a predictive model in a Python inference service, query a vector database, and call an external LLM.</li>
<li><strong>Non-determinism</strong>: AI agents make autonomous decisions and tool calls. If an agent fails, you need a full trace to understand its reasoning loop and what external tools it tried to invoke.</li>
<li><strong>Context dependence</strong>: You don't just care <em>that</em> an error happened; you need to know <em>what model version</em> was running, <em>what hyperparameters</em> were used, <em>what the input data looked like</em>, <em>what</em> was the commit that made that change. Many of these attributes are custom to your app, and you need an Observability environment that has the flexibility of creating new parameters on the fly and use them to find and fix issues.</li>
</ul>
<p>On top of that, with the increased use of AI agents to generate code and make autonomous decisions, Observability becomes key to understanding what is working and what is not. It creates a critical feedback loop to quickly fix problems. More than ever, ML and AI applications need to adopt the best practices of mature software engineering systems to succeed.</p>
<p>This guide shows how to use OpenTelemetry and Elastic to correlate traces, logs, and metrics to track runs, compare model behavior, and trace requests across Python and Go services with one shared context.</p>
<h2>Problem context: why AI systems are harder to debug</h2>
<p>Traditional services already have distributed failure modes, but ML and AI systems add more moving parts:</p>
<ul>
<li>notebook experiments and ad hoc jobs</li>
<li>batch training and evaluation pipelines</li>
<li>online inference services</li>
<li>external API calls, including LLM providers</li>
<li>changing model versions and hyperparameters</li>
</ul>
<p>When one prediction path gets slower or starts failing, plain isolated logs do not answer enough questions. You need to correlate:</p>
<ul>
<li><strong>what ran</strong> (run ID, model version, parameters)</li>
<li><strong>where time was spent</strong> (pipeline stage latencies)</li>
<li><strong>what was the result</strong> (model stats, predictions, API calls, compare with other runs)</li>
<li><strong>what changed</strong> (code, data, dependencies)</li>
</ul>
<p>In a future blog post, we'll show you how to set up automatic RCA and remediations with <a href="https://github.com/elastic/workflows/">Elastic Workflows</a> and our AI integrations. But as a first step, ML and AI pipelines need a robust Observability framework, which is very easy to set up with OpenTelemetry and Elastic.</p>
<h2>Solution overview</h2>
<p>OpenTelemetry gives you a standard way to emit traces, metrics, and logs. Elastic provides full OpenTelemetry ingestion, giving you a single place to store and query that telemetry. Kibana's UI is fully integrated with OpenTelemetry, allowing you to explore your services, service dependencies, service latencies, spans, and metrics out-of-the-box.</p>
<p>You can start with two deployment options:</p>
<ul>
<li><strong>Cloud</strong>: send OpenTelemetry data directly to Elastic Cloud Managed OTLP Endpoint (<a href="https://www.elastic.co/docs/reference/opentelemetry/motlp">mOTLP docs</a>), without the overhead of managing collectors</li>
<li><strong>Local</strong>: run Elastic and the EDOT Collector with <a href="https://github.com/elastic/start-local?tab=readme-ov-file#install-the-elastic-distribution-of-opentelemetry-edot-collector">start-local</a>, the EDOT Collector will be automatically listening for OTLP data in <code>localhost:4317</code></li>
</ul>
<p>Both options let you keep your application code unchanged for the initial implementation.</p>
<h2>Step 1: zero-code baseline for Python services</h2>
<p>Start by just installing the Elastic Distribution of OpenTelemetry Python (<a href="https://github.com/elastic/elastic-otel-python">EDOT Python</a>) package and using the <code>opentelemetry-instrument</code> wrapper to run your script. By simply running your script with this wrapper—without modifying your application code—your Python services begin emitting standard telemetry right away. This includes any logs exported via <code>logging</code>, alongside metrics and traces for auto-instrumented libraries. This data can be routed directly to Elastic's managed OTLP endpoint or a local EDOT collector.</p>
<pre><code class="language-bash">pip install elastic-opentelemetry
edot-bootstrap --action=install
</code></pre>
<p>Export the OpenTelemetry environment variables, then run <code>opentelemetry-instrument</code> on your script to enable auto-instrumentation.</p>
<pre><code class="language-bash">export OTEL_EXPORTER_OTLP_ENDPOINT=&quot;https://&lt;motlp-endpoint&gt;&quot; # No need when using start-local with EDOT
export OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=ApiKey &lt;key&gt;&quot; # No need when using start-local with EDOT
export OTEL_RESOURCE_ATTRIBUTES=&quot;deployment.environment=prod,service.version=1.0.0&quot; # Set the environment and version for your app
export OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true
export ELASTIC_OTEL_SYSTEM_METRICS_ENABLED=true
export OTEL_METRIC_EXPORT_INTERVAL=5000 # Choose the interval for your application metrics

opentelemetry-instrument --service_name=&lt;pipeline-name&gt; python3 &lt;your_python_script&gt;.py # Set your chosen name for your service
</code></pre>
<p>With this baseline, you can quickly get:</p>
<ul>
<li>Centralized logs with trace context. Any logs exported via <code>logging</code> will be searchable in Elastic and Kibana, with the ability to perform full-text search on your logs</li>
<li>Set alerting on log errors</li>
<li>Process and system metrics. System and process metrics from the execution will be automatically exported to Elastic. You can visualize them, and analyse memory usage (leaks, OOM errors), CPU utilization (Bottlenecks / Spikes), thread counts, disk I/O bottlenecks or network I/O saturation.</li>
<li>Set alerting on metrics</li>
<li>Spans for auto instrumented libraries</li>
<li>Service latency baselines and error trends</li>
<li>Set manual or Anomaly detection alerting on error rates, latencies or throughput</li>
<li>Correlate logs, metrics, and traces in a single shared context to quickly find the root cause of issues, using OpenTelemetry for instrumentation and Elastic for analysis.</li>
</ul>
<p>Once ingested, Kibana immediately populates out-of-the-box dashboards. You can explore full-text searchable logs, monitor system and process metrics, investigate auto-instrumented trace waterfalls, map out your ML dependencies with service maps, and easily set up alerts for latency spikes, memory or CPU usage or log errors.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-1-logs.png" alt="Logs in Elastic" />
<img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-1-log-errors.png" alt="Log errors in Elastic" />
<img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-1-alerts-on-log-errors.png" alt="Alerts on log errors" />
<img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-1-metrics.png" alt="System and process metrics" />
<img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-1-auto-instrumented-traces.png" alt="Auto instrumented traces" />
<img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-1-service-map.png" alt="Service map in Elastic" />
<img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-1-alerts-on-latencies.png" alt="Alerts on latencies" /></p>
<p>For LLM-specific observability, OpenTelemetry provides official <a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/">Semantic Conventions for Generative AI</a> to standardize how you track token usage, model names, and prompts. These semantic conventions are still in development and not stable yet. Some instrumentations for the most used libraries in this space are being developed as part of the <a href="https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation-genai">OpenTelemetry Python Contrib repository</a>.
Alternatively you can implement these conventions manually in your custom spans. LLM related OpenTelemetry logs, metrics and traces sent to Elastic will be in context and automatically correlated with the rest of your application or stack of applications.</p>
<h2>Step 2: add ML-specific context with custom spans and log fields</h2>
<p>Auto-instrumentation is a starting point. For ML and AI Ops, add explicit spans around business stages and attach run metadata. Elastic's schema flexibility and dynamic mappings make it a perfect fit for custom attributes or metrics that are exclusive to your pipelines or specific experiments. There is no need to know what the data will look like before writing it. You have the flexibility of creating new parameters on the fly, Elastic maps them automatically, and you can track them instantly.</p>
<p>Add custom fields and metric-like values as structured log fields so you can chart and alert on them later:</p>
<pre><code class="language-python">logger.info(&quot;training metrics&quot;, extra={
    &quot;ml.run_id&quot;: run_id,
    &quot;ml.training_accuracy&quot;: train_accuracy,
    &quot;ml.validation_accuracy&quot;: val_accuracy,
    &quot;ml.drift_detected&quot;: drift_detected,
})
</code></pre>
<p>Because Elastic handles dynamic mapping, any custom metrics or attributes you log, like model ids, training accuracy or drift detection, are instantly indexed and available to search in Discover or visualize via Dashboards.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-2-custom-log-attributes.png" alt="Custom log attributes" /></p>
<p>This makes dashboards and rules practical:</p>
<ul>
<li>alert when <code>ml.validation_accuracy &lt; 0.8</code></li>
<li>alert when <code>ml.drift_detected == true</code></li>
<li>compare stage latency by <code>ml.model_version</code></li>
</ul>
<p>You can use these custom attributes to build targeted visualizations, and trigger alerts when ML-specific metrics like validation accuracy drop below a critical threshold.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-2-charts-from-custom-log-attributes.png" alt="Charts from custom log attributes" /></p>
<p>Adding custom spans allows you to break down the specific stages of your ML pipelines, such as data loading and model training, wrapping them in their own measurable execution blocks, and analyze average latency or error rates for specific pipeline stages.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-2-custom-spans.png" alt="Custom spans in code" /></p>
<pre><code class="language-python">from opentelemetry import trace

tracer = trace.get_tracer(&quot;ml.pipeline&quot;)

with tracer.start_as_current_span(&quot;load_data&quot;) as span:
    span.set_attribute(&quot;ml.run_id&quot;, run_id)
    span.set_attribute(&quot;ml.dataset&quot;, dataset_source)
    load_data()

with tracer.start_as_current_span(&quot;train_model&quot;) as span:
    span.set_attribute(&quot;ml.model_version&quot;, model_version)
    span.set_attribute(&quot;ml.learning_rate&quot;, learning_rate)
    train_model()
</code></pre>
<p>Custom spans will be reflected in the APM UI alongside your traces. So you can explore their latency, impact in total execution, stack traces, error rates.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-2-custom-spans-ui-in-elastic.png" alt="Custom spans UI in Elastic" />
<img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-2-analysing-spans.png" alt="Analysing spans" />
<img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-2-latency-and-avg-latency-of-spans.png" alt="Latency and avg latency of spans" />
<img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-2-alerts-on-custom-log-metrics.png" alt="Alerts on custom log metrics" /></p>
<h2>Step 3: trace across Python and Go in production</h2>
<p>Real inference paths often cross service boundaries. For example:</p>
<p>In a production environment, a user request might pass through a Go-based API before hitting your Python ML inference service. OpenTelemetry ensures tracing context is preserved seamlessly across these boundaries.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-3-service-map-with-multiple-services.png" alt="Service map with multiple services" /></p>
<p>In our example, we have a simple Go HTTP service that acts as the entry point and demonstrates OpenTelemetry instrumentation in Go. This REST API service stores and retrieves ML predictions by querying Elasticsearch based on data IDs from the source dataset. All of its endpoints are natively instrumented with OTel spans.</p>
<p>The full request lifecycle looks like this:</p>
<ol>
<li>The Go API receives the client request.</li>
<li>It searches Elasticsearch for an existing prediction or calls the Python model service to run inference.</li>
<li>The Python service loads features, runs the model, and returns predictions.</li>
</ol>
<p>When both services use OpenTelemetry, trace context is propagated automatically through headers. In Elastic, you can inspect one end-to-end trace and locate latency or errors by service and span.</p>
<p>The resulting distributed trace in Elastic pieces the entire journey together. You can see the exact breakdown of time spent in the Go API versus the Python model, and correlate logs from both services in a single unified view.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-3-multiple-services.png" alt="Multiple services request flow" />
<img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-3-spans-per-service.png" alt="Spans per service" />
<img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-3-go-service-logs.png" alt="Go service logs" />
<img src="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/step-3-go-traces-in-discover.png" alt="Go traces in discover" /></p>
<h2>Validation checklist</h2>
<p>After instrumentation, validate with a short runbook:</p>
<ol>
<li>Confirm logs, metrics, and traces arrive for each service.</li>
<li>Verify your custom attributes (e.g. <code>run_id</code>, <code>model_version</code>, <code>llm_ground_truth_score</code>) are present in traces and logs.</li>
<li>Compare p95 latency per stage (<code>load_data</code>, <code>train_model</code>, <code>predict</code>).</li>
<li>Trigger a controlled failure and confirm error traces include stack context.</li>
<li>Test one rule for errors, one rule for latency spikes, and one rule for model-quality fields. Set up a connector and attach it to the rule to reach you in Slack, email, or trigger an auto-remediation workflow.</li>
</ol>
<h2>Conclusion and next steps</h2>
<p>OpenTelemetry gives ML and AI teams a unified telemetry layer, while Elastic makes that data instantly queryable and actionable across your entire lifecycle—from notebook experiments to production inference. By starting with zero-code instrumentation and incrementally adding ML-specific attributes and cross-language tracing, your team can easily adopt the Observability best practices of mature software engineering systems and succeed in the new era of complex AI operations.</p>
<p>Try this setup in <a href="https://cloud.elastic.co/registration">Elastic Cloud</a>, and use <a href="https://www.elastic.co/docs/reference/opentelemetry/motlp">mOTLP</a> for a managed ingest path. If you want a local sandbox first, start with <a href="https://github.com/elastic/start-local?tab=readme-ov-file#install-the-elastic-distribution-of-opentelemetry-edot-collector">Elastic start-local + EDOT Collector</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/ml-ai-ops-observability-opentelemetry-elastic/header.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[AIOps with Elastic Observability: Modern AIOps & Log Intelligence]]></title>
            <link>https://www.elastic.co/observability-labs/blog/modern-aiops-elastic-observability</link>
            <guid isPermaLink="false">modern-aiops-elastic-observability</guid>
            <pubDate>Wed, 26 Nov 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Exploring modern AIOps capabilities, including anomaly detection, log intelligence, and log analysis & categorization with Elastic Observability.]]></description>
            <content:encoded><![CDATA[<h1>AIOps Blog Refresher: Unlocking Intelligence from Your Logs with Elastic</h1>
<p>Elastic has been leading the charge with AIOps, especially in the recent 9.2 update of Elastic Observability with Streams. The conversation around AIOps has shifted dramatically as we move through the year. DevOps and SRE teams aren't asking whether they need AIOps, they're asking how to leverage it more effectively to stay ahead of exponentially growing complexity.</p>
<p>The current challenge of AIOps is that modern cloud-native environments generate massive volumes of telemetry data that are magnitudes larger than past environments. But here's what many teams overlook: logs are the richest source of operational intelligence you have. Logs are able to tell you exactly what happened and why, while metrics only tell you something is wrong, and traces only tell you where. The problem is that most organizations are drowning in logs. Microservices, such as user authentications or inventories, serverless functions, and Kubernetes generate millions of log entries daily. Without AI and machine learning, finding meaningful patterns in this data takes too much time and energy.</p>
<h2>Log Intelligence Improvement: What's New in 2025</h2>
<p>Historically in observability, unlocking your log intelligence included long manual effort that required not only parsing through logs, but also structuring those logs. Elastic Observability has drastically changed how teams extract value from logs. Observability is not just simple signal analysis - modern tools need to have proactive, log-driven investigations. At Elastic, this modernity is Streams.</p>
<p>Streams, a new release from Elastic, is a collection of AI-driven tools that identify significant events in parsed raw logs by enriching logs with meaningful fields. With Streams, SREs can maximize the value of their data, their logs, and their systems. With system reliability as the goal, Streams helps to reduce pipeline management overhead and accelerates observability analysis. And it takes nearly no time to set up!</p>
<p>Here is how Streams powers the Elastic Observability capabilities available now.</p>
<h3>Advanced Log Rate Analysis</h3>
<p>Log rate analysis can go far beyond only detecting spikes. Elastic's machine learning automatically identifies when log volumes deviate from expected baselines, then contextualizes these changes within your broader system performance. When your application suddenly generates more error logs, Elastic’s AIOps doesn't just alert you, it also determines whether it's a critical issue requiring immediate attention or just a temporary anomaly.</p>
<p>This matters to your analysis because not all log spikes are equal. A 10x increase in DEBUG logs might indicate verbose logging accidentally enabled in production. A 2x increase in ERROR logs could signal a cascading failure. Log rate analysis distinguishes between these scenarios automatically, giving your team the context needed to respond appropriately.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-aiops-elastic-observability/log-analysis.png" alt="Log Analysis" /></p>
<h3>Intelligent Log Categorization with Streams</h3>
<p>This is where AIOps shines with log data. Streams uses machine learning algorithms in order to automatically classify and group similar log patterns, dramatically reducing noise. Instead of manually parsing millions of entries, the system identifies common structures, groups related events, and surfaces the categories that matter most.</p>
<p>Logs are unstructured by nature, making them difficult to analyze at scale. Streams corrals chaotic log streams into organized, queryable patterns. Instantly, you can see that 80% of your errors fall into three categories, helping you prioritize where to focus remediation efforts. This approach helps you reduce noise and accelerate analysis, allowing teams to act on insights faster.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-aiops-elastic-observability/categories.png" alt="Log Categorizations" /></p>
<h3>Multi-Dimensional Anomaly Detection</h3>
<p><a href="https://www.elastic.co/docs/explore-analyze/machine-learning/anomaly-detection">Anomaly detection</a> now simultaneously examines relationships between logs, metrics, and traces. A slight increase in response time might not trigger an alert by itself, but when correlated with unusual log patterns and memory consumption changes, the system recognizes it as an early warning sign.</p>
<p>Logs contain a myriad of contextual information that metrics and traces can't capture: stack traces, user IDs, transaction details, error messages, etc. By correlating log anomalies with other signals, you get the full picture of what's happening in your system. This whole holistic view enables teams to catch issues earlier, as well as understand their full impact across the stack.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-aiops-elastic-observability/anomalies.png" alt="Anomaly Detection" /></p>
<h3>Enhanced Root Cause Analysis Powered by Significant Events</h3>
<p>When an issue occurs, Elastic's Streams accelerates root cause analysis through AI-assisted parsing of logs and bringing about <a href="https://www.elastic.co/docs/solutions/observability/streams/management/significant-events">“Significant events.”</a> Significant event queries can be defined by AI or manually, depending on if you know what logs you are looking for or not. Then, Elastic’s AIOps traces the problem through your entire stack using these events, as well as enriched log data combined with distributed tracing. This system is able to correlate failed transactions with specific log entries, deployment events, and infrastructure changes. This helps you understand not just what broke, but why and when.</p>
<p>Streams makes the analysis of your logs quick and automatic by going across your entire distributed system within seconds, grabbing relevant log entries such as stack traces, state information, error messages, and more. What used to require hours of manual investigation and deduction now happens automatically, freeing you and your team from tedious detective work and enabling faster resolution. </p>
<h2>Logs in Action: Real-World Impact</h2>
<p>Let's look at how these capabilities work together in practice. Imagine your payment processing service is experiencing intermittent failures - only 0.5% of transactions, but enough to concern your team. Traditional monitoring shows everything is mostly okay, but customers are still complaining.</p>
<p>Without Streams, an SRE might initially run some broad queries, manually sift through thousands of logs, struggle to connect all the dots, and ultimately not understand the correlation between the errors and recent system changes. </p>
<p>With Elastic Streams and AIOps, many of these potential problems are instantly mitigated:</p>
<ul>
<li>
<p>Streams automatically parse the payment service, adding connection timeouts to a new category of significant events</p>
</li>
<li>
<p>Log rate analysis with Streams reveal that this significant event category has been slowly growing over the past month, showing growth of the timeouts from a small number of occurrences into a larger amount</p>
</li>
<li>
<p>Elastic’s built-in anomaly detection correlates these significant events with deployment data, and identifies that they started appearing after a recent load balancer configuration</p>
</li>
<li>
<p>Root analysis pinpoints the exact database connection pool setting that is too restrictive for peak load by tracing affected transactions through previously enriched logs</p>
</li>
</ul>
<p>What usually takes 4-8 hours of manual log analysis is resolved in minutes, with Elastic automatically highlighting the relevant log entries that tell the complete story. This is the power of AIOps and Streams as applied to log intelligence.</p>
<h2>The Power of Unified Log Intelligence</h2>
<p>What sets Elastic apart is treating logs as a priority in your observability strategy. Elastic provides comprehensive log ingestion that centralizes petabytes of logs from across your infrastructure with flexible parsing and enrichment. The platform uses purpose-built machine learning models that understand log patterns, not generic algorithms retrofitted for log analysis.</p>
<p>Logs don't exist in isolation, which is why Elastic correlates log data with metrics, traces, and business events to provide complete context. And because log volumes can be massive, Elastic's tiered storage approach means you can retain years of logs for compliance and historical analysis without breaking the budget.</p>
<h2>Why Logs Matter More Than Ever</h2>
<p>Logs have become the cornerstone of effective AIOps for three critical reasons.</p>
<p>First off, logs capture what metrics can't. A metric tells you the CPU is at 80%, but a log tells you which process is consuming resources and why. This level of detail is essential for understanding not just that something is wrong, but what specifically is causing the problem.</p>
<p>Second, logs provide business context. Error messages contain user IDs, transaction ldetails, and business logic failures that help you understand customer impact. When you're troubleshooting an issue, knowing which customers are affected and what they were trying to do is invaluable for prioritizing your response.</p>
<p>Third, logs enable true root cause analysis. Stack traces, error messages, and application state captured in logs are essential for understanding the why behind every incident. Without this information, teams are left guessing at root causes rather than definitively identifying and fixing them.</p>
<p>The teams winning with AIOps in 2025 aren't just monitoring metrics, they're extracting intelligence from their logs at scale, turning operational data into actionable insights.</p>
<h2>Transform Your Log Strategy Today</h2>
<p>Every hour your team spends manually searching through logs is an hour they're not spending on innovation. Every incident that could have been prevented through intelligent log analysis represents both technical debt and business risk.</p>
<p>Elastic Observability provides the foundation you need to unlock the intelligence hidden in your logs. With automatic categorization, anomaly detection, and ML-powered analysis, you can start seeing value immediately. Check out this recent <a href="https://www.elastic.co/observability-labs/blog/elastic-observability-streams-ai-logs-investigations">article</a> to get started with Elastic Streams and Observability today!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/modern-aiops-elastic-observability/blog-header.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[The observability gap: Why your monitoring strategy isn't ready for what's coming next]]></title>
            <link>https://www.elastic.co/observability-labs/blog/modern-observability-opentelemetry-correlation-ai</link>
            <guid isPermaLink="false">modern-observability-opentelemetry-correlation-ai</guid>
            <pubDate>Mon, 25 Aug 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[The increasing complexity of distributed applications and the observability data they generate creates challenges for SREs and IT Operations teams. Take a look at how you can close this observability gap with OpenTelemetry and the right strategy.]]></description>
            <content:encoded><![CDATA[<p>Anyone that’s been to London knows the announcements at the Tube to “Mind the gap” but what about the gap that’s developing in our monitoring and observability strategies? I’ve been through this toil before, and have run a distributed system that was humming along perfectly. My alerts were manageable, my dashboards made sense, and when things broke, I could usually track down the issue in a reasonable amount of time.</p>
<p>Fast forward 3-5 years and things have changed, we added Kubernetes, embraced microservices, maybe these days you might have even sprinkled in some AI-powered features. Suddenly, you're drowning in telemetry data, your alert fatigue is real, and correlating issues across your distributed architecture feels stressful.</p>
<p>You're experiencing what I call the &quot;observability gap&quot;, where system complexity rockets ahead while our monitoring maturity crawls behind. Today, we're going to explore why this gap exists, what's driving it wider, and most importantly, how to close it using modern observability practices.</p>
<h2>The complexity rocket ship has left the station</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image2.jpg" alt="Observability Gap" /></p>
<p>Let's be honest about what we're dealing with. The scale and complexity of our infrastructure isn't growing linearly, it's exponential. We've gone from monolithic applications running on physical servers to container orchestration platforms managing hundreds of microservices, with AI algorithms now starting to make scaling decisions autonomously.</p>
<p>This trajectory shows no signs of slowing down. With AI-assisted coding accelerating development cycles and intelligent orchestration systems like Kubernetes evolving toward predictive scaling, we're looking at infrastructure that's not just complex, but dynamically complex.</p>
<p>Meanwhile, our observability tooling? It's stuck in the past, designed for a world where you knew exactly how many servers you had and could manually correlate logs with metrics by cross-referencing timestamps.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image3.jpg" alt="Observability Gap part 2" /></p>
<h2>The telemetry data explosion (and why sampling isn't the answer)</h2>
<p>One of the first things teams notice as they scale is their observability bill climbing faster than their infrastructure costs. The knee-jerk reaction is often to start sampling data downsample metrics, head-sample traces, deduplicate logs. While these techniques have their place, they're fundamentally at odds with where we're heading.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image4.jpg" alt="Data Management: Reduce fidelity of data" /></p>
<p>Here's the thing: ML and AI systems thrive on rich, contextual data. When you sample away the &quot;noise,&quot; you're often discarding the very signals that could help you understand system behavior patterns or predict failures. Instead of asking &quot;how can we collect less data?&quot;, the better question is &quot;how can we store and process all this data cost-effectively?&quot;</p>
<p>Modern storage architectures, particularly those leveraging object storage and advanced compression techniques like ZStandard, can achieve remarkable cost-to-value ratios. The secret is organizing related data together and moving it to cheaper storage tiers quickly. This approach lets you have your cake and eat it too, full fidelity data retention without breaking the bank.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image5.jpg" alt="Data Management: Make Storage Cheaper" /></p>
<p>Now of course there is a balance to this and not all your applications are equal, so as a first step you should look at all your most critical flows and applications and ensure that they have the richest telemetry. Do not use a sledge hammer approach and sample all your data just to reduce bills when a scalpel is best.</p>
<h2>OpenTelemetry (OTel): the foundation everything else builds on</h2>
<p>If I had to pick the single most transformative change in observability during my career, it would be OpenTelemetry. Not because it's flashy or revolutionary in concept, but because it solves fundamental problems that have plagued us for years.</p>
<p>Before OTel, instrumenting applications meant vendor lock-in. Want to switch from vendor A to vendor B? Good luck re-instrumenting your entire codebase. Want to send the same telemetry to multiple backends? Hope you enjoy maintaining multiple agent configurations.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image6.jpg" alt="What is OpenTelemetry" /></p>
<p>OpenTelemetry changes things completely. Here's the three main reasons why.</p>
<p><strong>Vendor Neutrality:</strong> Your instrumentation code becomes portable. The same OTEL SDK can send data to any compliant backend.</p>
<p><strong>OpenTelemetry Semantic Conventions:</strong> All your telemetry (logs, metrics, traces, profiles, wide-events) shares common metadata like service names, resource attributes, and trace context.</p>
<p><strong>Auto-Instrumentation:</strong> For most popular languages and frameworks, you get rich telemetry with zero code changes.</p>
<p>OTEL also makes manual instrumentation incredibly valuable with minimal effort. Adding a single line like this</p>
<p><code>baggage.set_baggage(&quot;customer.id&quot;, &quot;alice123&quot;)</code></p>
<p>In your authentication service means that customer ID automatically flows through every downstream service call, every database query, every log message. Suddenly, you can search all your telemetry data by customer ID across your entire distributed system.</p>
<p>The trajectory is clear: within a few years, OTel will be as ubiquitous and invisible as Kubernetes is becoming today. Runtimes will include it by default, cloud providers will offer OTel collectors at the edge, and frameworks will come pre-instrumented.</p>
<h2>Correlation: the secret sauce that makes everything click</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image7.jpg" alt="Why do we need correlation?" /></p>
<p>You get an alert about high latency. You check your metrics dashboard yep, 95th percentile is spiking. You switch to your tracing system and you can see some slow requests. You hop over to your logging system and there are some error messages around the same time. Now comes the fun part: figuring out which logs correspond to which traces and whether they're related to the metric that alerted you.</p>
<p>This context-switching nightmare is exactly what proper correlation eliminates. When your telemetry data shares common identifiers for example, trace IDs in logs, consistent service names, synchronized timestamps or even customer IDs you can seamlessly pivot between different signal types without losing context.</p>
<p>But correlation goes beyond just technical convenience. When you can search all your logs by customer.id and immediately see the traces and metrics for that customer's journey through your system, you transform how you approach support and debugging. When you can filter your entire observability stack by deployment version and instantly understand the impact of a release, you change how you think about deployments.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image8.jpg" alt="How does this work?" /></p>
<p>Metrics? Yes, even metrics can be correlated by using OpenTelemetry exemplars, for example using python you would turn on exemplars as follows.</p>
<pre><code class="language-python"># Setup metrics with exemplars enabled

exemplar_filter = ExemplarFilter(trace_based=True)  

exemplar_reservoir = ExemplarReservoir(

    exemplar_filter=exemplar_filter,`

    max_exemplars=5
)
</code></pre>
<p>This would then associate metrics with a trace that happens to be occurring so you get some metrics correlated to your traces.</p>
<h2>Then again, why correlate at all?</h2>
<p>So you may be thinking, this is great and I can see this being a useful strategy. It is especially useful when you have metrics, logs and traces in separate systems, however, pretty soon you realize that it's a lot of effort when you could just combine all this data together in a single data structure and avoid the need to correlate at all. The observability industry agrees and has recently been espousing the benefits of a new signal type called wide-events.</p>
<p>Wide-events are just really structured logs, the idea is to put metric data, trace data and log data all into the same wide data structure which can make analysis much easier. Think about it, if you have a single data structure you can very quickly run queries and aggregations without having to join any data which can get pretty expensive.</p>
<p>Additionally you are increasing the information density per log record which is particularly great for AI applications.  AI gets a context-rich dataset to do analysis on with minimal latency, a single record with enough descriptive capability to quickly find the root cause of your issue without having to dig around in other data stores and try to figure out whatever schema those data stores are using.</p>
<p>LLMs especially LOVE context and if you can give them all the context they need without having them try to find it, your investigation time will significantly reduce.</p>
<p>This isn't just about making SRE life easier (though it does that). It's about creating the rich, interconnected dataset that AI and ML systems need to understand your infrastructure's behavior patterns.</p>
<h2>AI-driven investigations</h2>
<p>Observability tools today have been pretty good at solving the alerting fatigue and dashboarding problems, things have gotten quite mature there. Alert correlation and other techniques drastically reduce the noise in these domains, not to mention a focus on being alerted by SLOs instead of pure technical metrics. Life has gotten better over the past few years for SREs here.</p>
<p>Now alerts are one piece of the puzzle but the latest AI techniques using LLMs and agentic AI can unlock time savings in a different spot, during investigations. Think about it, investigations are typically what drags on when you have an outage, the cognitive overload while the pressure is on is very real and pretty stressful for SREs.</p>
<p>The good news is that when we get our data in good shape with correlation, enrichment and adopting wide-events and we store the data in full fidelity we now have the tools to help us drive faster investigations.</p>
<p>LLMs can take all that rich data and do some very powerful analysis that can cut down your investigation time. Let's walk through an example.</p>
<p>Imagine we have the following basic log. We only have a limited amount of data for an LLM to reason about. All it can tell is that a database failed.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image9.jpg" alt="What is a basic log" /></p>
<p>Let's see what this looks like when we use a wide-event, notice that already we can see some significant benefits, firstly we only had to visit the log from a single node, the node that serviced the request. We didn’t have to dig into downstream logs. This already makes life easier for the LLM; it doesn't have to figure out how to correlate multiple log lines and traces and metrics though we do still have correlation IDs if we desperately need to look in downstream systems.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image10.jpg" alt="App Log" /></p>
<p>Next we have all this additional rich data that an LLM can use to reason about what happened. LLMs work best with context and if you can feed them as much context as possible they will work more effectively to reduce your investigation time.</p>
<table>
<thead>
<tr>
<th>Field</th>
<th>How an LLM uses it</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>trace_id</code>, <code>parent_span_id</code></td>
<td>Thread every hop together without parsing free-text</td>
</tr>
<tr>
<td><code>status.code</code>, <code>error.*</code></td>
<td>Precise failure class; no NLP guess-work</td>
</tr>
<tr>
<td><code>db.*</code></td>
<td>Root-cause surface (&quot;postgres isn't provisioned&quot;)</td>
</tr>
<tr>
<td><code>user.id</code>, <code>cloud.region</code></td>
<td>Instant blast-radius queries</td>
</tr>
<tr>
<td><code>deployment.version</code></td>
<td>Correlation with new releases</td>
</tr>
</tbody>
</table>
<p>Notice that we didn’t get rid of the unstructured error message, this is still useful context! LLMs are great at processing unstructured text so this textual description helps it understand the problem even further.</p>
<p>Large language models shine when they’re handed complete, context-rich evidence, exactly what wide-event logging supplies. Invest once in richer logs, and every downstream AI workflow (summaries, anomaly detection, natural-language queries) becomes simpler, cheaper, and far more reliable.</p>
<h2>Building toward the future</h2>
<p>As I look ahead, three trends seem inevitable:</p>
<ol>
<li>
<p><strong>OpenTelemetry semantic conventions powers wide-events:</strong> OTel semantic conventions will become as standard as logging is today to create wide-events. Cloud providers, runtimes, and frameworks will use it by default.</p>
</li>
<li>
<p><strong>Making sense of logs with LLMs:</strong> Both improving the richness of your data and having LLMs automatically improve the richness of your existing logs will become essential for shortening investigation times.</p>
</li>
<li>
<p><strong>AI will be essential</strong>: As system complexity outpaces human cognitive ability to understand it, AI assistance will become necessary for maintaining reasonable investigation times.</p>
</li>
</ol>
<p>The organizations that start building toward this future now, adopting OpenTelemetry, investing in richer observability, and beginning to experiment with AI-assisted debugging will have a significant advantage as these trends accelerate.</p>
<h2>Your next steps</h2>
<p>If you're dealing with the observability gap in your own environment, here's where I'd start</p>
<ol>
<li>
<p><strong>Evaluate your logs:</strong> Do your logs have the richness of data you need to shorten investigation times? Can LLMs help provide additional context?</p>
</li>
<li>
<p><strong>Start experimenting with OpenTelemetry:</strong> Even if you can't migrate everything immediately, instrumenting new services with OTel and using semantic conventions to produce wide-events gives you experience with the technology and starts building your enriched dataset.</p>
</li>
<li>
<p><strong>Add high-value context:</strong> Customer IDs, session IDs, deployment versions even small amounts of contextual metadata can dramatically improve your debugging capabilities.</p>
</li>
<li>
<p><strong>Think beyond storage costs:</strong> Instead of sampling data away, investigate modern storage architectures that let you keep everything at a reasonable cost for your most critical services.</p>
</li>
</ol>
<p>The complexity rocket ship has left the station, and it's not slowing down. The question isn't whether your observability strategy needs to evolve; it's whether you'll evolve it proactively or reactively. I know which approach leads to better sleep at night.</p>
<h2>Additional resources</h2>
<ul>
<li><a href="https://www.elastic.co/virtual-events/getting-started-logging">Getting started with logging on the ELK Stack webinar</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/the-next-evolution-of-observability-unifying-data-with-opentelemetry-and-generative-ai">The next evolution of observability: unifying data with OpenTelemetry and generative AI blog</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/elastic-agent-pivot-opentelemetry">Pivoting Elastic's Data Ingestion to OpenTelemetry blog</a></li>
</ul>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image1.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Monitor dbt pipelines with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/monitor-dbt-pipelines-with-elastic-observability</link>
            <guid isPermaLink="false">monitor-dbt-pipelines-with-elastic-observability</guid>
            <pubDate>Fri, 26 Jul 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to set up a dbt monitoring system with Elastic that proactively alerts on data processing cost spikes, anomalies in rows per table, and data quality test failures]]></description>
            <content:encoded><![CDATA[<p>In the Data Analytics team within the Observability organization in Elastic, we use <a href="https://www.getdbt.com/product/what-is-dbt">dbt (dbt™, data build tool)</a> to execute our SQL data transformation pipelines. dbt is a SQL-first transformation workflow that lets teams quickly and collaboratively deploy analytics code. In particular, we use <a href="https://docs.getdbt.com/docs/core/installation-overview">dbt core</a>, the <a href="https://github.com/dbt-labs/dbt-core">open-source project</a>, where you can develop from the command line and run your dbt project.</p>
<p>Our data transformation pipelines run daily and process the data that feed our internal dashboards, reports, analyses, and Machine Learning (ML) models.</p>
<p>There have been incidents in the past when the pipelines have failed, the source tables contained wrong data or we have introduced a change into our SQL code that has caused data quality issues, and we only realized once we saw it in a weekly report that was showing an anomalous number of records. That’s why we have built a monitoring system that proactively alerts us about these types of incidents as soon as they happen and helps us with visualizations and analyses to understand their root cause, saving us several hours or days of manual investigations.</p>
<p>We have leveraged our own Observability Solution to help solve this challenge, monitoring the entire lifecycle of our dbt implementation. This setup enables us to track the behavior of our models and conduct data quality testing on the final tables. We export dbt process logs from run jobs and tests into Elasticsearch and utilize Kibana to create dashboards, set up alerts, and configure Machine Learning jobs to monitor and assess issues.</p>
<p>The following diagram shows our complete architecture. In a follow-up article, we’ll also cover how we observe our python data processing and ML model processes using OTEL and Elastic - stay tuned.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/architecture.png" alt="1 - architecture" /></p>
<h2>Why monitor dbt pipelines with Elastic?</h2>
<p>With every invocation, dbt generates and saves one or more JSON files called <a href="https://docs.getdbt.com/reference/artifacts/dbt-artifacts">artifacts</a> containing log data on the invocation results. <code>dbt run</code> and <code>dbt test</code> invocation logs are <a href="https://docs.getdbt.com/reference/artifacts/run-results-json">stored in the file <code>run_results.json</code></a>, as per the dbt documentation:</p>
<blockquote>
<p>This file contains information about a completed invocation of dbt, including timing and status info for each node (model, test, etc) that was executed. In aggregate, many <code>run_results.json</code> can be combined to calculate average model runtime, test failure rates, the number of record changes captured by snapshots, etc.</p>
</blockquote>
<p>Monitoring <code>dbt run</code> invocation logs can help solve several issues, including tracking and alerting about table volumes, detecting excessive slot time from resource-intensive models, identifying cost spikes due to slot time or volume, and pinpointing slow execution times that may indicate scheduling issues. This system was crucial when we merged a PR with a change in our code that had an issue, producing a sudden drop in the number of daily rows in upstream Table A. By ingesting the <code>dbt run</code> logs into Elastic, our anomaly detection job quickly identified anomalies in the daily row counts for Table A and its downstream tables, B, C, and D. The Data Analytics team received an alert notification about the issue, allowing us to promptly troubleshoot, fix and backfill the tables before it affected the weekly dashboards and downstream ML models.</p>
<p>Monitoring <code>dbt test</code> invocation logs can also address several issues, such as identifying duplicates in tables, detecting unnoticed alterations in allowed values for specific fields through validation of all enum fields, and resolving various other data processing and quality concerns. With dashboards and alerts on data quality tests, we proactively identify issues like duplicate keys, unexpected category values, and increased nulls, ensuring data integrity. In our team, we had an issue where a change in one of our raw lookup tables produced duplicated rows in our user table, doubling the number of users reported. By ingesting the <code>dbt test</code> logs into Elastic, our rules detected that some duplicate tests had failed. The team received an alert notification about the issue, allowing us to troubleshoot it right away by finding the upstream table that was the root cause. These duplicates meant that downstream tables had to process 2x the amount of data, creating a spike in the bytes processed and slot time. The anomaly detection and alerts on the <code>dbt run</code> logs also helped us spot these spikes for individual tables and allowed us to quantify the impact on our billing.</p>
<p>Processing our dbt logs with Elastic and Kibana allows us to obtain real-time insights, helps us quickly troubleshoot potential issues, and keeps our data transformation processes running smoothly. We set up anomaly detection jobs and alerts in Kibana to monitor the number of rows processed by dbt, the slot time, and the results of the tests. This lets us catch real-time incidents, and by promptly identifying and fixing these issues, Elastic makes our data pipeline more resilient and our models more cost-effective, helping us stay on top of cost spikes or data quality issues.</p>
<p>We can also correlate this information with other events ingested into Elastic, for example using the <a href="https://www.elastic.co/guide/en/enterprise-search/current/connectors-github.html">Elastic Github connector</a>, we can correlate data quality test failures or other anomalies with code changes to find the root cause of the commit or PR that caused the issues. By ingesting application logs into Elastic, we can also analyze if these issues in our pipelines have affected downstream applications, increasing latency, throughput or error rates using APM. Ingesting billing, revenue data or web traffic, we could also see the impact in business metrics.</p>
<h2>How to export dbt invocation logs to Elasticsearch</h2>
<p>We use the <a href="https://elasticsearch-py.readthedocs.io/en">Python Elasticsearch client</a> to send the dbt invocation logs to Elastic after we run our <code>dbt run</code> and <code>dbt test</code> processes daily in production. The setup just requires you to install the <a href="https://elasticsearch-py.readthedocs.io/en/v8.14.0/quickstart.html#installation">Elasticsearch Python client</a> and obtain your Elastic Cloud ID (go to <a href="https://cloud.elastic.co/deployments/">https://cloud.elastic.co/deployments/</a>, select your deployment and find the <code>Cloud ID</code>) and Elastic Cloud API Key <a href="https://elasticsearch-py.readthedocs.io/en/v8.14.0/quickstart.html#connecting">(following this guide)</a></p>
<p>This python helper function will index the results from your <code>run_results.json</code> file to the specified index. You just need to export the variables to the environment:</p>
<ul>
<li><code>RESULTS_FILE</code>: path to your <code>run_results.json</code> file</li>
<li><code>DBT_RUN_LOGS_INDEX</code>: the name you want to give to dbt run logs index in Elastic, e.g. <code>dbt_run_logs</code></li>
<li><code>DBT_TEST_LOGS_INDEX</code>: the name you want to give to the dbt test logs index in Elastic, e.g. <code>dbt_test_logs</code></li>
<li><code>ES_CLUSTER_CLOUD_ID</code></li>
<li><code>ES_CLUSTER_API_KEY</code></li>
</ul>
<p>Then call the function <code>log_dbt_es</code> from your python code or save this code as a python script and run it after executing your <code>dbt run</code> or <code>dbt test</code> commands:</p>
<pre><code>from elasticsearch import Elasticsearch, helpers
import os
import sys
import json

def log_dbt_es():
   RESULTS_FILE = os.environ[&quot;RESULTS_FILE&quot;]
   DBT_RUN_LOGS_INDEX = os.environ[&quot;DBT_RUN_LOGS_INDEX&quot;]
   DBT_TEST_LOGS_INDEX = os.environ[&quot;DBT_TEST_LOGS_INDEX&quot;]
   es_cluster_cloud_id = os.environ[&quot;ES_CLUSTER_CLOUD_ID&quot;]
   es_cluster_api_key = os.environ[&quot;ES_CLUSTER_API_KEY&quot;]


   es_client = Elasticsearch(
       cloud_id=es_cluster_cloud_id,
       api_key=es_cluster_api_key,
       request_timeout=120,
   )


   if not os.path.exists(RESULTS_FILE):
       print(f&quot;ERROR: {RESULTS_FILE} No dbt run results found.&quot;)
       sys.exit(1)


   with open(RESULTS_FILE, &quot;r&quot;) as json_file:
       results = json.load(json_file)
       timestamp = results[&quot;metadata&quot;][&quot;generated_at&quot;]
       metadata = results[&quot;metadata&quot;]
       elapsed_time = results[&quot;elapsed_time&quot;]
       args = results[&quot;args&quot;]
       docs = []
       for result in results[&quot;results&quot;]:
           if result[&quot;unique_id&quot;].split(&quot;.&quot;)[0] == &quot;test&quot;:
               result[&quot;_index&quot;] = DBT_TEST_LOGS_INDEX
           else:
               result[&quot;_index&quot;] = DBT_RUN_LOGS_INDEX
           result[&quot;@timestamp&quot;] = timestamp
           result[&quot;metadata&quot;] = metadata
           result[&quot;elapsed_time&quot;] = elapsed_time
           result[&quot;args&quot;] = args
           docs.append(result)
       _ = helpers.bulk(es_client, docs)
   return &quot;Done&quot;

# Call the function
log_dbt_es()
</code></pre>
<p>If you want to add/remove any other fields from <code>run_results.json</code>, you can modify the above function to do it.</p>
<p>Once the results are indexed, you can use Kibana to create Data Views for both indexes and start exploring them in Discover.</p>
<p>Go to Discover, click on the data view selector on the top left and “Create a data view”.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/discover-create-dataview.png" alt="2 - discover create a data view" /></p>
<p>Now you can create a data view with your preferred name. Do this for both dbt run (<code>DBT_RUN_LOGS_INDEX</code> in your code) and dbt test (<code>DBT_TEST_LOGS_INDEX</code> in your code) indices:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/create-dataview.png" alt="3 - create a data view" /></p>
<p>Going back to Discover, you’ll be able to select the Data Views and explore the data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/discover-logs-explorer.png" alt="4 - discover logs explorer" /></p>
<h2>dbt run alerts, dashboards and ML jobs</h2>
<p>The invocation of <a href="https://docs.getdbt.com/reference/commands/run"><code>dbt run</code></a> executes compiled SQL model files against the current database. <code>dbt run</code> invocation logs contain the <a href="https://docs.getdbt.com/reference/artifacts/run-results-json">following fields</a>:</p>
<ul>
<li><code>unique_id</code>: Unique model identifier</li>
<li><code>execution_time</code>: Total time spent executing this model run</li>
</ul>
<p>The logs also contain the following metrics about the job execution from the adapter:</p>
<ul>
<li><code>adapter_response.bytes_processed</code></li>
<li><code>adapter_response.bytes_billed</code></li>
<li><code>adapter_response.slot_ms</code></li>
<li><code>adapter_response.rows_affected</code></li>
</ul>
<p>We have used Kibana to set up <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-ad-run-jobs.html">Anomaly Detection jobs</a> on the above-mentioned metrics. You can configure a <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-anomaly-detection-job-types.html#multi-metric-jobs">multi-metric job</a> split by <code>unique_id</code> to be alerted when the sum of rows affected, slot time consumed, or bytes billed is anomalous per table. You can track one job per metric. If you have built a dashboard of the metrics per table, you can use <a href="https://www.elastic.co/guide/en/machine-learning/8.14/ml-jobs-from-lens.html">this shortcut</a> to create the Anomaly Detection job directly from the visualization. After the jobs are created and are running on incoming data, you can <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-ad-view-results.html">view the jobs</a> and add them to a dashboard using the three dots button in the anomaly timeline:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/ml-job-add-to-dashboard.png" alt="5 - add ML job to dashboard" /></p>
<p>We have used the <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-configuring-alerts.html">ML job to set up alerts</a> that send us emails/slack messages when anomalies are detected. Alerts can be created directly from the Jobs (Machine Learning &gt; Anomaly Detection Jobs) page, by clicking on the three dots at the end of the ML job row:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/ml-job-create-alert.png" alt="6 - create alert from ML job" /></p>
<p>We also use <a href="https://www.elastic.co/guide/en/kibana/current/dashboard.html">Kibana dashboards</a> to visualize the anomaly detection job results and related metrics per table, to identify which tables consume most of our resources, to have visibility on their temporal evolution, and to measure aggregated metrics that can help us understand month over month changes.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/ml-job-dashboard.png" alt="7 - ML job in dashboard" />
<img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/dashboard-slot-time.png" alt="8 - dashboard slot time chart" />
<img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/dashboard-aggregated-metrics.png" alt="9 - dashboard aggregated metrics" /></p>
<h2>dbt test alerts and dashboards</h2>
<p>You may already be familiar with <a href="https://docs.getdbt.com/docs/build/data-tests">tests in dbt</a>, but if you’re not, dbt data tests are assertions you make about your models. Using the command <a href="https://docs.getdbt.com/reference/commands/test"><code>dbt test</code></a>, dbt will tell you if each test in your project passes or fails. <a href="https://docs.getdbt.com/docs/build/data-tests#example">Here is an example of how to set them up</a>. In our team, we use out-of-the-box dbt tests (<code>unique</code>, <code>not_null</code>, <code>accepted_values</code>, and <code>relationships</code>) and the packages <a href="https://hub.getdbt.com/dbt-labs/dbt_utils/latest/">dbt_utils</a> and <a href="https://hub.getdbt.com/calogica/dbt_expectations/latest/">dbt_expectations</a> for some extra tests. When the command <code>dbt test</code> is run, it generates logs that are stored in <code>run_results.json</code>.</p>
<p>dbt test logs contain the <a href="https://docs.getdbt.com/reference/artifacts/run-results-json">following fields</a>:</p>
<ul>
<li><code>unique_id</code>: Unique test identifier, tests contain the “test” prefix in their unique identifier</li>
<li><code>status</code>: result of the test, <code>pass</code> or <code>fail</code></li>
<li><code>execution_time</code>: Total time spent executing this test</li>
<li><code>failures</code>: will be 0 if the test passes and 1 if the test fails</li>
<li><code>message</code>: If the test fails, reason why it failed</li>
</ul>
<p>The logs also contain the metrics about the job execution from the adapter.</p>
<p>We have set up alerts on document count (see <a href="https://www.elastic.co/guide/en/observability/8.14/custom-threshold-alert.html">guide</a>) that will send us an email / slack message when there are any failed tests. The rule for the alerts is set up on the dbt test Data View that we have created before, the query filtering on <code>status:fail</code> to obtain the logs for the tests that have failed, and the rule condition is document count bigger than 0.
Whenever there is a failure in any test in production, we get an alert with links to the alert details and dashboards to be able to troubleshoot them:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/email-alert.png" alt="10 - alert" /></p>
<p>We have also built a dashboard to visualize the tests run, tests failed, and their execution time and slot time to have a historical view of the test run:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/dashboard-tests.png" alt="11 - dashboard dbt tests" /></p>
<h2>Finding Root Causes with the AI Assistant</h2>
<p>The most effective way for us to analyze these multiple sources of information is using the AI Assistant to help us troubleshoot the incidents. In our case, we got an alert about a test failure, and we used the AI Assistant to give us context on what happened. Then we asked if there were any downstream consequences, and the AI Assistant interpreted the results of the Anomaly Detection job, which indicated a spike in slot time for one of our downstream tables and the increase of the slot time vs. the baseline. Then, we asked for the root cause, and the AI Assistant was able to find and provide us a link to a PR from our Github changelog that matched the start of the incident and was the most probable cause.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/ai-assistant.png" alt="12 - ai assistant troubleshoot" /></p>
<h2>Conclusion</h2>
<p>As a Data Analytics team, we are responsible for guaranteeing that the tables, charts, models, reports, and dashboards we provide to stakeholders are accurate and contain the right sources of information. As teams grow, the number of models we own becomes larger and more interconnected, and it isn’t easy to guarantee that everything is running smoothly and providing accurate results. Having a monitoring system that proactively alerts us on cost spikes, anomalies in row counts, or data quality test failures is like having a trusted companion that will alert you in advance if something goes wrong and help you get to the root cause of the issue.</p>
<p>dbt invocation logs are a crucial source of information about the status of our data pipelines, and Elastic is the perfect tool to extract the maximum potential out of them. Use this blog post as a starting point for utilizing your dbt logs to help your team achieve greater reliability and peace of mind, allowing them to focus on more strategic tasks rather than worrying about potential data issues.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/monitoring-dbt-with-elastic.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Monitor OpenAI API and GPT models with OpenTelemetry and Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/monitor-openai-api-gpt-models-opentelemetry</link>
            <guid isPermaLink="false">monitor-openai-api-gpt-models-opentelemetry</guid>
            <pubDate>Tue, 04 Apr 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Get ready to be blown away by this game-changing approach to monitoring cutting-edge ChatGPT applications! As the ChatGPT phenomenon takes the world by storm, it's time to supercharge your monitoring game with OpenTelemetry and Elastic Observability.]]></description>
            <content:encoded><![CDATA[<p>ChatGPT is so hot right now, it broke the internet. As an avid user of ChatGPT and a developer of ChatGPT applications, I am incredibly excited by the possibilities of this technology. What I see happening is that there will be exponential growth of ChatGPT-based solutions, and people are going to need to monitor those solutions.</p>
<p>Since this is a pretty new technology, we wouldn’t want to burden our shiny new code with proprietary technology, would we? No, we would not, and that is why we are going to use OpenTelemetry to monitor our ChatGPT code in this blog. This is particularly relevant for me as I recently created a service to generate meeting notes from Zoom calls. If I am to release this into the wild, how much is it going to cost me and how do I make sure it is available?</p>
<h2>OpenAI APIs to the rescue</h2>
<p>The OpenAI API is pretty awesome, there is no doubt. It also gives us the information shown below in each response to each API call, which can help us with understanding what we are being charged. By using the token counts, the model, and the pricing that OpenAI has put up on its website, we can calculate the cost. The question is, how do we get this information into our monitoring tools?</p>
<pre><code class="language-python">{
  &quot;choices&quot;: [
    {
      &quot;finish_reason&quot;: &quot;length&quot;,
      &quot;index&quot;: 0,
      &quot;logprobs&quot;: null,
      &quot;text&quot;: &quot;\n\nElastic is an amazing observability tool because it provides a comprehensive set of features for monitoring&quot;
    }
  ],
  &quot;created&quot;: 1680281710,
  &quot;id&quot;: &quot;cmpl-70CJq07gibupTcSM8xOWekOTV5FRF&quot;,
  &quot;model&quot;: &quot;text-davinci-003&quot;,
  &quot;object&quot;: &quot;text_completion&quot;,
  &quot;usage&quot;: {
    &quot;completion_tokens&quot;: 20,
    &quot;prompt_tokens&quot;: 9,
    &quot;total_tokens&quot;: 29
  }
}
</code></pre>
<h2>OpenTelemetry to the rescue</h2>
<p><a href="https://www.elastic.co/blog/opentelemetry-observability">OpenTelemetry</a> is truly a fantastic piece of work. It has had so much adoption and work committed to it over the years, and it seems to really be getting to the point where we can call it the Linux of Observability. We can use it to record logs, metrics, and traces and get those in a vendor neutral way into our favorite observability tool — in this case, Elastic Observability.</p>
<p>With the latest and greatest otel libraries in Python, we can auto-instrument external calls, and this will help us understand how OpenAI calls are performing. Let's take a sneak peek at our sample Python application, which implements Flask and the ChatGPT API and also has OpenTelemetry. If you want to try this yourself, take a look at the GitHub link at the end of this blog and follow these steps.</p>
<h3>Set up Elastic Cloud account (if you already don’t have one)</h3>
<ol>
<li>Sign up for a two-week free trial at <a href="https://www.elastic.co/cloud/elasticsearch-service/signup">https://www.elastic.co/cloud/elasticsearch-service/signup</a>.</li>
<li>Create a deployment.</li>
</ol>
<p>Once you are logged in, click <strong>Add integrations</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-openai-api-gpt-models-opentelemetry/blog-elastic-cloud-deployment-add-integrations.png" alt="elastic cloud deployment add integrations" /></p>
<p>Click on <strong>APM Integration</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-openai-api-gpt-models-opentelemetry/blog-elastic-apm-integration.png" alt="elastic apm integration" /></p>
<p>Then scroll down to get the details you need for this blog:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-openai-api-gpt-models-opentelemetry/blog-elastic-opentelemetry-download.png" alt="elastic opentelemetry download" /></p>
<p>Be sure to set the following Environment variables, replacing the variables with data you get from Elastic as above and OpenAI from <a href="https://platform.openai.com/account/api-keys">here</a>, and then run these export commands on the command line.</p>
<pre><code class="language-bash">export OPEN_AI_KEY=sk-abcdefgh5ijk2l173mnop3qrstuvwxyzab2cde47fP2g9jij
export OTEL_EXPORTER_OTLP_AUTH_HEADER=abc9ldeofghij3klmn
export OTEL_EXPORTER_OTLP_ENDPOINT=https://123456abcdef.apm.us-west2.gcp.elastic-cloud.com:443
</code></pre>
<p>And install the following Python libraries:</p>
<pre><code class="language-python">pip3 install opentelemetry-api
pip3 install opentelemetry-sdk
pip3 install opentelemetry-exporter-otlp
pip3 install opentelemetry-instrumentation
pip3 install opentelemetry-instrumentation-requests
pip3 install openai
pip3 install flask
</code></pre>
<p>Here is a look at the code we are using for the example application. In the real world, this would be your own code. All this does is call OpenAI APIs with the following message: “Why is Elastic an amazing observability tool?”</p>
<pre><code class="language-python">import openai
from flask import Flask
import monitor  # Import the module
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import urllib
import os
from opentelemetry import trace
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# OpenTelemetry setup up code here, feel free to replace the “your-service-name” attribute here.
resource = Resource(attributes={
    SERVICE_NAME: &quot;your-service-name&quot;
})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT'),
        headers=&quot;Authorization=Bearer%20&quot;+os.getenv('OTEL_EXPORTER_OTLP_AUTH_HEADER')))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
RequestsInstrumentor().instrument()



# Initialize Flask app and instrument it

app = Flask(__name__)
# Set OpenAI API key
openai.api_key = os.getenv('OPEN_AI_KEY')


@app.route(&quot;/completion&quot;)
@tracer.start_as_current_span(&quot;do_work&quot;)
def completion():
    response = openai.Completion.create(
        model=&quot;text-davinci-003&quot;,
        prompt=&quot;Why is Elastic an amazing observability tool?&quot;,
        max_tokens=20,
        temperature=0
    )
    return response.choices[0].text.strip()

if __name__ == &quot;__main__&quot;:
    app.run()
</code></pre>
<p>This code should be fairly familiar to anyone who has implemented OpenTelemetry with Python here — there is no specific magic. The magic happens inside the “monitor” code that you can use freely to instrument your own OpenAI applications.</p>
<h2>Monkeying around</h2>
<p>Inside the monitor.py code, you will see we do something called “Monkey Patching.” Monkey patching is a technique in Python where you dynamically modify the behavior of a class or module at runtime by modifying its attributes or methods. Monkey patching allows you to change the functionality of a class or module without having to modify its source code. It can be useful in situations where you need to modify the behavior of an existing class or module that you don't have control over or cannot modify directly.</p>
<p>What we want to do here is modify the behavior of the “Completion” call so we can steal the response metrics and add them to our OpenTelemetry spans. You can see how we do that below:</p>
<pre><code class="language-python">def count_completion_requests_and_tokens(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        counters['completion_count'] += 1
        response = func(*args, **kwargs)
        token_count = response.usage.total_tokens
        prompt_tokens = response.usage.prompt_tokens
        completion_tokens = response.usage.completion_tokens
        cost = calculate_cost(response)
        strResponse = json.dumps(response)
        # Set OpenTelemetry attributes
        span = trace.get_current_span()
        if span:
            span.set_attribute(&quot;completion_count&quot;, counters['completion_count'])
            span.set_attribute(&quot;token_count&quot;, token_count)
            span.set_attribute(&quot;prompt_tokens&quot;, prompt_tokens)
            span.set_attribute(&quot;completion_tokens&quot;, completion_tokens)
            span.set_attribute(&quot;model&quot;, response.model)
            span.set_attribute(&quot;cost&quot;, cost)
            span.set_attribute(&quot;response&quot;, strResponse)
        return response
    return wrapper
# Monkey-patch the openai.Completion.create function
openai.Completion.create = count_completion_requests_and_tokens(openai.Completion.create)
</code></pre>
<p>By adding all this data to our Span, we can actually send it to our OpenTelemetry OTLP endpoint (in this case it will be Elastic). The benefit of doing this is that you can easily use the data for search or to build dashboards and visualizations. In the final step, we also want to calculate the cost. We do this by implementing the following function, which will calculate the cost of a single request to the OpenAI APIs.</p>
<pre><code class="language-python">def calculate_cost(response):
    if response.model in ['gpt-4', 'gpt-4-0314']:
        cost = (response.usage.prompt_tokens * 0.03 + response.usage.completion_tokens * 0.06) / 1000
    elif response.model in ['gpt-4-32k', 'gpt-4-32k-0314']:
        cost = (response.usage.prompt_tokens * 0.06 + response.usage.completion_tokens * 0.12) / 1000
    elif 'gpt-3.5-turbo' in response.model:
        cost = response.usage.total_tokens * 0.002 / 1000
    elif 'davinci' in response.model:
        cost = response.usage.total_tokens * 0.02 / 1000
    elif 'curie' in response.model:
        cost = response.usage.total_tokens * 0.002 / 1000
    elif 'babbage' in response.model:
        cost = response.usage.total_tokens * 0.0005 / 1000
    elif 'ada' in response.model:
        cost = response.usage.total_tokens * 0.0004 / 1000
    else:
        cost = 0
    return cost
</code></pre>
<h2>Elastic to the rescue</h2>
<p>Once we are capturing all this data, it’s time to have some fun with it in Elastic. In Discover, we can see all the data points we sent over using the OpenTelemetry library:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-openai-api-gpt-models-opentelemetry/blog-elastic-discover-apm.png" alt="elastic discover apm" /></p>
<p>With these labels in place, it is very easy to build a dashboard. Take a look at this one I built earlier (<a href="https://github.com/davidgeorgehope/ChatGPTMonitoringWithOtel/blob/main/chatGPTDashboard.ndjson">which is also checked into my GitHub Repository</a>):</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-openai-api-gpt-models-opentelemetry/blog-elastic-labels-dashboard.png" alt="elastic labels dashboard" /></p>
<p>We can also see Transactions, Latency of the OpenAI service, and all the spans related to our ChatGPT service calls.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-openai-api-gpt-models-opentelemetry/blog-elastic-observability-service-name.png" alt="observability service name" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-openai-api-gpt-models-opentelemetry/blog-elastic-your-service-name.png" alt="elastic your service name" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-openai-api-gpt-models-opentelemetry/blog-elastic-api-openai.png" alt="elastic api openai" /></p>
<p>In the transaction view, we can also see how long specific OpenAI calls have taken:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-openai-api-gpt-models-opentelemetry/blog-elastic-latency-distribution.png" alt="elastic latency distribution" /></p>
<p>Some requests to OpenAI here have taken over 3 seconds. ChatGPT can be very slow, so it’s important for us to understand how slow this is and if users are becoming frustrated.</p>
<h2>Summary</h2>
<p>We looked at monitoring ChatGPT with OpenTelemetry with Elastic. ChatGPT is a worldwide phenomenon and it’s going to no doubt grow and grow, and pretty soon everyone will be using it. Because it can be slow to get responses out, it is critical that people are able to understand the performance of any code that is using this service.</p>
<p>There is also the issue of cost, since it’s incredibly important to understand if this service is eating into your margins and if what you are asking for is profitable for your business. With the current economic environment, we have to keep an eye on profitability.</p>
<p>Take a look at the code for this solution <a href="https://github.com/davidgeorgehope/ChatGPTMonitoringWithOtel">here</a>. And please feel free to use the “monitor” library to instrument your own OpenAI code.</p>
<p>Interested in learning more about Elastic Observability? Check out the following resources:</p>
<ul>
<li><a href="https://www.elastic.co/virtual-events/intro-to-elastic-observability">An Introduction to Elastic Observability</a></li>
<li><a href="https://www.elastic.co/training/observability-fundamentals">Observability Fundamentals Training</a></li>
<li><a href="https://www.elastic.co/observability/demo">Watch an Elastic Observability demo</a></li>
<li><a href="https://www.elastic.co/blog/observability-predictions-trends-2023">Observability Predictions and Trends for 2023</a></li>
</ul>
<p>And sign up for our <a href="https://www.elastic.co/virtual-events/emerging-trends-in-observability">Elastic Observability Trends Webinar</a> featuring AWS and Forrester, not to be missed!</p>
<p><em>In this blog post, we may have used third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/monitor-openai-api-gpt-models-opentelemetry/opentelemetry-graphic-ad-2-1920x1080.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Monitor your Python data pipelines with OTEL]]></title>
            <link>https://www.elastic.co/observability-labs/blog/monitor-your-python-data-pipelines-with-otel</link>
            <guid isPermaLink="false">monitor-your-python-data-pipelines-with-otel</guid>
            <pubDate>Thu, 08 Aug 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to configure OTEL for your data pipelines, detect any anomalies, analyze performance, and set up corresponding alerts with Elastic.]]></description>
            <content:encoded><![CDATA[<p>This article delves into how to implement observability practices, particularly using <a href="https://opentelemetry.io/">OpenTelemetry (OTEL)</a> in Python, to enhance the monitoring and quality control of data pipelines using Elastic. While the primary focus of the examples presented in the article is ETL (Extract, Transform, Load) processes to ensure the accuracy and reliability of data pipelines that is crucial for Business Intelligence (BI), the strategies and tools discussed are equally applicable to Python processes used for Machine Learning (ML) models or other data processing tasks.</p>
<h2>Introduction</h2>
<p>Data pipelines, particularly ETL processes, form the backbone of modern data architectures. These pipelines are responsible for extracting raw data from various sources, transforming it into meaningful information, and loading it into data warehouses or data lakes for analysis and reporting.</p>
<p>In our organization, we have Python-based ETL scripts that play a pivotal role in exporting and processing data from Elasticsearch (ES) clusters and loading it into <a href="https://cloud.google.com/bigquery">Google BigQuery (BQ)</a>. This processed data then feeds into <a href="https://www.getdbt.com">DBT (Data Build Tool)</a> models, which further refine the data and make it available for analytics and reporting. To see the full architecture and learn how we monitor our DBT pipelines with Elastic see <a href="https://www.elastic.co/observability-labs/blog/monitor-dbt-pipelines-with-elastic-observability">Monitor your DBT pipelines with Elastic Observability</a>. In this article we focus on the ETL scripts. Given the critical nature of these scripts, it is imperative to set up mechanisms to control and ensure the quality of the data they generate.</p>
<p>The strategies discussed here can be extended to any script or application that handles data processing or machine learning models, regardless of the programming language used as long as there exists a corresponding agent that supports OTEL instrumentation.</p>
<h2>Motivation</h2>
<p>Observability in data pipelines involves monitoring the entire lifecycle of data processing to ensure that everything works as expected. It includes:</p>
<ol>
<li>Data Quality Control:</li>
</ol>
<ul>
<li>Detecting anomalies in the data, such as unexpected drops in record counts.</li>
<li>Verifying that data transformations are applied correctly and consistently.</li>
<li>Ensuring the integrity and accuracy of the data loaded into the data warehouse.</li>
</ul>
<ol start="2">
<li>Performance Monitoring:</li>
</ol>
<ul>
<li>Tracking the execution time of ETL scripts to identify bottlenecks and optimize performance.</li>
<li>Monitoring resource usage, such as memory and CPU consumption, to ensure efficient use of infrastructure.</li>
</ul>
<ol start="3">
<li>Real-time Alerting:</li>
</ol>
<ul>
<li>Setting up alerts for immediate notification of issues such as failed ETL jobs, data quality issues, or performance degradation.</li>
<li>Identify the root case of such incidents</li>
<li>Proactively addressing incidents to minimize downtime and impact on business operations</li>
</ul>
<p>Issues such as failed ETL jobs, can even point to larger infrastructure or data source data quality issues.</p>
<h2>Steps for Instrumentation</h2>
<p>Here are the steps to automatically instrument your Python script for exporting OTEL traces, metrics, and logs.</p>
<h3>Step 1: Import Required Libraries</h3>
<p>We first need to install the following libraries.</p>
<pre><code class="language-sh">pip install elastic-opentelemetry google-cloud-bigquery[opentelemetry]
</code></pre>
<p>You can also them to your project's <code>requirements.txt</code> file and install them with <code>pip install -r requirements.txt</code>.</p>
<h4>Explanation of Dependencies</h4>
<ol>
<li>
<p><strong>elastic-opentelemetry</strong>: This package is the Elastic Distribution for OpenTelemetry Python. Under the hood it will install the following packages:</p>
<ul>
<li>
<p><strong>opentelemetry-distro</strong>: This package is a convenience distribution of OpenTelemetry, which includes the OpenTelemetry SDK, APIs, and various instrumentation packages. It simplifies the setup and configuration of OpenTelemetry in your application.</p>
</li>
<li>
<p><strong>opentelemetry-exporter-otlp</strong>: This package provides an exporter that sends telemetry data to the OpenTelemetry Collector or any other endpoint that supports the OpenTelemetry Protocol (OTLP). This includes traces, metrics, and logs.</p>
</li>
<li>
<p><strong>opentelemetry-instrumentation-system-metrics</strong>: This package provides instrumentation for collecting system metrics, such as CPU usage, memory usage, and other system-level metrics.</p>
</li>
</ul>
</li>
<li>
<p><strong>google-cloud-bigquery[opentelemetry]</strong>: This package integrates Google Cloud BigQuery with OpenTelemetry, allowing you to trace and monitor BigQuery operations.</p>
</li>
</ol>
<h3>Step 2: Export OTEL Variables</h3>
<p>Set the necessary OpenTelemetry (OTEL) variables by getting the configuration from APM OTEL from Elastic.</p>
<p>Go to APM -&gt; Services -&gt; Add data (top left corner).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/otel-variables-1.png" alt="1 - Get OTEL variables step 1" /></p>
<p>In this section you will find the steps how to configure various APM agents. Navigate to OpenTelemetry to find the variables that you need to export.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/otel-variables-2.png" alt="2 - Get OTEL variables step 2" /></p>
<p><strong>Find OTLP Endpoint</strong>:</p>
<ul>
<li>Look for the section related to OpenTelemetry or OTLP configuration.</li>
<li>The <code>OTEL_EXPORTER_OTLP_ENDPOINT</code> is typically provided as part of the setup instructions for integrating OpenTelemetry with Elastic APM. It might look something like <code>https://&lt;your-apm-server&gt;/otlp</code>.</li>
</ul>
<p><strong>Obtain OTLP Headers</strong>:</p>
<ul>
<li>In the same section, you should find instructions or a field for OTLP headers. These headers are often used for authentication purposes.</li>
<li>Copy the necessary headers provided by the interface. They might look like <code>Authorization: Bearer &lt;your-token&gt;</code>.</li>
</ul>
<p>Note: Notice you need to replace the whitespace between <code>Bearer</code> and your token with <code>%20</code> in the <code>OTEL_EXPORTER_OTLP_HEADERS</code> variable when using Python.</p>
<p>Alternatively you can use a different approach for authentication using API keys (see <a href="https://github.com/elastic/elastic-otel-python?tab=readme-ov-file#authentication">instructions</a>). If you are using our <a href="https://www.elastic.co/docs/current/serverless/general/what-is-serverless-elastic">serverless offering</a> you will need to use this approach instead.</p>
<p><strong>Set up the variables</strong>:</p>
<ul>
<li>Replace the placeholders in your script with the actual values obtained from the Elastic APM interface and execute it in your shell via the source command <code>source env.sh</code>.</li>
</ul>
<p>Below is a script to set these variables:</p>
<pre><code class="language-sh">#!/bin/bash
echo &quot;--- :otel: Setting OTEL variables&quot;
export OTEL_EXPORTER_OTLP_ENDPOINT='https://your-apm-server/otlp:443'
export OTEL_EXPORTER_OTLP_HEADERS='Authorization=Bearer%20your-token'
export OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true
export OTEL_PYTHON_LOG_CORRELATION=true
export ELASTIC_OTEL_SYSTEM_METRICS_ENABLED=true
export OTEL_METRIC_EXPORT_INTERVAL=5000
export OTEL_LOGS_EXPORTER=&quot;otlp,console&quot;
</code></pre>
<p>With these variables set, we are ready for auto-instrumentation without needing to add anything to the code.</p>
<h4>Explanation of Variables</h4>
<ul>
<li>
<p><strong>OTEL_EXPORTER_OTLP_ENDPOINT</strong>: This variable specifies the endpoint to which OTLP data (traces, metrics, logs) will be sent. Replace <code>placeholder</code> with your actual OTLP endpoint.</p>
</li>
<li>
<p><strong>OTEL_EXPORTER_OTLP_HEADERS</strong>: This variable specifies any headers required for authentication or other purposes when sending OTLP data. Replace <code>placeholder</code> with your actual OTLP headers.</p>
</li>
<li>
<p><strong>OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED</strong>: This variable enables auto-instrumentation for logging in Python, allowing logs to be automatically enriched with trace context.</p>
</li>
<li>
<p><strong>OTEL_PYTHON_LOG_CORRELATION</strong>: This variable enables log correlation, which includes trace context in log entries to correlate logs with traces.</p>
</li>
<li>
<p><strong>OTEL_METRIC_EXPORT_INTERVAL</strong>: This variable specifies the metric export interval in milliseconds, in this case 5s.</p>
</li>
<li>
<p><strong>OTEL_LOGS_EXPORTER</strong>: This variable specifies the exporter to use for logs. Setting it to &quot;otlp&quot; means that logs will be exported using the OTLP protocol. Adding &quot;console&quot; specifies that logs should be exported to both the OTLP endpoint and the console. In our case for better visibility on the infa side, we choose to export to console as well.</p>
</li>
<li>
<p><strong>ELASTIC_OTEL_SYSTEM_METRICS_ENABLED</strong>: It is needed to use this variable when using the Elastic distribution as by default it is set to false.</p>
</li>
</ul>
<p>Note: <strong>OTEL_METRICS_EXPORTER</strong> and <strong>OTEL_TRACES_EXPORTER</strong>: This variables specify the exporter to use for metrics/traces, and are set to &quot;otlp&quot; by default, which means that metrics and traces will be exported using the OTLP protocol.</p>
<h3>Running Python ETLs</h3>
<p>We run Python ETLs with the following command:</p>
<pre><code class="language-sh">OTEL_RESOURCE_ATTRIBUTES=&quot;service.name=x-ETL,service.version=1.0,deployment.environment=production&quot; &amp;&amp; opentelemetry-instrument python3 X_ETL.py 
</code></pre>
<h4>Explanation of the Command</h4>
<ul>
<li>
<p><strong>OTEL_RESOURCE_ATTRIBUTES</strong>: This variable specifies additional resource attributes, such as <a href="https://www.elastic.co/guide/en/observability/current/apm.html">service name</a>, service version and deployment environment, that will be included in all telemetry data, you can customize these values per your needs. You can use a different service name for each script.</p>
</li>
<li>
<p><strong>opentelemetry-instrument</strong>: This command auto-instruments the specified Python script for OpenTelemetry. It sets up the necessary hooks to collect traces, metrics, and logs.</p>
</li>
<li>
<p><strong>python3 X_ETL.py</strong>: This runs the specified Python script (<code>X_ETL.py</code>).</p>
</li>
</ul>
<h3>Tracing</h3>
<p>We export the traces via the default OTLP protocol.</p>
<p>Tracing is a key aspect of monitoring and understanding the performance of applications. <a href="https://www.elastic.co/guide/en/observability/current/apm-data-model-spans.html">Spans</a> form the building blocks of tracing. They encapsulate detailed information about the execution of specific code paths. They record the start and end times of activities and can have hierarchical relationships with other spans, forming a parent/child structure.</p>
<p>Spans include essential attributes such as transaction IDs, parent IDs, start times, durations, names, types, subtypes, and actions. Additionally, spans may contain stack traces, which provide a detailed view of function calls, including attributes like function name, file path, and line number, which is especially useful for debugging. These attributes help us analyze the script's execution flow, identify performance issues, and enhance optimization efforts.</p>
<p>With the default instrumentation, the whole Python script would be a single span. In our case we have decided to manually add specific spans per the different phases of the Python process, to be able to measure their latency, throughput, error rate, etc individually. This is how we define spans manually:</p>
<pre><code class="language-python">from opentelemetry import trace

if __name__ == &quot;__main__&quot;:

    tracer = trace.get_tracer(&quot;main&quot;)
    with tracer.start_as_current_span(&quot;initialization&quot;) as span:
            # Init code
            … 
    with tracer.start_as_current_span(&quot;search&quot;) as span:
            # Step 1 - Search code
            …
   with tracer.start_as_current_span(&quot;transform&quot;) as span:
           # Step 2 - Transform code
           …
   with tracer.start_as_current_span(&quot;load&quot;) as span:
           # Step 3 - Load code
           …
</code></pre>
<p>You can explore traces in the APM interface as shown below.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/Traces-APM-Observability-Elastic.png" alt="3 - APM Traces view" /></p>
<h3>Metrics</h3>
<p>We export metrics via the default OTLP protocol as well, such as CPU usage and memory. No extra code needs to be added in the script itself.</p>
<p>Note: Remember to set <code>ELASTIC_OTEL_SYSTEM_METRICS_ENABLED</code> to true.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/otel-metrics-apm-view.png" alt="4 - APM Metrics view" /></p>
<h3>Logging</h3>
<p>We export logs via the default OTLP protocol as well.</p>
<p>For logging, we modify the logging calls to add extra fields using a dictionary structure (bq_fields) as shown below:</p>
<pre><code class="language-python">        job.result()  # Waits for table load to complete
        job_details = client.get_job(job.job_id)  # Get job details

        # Extract job information
        bq_fields = {
            # &quot;slot_time_ms&quot;: job_details.slot_ms,
            &quot;job_id&quot;: job_details.job_id,
            &quot;job_type&quot;: job_details.job_type,
            &quot;state&quot;: job_details.state,
            &quot;path&quot;: job_details.path,
            &quot;job_created&quot;: job_details.created.isoformat(),
            &quot;job_ended&quot;: job_details.ended.isoformat(),
            &quot;execution_time_ms&quot;: (
                job_details.ended - job_details.created
            ).total_seconds()
            * 1000,
            &quot;bytes_processed&quot;: job_details.output_bytes,
            &quot;rows_affected&quot;: job_details.output_rows,
            &quot;destination_table&quot;: job_details.destination.table_id,
            &quot;event&quot;: &quot;BigQuery Load Job&quot;, # Custom event type
            &quot;status&quot;: &quot;success&quot;, # Status of the step (success/error)
            &quot;category&quot;: category # ETL category tag 
        }

        logging.info(&quot;BigQuery load operation successful&quot;, extra=bq_fields)
</code></pre>
<p>This code shows how to extract BQ job stats, execution time, bytes processed, rows affected and destination table among them. You can add other metadata like we do such as custom event type, status, and category.</p>
<p>Any calls to logging (of all levels above the set threshold, in this case INFO <code>logging.getLogger().setLevel(logging.INFO)</code>) will create a log that will be exported to Elastic. This means that in Python scripts that already use <code>logging</code> there is no need to make any changes to export logs to Elastic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/otel-logs-apm-view.png" alt="5 - APM Logs view" /></p>
<p>For each of the log messages, you can go into the details view (click on the <code>…</code> when you hover over the log line and go into <code>View details</code>) to examine the metadata attached to the log message. You can also explore the logs in <a href="https://www.elastic.co/guide/en/kibana/8.14/discover.html">Discover</a>.</p>
<h4>Explanation of Logging Modification</h4>
<ul>
<li>
<p><strong>logging.info</strong>: This logs an informational message. The message &quot;BigQuery load operation successful&quot; is logged.</p>
</li>
<li>
<p><strong>extra=bq_fields</strong>: This adds additional context to the log entry using the <code>bq_fields</code> dictionary. This context can include details making the log entries more informative and easier to analyze. This data will be later used to set up alerts and data anomaly detection jobs.</p>
</li>
</ul>
<h2>Monitoring in Elastic's APM</h2>
<p>As shown, we can examine traces, metrics, and logs in the APM interface. To make the most out of this data, we make use on top of nearly the whole suit of features in Elastic Observability alongside Elastic Analytic's ML capabilities.</p>
<h3>Rules and Alerts</h3>
<p>We can set up rules and alerts to detect anomalies, errors, and performance issues in our scripts.</p>
<p>The <a href="https://www.elastic.co/guide/en/kibana/current/apm-alerts.html#apm-create-error-alert"><code>error count threshold</code> rule</a> is used to create a trigger when the number of errors in a service exceeds a defined threshold.</p>
<p>To create the rule go to Alerts and Insights -&gt; Rules -&gt; Create Rule -&gt; Error count threshold, set the error count threshold, the service or environment you want to monitor (you can also set an error grouping key across services), how often to run the check, and choose a connector.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/error-count-threshold.png" alt="6 - ETL Status Error Rule" /></p>
<p>Next, we create a rule of type <code>custom threshold</code> on a given ETL logs <a href="https://www.elastic.co/guide/en/kibana/current/data-views.html">data view</a> (create one for your index) filtering on &quot;labels.status: error&quot; to get all the logs with status error from any of the steps of the ETL which have failed. The rule condition is set to document count &gt; 0. In our case, in the last section of the rule config, we also set up Slack <a href="https://www.elastic.co/guide/en/kibana/current/alerting-getting-started.html">alerts</a> every time the rule is activated. You can pick from a long list of <a href="https://www.elastic.co/guide/en/kibana/current/action-types.html">connectors</a> Elastic supports.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/etl-fail-status-rule.png" alt="7 - ETL Status Error Rule" /></p>
<p>Then we can set up alerts for failures. We add status to the logs metadata as shown in the code sample below for each of the steps in the ETLs. It then becomes available in ES via <code>labels.status</code>.</p>
<pre><code class="language-python">logging.info(
            &quot;Elasticsearch search operation successful&quot;,
            extra={
                &quot;event&quot;: &quot;Elasticsearch Search&quot;,
                &quot;status&quot;: &quot;success&quot;,
                &quot;category&quot;: category,
                &quot;index&quot;: index,
            },
        )
</code></pre>
<h3>More Rules</h3>
<p>We could also add rules to detect anomalies in the execution time of the different spans we define. This is done by selecting transaction/span -&gt; Alerts and rules -&gt; Custom threshold rule -&gt; Latency. In the example below, we want to generate an alert whenever the search step takes more than 25s.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/apm_custom_threshold_latency.png" alt="8 - APM Custom Threshold - Latency" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/apm_custom_threshold_latency_2.png" alt="9 - APM Custom Threshold - Config" /></p>
<p>Alternatively, for finer-grained control, you can go with Alerts and rules -&gt; Anomaly rule, set up an anomaly job, and pick a threshold severity level.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/apm_anomaly_rule_config.png" alt="10 - APM Anomaly Rule - Config" /></p>
<h3>Anomaly detection job</h3>
<p>In this example we set an anomaly detection job on the number of documents before transform.</p>
<p>We set up an <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-ad-run-jobs.html">Anomaly Detection jobs</a> on the number of document before the transform using the [Single metric job] (<a href="https://www.elastic.co/guide/en/machine-learning/current/ml-anomaly-detection-job-types.html#multi-metric-jobs">https://www.elastic.co/guide/en/machine-learning/current/ml-anomaly-detection-job-types.html#multi-metric-jobs</a>) to detect any anomalies with the incoming data source.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/single-metrics.png" alt="11 - Single Metrics" /></p>
<p>In the last step, you can create alerting similarly to what we did before to receive alerts whenever there is an anomaly detected, by setting up a severity level threshold. Using the anomaly score which is assigned to every anomaly, every anomaly is characterized by a severity level.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/anomaly-detection-alerting-1.png" alt="12 - Anomaly detection Alerting - Severity" /></p>
<p>Similarly to the previous example, we set up a Slack connector to receive alerts whenever an anomaly is detected.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/anomaly-detection-alerting-connectors.png" alt="13 - Anomaly detection Alerting - Connectors" /></p>
<p>You can go to your custom dashboard by going to Add Panel -&gt; ML -&gt; Anomaly Swim Lane -&gt; Pick your job.</p>
<p>Similarly, we add jobs for the number of documents after the transform, and a Multi-Metric one on the <code>execution_time_ms</code>, <code>bytes_processed</code> and <code>rows_affected</code> similarly to how it was done in <a href="https://www.elastic.co/observability-labs/blog/monitor-dbt-pipelines-with-elastic-observability">Monitor your DBT pipelines with Elastic Observability</a>.</p>
<h2>Custom Dashboard</h2>
<p>Now that your logs, metrics, and traces are in Elastic, you can use the full potential of our Kibana dashboards to extract the most from them. We can create a custom dashboard like the following one: a pie chart based on <code>labels.event</code> (category field for every type of step in the ETLs), a chart for every type of step broken down by status, a timeline of steps broken down by status, BQ stats for the ETL, and anomaly detection swim lane panels for the various anomaly jobs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/custom_dashboard.png" alt="14 - Custom Dashboard" /></p>
<h2>Conclusion</h2>
<p>Elastic’s APM, in combination with other Observability and ML features, provides a unified view of our data pipelines, allowing us to bring a lot of value with minimal code changes:</p>
<ul>
<li>Logging of new logs (no need to add custom logging) alongside their execution context</li>
<li>Monitor the runtime behavior of our models</li>
<li>Track data quality issues</li>
<li>Identify and troubleshoot real-time incidents</li>
<li>Optimize performance bottlenecks and resource usage</li>
<li>Identify dependencies on other services and their latency</li>
<li>Optimize data transformation processes</li>
<li>Set up alerts on latency, data quality issues, error rates of transactions or CPU usage)</li>
</ul>
<p>With these capabilities, we can ensure the resilience and reliability of our data pipelines, leading to more robust and accurate BI system and reporting.</p>
<p>In conclusion, setting up OpenTelemetry (OTEL) in Python for data pipeline observability has significantly improved our ability to monitor, detect, and resolve issues proactively. This has led to more reliable data transformations, better resource management, and enhanced overall performance of our data transformation, BI and Machine Learning systems.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/main_image.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Monitoring Android applications with Elastic APM]]></title>
            <link>https://www.elastic.co/observability-labs/blog/monitoring-android-applications-apm</link>
            <guid isPermaLink="false">monitoring-android-applications-apm</guid>
            <pubDate>Tue, 21 Mar 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic has launched its APM agent for Android applications, allowing developers to track key aspects of applications to help troubleshoot issues and performance flaws with mobile applications, corresponding backend services, and their interactions.]]></description>
            <content:encoded><![CDATA[<blockquote>
<p><strong>WARNING</strong>: This article shows information about the Android agent that is no longer accurate for versions <code>1.x</code>. Please refer to <a href="https://www.elastic.co/docs/reference/apm/agents/android">its documentation</a> to learn about its new APIs.</p>
</blockquote>
<p>People are handling more and more matters on their smartphones through mobile apps both privately and professionally. With thousands or even millions of users, ensuring great <a href="https://www.elastic.co/observability/application-performance-monitoring">monitor application performance</a> and reliability is a key challenge for providers and operators of mobile apps and related backend services. Understanding the behavior of mobile apps, the occurrences and types of crashes, the <a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">root causes of slow response times</a>, and the real user impact of backend issues is key to managing the performance of mobile apps and associated backend services.</p>
<p>Elastic has launched its application performance monitoring (<a href="https://www.elastic.co/observability/application-performance-monitoring">APM</a>) agent for Android applications, allowing developers to keep track of key aspects of their applications, from crashes and HTTP requests to screen rendering times and end-to-end distributed tracing. All of this helps troubleshoot issues and performance flaws with mobile applications, corresponding backend services, and their interaction. The Elastic APM Android Agent automatically instruments your application and its dependencies so that you can simply “plug-and-play” the agent into your application without having to worry about changing your codebase much.</p>
<p>The Elastic APM Android Agent has been developed from scratch on top of OpenTelemetry, an open standard and framework for observability. Developers will be able to take full advantage of its capabilities, as well as the support provided by a huge and active community. If you’re familiar with OpenTelemetry and your application is already instrumented with OpenTelemetry, then you can simply reuse it all when switching to the Elastic APM Android Agent. But no worries if that’s not the case — the agent is configured to handle common traceable scenarios automatically without having to deep dive into the specifics of the OpenTelemetry API.</p>
<p>[Related article: <a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a>]</p>
<h2>How it works</h2>
<p>The Elastic APM Android Agent is a combination of an SDK plus a Gradle plugin. The SDK contains utilities that will let you initialize and configure the agent’s behavior, as well as prepare and initialize the OpenTelemetry SDK. You can use the SDK for programmatic configuration and initialization of the agent, in particular for advanced and special use cases.</p>
<p>In most cases, a programmatic configuration and initialization won’t be necessary. Instead, you can use the provided Gradle plugin to configure the agent and automatically instrument your app. The Gradle plugin uses Byte Buddy and the official Android Gradle plugin API under the hood to automatically inject instrumentation code into your app through compile-time transformation of your application’s and its dependencies’ classes.</p>
<p>Compiling your app with the Elastic Android APM Agent Gradle Plugin configured and enabled will make your Android app report tracing data, metrics, and different events and logs at runtime.</p>
<h2>Using the Elastic APM Agent in an Android app</h2>
<p>By means of a <a href="https://github.com/elastic/sample-app-android-apm">simple demo application</a>, we’re going through the steps mentioned in the “<a href="https://www.elastic.co/guide/en/apm/agent/android/current/setup.html">Set up the Agent</a>” guide to set up the Elastic Android APM Agent.</p>
<h3>Prerequisites</h3>
<p>For this example, you will need the following:</p>
<ul>
<li>An Elastic Stack with APM enabled (We recommend using Elastic’s Cloud offering. <a href="https://www.elastic.co/cloud/elasticsearch-service/signup?baymax=docs-body&amp;elektra=docs">Try it for free</a>.)</li>
<li>Java 11+</li>
<li><a href="https://developer.android.com/studio?gclid=Cj0KCQiAic6eBhCoARIsANlox87QsDnyjpKObQSivZz6DHMLTiL76CmqZGXTEqf4L7h3jQO7ljm8B14aAo4xEALw_wcB&amp;gclsrc=aw.ds">Android Studio</a></li>
<li><a href="https://developer.android.com/studio/run/emulator">Android Emulator, AVD device</a></li>
</ul>
<p>You’ll also need a way to push the app’s <a href="https://opentelemetry.io/docs/concepts/signals/">signals</a> into Elastic. Therefore, you will need Elastic APM’s <a href="https://www.elastic.co/guide/en/apm/guide/current/secret-token.html#create-secret-token">secret token</a> that you’ll configure into our sample app later.</p>
<h3>Test project for our example</h3>
<p>To showcase an end-to-end scenario including distributed tracing, in this example, we’ll instrument a <a href="https://github.com/elastic/sample-app-android-apm">simple weather application</a> that comprises two Android UI fragments and a simple local backend service based on Spring Boot.</p>
<p>The first fragment will have a dropdown list with some city names and also a button that takes you to the second one, where you’ll see the selected city’s current temperature. If you pick a non-European city on the first screen, you’ll get an error from the (local) backend when you head to the second screen. This is to demonstrate how network and backend errors are captured and correlated in Elastic APM.</p>
<h3>Applying the Elastic APM Agent plugin</h3>
<p>In the following, we will explain <a href="https://www.elastic.co/guide/en/apm/agent/android/current/setup.html">all the steps required to set up the Elastic APM Android Agent</a> from scratch for an Android application. In case you want to skip these instructions and see the agent in action right away, use the main branch of that repo and apply only Step (3.b) before continuing with the next Section (“Setting up the local backend service”).</p>
<ol>
<li>Clone the <a href="https://github.com/elastic/sample-app-android-apm">sample app</a> repo and open it in Android Studio.</li>
<li>Switch to the uninstrumented repo branch to start from a blank, uninstrumented Android application. You can run this command to switch to the uninstrumented branch:</li>
</ol>
<pre><code class="language-bash">git checkout uninstrumented
</code></pre>
<ol start="3">
<li>Follow the Elastic APM Android Agent’s <a href="https://www.elastic.co/guide/en/apm/agent/android/current/setup.html">setup guide</a>:</li>
</ol>
<p>Add the co.elastic.apm.android plugin to the app/build.gradle file (please make sure to use the latest version available of the plugin, which you can find <a href="https://plugins.gradle.org/plugin/co.elastic.apm.android">here</a>).</p>
<p>Configure the agent’s connection to the Elastic APM backend by providing the ‘serverUrl’ and ‘secretToken’ in the ‘elasticAPM’ section of the app/build.gradle file.</p>
<pre><code class="language-java">// Android app's build.gradle file
plugins {
    //...
    id &quot;co.elastic.apm.android&quot; version &quot;[latest_version]&quot;
}

//...

elasticApm {
    // Minimal configuration
    serverUrl = &quot;https://your.elastic.apm.endpoint&quot;

    // Optional
    serviceName = &quot;weather-sample-app&quot;
    serviceVersion = &quot;0.0.1&quot;
    secretToken = &quot;your Elastic APM secret token&quot;
}
</code></pre>
<ol start="4">
<li>The only actual code change required is a one-liner to initialize the Elastic APM Android Agent in the Application.onCreate method. The application class for this sample app is located at app/src/main/java/co/elastic/apm/android/sample/MyApp.kt.</li>
</ol>
<pre><code class="language-kotlin">
package co.elastic.apm.android.sample

import android.app.Application
import co.elastic.apm.android.sdk.ElasticApmAgent

class MyApp : Application() {

    override fun onCreate() {
        super.onCreate()
        ElasticApmAgent.initialize(this)
    }
}
</code></pre>
<p>Bear in mind that for this example, we’re not changing the agent’s default configuration — if you want more information about how to do so, take a look at the agent’s <a href="https://www.elastic.co/guide/en/apm/agent/android/current/configuration.html#_runtime_configuration">runtime configuration guide</a>.</p>
<p>Before launching our Android Weather App, we need to configure and start the local weather-backend service as described in the next section.</p>
<h3>Setting up the local backend service</h3>
<p>One of the key features the agent provides is distributed tracing, which allows you to see the full end-to-end story of an HTTP transaction, starting from our mobile app and traversing instrumented backend services used by the app. Elastic APM will show you the full picture as one distributed trace, which comes in very handy for troubleshooting issues, especially the ones related to high latency and backend errors.</p>
<p>As part of our sample app, we’re going to launch a simple local backend service that will handle our app’s HTTP requests. The backend service is instrumented with the <a href="https://www.elastic.co/guide/en/apm/agent/java/current/index.html">Elastic APM Java agent</a> to collect and send its own APM data over to Elastic APM, allowing it to correlate the mobile interactions with the processing of the backend requests.</p>
<p>In order to configure the local server, we need to set our Elastic APM endpoint and secret token (the same used for our Android app in the previous step) into the backend/src/main/resources/elasticapm.properties file:</p>
<pre><code class="language-bash">service_name=weather-backend
application_packages=co.elastic.apm.android.sample
server_url=YOUR_ELASTIC_APM_URL
secret_token=YOUR_ELASTIC_APM_SECRET_TOKEN
</code></pre>
<h3>Launching the demo</h3>
<p>Our sample app will get automatic instrumentation for the agent’s currently <a href="https://www.elastic.co/guide/en/apm/agent/android/current/supported-technologies.html">supported frameworks</a>, which means that we’ll get to see screen rendering spans as well as OkHttp requests out of the box. For frameworks not currently supported, you could apply manual instrumentation to enrich your APM data (see “Manual Instrumentation” below).</p>
<p>We are ready to launch the demo. (The demo is meant to be executed on a local environment using an emulator for Android.) Therefore, we need to:</p>
<ol>
<li>Launch the backend service using this command in a terminal located in the root directory of our sample project: ./gradlew bootRun (or gradlew.bat bootRun if you’re on Windows). Alternatively, you can start the backend service from Android Studio.</li>
<li>Launch the weather sample app in an Android emulator (from Android Studio).</li>
</ol>
<p>Once everything is running, we need to navigate around in the app to generate some load that we would like to observe in Elastic APM. So, select a city, click <strong>Next</strong> and repeat it multiple times. Please, also make sure to select <strong>New York</strong> at least once. You will see that the weather forecast won’t work for New York as the city. Below, we will use Elastic APM to find out what’s going wrong when selecting New York.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitoring-android-applications-apm/blog-elastic-android-apm-city-selection.png" alt="apm android city selection" /></p>
<h2>First glance at the APM results</h2>
<p>Let’s open Kibana and navigate to the Observability solution.</p>
<p>Under the Services navigation item, you should see a list of two services: our Android app <strong>weather-sample-app</strong> and the corresponding backend service <strong>weather-backend</strong>. Click on the <strong>Service map</strong> tab to see a visualization of the dependencies between those services and any external services.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitoring-android-applications-apm/blog-elastic-apm-android-services.png" alt="apm android services" /></p>
<p>Click on the <strong>weather-sample-app</strong> to dive into the dashboard for the Android app. The service view for mobile applications is in technical preview at the publishing of this blog post, but you can already see insightful information about the app on that screen. You see information like the amount of active sessions in the selected time frame, number of HTTP requests emitted by the weather-sample-app, geographical distribution of the requests as well as breakdowns on device models, OS versions, network connection types, and app versions. (Information on crashes and app load times are under development.)</p>
<p>For the purpose of demonstration, we kept this demo simple, so the data is less diversified and also rather limited. However, this kind of data is particularly useful when you are monitoring a mobile app with higher usage numbers and higher diversification on device models, OS versions, etc. Troubleshooting problems and performance issues becomes way easier when you can use these properties to filter and group your APM data. You can use the quick filters at the top to do so and see how the metrics adopt depending on your selection.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitoring-android-applications-apm/blog-elastic-apm-android-weather-sample-app.png" alt="apm android weather sample app" /></p>
<p>Now, let’s see how individual user interactions are processed, including downstream calls into the backend service. Under the Transactions tab (at the top), we see the different end-to-end transaction groups, including the two transactions for the FirstFragment and the SecondFragment.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitoring-android-applications-apm/blog-elastic-apm-android-latency-distribution.png" alt="apm android latency distribution" /></p>
<p>Let’s deep dive into the SecondFragment - View appearing transaction, to see the metrics (e.g., latency, throughput) for this transaction group and also the invocation waterfall view for the individual user interactions. As we can see in the following screenshot, after view creation, the fragment performs an HTTP GET request to 10.0.2.2, which takes ~130 milliseconds. In the same waterfall, we see that the HTTP call is processed by the weather-backend service, which itself conducts an HTTP call to api.open-meteo.com.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitoring-android-applications-apm/blog-elastic-apm-android-trace-samples.png" alt="apm android trace samples" /></p>
<p>Now, when looking at the waterfall view for a request where New York was selected as the city, we see an error happening on the backend service that explains why the forecast didn’t work for New York. By clicking on the red <strong>View related error</strong> badge, you will get details on the error and the actual root cause of the problem.</p>
<p>The exception message on the weather-backend states that “This service can only retrieve geo locations for European cities!” That’s the problem with selecting New York as the city.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitoring-android-applications-apm/blog-elastic-apm-android-weather-backend.png" alt="apm android weather backend" /></p>
<h2>Manual instrumentation</h2>
<p>As previously mentioned, the Elastic APM Android Agent does a bunch of automatic instrumentation on your behalf for the <a href="https://www.elastic.co/guide/en/apm/agent/android/current/supported-technologies.html">supported frameworks</a>; however, in some cases, you might want to get extra instrumentation depending on your app’s use cases. For those cases, you’ve gotten covered by the OpenTelemetry API, which is what the Elastic APM Android Agent is based on. The OpenTelemetry Java SDK contains tools to create custom spans, metrics, and logs, and since it’s the base of the Elastic APM Android Agent, it’s available for you to use without having to add any extra dependencies into your project and without having to configure anything to connect your custom signals to your own Elastic environment either, as the agent does that for you.</p>
<p>The way to start would be by getting OpenTelemetry’s instance like so:</p>
<pre><code class="language-java">OpenTelemetry openTelemetry = GlobalOpenTelemetry.get();
</code></pre>
<p>And then you can follow the instructions from the <a href="https://opentelemetry.io/docs/instrumentation/java/manual/#acquiring-a-tracer">OpenTelemetry Java documentation</a> in order to create your custom signals. See the following example for the creation of a custom span:</p>
<pre><code class="language-java">OpenTelemetry openTelemetry = GlobalOpenTelemetry.get();
Tracer tracer = openTelemetry.getTracer(&quot;instrumentation-library-name&quot;, &quot;1.0.0&quot;);
Span span = tracer.spanBuilder(&quot;my span&quot;).startSpan();

// Make the span the current span
try (Scope ss = span.makeCurrent()) {
  // In this scope, the span is the current/active span
} finally {
    span.end();
}
</code></pre>
<h2>Conclusion</h2>
<p>In this blog post, we demonstrated how you can use the Elastic APM Android Agent to achieve end-to-end observability into your Android-based mobile applications. Setting up the agent is a matter of a few minutes and the provided insights allow you to analyze your app’s performance and its dependencies on backend services. With the Elastic APM Android Agent in place, you can leverage Elastic’s rich APM feature as well as the various possibilities to customize your analysis workflows through custom instrumentation and custom dashboards.</p>
<p>Are you curious? Then try it yourself. Sign up for a <a href="https://www.elastic.co/cloud/elasticsearch-service/signup">free trial on the Elastic Cloud</a>, enrich your Android app with the Elastic APM Android agent as described in this blog, and explore the data in <a href="https://www.elastic.co/observability">Elastic’s Observability solution</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/monitoring-android-applications-apm/illustration-indusrty-technology-social-1680x980.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Monitoring Proxmox VE deployments with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/monitoring-proxmox-ve-with-elastic</link>
            <guid isPermaLink="false">monitoring-proxmox-ve-with-elastic</guid>
            <pubDate>Wed, 23 Jul 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Monitoring Proxmox VE deployments, VMs, and Linux Containers with Elastic Observability.]]></description>
            <content:encoded><![CDATA[<p>In this blog post, you will learn how to leverage Elastic Observability to monitor Proxmox VE and the software running on top of it, both in the form of Linux Containers (LXCs) and Virtual Machines (VMs).</p>
<h2>Why use Elastic Observability with Proxmox?</h2>
<p>Here at Elastic, we are passionate about efficiently managing and monitoring infrastructure and applications. Many of us have fun playing with home labs, oftentimes running Proxmox VE, a powerful open-source virtualization platform used to run virtual machines and Linux Containers (LXCs) with ease. While Proxmox provides robust tools for managing virtualized resources, gaining deep insights into the performance and health of your LXCs, VMs, and hosts requires a comprehensive monitoring solution. This blog post will guide you through leveraging the power of Elastic Observability, in conjunction with Elastic Agent, to effectively monitor your Proxmox VE deployment, ensuring optimal performance and proactive issue resolution thanks to Kibana Alerts.</p>
<h2>The homelab setup</h2>
<p>Our homelab setup centers around an Intel N100 mini PC, serving as the host for Proxmox VE. This setup is simple and minimal, yet effective for showcasing a few interesting capabilities. On top of this mini PC, we run several Linux Containers (LXCs) for various services, along with a dedicated virtual machine for Home Assistant.</p>
<h2>Elastic Agent installation and configuration</h2>
<p>Before beginning, it is worth noting that there are numerous ways to install and configure the Elastic Agent. For the sake of simplicity, we will showcase a setup in which only one instance of the Elastic Agent is running on the host machine. The Elastic Agent reports to an Elastic Cloud Observability deployment and is managed via Fleet, which makes it tremendously easy to upgrade and re-configure it whenever needed.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitoring-proxmox-ve-with-elastic/fleet-prox.jpg" alt="The Elastic Integrations enabled for our Proxmox host" /></p>
<h2>Diving into the host</h2>
<p>Kibana offers various panes that make it nice and easy to learn about a system's health at a quick glance.</p>
<p>As a first step, let's take a look at the <code>Infrastructure &gt; Hosts</code> page in Kibana:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitoring-proxmox-ve-with-elastic/kibana-infrastructure-hosts-proxmox.jpg" alt="The Infrastructure &gt; Hosts Kibana page for our Proxmox host" /></p>
<p>Here we can see various information about our Proxmox VE host (i.e. the mini PC). The top processes running on it are presented, including processes running in LXCs such as <code>pia-daemon</code>. We can also see a <code>kvm</code> process, specifically running a Home Assistant virtual machine, and a Proxmox <code>pve-firewall</code> process.</p>
<p>Let's now take a look at <code>Universal Profiling &gt; Flamegraph</code>. This graph shows how much CPU time is consumed by different stack traces from processes running on the host system. You can drill down into specific processes using the search bar at the top. For instance, you can filter by <code>kvm</code> to only see information regarding this specific process.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitoring-proxmox-ve-with-elastic/universal-profiling-flamegraph-kvm.jpg" alt="The Universal Profiling &gt; Flamegraph Kibana page for our Proxmox host" /></p>
<h2>The Observability AI Assistant</h2>
<p>All the Kibana panes we visited so far have proved to be highly interesting, but they struggle to answer urgent questions such as:</p>
<ul>
<li>did anything happen in our mini PC recently?</li>
<li>was there any significant change in functionality?</li>
<li>is there any precious information hidden among the thousands of data points collected?</li>
</ul>
<p>The Elastic Observability AI Assistant helps us by answering these questions in natural language. By default, on Elastic Cloud, it uses the Elastic-managed LLM connector, which means users do not need to configure anything to get started with it. It just works!</p>
<p>Let's go to the <code>Observability &gt; AI Assistant</code> pane in Kibana and let's try to ask a generic prompt such as: &quot;please give me an overview of the health of my <code>prox</code> host&quot;.</p>
<p>Let's then wait a minute so that it can dig into the data... et voilà, here comes lots of relevant information in the form of graphs and natural language explanations. The Observability AI Assistant understood our question, went through all the data for our Proxmox host, ran data analytics on it, and reported back in a matter of seconds!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitoring-proxmox-ve-with-elastic/observability-ai-assistant-1.jpg" alt="The Observability AI Assistant's first reply" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitoring-proxmox-ve-with-elastic/observability-ai-assistant-2.jpg" alt="The Observability AI Assistant's second reply" /></p>
<h2>Alerting upon disruption with Kibana Alerts</h2>
<p>As a final step, let's try to define a Kibana Alert to help us understand whether our host is overloaded. Let's head to <code>Observability &gt; Alerts &gt; Rules</code> and create a new rule. We will create a Custom Threshold rule that will fire if CPU usage for the host is higher than 80% on average for the last 15 minutes. Kibana will send us an email in case the rule fires. The rule is also configured to fire if no data appears for the last 15 minutes, which is extremely helpful as it would imply the presence of some issues to be debugged: broken network or no electricity in the house, a faulty Agent deployment, or even a hardware issue with the mini PC.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitoring-proxmox-ve-with-elastic/rule-cpu-over-80.jpg" alt="The Kibana Alerting Rule for CPU being over 80 percent" /></p>
<h2>Conclusion</h2>
<p>In this blog post we showcased how to effectively use the Elastic Stack to monitor Proxmox VE deployments. If you would like to try out such a setup first-hand, you are more than welcome to enjoy <a href="https://www.elastic.co/cloud/cloud-trial-overview">Elastic Cloud's 14-days free trial</a>.</p>
<p>In future blog posts, we will investigate how to dig deeper into LXCs and VMs to gather even more information from our home lab and create more tailored alerts. Stay tuned!</p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/monitoring-proxmox-ve-with-elastic/article-image.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Native OpenTelemetry support in Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/native-opentelemetry-support-in-elastic-observability</link>
            <guid isPermaLink="false">native-opentelemetry-support-in-elastic-observability</guid>
            <pubDate>Wed, 13 Sep 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic offers native support for OpenTelemetry by allowing for direct ingest of OpenTelemetry traces, metrics, and logs without conversion, and applying any Elastic feature against OTel data without degradation in capabilities.]]></description>
            <content:encoded><![CDATA[<p>NOTE: Since writing this blog, new OTel data ingest configurations are now available in Elastic. See recent <a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-otel-operator">blog</a></p>
<p>OpenTelemetry is more than just becoming the open ingestion standard for observability. As one of the major Cloud Native Computing Foundation (CNCF) projects, with as many commits as Kubernetes, it is gaining support from major ISVs and cloud providers delivering support for the framework. Many global companies from finance, insurance, tech, and other industries are starting to standardize on OpenTelemetry. With OpenTelemetry, DevOps teams have a consistent approach to collecting and ingesting telemetry data providing a de-facto standard for observability.</p>
<p>Elastic&lt;sup&gt;®&lt;/sup&gt; is strategically standardizing on OpenTelemetry for the main data collection architecture for observability and security. Additionally, Elastic is making a commitment to help OpenTelemetry become the best de facto data collection infrastructure for the observability ecosystem. Elastic is deepening its relationship with OpenTelemetry beyond the recent contribution of Elastic Common Schema (ECS) to OpenTelemetry (OTel).</p>
<p>Today, Elastic supports OpenTelemetry natively, since Elastic 7.14, by being able to directly ingest OpenTelemetry protocol (OTLP) based traces, metrics, and logs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/native-opentelemetry-support-in-elastic-observability/elastic-blog-1-otel-config-options.png" alt="otel configuration options" /></p>
<p>In this blog, we’ll review the current OpenTelemetry support provided by Elastic, which includes the following:</p>
<ul>
<li><a href="#ingesting-opentelemetry-into-elastic"><strong>Easy ingest of distributed tracing and metrics</strong></a> for applications configured with OpenTelemetry agents for Python, NodeJS, Java, Go, and .NET</li>
<li><a href="#opentelemetry-logs-in-elastic"><strong>OpenTelemetry logs instrumentation and ingest</strong></a> using various configurations</li>
<li><a href="#opentelemetry-is-elastics-preferred-schema"><strong>Open semantic conventions</strong></a> for logs and more through ECS, which is not part of OpenTelemetry</li>
<li><a href="#elastic-observability-apm-and-machine-learning-capabilities"><strong>Machine learning based AIOps capabilities</strong></a>, such as latency correlations, failure correlations, anomaly detection, log spike analysis, predictive pattern analysis, Elastic AI Assistant support, and more, all apply to native OTLP telemetry.</li>
<li><a href="#elastic-allows-you-to-migrate-to-otel-on-your-schedule"><strong>Migrate applications to OpenTelemetry at your own speed</strong></a>. Elastic’s APM capabilities all work seamlessly even with a mix of services using OpenTelemetry and/or Elastic APM agents. You can even combine OpenTelemetry instrumentation with Elastic Agent.</li>
<li><a href="#integrated-kubernetes-and-opentelemetry-views-in-elastic"><strong>Integrated views and analysis with Kubernetes clusters</strong></a>, which most OpenTelemetry applications are running on. Elastic can highlight specific pods and containers related to each service when analyzing issues for applications based on OpenTelemetry.</li>
</ul>
<h2>Ingesting OpenTelemetry into Elastic</h2>
<p>If you’re interested in seeing how simple it is to ingest OpenTelemetry traces and metrics into Elastic, follow the steps outlined in this blog.</p>
<p>Let’s outline what Elastic provides for ingesting OpenTelemetry data. Here are all your options:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/native-opentelemetry-support-in-elastic-observability/elastic-blog-2-flowchart.png" alt="flowchart" /></p>
<h3>Using the OpenTelemetry Collector</h3>
<p>When using the OpenTelemetry Collector, which is the most common configuration option, you simply have to add two key variables.</p>
<p>The instructions utilize a specific opentelemetry-collector configuration for Elastic. Essentially, the Elastic <a href="https://github.com/elastic/opentelemetry-demo/blob/main/kubernetes/elastic-helm/values.yaml">values.yaml</a> file specified in the elastic/opentelemetry-demo configure the opentelemetry-collector to point to the Elastic APM Server using two main values:</p>
<p>OTEL_EXPORTER_OTLP_ENDPOINT is Elastic’s APM Server<br />
OTEL_EXPORTER_OTLP_HEADERS Elastic Authorization</p>
<p>These two values can be found in the OpenTelemetry setup instructions under the APM integration instructions (Integrations-&gt;APM) in your Elastic Cloud.</p>
<h3>Native OpenTelemetry agents embedded in code</h3>
<p>If you are thinking of using OpenTelemetry libraries in your code, you can simply point the service to Elastic’s APM server, because it supports native OLTP protocol. No special Elastic conversion is needed.</p>
<p>To demonstrate this effectively and provide some education on how to use OpenTelemetry, we have two applications you can use to learn from:</p>
<ul>
<li><a href="https://github.com/elastic/opentelemetry-demo">Elastic’s version of OpenTelemetry demo</a>: As with all the other observability vendors, we have our own forked version of the OpenTelemetry demo.</li>
<li><a href="https://github.com/elastic/workshops-instruqt/tree/main/Elastiflix">Elastiflix:</a> This demo application is an example to help you learn how to instrument on various languages and telemetry signals.</li>
</ul>
<p>Check out our blogs on using the Elastiflix application and instrumenting with OpenTelemetry:</p>
<ul>
<li><a href="https://www.elastic.co/blog/getting-started-opentelemetry-instrumentation-sample-app">Elastiflix application</a>, a guide to instrument different languages with OpenTelemetry</li>
<li>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual-instrumentation</a></li>
<li>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual-instrumentation</a></li>
</ul>
<p>We have created YouTube videos on these topics as well:</p>
<ul>
<li><a href="https://youtu.be/wMXMRsjFg-8?feature=shared">How to Manually Instrument Java with OpenTelemetry (Part 1)</a></li>
<li><a href="https://youtu.be/PX7s6RRLGaU?feature=shared">How to Manually Instrument Java with OpenTelemetry (Part 2)</a></li>
<li><a href="https://youtu.be/hXTlV_RnELc?feature=shared">Custom Java Instrumentation with OpenTelemetry</a></li>
<li><a href="https://youtu.be/E8g9u_uOFO4?feature=shared">Elastic APM - Automatic .NET Instrumentation with OpenTelemetry</a></li>
<li><a href="https://youtu.be/7J9M2JsHwRE?feature=shared">How to Manually Instrument .NET Applications with OpenTelemetry</a></li>
</ul>
<p>Given Elastic and OpenTelemetry’s vast user base, these provide a rich source of education for anyone trying to learn the intricacies of instrumenting with OpenTelemetry.</p>
<h3>Elastic Agents supporting OpenTelemetry</h3>
<p>If you’ve already implemented OpenTelemetry, you can still use them with OpenTelemetry. <a href="https://www.elastic.co/blog/opentelemetry-instrumentation-elastic-apm-agent-features">Elastic APM agents today are able to ship OpenTelemetry</a> spans as part of a trace. This means that if you have any component in your application that emits an OpenTelemetry span, it’ll be part of the trace the Elastic APM agent captures.</p>
<h2>OpenTelemetry logs in Elastic</h2>
<p>If you look at OpenTelemetry documentation, you will see that a lot of language libraries are still in experimental or not implemented yet state. Java is in stable state, per the documentation. Depending on your service’s language, and your appetite for adventure, there exist several options for exporting logs from your services and applications and marrying them together in your observability backend.</p>
<p>In a previous blog, we discussed <a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 different configurations to properly get logging data into Elastic for Java</a>. The blog explores the current state of the art of OpenTelemetry logging and provides guidance on the available approaches with the following tenants in mind:</p>
<ul>
<li>Correlation of service logs with OTel-generated tracing where applicable</li>
<li>Proper capture of exceptions</li>
<li>Common context across tracing, metrics, and logging</li>
<li>Support for slf4j key-value pairs (“structured logging”)</li>
<li>Automatic attachment of metadata carried between services via OTel baggage</li>
<li>Use of an Elastic Observability backend</li>
<li>Consistent data fidelity in Elastic regardless of the approach taken</li>
</ul>
<p>Three models, which are covered in the blog, currently exist for getting your application or service logs to Elastic with correlation to OTel tracing and baggage:</p>
<ul>
<li>Output logs from your service (alongside traces and metrics) using an embedded OpenTelemetry Instrumentation library to Elastic via the OTLP protocol</li>
<li>Write logs from your service to a file scrapped by the OpenTelemetry Collector, which then forwards to Elastic via the OTLP protocol</li>
<li>Write logs from your service to a file scrapped by Elastic Agent (or Filebeat), which then forwards to Elastic via an Elastic-defined protocol</li>
</ul>
<p>Note that (1), in contrast to (2) and (3), does not involve writing service logs to a file prior to ingestion into Elastic.</p>
<h2>OpenTelemetry is Elastic’s preferred schema</h2>
<p>Elastic recently contributed the <a href="https://opentelemetry.io/blog/2023/ecs-otel-semconv-convergence/">Elastic Common Schema (ECS) to the OpenTelemetry (OTel)</a> project, enabling a unified data specification for security and observability data within the OTel Semantic Conventions framework.</p>
<p>ECS, an open source specification, was developed with support from the Elastic user community to define a common set of fields to be used when storing event data in Elasticsearch&lt;sup&gt;®&lt;/sup&gt;. ECS helps reduce management and storage costs stemming from data duplication, improving operational efficiency.</p>
<p>Similarly, OTel’s Semantic Conventions (SemConv) also specify common names for various kinds of operations and data. The benefit of using OTel SemConv is in following a common naming scheme that can be standardized across a codebase, libraries, and platforms for OTel users.</p>
<p>The merging of ECS and OTel SemConv will help advance OTel’s adoption and the continued evolution and convergence of observability and security domains.</p>
<h2>Elastic Observability APM and machine learning capabilities</h2>
<p>All of Elastic Observability’s APM capabilities are available with OTel data (read more on this in our blog, <a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry</a>):</p>
<ul>
<li>Service maps</li>
<li>Service details (latency, throughput, failed transactions)</li>
<li>Dependencies between services</li>
<li>Transactions (traces)</li>
<li>ML correlations (specifically for latency)</li>
<li>Service logs</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/native-opentelemetry-support-in-elastic-observability/elastic-blog-3-services.png" alt="services" /></p>
<p>In addition to Elastic’s APM and unified view of the telemetry data, you will now be able to use Elastic’s powerful machine learning capabilities to reduce the analysis, and alerting to help reduce MTTR. Here are some of the ML based AIOps capabilities we have:</p>
<ul>
<li><a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability"><strong>Anomaly detection:</strong></a> Elastic Observability, when turned on (<a href="https://www.elastic.co/guide/en/kibana/current/xpack-ml-anomalies.html">see documentation</a>), automatically detects anomalies by continuously modeling the normal behavior of your OpenTelemetry data — learning trends, periodicity, and more.</li>
<li><a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability"><strong>Log categorization:</strong></a> Elastic also identifies patterns in your OpenTelemetry log events quickly, so that you can take action quicker.</li>
<li><strong>High-latency or erroneous transactions:</strong> Elastic Observability’s APM capability helps you discover which attributes are contributing to increased transaction latency and identifies which attributes are most influential in distinguishing between transaction failures and successes.</li>
<li><a href="https://www.elastic.co/blog/observability-logs-machine-learning-aiops"><strong>Log spike detector</strong></a> helps identify reasons for increases in OpenTelemetry log rates. It makes it easy to find and investigate causes of unusual spikes by using the analysis workflow view.</li>
<li><a href="https://www.elastic.co/blog/observability-logs-machine-learning-aiops"><strong>Log pattern analysis</strong></a> helps you find patterns in unstructured log messages and makes it easier to examine your data.</li>
</ul>
<h2>Elastic allows you to migrate to OTel on your schedule</h2>
<p>Although OpenTelemetry supports many programming languages, the <a href="https://opentelemetry.io/docs/instrumentation/">status of its major functional components</a> — metrics, traces, and logs — are still at various stages. Thus migrating applications written in Java, Python, and JavaScript are good choices to start with as their metrics, traces, and logs (for Java) are stable.</p>
<p>For the other languages that are not yet supported, you can easily instrument those using Elastic Agents, therefore running your <a href="https://www.elastic.co/observability">full stack observability platform</a> in mixed mode (Elastic agents with OpenTelemetry agents).</p>
<p>Here is a simple example:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/native-opentelemetry-support-in-elastic-observability/elastic-blog-4-services2.png" alt="services 2" /></p>
<p>The above shows a simple variation of our standard Elastic Agent application with one service flipped to OTel — the newsletter-otel service. But we can easily and as needed convert each of these services to OTel as development resources allow.</p>
<p>Hence you can migrate what you need to OpenTelemetry with Elastic as specific languages reach a stable state, and you can then continue your migration to OpenTelemetry agents.</p>
<h2>Integrated Kubernetes and OpenTelemetry views in Elastic</h2>
<p>Elastic manages your Kubernetes cluster using the Elastic Agent, and you can use it on your Kubernetes cluster where your OpenTelemetry application is running. Hence you can not only use OpenTelemetry for your application, but Elastic can also monitor the corresponding Kubernetes cluster.</p>
<p>There are two configurations for Kubernetes:</p>
<p><strong>1. Simply deploying the Elastic Agent daemon set on the kubernetes cluster.</strong> We outline this out in the article entitled <a href="https://www.elastic.co/blog/kubernetes-cluster-metrics-logs-monitoring">Managing your Kubernetes cluster with Elastic Observability</a>. This would also push just the Kubernetes metrics and logs to Elastic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/native-opentelemetry-support-in-elastic-observability/elastic-blog-5-cloud-nodes.png" alt="elastic cloud nodes" /></p>
<p><strong>2. Deploying the Elastic Agent with not only the Kubernetes Daemon set, but also Elastic’s APM integration, the Defend (Security) integration, and Network Packet capture integration</strong> to provide more comprehensive Kubernetes cluster observability. We outline this configuration in the following article <a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/native-opentelemetry-support-in-elastic-observability/elastic-blog-6-flowhcart.png" alt="flowchart" /></p>
<p>Both <a href="https://www.elastic.co/observability/opentelemetry">OpenTelemetry visualization</a> examples use the OpenTelemetry demo, and in Elastic, we tie the Kubernetes information with the application to provide you an ability to see Kubernetes information from your traces in APM. This provides a more integrated approach when troubleshooting.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/native-opentelemetry-support-in-elastic-observability/elastic-blog-7-pod-deets.png" alt="pod details" /></p>
<h2>Summary</h2>
<p>In essence, Elastic's commitment goes beyond mere support for OpenTelemetry. We are dedicated to ensuring our customers not only adopt OpenTelemetry but thrive with it. Through our solutions, expertise, and resources, we aim to elevate the observability journey for every business, turning data into actionable insights that drive growth and innovation.</p>
<blockquote>
<p>Developer resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/getting-started-opentelemetry-instrumentation-sample-app">Elastiflix application</a>, a guide to instrument different languages with OpenTelemetry</li>
<li>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual-instrumentation</a></li>
<li>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Go: <a href="https://elastic.co/blog/manual-instrumentation-of-go-applications-opentelemetry">Manual-instrumentation</a></li>
<li><a href="https://www.elastic.co/blog/best-practices-instrumenting-opentelemetry">Best practices for OpenTelemetry</a></li>
</ul>
<p>General configuration and use case resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></li>
<li><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a></li>
<li><a href="https://www.elastic.co/blog/custom-metrics-app-code-java-agent-plugin">Capturing custom metrics through OpenTelemetry API in code with Elastic</a></li>
<li><a href="https://www.elastic.co/virtual-events/future-proof-your-observability-platform-with-opentelemetry-and-elastic">Future-proof your observability platform with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-k8s-observability-elasticsearch-cncf">Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more</a></li>
</ul>
</blockquote>
<p>Don’t have an Elastic Cloud account yet? <a href="https://cloud.elastic.co/registration">Sign up for Elastic Cloud</a> and try out the instrumentation capabilities that I discussed above. I would be interested in getting your feedback about your experience in gaining visibility into your application stack with Elastic.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/native-opentelemetry-support-in-elastic-observability/ecs-otel-announcement-2.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Network monitoring with Elastic: Unifying network observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/network-monitoring-with-elastic-unifying-network-observability</link>
            <guid isPermaLink="false">network-monitoring-with-elastic-unifying-network-observability</guid>
            <pubDate>Mon, 16 Feb 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to unify network monitoring using Elastic observability and AI. We'll showcase how to correlate network data, identify root causes and fix issues.]]></description>
            <content:encoded><![CDATA[<h2>Introduction: The Network Monitoring Fragmentation Problem</h2>
<p>In five years working with Enterprise accounts at Elastic, I have heard the same challenge again and again:</p>
<p><strong>&quot;We have several network monitoring tools, and we would love to correlate all of them into one platform.&quot;</strong></p>
<p>For many organizations, the barrier to true correlation isn't a lack of data, but where that data lives. Frequently, we see SNMP metrics, flow data, and logs isolated in purpose-built silos or dashboards. Without a unified data store and a proper correlation engine, piecing together the full narrative — from a topology change to a performance degradation — becomes a manual, time-consuming puzzle.</p>
<p>When an incident happens, engineers become <strong>human correlation engines</strong> — manually jumping between systems, copying timestamps, cross-referencing device names, and trying to piece together what actually happened. A simple question like &quot;Did this interface failure impact application performance?&quot; requires querying multiple tools and mentally correlating the results.</p>
<p>The real cost isn't the tool licenses — it's the time lost during critical incidents.</p>
<p>This lab is my answer to a fundamental question: <strong>Can Elastic become the unified foundation that actually correlates network data?</strong></p>
<p>More importantly, it demonstrates that Elastic is fully ready for network operations — capable of ingesting diverse telemetry and using AI to correlate relationships, identify root causes, and resolve issues in seconds instead of hours.</p>
<h2>The Problem: Network Observability is Broken</h2>
<p>Let me paint a typical scenario I encounter with enterprise network teams:</p>
<p><strong>The Fragmented Reality:</strong></p>
<ul>
<li>No single source of truth</li>
<li>Manual correlation during incidents (15-30 minutes per event)</li>
<li>Fragmented teams (network vs. platform engineers)</li>
<li>Limited automation capabilities</li>
<li>No AI-powered analysis</li>
</ul>
<p><strong>When a link goes down at 2 AM:</strong></p>
<ul>
<li>Notice the alert - 2 minutes</li>
<li>Log into monitoring tool to see the metric - 3 minutes</li>
<li>Switch to traffic analyzer to check impact - 5 minutes</li>
<li>Open log management to search for related messages - 10 minutes</li>
<li>Manually correlate timestamps across systems - 8 minutes</li>
<li>Create a ticket and copy context from multiple tools - 8 minutes</li>
</ul>
<p><strong>Time to initial diagnosis: 36 minutes</strong></p>
<p>This workflow is expensive, error-prone, and doesn't scale.</p>
<h2>The Vision: Elastic as a Unified Network Observability Platform</h2>
<p>What if you could:</p>
<ul>
<li>Collect SNMP metrics, NetFlow, traps, and topology data in <strong>one platform</strong></li>
<li>Correlate network events with application performance <strong>automatically</strong></li>
<li>Generate executive dashboards without separate BI tools</li>
<li>Use <strong>AI to analyze incidents in seconds</strong>, not hours</li>
<li>Trigger alerting from network events</li>
</ul>
<p>This is what this lab aims to demonstrate.</p>
<h2>What I Built: A Production-Grade Network Simulation</h2>
<p>To demonstrate how Elastic unifies network data, I needed a realistic environment that generates real-world telemetry. Enter <strong>Containerlab</strong>  —  a Docker-based solution that enables us to create a network simulation framework.</p>
<h3>Lab Architecture</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/network-monitoring-with-elastic-unifying-network-observability/lab-topology.jpg" alt="Lab Topology" /></p>
<p>I simulated a Service Provider core network with:</p>
<ul>
<li><strong>7 FRR routers</strong> forming an OSPF Area 0 mesh</li>
<li><strong>2 Ubuntu hosts</strong> for additional use cases</li>
<li><strong>2 Layer 2 switches</strong> for access layer segmentation</li>
<li><strong>3 telemetry collectors</strong> feeding Elastic Cloud</li>
</ul>
<p><strong>Total containers:</strong> 14</p>
<p><strong>Deployment time:</strong> 12-15 minutes (fully automated)</p>
<p><strong>Full deployment instructions and topology details are available in the <a href="https://github.com/DeBaker1974/Containerlab-OSPF">GitHub repository README</a>.</strong></p>
<h2>The Three Telemetry Pipelines: Proving Multi-Source Correlation</h2>
<p>What makes this lab production-ready is its <strong>hybrid observability approach</strong> — proving that Elastic can unify disparate network data sources.</p>
<table>
<thead>
<tr>
<th align="left">Pipeline</th>
<th align="left">Data Type</th>
<th align="left">Collection Method</th>
<th align="left">Collector</th>
<th align="left">Use Case</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left"><strong>SNMP Metrics</strong></td>
<td align="left">Interface stats, system health, LLDP topology</td>
<td align="left">Active polling</td>
<td align="left">OTEL Collector</td>
<td align="left">Capacity planning, trend analysis</td>
</tr>
<tr>
<td align="left"><strong>NetFlow</strong></td>
<td align="left">Traffic flows</td>
<td align="left">Push-based export</td>
<td align="left">Elastic Agent</td>
<td align="left">Top talkers, security investigation</td>
</tr>
<tr>
<td align="left"><strong>SNMP Traps</strong></td>
<td align="left">Interface up/down events</td>
<td align="left">Event-driven</td>
<td align="left">Logstash</td>
<td align="left">Real-time incident detection</td>
</tr>
</tbody>
</table>
<p>This unified architecture proves Elastic can replace multiple specialized network monitoring tools with a single platform.</p>
<h2>The Power of Correlation: One Platform, One Query</h2>
<p>When a network incident occurs, you need to answer questions like:</p>
<ul>
<li>Which interface failed? <em>(SNMP metrics)</em></li>
<li>What traffic was affected? <em>(NetFlow)</em></li>
<li>What was the sequence of events? <em>(SNMP traps)</em></li>
<li>Which devices are downstream? <em>(LLDP topology)</em></li>
</ul>
<p><strong>The Problem:</strong> modern tools offer separate modules glued together, forcing users to navigate different spaces for different sets of data.</p>
<p><strong>The Reality:</strong> You still have to pivot. You see a spike in the Metrics module, but to see why, you have to open the Logs module and manually align the time picker. The data lives in different tables or backends, making true correlation impossible without human intervention.</p>
<p><strong>The Elastic Difference:</strong> One Store, One Language, One AI</p>
<p>Elastic makes it simple. Whether it's an SNMP counter (metric), a NetFlow record (flow), or a Syslog message (log), it is all stored in a unified datastore powered by the Elasticsearch engine. This allows users to easily search across multiple datasets in a single query.</p>
<pre><code class="language-bash">FROM logs-*
| WHERE host.name == &quot;csr23&quot; AND interface.name == &quot;eth1&quot;
</code></pre>
<p><strong>Time required: 3 seconds</strong></p>
<p>Furthermore, as you will see later, the exact location of the data becomes agnostic to the user when leveraging the AI Assistant.</p>
<h2>Data Transformation: From Cryptic OIDs to Actionable Intelligence</h2>
<p>Raw SNMP traps are notoriously difficult to interpret at a glance. In our current lab setup, the data arrives looking like this:</p>
<pre><code class="language-bash">OID: 1.3.6.1.6.3.1.1.5.3
ifIndex: 2
ifDescr: eth1
</code></pre>
<p>While traditional Network Management Platforms (NMPs) handle OID translation natively, bringing that clarity into Elastic requires a specific configuration.</p>
<p>In this initial lab, we are intentionally working with this raw data to demonstrate how AI assistants can interpret these events even without pre-existing context.</p>
<p>However, the strategy for the next phase of this project is to implement Elasticsearch Ingest Pipelines. This will allow us to map raw OIDs to human-readable names. This step is crucial for bridging the gap between Network tools and Application Observability platforms, allowing network events to be instantly correlated with application errors and infrastructure logs.</p>
<p><strong>The Target State</strong></p>
<p>Once the pipeline is implemented in the next lab, we will transform that raw trap into searchable, meaningful data:</p>
<pre><code class="language-bash">{
  &quot;event.action&quot;: &quot;interface-down&quot;,
  &quot;host.name&quot;: &quot;csr23&quot;,
  &quot;interface.name&quot;: &quot;eth1&quot;,
  &quot;interface.oper_status_text&quot;: &quot;Link Down&quot;
}
</code></pre>
<p><strong>The result:</strong></p>
<ul>
<li>Human-readable fields</li>
<li>Searchable dimensions for filtering</li>
<li>Context for automation rules and dashboards</li>
<li>Correlation keys for joining with metrics and flows</li>
</ul>
<p>In our next blog post, we will walk through building the ingest pipeline that performs this transformation — step by step.</p>
<h2>Intelligent Alerting: From Noise to Actionable Intelligence</h2>
<p>Traditional network monitoring relies on simple threshold alerts — &quot;interface down,&quot; &quot;high CPU.&quot; These alerts flood your inbox but provide <strong>zero context</strong> about root cause, impact, or remediation.</p>
<h3>The Lab's Approach: ES|QL + AI Assistant</h3>
<p><strong>1. Semantic Detection with ES|QL</strong></p>
<p>Instead of generic threshold alerts, the lab uses ES|QL to detect specific event patterns:</p>
<pre><code class="language-bash">FROM logs-snmp.trap-prod
| WHERE snmp.trap_oid == &quot;1.3.6.1.6.3.1.1.5.3&quot;
| KEEP @timestamp, host.name, interface.name, message
</code></pre>
<p><strong>2. Automatic AI-Powered Investigation</strong></p>
<p>When the alert triggers, it invokes the <strong>Observability AI Assistant</strong> with a structured investigation prompt that:</p>
<ul>
<li>Performs immediate triage (which device, which interface, when)</li>
<li>Assesses OSPF impact and traffic rerouting</li>
<li>Correlates with other recent failures</li>
<li>Generates severity assessment and recommended actions</li>
</ul>
<h3>The Transformation</h3>
<table>
<thead>
<tr>
<th align="center">Traditional Alerting</th>
<th align="center">Intelligent Alerting (Elastic)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center"><strong>Email: &quot;Interface down on csr23&quot;</strong></td>
<td align="center">Structured analysis with device context</td>
</tr>
<tr>
<td align="center"><strong>Manual investigation: 20-30 min</strong></td>
<td align="center">AI-automated investigation: 90 seconds</td>
</tr>
<tr>
<td align="center"><strong>Engineer correlates across tools</strong></td>
<td align="center">Automatic cross-source correlation</td>
</tr>
<tr>
<td align="center"><strong>No business impact assessment</strong></td>
<td align="center">Severity + recommended actions included</td>
</tr>
</tbody>
</table>
<h2>Accelerating Incident Response with the Elastic AI Assistant</h2>
<p>This is where the Elastic AI Assistant demonstrates its operational value — moving beyond passive data collection to actively interpret and explain network events in real-time</p>
<p>When an engineer views a trap document in Discover and asks:</p>
<p><em><strong>&quot;Explain this log message&quot;</strong></em></p>
<p>The AI Assistant provides comprehensive analysis including:</p>
<ul>
<li><strong>What happened:</strong> Plain-language explanation of the SNMP trap</li>
<li><strong>Device context:</strong> Router role, interface purpose, network position</li>
<li><strong>Impact analysis:</strong> OSPF neighbor status, traffic rerouting assessment</li>
<li><strong>Root cause possibilities:</strong> Physical layer, link layer, administrative causes</li>
<li><strong>Recommended actions:</strong> Immediate steps, investigation queries, validation checks</li>
<li><strong>Severity assessment:</strong> Business and technical impact rating</li>
</ul>
<h3>Manual Triage vs. AI-Assisted Investigation</h3>
<table>
<thead>
<tr>
<th align="left">Before</th>
<th align="left">After (Elastic AI)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left"><strong>Google the OID → 5 min</strong></td>
<td align="left">Click &quot;Explain this log&quot; → 20 seconds</td>
</tr>
<tr>
<td align="left"><strong>Open network diagram → 3 min</strong></td>
<td align="left">Topology context auto-provided</td>
</tr>
<tr>
<td align="left"><strong>Query multiple tools → 10 min</strong></td>
<td align="left">Cross-source correlation instant</td>
</tr>
<tr>
<td align="left"><strong>Assess business impact → 5 min</strong></td>
<td align="left">Impact analysis auto-generated</td>
</tr>
<tr>
<td align="left"><strong>Total: ~28 minutes</strong></td>
<td align="left"><strong>Total: ~20 seconds</strong></td>
</tr>
</tbody>
</table>
<h2>The Value Proposition: One Platform, One Data Model, One AI</h2>
<h3>What This Lab Demonstrates</h3>
<p>Elastic provides:</p>
<ul>
<li><strong>One unified platform</strong> for metrics, logs, flows</li>
<li><strong>One data model</strong> (SemConv) for consistent correlation</li>
<li><strong>One search interface</strong> (Kibana) for all network data</li>
<li><strong>One AI assistant</strong> that understands all your network telemetry</li>
<li><strong>AI-powered alerting</strong> with automated investigation</li>
</ul>
<h3>Business Impact</h3>
<p><strong>Efficiency Gains:</strong></p>
<ul>
<li><strong>85% reduction in MTTR</strong> (36 min → 5 min for initial diagnosis)</li>
<li><strong>90% reduction</strong> in manual correlation time</li>
<li>Junior engineers gain access to <strong>AI-powered expert analysis</strong></li>
</ul>
<p><strong>Operational Benefits:</strong></p>
<ul>
<li>Network engineers focus on <strong>strategy, not tool-switching</strong></li>
<li><strong>Cross-functional collaboration</strong> in one platform</li>
<li><strong>Reduced tool sprawl</strong> and management overhead</li>
</ul>
<h2>Lessons Learned</h2>
<p>After building this lab, several key insights emerged regarding how network data fits into the broader observability ecosystem:</p>
<p><strong>1. Extending Observability to the Network</strong></p>
<p>Elastic is already the gold standard for high-volume logs and application traces. This lab demonstrates that the same engine seamlessly handles network telemetry without needing a separate, siloed tool.</p>
<ul>
<li>Scale: The same architecture that ingests petabytes of application logs easily handles millions of interface counters.</li>
<li>Structure: Native support for complex nested documents allows for rich SNMP trap data (variable bindings) without flattening or losing context.</li>
<li>Speed: Real-time search applies equally to network events, enabling sub-second troubleshooting.</li>
</ul>
<p><strong>2. OpenTelemetry Semantic Conventions (SemConv) as the Universal Translator</strong></p>
<p>The power isn't just in storing the data, but in standardizing it. By mapping SNMP and NetFlow to the <strong>OpenTelemetry Semantic Conventions (SemConv)</strong>, network data finally speaks the same language as the rest of the stack.</p>
<ul>
<li><strong>Unified Search:</strong> Query across firewall logs, server metrics, and switch telemetry in a single search bar.</li>
<li><strong>Instant Visualization:</strong> Pre-built dashboards work immediately because the field names are standardized.</li>
<li><strong>Cross-Domain Correlation</strong>: Easily correlates a spike in application latency with a specific interface saturation event.</li>
</ul>
<p><strong>3. AI Assistants Thrive on Context</strong></p>
<p>While the AI in this lab was powerful on its own, the experiment highlighted a critical realization: an AI Assistant becomes exponentially more effective when coupled with a specific Knowledge Base.</p>
<p><strong>Context is King:</strong> The AI delivers better root cause analysis when provided with rich metadata, such as device roles and topology maps. Without it, the advice remains generic.</p>
<p><strong>Pro Tip (and What’s Next):</strong></p>
<p>To get organization-specific advice rather than generic suggestions, you need to feed the AI your documentation.</p>
<ul>
<li><strong>The Goal:</strong> Create a Knowledge Base containing device roles, network topology diagrams, and troubleshooting procedures.</li>
<li><strong>The Next Step:</strong> In my next blog post, I will demonstrate exactly how to do this — connecting a Knowledge Base to the AI Assistant to enable fully context-aware troubleshooting.</li>
</ul>
<h2>Conclusion: Completing the Observability Picture</h2>
<p>Elastic is already widely recognized as the standard for Application and Security observability. The goal of this lab wasn't to ask if Elastic can handle networking, but to demonstrate the immense value of bringing network data into that existing ecosystem.</p>
<p>The verdict is clear: Elastic acts as that unified foundation. It effectively breaks down the silo between Network Engineering and the rest of IT.</p>
<p>This isn't just about consolidating dashboards or replacing legacy tools. It is about establishing the Elasticsearch AI Platform as the single source of truth where network telemetry sits right alongside application and infrastructure data.</p>
<p>By treating network data as a first-class citizen in the observability stack, we unlock automated correlation, AI-assisted investigation, and the speed required to resolve incidents before they impact the business. The capabilities are in place, and the foundation is solid — Elastic is ready to unify your network with the rest of your digital business.</p>
<h2>Ready to Try It Yourself?</h2>
<p>Check out <a href="https://github.com/DeBaker1974/Containerlab-OSPF">github.com/DeBaker1974/Containerlab-OSPF</a></p>
<p>The repository includes:</p>
<ul>
<li>Complete deployment scripts (12-15 minute automated setup)</li>
<li>Pre-configured telemetry pipelines</li>
<li>Kibana dashboards</li>
<li>Alert rules with AI Assistant integration</li>
<li>Detailed README</li>
</ul>
<p><strong>Not ready to build? Try Elastic Serverless:</strong> <a href="https://cloud.elastic.co/registration">Start a free 14-day trial</a> and explore AI-powered observability with your own data.</p>
<p><strong>Special thanks to the Containerlab and FRRouting communities for their incredible open-source tools, and to Sheriff Lawal (CCIE, CISSP), Sr. Manager, Solutions Architecture at Elastic, for mentoring on this project.</strong></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/network-monitoring-with-elastic-unifying-network-observability/article-image.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Introducing Elastic Observability's new Synthetic Monitoring: Designed for seamless GitOps management and SRE-focused workflows]]></title>
            <link>https://www.elastic.co/observability-labs/blog/new-synthetic-monitoring-observability</link>
            <guid isPermaLink="false">new-synthetic-monitoring-observability</guid>
            <pubDate>Thu, 20 Oct 2022 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic Observability introduces Synthetic Monitoring, a GitOps management and SRE-focused workflows game-changer. This tool provides visibility into critical flows and third-party dependencies, enhancing application performance and user experience.]]></description>
            <content:encoded><![CDATA[<p>We are excited to announce the general availability of Elastic Observability's all-new Synthetic Monitoring. This powerful tool, designed for streamlined GitOps management and Site Reliability Engineers (SRE) workflows, elevates your monitoring capabilities and empowers you to transform your application's performance.</p>
<p>As you read through the next few sections, you can also look at these additional resources:</p>
<ul>
<li><a href="https://www.elastic.co/virtual-events/improve-business-outcomes-and-observability-with-synthetic-monitoring">On-demand webinar: Getting started with synthetic monitoring on Elastic</a></li>
<li><a href="https://www.elastic.co/blog/uniting-testing-and-monitoring-with-synthetic-monitoring">How to create a CI/CD pipeline with GitHub actions and Elastic synthetic monitoring tests</a></li>
<li><a href="https://www.elastic.co/blog/why-and-how-replace-end-to-end-tests-synthetic-monitors">Creating end-to-end synthetics monitoring tests</a></li>
<li><a href="https://playwright.dev/">Playwright (what Elastic uses for synthetic monitoring tests)</a></li>
<li><a href="https://www.npmjs.com/package/@elastic/synthetics">Elastic’s NPM library for synthetics monitoring test development</a></li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/new-synthetic-monitoring-observability/blog-elastic-monitors.png" alt="observability monitors" /></p>
<h2>Synthetic Monitoring: The missing piece in your observability puzzle</h2>
<p>Synthetic Monitoring plays a vital role in complementing traditional logs and traces driven Observability, offering a unique lens through which SREs can analyze their critical flows. In the dynamic world of digital applications, ensuring these flows are available and functioning as expected for end-users becomes critical. This is where Synthetic Monitoring shines, offering the only surefire method to gain visibility into these crucial aspects.</p>
<p>Moreover, with the rise in the use of third-party dependencies in modern web applications, Synthetic Monitoring becomes indispensable. These third-party elements, while often improving functionality and user experience, can become weak links leading to failures or downtime. Synthetic Monitoring can provide exclusive visibility into these dependencies, enabling teams to identify and address potential issues proactively.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/new-synthetic-monitoring-observability/blog-elastic-observability-network-requests.png" alt="observability network requests" /></p>
<p>By integrating Synthetic Monitoring into your Observability strategy, you can proactively identify and mitigate potential problems, preventing costly downtime and ensuring an optimal user experience. Our Synthetic Monitoring solution fits perfectly within this framework, providing a comprehensive tool to safeguard your applications' performance and reliability.</p>
<h2>SRE-focused solution</h2>
<p>Elevate your SRE workflows with our Synthetic Monitoring product, built with an SRE's needs in mind. Enjoy access to dedicated error detail pages that serve up all crucial information at a glance, allowing you to effortlessly triage and diagnose issues. Our comparison feature offers a side-by-side view of the last successful test run and the failed one, further simplifying issue resolution. With additional features such as performance trend analysis, proactive alerts, and seamless <a href="https://www.elastic.co/integrations/data-integrations?solution=all-solutions&amp;category=ticketing">integration with incident management tools</a>, (such as <a href="https://www.elastic.co/blog/elastic-integrations-with-servicenow-itsm-sir-itom">ServiceNow</a>) our Synthetic Monitoring solution is the quintessential tool for maintaining smooth and reliable end-user experiences.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/new-synthetic-monitoring-observability/blog-elastic-observability-service-unavailable.png" alt="observability service unavailable" /></p>
<h2>A leap forward in GitOps management</h2>
<p>Experience an industry first in synthetic monitoring with our groundbreaking product, uniquely built on top of the powerful browser testing framework Playwright. This innovation enables you to manage monitors as code, allowing you to write and verify tests in pre-production before effortlessly pushing the test scripts into synthetic monitoring for ongoing testing in production.</p>
<p>For developers wishing to run tests locally, our solution integrates seamlessly with the <a href="https://www.npmjs.com/package/@elastic/synthetics">NPM library</a>. This flexibility ensures that our product not only eliminates the lag between code releases and testing updates, but also simplifies the management of large volumes of monitors and scripts.</p>
<p>Moreover, keeping scripts in source control further provides advantages such as version control, Role-Based Access Control (RBAC), and the opportunity to centralize your test code alongside your application code. In essence, our Playwright-based solution revolutionizes synthetic monitoring by streamlining the entire testing process, ensuring seamless and efficient monitoring in all environments.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/new-synthetic-monitoring-observability/blog-elastic-open-editions.png" alt="observability open editions" /></p>
<h2>Managed testing infrastructure for comprehensive coverage without the hassle</h2>
<p>Our Synthetic Monitoring solution introduces an Elastic-first managed testing service, offering a global network of testing locations. At launch there are ten locations around the globe and we will be continuously growing our footprint. Eliminate the headaches of hardware management, capacity planning, scaling, updating, and security patching. Conduct both lightweight and full browser tests with ease and take advantage of features such as automatic scaling, built-in security, and seamless integration with Elastic Observability. For those use cases requiring a testing agent deployed within your own infrastructure, we offer support via Private Testing Locations. This enables your teams to focus on what matters most — delivering outstanding user experiences.</p>
<h2>Pricing and promotional period</h2>
<p>To celebrate the launch, we're providing a free promotional period for the managed testing service. From now until September 1, 2023, all test execution will be free of charge. After that, the browser test runs will be charged at a minimal $0.014 per test run. We will also have a unique flat rate for ping test execution set at $35/month/region for virtually unlimited lightweight test execution. We will not charge for test execution for private locations. <a href="https://www.elastic.co/pricing/">View our Pricing page</a> for more information.</p>
<h2>Try it out</h2>
<p>Don't miss out on this opportunity to experience our unique approach to Synthetic Monitoring. <a href="https://www.elastic.co/blog/whats-new-elastic-observability-8-8-0">Upgrade your existing Elastic Stack to 8.8.0</a> to take advantage of our free promotional period.</p>
<p>Read about these capabilities and more in the Elastic Observability 8.8.0 <a href="https://www.elastic.co/guide/en/welcome-to-elastic/current/new.html">release notes</a>.</p>
<p>Existing Elastic Cloud customers can access many of these features directly from the <a href="https://cloud.elastic.co/">Elastic Cloud console</a>. Not taking advantage of Elastic on cloud? <a href="https://www.elastic.co/cloud/cloud-trial-overview">Start a free trial</a>.</p>
<p><em>Originally published October 25, 2022; updated May 23, 2023.</em></p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/new-synthetic-monitoring-observability/the-end-of-databases-A_(1).jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[NGNIX log analytics with GenAI in Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/nginx-log-analytics-with-genai-elastic</link>
            <guid isPermaLink="false">nginx-log-analytics-with-genai-elastic</guid>
            <pubDate>Fri, 05 Jul 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic has a set of embedded capabilities such as a GenAI RAG-based AI Assistant and a machine learning platform as part of the product baseline. These make analyzing the vast number of logs you get from NGINX easier.]]></description>
            <content:encoded><![CDATA[<p>Elastic Observability provides a full observability solution, supporting metrics, traces, and logs for applications and infrastructure. NGINX, which is highly used for web serving, load balancing, http caching, and reverse proxy, is the key to many applications and outputs a large volume of logs. NGINX’s access logs, which detail all requests made to the NGINX server, and error logs which record server-related issues and problems are key to managing and analyzing NGINX issues along with understanding what is happening to your application. </p>
<p>In managing NGINX Elastic provides several capabilities:</p>
<ol>
<li>
<p>Easy ingest, parsing, and out-of-the-box dashboards. Check out the simple how-to in our <a href="https://www.elastic.co/guide/en/fleet/current/example-standalone-monitor-nginx.html">docs</a>. Based on logs, these dashboards show several items over time, response codes, errors, top pages, data volume, browsers used, active connections, drop rates, and much more.</p>
</li>
<li>
<p>Out-of-the-box ML-based anomaly detection jobs for your NGINX logs. These jobs help pinpoint anomalies against request rates, IP address request rates, URL access, status codes, and visitor rate anomalies.</p>
</li>
<li>
<p>ES|QL which helps work through logs and build out charts during analysis.</p>
</li>
<li>
<p>Elastic’s GenAI Assistant provides a simple natural language interface that helps analyze all the logs and can pull out issues from ML jobs and even create dashboards. The Elastic AI Assistant also automatically uses ES|QL.</p>
</li>
<li>
<p>NGINX SLOs - Finally Elastic provides the ability to define and monitor SLOs for your NGINX logs. While most SLOs are metrics-based, Elastic allows you to create logs-based SLOs. We detailed this in a previous <a href="https://www.elastic.co/observability-labs/blog/service-level-objectives-slos-logs-metrics">blog</a>.</p>
</li>
</ol>
<p>NGINX logs are another example of why logs are great.  Logging is an important part of Observability, for which we generally think of metrics and tracing. However, the amount of logs an application and the underlying infrastructure output can be significantly daunting and NGINX is usually the starting point for most analyses. </p>
<p>In today’s blog, we’ll cover how the out-of-the-box ML-based anomaly detection jobs can help RCA, and how Elastic’s GenAI Assistant helps easily work through logs to pinpoint issues in minutes. </p>
<h2>Prerequisites and config&lt;a id=&quot;prerequisites-and-config&quot;&gt;&lt;/a&gt;</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up this demonstration:</p>
<ul>
<li>
<p>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>).</p>
</li>
<li>
<p>Bring up an <a href="https://docs.nginx.com/nginx/admin-guide/web-server/">NGINX server</a> on a host. OR run an application with NGINX as a front end and drive traffic.</p>
</li>
<li>
<p>Install the NGINX integration and assets and review the dashboards as noted in the <a href="https://www.elastic.co/guide/en/fleet/current/example-standalone-monitor-nginx.html">docs</a>.</p>
</li>
<li>
<p>Ensure you have an <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-settings.html">ML node configured</a> in your Elastic stack</p>
</li>
<li>
<p>To use the AI Assistant you will need a trial or upgrade to Platinum.</p>
</li>
</ul>
<p>In our scenario, we use data from 3 months from our Elastic environment to help highlight the features. Hence you might need to run your application with traffic for a specific time frame to follow along.</p>
<h2>Analyzing the issues with AI Assistant&lt;a id=&quot;analyzing-the-issues-with-ai-assistant&quot;&gt;&lt;/a&gt;</h2>
<p>As detailed in a previous <a href="https://www.elastic.co/observability-labs/blog/service-level-objectives-slos-logs-metrics">blog</a>, you can get alerted on issues via SLO monitoring against NGINX logs. Let’s assume you have an SLO based on status codes as we outlined in the previous <a href="https://www.elastic.co/observability-labs/blog/service-level-objectives-slos-logs-metrics">blog</a>. You can immediately analyze the issue via the AI Assistant. Because it's a chat interface we simply open the AI Assistant and work through some simple analysis: (See Animated GIF for a demo)</p>
<h3>AI Assistant analysis:&lt;a id=&quot;ai-assistant-analysis&quot;&gt;&lt;/a&gt;</h3>
<ul>
<li>
<p><strong><em>Using lens graph all http response status codes &lt; 400 and &gt; =400 from filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer</em></strong> <em>-</em> We wanted to simply understand the amount of requests resulting in status code &gt;= 400 and graph the results. We see that 15% of the requests were not successful, hence an SLO alert being triggered.</p>
</li>
<li>
<p><strong>Which ip address (field source.adress) has the highest number of http.response.status.code &gt;= 400 from filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer</strong>  - We were curious is there was a specific IP address not having successful requests. 72.57.0.53, with a count of 25,227 occurrences is daily high but not the ensure 2 failed requests.</p>
</li>
<li>
<p><strong><em>What country (source.geo.country_iso_code) is source.address=72.57.0.53 coming from. Use filebeat-nginx-elasticco-anon-2017.</em></strong> - Again we were curious if this came from a specific country. And the IP address 72.57.0.53 is coming from the country with the ISO code IN, which corresponds to India. Nothing out of the ordinary.</p>
</li>
<li>
<p><strong><em>Did source.address=72.57.0.53 have any (http.response.status.code &lt; 400) from filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer -</em></strong>  Oddly the IP address in question only had 4000+ successful responses. Meaning its not malicious, and points to something else.</p>
</li>
<li>
<p><strong><em>What are the different status codes (http.response.status.code&gt;=400), from source.address=72.57.0.53. Use filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer. Provide counts for each status code -</em></strong> We are curious whether or not we see any 502, which there were none, but most of the failures were 404. </p>
</li>
<li>
<p><strong><em>What are the different status codes (http.response.status.code&gt;=400). Use filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer. Provide counts for each status code</em></strong> - Regardless of a specific address, what is the largest number of status code occurrences &gt; 400. This also points to 404. </p>
</li>
<li>
<p><strong><em>What does a high 404 count from a specific IP address mean from NGINX logs?</em></strong> - Asking this question, we need to understand the potential causes of this from our application. From the answers, we can rule out security probing and web scraping, as we validated that a specific address 72.57.0.53 has a low non-success request status code. It also rules out User error. Hence this points potentially to Broken Links or Missing Resources.</p>
</li>
</ul>
<h3>Watch the flow:&lt;a id=&quot;watch-the-flow&quot;&gt;&lt;/a&gt;</h3>
&lt;Video vidyardUuid=&quot;ak9xDdhcL3SxpqU7CRsD68&quot; /&gt;
<h3>Potential issue:</h3>
<p>It seems that we potentially have an issue with the backend serving specific answers or having issues with resources (database, or broken links). This is cursing the higher-than-normal non-successful status codes&gt;=400.</p>
<h3>Key highlights from AI Assistant:</h3>
<p>As you watched this video you will notice a few things:</p>
<ol>
<li>
<p>We analyzed millions of logs in a matter of minutes using a set of simple natural language queries. </p>
</li>
<li>
<p>We didn’t need to know any special query language. The AI Assistant used Elastic’s ES|QL but can similarly use KQL also. </p>
</li>
<li>
<p>The AI Assistant easily builds out graphs</p>
</li>
<li>
<p>The AI Assistant is accessing and using internal information stored in Elastic’s indices. Vs a simple “google foo” based AI Assistant. This is enabled through RAG, and the AI Assistant can also bring up known issues in github, runbooks, and other useful internal information.</p>
</li>
</ol>
<p>Check out the following <a href="https://www.elastic.co/observability-labs/blog/elastic-rag-ai-assistant-application-issues-llm-github">blog</a> on how the AI Assistant uses RAG to retrieve internal information. Specifically using github and runbooks.</p>
<h2>Locating anomalies with ML</h2>
<p>While using the AI Assistant is great for analyzing information, another important aspect of NGINX log management is to ensure you can manage log spikes and anomalies. Elastic has a machine learning platform that allows you to develop jobs to analyze specific metrics or multiple metrics to look for anomalies.When using NGINX, there are several <a href="https://www.elastic.co/guide/en/machine-learning/current/ootb-ml-jobs-nginx.html">out-of-the-box anomaly detection jobs</a>. These work specifically on NGINX access logs.</p>
<ul>
<li>
<p>Low_request_rate_nginx - Detect low request rates</p>
</li>
<li>
<p>Source_ip_request_rate_nginx - Detect unusual source IPs - high request rates</p>
</li>
<li>
<p>Source_ip_url_count_nginx - Detect unusual source IPs - high distinct count of URLs</p>
</li>
<li>
<p>Status_code_rate_nginx - Detect unusual status code rates</p>
</li>
<li>
<p>Visitor_rate_nginx - Detect unusual visitor rates</p>
</li>
</ul>
<p>Being right out of the box, lets look at the job - Status_code_rate_nginx, which is related to our previous analysis.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/nginx-log-analytics-with-genai-elastic/nginx-ml-log-analytics.png" alt="NGINX ML Log Analytics" /></p>
<p>With a few simple clicks we immediately get an analysis showing a specific IP address - 72.57.0.53, having higher than normal non-successful requests. Oddly we also found this is using the AI Assistant.</p>
<p>We can take this further with conversations with the AI Assistant, look at the logs, and/or even look at the other ML anomaly jobs.</p>
<h2>Conclusion:&lt;a id=&quot;conclusion&quot;&gt;&lt;/a&gt;</h2>
<p>You’ve now seen how easily Elastic’s RAG-based AI Assistant can help analyze NGINX logs without even the need to know query syntax, understand where the data is, and understand even the fields. Additionally, you’ve also seen how we can alert you when a potential issue or degradation in service (SLO). </p>
<p>Check out other resources on NGINX logs:</p>
<p><a href="https://www.elastic.co/guide/en/machine-learning/current/ootb-ml-jobs-nginx.html">Out-of-the-box anomaly detection jobs for NGINX</a></p>
<p><a href="https://www.elastic.co/guide/en/fleet/current/example-standalone-monitor-nginx.html">Using the NGINX integration to ingest and analyze NGINX Logs</a></p>
<p><a href="https://www.elastic.co/observability-labs/blog/service-level-objectives-slos-logs-metrics">NGINX Logs based SLOs in Elastic</a></p>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-rag-ai-assistant-application-issues-llm-github">Using GitHub issues, runbooks, and other internal information for RCAs with Elastic’s RAG based AI Assistant</a></p>
<h2>Try it out&lt;a id=&quot;try-it-out&quot;&gt;&lt;/a&gt;</h2>
<p>Existing Elastic Cloud customers can access many of these features directly from the <a href="https://cloud.elastic.co/">Elastic Cloud console</a>. Not taking advantage of Elastic on the cloud? <a href="https://www.elastic.co/cloud/cloud-trial-overview">Start a free trial</a>.</p>
<p>All of this is also possible in your environment. <a href="https://www.elastic.co/observability/universal-profiling">Learn how to get started today</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/nginx-log-analytics-with-genai-elastic/blog-thumb-observability-pattern-color.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Exploring Nginx metrics with Elastic time series data streams]]></title>
            <link>https://www.elastic.co/observability-labs/blog/nginx-metrics-elastic-time-series-data-streams</link>
            <guid isPermaLink="false">nginx-metrics-elastic-time-series-data-streams</guid>
            <pubDate>Mon, 10 Jul 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Elasticsearch recently released time series metrics as GA. In this blog, we dive into details of what a time series metric document is and the mapping used for enabling time series by using an existing OOTB Nginx integration.]]></description>
            <content:encoded><![CDATA[<p>Elasticsearch&lt;sup&gt;®&lt;/sup&gt; recently released time series data streams for metrics. This not only provides better metrics support in Elastic Observability, but it also helps reduce <a href="https://www.elastic.co/blog/whats-new-elasticsearch-8-7-0">storage costs</a>. We discussed this in a <a href="https://www.elastic.co/blog/elasticsearch-time-series-data-streams-observability-metrics">previous blog</a>.</p>
<p>In this blog, we dive into how to enable and use time series data streams by reviewing what a time series metrics <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/documents-indices.html">document</a> is and the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html">mapping</a> used for enabling time series. In particular, we will showcase this by using Elastic Observability’s Nginx integration. As Elastic&lt;sup&gt;®&lt;/sup&gt; <a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.8/tsds.html">time series data stream (TSDS)</a> metrics capabilities evolve, some of the scenarios below will change.</p>
<p>Elastic TSDS stores metrics in indices optimized for a time series database (<a href="https://en.wikipedia.org/wiki/Time_series_database">TSDB</a>), which is used to store time series metrics. <a href="https://www.elastic.co/blog/whats-new-elasticsearch-8-7-0">Elastic’s TSDB also got a significant optimization in 8.7</a> by reducing storage costs by upward of 70%.</p>
<h2>What is an Elastic time series data stream?</h2>
<p>A time series data stream (TSDS) models timestamped metrics data as one or more time series. In a TSDS, each Elasticsearch document represents an observation or data point in a specific time series. Although a TSDS can contain multiple time series, a document can only belong to one time series. A time series can’t span multiple data streams.</p>
<p>A regular <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html">data stream</a> can have different usages including logs. For metrics usage, however, a time series data stream is recommended. A time series data stream is different from a regular data stream in <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tsds.html#differences-from-regular-data-stream">multiple ways</a>. A TSDS contains more than one predefined dimension and multiple metrics.</p>
<h2>Nginx metrics as an example</h2>
<p><a href="https://www.elastic.co/integrations/data-integrations?solution=observability">Integrations</a> provide an easy way to ingest observability metrics for a large number of services and systems. We use the <a href="https://docs.elastic.co/en/integrations/nginx">Nginx</a> integration <a href="https://docs.elastic.co/en/integrations/nginx#metrics-reference">metrics</a> data set as an example here. This is one of the integrations, on which time series has been recently enabled.</p>
<h2>Process of enabling TSDS on a package</h2>
<p>Time series is <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tsds.html#time-series-mode">enabled</a> on a metrics data stream of an <a href="https://www.elastic.co/integrations/">integration</a> package, after adding the relevant time series <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tsds.html#time-series-metric">metrics</a> and <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tsds.html#time-series-dimension">dimension</a> mappings. Existing integrations with metrics data streams will come with time series metrics enabled, so that users can use them as-is without any additional configuration.</p>
<p>The image below captures a high-level summary of a time series data stream, the corresponding index template, the time series indices and a single document. We will shortly dive into the details of each of the fields in the document.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/nginx-metrics-elastic-time-series-data-streams/elastic-blog-1-time-series-data-stream-2.png" alt="time series data stream" /></p>
<h2>TSDS metric document</h2>
<p>Below we provide a snippet of an ingested Elastic document with time series metrics and dimension together.</p>
<pre><code class="language-json">{
  &quot;@timestamp&quot;: &quot;2023-06-29T03:58:12.772Z&quot;,

  &quot;nginx&quot;: {
    &quot;stubstatus&quot;: {
      &quot;accepts&quot;: 202,
      &quot;active&quot;: 2,
      &quot;current&quot;: 3,
      &quot;dropped&quot;: 0,
      &quot;handled&quot;: 202,
      &quot;hostname&quot;: &quot;host.docker.internal:80&quot;,
      &quot;reading&quot;: 0,
      &quot;requests&quot;: 10217,
      &quot;waiting&quot;: 1,
      &quot;writing&quot;: 1
    }
  }
}
</code></pre>
<p><strong>Multiple metrics per document:</strong><br />
An ingested <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/documents-indices.html">document</a> has a collection of fields, including metrics fields. Multiple related metrics fields can be part of a single document. A document is part of a single <a href="https://www.elastic.co/guide/en/fleet/current/data-streams.html">data stream</a>, and typically all the metrics it contains are related. All the metrics in a document are part of the same time series.</p>
<p><strong>Metric type and dimensions as mapping:</strong><br />
While the document contains the metrics details, the metric types and dimension details are defined as part of the field <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html">mapping</a>. All the time series relevant field mappings are defined collectively for a given datastream, as part of the package development. All the integrations released with time series data stream, contain all the relevant time series field mappings, as part of the package release. There are two additional mappings needed in particular: <strong>time_series_metric</strong> mapping and <strong>time_series_dimension</strong> mapping.</p>
<h2>Metrics types fields</h2>
<p>A document contains the metric type fields (as shown above). The mappings for the metric type fields is done using <strong>time_series_metric</strong> mapping in the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index-templates.html">index templates</a> as given below:</p>
<pre><code class="language-json">&quot;nginx&quot;: {
    &quot;properties&quot;: {
       &quot;stubstatus&quot;: {
           &quot;properties&quot;: {
                &quot;accepts&quot;: {
                  &quot;type&quot;: &quot;long&quot;,
                  &quot;time_series_metric&quot;: &quot;counter&quot;
                },
                &quot;active&quot;: {
                  &quot;type&quot;: &quot;long&quot;,
                  &quot;time_series_metric&quot;: &quot;gauge&quot;
                },
                &quot;current&quot;: {
                  &quot;type&quot;: &quot;long&quot;,
                  &quot;time_series_metric&quot;: &quot;gauge&quot;
                },
                &quot;dropped&quot;: {
                  &quot;type&quot;: &quot;long&quot;,
                  &quot;time_series_metric&quot;: &quot;counter&quot;
                },
                &quot;handled&quot;: {
                  &quot;type&quot;: &quot;long&quot;,
                  &quot;time_series_metric&quot;: &quot;counter&quot;
                },
                &quot;reading&quot;: {
                  &quot;type&quot;: &quot;long&quot;,
                  &quot;time_series_metric&quot;: &quot;gauge&quot;
                },
                &quot;requests&quot;: {
                  &quot;type&quot;: &quot;long&quot;,
                  &quot;time_series_metric&quot;: &quot;counter&quot;
                },
                &quot;waiting&quot;: {
                  &quot;type&quot;: &quot;long&quot;,
                  &quot;time_series_metric&quot;: &quot;gauge&quot;
                },
                &quot;writing&quot;: {
                  &quot;type&quot;: &quot;long&quot;,
                  &quot;time_series_metric&quot;: &quot;gauge&quot;
                }
           }
       }
    }
}
</code></pre>
<h2>Dimension fields</h2>
<p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tsds.html#time-series-dimension">Dimensions</a> are field names and values that, in combination, identify a document’s time series.</p>
<p>In Elastic time series, there are some additional considerations for dimensions:</p>
<ul>
<li>Dimension fields need to be defined for each time series. There will be no time series with zero dimension fields.</li>
<li>Keyword (or similar) type fields can be defined as dimensions.</li>
<li>There is a current limit on the number of dimensions that can be defined in a data stream. The limit restrictions will likely be lifted going forward.</li>
</ul>
<p>Dimension is common for all the metrics in a single document, as part of a data stream. Each time series data stream of a package (example: Nginx) already comes with a predefined set of dimension fields as below.</p>
<p>The document would contain more than one dimension field. In the case of Nginx, <em>agend.id</em> and <em>nginx.stubstatus.hostname</em> are some of the dimension fields. The mappings for the dimension fields is done using <strong>time_series_dimension</strong> mapping as below:</p>
<pre><code class="language-json">&quot;agent&quot;: {
   &quot;properties&quot;: {
      &quot;id&quot;: {
         &quot;type&quot;: &quot;keyword&quot;,
         &quot;time_series_dimension&quot;: true
       }
    }
 },

&quot;nginx&quot;: {
   &quot;properties&quot;: {
       &quot;stubstatus&quot;: {
            &quot;properties&quot;: {
                &quot;hostname&quot;: {
                  &quot;type&quot;: &quot;keyword&quot;,
                  &quot;time_series_dimension&quot;: true
                },
            }
       }
    }
}
</code></pre>
<h2>Meta fields</h2>
<p>Documents ingested also have additional meta fields apart from the <em>metric</em> and <em>dimension</em> fields explained above. These additional fields provide richer query capabilities for the metrics.</p>
<p><strong>Example Elastic meta fields</strong></p>
<pre><code class="language-json">&quot;data_stream&quot;: {
      &quot;dataset&quot;: &quot;nginx.stubstatus&quot;,
      &quot;namespace&quot;: &quot;default&quot;,
      &quot;type&quot;: &quot;metrics&quot;
 }
</code></pre>
<h2>Discover and visualization in Kibana</h2>
<p>Elastic provides comprehensive search and visualization for the time series metrics. Time series metrics can be searched as-is in <a href="https://www.elastic.co/guide/en/kibana/current/discover.html">Discover</a>. In the search below, the counter and gauges metrics are captured as <em>different icons</em>. Below we also provide examples of visualization for the time series metrics using <a href="https://www.elastic.co/kibana/kibana-lens">Lens</a> and OOTB dashboard included as part of the Nginx integration package.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/nginx-metrics-elastic-time-series-data-streams/elastic-blog-2-discover-search-tsds.png" alt="Discover search for TSDS metrics" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/nginx-metrics-elastic-time-series-data-streams/elastic-blog-3-lens.png" alt="Maximum of counter field nginx.stubstatus.accepts visualized using Lens" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/nginx-metrics-elastic-time-series-data-streams/elastic-blog-4-median-gauge.png" alt="Median of gauge field nginx.stubstatus.active visualized using Lens" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/nginx-metrics-elastic-time-series-data-streams/elastic-blog-5-multiple-line-graphs.png" alt="OOTB Nginx dashboard with the TSDS metrics visualizations " /></p>
<h2>Try it out!</h2>
<p>We have provided a detailed example of a time series document ingested by the Elastic Nginx integration. We have walked through how time series metrics are modeled in Elastic and the additional time series mappings with examples. We provided details of dimension requirements for Elastic time series, as well as brief examples of search/visualization/dashboard of TSDS metrics in Kibana&lt;sup&gt;®&lt;/sup&gt;.</p>
<p>Don’t have an Elastic Cloud account yet? <a href="https://cloud.elastic.co/registration">Sign up for Elastic Cloud</a> and try out the auto-instrumentation capabilities that I discussed above. I would be interested in getting your feedback about your experience in gaining visibility into your application stack with Elastic.</p>
<blockquote>
<ul>
<li><a href="https://www.elastic.co/blog/elasticsearch-time-series-data-streams-observability-metrics">How to use Elasticsearch and Time Series Data Streams for observability metrics</a></li>
<li><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tsds.html">Time Series Data Stream in Elastic documentation</a> </li>
<li><a href="https://www.elastic.co/blog/whats-new-elasticsearch-8-7-0">Efficient storage with Elastic Time Series Database</a><a href="https://www.elastic.co/integrations/">Elastic integrations catalog</a></li>
<li><a href="https://www.elastic.co/integrations/">Elastic integrations catalog</a></li>
</ul>
</blockquote>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/nginx-metrics-elastic-time-series-data-streams/time-series-data-streams-blog-720x420-1.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[A Practical Guide to end-to-end distributed tracing for Nginx with OpenTelemetry in Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/nginx-opentelemetry-end-to-end-tracing</link>
            <guid isPermaLink="false">nginx-opentelemetry-end-to-end-tracing</guid>
            <pubDate>Tue, 13 Jan 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Instrument Nginx with the OpenTelemetry tracing module and export spans to Elastic Observability's APM for full end-to-end distributed tracing.]]></description>
            <content:encoded><![CDATA[<h1>End-to-End Observability with Nginx and OpenTelemetry</h1>
<p>Nginx sits at the very front of most modern architectures: handling SSL, routing, load balancing, authentication, and more. Yet, despite its central role, it is often absent from distributed traces.<br />
That gap creates blind spots that impact performance debugging, user experience analysis, and system reliability.</p>
<p>This article explains <strong>why Nginx tracing is important</strong> in an application context, and provides a <strong>practical guide</strong> to enable the Nginx <a href="https://nginx.org/en/docs/ngx_otel_module.html">Otel</a> tracing module exporting spans directly to <a href="https://www.elastic.co/docs/solutions/observability/apm">Elastic APM</a>.</p>
<h2>Why Nginx Tracing Matters for Modern Observability</h2>
<p>Instrumenting only backend services gives you only half the picture.<br />
Nginx sees:</p>
<ul>
<li>every incoming request</li>
<li>client trace context</li>
<li>TLS negotiation</li>
<li>upstream errors (502, 504)</li>
<li>edge-layer latency</li>
<li>routing decisions</li>
</ul>
<p>If Nginx is not in your traces, your distributed trace is incomplete.</p>
<p>By adding OpenTelemetry tracing at this ingress layer, you unlock:</p>
<p><em>1. Full trace continuity</em> : From browser → Nginx → backend → database.
<img src="https://www.elastic.co/observability-labs/assets/images/nginx-opentelemetry-end-to-end-tracing/document_elastic_nginx_otel_instrumentation_1.png" alt="Nginx Trace Continuity" /></p>
<p><em>2. Accurate latency attribution</em> : Edge delays vs. backend delays are clearly separated which unlock Elastic <a href="https://www.elastic.co/docs/solutions/observability/apm/machine-learning">APM Latency</a> anomaly detection for proactive detection.
<img src="https://www.elastic.co/observability-labs/assets/images/nginx-opentelemetry-end-to-end-tracing/document_elastic_nginx_otel_instrumentation_2.png" alt="Nginx Latency Detection" /></p>
<p><em>3. Error root-cause clarity</em> : Nginx errors appear as spans instead of backend “mystery gaps”.</p>
<p><em>4. Complete service topology</em> : Your APM service map finally shows the real architecture.
<img src="https://www.elastic.co/observability-labs/assets/images/nginx-opentelemetry-end-to-end-tracing/document_elastic_nginx_otel_instrumentation_4.png" alt="Nginx APM Service Map" /></p>
<h2>Integrating Nginx with OpenTelemetry on Debian</h2>
<p>This guide provides a comprehensive overview of why, how to install and configure the Nginx OpenTelemetry module on a Debian-based system. The configuration examples are tailored to send telemetry data directly to an Elastic APM endpoint whether it's an <a href="https://www.elastic.co/docs/reference/opentelemetry">EDOT</a> Collector or <a href="https://www.elastic.co/observability-labs/blog/elastic-managed-otlp-endpoint-for-opentelemetry">mOtel</a> in case of our serverless, enabling end-to-end distributed tracing.</p>
<h2>Installation on Debian</h2>
<p>The Nginx OTEL module is not included in the standard Nginx packages. It must be installed along with a working nginx configuration.</p>
<h3>Prerequisites</h3>
<p>First, install the necessary tools for compiling software and the Nginx development dependencies.</p>
<pre><code class="language-bash">sudo apt update
sudo apt install -y apt install nginx-module-otel
</code></pre>
<h3>Load the Module in Nginx</h3>
<p>Edit your main <code>/etc/nginx/nginx.conf</code> file to load the new module. This directive must be at the top level, before the <code>http</code> block.</p>
<pre><code class="language-nginx"># /etc/nginx/nginx.conf

load_module modules/ngx_otel_module.so;

events {
    # ...
}

http {
    # ...
}
</code></pre>
<p>Now, test your configuration and restart Nginx.</p>
<pre><code class="language-bash">sudo nginx -t
sudo systemctl restart nginx
</code></pre>
<h2>Configuration</h2>
<p>Configuration is split between the main <code>nginx.conf</code> file (for global settings) and your site-specific server block files.</p>
<h3>Global Configuration (<code>/etc/nginx/nginx.conf</code>)</h3>
<p>This configuration sets up the destination for your telemetry data and defines global variables used for CORS and tracing. These settings are placed inside the <code>http</code> block.</p>
<pre><code class="language-nginx">http {
    ...

    # --- OpenTelemetry Exporter Configuration ---
    # Defines where Nginx will send its telemetry data directly to Elastic APM or EDOT.
    otel_exporter {
        endpoint https://&lt;ELASTIC_URL&gt;:443;
        header Authorization &quot;Bearer &lt;TOKEN&gt;&quot;;
    }

    # --- OpenTelemetry Service Metadata ---
    # These attributes identify Nginx as a unique service in the APM UI.
    otel_service_name nginx;
    otel_resource_attr service.version 1.28.0;
    otel_resource_attr deployment.environment production;
    otel_trace_context propagate; # Needed to propagate the RUM traces to the backend

    # --- Helper Variables for Tracing and CORS ---
    # Creates the $trace_flags variable needed to build the outgoing traceparent header.
    map $otel_parent_sampled $trace_flags {
        default &quot;00&quot;; # Not sampled
        &quot;1&quot;     &quot;01&quot;; # Sampled
    }

    # Creates the $cors_origin variable for secure, multi-origin CORS handling.
    map $http_origin $cors_origin {
        default &quot;&quot;;
        &quot;http://&lt;URL_ORIGIN_1&gt;/&quot; $http_origin; # Add your Origin here to allow CORS
        &quot;https://&lt;URL_ORIGIN_2&gt;/&quot; $http_origin; # Add your others Origin here to allow CORS
    }
...
}
</code></pre>
<h3>Server Block Configuration (<code>/etc/nginx/conf.d/site.conf</code>)</h3>
<p>This configuration enables tracing for a specific site, handles CORS preflight requests, and propagates the trace context to the backend service.</p>
<pre><code class="language-nginx">server {
    listen 443 ssl;
    server_name &lt;WEBSITE_URL&gt;;

    # --- OpenTelemetry Module Activation ---
    # Enable tracing for this server block.
    otel_trace on;
    otel_trace_context propagate;

    location / {
        # --- CORS Preflight (OPTIONS) Handling ---
        # Intercepts preflight requests and returns the correct CORS headers,
        # allowing the browser to proceed with the actual request.
        if ($request_method = 'OPTIONS') {
            add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS' always;
            add_header 'Access-Control-Allow-Headers' 'Content-Type, traceparent, tracestate' always;
            add_header 'Access-Control-Max-Age' 86400;
            add_header 'Access-Control-Allow-Origin' &quot;$cors_origin&quot; always;
            return 204;
        }

        # --- OpenTelemetry Trace Context Propagation ---
        # Manually constructs the W3C traceparent header and passes the tracestate
        # header to the backend, linking this trace to the upstream service.
        proxy_set_header traceparent      &quot;00-$otel_trace_id-$otel_span_id-$trace_flags&quot;;
        proxy_set_header tracestate       $http_tracestate;

        # --- Standard Proxy Headers ---
        proxy_set_header Host             $host;
        proxy_set_header X-Real-IP        $remote_addr;
        proxy_set_header X-Forwarded-For  $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # --- Forward to Backend ---
        # Passes the request to the actual application (eg. localhost in this example).
        proxy_pass http://&lt;BACKEND_URL&gt;:8080;
    }
}
</code></pre>
<p>Test your configuration and restart Nginx.</p>
<pre><code class="language-bash">sudo nginx -t
sudo systemctl restart nginx
</code></pre>
<h2>Conclusion: Turning Nginx into a First-Class Observability Signal</h2>
<p>By enabling OpenTelemetry tracing directly in Nginx and exporting spans to Elastic APM (via EDOT or Elastic’s managed OTLP endpoint), you bring your ingress layer into the same observability model as the rest of your stack. The result is:</p>
<ul>
<li>true end-to-end trace continuity from the browser to backend services</li>
<li>clear separation between edge latency and application latency</li>
<li>immediate visibility into gateway-level failures and retries</li>
<li>accurate service maps that reflect real production traffic</li>
</ul>
<p>Most importantly, this approach aligns Nginx with modern observability standards. It avoids proprietary instrumentation, fits naturally into OpenTelemetry-based architectures, and scales consistently across hybrid and cloud-native environments.</p>
<h2>Try it out!</h2>
<p>Once Nginx tracing is in place, several natural extensions can further improve your observability posture:</p>
<ul>
<li>correlate Nginx traces with application <a href="https://www.elastic.co/docs/reference/apm/agents/go/log-correlation">logs and metrics using</a> Elastic’s unified observability</li>
<li>add Real User Monitoring (<a href="https://www.elastic.co/docs/solutions/observability/apm/apm-agents/real-user-monitoring-rum">RUM</a>) to close the loop from frontend to backend</li>
<li>introduce <a href="https://www.elastic.co/docs/solutions/observability/apm/transaction-sampling">sampling and tail-based</a> decisions at the collector level for cost control</li>
<li>use Elastic <a href="https://www.elastic.co/docs/solutions/observability/apm/service-map">APM service maps</a> and <a href="https://www.elastic.co/docs/reference/machine-learning/ootb-ml-jobs-apm">anomaly detection</a> to proactively detect edge-related issues</li>
</ul>
<p>Instrumenting Nginx is often the missing link in distributed tracing strategies. With OpenTelemetry and Elastic, that gap can now be closed in a clean, standards-based, and production-ready way.</p>
<p>If you want to experiment with this setup quickly, Elastic Serverless provides the fastest way to get started.
Sign up and try it out in just a few minutes using our trial environment available at <a href="https://cloud.elastic.co/">https://cloud.elastic.co/</a> .</p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/nginx-opentelemetry-end-to-end-tracing/document_elastic_nginx_otel_instrumentation_4.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Root cause analysis with logs: Elastic Observability's AIOps Labs]]></title>
            <link>https://www.elastic.co/observability-labs/blog/observability-logs-machine-learning-aiops</link>
            <guid isPermaLink="false">observability-logs-machine-learning-aiops</guid>
            <pubDate>Thu, 27 Apr 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic Observability provides more than just log aggregation, metrics analysis, APM, and distributed tracing. Our machine learning-based AIOps capabilities help you analyze the root cause of issues allowing you to focus on the most important tasks.]]></description>
            <content:encoded><![CDATA[<p>In the <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">previous blog</a> in our root cause analysis with logs series, we explored how to analyze logs in Elastic Observability with Elastic’s anomaly detection and log categorization capabilities. Elastic’s platform enables you to get started on machine learning (ML) quickly. You don’t need to have a data science team or design a system architecture. Additionally, there’s no need to move data to a third-party framework for model training.</p>
<p>Preconfigured <a href="https://www.elastic.co/blog/may-2023-launch-machine-learning-models">machine learning models</a> for observability and security are available. If those don't work well enough on your data, in-tool wizards guide you through the few steps needed to configure custom anomaly detection and train your model with supervised learning. To get you started, there are several key features built into Elastic Observability to aid in analysis, bypassing the need to run specific ML models. These features help minimize the time and analysis of logs.</p>
<p>Let’s review the set of machine learning-based observability features in Elastic:</p>
<p><strong>Anomaly detection:</strong> Elastic Observability, when turned on (<a href="https://www.elastic.co/guide/en/kibana/current/xpack-ml-anomalies.html">see documentation</a>), automatically detects anomalies by continuously modeling the normal behavior of your time series data — learning trends, periodicity, and more — in real time to identify anomalies, streamline root cause analysis, and reduce false positives. Anomaly detection runs in and scales with Elasticsearch and includes an intuitive UI.</p>
<p><strong>Log categorization:</strong> Using anomaly detection, Elastic also identifies patterns in your log events quickly. Instead of manually identifying similar logs, the logs categorization view lists log events that have been grouped, based on their messages and formats, so that you can take action more quickly.</p>
<p><strong>High-latency or erroneous transactions:</strong> Elastic Observability’s APM capability helps you discover which attributes are contributing to increased transaction latency and identifies which attributes are most influential in distinguishing between transaction failures and successes. Read <a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">APM correlations in Elastic Observability: Automatically identifying probable causes of slow or failed transactions</a> for an overview of this capability.</p>
<p><strong>AIOps Labs:</strong> AIOps Labs provides two main capabilities using advanced statistical methods:</p>
<ul>
<li><strong>Log spike detector</strong> helps identify reasons for increases in log rates. It makes it easy to find and investigate the causes of unusual spikes by using the analysis workflow view. Examine the histogram chart of the log rates for a given data view, and find the reason behind a particular change possibly in millions of log events across multiple fields and values.</li>
<li><strong>Log pattern analysis</strong> helps you find patterns in unstructured log messages and makes it easier to examine your data. It performs categorization analysis on a selected field of a data view, creates categories based on the data, and displays them together with a chart that shows the distribution of each category and an example document that matches the category.</li>
</ul>
<p>As we showed in the <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">last blog</a>, using machine learning-based features helps minimize the extremely tedious and time-consuming process of analyzing data using traditional methods, such as alerting and simple pattern matching (visual or simple searching, etc.). Trying to find the needle in the haystack requires the use of some level of artificial intelligence due to the increasing amounts of telemetry data (logs, metrics, and traces) being collected across ever-growing applications.</p>
<p>In this blog post, we’ll cover two capabilities found in Elastic’s AIOps Labs: log spike detector and log pattern analysis. We’ll use the same data from the previous blog and analyze it using these two capabilities.</p>
<p>_ <strong>We will cover log spike detector and log pattern analysis against the popular Hipster Shop app developed by Google, and modified recently by OpenTelemetry.</strong> _</p>
<p>Overviews of high-latency capabilities can be found <a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">here</a>, and an overview of AIOps labs can be found <a href="https://www.youtube.com/watch?v=jgHxzUNzfhM&amp;list=PLhLSfisesZItlRZKgd-DtYukNfpThDAv_&amp;index=5">here</a>.</p>
<p>Below, we will examine a scenario where we use anomaly detection and log categorization to help identify a root cause of an issue in Hipster Shop.</p>
<h2>Prerequisites and config</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up this demonstration:</p>
<ul>
<li>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>) on AWS. Deploying this on AWS is required for Elastic Serverless Forwarder.</li>
<li>Utilize a version of the popular <a href="https://github.com/GoogleCloudPlatform/microservices-demo">Hipster Shop</a> demo application. It was originally written by Google to showcase Kubernetes across a multitude of variants available, such as the <a href="https://github.com/open-telemetry/opentelemetry-demo">OpenTelemetry Demo App</a>. The Elastic version is found <a href="https://github.com/elastic/opentelemetry-demo">here</a>.</li>
<li>Ensure you have configured the app for either Elastic APM agents or OpenTelemetry agents. For more details, please refer to these two blogs: <a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OTel in Elastic</a> and <a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Observability and Security with OTel in Elastic</a>. Additionally, review the <a href="https://www.elastic.co/guide/en/apm/guide/current/open-telemetry.html">OTel documentation in Elastic</a>.</li>
<li>Look through an overview of <a href="https://www.elastic.co/guide/en/observability/current/apm.html">Elastic Observability APM capabilities</a>.</li>
<li>Look through our <a href="https://www.elastic.co/guide/en/observability/8.5/inspect-log-anomalies.html">anomaly detection documentation</a> for logs and <a href="https://www.elastic.co/guide/en/observability/8.5/categorize-logs.html">log categorization documentation</a>.</li>
</ul>
<p>Once you’ve instrumented your application with APM (Elastic or OTel) agents and are ingesting metrics and logs into Elastic Observability, you should see a service map for the application as follows:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-service-map.png" alt="observability service map" /></p>
<p>In our example, we’ve introduced issues to help walk you through the root cause analysis features. You might have a different set of issues depending on how you load the application and/or introduce specific feature flags.</p>
<p>As part of the walk-through, we’ll assume we are DevOps or SRE managing this application in production.</p>
<h2>Root cause analysis</h2>
<p>While the application has been running normally for some time, you get a notification that some of the services are unhealthy. This can occur from the notification setting you’ve set up in Elastic or other external notification platforms (including customer-related issues). In this instance, we’re assuming that customer support has called in multiple customer complaints about the website.</p>
<p>How do you as a DevOps or SRE investigate this? We will walk through two avenues in Elastic to investigate the issue:</p>
<ul>
<li>Log spike analysis</li>
<li>Log pattern analysis</li>
</ul>
<p>While we show these two paths separately, they can be used in conjunction and are complementary, as they are both tools Elastic Observability provides to help you troubleshoot and identify a root cause.</p>
<p>Starting with the service map, you can see anomalies identified with red circles and as we select them, Elastic will provide a score for the anomaly.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-service-map-service-details.png" alt="observability service map service details" /></p>
<p>In this example, we can see that there is a score of 96 for a specific anomaly for the productCatalogService in the Hipster Shop application. An anomaly score indicates the significance of the anomaly compared to previously seen anomalies. Rather than jump into anomaly detection (see previous <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">blog</a>), let’s look at some of the potential issues by reviewing the service details in APM.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-product-catalog-service-overview.png" alt="observability product catalog service overview" /></p>
<p>What we see for the productCatalogService is that there are latency issues, failed transactions, a large number of issues, and a dependency to PostgreSQL. When we look at the errors in more detail and drill down, we see they are all coming from <a href="https://pkg.go.dev/github.com/lib/pq">PQ - which is a PostgreSQL driver in Go</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-product-catalog-service-errors.png" alt="observability product catalog service errors" /></p>
<p>As we drill further, we still can’t tell why the productCatalogService is not able to pull information from the PostgreSQL database.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-product-catalog-service-error-group.png" alt="observability product catalog service error group" /></p>
<p>We see that there is a spike in errors, so let's see if we can gleam further insight using one of our two options:</p>
<ul>
<li>Log rate spikes</li>
<li>Log pattern analysis</li>
</ul>
<h3>Log rate spikes</h3>
<p>Let’s start with the <strong>log rate spikes</strong> detector capability from Elastic’s AIOps Labs section of Elastic’s machine learning capabilities. We also pre-select analyzing the spike against a baseline history.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-explain-log-rate-spikes-postgres.png" alt="explain log rate spikes postgres" /></p>
<p>The log rate spikes detector has looked at all the logs from the spike and compared them to the baseline, and it's seeing higher-than-normal counts in specific log messages. From a visual inspection, we see that PostgreSQL log messages are high. We further filter this with postgres.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-explain-log-rate-spikes-pgbench.png" alt="explain log rates spikes pgbench" /></p>
<p>We immediately notice that this issue is potentially caused by pgbench, a popular PostgreSQL tool to help benchmark the database. pgbench runs the same sequence of SQL commands over and over, possibly in multiple, concurrent database sessions. While pgbench is definitely a useful tool, it should not be used in a production environment as it causes a heavy load on the database host, likely causing higher latency issues on the site.</p>
<p>While this may or may not be the ultimate root cause, we have rather quickly identified a potential issue that has a high probability of being the root cause. An engineer likely intended to run pgbench against a staging database to evaluate its performance, and not the production environment.</p>
<h3>Log pattern analysis</h3>
<p>Instead of log rate spikes, let’s use log pattern analysis to investigate the spike in errors we saw in productCatalogService. In AIOps Labs, we simply select Log Pattern Analysis, use Logs data, filter the results with postgres (since we know it's related to PostgreSQL), and look at information from the message field of the logs we are processing. We see the following:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-explain-log-pattern-analysis.png" alt="observability explain log pattern analysis" /></p>
<p>Almost immediately we see the biggest pattern it finds is a log message where pgbench is updating the database. We can further directly drill into this log message from log pattern analysis into Discover and review the details and further analyze the messages.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-expanded-document.png" alt="expanded document" /></p>
<p>As we mentioned in the previous section, while it may or may not be the root cause, it quickly gives us a place to start and a potential root cause. A developer likely intended to run pgbench against a staging database to evaluate its performance, and not the production environment.</p>
<h2>Conclusion</h2>
<p>Between the <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">first blog</a> and this one, we’ve shown how Elastic Observability can help you further identify and get closer to pinpointing the root cause of issues without having to look for a “needle in a haystack.” Here’s a quick recap of what you learned in this blog.</p>
<ul>
<li>
<p>Elastic Observability has numerous capabilities to help you reduce your time to find the root cause and improve your MTTR (even MTTD). In particular, we reviewed the following two main capabilities (found in AIOps Labs in Elastic) in this blog:</p>
<ol>
<li><strong>Log rate spikes</strong> detector helps identify reasons for increases in log rates. It makes it easy to find and investigate the causes of unusual spikes by using the analysis workflow view. Examine the histogram chart of the log rates for a given data view, and find the reason behind a particular change possibly in millions of log events across multiple fields and values.</li>
<li><strong>Log pattern analysis</strong> helps you find patterns in unstructured log messages and makes it easier to examine your data. It performs categorization analysis on a selected field of a data view, creates categories based on the data, and displays them together with a chart that shows the distribution of each category and an example document that matches the category.</li>
</ol>
</li>
<li>
<p>You learned how easy and simple it is to use Elastic Observability’s log categorization and anomaly detection capabilities without having to understand machine learning (which helps drive these features) or having to do any lengthy setups.</p>
</li>
</ul>
<p>Ready to get started? <a href="https://cloud.elastic.co/registration">Register for Elastic Cloud</a> and try out the features and capabilities outlined above.</p>
<h3>Additional logging resources:</h3>
<ul>
<li><a href="https://www.elastic.co/getting-started/observability/collect-and-analyze-logs">Getting started with logging on Elastic (quickstart)</a></li>
<li><a href="https://www.elastic.co/guide/en/observability/current/logs-metrics-get-started.html">Ingesting common known logs via integrations (compute node example)</a></li>
<li><a href="https://docs.elastic.co/integrations">List of integrations</a></li>
<li><a href="https://www.elastic.co/blog/log-monitoring-management-enterprise">Ingesting custom application logs into Elastic</a></li>
<li><a href="https://www.elastic.co/blog/observability-logs-parsing-schema-read-write">Enriching logs in Elastic</a></li>
<li>Analyzing Logs with <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">Anomaly Detection (ML)</a> and <a href="https://www.elastic.co/blog/observability-logs-machine-learning-aiops">AIOps</a></li>
</ul>
<h3>Common use case examples with logs:</h3>
<ul>
<li><a href="https://youtu.be/ax04ZFWqVCg">Nginx log management</a></li>
<li><a href="https://www.elastic.co/blog/vpc-flow-logs-monitoring-analytics-observability">AWS VPC Flow log management</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-errors-elastic-observability-logs-openai">Using OpenAI to analyze Kubernetes errors</a></li>
<li><a href="https://youtu.be/Li5TJAWbz8Q">PostgreSQL issue analysis with AIOps</a></li>
</ul>
<p><em>Elastic and Elasticsearch are trademarks, logos or registered trademarks of Elasticsearch B.V. in the United States and other countries.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/illustration-machine-learning-anomaly-1680x980.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Monitoring service performance: An overview of SLA calculation for Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/observability-sla-calculations-transforms</link>
            <guid isPermaLink="false">observability-sla-calculations-transforms</guid>
            <pubDate>Mon, 24 Apr 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic Stack provides many valuable insights for different users, such as reports on service performance and if the service level agreement (SLA) is met. In this post, we’ll provide an overview of calculating an SLA for Elastic Observability.]]></description>
            <content:encoded><![CDATA[<p>Elastic Stack provides many valuable insights for different users. Developers are interested in low-level metrics and debugging information. <a href="https://www.elastic.co/blog/elastic-observability-sre-incident-response">SREs</a> are interested in seeing everything at once and identifying where the root cause is. Managers want reports that tell them how good service performance is and if the service level agreement (SLA) is met. In this post, we’ll focus on the service perspective and provide an overview of calculating an SLA.</p>
<p><em>Since version 8.8, we have a built in functionality to calculate SLOs —</em> <a href="https://www.elastic.co/guide/en/observability/current/slo.html"><em>check out our guide</em></a><em>!</em></p>
<h2>Foundations of calculating an SLA</h2>
<p>There are many ways to calculate and measure an SLA. The most important part is the definition of the SLA, and as a consultant, I’ve seen many different ways. Some examples include:</p>
<ul>
<li>Count of HTTP 2xx must be above 98% of all HTTP status</li>
<li>Response time of successful HTTP 2xx requests must be below x milliseconds</li>
<li>Synthetic monitor must be up at least 99%</li>
<li>95% of all batch transactions from the billing service need to complete within 4 seconds</li>
</ul>
<p>Depending on the origin of the data, calculating the SLA can be easier or more difficult. For uptime (Synthetic Monitoring), we automatically provide SLA values and offer out-of-the-box alerts to simply define alert when availability below 98% for the last 1 hour.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-overview-monitor-details.png" alt="overview monitor details" /></p>
<p>I personally recommend using <a href="https://www.elastic.co/blog/new-synthetic-monitoring-observability">Elastic Synthetic Monitoring</a> whenever possible to monitor service performance. Running HTTP requests and verifying the answers from the service, or doing fully fledged browser monitors and clicking through the website as a real user does, ensures a better understanding of the health of your service.</p>
<p>Sometimes this is impossible because you want to calculate the uptime of a specific Windows Service that does not offer any TCP port or HTTP interaction. Here the caveat applies that just because the service is running, it does not necessarily imply that the service is working fine.</p>
<h2>Transforms to the rescue</h2>
<p>We have identified our important service. In our case, it is the Steam Client Helper. There are two ways to solve this.</p>
<h3>Lens formula</h3>
<p>You can use Lens and formula (for a deep dive into formulas, <a href="https://www.elastic.co/blog/how-tough-was-your-workout-take-a-closer-look-at-strava-data-through-kibana-lens">check out this blog</a>). Use the Search bar to filter down the data you want. Then use the formula option in Lens. We are dividing all counts of records with Running as a state and dividing it by the overall count of records. This is a nice solution when there is a need to calculate quickly and on the fly.</p>
<pre><code class="language-sql">count(kql='windows.service.state: &quot;Running&quot; ')/count()
</code></pre>
<p>Using the formula posted above as the bar chart's vertical axis calculates the uptime percentage. We use an annotation to mark why there is a dip and why this service was below the threshold. The annotation is set to reboot, which indicates a reboot happening, and thus, the service was down for a moment. Lastly, we add a reference line and set this to our defined threshold at 98%. This ensures that a quick look at the visualization allows our eyes to gauge if we are above or below the threshold.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-visualization.png" alt="visualization" /></p>
<h3>Transform</h3>
<p>What if I am not interested in just one service, but there are multiple services needed for your SLA? That is where Transforms can solve this problem. Furthermore, the second issue is that this data is only available inside the Lens. Therefore, we cannot create any alerts on this.</p>
<p>Go to Transforms and create a pivot transform.</p>
<ol>
<li>
<p>Add the following filter to narrow it to only services data sets: data_stream.dataset: &quot;windows.service&quot;. If you are interested in a specific service, you can always add it to the search bar if you want to know if a specific remote management service is up in your entire fleet!</p>
</li>
<li>
<p>Select data histogram(@timestamp) and set it to your chosen unit. By default, the Elastic Agent only collects service states every 60 seconds. I am going with 1 hour.</p>
</li>
<li>
<p>Select agent.name and windows.service.name as well.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-transform-configuration.png" alt="transform configuration" /></p>
<ol start="4">
<li>Now we need to define an aggregation type. We will use a value_count of windows.service.state. That just counts how many records have this value.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-aggregations.png" alt="aggregations" /></p>
<ol start="5">
<li>
<p>Rename the value_count to total_count.</p>
</li>
<li>
<p>Add value_count for windows.service.state a second time and use the pencil icon to edit it to terms, which aggregates for running.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-aggregations-apply.png" alt="aggregations apply" /></p>
<ol start="7">
<li>
<p>This opens up a sub-aggregation. Once again, select value_count(windows.service.state) and rename it to values.</p>
</li>
<li>
<p>Now, the preview shows us the count of records with any states and the count of running.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-transform-configuration-next.png" alt="transform configuration" /></p>
<ol start="9">
<li>
<p>Here comes the tricky part. We need to write some custom aggregations to calculate the percentage of uptime. Click on the copy icon next to the edit JSON config.</p>
</li>
<li>
<p>In a new tab, go to Dev Tools. Paste what you have in the clipboard.</p>
</li>
<li>
<p>Press the play button or use the keyboard shortcut ctrl+enter/cmd+enter and run it. This will create a preview of what the data looks like. It should give you the same information as in the table preview.</p>
</li>
<li>
<p>Now, we need to calculate the percentage of up, which means doing a bucket script where we divide running.values by total_count, just like we did in the Lens visualization. Suppose you name the columns differently or use more than a single value. In that case, you will need to adapt accordingly.</p>
</li>
</ol>
<pre><code class="language-json">&quot;availability&quot;: {
        &quot;bucket_script&quot;: {
          &quot;buckets_path&quot;: {
            &quot;up&quot;: &quot;running&gt;values&quot;,
            &quot;total&quot;: &quot;total_count&quot;
          },
          &quot;script&quot;: &quot;params.up/params.total&quot;
        }
      }
</code></pre>
<ol start="13">
<li>This is the entire transform for me:</li>
</ol>
<pre><code class="language-bash">POST _transform/_preview
{
  &quot;source&quot;: {
    &quot;index&quot;: [
      &quot;metrics-*&quot;
    ]
  },
  &quot;pivot&quot;: {
    &quot;group_by&quot;: {
      &quot;@timestamp&quot;: {
        &quot;date_histogram&quot;: {
          &quot;field&quot;: &quot;@timestamp&quot;,
          &quot;calendar_interval&quot;: &quot;1h&quot;
        }
      },
      &quot;agent.name&quot;: {
        &quot;terms&quot;: {
          &quot;field&quot;: &quot;agent.name&quot;
        }
      },
      &quot;windows.service.name&quot;: {
        &quot;terms&quot;: {
          &quot;field&quot;: &quot;windows.service.name&quot;
        }
      }
    },
    &quot;aggregations&quot;: {
      &quot;total_count&quot;: {
        &quot;value_count&quot;: {
          &quot;field&quot;: &quot;windows.service.state&quot;
        }
      },
      &quot;running&quot;: {
        &quot;filter&quot;: {
          &quot;term&quot;: {
            &quot;windows.service.state&quot;: &quot;Running&quot;
          }
        },
        &quot;aggs&quot;: {
          &quot;values&quot;: {
            &quot;value_count&quot;: {
              &quot;field&quot;: &quot;windows.service.state&quot;
            }
          }
        }
      },
      &quot;availability&quot;: {
        &quot;bucket_script&quot;: {
          &quot;buckets_path&quot;: {
            &quot;up&quot;: &quot;running&gt;values&quot;,
            &quot;total&quot;: &quot;total_count&quot;
          },
          &quot;script&quot;: &quot;params.up/params.total&quot;
        }
      }
    }
  }
}
</code></pre>
<ol start="14">
<li>The preview in Dev Tools should work and be complete. Otherwise, you must debug any errors. Most of the time, it is the bucket script and the path to the values. You might have called it up instead of running. This is what the preview looks like for me.</li>
</ol>
<pre><code class="language-json">{
  &quot;running&quot;: {
    &quot;values&quot;: 1
  },
  &quot;agent&quot;: {
    &quot;name&quot;: &quot;AnnalenasMac&quot;
  },
  &quot;@timestamp&quot;: &quot;2021-12-07T19:00:00.000Z&quot;,
  &quot;total_count&quot;: 1,
  &quot;availability&quot;: 1,
  &quot;windows&quot;: {
    &quot;service&quot;: {
      &quot;name&quot;: &quot;InstallService&quot;
    }
  }
},
</code></pre>
<ol start="15">
<li>Now we only paste the bucket script into the transform creation UI after selecting Edit JSON. It looks like this:</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-transform-configuration-pivot-configuration-object.png" alt="transform configuration pivot configuration object" /></p>
<ol start="16">
<li>Give your transform a name, set the destination index, and run it continuously. When selecting this, please also make sure not to use @timestamp. Instead, opt for event.ingested. <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/transform-checkpoints.html">Our documentation explains this in detail</a>.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-transform-details.png" alt="transform details" /></p>
<ol start="17">
<li>Click next and create and start. This can take a bit, so don’t worry.</li>
</ol>
<p>To summarize, we have now created a pivot transform using a bucket script aggregation to calculate the running time of a service in percentage. There is a caveat because Elastic Agent, per default, only collects the every 60 seconds the services state. It can be that a service is up exactly when collected and down a few seconds later. If it is that important and no other monitoring possibilities, such as <a href="https://www.elastic.co/blog/what-can-elastic-synthetics-tell-us-about-kibana-dashboards">Elastic Synthetics</a> are possible, you might want to reduce the collection time on the Agent side to get the services state every 30 seconds, 45 seconds. Depending on how important your thresholds are, you can create multiple policies having different collection times. This ensures that a super important server might collect the services state every 10 seconds because you need as much granularity and insurance for the correctness of the metric. For normal workstations where you just want to know if your remote access solution is up the majority of the time, you might not mind having a single metric every 60 seconds.</p>
<p>After you have created the transform, one additional feature you get is that the data is stored in an index, similar to in Elasticsearch. When you just do the visualization, the metric is calculated for this visualization only and not available anywhere else. Since this is now data, you can create a threshold alert to your favorite connection (Slack, Teams, Service Now, Mail, and so <a href="https://www.elastic.co/guide/en/kibana/current/action-types.html">many more to choose from</a>).</p>
<h2>Visualizing the transformed data</h2>
<p>The transform created a data view called windows-service. The first thing we want to do is change the format of the availability field to a percentage. This automatically tells Lens that this needs to be formatted as a percentage field, so you don’t need to select it manually as well as do calculations. Furthermore, in Discover, instead of seeing 0.5 you see 50%. Isn’t that cool? This is also possible for durations, like event.duration if you have it as nanoseconds! No more calculations on the fly and thinking if you need to divide by 1,000 or 1,000,000.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-edit-field-availability.png" alt="edit field availability" /></p>
<p>We get this view by using a simple Lens visualization with a timestamp on the vertical axis with the minimum interval for 1 day and an average of availability. Don’t worry — the other data will be populated once the transformation finishes. We can add a reference line using the value 0.98 because our target is 98% uptime of the service.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-line.png" alt="line" /></p>
<h2>Summary</h2>
<p>This blog post covered the steps needed to calculate the SLA for a specific data set in Elastic Observability, as well as how to visualize it. Using this calculation method opens the door to a lot of interesting use cases. You can change the bucket script and start calculating the number of sales, and the average basket size. Interested in learning more about Elastic Synthetics? Read <a href="https://www.elastic.co/guide/en/observability/current/monitor-uptime-synthetics.html">our documentation</a> or check out our free <a href="https://www.elastic.co/training/synthetics-quick-start">Synthetic Monitoring Quick Start training</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/illustration-analytics-report-1680x980.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[A train ride away from a million events per second with EDOT Cloud Forwarder]]></title>
            <link>https://www.elastic.co/observability-labs/blog/one-million-events-per-second-with-edot-cloud-forwarder</link>
            <guid isPermaLink="false">one-million-events-per-second-with-edot-cloud-forwarder</guid>
            <pubDate>Tue, 20 Jan 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[EDOT Cloud Forwarder for AWS from Elastic Observability is now Generally Available. Deploying EDOT Cloud Forwarder and reliably handling one million events per second with zero intervention, zero data loss, and zero idle cost.]]></description>
            <content:encoded><![CDATA[<p>Infrastructure observability is critical for maintaining uptime, optimizing cloud environments, and securing the cloud perimeter. Cloud environments generate observability data at massive scale. VPC Flow Logs, ELB Access Logs, CloudTrail and CloudWatch logs can easily reach hundreds of thousands of events per second. Dealing with scale like this is a complex problem all by itself.</p>
<p>Today, we introduce <strong>EDOT Cloud Forwarder</strong>, built on OTel Collector, it is the simplest, fastest, and possibly most boring way to connect your Cloud environment to Elastic Observability, and it is <strong>now Generally Available on AWS</strong>. With EDOT Cloud Forwarder you can get started in seconds, get observability across your entire cloud estate, and easily handle telemetry at any volume.</p>
<h2>Deploying Cloud Forwarder from a District Line train</h2>
<p>So, I got to work deploying EDOT Cloud Forwarder in my AWS account. I was doing it on the commute, using nothing more than a decent 4G signal and hoping that my 27% battery would be enough.</p>
<p>I hit deploy on the terraform template and waited. I started seeing events flowing into Elastic Observability.</p>
<p>As the train pulled into Putney Bridge, the flow of logs peaked, and one million events per second scrolled across my screen.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/one-million-events-per-second-with-edot-cloud-forwarder/putney_bridge_1MEPS.png" alt="Putney_Bridge" /></p>
<p>Once deployed, I got the best three zeros I could expect:</p>
<ul>
<li><strong>Zero Intervention:</strong> I watched as traffic ramped up to a full 1M EPS. Lambda functions automatically scaled out from a few instances to the 60-65 concurrent executions needed. There were <strong>zero manual adjustments required</strong>. The scaling was instant and hands-free.</li>
<li><strong>Zero Data Loss:</strong> It achieved a consistent processing rate, with every single event indexed in Elasticsearch.</li>
<li><strong>Zero Idle Cost:</strong> When there are no events, Cloud Forwarder scales to zero - it has no fixed infrastructure cost. You only pay for the moment data is being processed, not for permanently over-provisioned servers sitting idle.</li>
</ul>
<p>Right before the train came to a standstill, I looked at the total cost for running Cloud Forwarder for the two minutes between Parsons Green and Putney Bridge - we forwarded about 120GB of telemetry and the total cost was below £0.10. Well, unless you count the £2.50 train ticket!</p>
<h2>Making Observability easy at any scale</h2>
<p>Getting started observing your infrastructure is hard and once it's observable, deriving actionable value from it requires sifting and winnowing through massive volumes of telemetry data, sometimes millions of events per second.</p>
<p>The new EDOT Cloud Forwarder for AWS (also available in Preview for GCP and Azure) is easy to deploy, with just a Terraform template. To make sure that it was easy to get started with, we designed it to be as close to a <a href="https://www.elastic.co/docs/reference/opentelemetry/edot-cloud-forwarder/aws#quick-deployment-direct-link">single-click deployment</a> as possible:</p>
<p>Just click the link below to launch the CloudFormation stack in your AWS account:</p>
<p><a href="https://console.aws.amazon.com/cloudformation/home?%23/stacks/new?templateURL=https%3A%2F%2Fedot-cloud-forwarder.s3.amazonaws.com%2Fv1%2Flatest%2Fcloudformation%2Fs3_logs-cloudformation.yaml"><img src="https://www.elastic.co/observability-labs/assets/images/one-million-events-per-second-with-edot-cloud-forwarder/cloudformation-launch-stack.png" alt="Launch_stack" /></a></p>
<p>The best part? The fastest way to get started with Elastic Observability scales to any size workload! With EDOT Cloud Forwarder you have one solution which automatically scales down to zero and up to millions of events per second.</p>
<h2>So, what is EDOT Cloud Forwarder?</h2>
<p>EDOT Cloud Forwarder is a serverless OpenTelemetry Collector that, on AWS, runs as a Lambda function. In AWS, it is triggered by events and processes logs and metrics from services such as VPC Flow Logs, ELB Access Logs, CloudTrail, CloudWatch Logs and CloudWatch Metrics.</p>
<p>It has the following core capabilities:</p>
<ul>
<li>Collects observability and security data from Cloud Service Providers</li>
<li>Parses data into native OpenTelemetry format</li>
<li>Forwards data over OTLP</li>
<li>Scales up and down based on traffic</li>
</ul>
<p>ECF for AWS is a pure serverless solution, no VMs, containers, or Kubernetes control planes to manage.</p>
<h2>Off the train, a more controlled scenario</h2>
<p>For a more controlled testing scenario, we used synthetic VPC Flow Log data generated to show how easy EDOT Cloud Forwarder can sustain a million events per second reliably and without data loss.</p>
<p>For the configuration, we left all EDOT Cloud Forwarder configuration/settings at their <a href="https://www.elastic.co/docs/reference/opentelemetry/edot-cloud-forwarder/aws#optional-settings">defaults</a>. In AWS, the Lambda max concurrency defaults to 5. If left to the default of 5, up to around 50k EPS could be expected. For our scenario, we bumped this to 100 to ensure we'd have plenty of headroom for our test.</p>
<p>We ran the scenario in 10 minute stages, with each stage resulting in a larger data volume. We held ingest flat during the stage to provide short-term steady state windows for us to grab metrics.</p>
<p>We experienced no errors across all stages of the scenarios, no retries outside expected behavior, no data loss across 5.4 billion ingested events.</p>
<h2>Stats for Observability Nerds</h2>
<p>Because we know you love them:</p>
<h3>Incremental Load Stages</h3>
<p>We tested EDOT Cloud Forwarder using incremental load stages, gradually increasing traffic from approximately 300,000 events per second to over 1 million events per second.</p>
<p>The graph below shows the Elasticsearch ingestion rate throughout the entire test duration. You can see the clear progression as we ramped up through six distinct stages, with each plateau representing a 10 minute stabilization period.</p>
<p>The system handled each traffic increase smoothly, culminating in sustained 1 million documents per second ingestion with no bottlenecks or data loss.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/one-million-events-per-second-with-edot-cloud-forwarder/es-ingestion-rate.png" alt="ES Ingestion Rate" /></p>
<h3>Lambda</h3>
<p>As seen in CloudWatch metrics (no manual adjustments required). No errors and no throttles were seen during the full duration of the test.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/one-million-events-per-second-with-edot-cloud-forwarder/lambda-errors-throttles.png" alt="Lambda errors and throttles" /></p>
<h4>Concurrent executions</h4>
<p>60 to 65 instances running at the same time.
<img src="https://www.elastic.co/observability-labs/assets/images/one-million-events-per-second-with-edot-cloud-forwarder/lambda-instances.png" alt="Lambda concurrent executions" /></p>
<h4>Average execution time per Lambda</h4>
<p>Each execution is taking about 5 seconds.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/one-million-events-per-second-with-edot-cloud-forwarder/lambda-execution-time.png" alt="Lambda average execution time" /></p>
<h4>Memory Usage</h4>
<p>Memory use stabilized at around 450 MB, below the default limit of 512 MB.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/one-million-events-per-second-with-edot-cloud-forwarder/lambda-memory-used.png" alt="Lambda memory usage" /></p>
<h3>Elasticsearch Indexing</h3>
<p>Elasticsearch indexed one million documents per second, with events visible in Discover within seconds and no indexing delays or bottlenecks.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/one-million-events-per-second-with-edot-cloud-forwarder/es-1m-eps.png" alt="1M EPS" /></p>
<h2>Efficient by design: Lambda at 1M EPS from S3</h2>
<p>Running ECF for AWS at 1M EPS costs about <strong>$3.87 per hour</strong>. Around 66 percent ($2.57 per hour) is data transfer (same region), 34 percent ($1.32 per hour) is Lambda compute, and less than 1 percent is S3 requests.</p>
<p>This is fully serverless with no idle cost. You only pay while events are forwarded. With S3, data arrives pre-batched in large objects, which keeps Lambda invocations low and compute costs tightly controlled. At sustained throughput, Lambda costs are comparable to EKS—but without cluster management or idle capacity.</p>
<h3>Other options: OTel Collector on EKS at 1M events per second</h3>
<p>An OTel Collector on EKS sized for 1M EPS has a baseline cost of about <strong>$3.69 per hour</strong>. Roughly $0.33 per hour of this is compute related, EC2 nodes, EKS control plane, and EBS. The rest comes from data transfer and SQS, which scale with traffic and do not change with utilization.</p>
<h4>Idle compute impact on EKS real costs</h4>
<p>Considering EKS is typically provisioned for peak load, the real cost of the compute portion is affected by idle capacity. At <strong>100 percent utilization</strong>, total cost is <strong>$3.69 per hour</strong>. At <strong>50 percent utilization</strong>, a common baseline to absorb burstiness, total cost rises to about <strong>$4.02 per hour</strong>. At <strong>30 percent utilization</strong>, it increases further to about <strong>$4.46 per hour</strong>.</p>
<h4>Pay for Work vs Pay for Capacity</h4>
<p>ECF for AWS delivers 1M EPS at a cost comparable to EKS at peak utilization, with no idle compute or capacity planning required. EKS can reach the same peak throughput, but total cost increases further as average utilization drops because compute capacity must be provisioned in advance.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/one-million-events-per-second-with-edot-cloud-forwarder/otel-collector-lambda.png" alt="Collector-lambda" /></p>
<h2>Conclusion</h2>
<p>The boring truth: to the EDOT Cloud Forwarder, a million events per second is no different from any other workload.</p>
<p>With no infrastructure to deploy, no idle cost and no manual scaling, I guess the best thing to do is to stop overthinking and start forwarding! It's like stepping onto the fastest train on the line: you just get on and you're instantly en route to your destination, effortlessly handling any distance or in this case, any volume.</p>
<p>So, we're shipping it. ECF for AWS is now Generally Available.</p>
<h2>Get Started</h2>
<ol>
<li>Deploy EDOT Cloud Forwarder via CloudFormation (below) or using the AWS Serverless Application Repository</li>
</ol>
<p><a href="https://console.aws.amazon.com/cloudformation/home?%23/stacks/new?templateURL=https%3A%2F%2Fedot-cloud-forwarder.s3.amazonaws.com%2Fv1%2Flatest%2Fcloudformation%2Fs3_logs-cloudformation.yaml"><img src="https://www.elastic.co/observability-labs/assets/images/one-million-events-per-second-with-edot-cloud-forwarder/cloudformation-launch-stack.png" alt="Launch_stack" /></a></p>
<ol start="2">
<li>Create an Observability project using an <a href="https://cloud.elastic.co/login?redirectTo=%2Fhome">Elastic Cloud</a> free trial or deploy locally with start-local if you don't already have an Elastic project or deployment.</li>
</ol>
<p>Visit EDOT Cloud Forwarder for AWS <a href="https://www.elastic.co/docs/reference/opentelemetry/edot-cloud-forwarder/aws">Documentation</a> for more details.</p>
<p>Check out these other resources on OpenTelemetry at Elastic</p>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-agent-pivot-opentelemetry">Discover how Elastic is evolving data ingestion with OpenTelemetry</a></p>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-sdk-central-configuration-opamp">Learn how OpAMP enables centralized configuration of OpenTelemetry SDKs</a></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/one-million-events-per-second-with-edot-cloud-forwarder/ecf-for-aws.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Tracing, logs, and metrics for a RAG based Chatbot with Elastic Distributions of OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/openai-tracing-elastic-opentelemetry</link>
            <guid isPermaLink="false">openai-tracing-elastic-opentelemetry</guid>
            <pubDate>Fri, 24 Jan 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[How to observe a OpenAI RAG based application using Elastic. Instrument the app, collect logs, traces, metrics, and understand how well the LLM is performing with Elastic Distributions of OpenTelemetry on Kubernetes and Docker.]]></description>
            <content:encoded><![CDATA[<p>As discussed in the following post, <a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-openai">Elastic added instrumentation for OpenAI based applications in EDOT</a>. The main application most commonly using LLMs is known as a Chatbot. These chatbots not only use large language models (LLMs), but are also using frameworks such as LangChain, and search to improve contextual information during a conversation RAG (Retrieval Augmented Generation). Elastics's sample <a href="https://github.com/elastic/elasticsearch-labs/tree/main/example-apps/chatbot-rag-app">RAG based Chatbot application</a>, showcases how to use Elasticsearch with local data that has embeddings, enabling search to properly pull out the most contextual information during a query with a chatbot connected to an LLM of your choice. It's a great example of how to build out a RAG based application with Elasticsearch.</p>
<p>This app is also now insturmented with EDOT, and you can visualize the Chatbot's traces to OpenAI, as well as relevant logs, and metrics from the application. By running the app as instructed in the github repo with Docker you can see these traces on a local stack. But how about running it against serverless, Elastic cloud or even with Kubernetes?</p>
<p>In this blog we will walk through how to set up Elastic's RAG Based Chatbot application with Elastic cloud and Kubernetes.</p>
<h1>Prerequisites:</h1>
<p>In order to follow along, these few pre-requisites are needed</p>
<ul>
<li>
<p>An Elastic Cloud account — sign up now, and become familiar with Elastic's OpenTelemetry configuration. With Serverless no version required. With regular cloud minimally 8.17</p>
</li>
<li>
<p>Git clone the <a href="https://github.com/elastic/elasticsearch-labs/tree/main/example-apps/chatbot-rag-app">RAG based Chatbot application</a> and go through the <a href="https://www.elastic.co/search-labs/tutorials/chatbot-tutorial/welcome">tutorial</a> on how to bring it up and become more familiar and how to bring up the application using Docker.</p>
</li>
<li>
<p>An account on OpenAI with API keys</p>
</li>
<li>
<p>Kubernetes cluster to run the RAG based Chatbot app</p>
</li>
<li>
<p>The instructions in this blog are also found in <a href="https://github.com/elastic/observability-examples/tree/main/chatbot-rag-app-observability">observability-examples</a> in github.</p>
</li>
</ul>
<h1>Application OpenTelemetry output in Elastic</h1>
<h2>Chatbot-rag-app</h2>
<p>The first item that you will need to get up and running is the ChatBotApp, and once up you should see the following:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-elastic-opentelemetry/Chatbotapp-general.png" alt="Chatbot app main page" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-elastic-opentelemetry/Chatbotapp-details.png" alt="Chatbot app working" /></p>
<p>As you select some of the questions you will set a response based on the index that was created in Elasticsearch when the app initializes. Additionally there will be queries that are made to LLMs.</p>
<h2>Traces, logs, and metrics from EDOT in Elastic</h2>
<p>Once you have the application running on your K8s cluster or with Docker, and Elastic Cloud up and running you should see the following:</p>
<h3>Logs:</h3>
<p>In Discover you will see logs from the Chatbotapp, and be able to analyze the application logs, any specific log patterns, which saves you time in analysis.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-elastic-opentelemetry/chatbot-reg-logs.png" alt="Chatbot-logs" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-elastic-opentelemetry/chatbot-reg-logs-patterns.png" alt="Chatbot-log-patterns" /></p>
<h3>Traces:</h3>
<p>In Elastic Observability APM, you can also see tha chatbot details, which include transactions, dependencies, logs, errors, etc.</p>
<p>When you look at traces, you will be able to see the chatbot interactions in the trace.</p>
<ol>
<li>
<p>You will see the end to end http call</p>
</li>
<li>
<p>Individual calls to elasticsearch</p>
</li>
<li>
<p>Specific calls such as invoke actions, and calls to the LLM</p>
</li>
</ol>
<p>You can also get individual details of the traces, and look at related logs, and metrics related to that trace,</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-elastic-opentelemetry/chatbot-reg-trace.png" alt="CHatbot-traces" /></p>
<h3>Metrics:</h3>
<p>In addition to logs, and traces, any instrumented metrics will also get ingested into Elastic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-elastic-opentelemetry/chatbot-reg-metrics.png" alt="Chatbot app metrics" /></p>
<h1>Setting it all up with Docker</h1>
<p>In order to properly set up the Chatbot-app on Docker with telemetry sent over to Elastic, a few things must be set up:</p>
<ol>
<li>
<p>Git clone the chatbot-rag-app</p>
</li>
<li>
<p>Modify the env file as noted in the github README with the following exception:</p>
</li>
</ol>
<p>Use your Elastic cloud's <code>OTEL_EXPORTER_OTLP_ENDPOINT</code> and <code>OTEL_EXPORTER_OTLP_HEADER</code> instead.</p>
<p>You can find these in the Elastic Cloud under <code>integrations-&gt;APM</code></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-elastic-opentelemetry/otel-credentials.png" alt="OTel credentials" /></p>
<p>Envs for sending the OTel instrumentation you will need the following:</p>
<pre><code class="language-bash">OTEL_EXPORTER_OTLP_ENDPOINT=&quot;https://123456789.apm.us-west-2.aws.cloud.es.io:443&quot;
OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=Bearer%20xxxxx&quot;
</code></pre>
<p>Notice the <code>%20</code> in the headers. This will be needed to account for the space in credentials.</p>
<ol start="3">
<li>
<p>Set the following to false - <code>OTEL_SDK_DISABLED=false</code></p>
</li>
<li>
<p>Set the envs for LLMs</p>
</li>
</ol>
<p>In this example we're using OpenAI, hence only three variables are needed.</p>
<pre><code class="language-bash">LLM_TYPE=openai
OPENAI_API_KEY=XXXX
CHAT_MODEL=gpt-4o-mini
</code></pre>
<ol start="5">
<li>Run the docker container as noted</li>
</ol>
<pre><code class="language-bash">docker compose up --build --force-recreate
</code></pre>
<ol start="6">
<li>
<p>Play with the app at <code>localhost:4000</code></p>
</li>
<li>
<p>Then log into Elastic cloud and see the output as shown previously.</p>
</li>
</ol>
<h1>Run chatbot-rag-app on Kubernetes</h1>
<p>In order to set this up, you can follow the following repo on Observability-examples which has the Kubernetes yaml files being used. These will also point to Elastic Cloud.</p>
<ol>
<li>
<p>Set up the Kubernetes Cluster (we're using EKS)</p>
</li>
<li>
<p>Get the appropriate ENV variables:</p>
</li>
</ol>
<ul>
<li>
<p>Find the <code>OTEL_EXPORTER_OTLP_ENDPOINT/HEADER</code> variables as noted in the pervious for Docker.</p>
</li>
<li>
<p>Get your OpenAI Key</p>
</li>
<li>
<p>Elasticsearch URL, and username and password.</p>
</li>
</ul>
<ol start="3">
<li>Follow the instructions in the following <a href="https://github.com/elastic/observability-examples/tree/main/chatbot-rag-app-observability">github repo in observability examples</a> to run two Kubernetes yaml files.</li>
</ol>
<p>Essentially you need only replace the secret variables in k8s-deployment.yaml, and run</p>
<pre><code class="language-bash">kubectl create -f k8s-deployment.yaml
kubectl create -f init-index-job.yaml
</code></pre>
<p>The app needs to be running first, then we use the app to initialize Elasticsearch with indices for the app.</p>
<p><strong><em>Init-index-job.yaml</em></strong></p>
<pre><code class="language-bash">apiVersion: batch/v1
kind: Job
metadata:
  name: init-elasticsearch-index-test
spec:
  template:
    spec:
      containers:
      - name: init-index
        image: ghcr.io/elastic/elasticsearch-labs/chatbot-rag-app:latest
        workingDir: /app/api
        command: [&quot;python3&quot;, &quot;-m&quot;, &quot;flask&quot;, &quot;--app&quot;, &quot;app&quot;, &quot;create-index&quot;]
        env:
        - name: FLASK_APP
          value: &quot;app&quot;
        - name: LLM_TYPE
          value: &quot;openai&quot;
        - name: CHAT_MODEL
          value: &quot;gpt-4o-mini&quot;
        - name: ES_INDEX
          value: &quot;workplace-app-docs&quot;
        - name: ES_INDEX_CHAT_HISTORY
          value: &quot;workplace-app-docs-chat-history&quot;
        - name: ELASTICSEARCH_URL
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_URL
        - name: ELASTICSEARCH_USER
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_USER
        - name: ELASTICSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_PASSWORD
        envFrom:
        - secretRef:
            name: chatbot-regular-secrets
      restartPolicy: Never
  backoffLimit: 4
</code></pre>
<p><strong><em>k8s-deployment.yaml</em></strong></p>
<pre><code class="language-bash">apiVersion: v1
kind: Secret
metadata:
  name: chatbot-regular-secrets
type: Opaque
stringData:
  ELASTICSEARCH_URL: &quot;https://yourelasticcloud.es.us-west-2.aws.found.io&quot;
  ELASTICSEARCH_USER: &quot;elastic&quot;
  ELASTICSEARCH_PASSWORD: &quot;elastic&quot;
  OTEL_EXPORTER_OTLP_HEADERS: &quot;Authorization=Bearer%20xxxx&quot;
  OTEL_EXPORTER_OTLP_ENDPOINT: &quot;https://12345.apm.us-west-2.aws.cloud.es.io:443&quot;
  OPENAI_API_KEY: &quot;YYYYYYYY&quot;

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: chatbot-regular
spec:
  replicas: 2
  selector:
    matchLabels:
      app: chatbot-regular
  template:
    metadata:
      labels:
        app: chatbot-regular
    spec:
      containers:
      - name: chatbot-regular
        image: ghcr.io/elastic/elasticsearch-labs/chatbot-rag-app:latest
        ports:
        - containerPort: 4000
        env:
        - name: LLM_TYPE
          value: &quot;openai&quot;
        - name: CHAT_MODEL
          value: &quot;gpt-4o-mini&quot;
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: &quot;service.name=chatbot-regular,service.version=0.0.1,deployment.environment=dev&quot;
        - name: OTEL_SDK_DISABLED
          value: &quot;false&quot;
        - name: OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT
          value: &quot;true&quot;
        - name: OTEL_EXPERIMENTAL_RESOURCE_DETECTORS
          value: &quot;process_runtime,os,otel,telemetry_distro&quot;
        - name: OTEL_EXPORTER_OTLP_PROTOCOL
          value: &quot;http/protobuf&quot;
        - name: OTEL_METRIC_EXPORT_INTERVAL
          value: &quot;3000&quot;
        - name: OTEL_BSP_SCHEDULE_DELAY
          value: &quot;3000&quot;
        envFrom:
        - secretRef:
            name: chatbot-regular-secrets
        resources:
          requests:
            memory: &quot;512Mi&quot;
            cpu: &quot;250m&quot;
          limits:
            memory: &quot;1Gi&quot;
            cpu: &quot;500m&quot;

---
apiVersion: v1
kind: Service
metadata:
  name: chatbot-regular-service
spec:
  selector:
    app: chatbot-regular
  ports:
  - port: 80
    targetPort: 4000
  type: LoadBalancer
</code></pre>
<p><strong>Open App with LoadBalancer URL</strong></p>
<p>Run the kubectl get services command and get the URL for the chatbot app</p>
<pre><code class="language-bash">% kubectl get services
NAME                                 TYPE           CLUSTER-IP    EXTERNAL-IP                                                               PORT(S)                                                                     AGE
chatbot-regular-service            LoadBalancer   10.100.130.44    xxxxxxxxx-1515488226.us-west-2.elb.amazonaws.com   80:30748/TCP                                                                6d23h
</code></pre>
<ol start="4">
<li>
<p>Play with app and review telemetry in Elastic</p>
</li>
<li>
<p>Once you go to the URL, you should see all the screens we described earlier in the beginning of this blog.</p>
</li>
</ol>
<h1>Conclusion</h1>
<p>With Elastic's Chatbot-rag-app you have an example of how to build out a OpenAI driven RAG based chat application. However, you still need to understand how well it performs, whether its working properly, etc. Using OTel and Elastic’s EDOT gives you the ability to achieve this. Additionally, you will generally run this application on Kubernetes. Hopefully this blog provides the outline of how to achieve this.
Here are the other Tracing blogs:</p>
<p>App Observability with LLM (Tracing)-</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-tracing-langtrace">Observing LangChain with Langtrace and OpenTelemetry</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-openlit-tracing">Observing LangChain with OpenLit Tracing</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-tracing">Instrumenting LangChain with OpenTelemetry</a></p>
</li>
</ul>
<p>LLM Observability -</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elevate-llm-observability-with-gcp-vertex-ai-integration">Elevate LLM Observability with GCP Vertex AI Integration</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/llm-observability-aws-bedrock">LLM Observability on AWS Bedrock</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/llm-observability-azure-openai">LLM Observability for Azure OpenAI</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/llm-observability-azure-openai-v2">LLM Observability for Azure OpenAI v2</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/openai-tracing-elastic-opentelemetry/edot-openai-tracing.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Tracing a RAG based Chatbot with Elastic Distributions of OpenTelemetry and Langtrace]]></title>
            <link>https://www.elastic.co/observability-labs/blog/openai-tracing-langtrace-elastic</link>
            <guid isPermaLink="false">openai-tracing-langtrace-elastic</guid>
            <pubDate>Thu, 06 Feb 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[How to observe a OpenAI RAG based application using Elastic. Instrument the app, collect logs, traces, metrics, and understand how well the LLM is performing with Elastic Distributions of OpenTelemetry on Kubernetes with Langtrace.]]></description>
            <content:encoded><![CDATA[<p>Most AI-driven applications are currently focusing around increasing the value an end user, such as an SRE gets from AI. The main use case is the creation of various chatbots. These chatbots not only use large language models (LLMs), but are also using frameworks such as LangChain, and search to improve contextual information during a conversation (Retrieval Augmented Generation). Elastic’s sample <a href="https://github.com/elastic/elasticsearch-labs/tree/main/example-apps/chatbot-rag-app">RAG based Chatbot application</a>, showcases how to use Elasticsearch with local data that has embeddings, enabling search to properly pull out the most contextual information during a query with a chatbot connected to an LLM of your choice. It's a great example of how to build out a RAG based application with Elasticsearch. However, what about monitoring the application?</p>
<p>Elastic provides the ability to ingest OpenTelemetry data with native OTel SDKs, the off the shelf OTel collector, or even Elastic’s Distributions of OpenTelemetry (EDOT). EDOT enables you to bring in logs, metrics and traces for your GenAI application and for K8s. However you will also generally need libraries to help trace specific components in your application. In tracing GenAI applications you can pick from a large set of libraries.</p>
<ul>
<li>
<p><a href="https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation-genai/opentelemetry-instrumentation-openai-v2">OpenTelemetry OpenAI Instrumentation-v2</a> - allows tracing LLM requests and logging of messages made by the OpenAI Python API library. (note v2 is built by OpenTelemetry, the non v2 version is from a specific vendor and not OpenTelemetry)</p>
</li>
<li>
<p><a href="https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation-genai/opentelemetry-instrumentation-vertexai">OpenTelemetry VertexAI Instrumentation</a> - allows tracing LLM requests and logging of messages made by the VertexAI Python API library</p>
</li>
<li>
<p><a href="https://docs.langtrace.ai/introduction">Langtrace</a> - commercially available library which supports all LLMs in one library, and all traces are also OTel native.</p>
</li>
<li>
<p>Elastic’s EDOT - which recently added tracing. See <a href="https://www.elastic.co/observability-labs/blog/openai-tracing-elastic-opentelemetry">blog</a>.</p>
</li>
</ul>
<p>As you can see OpenTelemetry is the defacto mechanism that is converging to collect and ingest. OpenTelemetry is growing its support for this but it is also early days.</p>
<p>In this blog, we will walk through how to, with minimal code, observe a RAG based chatbot application with tracing using Langtrace. We previously covered Langtrace in a <a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-tracing-langtrace">blog</a> to highlight tracing Langchain.</p>
<p>In this blog we used langtrace OpenAI, Amazon Bedrock, Cohere, and others in one library.</p>
<h1>Pre-requisites:</h1>
<p>In order to follow along, these few pre-requisites are needed</p>
<ul>
<li>An Elastic Cloud account — sign up now, and become familiar with Elastic’s OpenTelemetry configuration. With Serverless no version required. With regular cloud minimally 8.17</li>
</ul>
<ul>
<li>Git clone the <a href="https://github.com/elastic/elasticsearch-labs/tree/main/example-apps/chatbot-rag-app">RAG based Chatbot application</a> and go through the <a href="https://www.elastic.co/search-labs/tutorials/chatbot-tutorial/welcome">tutorial</a> on how to bring it up and become more familiar.</li>
</ul>
<ul>
<li>An account on your favorite LLM (OpenAI, AzureOpen AI, etc), with API keys</li>
</ul>
<ul>
<li>Be familiar with EDOT to understand how we bring in logs, metrics, and traces from the application through the OTel Collector</li>
</ul>
<ul>
<li>Kubernetes cluster - I’ll be using Amazon EKS</li>
</ul>
<ul>
<li>Look at <a href="https://docs.langtrace.ai/introduction">Langtrace</a> documentation also.</li>
</ul>
<h1>Application OpenTelemetry output in Elastic</h1>
<h2>Chatbot-rag-app</h2>
<p>The first item that you will need to get up and running is the ChatBotApp, and once up you should see the following:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/Chatbotapp-general.png" alt="Chatbot app main page" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/Chatbotapp-details.png" alt="Chatbot app working" /></p>
<p>As you select some of the questions you will set a response based on the index that was created in Elasticsearch when the app initializes. Additionally there will be queries that are made to LLMs.</p>
<h2>Traces, logs, and metrics from EDOT in Elastic</h2>
<p>Once you have OTel Collector with EDOT configuration on your K8s cluster, and Elastic Cloud up and running you should see the following:</p>
<h3>Logs:</h3>
<p>In Discover you will see logs from the Chatbotapp, and be able to analyze the application logs, any specific log patterns (saves you time in analysis), and view logs from K8s.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/Chatbotapp-logs.png" alt="Chatbot-logs" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/Chatbotapp-log-patterns.png" alt="Chatbot-log-patterns" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/Chatbotapp-logs-detailed.png" alt="Chatbot-log-details" /></p>
<h3>Traces:</h3>
<p>In Elastic Observability APM, you can also see tha chatbot details, which include transactions, dependencies, logs, errors, etc.</p>
<p>When you look at traces, you will be able to see the chatbot interactions in the trace.</p>
<ol>
<li>
<p>You will see the end to end http call</p>
</li>
<li>
<p>Individual calls to elasticsearch</p>
</li>
<li>
<p>Specific calls such as invoke actions, and calls to the LLM</p>
</li>
</ol>
<p>You can also get individual details of the traces, and look at related logs, and metrics related to that trace,</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/Chatbotapp-service-traces.png" alt="CHatbot-traces" /></p>
<h3>Metrics:</h3>
<p>In addition to logs, and traces, any instrumented metrics will also get ingested into Elastic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/chatbot-reg-metrics.png" alt="Chatbot app metrics" /></p>
<h1>Setting it all up</h1>
<p>In order to properly set up the Chatbot-app on K8s with telemetry sent over to Elastic, a few things must be set up:</p>
<ol>
<li>
<p>Git clone the chatbot-rag-app, and modify one of the python files.</p>
</li>
<li>
<p>Next create a docker container that can be used in Kubernetes. The Docker build <a href="https://github.com/elastic/elasticsearch-labs/blob/main/example-apps/chatbot-rag-app/Dockerfile">here</a> in the Chatbot-app is good to use.</p>
</li>
<li>
<p>Collect all needed env variables. In this example we are using OpenAI, but the files can be modified for any of the LLMs. Hence you will have to get a few environmental variables loaded into the cluster. In the github repo there is a env.example for docker. You can pick and chose what is needed or not needed and adjust appropriately in the K8s file below.</p>
</li>
<li>
<p>Set up your K8s Cluster, and then install the OpenTelemetry collector with the appropriate yaml file and credentials. This will help collect K8s cluster logs and metrics also.</p>
</li>
<li>
<p>Utilize the two yaml files listed below to ensure you can run it on Kubernetes.</p>
</li>
</ol>
<ul>
<li>
<p>Init-index-job.yaml - Initiates the index in elasticsearch with the local corporate information</p>
</li>
<li>
<p>k8s-deployment-chatbot-rag-app.yaml - initializes the application frontend and backend.</p>
</li>
</ul>
<ol start="6">
<li>
<p>Open the app on the load balancer URL against the chatbot-app service in K8s</p>
</li>
<li>
<p>Go to Elasticsearch and look at Discover for logs, go to APM and look for your chatbot-app and review the traces, and finally.</p>
</li>
</ol>
<h2>Modify the code for tracing with Langtrace</h2>
<p>Once you curl the app and untar, go to the chatbot-rag-app directory:</p>
<pre><code class="language-bash">curl https://codeload.github.com/elastic/elasticsearch-labs/tar.gz/main | 
tar -xz --strip=2 elasticsearch-labs-main/example-apps/chatbot-rag-app
cd elasticsearch-labs-main/example-apps/chatbot-rag-app
</code></pre>
<p>Next open the <code>app.py</code> file in the <code>api</code> directory and add the following</p>
<pre><code class="language-bash">from opentelemetry.instrumentation.flask import FlaskInstrumentor

from langtrace_python_sdk import langtrace

langtrace.init(batch=False)

FlaskInstrumentor().instrument_app(app)
</code></pre>
<p>into the code:</p>
<pre><code class="language-bash">import os
import sys
from uuid import uuid4

from chat import ask_question
from flask import Flask, Response, jsonify, request
from flask_cors import CORS

from opentelemetry.instrumentation.flask import FlaskInstrumentor

from langtrace_python_sdk import langtrace

langtrace.init(batch=False)

app = Flask(__name__, static_folder=&quot;../frontend/build&quot;, static_url_path=&quot;/&quot;)
CORS(app)

FlaskInstrumentor().instrument_app(app)

@app.route(&quot;/&quot;)
</code></pre>
<p>See the items in <strong>BOLD</strong> which will add in the langtrace library, and the opentelemetry flask instrumentation. This combination will provide and end to end trace for the https call all the way down to the calls to Elasticsearch, and to OpenAI (or other LLMs).</p>
<h2>Create the docker container</h2>
<p>Use the Dockerfile that is in the chatbot-rag-app directory as is and add the following line:</p>
<p><code>RUN pip3 install --no-cache-dir langtrace-python-sdk</code></p>
<p>into the Dockerfile:</p>
<pre><code class="language-bash">COPY requirements.txt ./requirements.txt
RUN pip3 install -r ./requirements.txt
RUN pip3 install --no-cache-dir langtrace-python-sdk
COPY api ./api
COPY data ./data

EXPOSE 4000
</code></pre>
<p>This enables the <code>langtrace-python-sdk</code> to be installed into the docker container so the langtrace libraries can be used properly.</p>
<h2>Collecting the proper env variables:</h2>
<p>First collect the env variables from Elastic:</p>
<p>Envs for index initialization in Elastic:</p>
<pre><code class="language-bash">
ELASTICSEARCH_URL=https://aws.us-west-2.aws.found.io
ELASTICSEARCH_USER=elastic
ELASTICSEARCH_PASSWORD=elastic

# The name of the Elasticsearch indexes
ES_INDEX=workplace-app-docs
ES_INDEX_CHAT_HISTORY=workplace-app-docs-chat-history

</code></pre>
<p>The <code>ELASTICSEARCH_URL</code> can be found in cloud.elastic.co when you bring up your instance.
The user and password, you will need to setup in Elastic.</p>
<p>Envs for sending the OTel instrumentation you will need the following:</p>
<pre><code class="language-bash">OTEL_EXPORTER_OTLP_ENDPOINT=&quot;https://123456789.apm.us-west-2.aws.cloud.es.io:443&quot;
OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=Bearer xxxxx&quot;
</code></pre>
<p>These credentials are found in Elastic under APM integration and under OpenTelemetry</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/otel-credentials.png" alt="OTel credentials" /></p>
<p>Envs for LLMs</p>
<p>In this example we’re using OpenAI, hence only three variables are needed.</p>
<pre><code class="language-bash">LLM_TYPE=openai
OPENAI_API_KEY=XXXX
CHAT_MODEL=gpt-4o-mini
</code></pre>
<p>All these variables will be needed in the Kubernetes yamls in the next step</p>
<h2>Setup K8s cluster and load up OTel Collector with EDOT</h2>
<p>This step is outlined in the following <a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-otel-operator">Blog</a>. It’s a simple three step process.</p>
<p>This step will bring in all the K8s cluster logs and metrics and setup the OTel collector.</p>
<h2>Setup secrets, initialize indices, and start the app</h2>
<p>Now that the cluster is up, and you have your environmental variables, you will need to</p>
<ol>
<li>
<p>Install and run the <code>k8s-deployments.yaml</code> with the variables</p>
</li>
<li>
<p>Initialize the index</p>
</li>
</ol>
<p>Essentially run the following:</p>
<pre><code class="language-bash">kubectl create -f k8s-deployment.yaml
kubectl create -f init-index-job.yaml
</code></pre>
<p>Here are the two yamls you should use. Also found <a href="https://github.com/elastic/observability-examples/tree/main/chatbot-rag-app-observability">here</a></p>
<p>k8s-deployment.yaml</p>
<pre><code class="language-bash">apiVersion: v1
kind: Secret
metadata:
  name: genai-chatbot-langtrace-secrets
type: Opaque
stringData:
  OTEL_EXPORTER_OTLP_HEADERS: &quot;Authorization=Bearer%20xxxx&quot;
  OTEL_EXPORTER_OTLP_ENDPOINT: &quot;https://1234567.apm.us-west-2.aws.cloud.es.io:443&quot;
 ELASTICSEARCH_URL: &quot;YOUR_ELASTIC_SEARCH_URL&quot;
  ELASTICSEARCH_USER: &quot;elastic&quot;
  ELASTICSEARCH_PASSWORD: &quot;elastic&quot;
  OPENAI_API_KEY: &quot;XXXXXXX&quot;  

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: genai-chatbot-langtrace
spec:
  replicas: 2
  selector:
    matchLabels:
      app: genai-chatbot-langtrace
  template:
    metadata:
      labels:
        app: genai-chatbot-langtrace
    spec:
      containers:
      - name: genai-chatbot-langtrace
        image:65765.amazonaws.com/genai-chatbot-langtrace2:latest
        ports:
        - containerPort: 4000
        env:
        - name: LLM_TYPE
          value: &quot;openai&quot;
        - name: CHAT_MODEL
          value: &quot;gpt-4o-mini&quot;
        - name: OTEL_SDK_DISABLED
          value: &quot;false&quot;
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: &quot;service.name=genai-chatbot-langtrace,service.version=0.0.1,deployment.environment=dev&quot;
        - name: OTEL_EXPORTER_OTLP_PROTOCOL
          value: &quot;http/protobuf&quot;
        envFrom:
        - secretRef:
            name: genai-chatbot-langtrace-secrets
        resources:
          requests:
            memory: &quot;512Mi&quot;
            cpu: &quot;250m&quot;
          limits:
            memory: &quot;1Gi&quot;
            cpu: &quot;500m&quot;

---
apiVersion: v1
kind: Service
metadata:
  name: genai-chatbot-langtrace-service
spec:
  selector:
    app: genai-chatbot-langtrace
  ports:
  - port: 80
    targetPort: 4000
  type: LoadBalancer

</code></pre>
<p>Init-index-job.yaml</p>
<pre><code class="language-bash">apiVersion: batch/v1
kind: Job
metadata:
  name: init-elasticsearch-index-test
spec:
  template:
    spec:
      containers:
      - name: init-index
#update your image location for chatbot rag app
        image: your-image-location:latest
        workingDir: /app/api
        command: [&quot;python3&quot;, &quot;-m&quot;, &quot;flask&quot;, &quot;--app&quot;, &quot;app&quot;, &quot;create-index&quot;]
        env:
        - name: FLASK_APP
          value: &quot;app&quot;
        - name: LLM_TYPE
          value: &quot;openai&quot;
        - name: CHAT_MODEL
          value: &quot;gpt-4o-mini&quot;
        - name: ES_INDEX
          value: &quot;workplace-app-docs&quot;
        - name: ES_INDEX_CHAT_HISTORY
          value: &quot;workplace-app-docs-chat-history&quot;
        - name: ELASTICSEARCH_URL
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_URL
        - name: ELASTICSEARCH_USER
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_USER
        - name: ELASTICSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_PASSWORD
        envFrom:
        - secretRef:
            name: chatbot-regular-secrets
      restartPolicy: Never
  backoffLimit: 4

</code></pre>
<h2>Open App with LoadBalancer URL</h2>
<p>Run the kubectl get services command and get the URL for the chatbot app</p>
<pre><code class="language-bash">% kubectl get services
NAME                                 TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)                                                                     AGE
chatbot-langtrace-service            LoadBalancer   10.100.130.44    xxxxxxxxx-1515488226.us-west-2.elb.amazonaws.com   80:30748/TCP                                                                6d23h

</code></pre>
<p>Play with app and review telemetry in Elastic</p>
<p>Once you go to the URL, you should see all the screens we described earlier in the <a href="https://docs.google.com/document/d/1w_3VRDJV3CoLMjOj8Ktnng-6MuKgdzkhKs4CVBWkatc/edit?tab=t.0#bookmark=id.lrmf4nbl2twi">beginning of this blog</a>.</p>
<h1>Conclusion</h1>
<p>With Elastic's Chatbot-rag-app you have an example of how to build out a OpenAI driven RAG based chat application. However, you still need to understand how well it performs, whether its working properly, etc. Using OTel, Elastic’s EDOT and Langtrace gives you the ability to achieve this. Additionally, you will generally run this application on Kubernetes. Hopefully this blog provides the outline of how to achieve this.</p>
<p>Here are the other Tracing blogs:</p>
<p>App Observability with LLM (Tracing)-</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-tracing-langtrace">Observing LangChain with Langtrace and OpenTelemetry</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-openlit-tracing">Observing LangChain with OpenLit Tracing</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-tracing">Instrumenting LangChain with OpenTelemetry</a></p>
</li>
</ul>
<p>LLM Observability -</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elevate-llm-observability-with-gcp-vertex-ai-integration">Elevate LLM Observability with GCP Vertex AI Integration</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/llm-observability-aws-bedrock">LLM Observability on AWS Bedrock</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/llm-observability-azure-openai">LLM Observability for Azure OpenAI</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/llm-observability-azure-openai-v2">LLM Observability for Azure OpenAI v2</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/edot-openai-tracing.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Collecting OpenShift container logs using Red Hat’s OpenShift Logging Operator]]></title>
            <link>https://www.elastic.co/observability-labs/blog/openshift-container-logs-red-hat-logging-operator</link>
            <guid isPermaLink="false">openshift-container-logs-red-hat-logging-operator</guid>
            <pubDate>Tue, 16 Jan 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to optimize OpenShift logs collected with Red Hat OpenShift Logging Operator, as well as format and route them efficiently in Elasticsearch.]]></description>
            <content:encoded><![CDATA[<p>This blog explores a possible approach to collecting and formatting OpenShift Container Platform logs and audit logs with Red Hat OpenShift Logging Operator. We recommend using Elastic® Agent for the best possible experience! We will also show how to format the logs to Elastic Common Schema (<a href="https://www.elastic.co/guide/en/ecs/current/index.html">ECS</a>) for the best experience viewing, searching, and visualizing your logs. All examples in this blog are based on OpenShift 4.14.</p>
<h2>Why use OpenShift Logging Operator?</h2>
<p>A lot of enterprise customers use OpenShift as their orchestrating solution. The advantages of this approach are:</p>
<ul>
<li>
<p>It is developed and supported by Red Hat</p>
</li>
<li>
<p>It can automatically update the OpenShift cluster along with the Operating system to make sure that they are and remain compatible</p>
</li>
<li>
<p>It can speed up developing life cycles with features like source to image</p>
</li>
<li>
<p>It uses enhanced security</p>
</li>
</ul>
<p>In our consulting experience, this latter aspect poses challenges and frictions with OpenShift administrators when we try to install an Elastic Agent to collect the logs of the pods. Indeed, Elastic Agent requires the files of the host to be mounted in the pod, and it also needs to be run in privileged mode. (Read more about the permissions required by Elastic Agent in the <a href="https://www.elastic.co/guide/en/fleet/current/running-on-kubernetes-standalone.html#_red_hat_openshift_configuration">official Elasticsearch® Documentation</a>). While the solution we explore in this post requires similar privileges under the hood, it is managed by the OpenShift Logging Operator, which is developed and supported by Red Hat.</p>
<h2>Which logs are we going to collect?</h2>
<p>In OpenShift Container Platform, we distinguish <a href="https://docs.openshift.com/container-platform/4.14/logging/cluster-logging.html#logging-architecture-overview_cluster-logging">three broad categories of logs</a>: audit, application, and infrastructure logs:</p>
<ul>
<li>
<p><strong>Audit logs</strong> describe the list of activities that affected the system by users, administrators, and other components.</p>
</li>
<li>
<p><strong>Application logs</strong> are composed of the container logs of the pods running in non-reserved namespaces.</p>
</li>
<li>
<p><strong>Infrastructure logs</strong> are composed of container logs of the pods running in reserved namespaces like openshift*, kube*, and default along with journald messages from the nodes.</p>
</li>
</ul>
<p>In the following, we will consider only audit and application logs for the sake of simplicity. In this post, we will describe how to format audit and application Logs in the format expected by the Kubernetes integration to take the most out of Elastic Observability.</p>
<h2>Getting started</h2>
<p>To collect the logs from OpenShift, we must perform some preparation steps in Elasticsearch and OpenShift.</p>
<h3>Inside Elasticsearch</h3>
<p>We first <a href="https://www.elastic.co/guide/en/fleet/8.11/install-uninstall-integration-assets.html#install-integration-assets">install the Kubernetes integration assets</a>. We are mainly interested in the index templates and ingest pipelines for the logs-kubernetes.container_logs and logs-kubernetes.audit_logs.</p>
<p>To format the logs received from the ClusterLogForwarder in <a href="https://www.elastic.co/guide/en/ecs/current/index.html">ECS</a> format, we will define a pipeline to normalize the container logs. The field naming convention used by OpenShift is slightly different from that used by ECS. To get a list of exported fields from OpenShift, refer to <a href="https://docs.openshift.com/container-platform/4.14/logging/cluster-logging-exported-fields.html">Exported fields | Logging | OpenShift Container Platform 4.14</a>. To get a list of exported fields of the Kubernetes integration, you can refer to <a href="https://www.elastic.co/guide/en/beats/filebeat/current/exported-fields-kubernetes-processor.html">Kubernetes fields | Filebeat Reference [8.11] | Elastic</a> and <a href="https://www.elastic.co/guide/en/observability/current/logs-app-fields.html">Logs app fields | Elastic Observability [8.11]</a>. Further, specific fields like kubernetes.annotations must be normalized by replacing dots with underscores. This operation is usually done automatically by Elastic Agent.</p>
<pre><code class="language-bash">PUT _ingest/pipeline/openshift-2-ecs
{
  &quot;processors&quot;: [
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.pod_id&quot;,
        &quot;target_field&quot;: &quot;kubernetes.pod.uid&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.pod_ip&quot;,
        &quot;target_field&quot;: &quot;kubernetes.pod.ip&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.pod_name&quot;,
        &quot;target_field&quot;: &quot;kubernetes.pod.name&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.namespace_name&quot;,
        &quot;target_field&quot;: &quot;kubernetes.namespace&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.namespace_id&quot;,
        &quot;target_field&quot;: &quot;kubernetes.namespace_uid&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.container_id&quot;,
        &quot;target_field&quot;: &quot;container.id&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;dissect&quot;: {
        &quot;field&quot;: &quot;container.id&quot;,
        &quot;pattern&quot;: &quot;%{container.runtime}://%{container.id}&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.container_image&quot;,
        &quot;target_field&quot;: &quot;container.image.name&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;kubernetes.container.image&quot;,
        &quot;copy_from&quot;: &quot;container.image.name&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;copy_from&quot;: &quot;kubernetes.container_name&quot;,
        &quot;field&quot;: &quot;container.name&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.container_name&quot;,
        &quot;target_field&quot;: &quot;kubernetes.container.name&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;kubernetes.node.name&quot;,
        &quot;copy_from&quot;: &quot;hostname&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;hostname&quot;,
        &quot;target_field&quot;: &quot;host.name&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;level&quot;,
        &quot;target_field&quot;: &quot;log.level&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;file&quot;,
        &quot;target_field&quot;: &quot;log.file.path&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;copy_from&quot;: &quot;openshift.cluster_id&quot;,
        &quot;field&quot;: &quot;orchestrator.cluster.name&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;dissect&quot;: {
        &quot;field&quot;: &quot;kubernetes.pod_owner&quot;,
        &quot;pattern&quot;: &quot;%{_tmp.parent_type}/%{_tmp.parent_name}&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;lowercase&quot;: {
        &quot;field&quot;: &quot;_tmp.parent_type&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;kubernetes.pod.{{_tmp.parent_type}}.name&quot;,
        &quot;value&quot;: &quot;{{_tmp.parent_name}}&quot;,
        &quot;if&quot;: &quot;ctx?._tmp?.parent_type != null&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;remove&quot;: {
        &quot;field&quot;: [
          &quot;_tmp&quot;,
          &quot;kubernetes.pod_owner&quot;
          ],
          &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;script&quot;: {
        &quot;description&quot;: &quot;Normalize kubernetes annotations&quot;,
        &quot;if&quot;: &quot;ctx?.kubernetes?.annotations != null&quot;,
        &quot;source&quot;: &quot;&quot;&quot;
        def keys = new ArrayList(ctx.kubernetes.annotations.keySet());
        for(k in keys) {
          if (k.indexOf(&quot;.&quot;) &gt;= 0) {
            def sanitizedKey = k.replace(&quot;.&quot;, &quot;_&quot;);
            ctx.kubernetes.annotations[sanitizedKey] = ctx.kubernetes.annotations[k];
            ctx.kubernetes.annotations.remove(k);
          }
        }
        &quot;&quot;&quot;
      }
    },
    {
      &quot;script&quot;: {
        &quot;description&quot;: &quot;Normalize kubernetes namespace_labels&quot;,
        &quot;if&quot;: &quot;ctx?.kubernetes?.namespace_labels != null&quot;,
        &quot;source&quot;: &quot;&quot;&quot;
        def keys = new ArrayList(ctx.kubernetes.namespace_labels.keySet());
        for(k in keys) {
          if (k.indexOf(&quot;.&quot;) &gt;= 0) {
            def sanitizedKey = k.replace(&quot;.&quot;, &quot;_&quot;);
            ctx.kubernetes.namespace_labels[sanitizedKey] = ctx.kubernetes.namespace_labels[k];
            ctx.kubernetes.namespace_labels.remove(k);
          }
        }
        &quot;&quot;&quot;
      }
    },
    {
      &quot;script&quot;: {
        &quot;description&quot;: &quot;Normalize special Kubernetes Labels used in logs-kubernetes.container_logs to determine service.name and service.version&quot;,
        &quot;if&quot;: &quot;ctx?.kubernetes?.labels != null&quot;,
        &quot;source&quot;: &quot;&quot;&quot;
        def keys = new ArrayList(ctx.kubernetes.labels.keySet());
        for(k in keys) {
          if (k.startsWith(&quot;app_kubernetes_io_component_&quot;)) {
            def sanitizedKey = k.replace(&quot;app_kubernetes_io_component_&quot;, &quot;app_kubernetes_io_component/&quot;);
            ctx.kubernetes.labels[sanitizedKey] = ctx.kubernetes.labels[k];
            ctx.kubernetes.labels.remove(k);
          }
        }
        &quot;&quot;&quot;
      }
    }
    ]
}
</code></pre>
<p>Similarly, to handle the audit logs like the ones collected by Kubernetes, we define an ingest pipeline:</p>
<pre><code class="language-bash">PUT _ingest/pipeline/openshift-audit-2-ecs
{
  &quot;processors&quot;: [
    {
      &quot;script&quot;: {
        &quot;source&quot;: &quot;&quot;&quot;
        def audit = [:];
        def keyToRemove = [];
        for(k in ctx.keySet()) {
          if (k.indexOf('_') != 0 &amp;&amp; !['@timestamp', 'data_stream', 'openshift', 'event', 'hostname'].contains(k)) {
            audit[k] = ctx[k];
            keyToRemove.add(k);
          }
        }
        for(k in keyToRemove) {
          ctx.remove(k);
        }
        ctx.kubernetes=[&quot;audit&quot;:audit];
        &quot;&quot;&quot;,
        &quot;description&quot;: &quot;Move all the 'kubernetes.audit' fields under 'kubernetes.audit' object&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;copy_from&quot;: &quot;openshift.cluster_id&quot;,
        &quot;field&quot;: &quot;orchestrator.cluster.name&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;kubernetes.node.name&quot;,
        &quot;copy_from&quot;: &quot;hostname&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;hostname&quot;,
        &quot;target_field&quot;: &quot;host.name&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;script&quot;: {
        &quot;if&quot;: &quot;ctx?.kubernetes?.audit?.annotations != null&quot;,
        &quot;source&quot;: &quot;&quot;&quot;
          def keys = new ArrayList(ctx.kubernetes.audit.annotations.keySet());
          for(k in keys) {
            if (k.indexOf(&quot;.&quot;) &gt;= 0) {
              def sanitizedKey = k.replace(&quot;.&quot;, &quot;_&quot;);
              ctx.kubernetes.audit.annotations[sanitizedKey] = ctx.kubernetes.audit.annotations[k];
              ctx.kubernetes.audit.annotations.remove(k);
            }
          }
          &quot;&quot;&quot;,
        &quot;description&quot;: &quot;Normalize kubernetes audit annotations field as expected by the Integration&quot;
      }
    }
  ]
}
</code></pre>
<p>The main objective of the pipeline is to mimic what Elastic Agent is doing: storing all audit fields under the kubernetes.audit object.</p>
<p>We are not going to use the conventional @custom pipeline approach because the fields must be normalized before invoking the logs-kubernetes.container_logs integration pipeline that uses fields like kubernetes.container.name and kubernetes.labels to determine the fields service.name and service.version. Read more about custom pipelines in <a href="https://www.elastic.co/guide/en/fleet/8.11/data-streams-pipeline-tutorial.html#data-streams-pipeline-one">Tutorial: Transform data with custom ingest pipelines | Fleet and Elastic Agent Guide [8.11]</a>.</p>
<p>The OpenShift Cluster Log Forwarder writes the data in the indices app-write and audit-write by default. It is possible to change this behavior, but it still tries to prepend the prefix “app” and the suffix “write”, so we opted to send the data to the default destination and use the reroute processor to send it to the right data streams. Read more about the Reroute Processor in our blog <a href="https://www.elastic.co/blog/simplifying-log-data-management-flexible-routing-elastic">Simplifying log data management: Harness the power of flexible routing with Elastic</a> and our documentation <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/reroute-processor.html">Reroute processor | Elasticsearch Guide [8.11] | Elastic</a>.</p>
<p>In this case, we want to redirect the container logs (app-write index) to logs-kubernetes.container_logs and the Audit logs (audit-write) to logs-kubernetes.audit_logs:</p>
<pre><code class="language-bash">PUT _ingest/pipeline/app-write-reroute-pipeline
{
  &quot;processors&quot;: [
    {
      &quot;pipeline&quot;: {
        &quot;name&quot;: &quot;openshift-2-ecs&quot;,
        &quot;description&quot;: &quot;Format the Openshift data in ECS&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;event.dataset&quot;,
        &quot;value&quot;: &quot;kubernetes.container_logs&quot;
      }
    },
    {
      &quot;reroute&quot;: {
        &quot;destination&quot;: &quot;logs-kubernetes.container_logs-openshift&quot;
      }
    }
  ]
}



PUT _ingest/pipeline/audit-write-reroute-pipeline
{
  &quot;processors&quot;: [
    {
      &quot;pipeline&quot;: {
        &quot;name&quot;: &quot;openshift-audit-2-ecs&quot;,
        &quot;description&quot;: &quot;Format the Openshift data in ECS&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;event.dataset&quot;,
        &quot;value&quot;: &quot;kubernetes.audit_logs&quot;
      }
    },
    {
      &quot;reroute&quot;: {
        &quot;destination&quot;: &quot;logs-kubernetes.audit_logs-openshift&quot;
      }
    }
  ]
}
</code></pre>
<p>Please note that given that app-write and audit-write do not follow the data stream naming convention, we are forced to add the destination field in the reroute processor. The reroute processor will also fill up the <a href="https://www.elastic.co/guide/en/ecs/8.11/ecs-data_stream.html">data_stream fields</a> for us. Note that this step is done automatically by Elastic Agent at source.</p>
<p>Further, we create the indices with the default pipelines we created to reroute the logs according to our needs.</p>
<pre><code class="language-bash">PUT app-write
{
  &quot;settings&quot;: {
      &quot;index.default_pipeline&quot;: &quot;app-write-reroute-pipeline&quot;
   }
}


PUT audit-write
{
  &quot;settings&quot;: {
    &quot;index.default_pipeline&quot;: &quot;audit-write-reroute-pipeline&quot;
  }
}
</code></pre>
<p>Basically, what we did can be summarized in this picture:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openshift-container-logs-red-hat-logging-operator/openshift-summary-blog.png" alt="openshift-summary-blog" /></p>
<p>Let us take the container logs. When the operator attempts to write in the app-write index, it will invoke the default_pipeline “app-write-reroute-pipeline” that formats the logs into ECS format and reroutes the logs to logs-kubernetes.container_logs-openshift datastreams. This calls the integration pipeline that invokes, if it exists, the logs-kubernetes.container_logs@custom pipeline. Finally, the logs-kubernetes_container_logs pipeline may reroute the logs to another data set and namespace utilizing the elastic.co/dataset and elastic.co/namespace annotations as described in the Kubernetes <a href="https://docs.elastic.co/integrations/kubernetes/container-logs#rerouting-based-on-pod-annotations">integration documentation</a>, which in turn can lead to the execution of an another integration pipeline.</p>
<h3>Create a user for sending the logs</h3>
<p>We are going to use basic authentication because, at the time of writing, it is the only supported authentication method for Elasticsearch in OpenShift logging. Thus, we need a role that allows the user to write and read the app-write, and audit-write logs (required by the OpenShift agent) and auto_configure access to logs-*-* to allow custom Kubernetes rerouting:</p>
<pre><code class="language-bash">PUT _security/role/YOURROLE
{
    &quot;cluster&quot;: [
      &quot;monitor&quot;
    ],
    &quot;indices&quot;: [
      {
        &quot;names&quot;: [
          &quot;logs-*-*&quot;
        ],
        &quot;privileges&quot;: [
          &quot;auto_configure&quot;,
          &quot;create_doc&quot;
        ],
        &quot;allow_restricted_indices&quot;: false
      },
      {
        &quot;names&quot;: [
          &quot;app-write&quot;,
          &quot;audit-write&quot;,
        ],
        &quot;privileges&quot;: [
          &quot;create_doc&quot;,
          &quot;read&quot;
        ],
        &quot;allow_restricted_indices&quot;: false
      }
    ],
    &quot;applications&quot;: [],
    &quot;run_as&quot;: [],
    &quot;metadata&quot;: {},
    &quot;transient_metadata&quot;: {
      &quot;enabled&quot;: true
    }

}



PUT _security/user/YOUR_USERNAME
{
  &quot;password&quot;: &quot;YOUR_PASSWORD&quot;,
  &quot;roles&quot;: [&quot;YOURROLE&quot;]
}
</code></pre>
<h3>On OpenShift</h3>
<p>On the OpenShift Cluster, we need to follow the <a href="https://docs.openshift.com/container-platform/4.14/logging/log_collection_forwarding/log-forwarding.html">official documentation</a> of Red Hat on how to install the Red Hat OpenShift Logging and configure Cluster Logging and the Cluster Log Forwarder.</p>
<p>We need to install the Red Hat OpenShift Logging Operator, which defines the ClusterLogging and ClusterLogForwarder Resources. Afterward, we can define the Cluster Logging resource:</p>
<pre><code class="language-yaml">apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
  name: instance
  namespace: openshift-logging
spec:
  collection:
    logs:
      type: vector
      vector: {}
</code></pre>
<p>The Cluster Log Forwarder is the resource responsible for defining a daemon set that will forward the logs to the remote Elasticsearch. Before creating it, we need to create in the same namespace as the ClusterLogForwarder a secret containing the Elasticsearch credentials for the user we created previously in the namespace, where the ClusterLogForwarder will be deployed:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Secret
metadata:
  name: elasticsearch-password
  namespace: openshift-logging
type: Opaque
stringData:
  username: YOUR_USERNAME
  password: YOUR_PASSWORD
</code></pre>
<p>Finally, we create the ClusterLogForwarder resource:</p>
<pre><code class="language-yaml">kind: ClusterLogForwarder
apiVersion: logging.openshift.io/v1
metadata:
  name: instance
  namespace: openshift-logging
spec:
  outputs:
    - name: remote-elasticsearch
      secret:
        name: elasticsearch-password
      type: elasticsearch
      url: &quot;https://YOUR_ELASTICSEARCH_URL:443&quot;
      elasticsearch:
        version: 8 # The default is version 6 with the _type field
  pipelines:
    - inputRefs:
        - application
        - audit
      name: enable-default-log-store
      outputRefs:
        - remote-elasticsearch
</code></pre>
<p>Note that we explicitly defined the version of Elasticsearch to be 8, otherwise the ClusterLogForwarder will send the _type field, which is not compatible with Elasticsearch 8 and that we collect only application and audit logs.</p>
<h2>Result</h2>
<p>Once the logs are collected and passed through all the pipelines, the result is very close to the out-of-the-box Kubernetes integration. There are important differences, like the lack of host and cloud metadata information that don’t seem to be collected (at least without an additional configuration). We can view the Kubernetes container logs in the logs explorer:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openshift-container-logs-red-hat-logging-operator/openshift-summary-blog-graphs.png" alt="openshift-summary-blog-graphs" /></p>
<p>In this post, we described how you can use the OpenShift Logging Operator to collect the logs of containers and audit logs. We still recommend leveraging Elastic Agent to collect all your logs. It is the best user experience you can get. No need to maintain or transform the logs yourself to ECS formatting. Additionally, Elastic Agent uses API keys as the authentication method and collects metadata like cloud information that allow you in the long run to do <a href="https://www.elastic.co/blog/optimize-cloud-resources-cost-apm-metadata-elastic-observability">more</a>.</p>
<p><a href="https://www.elastic.co/observability/log-monitoring">Learn more about log monitoring with the Elastic Stack</a>.</p>
<p><em>Have feedback on this blog?</em> <a href="https://github.com/herrBez/elastic-blog-openshift-logging/issues"><em>Share it here</em></a><em>.</em></p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/openshift-container-logs-red-hat-logging-operator/139687_-_Blog_Header_Banner_V1.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[OpenTelemetry for PHP: EDOT PHP joins the OpenTelemetry project]]></title>
            <link>https://www.elastic.co/observability-labs/blog/opentelemetry-accepts-elastics-donation-of-edot</link>
            <guid isPermaLink="false">opentelemetry-accepts-elastics-donation-of-edot</guid>
            <pubDate>Mon, 10 Nov 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Explore Elastic’s donation of its EDOT PHP to the OpenTelemetry community and discover how it makes OpenTelemetry for PHP simpler and more accessible.]]></description>
            <content:encoded><![CDATA[<p>The OpenTelemetry community has officially accepted Elastic's proposal to contribute the <strong>Elastic Distribution of OpenTelemetry for PHP (EDOT PHP)</strong> — marking an important milestone in bringing first-class observability to one of the web's most widely used languages.</p>
<p>For decades, PHP has powered everything from small business websites to large-scale SaaS platforms. Yet observability in PHP has often required manual setup, compilers, custom extensions, or changes to application code — challenges that limited adoption in production environments.
This upcoming donation aims to change that, by making OpenTelemetry for PHP <strong>as easy to deploy as any other runtime</strong>.</p>
<h2>What's coming</h2>
<p>Once the contribution process is complete, EDOT PHP will become part of the OpenTelemetry project — providing a <strong>complete, production-ready distribution</strong> that's optimized for performance, simplicity, and scalability.</p>
<p>EDOT PHP introduces a new approach to PHP observability:</p>
<ul>
<li><strong>Simple installation</strong> - installing OpenTelemetry for PHP will be as straightforward as installing a standard system package. From that point, the agent automatically detects and instruments PHP applications — no code changes, no manual setup.</li>
<li><strong>Automatic agent loading</strong> - works transparently in cloud and container environments without modifying application deployments.</li>
<li><strong>Zero configuration</strong> - ships as a single, self-contained binary; no need to install or compile any external extensions.</li>
<li><strong>Native C++ performance</strong> - a built-in serializer written in C++ reduces telemetry overhead by up to <strong>5×</strong>.</li>
<li><strong>Automatic instrumentation</strong> - instruments popular frameworks and libraries out of the box.</li>
<li><strong>Inferred spans</strong> - reveals the behavior of even uninstrumented code paths, providing full trace coverage.</li>
<li><strong>Automatic root spans</strong> - ensures complete traces, even in legacy or partially instrumented applications.</li>
<li><strong>OpAMP readiness</strong> - while the OpenTelemetry community continues to standardize configuration schemas and management workflows, the implementation in EDOT PHP is fully prepared to support these upcoming specifications — ensuring seamless adoption once the OpAMP ecosystem matures.</li>
<li><strong>Asynchronous backend communication</strong> - telemetry data is exported to the OpenTelemetry Collector or backend <strong>asynchronously</strong>, without blocking the instrumented application.
This ensures that span and metric exports do not add latency to user requests or impact response times, even under heavy load.</li>
</ul>
<p>Together, these features make EDOT PHP the first truly <strong>zero-effort observability solution for PHP</strong> — from local testing to cloud-scale production systems.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-accepts-elastics-donation-of-edot/performance.png" alt="Performance comparision" /></p>
<blockquote>
<p>The native C++ serializer and asynchronous export pipeline in EDOT PHP reduce average request time from <strong>49 ms</strong> to <strong>23 ms</strong>, more than <strong>2× faster</strong> than the pure PHP implementation.</p>
</blockquote>
<h2>Building on the existing foundation</h2>
<p>EDOT PHP doesn't replace the existing OpenTelemetry PHP SDK — it <strong>extends and strengthens it</strong>.
It packages the SDK, automatic instrumentation, and native extension into a single, unified agent package that works seamlessly with existing OpenTelemetry specifications and APIs.</p>
<p>By contributing this work, Elastic helps the OpenTelemetry community accelerate PHP adoption, align implementations across languages, and make distributed tracing truly universal.</p>
<blockquote>
<p>“This isn't a hand-off — it's a collaboration.
We're contributing years of development to help OpenTelemetry for PHP evolve faster, run more efficiently, and reach more users in every environment.”</p>
<ul>
<li><em>Elastic Observability team</em></li>
</ul>
</blockquote>
<h2>Ongoing improvements</h2>
<p>Elastic continues to invest in advancing EDOT PHP ahead of its integration into OpenTelemetry.
The team is currently focused on <strong>reducing resource usage and memory footprint</strong>, particularly in <strong>multi-worker server environments</strong> such as PHP-FPM or Apache prefork.
These optimizations aim to make the agent more predictable and efficient under heavy load — ensuring that telemetry remains lightweight even in large-scale production deployments.</p>
<p>Beyond that, we're exploring further improvements that can enhance both performance and interoperability.
Areas under investigation include smarter coordination in high-concurrency scenarios, better sharing of telemetry resources across workers, and future alignment with additional OpenTelemetry signals such as metrics and logs.</p>
<p>Together, these efforts will help make EDOT PHP not only faster, but also more adaptable and seamlessly integrated into diverse runtime architectures.</p>
<h2>Why it matters</h2>
<p>This contribution is about more than performance — it's about <strong>removing barriers</strong>.
By making OpenTelemetry for PHP installable as a simple system package and automatically loaded into running applications, the project opens observability to every PHP developer, operator, and platform provider.</p>
<p>For the OpenTelemetry ecosystem, it fills one of the last major language gaps, extending visibility to a vast portion of the internet — all under open governance and community collaboration.</p>
<h2>Looking ahead</h2>
<p>In the months ahead, Elastic and the OpenTelemetry PHP SIG will work closely on the technical integration, documentation, and community onboarding process.
Once the transition is complete, developers will gain a fully open, community-driven, and production-ready OpenTelemetry agent that “just works” — without friction, configuration, or code changes.</p>
<p>Together, we're building a future where <strong>observability just works — for every language, every framework, and every environment</strong>.</p>
<p>For more information:</p>
<p><a href="https://www.elastic.co/docs/reference/opentelemetry">EDOT documentation</a>&lt;br /&gt;
<a href="https://www.elastic.co/observability-labs/blog/elastic-managed-otlp-endpoint-for-opentelemetry">Learn about</a> OTLP Endpoint</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/opentelemetry-accepts-elastics-donation-of-edot/otel-php.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Composing OpenTelemetry Reference Architectures]]></title>
            <link>https://www.elastic.co/observability-labs/blog/opentelemetry-collector-reference-architectures</link>
            <guid isPermaLink="false">opentelemetry-collector-reference-architectures</guid>
            <pubDate>Tue, 31 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[A conceptual framework for reasoning about OpenTelemetry Collection architectures — edge, processing, and resilience layers that compose into the right pipeline for your environment.]]></description>
            <content:encoded><![CDATA[<p>Most OpenTelemetry tutorials end at the same place: an application instrumented with the SDK, exporting traces to a single collector, forwarding to a backend. It works. Then production happens.</p>
<p>Traffic grows. Teams want metrics derived from traces. The backend goes down for maintenance and you lose an hour of telemetry. A compliance requirement means PII must be stripped before data leaves the cluster. Suddenly, that single collector isn't enough — and the question becomes: what should the architecture actually look like?</p>
<p>The OpenTelemetry Collector is designed to be composed. It can run in multiple deployment modes, be chained into pipelines, and scaled independently at each stage. But the documentation describes individual components, not how to think about assembling them. That thinking is what this article is about.</p>
<p>What follows is a conceptual framework for reasoning about collector architectures — not a set of rigid templates. The building blocks described here are reference points. In practice, they combine, overlap, and adapt to your constraints. A tail sampling tier might also need Kafka-backed resilience. A gateway might absorb the role of a sampling tier at low volumes. The goal is to understand the concepts well enough to compose the right architecture for your situation, not to pick a pre-built one off a shelf.</p>
<h2>Three conceptual layers</h2>
<p>It helps to think about collector architectures in three layers: <strong>edge</strong>, <strong>processing</strong>, and <strong>resilience</strong>. These aren't physical tiers that must exist as separate deployments — they're categories of concern. A single collector can address multiple layers. A complex deployment might have several components within one layer. The layers are a thinking tool, not a deployment diagram.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-collector-reference-architectures/three-layers.png" alt="The three conceptual layers: Edge, Processing, and Resilience" /></p>
<h3>Edge: how telemetry enters the pipeline</h3>
<p>The edge layer is about the first hop — how telemetry gets from your applications and infrastructure into the pipeline. At this stage, the collector gathers data in two fundamentally different ways. <strong>Pull-based receivers</strong> like <code>filelog</code> and <code>hostmetrics</code> actively reach out to collect data — tailing log files on disk or scraping system-level metrics from the host. <strong>Push-based receivers</strong> like <code>otlp</code> listen for data sent to them — applications instrumented with OpenTelemetry SDKs export traces, metrics, and logs directly to the collector's OTLP endpoint. A single edge collector typically runs both: pull receivers for infrastructure telemetry the application doesn't know about, and push receivers for application telemetry the SDK produces. There are several common deployment patterns, and the right one depends on your environment and what you need to collect.</p>
<p><strong>DaemonSet Agent</strong> — One OpenTelemetry Collector per Kubernetes node, deployed as a DaemonSet. Applications export to the agent running on the same node (typically via status.hostIP:4317 using the Kubernetes Downward API). The agent also tails container log files from disk via the filelog receiver and scrapes host-level metrics via the hostmetrics receiver. This is the most common Kubernetes pattern because it handles both application and infrastructure telemetry with a single deployment, and applications only need to know about localhost.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-collector-reference-architectures/daemonset-agent.png" alt="DaemonSet Agent pattern: Application with OTel SDK exporting over OTLP to a per-node DaemonSet collector" /></p>
<p><strong>Sidecar Agent</strong> — One OpenTelemetry Collector per pod, deployed as a sidecar container. Each service gets its own collector with a custom configuration. This is required on managed container platforms like AWS Fargate or Azure Container Apps where DaemonSets aren't available, and it's useful when services have different processing requirements. When running alongside a DaemonSet, the sidecar handles application telemetry while the DaemonSet independently collects node-level telemetry — applications don't send to both.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-collector-reference-architectures/sidecar-agent.png" alt="Sidecar Agent pattern: Application with OTel SDK exporting over OTLP to a per-pod sidecar collector" /></p>
<p><strong>Host Agent</strong> — A standalone OpenTelemetry Collector running as a systemd service on bare-metal or VM hosts. It serves the same role as the DaemonSet agent but outside Kubernetes: collecting host metrics, tailing log files, and receiving OTLP from local applications.</p>
<p><strong>Direct SDK Export</strong> — Applications export directly to the next stage (gateway or backend) with no local collector. This is the simplest option but only works when you don't need infrastructure collection. For log collection, the recommended pattern is still to write to stdout and use a collector with the <code>filelog</code> receiver — even if the SDK is exporting traces and metrics directly.</p>
<p>These patterns aren't mutually exclusive. A Kubernetes cluster might run DaemonSet agents for infrastructure collection alongside sidecars for services that need custom processing. A VM environment might use host agents for some services and direct SDK export for others. The edge layer is about matching the collection pattern to the workload, not picking one pattern for everything.</p>
<h3>Processing: central policy, sampling, and transformation</h3>
<p>Not every architecture needs a processing layer. If your edge collectors can export directly to your backend and you don't need centralized policy, you can skip it to favour simplicity. But several scenarios push you toward central processing — and the way you address them can range from a single gateway to a multi-stage pipeline.</p>
<p><strong>Centralized policy (Gateway)</strong> — A pool of OpenTelemetry Collectors that sits between edge collectors and the backend. This is where you enforce consistent filtering, transformation, and PII redaction across all services. It's also where you manage backend credentials — edge collectors export to the gateway over OTLP, and only the gateway holds the API keys. Credential isolation is often the primary reason teams add a gateway.</p>
<p>Replica count scales with data volume. At low volumes (under 1K events/sec), 2 replicas co-located with workloads is sufficient. At medium volumes, 3–5 replicas on a dedicated node pool. At high volumes, 5–20+ replicas, potentially in a separate cluster. This is a general rule of thumb, and you should adapt it to your specific needs as loads might vary significantly between payload types.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-collector-reference-architectures/gateway-pool.png" alt="Gateway pattern: Load Balancer distributing traffic to a Gateway Pool of OTel Collectors" /></p>
<p><strong>Tail-based sampling</strong> — Sampling decisions that consider the complete trace (e.g., &quot;keep all traces with errors, sample 10% of successful traces&quot;) require that all spans of a trace reach the same collector instance. This is achieved with the <code>loadbalancingexporter</code> using <code>routing_key: traceID</code>, which consistently routes spans from the same trace to the same downstream collector.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-collector-reference-architectures/tail-sampling.png" alt="Tail sampling pattern: LB Exporter routing to Sampling Collectors with tail_sampling" /></p>
<p>There's a critical subtlety here: if you're deriving span metrics (RED metrics) from traces using the <code>spanmetrics</code> connector, the derivation must happen <strong>before</strong> sampling. Otherwise, your metrics only reflect the sampled subset, not the true traffic. The correct pattern is a two-step pipeline within the sampling stage:</p>
<ol>
<li>Receive traces, derive spanmetrics from 100% of traffic, forward via a <code>forward</code> connector.</li>
<li>Apply <code>tail_sampling</code> to the forwarded traces, export only kept traces.</li>
<li>A separate metrics pipeline exports the derived RED metrics.</li>
</ol>
<p>This ensures accurate metrics regardless of your sampling rate.</p>
<p><strong>The key point about processing</strong> is that these capabilities — gateway policy, tail sampling, span metrics derivation — are not separate products or fixed modules. They're configurations of the same OpenTelemetry Collector. At low volumes, a single gateway deployment might handle policy enforcement, sampling, and metrics derivation all at once. At high volumes, you might split them into dedicated stages for independent scaling. The architecture adapts to your scale, not the other way around.</p>
<h3>Resilience: what happens when the backend is down</h3>
<p>The resilience layer determines how much data you're willing to lose during backend outages or collector restarts. This isn't a separate tier you bolt on — it's a property you apply to any stage of the pipeline.</p>
<p><strong>In-Memory Queues</strong> — The default. The collector's <code>sending_queue</code> retries failed exports with exponential backoff. If the collector process crashes or restarts, queued data is lost. This is acceptable for development and for workloads where some data loss during incidents is tolerable.</p>
<p><strong>Persistent Queues (WAL)</strong> — The <code>file_storage</code> extension writes queued data to disk before export. If the collector crashes, it resumes from where it left off after restart. In Kubernetes, this requires a PersistentVolumeClaim. This is the right choice for most production workloads — it survives collector restarts and brief backend outages without the operational complexity of an external message bus.</p>
<p><strong>Kafka Buffer</strong> — An external Kafka cluster sits between collectors and the backend. Producer collectors write to Kafka topics; consumer collectors read from Kafka and export to the backend. This provides the strongest durability guarantee — Kafka can buffer hours of telemetry during extended outages and enables replay. But it adds significant operational complexity.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-collector-reference-architectures/kafka-buffer.png" alt="Kafka buffer pattern: Collector Pool producing to Kafka, consumed by another Collector Pool" /></p>
<p>The important thing to understand is that resilience is orthogonal to the other layers. You can add persistent queues to an edge agent, a gateway, or a sampling tier. You can put Kafka in front of a gateway, in front of a sampling tier, or in front of the backend. A tail sampling deployment that needs to survive extended outages might use Kafka-backed ingestion — combining what might look like two separate &quot;modules&quot; into a single stage. The building blocks compose freely based on what you need to protect against.</p>
<h2>Where to start with your architecture</h2>
<p>The Agent + Gateway two-tier pattern is the de facto production standard, used by the vast majority of organizations running OpenTelemetry at scale. DaemonSet agents on every node handle local collection — pulling infrastructure telemetry via <code>filelog</code> and <code>hostmetrics</code>, receiving application telemetry via OTLP — while a centralized gateway pool enforces policy, manages credentials, and exports to the backend. Persistent queues (WAL) on the gateway protect against backend outages without external dependencies.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-collector-reference-architectures/where-to-start.png" alt="A Kubernetes architecture with DaemonSet agents, a processing tier with tail sampling and gateway pool, exporting over OTLP to an observability backend" /></p>
<p>Every other configuration either simplifies this pattern or extends it. Smaller environments might drop the gateway and export directly from agents. Larger ones might add a tail sampling tier with traceID-based load balancing, a Kafka buffer for extended resilience, or span metrics derivation before sampling. The building blocks described in the previous sections — edge, processing, resilience — are the modules you add or remove from this foundation.</p>
<p>The key is to start with the two-tier pattern and evolve incrementally:</p>
<ul>
<li>Need credential isolation or centralized PII redaction? You already have the gateway.</li>
<li>Need tail-based sampling? Add a load-balancing exporter and a sampling tier between agents and gateway.</li>
<li>Need hours of buffer during extended outages? Insert Kafka between agents and the processing tier.</li>
<li>Running on Fargate or Azure Container Apps? Swap DaemonSet agents for sidecars — the rest of the pipeline stays the same.</li>
</ul>
<p>Start here. Add modules as your needs grow. The architecture adapts to your scale, not the other way around.</p>
<h2>Decision points that shape where you need to take your architecture</h2>
<p>When designing a collector architecture, these are the questions that determine which patterns you need:</p>
<table>
<thead>
<tr>
<th>Question</th>
<th>Impact</th>
</tr>
</thead>
<tbody>
<tr>
<td>Do I need infrastructure telemetry (host metrics, disk logs)?</td>
<td>Determines whether you need a local collector or can use direct SDK export</td>
</tr>
<tr>
<td>Am I on a managed container platform (Fargate, ACA)?</td>
<td>Forces sidecar pattern instead of DaemonSet</td>
</tr>
<tr>
<td>Do I need centralized filtering, PII redaction, or credential isolation?</td>
<td>Adds a gateway stage</td>
</tr>
<tr>
<td>Do I need tail-based sampling?</td>
<td>Adds a sampling stage with load-balancing exporter and traceID routing</td>
</tr>
<tr>
<td>Do I want span-derived metrics (RED metrics)?</td>
<td>Requires spanmetrics before sampling in a two-step pipeline</td>
</tr>
<tr>
<td>How much data loss is acceptable during outages?</td>
<td>Determines in-memory queues vs. persistent queues vs. Kafka — applied to whichever stage needs protection</td>
</tr>
<tr>
<td>What is my expected data volume?</td>
<td>Determines whether capabilities can be co-located in a single deployment or need dedicated stages</td>
</tr>
</tbody>
</table>
<p>The answers to these questions don't map to a single &quot;correct&quot; architecture. They constrain the design space, and within those constraints, you make trade-offs between simplicity and capability.</p>
<h2>Exploring these patterns interactively</h2>
<p>If you'd rather explore how these building blocks compose than assemble them by hand, <a href="https://mlunadia.github.io/otel-blueprints/">OpenTelemetry Blueprints</a> is an open-source tool that generates reference architectures from your requirements.</p>
<p>Toggle your environment, signals, volume, resilience, and processing needs — and get a composed diagram with animated data flow, interactive tooltips, and reference collector configurations you can open directly in <a href="https://www.otelbin.io">OTelBin</a> for validation.</p>
<p><a href="https://mlunadia.github.io/otel-blueprints/"><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-collector-reference-architectures/architecture.png" alt="Screenshot of a composed architecture diagram showing a Kubernetes cluster with DaemonSet agent, processing tier, and observability backend" /></a></p>
<p>The generated configurations export via OTLP, so they work with any OTLP-compatible backend — including <a href="https://www.elastic.co/docs/reference/opentelemetry/motlp">Elastic Observability</a>, which natively accepts and stores OTLP traces, metrics, and logs.</p>
<p>The architectures Blueprints generates are reference compositions — starting points for understanding how the building blocks fit together, not turnkey deployments. Every architecture should be adapted to your organisation's scale, security, networking, and compliance requirements. The patterns might combine or overlap differently in your environment than in anyone else's, and that's the point.</p>
<h2>Get started</h2>
<p>The architectures described here export over OTLP, so they work with any compatible backend. If you don't have one yet, the fastest way to see your telemetry flowing end-to-end is with Elastic Observability — it natively ingests OTLP traces, metrics, and logs with no additional configuration.</p>
<ol>
<li><a href="https://cloud.elastic.co/serverless-registration?onboarding_token=observability">Start a free trial</a> on Elastic Cloud Serverless — no credit card required.</li>
<li>Point your collector's OTLP exporter at the managed OTLP endpoint.</li>
<li>Explore your traces, metrics, and logs in Kibana within minutes.</li>
</ol>
<p>Check out these resources to go further:</p>
<ul>
<li><a href="https://www.elastic.co/docs/reference/opentelemetry/motlp">Elastic's managed OTLP endpoint documentation</a></li>
<li><a href="https://www.elastic.co/docs/reference/opentelemetry/edot-collector">EDOT Collector — Elastic's distribution of the OpenTelemetry Collector</a></li>
<li><a href="https://mlunadia.github.io/otel-blueprints/">OpenTelemetry Blueprints — generate reference architectures interactively</a></li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/opentelemetry-collector-reference-architectures/opentelemetry-collector-reference-architectures.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Monitor your C++ Applications with Elastic APM]]></title>
            <link>https://www.elastic.co/observability-labs/blog/opentelemetry-cpp-elastic</link>
            <guid isPermaLink="false">opentelemetry-cpp-elastic</guid>
            <pubDate>Tue, 11 Feb 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[In this article we will be using the Opentelemetry CPP client to monitor C++ application within Elastic APM]]></description>
            <content:encoded><![CDATA[<p>Monitor your C++ Applications with Elastic APM</p>
<h1>Introduction</h1>
<p>One of the main challenges that developers, SREs, and DevOps professionals face is the absence of an extensive tool that provides them with visibility to their application stack. Many of the APM solutions out on the market do provide methods to monitor applications that were built on languages and frameworks (i.e., .NET, Java, Python, etc.) but fall short when it comes to C++ applications.</p>
<p>Luckily, Elastic has been one of the leading solutions in observability space and a contributor to the OpenTelemetry project. Elastic’s unique position and its extensive observability capabilities allows end-users to monitor applications built with object-oriented programming languages &amp; Framework in a variety of ways.</p>
<p>In this blog we will explore using Elastic APM to investigate C++ traces with the OpenTelemetry client. We will be providing a comprehensive guide on how to implement the OpenTelemetry client for C++ applications and connecting to Elastic APM solutions. While OTel has its libraries, and this blog reviews how to use the OTel CPP library, Elastic also has its own Elastic Distributions of OpenTelemetry, which were developed to provide commercial support, and are completely upstreamed regularly.</p>
<p>Here are some resources to help get you started:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/guide/en/observability/current/apm-open-telemetry.html">Use OpenTelemetry with APM</a></p>
</li>
<li>
<p><a href="https://github.com/open-telemetry/opentelemetry-cpp">The OpenTelemetry C++ Client</a></p>
</li>
<li>
<p><a href="https://opentelemetry.io/docs/languages/cpp/">OpenTelemetry C++ Docs</a></p>
</li>
</ul>
<h1>Step by Step Guide</h1>
<h2>Prerequisites</h2>
<ul>
<li>
<h3>Environment</h3>
</li>
</ul>
<p>Choosing an environment is quite important as there is limited support for the OTEL client. We have experimented with using multiple Operating Systems and here are the suggestions:</p>
<ul>
<li>
<p>Ubuntu 22.04</p>
</li>
<li>
<p>Debian 11 Bullseye</p>
</li>
<li>
<p>For this guide we are focusing on Ubuntu 22.04.</p>
<ul>
<li>
<p>Machine: 2 vCPU, 4GB is sufficient.</p>
</li>
<li>
<p>Image: Ubuntu 22.04 LTS (x86_64).</p>
</li>
<li>
<p>Disk: ~30 GB is enough.</p>
</li>
</ul>
</li>
</ul>
<h2>Implementation method </h2>
<p>We have experimented with multiple methods but we found that the most suitable approach is to use a package manager. After extensive testing, It appears that trying to run otel-cpp client could be quite challenging to the users. If practitioners desire to build with tools such as CMake and Bazel that is a viable solution. With that, as we tested both methods it became obvious that we were spending most of our time and effort fixing compatibility and dependencies’ issues for the OS Vs. Focusing on sending data to our APM. Hence we decided to move to a different method.</p>
<p>The main issues that we kept running into as we test are:</p>
<ul>
<li>
<p>Compatibility of packages.</p>
</li>
<li>
<p>Availability of packages.</p>
</li>
<li>
<p>Dependencies of libraries and packages.</p>
</li>
</ul>
<p>In this guide we will use vcpkg since it allows us to bring in all the dependencies required to run the Opentelemetry C++ client.</p>
<h2>Installing required OS tools</h2>
<h3>Update package lists</h3>
<pre><code>    sudo apt-get update
</code></pre>
<p>Install build essentials, cmake, git, and sqlite dev library</p>
<pre><code>    sudo apt-get install -y build-essential cmake git curl zip unzip sqlite3 libsqlite3-dev
</code></pre>
<p>sqlite3 and libsqlite3-dev allow us to build/run SQLite queries in our C++ code.</p>
<h3>Set Up vcpkg</h3>
<p>vcpkg is the C++ package manager that we’ll use to install opentelemetry-cpp client.</p>
<pre><code>    # Clone vcpkg
    cd ~
    git clone https://github.com/microsoft/vcpkg.git
</code></pre>
<pre><code>    # Bootstrap
    cd ~/vcpkg
    ./bootstrap-vcpkg.sh
</code></pre>
<h3>Install OpenTelemetry C++ with OTLP gRPC</h3>
<p>In this guide we focus on trace export to Elastic. At time of writing, vcpkg’s opentelemetry-cpp</p>
<p>version 1.18.0 fully supports traces but has limited direct metrics exporting.</p>
<h3>Install the package</h3>
<pre><code>    cd ~/vcpkg
    ./vcpkg install opentelemetry-cpp[otlp-grpc]:x64-linux
</code></pre>
<h4>Note</h4>
<p>Sometimes when installing opentelemetry-cpp on linux it doesn't install all the required packages. As a workaround if you run into that case, try running again but pass a flag to allow-unsupported:</p>
<pre><code>    ./vcpkg install opentelemetry-cpp[*]:x64-linux --allow-unsupported
</code></pre>
<h3>Verify</h3>
<pre><code>    ./vcpkg list | grep opentelemetry-cpp
</code></pre>
<p>The output thould be something like this:</p>
<pre><code>opentelemetry-cpp:x64-linux 1.18.0
</code></pre>
<h2>Create the C++ Project with Database Spans</h2>
<p>We’ll build a sample in ~/otel-app that:</p>
<ul>
<li>
<p>Uses SQLite to do basic CREATE/INSERT/SELECT queries. This is helpful to showcase capturing transactions for apps that use databases on Elastic APM.</p>
</li>
<li>
<p>Generate random traces to showcase how they are captured on Elastic APM.</p>
</li>
</ul>
<p>This app is going to generate random queries where some will contain database transactions and some are just application traces. Each query is contained in a child span, so they appear in APM as separate database transactions.</p>
<pre><code># Below is the structure of our project
</code></pre>
<pre><code>    otel-app/
    ├── main.cpp
    └── CMakeLists.txt
</code></pre>
<h3>Create App Project</h3>
<pre><code>    cd ~
    mkdir otel-app
    cd otel-app
</code></pre>
<p>Inside this project we will create two files</p>
<ul>
<li>
<p>main.cpp</p>
</li>
<li>
<p>CMakeLists.txt</p>
</li>
</ul>
<p>Keep in mind that main.cpp is where you are going to pass the otel exporters that are going to send data to the Elastic cluster. So for your tech stack it would be your application's source code.</p>
<h4>Sample application code</h4>
<pre><code>    main.cpp
    // Below we declare required libraries that we will be using to ship
    // traces to Elastic APM
    #include &lt;opentelemetry/exporters/otlp/otlp_grpc_exporter.h&gt;
    #include &lt;opentelemetry/sdk/trace/tracer_provider.h&gt;
    #include &lt;opentelemetry/sdk/trace/simple_processor.h&gt;
    #include &lt;opentelemetry/trace/provider.h&gt;

    #include &lt;sqlite3.h&gt;
    #include &lt;chrono&gt;
    #include &lt;iostream&gt;
    #include &lt;thread&gt;
    #include &lt;cstdlib&gt;  // for rand(), srand()
    #include &lt;ctime&gt;    // for time()

    // Namespace aliases
    namespace trace_api = opentelemetry::trace;
    namespace sdktrace  = opentelemetry::sdk::trace;
    namespace otlp      = opentelemetry::exporter::otlp;

    // Below we are using a helper function to run SQLITE statement inside 
    // child span
    bool ExecuteSql(sqlite3 *db, const std::string &amp;sql,
                    trace_api::Tracer &amp;tracer,
                    const std::string &amp;span_name)
    {
      // Starting the child span
      auto db_span = tracer.StartSpan(span_name);
      {
        auto scope = tracer.WithActiveSpan(db_span);

        // Here we mark Database attributes for clarity in APM
        db_span-&gt;SetAttribute(&quot;db.system&quot;, &quot;sqlite&quot;);
        db_span-&gt;SetAttribute(&quot;db.statement&quot;, sql);

        char *errMsg = nullptr;
        int rc = sqlite3_exec(db, sql.c_str(), nullptr, nullptr, &amp;errMsg);
        if (rc != SQLITE_OK)
        {
          db_span-&gt;AddEvent(&quot;SQLite error: &quot; + std::string(errMsg ? errMsg : &quot;unknown&quot;));
          sqlite3_free(errMsg);
          db_span-&gt;End();
          return false;
        }
        db_span-&gt;AddEvent(&quot;Query OK&quot;);
      }
      db_span-&gt;End();
      return true;
    }

    /**
     * DoNonDbWork - Simulate some other operation
     */
    void DoNonDbWork(trace_api::Tracer &amp;tracer, const std::string &amp;span_name)
    {
      auto child_span = tracer.StartSpan(span_name);
      {
        auto scope = tracer.WithActiveSpan(child_span);
        // Just sleep or do some &quot;fake&quot; work
        std::cout &lt;&lt; &quot;[TRACE] Doing non-DB work for &quot; &lt;&lt; span_name &lt;&lt; &quot;...\n&quot;;
        std::this_thread::sleep_for(std::chrono::milliseconds(200 + rand() % 300));
        child_span-&gt;AddEvent(&quot;Finished non-DB work&quot;);
      }
      child_span-&gt;End();
    }

    int main()
    {
      // Seed random generator for example
      srand(static_cast&lt;unsigned&gt;(time(nullptr)));

      // 1) Create OTLP exporter for traces
      otlp::OtlpGrpcExporterOptions opts;
      auto exporter = std::make_unique&lt;otlp::OtlpGrpcExporter&gt;(opts);

      // 2) Simple Span Processor
      auto processor = std::make_unique&lt;sdktrace::SimpleSpanProcessor&gt;(std::move(exporter));

      // 3) Tracer Provider
      auto sdk_tracer_provider = std::make_shared&lt;sdktrace::TracerProvider&gt;(std::move(processor));
      auto tracer = sdk_tracer_provider-&gt;GetTracer(&quot;my-cpp-multi-app&quot;);

      // Prepare an in-memory SQLite DB (for random DB usage)
      sqlite3 *db = nullptr;
      int rc = sqlite3_open(&quot;:memory:&quot;, &amp;db);
      if (rc == SQLITE_OK)
      {
        // Create a table so we can do inserts/reads
        ExecuteSql(db, &quot;CREATE TABLE IF NOT EXISTS items (id INTEGER PRIMARY KEY, info TEXT);&quot;,
                   *tracer.get(), &quot;db_create_table&quot;);
      }

      // Create the following loop to generate multiple transactions
      int num_transactions = 5;  // Change this variable to the desired number of transaction
      for (int i = 1; i &lt;= num_transactions; i++)
      {
        // Each iteration is a top-level transaction
        std::string transaction_name = &quot;transaction_&quot; + std::to_string(i);
        auto parent_span = tracer-&gt;StartSpan(transaction_name);
        {
          auto scope = tracer-&gt;WithActiveSpan(parent_span);

          std::cout &lt;&lt; &quot;\n=== Starting &quot; &lt;&lt; transaction_name &lt;&lt; &quot; ===\n&quot;;

          // Randomly select whether a transaction will interact with the database or not.
          bool doDb = (rand() % 2 == 0); // 50% chance

          if (doDb &amp;&amp; db)
          {
            // Insert random data
            std::string insert_sql = &quot;INSERT INTO items (info) VALUES ('Item &quot; + std::to_string(i) + &quot;');&quot;;
            ExecuteSql(db, insert_sql, *tracer.get(), &quot;db_insert_item&quot;);

            // Select from DB
            ExecuteSql(db, &quot;SELECT * FROM items;&quot;, *tracer.get(), &quot;db_select_items&quot;);
          }
          else
          {
            // Do some random non-DB tasks
            DoNonDbWork(*tracer.get(), &quot;non_db_task_1&quot;);
            DoNonDbWork(*tracer.get(), &quot;non_db_task_2&quot;);
          }

          // Sleep a little to simulate transaction time
          std::this_thread::sleep_for(std::chrono::milliseconds(200));
        }
        parent_span-&gt;End();
      }

      // Close DB
      sqlite3_close(db);

      // Extra sleep to ensure final flush
      std::cout &lt;&lt; &quot;\n[INFO] Sleeping 5 seconds to allow flush...\n&quot;;
      std::this_thread::sleep_for(std::chrono::seconds(5));
      std::cout &lt;&lt; &quot;[INFO] Exiting.\n&quot;;
      return 0;
    }
</code></pre>
<h5>What does the code do?</h5>
<p>We create 5 top-level “transaction_i” spans.</p>
<p>For each transaction, we randomly choose to do DB or non-DB work</p>
<pre><code>- If DB: Insert a row, then select. Each is a child span.

- If non-DB: We do two “fake tasks” (child spans).
</code></pre>
<p>Once we finish, we close the database connection and wait 5 seconds for data flush.</p>
<h4>Sample instruction file</h4>
<p>CMakeLists.txt : This file contains instructions describing the source files and targets.</p>
<pre><code>    cmake_minimum_required(VERSION 3.10)
    project(OtelApp VERSION 1.0)

    set(CMAKE_CXX_STANDARD 11)
    set(CMAKE_CXX_STANDARD_REQUIRED ON)

    # Here we are pointing to use the vcpkg toolchain
    set(CMAKE_TOOLCHAIN_FILE &quot;PATH-TO/vcpkg.cmake&quot; CACHE STRING &quot;Vcpkg toolchain file&quot;)

    find_package(opentelemetry-cpp CONFIG REQUIRED)

    add_executable(otel_app main.cpp)

    # Below we are linking the OTLP gRPC exporter, trace library, and sqlite3
    target_link_libraries(otel_app PRIVATE
        opentelemetry-cpp::otlp_grpc_exporter
        opentelemetry-cpp::trace
        sqlite3
    )
</code></pre>
<h4>Declare Environmental Variables</h4>
<p>Here we are going to export our Elastic Cloud endpoints as environmental variables</p>
<p>You can get that information by doing the following:</p>
<ol>
<li>
<p>Login into your elastic cloud</p>
</li>
<li>
<p>Go into your deployment</p>
</li>
<li>
<p>On the Left hand side, click on the hamburger menu and scroll down to “Integrations”</p>
</li>
<li>
<p>Go on the search bar inside the integration and type “APM”</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/APM-Search.png" alt="" /></p>
<ol start="5">
<li>
<p>Click on the APM integration</p>
</li>
<li>
<p>Scroll down and click on the OpenTelemetry Option on the far left side</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/highlighted.png" alt="" /></p>
<ol start="7">
<li>You should be able to see values similar to the screenshot below. Once you copy the values to export, click on launch APM.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/highlighted2.png" alt="" /></p>
<p>As you copy the required values, go ahead and export them.</p>
<pre><code>    export OTEL_EXPORTER_OTLP_ENDPOINT=&quot;APM-ENDPOINT&quot;
    export OTEL_EXPORTER_OTLP_HEADERS=&quot;KEY&quot;
    export OTEL_RESOURCE_ATTRIBUTES=&quot;service.name=my-app,service.version=1.0.0,deployment.environment=dev&quot;
</code></pre>
<p>Note that the elastic OTEL_EXPORTER_OTLP_HEADERS value usually starts with “Authorization=Bearer” make sure that you convert the upper case “A” in authorization to a lower case “a”. This is due to the fact that the otel header exporter expects a lower case “a” for authorization.</p>
<h3>Build and Run</h3>
<p>Once we create the two files we then move to building the application.</p>
<pre><code>cd ~/otel-app
mkdir -p build
cd build

cmake -DCMAKE_TOOLCHAIN_FILE=~/vcpkg/scripts/buildsystems/vcpkg.cmake \
      -DCMAKE_PREFIX_PATH=~/vcpkg/installed/x64-linux/share \
      ..
make
</code></pre>
<p>Once make is successful run the the application</p>
<pre><code>./otel-app
</code></pre>
<p>You should be able to see the script execute with a similar console output</p>
<pre><code>    Console outcome:
    === Starting transaction_1 ===
    [TRACE] Doing non-DB work for non_db_task_1...
    [TRACE] Doing non-DB work for non_db_task_2...

    === Starting transaction_2 ===
    [TRACE] Doing DB work for doDb_task_1...
    [TRACE] Doing DB work for doDb_task_2...

    === Starting transaction_3 ===
    [TRACE] Doing non-DB work for non_db_task_1...
    [TRACE] Doing non-DB work for non_db_task_2...

    === Starting transaction_4 ===
    [TRACE] Doing non-DB work for non_db_task_1...
    [TRACE] Doing non-DB work for non_db_task_2...

    === Starting transaction_5 ===
    [TRACE] Doing non-DB work for non_db_task_1...
    [TRACE] Doing non-DB work for non_db_task_2...

    [INFO] Sleeping 5 seconds to allow flush...
    [INFO] Exiting.
</code></pre>
<p>Once the script executes you should be able to observe those traces on Elastic APM similar to the screenshots below.</p>
<h3>Observe in Elastic APM</h3>
<p>Go to Elastic Cloud, open your deployment, and navigate to Observability &gt; APM.</p>
<p>Look for the app name in the service list (as defined by OTEL_RESOURCE_ATTRIBUTES).</p>
<p>Inside that service’s Traces tab, you’ll find multiple transactions like “transaction_1”,</p>
<p>“transaction_2”, etc.</p>
<p>Expanding each transaction shows child spans:</p>
<pre><code>- Possibly db_insert_item and db_select_items if random DB path was taken.

- Otherwise, non_db_task_1 and non_db_task_2.
</code></pre>
<p>You can see how some transactions do DB calls, some do not, each with different spans.</p>
<p>This variety demonstrates how your real application might produce multiple different</p>
<p>“routes” or “operations.”</p>
<h4>Service Map</h4>
<p>If everything runs correctly, you should be able to view your services and see service maps for your application.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/Service-Map.png" alt="" /></p>
<h4>Services</h4>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/Services.png" alt="" /></p>
<h4>My Elastic App</h4>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/Overview-transactions.png" alt="" /></p>
<h4>App Transactions</h4>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/Transactions2.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/Trace-db.png" alt="" /></p>
<h4>Dependencies</h4>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/Dependecies.png" alt="" /></p>
<h4>Logs</h4>
<p>Navigate to your logs window/Discover to see the incoming application logs</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/Logs.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/Logs2.png" alt="" /></p>
<h4>Patterns</h4>
<p>Log pattern analysis helps you to find patterns in unstructured log messages and makes it easier to examine your data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/patt2.png" alt="" /></p>
<h2>Final Recap</h2>
<p>Here is a quick summary of what we did:</p>
<ul>
<li>
<p>Provisioned an Ubuntu 22.04 machine.</p>
</li>
<li>
<p>Installed build tools for SQLite, dev libs, and vcpkg.</p>
</li>
<li>
<p>Installed the client for opentelemetry-cpp via vcpkg.</p>
</li>
<li>
<p>Created a minimal C++ project that executes app traces and captures database operations.</p>
</li>
<li>
<p>Connected database sqlite3 in CMakeLists.txt.</p>
</li>
<li>
<p>Exported the Elastic OTLP endpoint &amp; token as environment variables (with a lowercase authorization=Bearer key!).</p>
</li>
<li>
<p>Ran the application and observed DB interactions and app traces in Elastic APM.</p>
</li>
<li>
<p>Observed application logs and patterns on Elastic logs and Discover.</p>
</li>
</ul>
<h2>FAQ &amp; Common Issues</h2>
<ul>
<li>Getting “Could not find package configuration file provided by opentelemetry-cpp”?</li>
</ul>
<p>Make sure you pass</p>
<pre><code>-DCMAKE_TOOLCHAIN_FILE=... and -DCMAKE_PREFIX_PATH=... 
</code></pre>
<p>to cmake, or embed them in CMakeLists.txt.</p>
<ul>
<li>Crash: “validate_metadata: INTERNAL:Illegal header key”?</li>
</ul>
<p>Use all-lowercase in</p>
<pre><code>OTEL_EXPORTER_OTLP_HEADERS, e.g. authorization=Bearer \&lt;token&gt;.
</code></pre>
<ul>
<li>Missing otlp_grpc_metrics_exporter.h?</li>
</ul>
<p>Your vcpkg version of opentelemetry-cpp (1.18.0) lacks a direct metrics exporter for OTLP. For metrics, either upgrade the library or consider an OpenTelemetry Collector approach.</p>
<ul>
<li>No data in Elastic APM?</li>
</ul>
<p>Double-check your endpoint URL, Bearer token, firewall rules, or service name in the APM</p>
<h2>Additional Resources:</h2>
<ul>
<li><a href="https://cloud.elastic.co/registration">Sign up for Elastic Cloud free trial</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/tag/opentelemetry">More Elastic OpenTelemetry Topics</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry">Introducing Elastic Distributions of OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-collector">Introducing Elastic Distribution of OpenTelemetry Collector</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-openai">Instrumenting your OpenAI- powered Python, Node.js, and Java Applications with EDOT</a></li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/blog-image.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[How to combine OpenTelemetry instrumentation with Elastic APM Agent features]]></title>
            <link>https://www.elastic.co/observability-labs/blog/opentelemetry-instrumentation-apm-agent-features</link>
            <guid isPermaLink="false">opentelemetry-instrumentation-apm-agent-features</guid>
            <pubDate>Thu, 13 Jul 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[This post shows you how you can combine the OpenTelemetry tracing APIs with Elastic APM Agents. You'll learn how OpenTelemetry spans became part of a trace that Elastic APM Agents report.]]></description>
            <content:encoded><![CDATA[<p>Elastic APM supports OpenTelemetry on multiple levels. One easy-to understand scenario, which <a href="https://www.elastic.co/blog/opentelemetry-observability">we previously blogged about</a>, is the direct OpenTelemetry Protocol (OTLP) support in APM Server. This means that you can connect any OpenTelemetry agent to an Elastic APM Server and the APM Server will happily take that data, ingest it into Elasticsearch&lt;sup&gt;®&lt;/sup&gt;, and you can view that OpenTelemetry data in the APM app in Kibana&lt;sup&gt;®&lt;/sup&gt;.</p>
<p>This blog post will showcase a different use-case: within Elastic APM, we have <a href="https://www.elastic.co/guide/en/apm/agent/index.html">our own APM Agents</a>. Some of these have download numbers in the tens of millions, and some of them predate OpenTelemetry. Of course we realize OpenTelemetry is very important and it’s here to stay, so we wanted to make these agents OpenTelemetry compatible and illustrate them using <a href="https://www.elastic.co/observability/opentelemetry">OpenTelemetry visualizations</a> in this blog.</p>
<p>Most of our Elastic APM Agents today are able to ship OpenTelemetry spans as part of a trace. This means that if you have any component in your application that emits an OpenTelemetry span, it’ll be part of the trace the Elastic APM Agent captures. This can be a library you use that is already instrumented by the OpenTelemetry API, or it can be any other OpenTelemetry span that an application developer added into the application’s code for manual instrumentation.</p>
<p>This feature of the Elastic APM Agents not only reports those spans but also properly maintains parent-child relationships between all spans, making OpenTelemetry a first-class citizen for these agents. If, for example, an Elastic APM Agent starts a span for a specific action by auto-instrumentation and then within that span the OpenTelemetry API starts another span, then the OpenTelemetry span will be the child of the outer span created by the agent. This is reflected in the parent.id field of the spans. It’s the same the other way around as well: if a span is created by the OpenTelemetry API and within that span an Elastic APM agent captures another span, then the span created by the Elastic APM Agent will be the child of the other span created by the OpenTelemetry API.</p>
<p>This feature is present in the following agents:</p>
<ul>
<li><a href="https://www.elastic.co/guide/en/apm/agent/java/current/opentelemetry-bridge.html">Java</a></li>
<li><a href="https://www.elastic.co/guide/en/apm/agent/dotnet/master/opentelemetry-bridge.html">.NET</a></li>
<li><a href="https://www.elastic.co/guide/en/apm/agent/python/current/opentelemetry-bridge.html">Python</a></li>
<li><a href="https://www.elastic.co/guide/en/apm/agent/nodejs/current/opentelemetry-bridge.html">Node.js</a></li>
<li><a href="https://www.elastic.co/guide/en/apm/agent/go/current/opentelemetry.html">Go</a></li>
</ul>
<h2>Capturing OpenTelemetry spans in the Elastic .NET APM Agent</h2>
<p>As a first example, let’s take an ASP.NET Core application. We’ll put the .NET Elastic APM Agent into this application, and we’ll turn on the feature, which automatically bridges OpenTelemetry spans, so the Elastic APM Agent will make those spans part of the trace it reports.</p>
<p>The following code snippet shows a controller:</p>
<pre><code class="language-csharp">namespace SampleAspNetCoreApp.Controllers
{
	public class HomeController : Controller
	{
		private readonly SampleDataContext _sampleDataContext;
		private ActivitySource _activitySource = new ActivitySource(&quot;HomeController&quot;);
		public HomeController(SampleDataContext sampleDataContext) =&gt; _sampleDataContext = sampleDataContext;
		public async Task&lt;IActionResult&gt; Index()
		{
			await ReadGitHubStars();
			return View();
		}
		public async Task ReadGitHubStars()
		{
			using var activity = _activitySource.StartActivity();
			var httpClient = new HttpClient();
			httpClient.DefaultRequestHeaders.Add(&quot;User-Agent&quot;, &quot;APM-Sample-App&quot;);
			var responseMsg = await httpClient.GetAsync(&quot;https://api.github.com/repos/elastic/apm-agent-dotnet&quot;);
			var responseStr = await responseMsg.Content.ReadAsStringAsync();
			// …use responseStr
		}
	}
}
</code></pre>
<p>The Index method calls the ReadGitHubStars method and after that we simply return the corresponding view from the method.</p>
<p>The incoming HTTP call and the outgoing HTTP call by the HttpClient are automatically captured by the Elastic APM Agent — this is part of the auto instrumentation we had for a very long time.</p>
<p>The ReadGitHubStars is the one where we use the OpenTelemetry API. OpenTelemetry in .NET uses the ActivitySource and Activity APIs. The _activitySource.StartActivity() call simply creates an OpenTelemetry span that automatically takes the name of the method by using the <a href="https://learn.microsoft.com/en-us/dotnet/api/system.runtime.compilerservices.callermembernameattribute?view=net-7.0">CallerMemberNameAttribute</a> C# language feature, and this span will end when the method runs to completion.</p>
<p>Additionally, within this span we call the GitHub API with the HttpClient type. For this type, the .NET Elastic APM Agent again offers auto instrumentation, so the HTTP call will be also captured as a span by the agent automatically.</p>
<p>And here is how the water-flow chart for this transaction looks in Kibana:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-instrumentation-apm-agent-features/elastic-blog-1-trace-sample.png" alt="trace sample kibana" /></p>
<p>As you can see, the agent was able to capture the OpenTelemetry span as part of the trace.</p>
<h2>Bridging OpenTelemetry spans in Python by using the Python Elastic APM Agent</h2>
<p>Let’s see how this works in the case of Python. The idea is the same, so all the concepts introduced previously apply to this example as well.</p>
<p>We take a very simple Django example:</p>
<pre><code class="language-python">from django.http import HttpResponse
from elasticapm.contrib.opentelemetry import Tracer
import requests


def index(request):
   tracer = Tracer(__name__)
   with tracer.start_as_current_span(&quot;ReadGitHubStars&quot;):
       url = &quot;https://api.github.com/repos/elastic/apm-agent-python&quot;
       response = requests.get(url)
       return HttpResponse(response)
</code></pre>
<p>The first step to turn on capturing OpenTelemetry spans in Python is to import the Tracer implementation from elasticapm.contrib.opentelemetry.</p>
<p>And then on this Tracer you can start a new span — in this case, we manually name the span ReadGitHubStars.</p>
<p>Similarly to the previous example, the call to <a href="http://127.0.0.1:8000/otelsample/">http://127.0.0.1:8000/otelsample/</a> is captured by the Elastic APM Python Agent, and then the next span is created by the OpenTelemetry API, which, as you can see, is captured by the agent automatically, and then finally the HTTP call to the GitHub API is captured again by the auto instrumentation of the agent.</p>
<p>Here is how it looks in the water-flow chart:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-instrumentation-apm-agent-features/elastic-blog-2-trace-sample-2.png" alt="water-flow chart" /></p>
<p>As already mentioned, the agent maintains the parent-child relationship for all the OTel spans. Let’s take a look at the parent.id of the GET api.github.com call:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-instrumentation-apm-agent-features/elastic-blog-3-span-details.png" alt="OTel span details" /></p>
<p>As you can see, the id of this span is c98401c94d40b87a.</p>
<p>If we look at the span.id of the ReadGitHubStars OpenTelemetry span, then we can see that the id of this span is exactly c98401c94d40b87a — so the APM Agent internally maintains parent-child relationships across OpenTelemetry and non-OpenTelemetry spans, which makes OpenTelemetry spans first-class citizens in Elastic APM Agents.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-instrumentation-apm-agent-features/elastic-blog-4-span-details-2.png" alt="OpenTelemetry spans first-class citizens in Elastic APM Agents" /></p>
<h2>Other languages</h2>
<p>At this point, I'll stop to just replicate the exact same sample code in further languages — I think you already got the point here: in each language listed above, our Elastic APM Agents are able to bridge OpenTelemetry traces and show them in Kibana as native spans. We also <a href="https://www.elastic.co/blog/create-your-own-instrumentation-with-the-java-agent-plugin">blogged about using the same API in Java</a>, and you can see examples for the rest of the languages in the corresponding agent documentation (linked above).</p>
<h2>When to use this feature and when to use pure OpenTelemetry SDKs</h2>
<p>This is really up to you. If you want to only have pure OpenTelemetry usage in your applications and you really want to avoid any vendor-related software, then feel free to use OpenTelemetry SDKs directly — that is a use case we clearly support. If you go that route, this feature is not so relevant to you.</p>
<p>However, our Elastic APM Agents already have a very big user base and they offer features that are not present in OpenTelemetry. Some of these features are <a href="https://www.elastic.co/guide/en/apm/guide/current/span-compression.html">span compression</a>, <a href="https://www.elastic.co/guide/en/kibana/current/agent-configuration.html">central configuration</a>, <a href="https://www.elastic.co/guide/en/apm/agent/java/current/method-sampling-based.html">inferred spans</a>, distributed <a href="https://www.elastic.co/guide/en/apm/guide/current/configure-tail-based-sampling.html">tail based sampling</a> with multiple APM Servers, and many more.</p>
<p>If you are one of the many existing Elastic APM Agent users, or you plan to use an Elastic APM Agent because of the features mentioned above, then bridging OpenTelemetry spans enables you to still use the OpenTelemetry API and not rely on any vendor related API usage. That way your developer teams can instrument your application with OpenTelemetry, and you can also use any third-party library already instrumented by OpenTelemetry, and Elastic APM Agents will happily report those spans as part of the traces they report. With this, you can combine the vendor independent nature of OpenTelemetry and still use the feature rich Elastic APM Agents.</p>
<p>The OpenTelemetry bridge feature is also a good tool to use if you wish to change your telemetry library from an Elastic APM Agent to OpenTelemetry (and vice-versa), as it allows you to use both libraries together and switch them using atomic changes.</p>
<h2>Next steps</h2>
<p>In this blog post, we discussed how you can bridge OpenTelemetry spans with Elastic APM Agents. Of course OpenTelemetry is more than just traces. We know that, and we plan to cover further areas: currently we are working on bridging OpenTelemetry metrics in our Elastic APM Agents in a very similar fashion. You can watch the progress <a href="https://github.com/elastic/apm/issues/691">here</a>.</p>
<p><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Learn more about adding Elastic APM as part of your Elastic Observability deployment</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/opentelemetry-instrumentation-apm-agent-features/opentelemetry_apm-blog-720x420.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Automatic cloud resource attributes with OpenTelemetry Java]]></title>
            <link>https://www.elastic.co/observability-labs/blog/opentelemetry-java-automatic-cloud-resource-attributes</link>
            <guid isPermaLink="false">opentelemetry-java-automatic-cloud-resource-attributes</guid>
            <pubDate>Thu, 27 Jun 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Capturing cloud resource attributes allow to describe application cloud deployment details. In this article we describe three distinct ways to enable them for Java applications using OpenTelemetry]]></description>
            <content:encoded><![CDATA[<p>With OpenTelemetry, the observed entities (application, services, processes, …) are described through resource attributes. The definitions and the values of those attributes are defined in the <a href="https://opentelemetry.io/docs/concepts/semantic-conventions/">semantic conventions</a>.<br />
In practice, for a typical java application running in a cloud environment like Google Cloud Platform (GCP), Amazon Web Services (AWS) or Azure, it means capturing the name of the cloud provider, the cloud service name or availability zone in addition to per-provider attributes. Those attributes are then used to describe and qualify the observability signals (logs, traces, metrics), defined by semantic conventions in the <a href="https://opentelemetry.io/docs/specs/semconv/resource/cloud/">cloud resource attributes</a> section.</p>
<p>When using the <a href="https://github.com/open-telemetry/opentelemetry-java">OpenTelemetry Java SDK</a> or the <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation">OpenTelemetry instrumentation agent</a>, those attributes are not automatically captured by default. In this article we will show you first how to enable them with the SDK, then using the instrumentation agent and then we will show you how using the <a href="https://github.com/elastic/elastic-otel-java/">Elastic OpenTelemetry Distribution</a> makes it even easier.</p>
<h2>OpenTelemetry Java SDK</h2>
<p>The OpenTelemetry Java SDK does not capture any cloud resource attributes, however it provides a pluggable service provider interface to register resource attributes providers and application developers have to provide the implementations.</p>
<p>Implementations for <a href="https://github.com/open-telemetry/opentelemetry-java-contrib/tree/main/gcp-resources">GCP</a> and <a href="https://github.com/open-telemetry/opentelemetry-java-contrib/tree/main/aws-resources">AWS</a> are already included in the <a href="https://github.com/open-telemetry/opentelemetry-java-contrib/">OpenTelemetry Java Contrib</a> repo, so if you are using one of those cloud providers then it's mostly a matter of adding those providers to your application dependencies. Thanks to autoconfiguration those should be automatically included and enabled once they are added to the application classpath. The <a href="https://github.com/open-telemetry/opentelemetry-java/tree/main/sdk-extensions/autoconfigure#resource-provider-spi">SDK documentation</a> provides all the details to add and configure those in your application.</p>
<p>If you are using a cloud provider for which no such implementation is available, then you still have the option to provide your own which is a straightforward implementation of the <a href="https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk-extensions/autoconfigure/README.md#resource-provider-spi">ResourceProvider</a> SPI (Service Provider Interface). In order to keep things consistent, you will have to rely on the existing <a href="https://opentelemetry.io/docs/specs/semconv/resource/cloud/">cloud semantic conventions</a>.</p>
<p>For example here is an example of a simple cloud resource attributes provider for a fictitious cloud provider named &quot;potatoes&quot;.</p>
<pre><code>package potatoes;

import io.opentelemetry.api.common.Attributes;
import io.opentelemetry.sdk.autoconfigure.spi.ConfigProperties;
import io.opentelemetry.sdk.autoconfigure.spi.ResourceProvider;
import io.opentelemetry.sdk.resources.Resource;
import io.opentelemetry.semconv.incubating.CloudIncubatingAttributes;

public class PotatoesResourceProvider implements ResourceProvider {

@Override
public Resource createResource(ConfigProperties configProperties) {
   return Resource.create(Attributes.of(
           CloudIncubatingAttributes.CLOUD_PROVIDER, &quot;potatoes&quot;,
           CloudIncubatingAttributes.CLOUD_PLATFORM, &quot;french-fries&quot;,
           CloudIncubatingAttributes.CLOUD_REGION, &quot;garden&quot;
           ));
  }
}
</code></pre>
<h2>OpenTelemetry Java instrumentation</h2>
<p>The <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation">OpenTelemetry Java Instrumentation</a> provides a java agent that instruments the application at runtime automatically for an extensive set of frameworks and libraries (see <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/supported-libraries.md">supported technologies</a>).</p>
<p>Using instrumentation means that the application bytecode and the embedded libraries are modified automatically to make them behave as if explicit modifications were made in their source code to call the OpenTelemetry SDK in order to create traces, spans and metrics.</p>
<p>When an application is deployed with the OpenTelemetry instrumentation agent, the cloud resource attributes for GCP and AWS are included but not enabled by default since version 2.2.0. You can enable them <a href="https://opentelemetry.io/docs/languages/java/automatic/configuration/#enable-resource-providers-that-are-disabled-by-default">through configuration</a> by setting the following properties:</p>
<ul>
<li>
<p>For AWS: <code>otel.resource.providers.aws.enabled=true</code></p>
</li>
<li>
<p>For GCP: <code>otel.resource.providers.gcp.enabled=true</code></p>
</li>
</ul>
<h2>Elastic OpenTelemetry Java Distribution</h2>
<p>The Elastic OpenTelemetry Java distribution relies on the OpenTelemetry Java instrumentation which we often refer to as the Vanilla OpenTelemetry, and it thus inherits all of its features.</p>
<p>One major difference though is that the resource attributes providers for GCP and AWS are included and enabled by default to provide a better onboarding experience without extra configuration.</p>
<p>The minor cost to this is that it might make the application startup slightly slower due to having to call an HTTP(S) endpoint. This overhead is usually negligible compared to application startup but can become significant for some setups.</p>
<p>In order to reduce the startup overhead, or when the cloud provider is known in advance, you can selectively disable unused provider implementations through configuration:</p>
<ul>
<li>
<p>For AWS: <code>otel.resource.providers.aws.enabled=false</code></p>
</li>
<li>
<p>For GCP: <code>otel.resource.providers.gcp.enabled=false</code></p>
</li>
</ul>
<h2>Conclusion</h2>
<p>With this blogpost we have introduced what OpenTelemetry cloud resource attributes are and how they can be used and configured into application deployments using either OpenTelemetry SDK/API and Instrumentation agents.</p>
<p>When using the Elastic OpenTelemetry Java distribution, those resource providers are automatically provided and enabled for an easy and simple onboarding experience.</p>
<p>Another very interesting aspect of the cloud resource attribute providers available in the <a href="https://github.com/open-telemetry/opentelemetry-java-contrib">opentelemetry-java-contrib</a> repository is that they are maintained by their respective vendors (Google and Amazon). For the end-user it means those implementations should be quite well tested and be robust to changes in the underlying infrastructure. For solution vendors like Elastic, it means we don't have to re-implement and reverse-engineer the infrastructure details of every cloud provider, hence proving that investing in those common components is a net win for the broader OpenTelemetry community.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/opentelemetry-java-automatic-cloud-resource-attributes/flexible-implementation-1680X980.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Optimizing Observability with ES|QL: Streamlining SRE operations and issue resolution for Kubernetes and OTel]]></title>
            <link>https://www.elastic.co/observability-labs/blog/opentelemetry-kubernetes-esql</link>
            <guid isPermaLink="false">opentelemetry-kubernetes-esql</guid>
            <pubDate>Wed, 01 Nov 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[ES|QL enhances operational efficiency, data analysis, and issue resolution for SREs. This blog covers the advantages of ES|QL in Elastic Observability and how it can apply to managing issues instrumented with OpenTelemetry and running on Kubernetes.]]></description>
            <content:encoded><![CDATA[<p>As an operations engineer (SRE, IT Operations, DevOps), managing technology and data sprawl is an ongoing challenge. Simply managing the large volumes of high dimensionality and high cardinality data is overwhelming.</p>
<p>As a single platform, Elastic® helps SREs unify and correlate limitless telemetry data, including metrics, logs, traces, and profiling, into a single datastore — Elasticsearch®. By then applying the power of Elastic’s advanced machine learning (ML), AIOps, AI Assistant, and analytics, you can break down silos and turn data into insights. As a full-stack observability solution, everything from infrastructure monitoring to log monitoring and application performance monitoring (APM) can be found in a single, unified experience.</p>
<p>In Elastic 8.11, a technical preview is now available of <a href="https://www.elastic.co/blog/esql-elasticsearch-piped-query-language">Elastic’s new piped query language, ES|QL (Elasticsearch Query Language)</a>, which transforms, enriches, and simplifies data investigations. Powered by a new query engine, ES|QL delivers advanced search capabilities with concurrent processing, improving speed and efficiency, irrespective of data source and structure. Accelerate resolution by creating aggregations and visualizations from one screen, delivering an iterative, uninterrupted workflow.</p>
<h2>Advantages of ES|QL for SREs</h2>
<p>SREs using Elastic Observability can leverage ES|QL to analyze logs, metrics, traces, and profiling data, enabling them to pinpoint performance bottlenecks and system issues with a single query. SREs gain the following advantages when managing high dimensionality and high cardinality data with ES|QL in Elastic Observability:</p>
<ul>
<li><strong>Improved operational efficiency:</strong> By using ES|QL, SREs can create more actionable notifications with aggregated values as thresholds from a single query, which can also be managed through the Elastic API and integrated into DevOps processes.</li>
<li><strong>Enhanced analysis with insights:</strong> ES|QL can process diverse observability data, including application, infrastructure, business data, and more, regardless of the source and structure. ES|QL can easily enrich the data with additional fields and context, allowing the creation of visualizations for dashboards or issue analysis with a single query.</li>
<li><strong>Reduced mean time to resolution:</strong> ES|QL, when combined with Elastic Observability's AIOps and AI Assistant, enhances detection accuracy by identifying trends, isolating incidents, and reducing false positives. This improvement in context facilitates troubleshooting and the quick pinpointing and resolution of issues.</li>
</ul>
<p>ES|QL in Elastic Observability not only enhances an SRE's ability to manage the customer experience, an organization's revenue, and SLOs more effectively but also facilitates collaboration with developers and DevOps by providing contextualized aggregated data.</p>
<p>In this blog, we will cover some of the key use cases SREs can leverage with ES|QL:</p>
<ul>
<li>ES|QL integrated with the Elastic AI Assistant, which uses public LLM and private data, enhances the analysis experience anywhere in Elastic Observability.</li>
<li>SREs can, in a single ES|QL query, break down, analyze, and visualize observability data from multiple sources and across any time frame.</li>
<li>Actionable alerts can be easily created from a single ES|QL query, enhancing operations.</li>
</ul>
<p>I will work through these use cases by showcasing how an SRE can solve a problem in an application instrumented with OpenTelemetry and running on Kubernetes. The OpenTelemetry (OTel) demo is on an Amazon EKS cluster, with Elastic Cloud 8.11 configured.</p>
<p>You can also check out our <a href="https://www.youtube.com/watch?v=vm0pBWI2l9c">Elastic Observability ES|QL Demo</a>, which walks through ES|QL functionality for Observability.</p>
<h2>ES|QL with AI Assistant</h2>
<p>As an SRE, you are monitoring your OTel instrumented application with Elastic Observability, and while in Elastic APM, you notice some issues highlighted in the service map.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-1-services.png" alt="1 - services" /></p>
<p>Using Elastic AI Assistant, you can easily ask for analysis, and in particular, we check on what the overall latency is across the application services.</p>
<pre><code class="language-plaintext">My APM data is in traces-apm*. What's the average latency per service over the last hour? Use ESQL, the data is mapped to ECS
</code></pre>
&lt;Video vidyardUuid=&quot;wHJpzouDQHB51UftmkHFyo&quot; /&gt;
<p>The Elastic AI Assistant generates an ES|QL query, which we run in the AI Assistant to get a list of the average latencies across all the application services. We can easily see the top four are:</p>
<ul>
<li>load generator</li>
<li>front-end proxy</li>
<li>frontendservice</li>
<li>checkoutservice</li>
</ul>
<p>With a simple natural language query in the AI Assistant, it generated a single ES|QL query that helped list out the latencies across the services.</p>
<p>Noticing that there is an issue with several services, we decide to start with the frontend proxy. As we work through the details, we see significant failures, and through <strong>Elastic APM failure correlation</strong> , it becomes apparent that the frontend proxy is not properly completing its calls to downstream services.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-2-failed-transaction.png" alt="2 - failed transaction" /></p>
<h2>ES|QL insightful and contextual analysis in Discover</h2>
<p>Knowing that the application is running on Kubernetes, we investigate if there are issues in Kubernetes. In particular, we want to see if there are any services having issues.</p>
<p>We use the following query in ES|QL in Elastic Discover:</p>
<pre><code class="language-sql">from metrics-* | where kubernetes.container.status.last_terminated_reason != &quot;&quot; and kubernetes.namespace == &quot;default&quot; | stats reason_count=count(kubernetes.container.status.last_terminated_reason) by kubernetes.container.name, kubernetes.container.status.last_terminated_reason | where reason_count &gt; 0
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-3-two-horizontal-bar-graphs.png" alt="3 - horizontal graph" /></p>
<p>ES|QL helps analyze 1,000s/10,000s of metric events from Kubernetes and highlights two services that are restarting due to OOMKilled.</p>
<p>The Elastic AI Assistant, when asked about OOMKilled, indicates that a container in a pod was killed due to an out-of-memory condition.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-4-understanding-oomkilled.png" alt="4 - understanding oomkilled" /></p>
<p>We run another ES|QL query to understand the memory usage for emailservice and productcatalogservice.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-5-split-bar-graphs.png" alt="5 - split bar graphs" /></p>
<p>ES|QL easily found the average memory usage fairly high.</p>
<p>We can now further investigate both of these services’ logs, metrics, and Kubernetes-related data. However, before we continue, we create an alert to track heavy memory usage.</p>
<h2>Actionable alerts with ES|QL</h2>
<p>Suspecting a specific issue, that might recur, we simply create an alert that brings in the ES|QL query we just ran that will track for any service that exceeds 50% in memory utilization.</p>
<p>We modify the last query to find any service with high memory usage:</p>
<pre><code class="language-sql">FROM metrics*
| WHERE @timestamp &gt;= NOW() - 1 hours
| STATS avg_memory_usage = AVG(kubernetes.pod.memory.usage.limit.pct) BY kubernetes.deployment.name | where avg_memory_usage &gt; .5
</code></pre>
<p>With that query, we create a simple alert. Notice how the ES|QL query is brought into the alert. We simply connect this to pager duty. But we can choose from multiple connectors like ServiceNow, Opsgenie, email, etc.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-6-create-rule.png" alt="6 - create rule" /></p>
<p>With this alert, we can now easily monitor for any services that exceed 50% memory utilization in their pods.</p>
<h2>Make the most of your data with ES|QL</h2>
<p>In this post, we demonstrated the power ES|QL brings to analysis, operations, and reducing MTTR. In summary, the three use cases with ES|QL in Elastic Observability are as follows:</p>
<ul>
<li>ES|QL integrated with the Elastic AI Assistant, which uses public LLM and private data, enhances the analysis experience anywhere in Elastic Observability.</li>
<li>SREs can, in a single ES|QL query, break down, analyze, and visualize observability data from multiple sources and across any time frame.</li>
<li>Actionable alerts can be easily created from a single ES|QL query, enhancing operations.</li>
</ul>
<p>Elastic invites SREs and developers to experience this transformative language firsthand and unlock new horizons in their data tasks. Try it today at <a href="https://ela.st/free-trial">https://ela.st/free-trial</a> now in technical preview.</p>
<blockquote>
<ul>
<li><a href="https://www.elastic.co/demo-gallery/observability">Elastic Observability Tour</a></li>
<li><a href="https://www.elastic.co/blog/log-management-observability-operations">The power of effective log management</a></li>
<li><a href="https://www.elastic.co/blog/context-aware-insights-elastic-ai-assistant-observability">Transforming Observability with the AI Assistant</a></li>
<li><a href="https://www.elastic.co/blog/esql-elasticsearch-piped-query-language">ES|QL announcement blog</a></li>
</ul>
</blockquote>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/ES_QL_blog-720x420-05.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Independence with OpenTelemetry on Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/opentelemetry-observability</link>
            <guid isPermaLink="false">opentelemetry-observability</guid>
            <pubDate>Tue, 15 Nov 2022 00:00:00 GMT</pubDate>
            <description><![CDATA[OpenTelemetry has become a key component for observability given its open standards and developer-friendly tools. See how easily Elastic Observability integrates with OTel to provide a platform that minimizes vendor lock-in and maximizes flexibility.]]></description>
            <content:encoded><![CDATA[<p>The drive for faster, more scalable services is on the rise. Our day-to-day lives depend on apps, from a food delivery app to have your favorite meal delivered, to your banking app to manage your accounts, to even apps to schedule doctor’s appointments. These apps need to be able to grow from not only a features standpoint but also in terms of user capacity. The scale and need for global reach drives increasing complexity for these high-demand cloud applications.</p>
<p>In order to keep pace with demand, most of these online apps and services (for example, mobile applications, web pages, SaaS) are moving to a distributed microservice-based architecture and Kubernetes. Once you’ve migrated your app to the cloud, how do you manage and monitor production, scale, and availability of the service? <a href="https://opentelemetry.io/">OpenTelemetry</a> is quickly becoming the de facto standard for instrumentation and collecting application telemetry data for Kubernetes applications.</p>
<p><a href="https://www.elastic.co/what-is/opentelemetry">OpenTelemetry (OTel)</a> is an open source project providing a collection of tools, APIs, and SDKs that can be used to generate, collect, and export telemetry data (metrics, logs, and traces) to understand software performance and behavior. OpenTelemetry recently became a CNCF incubating project and has a significant amount of growing community and vendor support.</p>
<p>While OTel provides a standard way to instrument applications with a standard telemetry format, it doesn’t provide any backend or analytics components. Hence using OTel libraries in applications, infrastructure, and user experience monitoring provides flexibility in choosing the appropriate <a href="https://www.elastic.co/observability">observability tool</a> of choice. There is no longer any vendor lock-in for application performance monitoring (APM).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-otel-1.png" alt="" /></p>
<p>Elastic Observability natively supports OpenTelemetry and its OpenTelemetry protocol (OTLP) to ingest traces, metrics, and logs. All of Elastic Observability’s APM capabilities are available with OTel data. Hence the following capabilities (and more) are available for OTel data:</p>
<ul>
<li>Service maps</li>
<li>Service details (latency, throughput, failed transactions)</li>
<li>Dependencies between services</li>
<li>Transactions (traces)</li>
<li>ML correlations (specifically for latency)</li>
<li>Service logs</li>
</ul>
<p>In addition to Elastic’s APM and unified view of the telemetry data, you will now be able to use Elastic’s powerful machine learning capabilities to reduce the analysis, and alerting to help reduce MTTR.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-otel-2.png" alt="" /></p>
<p>Given its open source heritage, Elastic also supports other CNCF based projects, such as Prometheus, Fluentd, Fluent Bit, Istio, Kubernetes (K8S), and many more.</p>
<p>This blog will show:</p>
<ul>
<li>How to get a popular OTel instrumented demo app (Hipster Shop) configured to ingest into <a href="http://cloud.elastic.co">Elastic Cloud</a> through a few easy steps</li>
<li>Highlight some of the Elastic APM capabilities and features around OTel data and what you can do with this data once it’s in Elastic</li>
</ul>
<p>In follow-up blogs, we will detail how to use Elastic’s machine learning with OTel telemetry data, how to instrument OTel application metrics for specific languages, how we can support Prometheus ingest through the OTel collector, and more. Stay tuned!</p>
<h2>Prerequisites and config</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up the configuration:</p>
<ul>
<li>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>).</li>
<li>We used the OpenTelemetry Demo. Directions for using Elastic with OpenTelemetry Demo are <a href="https://github.com/elastic/opentelemetry-demo">here</a>.</li>
<li>Make sure you have <a href="https://kubernetes.io/docs/reference/kubectl/">kubectl</a> and <a href="https://helm.sh/">helm</a> also installed locally.</li>
<li>Additionally, we are using an OTel manually instrumented version of the application. No OTel automatic instrumentation was used in this blog configuration.</li>
<li>Location of our clusters. While we used Google Kubernetes Engine (GKE), you can use any Kubernetes platform of your choice.</li>
<li>While Elastic can ingest telemetry directly from OTel instrumented services, we will focus on the more traditional deployment, which uses the OpenTelemetry Collector.</li>
<li>Prometheus and FluentD/Fluent Bit — traditionally used to pull all Kubernetes data — is not being used here versus Kubernetes Agents. Follow-up blogs will showcase this.</li>
</ul>
<p>Here is the configuration we will get set up in this blog:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-otel-3.png" alt="Configuration to ingest OpenTelemetry data used in this blog" /></p>
<h2>Setting it all up</h2>
<p>Over the next few steps, I’ll walk through an <a href="https://www.elastic.co/observability/opentelemetry">Opentelemetry visualization</a>:</p>
<ul>
<li>Getting an account on Elastic Cloud</li>
<li>Bringing up a GKE cluster</li>
<li>Bringing up the application</li>
<li>Configuring Kubernetes OTel Collector configmap to point to Elastic Cloud</li>
<li>Using Elastic Observability APM with OTel data for improved visibility</li>
</ul>
<h3>Step 0: Create an account on Elastic Cloud</h3>
<p>Follow the instructions to <a href="https://cloud.elastic.co/registration?fromURI=/home">get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-otel-4.png" alt="" /></p>
<h3>Step 1: Bring up a K8S cluster</h3>
<p>We used Google Kubernetes Engine (GKE), but you can use any Kubernetes platform of your choice.</p>
<p>There are no special requirements for Elastic to collect OpenTelemetry data from a Kubernetes cluster. Any normal Kubernetes cluster on GKE, EKS, AKS, or Kubernetes compliant cluster (self-deployed and managed) works.</p>
<h3>Step 2: Load the OpenTelemetry demo application on the cluster</h3>
<p>Get your application on a Kubernetes cluster in your cloud service of choice or local Kubernetes platform. The application I am using is available <a href="https://github.com/bshetti/opentelemetry-microservices-demo/tree/main/deploy-with-collector-k8s">here</a>.</p>
<p>First clone the directory locally:</p>
<pre><code class="language-bash">git clone https://github.com/elastic/opentelemetry-demo.git
</code></pre>
<p>(Make sure you have <a href="https://kubernetes.io/docs/reference/kubectl/">kubectl</a> and <a href="https://helm.sh/">helm</a> also installed locally.)</p>
<p>The instructions utilize a specific opentelemetry-collector configuration for Elastic. Essentially, the Elastic <a href="https://github.com/elastic/opentelemetry-demo/blob/main/kubernetes/elastic-helm/values.yaml">values.yaml</a> file specified in the elastic/opentelemetry-demo configure the opentelemetry-collector to point to the Elastic APM Server using two main values:</p>
<p>OTEL_EXPORTER_OTLP_ENDPOINT is Elastic’s APM Server<br />
OTEL_EXPORTER_OTLP_HEADERS Elastic Authorization</p>
<p>These two values can be found in the OpenTelemetry setup instructions under the APM integration instructions (Integrations-&gt;APM) in your Elastic cloud.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-apm-agents.png" alt="elastic apm agents" /></p>
<p>Once you obtain this, the first step is to create a secret key on the cluster with your Elastic APM server endpoint, and your APM Secret Token with the following instruction:</p>
<pre><code class="language-bash">kubectl create secret generic elastic-secret \
  --from-literal=elastic_apm_endpoint='YOUR_APM_ENDPOINT_WITHOUT_HTTPS_PREFIX' \
  --from-literal=elastic_apm_secret_token='YOUR_APM_SECRET_TOKEN'
</code></pre>
<p>Don't forget to replace:</p>
<ul>
<li>YOUR_APM_ENDPOINT_WITHOUT_HTTPS_PREFIX: your Elastic APM endpoint ( <strong>without https:// prefix</strong> ) with OTEL_EXPORTER_OTLP_ENDPOINT</li>
<li>YOUR_APM_SECRET_TOKEN: your Elastic APM secret token OTEL_EXPORTER_OTLP_HEADERS</li>
</ul>
<p>Now execute the following commands:</p>
<pre><code class="language-bash"># switch to the kubernetes/elastic-helm directory
cd kubernetes/elastic-helm

# add the open-telemetry Helm repostiroy
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts

# deploy the demo through helm install
helm install -f values.yaml my-otel-demo open-telemetry/opentelemetry-demo
</code></pre>
<p>Once your application is up on Kubernetes, you will have the following pods (or some variant) running on the <strong>default</strong> namespace.</p>
<pre><code class="language-bash">kubectl get pods -n default
</code></pre>
<p>Output should be similar to the following:</p>
<pre><code class="language-bash">NAME                                                  READY   STATUS    RESTARTS      AGE
my-otel-demo-accountingservice-5c77754b4f-vwph6       1/1     Running   0             5d4h
my-otel-demo-adservice-6b8b7c7dc5-mb7j5               1/1     Running   0             5d4h
my-otel-demo-cartservice-76d94b7dcd-2g4lf             1/1     Running   0             5d4h
my-otel-demo-checkoutservice-988bbdb88-hmkrp          1/1     Running   0             5d4h
my-otel-demo-currencyservice-6cf4b5f9f6-vz9t2         1/1     Running   0             5d4h
my-otel-demo-emailservice-868c98fd4b-lpr7n            1/1     Running   6 (18h ago)   5d4h
my-otel-demo-featureflagservice-8446ff9c94-lzd4w      1/1     Running   0             5d4h
my-otel-demo-ffspostgres-867945d9cf-zzwd7             1/1     Running   0             5d4h
my-otel-demo-frauddetectionservice-5c97c589b9-z8fhz   1/1     Running   0             5d4h
my-otel-demo-frontend-d85ccf677-zg9fp                 1/1     Running   0             5d4h
my-otel-demo-frontendproxy-6c5c4fccf6-qmldp           1/1     Running   0             5d4h
my-otel-demo-kafka-68bcc66794-dsbr6                   1/1     Running   0             5d4h
my-otel-demo-loadgenerator-64c545b974-xfccq           1/1     Running   1 (36h ago)   5d4h
my-otel-demo-otelcol-fdfd9c7cf-6lr2w                  1/1     Running   0             5d4h
my-otel-demo-paymentservice-7955c68859-ff7zg          1/1     Running   0             5d4h
my-otel-demo-productcatalogservice-67c879657b-wn2wj   1/1     Running   0             5d4h
my-otel-demo-quoteservice-748d754ffc-qcwm4            1/1     Running   0             5d4h
my-otel-demo-recommendationservice-df78894c7-lwm5v    1/1     Running   0             5d4h
my-otel-demo-redis-7d48567546-h4p4t                   1/1     Running   0             5d4h
my-otel-demo-shippingservice-f6fc76ddd-2v7qv          1/1     Running   0             5d4h
</code></pre>
<h3>Step 3: Open Kibana and use the APM Service Map to view your OTel instrumented Services</h3>
<p>In the Elastic Observability UI under APM, select servicemap to see your services.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-observability-APM.png" alt="elastic observability APM" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-observability-OTEL-service-map.png" alt="elastic observability OTEL service map" /></p>
<p>If you are seeing this, then the OpenTelemetry Collector is sending data into Elastic:</p>
<p><em>Congratulations,</em> <em>you've instrumented the OpenTelemetry demo application using and successfully ingested the telemetry data into the Elastic!</em></p>
<h3>Step 4: What can Elastic show me?</h3>
<p>Now that the OpenTelemetry data is ingested into Elastic, what can you do?</p>
<p>First, you can view the APM service map (as shown in the previous step) — this will give you a full view of all the services and the transaction flows between services.</p>
<p>Next, you can now check out individual services and the transactions being collected.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-observability-frontend-overview.png" alt="elastic observability frontend overview" /></p>
<p>As you can see, the frontend details are listed. Everything from:</p>
<ul>
<li>Average service latency</li>
<li>Throughput</li>
<li>Main transactions</li>
<li>Failed traction rate</li>
<li>Errors</li>
<li>Dependencies</li>
</ul>
<p>Let’s get to the trace. In the Transactions tab, you can review all the types of transactions related to the frontend service:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-observability-frontend-transactions.png" alt="elastic observability frontend transactions" /></p>
<p>Selecting the HTTP POST transaction, we can see the full trace with all the spans:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-observability-frontend-HTTP-POST.png" alt="Average latency for this transaction, throughput, any failures, and of course the trace!" /></p>
<p>Not only can you review the trace but you can also analyze what is related to higher than normal latency for HTTP POST .</p>
<p>Elastic uses machine learning to help identify any potential latency issues across the services from the trace. It’s as simple as selecting the Latency Correlations tab and running the correlation.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-latency-correlations.png" alt="elastic observability latency correlations" /></p>
<p>This shows that the high latency transactions are occurring in checkout service with a medium correlation.</p>
<p>You can then drill down into logs directly from the trace view and review the logs associated with the trace to help identify and pinpoint potential issues.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-latency-distribution.png" alt="elastic observability latency distribution" /></p>
<h3>Analyze your data with Elastic machine learning (ML)</h3>
<p>Once OpenTelemetry metrics are in Elastic, start analyzing your data through Elastic’s ML capabilities.</p>
<p>A great review of these features can be found here: <a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">Correlating APM telemetry to determine root causes in transactions</a>. And there are many more videos and blogs on <a href="https://www.elastic.co/blog/">Elastic’s Blog</a>. We’ll follow up with additional blogs on leveraging Elastic’s machine learning capabilities for OpenTelemetry data.</p>
<h2>Conclusion</h2>
<p>I hope you’ve gotten an appreciation for how Elastic Observability can help you ingest and analyze OpenTelemetry data with Elastic’s APM capabilities.</p>
<p>A quick recap of lessons and more specifically learned:</p>
<ul>
<li>How to get a popular OTel instrumented demo app (Hipster Shop) configured to ingest into <a href="http://cloud.elastic.co">Elastic Cloud</a>, through a few easy steps</li>
<li>Highlight some of the Elastic APM capabilities and features around OTel data and what you can do with this once it’s in Elastic</li>
</ul>
<p>Ready to get started? Sign up <a href="https://cloud.elastic.co/registration">for Elastic Cloud</a> and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your OpenTelemetry data.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/illustration-scalability-gear-1680x980_(1).jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Optimizing cloud resources and cost with APM metadata in Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/optimize-cloud-resources-apm-observability</link>
            <guid isPermaLink="false">optimize-cloud-resources-apm-observability</guid>
            <pubDate>Wed, 16 Aug 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Optimize cloud costs with Elastic APM. Learn how to leverage cloud metadata, calculate pricing, and make smarter decisions for better performance.]]></description>
            <content:encoded><![CDATA[<p>Application performance monitoring (APM) is much more than capturing and tracking errors and stack traces. Today’s cloud-based businesses deploy applications across various regions and even cloud providers. So, harnessing the power of metadata provided by the Elastic APM agents becomes more critical. Leveraging the metadata, including crucial information like cloud region, provider, and machine type, allows us to track costs across the application stack. In this blog post, we look at how we can use cloud metadata to empower businesses to make smarter and cost-effective decisions, all while improving resource utilization and the user experience.</p>
<p>First, we need an example application that allows us to monitor infrastructure changes effectively. We use a Python Flask application with the Elastic Python APM agent. The application is a simple calculator taking the numbers as a REST request. We utilize Locust — a simple load-testing tool to evaluate performance under varying workloads.</p>
<p>The next step includes obtaining the pricing information associated with the cloud services. Every cloud provider is different. Most of them offer an option to retrieve pricing through an API. But today, we will focus on Google Cloud and will leverage their pricing calculator to retrieve relevant cost information.</p>
<h2>The calculator and Google Cloud pricing</h2>
<p>To perform a cost analysis, we need to know the cost of the machines in use. Google provides a billing <a href="https://cloud.google.com/billing/v1/how-tos/catalog-api">API</a> and <a href="https://cloud.google.com/billing/docs/reference/libraries#client-libraries-install-python">Client Library</a> to fetch the necessary data programmatically. In this blog, we are not covering the API approach. Instead, the <a href="https://cloud.google.com/products/calculator">Google Cloud Pricing Calculator</a> is enough. Select the machine type and region in the calculator and set the count 1 instance. It will then report the total estimated cost for this machine. Doing this for an e2-standard-4 machine type results in 107.7071784 US$ for a runtime of 730 hours.</p>
<p>Now, let’s go to our Kibana® where we will create a new index inside Dev Tools. Since we don’t want to analyze text, we will tell Elasticsearch® to treat every text as a keyword. The index name is cloud-billing. I might want to do the same for Azure and AWS, then I can append it to the same index.</p>
<pre><code class="language-bash">PUT cloud-billing
{
  &quot;mappings&quot;: {
    &quot;dynamic_templates&quot;: [
      {
        &quot;stringsaskeywords&quot;: {
          &quot;match&quot;: &quot;*&quot;,
          &quot;match_mapping_type&quot;: &quot;string&quot;,
          &quot;mapping&quot;: {
            &quot;type&quot;: &quot;keyword&quot;
          }
        }
      }
    ]
  }
}
</code></pre>
<p>Next up is crafting our billing document:</p>
<pre><code class="language-bash">POST cloud-billing/_doc/e2-standard-4_europe-west4
{
  &quot;machine&quot;: {
    &quot;enrichment&quot;: &quot;e2-standard-4_europe-west4&quot;
  },
  &quot;cloud&quot;: {
    &quot;machine&quot;: {
       &quot;type&quot;: &quot;e2-standard-4&quot;
    },
    &quot;region&quot;: &quot;europe-west4&quot;,
    &quot;provider&quot;: &quot;google&quot;
  },
  &quot;stats&quot;: {
    &quot;cpu&quot;: 4,
    &quot;memory&quot;: 8
  },
  &quot;price&quot;: {
    &quot;minute&quot;: 0.002459068,
    &quot;hour&quot;: 0.14754408,
    &quot;month&quot;: 107.7071784
  }
}
</code></pre>
<p>We create a document and set a custom ID. This ID matches the instance name and the region since the machines' costs may differ in each region. Automatic IDs could be problematic because I might want to update what a machine costs regularly. I could use a timestamped index for that and only ever use the latest document matching. But this way, I can update and don’t have to worry about it. I calculated the price down to minute and hour prices as well. The most important thing is the machine.enrichment field, which is the same as the ID. The same instance type can exist in multiple regions, but our enrichment processor is limited to match or range. We create a matching name that can explicitly match as in e2-standard-4_europe-west4. It’s up to you to decide whether you want the cloud provider in there and make it google_e2-standard-4_europ-west-4.</p>
<h2>Calculating the cost</h2>
<p>There are multiple ways of achieving this in the Elastic Stack. In this case, we will use an enrich policy, ingest pipeline, and transform.</p>
<p>The enrich policy is rather easy to setup:</p>
<pre><code class="language-bash">PUT _enrich/policy/cloud-billing
{
  &quot;match&quot;: {
    &quot;indices&quot;: &quot;cloud-billing&quot;,
    &quot;match_field&quot;: &quot;machine.enrichment&quot;,
    &quot;enrich_fields&quot;: [&quot;price.minute&quot;, &quot;price.hour&quot;, &quot;price.month&quot;]
  }
}

POST _enrich/policy/cloud-billing/_execute
</code></pre>
<p>Don’t forget to run the _execute at the end of it. This is necessary to make the internal indices used by the enrichment in the ingest pipeline. The ingest pipeline is rather minimalistic — it calls the enrichment and renames a field. This is where our machine.enrichment field comes in. One caveat around enrichment is that when you add new documents to the cloud-billing index, you need to rerun the _execute statement. The last bit calculates the total cost with the count of unique machines seen.</p>
<pre><code class="language-bash">PUT _ingest/pipeline/cloud-billing
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;_temp.machine_type&quot;,
        &quot;value&quot;: &quot;{{cloud.machine.type}}_{{cloud.region}}&quot;
      }
    },
    {
      &quot;enrich&quot;: {
        &quot;policy_name&quot;: &quot;cloud-billing&quot;,
        &quot;field&quot;: &quot;_temp.machine_type&quot;,
        &quot;target_field&quot;: &quot;enrichment&quot;
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;enrichment.price&quot;,
        &quot;target_field&quot;: &quot;price&quot;
      }
    },
    {
      &quot;remove&quot;: {
        &quot;field&quot;: [
          &quot;_temp&quot;,
          &quot;enrichment&quot;
        ]
      }
    },
    {
      &quot;script&quot;: {
        &quot;source&quot;: &quot;ctx.total_price=ctx.count_machines*ctx.price.hour&quot;
      }
    }
  ]
}
</code></pre>
<p>Since this is all configured now, we are ready for our Transform. For this, we need a data view that matches the APM data_streams. This is traces-apm*, metrics-apm.*, logs-apm.*. For the Transform, go to the Transform UI in Kibana and configure it in the following way:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/optimize-cloud-resources-apm-observability/elastic-blog-1-transform-configuration.png" alt="transform configuration" /></p>
<p>We are doing an hourly breakdown, therefore, I get a document per service, per hour, per machine type. The interesting bit is the aggregations. I want to see the average CPU usage and the 75,95,99 percentile, to view the CPU usage on an hourly basis. Allowing me to identify the CPU usage across an hour. At the bottom, give the transform a name and select an index cloud-costs and select the cloud-billing ingest pipeline.</p>
<p>Here is the entire transform as a JSON document:</p>
<pre><code class="language-bash">PUT _transform/cloud-billing
{
  &quot;source&quot;: {
    &quot;index&quot;: [
      &quot;traces-apm*&quot;,
      &quot;metrics-apm.*&quot;,
      &quot;logs-apm.*&quot;
    ],
    &quot;query&quot;: {
      &quot;bool&quot;: {
        &quot;filter&quot;: [
          {
            &quot;bool&quot;: {
              &quot;should&quot;: [
                {
                  &quot;exists&quot;: {
                    &quot;field&quot;: &quot;cloud.provider&quot;
                  }
                }
              ],
              &quot;minimum_should_match&quot;: 1
            }
          }
        ]
      }
    }
  },
  &quot;pivot&quot;: {
    &quot;group_by&quot;: {
      &quot;@timestamp&quot;: {
        &quot;date_histogram&quot;: {
          &quot;field&quot;: &quot;@timestamp&quot;,
          &quot;calendar_interval&quot;: &quot;1h&quot;
        }
      },
      &quot;cloud.provider&quot;: {
        &quot;terms&quot;: {
          &quot;field&quot;: &quot;cloud.provider&quot;
        }
      },
      &quot;cloud.region&quot;: {
        &quot;terms&quot;: {
          &quot;field&quot;: &quot;cloud.region&quot;
        }
      },
      &quot;cloud.machine.type&quot;: {
        &quot;terms&quot;: {
          &quot;field&quot;: &quot;cloud.machine.type&quot;
        }
      },
      &quot;service.name&quot;: {
        &quot;terms&quot;: {
          &quot;field&quot;: &quot;service.name&quot;
        }
      }
    },
    &quot;aggregations&quot;: {
      &quot;avg_cpu&quot;: {
        &quot;avg&quot;: {
          &quot;field&quot;: &quot;system.cpu.total.norm.pct&quot;
        }
      },
      &quot;percentiles_cpu&quot;: {
        &quot;percentiles&quot;: {
          &quot;field&quot;: &quot;system.cpu.total.norm.pct&quot;,
          &quot;percents&quot;: [
            75,
            95,
            99
          ]
        }
      },
      &quot;avg_transaction_duration&quot;: {
        &quot;avg&quot;: {
          &quot;field&quot;: &quot;transaction.duration.us&quot;
        }
      },
      &quot;percentiles_transaction_duration&quot;: {
        &quot;percentiles&quot;: {
          &quot;field&quot;: &quot;transaction.duration.us&quot;,
          &quot;percents&quot;: [
            75,
            95,
            99
          ]
        }
      },
      &quot;count_machines&quot;: {
        &quot;cardinality&quot;: {
          &quot;field&quot;: &quot;cloud.instance.id&quot;
        }
      }
    }
  },
  &quot;dest&quot;: {
    &quot;index&quot;: &quot;cloud-costs&quot;,
    &quot;pipeline&quot;: &quot;cloud-costs&quot;
  },
  &quot;sync&quot;: {
    &quot;time&quot;: {
      &quot;delay&quot;: &quot;120s&quot;,
      &quot;field&quot;: &quot;@timestamp&quot;
    }
  },
  &quot;settings&quot;: {
    &quot;max_page_search_size&quot;: 1000
  }
}
</code></pre>
<p>Once the transform is created and running, we need a Kibana Data View for the index: cloud-costs. For the transaction, use the custom formatter inside Kibana and set its format to “Duration” in “microseconds.”</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/optimize-cloud-resources-apm-observability/elastic-blog-2-cloud-costs.png" alt="cloud costs" /></p>
<p>With that, everything is arranged and ready to go.</p>
<h2>Observing infrastructure changes</h2>
<p>Below I created a dashboard that allows us to identify:</p>
<ul>
<li>How much costs a certain service creates</li>
<li>CPU usage</li>
<li>Memory usage</li>
<li>Transaction duration</li>
<li>Identify cost-saving potential</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/optimize-cloud-resources-apm-observability/elastic-blog-3-graphs.png" alt="graphs" /></p>
<p>From left to right, we want to focus on the very first chart. We have the bars representing the CPU as average in green and 95th percentile in blue on top. It goes from 0 to 100% and is normalized, meaning that even with 8 CPU cores, it will still read 100% usage and not 800%. The line graph represents the transaction duration, the average being in red, and the 95th percentile in purple. Last, we have the orange area at the bottom, which is the average memory usage on that host.</p>
<p>We immediately realize that our calculator does not need a lot of memory. Hovering over the graph reveals 2.89% memory usage. The e2-standard-8 machine that we are using has 32 GB of memory. We occasionally spike to 100% CPU in the 95th percentile. When this happens, we see that the average transaction duration spikes to 2.5 milliseconds. However, every hour this machine costs us a rounded 30 cents. Using this information, we can now downsize to a better fit. The average CPU usage is around 11-13%, and the 95th percentile is not that far away.</p>
<p>Because we are using 8 CPUs, one could now say that 12.5% represents a full core, but that is just an assumption on a piece of paper. Nonetheless, we know there is a lot of headroom, and we can downscale quite a bit. In this case, I decided to go to 2 CPUs and 2 GB of RAM, known as e2-highcpu2. This should fit my calculator application better. We barely touched the RAM, 2.89% out of 32GB are roughly 1GB of use. After the change and reboot of the calculator machine, I started the same Locust test to identify my CPU usage and, more importantly, if my transactions get slower, and if so, by how much. Ultimately, I want to decide whether 1 millisecond more latency is worth 10 more cents per hour. I added the change as an annotation in Lens.</p>
<p>After letting it run for a bit, we can now identify the smaller hosts' impact. In this case, we can see that the average did not change. However, the 95th percentile — as in 95% of all transactions are below this value — did spike up. Again, it looks bad at first, but checking in, it went from ~1.5 milliseconds to ~2.10 milliseconds, a ~0.6 millisecond increase. Now, you can decide whether that 0.6 millisecond increase is worth paying ~180$ more per month or if the current latency is good enough.</p>
<h2>Conclusion</h2>
<p>Observability is more than just collecting logs, metrics, and traces. Linking user experience to cloud costs allows your business to identify areas where you can save money. Having the right tools at your disposal will help you generate those insights quickly. Making informed decisions about how to optimize your cloud cost and ultimately improve the user experience is the bottom-line goal.</p>
<p>The dashboard and data view can be found in my <a href="https://github.com/philippkahr/blogs/tree/main/apm-cost-optimisation">GitHub repository</a>. You can download the .ndjson file and import it using the Saved Objects inside Stack Management in Kibana.</p>
<h2>Caveats</h2>
<p>Pricing is only for base machines without any disk information, static public IP addresses, and any other additional cost, such as licenses for operating systems. Furthermore, it excludes spot pricing, discounts, or free credits. Additionally, data transfer costs between services are also not included. We only calculate it based on the minute rate of the service running — we are not checking billing intervals from Google Cloud. In our case, we would bill per minute, regardless of what Google Cloud has. Using the count for unique instance.ids work as intended. However, if a machine is only running for one minute, we calculate it based on the hourly rate. So, a machine running for one minute, will cost the same as running for 50 minutes — at least how we calculate it. The transform uses calendar hour intervals; therefore, it's 8 am-9 am, 9 am-10 am, and so on.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/optimize-cloud-resources-apm-observability/illustration-out-of-box-data-vis-1680x980.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Find answers quickly, correlate OpenTelemetry traces with existing ECS logs in Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/otel-ecs-unification-elastic</link>
            <guid isPermaLink="false">otel-ecs-unification-elastic</guid>
            <pubDate>Thu, 04 Dec 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[In this blog we will discuss how EDOT enables you to collect existing ECS logs while ensuring a seamless and transparent move to OTel semantic conventions. The key benefit is that applications can continue sending logs as they do today, which minimizes the effort and impact on application developers.]]></description>
            <content:encoded><![CDATA[<p>OpenTelemetry (OTel) is the undisputed standard for vendor-neutral instrumentation. However, most established organizations don't start from a blank slate. You likely have a mature ecosystem of applications already logging in Elastic Common Schema (ECS), supported by years of refined dashboards and alerting rules.</p>
<p><strong>The challenge is clear:</strong> How do you adopt OTel’s unified observability without abandoning your proven ECS-based logging?</p>
<p>In this guide, we’ll demonstrate how to bridge this gap using the <strong>Elastic Distribution of OpenTelemetry (EDOT)</strong>. We will first show you how to leverage the EDOT Collector to ingest your logs into Elasticsearch, ensuring a seamless transition that unlocks the full power of OTel’s distributed tracing without breaking your current workflows.</p>
<p>Once the data is flowing, we will explore how Elasticsearch's underlying mapping architecture to allow that your existing filters and visualizations remain fully functional through two key features:</p>
<ul>
<li>
<p><strong>Field Aliases:</strong> We’ll explain how Elastic uses aliases to ensure that legacy dashboards looking for <code>log.level</code> (ECS) still work perfectly, even as your new telemetry arrives as <code>severity_text</code> (OTel).</p>
</li>
<li>
<p><strong>Passthrough Fields:</strong> We’ll show how Elastic’s native OTel mapping structures use passthrough fields to handle OTel attributes. This ensures your data remains searchable and performant without the need for complex, manual schema migrations.</p>
</li>
</ul>
<p>By combining EDOT for ingestion with these intelligent mapping structures, you can maintain your existing Java ECS logging while evolving toward a unified, OTel-native future.</p>
<h2>The ECS Foundation</h2>
<p>We begin with a Java application using <strong>Log4j2</strong> and the <strong>ecs-java-plugin</strong>. This setup generates structured JSON logs in the <a href="https://www.elastic.co/docs/reference/ecs">Elastic Common Schema (ECS)</a> that Elastic handles natively leveraging the ECS logging plugins that easily integrate with common logging libraries across various programming languages.</p>
<p>The following provides a <strong>Log4j2 Configuration Extract</strong> and this setup assumes prior configuration of Log4j2 dependencies to include the required ECS plugin libraries:</p>
<pre><code class="language-xml">&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;Configuration status=&quot;DEBUG&quot;&gt;
    &lt;Appenders&gt;
        &lt;Console name=&quot;LogToConsole&quot; target=&quot;SYSTEM_OUT&quot;&gt;
            &lt;EcsLayout serviceName=&quot;logger-app&quot; serviceVersion=&quot;v1.0.0&quot;/&gt;
        &lt;/Console&gt;
    &lt;/Appenders&gt;
    &lt;Loggers&gt;
        &lt;Root level=&quot;info&quot;&gt;
            &lt;AppenderRef ref=&quot;LogToConsole&quot;/&gt;
        &lt;/Root&gt;
    &lt;/Loggers&gt;
&lt;/Configuration&gt;
</code></pre>
<p><strong>Note:</strong> <code>&lt;EcsLayout serviceName=&quot;logger-app&quot; serviceVersion=&quot;v1.0.0&quot;/&gt;</code> we will come back to this setting later in the blog article, as with Kubernets deployments these values can be automatically populated by the EDOT Collector and the setting could be simplified to <code>&lt;EcsLayout/&gt;</code></p>
<h2>Introducing the Elastic Distribution of OpenTelemetry (EDOT)</h2>
<p>The <a href="https://www.elastic.co/docs/reference/opentelemetry">Elastic Distribution of OpenTelemetry (EDOT)</a> is more than just a repackaging; it is a curated set of OTel components (Collector and SDKs) optimized for Elastic Observability. Released in v8.15, it allows you to collect traces, metrics, and logs using standard OTel receivers while benefiting from Elastic-contributed enhancements like powerful log parsing and Kubernetes metadata enrichment.</p>
<p>EDOT's Primary Benefits:</p>
<p><strong>Deliver Enhanced Features Earlier:</strong> Provides features not yet available in &quot;vanilla&quot; OTel components, which Elastic continuously contributes upstream.</p>
<p><strong>Enhanced OTel Support:</strong> Offers enterprise-grade support and maintenance for fixes outside of standard OTel release cycles.</p>
<p>The question then becomes: How can users transition their ingestion architecture to an OTel-native approach while maintaining the ability to collect logs in ECS format?</p>
<p>This involves replacing classic collection and instrumentation components (like Elastic Agent and the Elastic APM Java Agent). Let us show you how this can be done step by step replacing it with the full suite of components provided by EDOT. A comprehensive view of the EDOT architecture components in Kubernets is shown below.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/otel-ecs-unification-elastic/architecture.png" alt="EDOT reference Architecure in K8s" /></p>
<p>In a Kubernetes environment, EDOT components are typically installed via an OTel Operator and HELM chart. The main components are:</p>
<ul>
<li><strong>EDOT Collector Cluster:</strong> deployment used to collect cluster-wide metrics.</li>
<li><strong>EDOT Collector Daemon:</strong> daemonset used to collect node metrics, logs, and application telemetry data.</li>
<li><strong>EDOT Collector Gateway:</strong> performs pre-processing, aggregation, and ingestion of data into Elastic.</li>
</ul>
<p>Elastic provides a curated configuration file for all the EDOT components available as part of the the OpenTelemetry Operator using the <code>opentelemetry-kube-stack</code> Helm chart. Downloadable from <a href="https://github.com/elastic/elastic-agent/blob/main/deploy/helm/edot-collector/kube-stack/values.yml">here</a>.</p>
<h2>Achieving Correlation: SDK + Logging Context</h2>
<p>To link a log line to a specific trace, the <a href="https://www.elastic.co/docs/reference/opentelemetry/edot-sdks/java">EDOT Java SDK</a> performs a &quot;handshake&quot; with your logging library.
When a trace is active, the SDK extracts the <code>trace_id</code> and <code>span_id</code> and injects them into the <strong>Mapped Diagnostic Context (MDC)</strong> of Log4j2. Even though your logs are in ECS format, they now carry the OTel DNA required for correlation.
While the EDOT SDK can collect logs directly, a generally more resilient approach is to stick to file collection. This is important because if the OTel Collector is down, logs written to a file are buffered locally on the disk, preventing the data loss that can occur if the SDK's in-memory queue reaches its limit and starts discarding new logs. For an in-depth discussion on this topic we refer to the <a href="https://opentelemetry.io/docs/languages/java/instrumentation/#log-instrumentation">OpenTelemetry Documentation</a>.</p>
<h2>Zero-Code Instrumentation</h2>
<p>The EDOT Java SDK is a customized version of the OpenTelemetry Java Agent. In Kubernetes, zero-code Java autoinstrumentation is supported by adding an <a href="https://www.elastic.co/docs/reference/opentelemetry/edot-sdks/java/setup/k8s">annotation</a> in the pod template configuration in the deployment manifest:</p>
<pre><code class="language-yml">apiVersion: apps/v1
kind: Deployment
...
spec:
  ..
  template:
    metadata:
      # Auto-Instrumentation
      annotations:
        instrumentation.opentelemetry.io/inject-java: &quot;opentelemetry-operator-system/elastic-instrumentation&quot;
</code></pre>
<h2>Collecting and Processing Logs with the EDOT Collector</h2>
<p>This is the most critical step. Our logs are now JSON, they are in the console output, and they contain trace IDs. Now, we need the EDOT Collector to pick them up and map them to the <strong>OpenTelemetry Log Data Model</strong>.</p>
<h3>EDOT Collector Configuration: Dynamic Workload Discovery and filelog receiver</h3>
<p>Applications running on containers become moving targets for monitoring systems. To handle this, we rely on <a href="https://www.elastic.co/observability-labs/blog/k8s-discovery-with-EDOT-collector">Dynamic workload discovery on Kubernetes</a>. This allows the EDOT Collector to track pod lifecycles and dynamically attach log collection configurations based on specific annotations relying on the <code>k8s_observer</code> and the <code>receiver_creator</code> component.</p>
<p>In our example, we have a Deployment with a Pod consisting of one container. We use Kubernetes annotations to:</p>
<ol>
<li>
<p>Enable auto-instrumentation (Java).</p>
</li>
<li>
<p>Enable log collection for this pod.</p>
</li>
<li>
<p>Instruct the collector to parse the output as JSON immediately (json-parser configuration).</p>
</li>
<li>
<p>Add custom attributes (e.g. identify the Application souce code)</p>
</li>
</ol>
<h4>Deployment Manifest Example</h4>
<pre><code class="language-yml">apiVersion: apps/v1
kind: Deployment
metadata:
  name: logger-app-deployment
  labels:
    app: logger-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: logger-app
  template:
    metadata:
      annotations:
        # 1. Turn on Auto-Instrumentation
        instrumentation.opentelemetry.io/inject-java: &quot;opentelemetry-operator-system/elastic-instrumentation&quot;
        # 2. Enable Log Collection for this pod
        io.opentelemetry.discovery.logs/enabled: &quot;true&quot;
        # 3. Provide the parsing &quot;hint&quot; (Treat logs as JSON)
        io.opentelemetry.discovery.logs.ecs-log-producer/config: |
            operators:
            - type: container
              id: container-parser
            - type: json_parser
              id: json-parser
         # 4. Identify this application as Java (To allow for user interface rendering in Kibana)
        resource.opentelemetry.io/telemetry.sdk.language: &quot;java&quot;
      ...
</code></pre>
<p>This setup provides a bare-minimum configuration for ingesting ECS library logs.
Crucially, it decouples log collection from application logic. Developers simply need to provide a hint via annotations that their logs are in JSON format (structurally guaranteed by the ECS libraries). We then define the standardized enrichment and processing rules centrally at the <a href="https://www.elastic.co/docs/reference/edot-collector/components">processor</a> level in the (Daemon) EDOT Collector.</p>
<p>This centralization ensures consistency across the platform: if we need to update our standard formatting or enrichment strategies later, we apply the change once in the collector, and it automatically propagates to all services without developers needing to touch their manifests.</p>
<h4>(Daemon) EDOT Collector Configuration</h4>
<p>To enable this, we configure a Receiver Creator in the Daemon Collector. This component uses the <code>k8s_observer</code> extension to monitor the Kubernetes environment and automatically discover the target pods based on the annotations above.</p>
<pre><code class="language-yml">daemon:
  ...
  config:
    ...
    extensions:
      extensions:
        k8s_observer:
          auth_type: serviceAccount
          node: ${env:K8S_NODE_NAME}
          observe_nodes: true
          observe_pods: true
          observe_services: true
          ...
    receivers:
        receiver_creator/logs:
          watch_observers: [k8s_observer]
          discovery:
            enabled: true
    ...
...
</code></pre>
<p>Finally, we reference the <code>receiver_creator</code> in the pipeline instead of a static filelog receiver and we make sure to include the <code>k8s_observer</code> extension:</p>
<pre><code class="language-yml">daemon:
  ...
  config:
    ...
    service:
      extensions:
      - k8s_observer
      pipelines:
        # Pipeline for node-level logs
        logs/node:
          receivers:
            # - filelog             # We disable direct filelog receiver
            - receiver_creator/logs # Using the configured receiver_creator instead of filelog
          processors:
            - batch
            - k8sattributes
            - resourcedetection/system
          exporters:
            - otlp/gateway # Forward to the Gateway Collector for ingestion
</code></pre>
<h3>The Transformation Layer</h3>
<p>While the logs are structured, OTel sees them as generic attributes. We use the OpenTelemetry Transformation Language (OTTL) within a <code>transform</code> processor to &quot;promote&quot; ECS fields to top-level OTel fields.
To finalize the pipeline, we use the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/transformprocessor/README.md">transform processor</a>, which allows us to modify and restructure telemetry signals using the OpenTelemetry Transformation Language (OTTL).</p>
<p>We use the processor to promote specific ECS fields into the top-level OpenTelemetry fields and renaming attributes according to OpenTelemetry Semantic Conventions:</p>
<ul>
<li>Promote the <code>message</code> attribute to the top-level <code>Body</code> field.</li>
<li>Promote the <code>log.level</code> attribute to the OTel <code>SeverityText</code> field.</li>
<li>Move the <code>@timestamp</code> attribute to the OTel <code>Time</code> field.</li>
<li>Map <code>trace_id</code> and <code>span_id</code> to the right log context.</li>
</ul>
<p>The following provides a sample <code>transform</code> configuration:</p>
<pre><code class="language-yml"> processors:
    transform/ecs_handler:
      log_statements:
      - context: log
        conditions:
          - log.attributes[&quot;ecs.version&quot;] != nil
        statements:
          # Map ECS fields to OTel Log Model
          - set(log.body, log.attributes[&quot;message&quot;])
          - set(log.time, Time(log.attributes[&quot;@timestamp&quot;], &quot;%Y-%m-%dT%H:%M:%SZ&quot;))
          - set(log.trace_id.string, log.attributes[&quot;trace_id&quot;])
          - set(log.span_id.string, log.attributes[&quot;span_id&quot;])
          - set(log.severity_text, log.attributes[&quot;log.level&quot;])
          # Cleanup original keys to save space
          - delete_key(log.attributes, &quot;message&quot;)
          - delete_key(log.attributes, &quot;trace_id&quot;)
          - delete_key(log.attributes, &quot;span_id&quot;)

          # Add here additional transformations as needed...
</code></pre>
<p><strong>Note:</strong> When working with EDOT Collector and the OpenTelemetry Kube-Stack Helm Chart, resource attributes such as <code>service.name</code> and <code>service.version</code> are automatically populated based on a set of <a href="https://opentelemetry.io/docs/specs/semconv/non-normative/k8s-attributes/">well-defined</a>
rules by the <code>k8sattributes</code> processor. Thus, on Kubernetes we do not need to extract those fields from the log content itself.</p>
<p>Make sure to use the newly created processor in the logs pipeline for the Daemon Collector:</p>
<pre><code class="language-yml">service:
  pipelines:
    logs/node:
      receivers:
        - receiver_creator/logs
      processors:
        - batch
        - k8sattributes
        - resourcedetection/system
        - transform/ecs_handler          # Newly created transform processor
      exporters:
        - otlp/gateway
</code></pre>
<h2>The Compatibility Layer: Bridging ECS and OTel</h2>
<p>To bridge the gap between the Elastic Common Schema (ECS) and OpenTelemetry (OTel), Elastic provides a &quot;compatibility layer&quot; built directly into its Observability solution relying on existing index templates and mappings. This architecture allows you to send OTel-native data while still using your legacy ECS-based dashboards, saved searches, and other associated objects.</p>
<p>This &quot;bridge&quot; relies on two key features:</p>
<ul>
<li>
<p><strong>Bridging ECS and OTel with Passthrough:</strong> OpenTelemetry (OTel) data often uses deeply nested structures (e.g., <code>resource.attributes.*</code>). Elasticsearch uses the <strong><a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/passthrough">Passthrough</a></strong> object type to &quot;promote&quot; these nested attributes to the top level when performing a search query. Any new metadata added by the OTel collector is automatically searchable without the user needing to know the full JSON path. This creates a &quot;virtual flattening&quot; layer and makes sure that all fields that match in name are automatically compatible, even though thery're stored in different namespaces (attributes/resource.attributes for OTel vs top-level for ECS). To learn more about fields and attributes alignment between ECS and Otel SemanticConvention refer to this <a href="https://www.elastic.co/docs/reference/ecs/ecs-otel-alignment-details">page</a>.</p>
</li>
<li>
<p><strong>Bridging with Field Aliases</strong>: Elastic relies on OTel mapping templates that include <code>Field Aliases</code>. These aliases link OTel semantic names back to their equivalent ECS fields at query to handle fields that do not align with Otel naming convention.</p>
</li>
</ul>
<p><em>The Benefit:</em> If you have an existing dashboard looking for <code>message</code> (ECS), but your data is now indexed as <code>body.text</code> (OTEL), an alias allows the dashboard to aggregate and visualize data from both sources simultaneously. This ensures that your existing filters and KQL queries also work flawlessly whether the data originated from a Filebeat agent or a modern OTel SDK Agent.</p>
<p>Some more details about field aliases and pass-through objects can be found <a href="https://www.elastic.co/docs/reference/opentelemetry/compatibility/data-streams#query-compatibility-with-classic-apm-data-streams">here</a>.</p>
<p>Here is an example of the provided mapping template:</p>
<pre><code class="language-yml">{
  &quot;mappings&quot;: {
    ...
    &quot;properties&quot;: {
      &quot;log&quot;: {
          &quot;properties&quot;: {
            &quot;level&quot;: {
              &quot;type&quot;: &quot;alias&quot;,
              &quot;path&quot;: &quot;severity_text&quot;
            }
          }
        },
      &quot;message&quot;: {
        &quot;type&quot;: &quot;alias&quot;,
        &quot;path&quot;: &quot;body.text&quot;
      }
    ...
    }
  }
 }
</code></pre>
<p>This architectural approach provides three major advantages for teams in transition:</p>
<ul>
<li>
<p><strong>Zero Reindexing:</strong> You don't have to rewrite or migrate old data. Aliases resolve at query time, meaning your old indices and new indices can coexist in the same visualization.</p>
</li>
<li>
<p><strong>Future-Proofing:</strong> As OTel becomes the primary standard (following the donation of ECS to the OTel project), Elastic is shifting its native UI to look for OTel fields first. These mappings ensure that your legacy ECS-native data still appears in OTel-native views.</p>
</li>
<li>
<p><strong>Unified Observability:</strong> It enables &quot;Correlation by Default.&quot; Because the aliases link trace_id (OTel) and trace.id (ECS), you can jump from a legacy log to a modern OTel trace without losing context or breaking the drill-down path.</p>
</li>
</ul>
<h2>Sending data to Elasticsearch</h2>
<p>If you are running Elastic Serverless or the latest Elastic Cloud Hosted (ECH) v9.2+, you now have access to a managed OTLP endpoint. This native functionality allows you to route telemetry directly from your Collector Gateway to Elasticsearch using the OTLP protocol.</p>
<p>Because we mapped our ECS fields to the OTel model in the collector, Elasticsearch recognizes the correlation immediately. You get the best of both worlds:
<em><strong>Legacy Compatibility:</strong></em> Your old ECS-based dashboards still work (with minor tweaks).
<em><strong>Modern Power:</strong></em> You can now click &quot;View Trace&quot; directly from a log entry in Kibana's Observability UI.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/otel-ecs-unification-elastic/discovery.jpg" alt="Discover" /></p>
<h2>Conclusion</h2>
<p>Transitioning to OpenTelemetry doesn't have to be a &quot;big bang&quot; migration. By using the EDOT SDK and Collector, you can:
<em><strong>Protect your investment</strong></em> in ECS-based logging libraries.
<em><strong>Centralize complexity</strong></em> by handling schema translation in the collector rather than the application.
<em><strong>Enable full correlation</strong></em> between traces and logs with zero code changes.</p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/otel-ecs-unification-elastic/blog-image.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[OpenTelemetry Data Quality Insights with the Instrumentation Score and Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/otel-instrumentation-score</link>
            <guid isPermaLink="false">otel-instrumentation-score</guid>
            <pubDate>Thu, 06 Nov 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[This post explores the Instrumentation Score for OpenTelemetry data quality, sharing practical insights, key learnings, and a hands-on look at implementing this approach with Elastic's powerful observability features.]]></description>
            <content:encoded><![CDATA[<p>OpenTelemetry adoption is rapidly increasing and more companies rely on OpenTelemetry to collect observability data.
While OpenTelemetry offers clear specifications and semantic conventions to guide telemetry data collection, it also introduces significant flexibility.
With high flexibility comes high responsibility — many things can go wrong with OTel-based data collection, easily resulting in mediocre or low-quality telemetry.
Poor data quality can hinder backend analysis, confuse users, and degrade system performance.
To unlock actionable insights from OpenTelemetry data, maintaining high data quality is essential.
The <a href="https://instrumentation-score.com/">Instrumentation Score</a> initiative addresses this challenge by providing a standardized way to measure OpenTelemetry data quality.
Although the specification and tooling are still evolving, the underlying concepts are already compelling.
In this blog post, I’ll share my experience experimenting with the Instrumentation Score concept and demonstrate how to use the Elastic Stack — utlizing ES|QL, Kibana Task Manager, and Dashboards — to build a POC for data quality analysis based on this approach within Elastic Observability.</p>
<h2>Instrumentation Score - The Power of Rule-based Data Quality Analysis</h2>
<p>When you first hear the term &quot;Instrumentation Score&quot;, your initial reaction might be: &quot;OK, there's a <em>single</em>, percentage-like metric that tells me my instrumentation (i.e. OTel data) has a score of 60 out of 100.
So what? How does it help me?&quot;</p>
<p>However, the Instrumentation Score is much more than just a single number.
Its power lies in the individual rules from which the score is calculated.
The rule definitions' <code>rationale</code>, <code>impact level</code>, and <code>criteria</code> provide an evaluation framework that enables you to drill down into data quality issues and identify specific areas for improvement.
Also, the Instrumentation Score specification does not mandate specific tools and implementation details for calculating the score and rule evaluations.</p>
<p>As I explored the Instrumentation Score concepts, I developed the following mental model for deriving actionable insights.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/otel-instrumentation-score/inst-score-drill-down.png" alt="Instrumentation Score Drill-down" /></p>
<h5>The Score</h5>
<p>The score itself is an indicator of the quality of your telemetry data. The lower the number, the more room for improvement with your data quality.
In general, if a score falls below 75, you should consider fixing your instrumentation and data collection.</p>
<h5>Breakdown by Instrumentation Score Rules</h5>
<p>Exploring the evaluation results of individual Instrumentation Score <em>rules</em> will give you insights into <em>what</em> is wrong with your data quality.
In addition, the rules' rationales explain <em>why</em> the violation of a rule is problematic.</p>
<p>As an example, let's take the <a href="https://github.com/instrumentation-score/spec/blob/main/rules/SPA-002.md"><code>SPA-002 rule</code></a>:</p>
<blockquote>
<p><strong>Description</strong>:</p>
<p>Traces do not contain orphan spans.</p>
<p><strong>Rationale</strong>:</p>
<p>Orphaned spans indicate potential issues in tracing instrumentation or data integrity. This can lead to incomplete or misleading trace data, hindering effective troubleshooting and performance analysis.</p>
</blockquote>
<p>If your data violates the <code>SPA-002</code> rule, you know <em>what</em> is wrong (i.e. you have broken traces), and the rationale explains why that is an issue (i.e. degraded analysis capabilities).</p>
<h5>Breakdown by Services</h5>
<p>When you have a large system with hundreds or maybe even thousands of entities (such as services, Kubernetes pods, etc.), a binary signal on all of the data — such as &quot;has a certain rule been passed or not&quot; — is not really actionable.
Is the data from all services violating a certain rule, or just a small subset of services?</p>
<p>Breaking down rule evaluation by services (and potentially other entity types) may help you to identify <em>where</em> there are issues with data quality.
For example, let's assume only one service — the <code>cart-service</code> — (out of your fifty services) is affected by the violation of rule <code>SPA-002</code>.
With that information, you can focus on fixing the instrumentation for the <code>cart-service</code> instead of having to check all fifty services.</p>
<p>Once you know which services (or other entities) violate which Instrumentation Score rules, you're very close to actionable insights.
However, there are two more things that I found to be extremely useful for data quality analysis when I was experimenting with the Instrumentation Score evaluation: (1) a quantitative indication of the extent, and (2) concrete examples of rule violation occurrences in your data.</p>
<h5>Quantifying the Rule Violation Extent</h5>
<p>The Instrumentation Score spec already defines an impact level (e.g. <code>NORMAL</code>, <code>IMPORTANT</code>, <code>CRITICAL</code>) per rule.
However, this only covers the &quot;importance&quot; of the rule itself, not the extent of a rule violation.
For example, if a single trace (out of a million traces) on your service has an orphan span, technically speaking the rule <code>SPA-002</code> is violated.
But is it really a relevant issue if only one out of a million traces is affected? Probably not. It definitely would be if half of your traces were broken.</p>
<p>Hence, having a quantitative indication of the extent of a rule violation per service — e.g. &quot;40% of your traces violate <code>SPA-002</code>&quot; — would provide additional information on how severe a rule violation actually is.</p>
<h5>Tangible Examples</h5>
<p>Finally, nothing is as meaningful and self-explanatory as tangible, concrete examples from your own data.
If the telemetry data of your <code>cart-service</code> violates <code>SPA-002</code> (i.e., has traces with orphan spans), wouldn't you want to see a concrete trace from that service that demonstrates the rule violation?
Analyzing concrete examples may give you hints about the root cause of broken traces — or, more generally, why your data violates Instrumentation Score rules.</p>
<h2>Instrumentation Score with Elastic</h2>
<p>The Instrumentation Score spec does not prescribe tool usage or implementation details for the calculation of the score and evaluation of the rules.
This allows for integrating the Instrumentation Score concept with whatever backend your OpenTelemetry data is being sent to.</p>
<p>With the goal of building a POC for an end-to-end integration of the Instrumentation Score with Elastic Observability, I combined the powerful capabilities of ES|QL with Kibana's task manager and dashboarding features.</p>
<p>Each Instrumentation Score rule can be formulated as an ES|QL query that covers the steps described above:</p>
<ul>
<li>rule passed or not</li>
<li>breakdown by services</li>
<li>calculation of the extent</li>
<li>sampling of an example occurrence</li>
</ul>
<p>Here is an example query for the <code>LOG-002</code> rule that checks the validity of the <code>severity_number</code> field:</p>
<pre><code class="language-esql">FROM logs-*.otel-* METADATA _id
| WHERE data_stream.type == &quot;logs&quot;
    AND @timestamp &gt; NOW() - 1h
| EVAL no_sev = severity_number IS NULL OR severity_number == 0
| STATS 
    logs_wo_severity = COUNT(*) WHERE no_sev,
    example = SAMPLE(_id, 1) WHERE no_sev,
    total = COUNT(*)
      BY service.name
| EVAL rule_passed = (logs_wo_severity == 0),
    extent = CASE(total != 0, logs_wo_severity / total, 0.0)
| KEEP rule_passed, service.name, example, extent
</code></pre>
<p>These rule evaluation queries are wrapped in a Kibana <code>instrumentation-score</code> plugin that utilizes the task manager for regular execution.
The <code>instrumentation-score</code> plugin then takes the results from all the evaluation queries for the different rules and calculates the final instrumentation score value (overall and broken down by service) following the <a href="https://github.com/instrumentation-score/spec/blob/main/specification.md#score-calculation-formula">Instrumentation Score spec's calculation formula</a>.
The resulting instrumentation score values, as well as the rule evaluation results (with the examples and extent) are then stored in separate Elasticsearch indices for consumption.</p>
<p>With the results stored in dedicated Elasticsearch indices, we can build Dashboards to visualize the Instrumentation Score insights and allow users to troubleshoot their data quality issues.</p>
<p>In this POC I implement subet of instrumentation score rules to prove out the approach.</p>
<p>The Instrumentation Score concept accommodates extension with your own custom rules.
I did that in my POC as well to test some quality rules that are not yet formalized as rules in the Instrumentation Score spec,
but are important for Elastic Observability to provide the maximum value from the OTel data.</p>
<h2>Applying the Instrumentation Score on the OpenTelemetry Demo</h2>
<p>The <a href="https://github.com/open-telemetry/opentelemetry-demo">OpenTelemetry Demo</a> is the most-used environment to play around with and showcase OpenTelemetry capabilities.
Initially, I thought the demo would be the worst environment to test my Instrumentation Score implementation.
After all, it's the showcase environment for OpenTelemetry, and I expected it to have an Instrumentation Score close to 100.
Surprisingly, that wasn't the case.</p>
<p>Let's start with the overview.</p>
<h3>The Overview</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/otel-instrumentation-score/dashboard-overview.png" alt="Dashboard Overview" /></p>
<p>This dashboard shows an overview of the Instrumentation Score results for the OpenTelemetry Demo environment.
The first thing you might notice is the very low overall score <code>35</code> (top-left corner).
The table in the bottom-left corner shows a breakdown of the score by services.
Somewhat surprisingly, all the service scores are higher than the overall score.
How is that possible?</p>
<p>The main reason is that Instrumentation Score rules have, by definition, a binary result — passed or not.
So it can happen that each service fails a single but distinct rule. Hence, the service score is not perfect but also not too bad.
But, from the overall perspective, many rules have failed (each by a different service), hence, leading to a very low overall score.</p>
<p>In the table on the right, we see the results for the individual rules with their description, impact level, and example occurrences.
We see that 7 out of 11 implemented rules have failed. Let's pick our favorite example from earlier — <code>SPA-002</code> (in row 5), the orphan spans rule.</p>
<p>With the dashboard indicating that the rule <code>SPA-002</code> has failed, we know that there are orphan spans somewhere in our OTel traces. But where exactly?</p>
<p>For further analysis, we have two ways to drill down: (1) into a specific rule to see which services violate a specific rule, or (2) into a specific service to see which rules are violated by that service.</p>
<h3>Rule Drilldown</h3>
<p>The following dashboard shows a detailed view into the rule evaluation results for individual rules.
In this case we selected rule <code>SPA-002</code> at the top.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/otel-instrumentation-score/dashboard-rule-spa-002.png" alt="Dashboard Overview" /></p>
<p>In addition to the rule's meta information, such as its description, rationale, and criteria, we see some statistics on the right.
For example, we see that 2 services have failed that rule, 16 passed, and for 19 services this rule is not applicable (e.g., because those don't have tracing data).
In the table below, we see which two services are impacted by this rule violation: the <code>frontend</code> and <code>frontend-proxy</code> services.
For each service, we also see the <em>extent</em>. In the case of the <code>frontend</code> service, around 20% of traces have orphan spans.
This information is crucial as it gives an indication of how severe the rule violation actually is.
If it had been under 1%, this problem might have been negligible, but with one trace out of five being broken, it definitely needs to be fixed.
Also, for each of the services, we have an example <code>span.id</code> for which no spans could be found but that are referenced in the <code>parent.id</code> by other spans.
This allows us to perform further analyses (e.g., by investigating the referring spans in Kibana's Discover) on concrete example cases.</p>
<p>With that view, we now know that the <code>frontend</code> service has a good amount of broken traces.
But is that service also violating other rules? And, if yes, which?</p>
<h3>Service Drilldown</h3>
<p>To answer the above question we can switch to the <code>Per Service</code> Dashboard.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/otel-instrumentation-score/dashboard-service-frontend.png" alt="Dashboard Overview" /></p>
<p>In this dashboard, we see similar information as on the overview dashboard, however, filtered on a single selected service (e.g., <code>frontend</code> service in this example).
In the table, we see that the <code>frontend</code> service violates three rules. We already know about <code>SPA-002</code> from the previous section.
In addition, the violation of the custom rule <code>SPA-C-001</code> shows that around 99% of transaction span names have high cardinality.
In Elastic Observability, <code>transactions</code> refer to service-local root spans (i.e., entry points into services).
In the example value, we see directly why the <code>span.name</code>s (here referred to as <code>transaction.name</code>s) have high cardinality.
The span name contains unique identifiers (here the session ID) as part of the URL that the span name is constructed from in the instrumentation.
As the <a href="https://www.elastic.co/docs/reference/edot-collector">EDOT Collector</a> derives metrics for transaction-type spans, we also can observe a violation of the <code>MET-001</code> which requires bound cardinality on metric dimensions.</p>
<p>As you can see, with the Instrumentation Score concept and a few different breakdown views, we were able to pinpoint data quality issues and identify which services and instrumentations need improvement to fix the issues.</p>
<h2>Learnings and Observations</h2>
<p>My experimentation with the Instrumentation Score was very insightful and showed me the power of this concept — though it's still in its early phase.
It is particularly insightful if the implementation and calculation include breakdowns by meaningful entities, such as services, K8s pods, hosts, etc.
With such a breakdown, you can narrow down data quality issues to a manageable scope, instead of having to sift through huge amounts of data and entities.</p>
<p>Furthermore, I realized that having some notion of problem extent (per rule and service), as well as concrete examples, helps make the problem more tangible.</p>
<p>Thinking further about the idea of rule violation <code>extent</code>, there might even be a way to incorporate that into the score formula itself.
In my humble opinion, this would make the score significantly more comparable and indicative of the actual impact.
I <a href="https://github.com/instrumentation-score/spec/issues/43">proposed this idea in an issue</a> on the Instrumentation Score project.</p>
<h2>Conclusion</h2>
<p>The Instrumentation Score is a powerful approach to ensuring a high level of data quality with OpenTelemetry.</p>
<p>Thank you to the maintainers — Antoine Toulme, Daniel Gomez Blanco, Juraci Paixão Kröhling, and Michele Mancioppi — for bringing this great project to life, and to all the contributors for their participation!</p>
<p>With proper implementation of the rules and score calculation, users can easily get actionable insights into what they need to fix in their instrumentation and data collection.
The Instrumentation Score rules are in an early stage and are steadily improved and extended.
I'm looking forward to what the community will build in the scope of this project in the future, and I hope to intensify my contributions as well.</p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/otel-instrumentation-score/otel-instrumentation-grade.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Why Elastic donated its OpenTelemetry PHP distribution]]></title>
            <link>https://www.elastic.co/observability-labs/blog/otel-php-distro-donation</link>
            <guid isPermaLink="false">otel-php-distro-donation</guid>
            <pubDate>Tue, 10 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn what the OpenTelemetry PHP distro donation changes for package-managed PHP environments, how it compares to existing options, and what contributors can do next.]]></description>
            <content:encoded><![CDATA[<p>We <a href="https://www.elastic.co/observability-labs/blog/opentelemetry-accepts-elastics-donation-of-edot">donated EDOT PHP</a> to make OpenTelemetry for PHP as easy to deploy as any other runtime.
PHP powers a significant number of websites and SaaS platforms, and we hope our contribution helps more teams adopt OpenTelemetry.
In many production environments, runtimes are locked down and building native extensions during deploy is not possible, so we focused on an OS-package-first path (<code>deb</code>, <code>rpm</code>, <code>apk</code>) for zero-code instrumentation.</p>
<p>Since we announcement of the donation, we have been actively working on the project and are about to release a first beta version.</p>
<p>In this post, we will walk through what was donated, why it matters for production PHP systems, how it relates to existing OpenTelemetry PHP projects, and what is the status of the project.</p>
<h2>Why PHP observability can still be hard</h2>
<p>OpenTelemetry gives us a common standard, but deployment reality still matters. In many PHP environments, the blocker is not instrumentation APIs. The blocker is operations.</p>
<p>At Elastic, we are committed to open standards and to OpenTelemetry as the industry standard for observability data collection. To help the community with these operational constraints, we donated our EDOT PHP distribution to OpenTelemetry.</p>
<p>Common constraints include:</p>
<ul>
<li>Shared hosting or hardened servers without build toolchains</li>
<li>Production images that cannot be rebuilt frequently</li>
<li>Package-managed PHP runtimes where OS-native install flows are required</li>
<li>Teams that need adoption without app code changes</li>
</ul>
<p>This is where an OS-package distribution helps. If you can install a package and restart PHP, you can usually start collecting telemetry.</p>
<h2>What the OpenTelemetry PHP distro provides</h2>
<p>The project we donated combines native and PHP runtime components into one production path.
The key features we announced in our original donation proposal are implemented and we are close to a first beta release.
This includes:</p>
<ul>
<li>Native extension and loader artifacts so teams can install prebuilt components instead of compiling in restricted environments.</li>
<li>Runtime/bootstrap logic for auto-instrumentation so applications can emit telemetry with little or no code changes.</li>
<li>Packaging support for <code>deb</code>, <code>rpm</code>, and <code>apk</code> so rollout fits existing Linux package management and operations workflows.</li>
<li>Background telemetry sending and automatic root span behavior so trace data is captured consistently without custom bootstrapping logic or blocking of the main flow.</li>
<li>OTLP protobuf serialization works out of the box, with no need for the <code>ext-protobuf</code> extension. This means teams don't have to install extra dependencies, which is especially important in PHP environments where adding new extensions is difficult or restricted.</li>
<li>Inferred spans so users get added visibility into work that is not explicitly instrumented in application code.</li>
<li>URL grouping for transaction/root spans so high-cardinality route data is easier to aggregate and analyze.</li>
<li>Built-in OpAMP support utilizing an OpAMP client already present in the agent.</li>
</ul>
<p>For teams running PHP <code>8.1</code> through <code>8.4</code>, this gives a practical onboarding path that fits existing OS package operations.</p>
<h2>How installation looks in practice</h2>
<p>A typical flow is simple:</p>
<ol>
<li>Install distro package for your Linux platform.</li>
<li>Set exporter endpoint and auth headers.</li>
<li>Restart PHP processes.</li>
<li>Verify traces in your collector or backend.</li>
</ol>
<p>Example environment variables:</p>
<pre><code class="language-bash">export OTEL_EXPORTER_OTLP_ENDPOINT=&quot;https://your-collector.example:4318&quot;
export OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=Bearer &lt;token&gt;&quot;
</code></pre>
<p>The key idea is predictable rollout through standard operational controls rather than custom build steps in each application pipeline.</p>
<p>For a detailed guide on how to setup the distro, see the <a href="https://github.com/open-telemetry/opentelemetry-php-distro/blob/main/docs/getting-started/setup.md">setup documentation</a>.</p>
<h2>Relationship to existing OpenTelemetry PHP instrumentation</h2>
<p>The current message from maintainers is coexistence with clear differentiation:</p>
<ul>
<li><strong>Distro path</strong>: package-managed, production-first, zero-code onboarding</li>
<li><strong>Composer-centric path</strong>: more manual control and portability where that is needed</li>
</ul>
<p>This distinction matters for users choosing a starting point. If you control application packaging tightly and can build extensions as part of app install, Composer-oriented paths can still be a fit. If you need an operations-first rollout through OS packages, the distro can reduce adoption friction.</p>
<p>The donation discussion also raised an important usability concern: too many overlapping options can confuse users. That is why compatibility and long-term alignment across projects is a key follow-up topic. The original proposal details are available in the <a href="https://github.com/open-telemetry/community/issues/2846">OpenTelemetry community donation issue</a>.</p>
<h2>What to validate before broad production rollout</h2>
<p>If you want to test this path in your own environment, validate these points early:</p>
<ul>
<li><strong>Runtime coverage</strong>: verify your PHP version and SAPI mode (PHP-FPM, Apache <code>mod_php</code>, CLI)</li>
<li><strong>Packaging fit</strong>: confirm your distro package format and architecture support</li>
<li><strong>Telemetry behavior</strong>: check span completeness, service naming, and exporter reliability</li>
<li><strong>Operational safety</strong>: verify restart procedures, rollback steps, and version pinning policy</li>
</ul>
<p>A lightweight validation matrix can save rework later, especially when multiple runtime profiles exist in the same organization.</p>
<h2>The current status and what's next</h2>
<p>With the completion of the initially-announced, above-mentioned features, we reached a significant milestone and are about to release a first beta version.</p>
<p>However, the work does not stop here. More enhancements and features are planned, including:</p>
<h3>Class Shadowing</h3>
<p>In some situations it is possible that the PHP distro loads classes that are already loaded by the application itself.
This can lead to collisions and unexpected behavior.
We are working on implementing shadowing of classes and namespaces of the PHP distro dependencies to avoid collisions.
This new feature will increase the stability and reliability of the PHP distro across a wider range of applications.</p>
<h3>Declarative Configuration Support</h3>
<p>Just recently the OpenTelemetry community announced the <a href="https://opentelemetry.io/blog/2026/stable-declarative-config/">stability of the declarative configuration specification</a>.
Including native support for the declarative configuration specification in the PHP distro is a feature that is on our roadmap and one of the next major features we will be working on.
With declarative configuration support, teams will be able to configure the PHP distro in a more flexible way without having to modify their application code.</p>
<p>We are working on implementing declarative configuration support in the distro to allow for more fine-grained configuration of the agent.
This new feature will increase the flexibility and usability of the PHP distro.</p>
<h3>PHP 8.5 Support</h3>
<p>In November 2025, PHP 8.5 was released with major changes to the language and runtime.
We are working on supporting it in the PHP distro which will expand the scope of compatibility and usability of the PHP distro across a wider range of applications and environments.</p>
<h3>Central Configuration Support</h3>
<p>Both, the declarative configuration as well as the <a href="https://github.com/open-telemetry/opentelemetry-specification/pull/4738">new proposal around telemetry policies</a> are potential and promising enablers for dynamic, central configuration of OTel SDKs and distors.
Once the discussion around telemetry policies is resolved, we will be able to implement central configuration support in the PHP distro.
This will combine the concepts around declarative configuration, telemetry policies and the OpenTelemetry Agent Management Protocol (OpAMP) to provide a more flexible and powerful way to configure the PHP distro centrally.</p>
<h3>Base EDOT on the upstream distribution</h3>
<p>With the new OpenTelemetry PHP distro reaching a first beta release, we are working on basing Elastic's OTel PHP distribution (EDOT PHP) on the upstream distribution to increase the compatibility and avoid feature drift.
This will ensure that the PHP distro is always up to date with the latest features and bug fixes from the OpenTelemetry community.</p>
<h2>Conclusion</h2>
<p>Our OpenTelemetry PHP distro donation is mainly about operational accessibility. It gives PHP teams a package-native way to adopt OpenTelemetry where build-time instrumentation is difficult.
As community alignment progresses, we expect this to become a clearer and lower-friction option for production PHP observability. Try out the <a href="https://github.com/open-telemetry/opentelemetry-php-distro">OpenTelemetry PHP distro repository</a>, document gaps, and feed findings back to the maintainers.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/otel-php-distro-donation/header.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[OpenTelemetry Profiles Signal Enters Alpha: Elastic’s Continuous Commitment to Profiling]]></title>
            <link>https://www.elastic.co/observability-labs/blog/otel-profiling-alpha</link>
            <guid isPermaLink="false">otel-profiling-alpha</guid>
            <pubDate>Wed, 25 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[OpenTelemetry Profiles has officially reached Alpha, entrenching profiling as the fourth observability signal. Elastic's core contribution of its eBPF profiling agent, continued OpenTelemetry Profiles signal work and commitment to a vendor-agnostic ecosystem are driving this industry-wide standard forward.]]></description>
            <content:encoded><![CDATA[<p>Following intensive collaboration between Elastic and the OpenTelemetry community, we are thrilled to announce that the OpenTelemetry Profiles signal has officially entered public Alpha.
This milestone is a testament to the community's dedication and marks a significant step towards establishing profiling as the fourth key observability signal in OpenTelemetry, alongside logs, metrics and traces.</p>
<p>As a core contributor, Elastic is proud to have accelerated this effort by previously donating its Universal Profiling™ eBPF-based continuous profiling agent to OpenTelemetry.
This production-grade agent enables whole-system visibility across all applications, covering a multitude of programming languages and runtimes including third-party libraries and kernel operations with minimal overhead.
It allows SREs and developers to quickly identify performance bottlenecks, maximize resource utilization, and optimize cloud spend.</p>
<p>Additionally, over the last two years, Elastic has been heavily contributing to the OpenTelemetry Collector, Semantic Conventions and Profiling Special Interest Groups (SIGs) to lay the technical foundation for the promotion of Profiles to Alpha.</p>
<p>This Alpha milestone not only boosts the standardization of continuous profiling but also accelerates the practical adoption of profiling as the fourth key signal in observability.
Customers now have a vendor-agnostic way of collecting profiling data and enabling correlation with existing signals, like logs, metrics and traces, unveiling new potential for observability insights and a more efficient troubleshooting experience.</p>
<h2>What is continuous profiling?</h2>
<p>Profiling is a technique used to understand the behavior of a software application by collecting information about its execution.
This includes tracking the duration of function calls, memory usage, CPU usage, and other system resources.</p>
<p>However, traditional profiling solutions have significant drawbacks limiting adoption in production environments:</p>
<ul>
<li>Significant cost and performance overhead due to code instrumentation</li>
<li>Disruptive service restarts</li>
<li>Inability to get visibility into third-party libraries</li>
</ul>
<p>Unlike traditional profiling, which is often done only in a specific development phase or under controlled test conditions, continuous profiling runs in the background with minimal overhead, eliminating the need for service restarts or manual intervention.
This provides real-time, actionable insights without replicating issues in separate environments.
SREs, DevOps, and developers can see how code affects performance and cost, making code and infrastructure improvements easier.</p>
<h2>Elastic's contribution: Powering the Alpha</h2>
<p>The Elastic-donated profiler now forms the reference eBPF-based profiler implementation within OpenTelemetry: <a href="https://github.com/open-telemetry/opentelemetry-ebpf-profiler/">opentelemetry-ebpf-profiler</a>.
With the Alpha release, the eBPF profiler operates as an OpenTelemetry Collector receiver and contains numerous improvements such as automatic Go symbolization and support for new language runtimes.
Operating as an OpenTelemetry Collector receiver enables the profiler to seamlessly leverage existing OpenTelemetry processing and filtering pipelines.</p>
<p>For example, the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/k8sattributesprocessor">k8sattributesprocessor</a> can use the <code>container.id</code> resource attribute to automatically enrich every profile with its corresponding Kubernetes context.
This means you don't just see a raw stack trace; you see exactly which namespace, pod, and deployment produced it.</p>
<pre><code class="language-yaml">receivers:
  # Profiling receiver
  profiling: {}

processors:
  k8sattributes:
    passthrough: false 
    pod_association:
      - sources:
          - from: resource_attribute
            name: container.id
    extract:
      metadata:
        - &quot;k8s.namespace.name&quot;
        - &quot;k8s.deployment.name&quot;
        - &quot;k8s.replicaset.name&quot;
        - &quot;k8s.statefulset.name&quot;
        - &quot;k8s.daemonset.name&quot;
        - &quot;k8s.node.name&quot;
        - &quot;k8s.pod.name&quot;
        - &quot;k8s.pod.ip&quot;
        - &quot;k8s.pod.uid&quot;
</code></pre>
<p>Besides improvements to the eBPF profiler, Elastic has made significant contributions to:</p>
<ul>
<li>Correlating profiles with the information produced by OpenTelemetry eBPF instrumentation (<a href="https://opentelemetry.io/docs/zero-code/obi/">OBI</a>), a powerful auto-instrumentation tool that can enable distributed tracing.</li>
<li><a href="https://github.com/open-telemetry/opentelemetry-specification/pull/4719">Process Context Sharing OTEP</a> which is designed to bridge the gap between application SDKs and the profiler. This mechanism will allow OpenTelemetry SDKs to &quot;publish&quot; their resource attributes (like <code>service.name</code>) into a small, standardized memory region. Because this data is stored in the process's own memory map, the eBPF Profiler can instantly discover and associate it with its corresponding Profile.</li>
<li>Semantic conventions and integration of OpenTelemetry Profiles with Google's pprof format (transparent conversion)</li>
<li>OpenTelemetry Collector processing pipelines, allowing it to better integrate with the profiling receiver</li>
</ul>
<h2>Elastic's Next-Generation Profiling Development</h2>
<p>Elastic remains deeply committed to OpenTelemetry's vision and is pushing the boundaries of what is possible with profiling data.
We are dedicating a team of profiling domain experts to co-maintain and advance profiling capabilities within OpenTelemetry, while simultaneously working on groundbreaking features built on this new open standard.</p>
<p>Exciting areas of internal profiling-specific development include:</p>
<ul>
<li>OpenTelemetry Profiles derived Metrics: We are developing innovative ways to automatically generate actionable performance metrics directly from the raw OTel Profiles data, providing a new dimension for infrastructure modeling and alerting.</li>
<li>Rapid Integration with the Elastic Stack: We are making swift progress on first-class support for OTLP Profiles within the Elastic Stack, ensuring seamless ingestion (the ebpf-profiler receiver is already integrated with the <a href="https://github.com/elastic/elastic-agent/tree/main/internal/edot#components">Elastic Distributions of OpenTelemetry (EDOT) collector</a>), storage, and visualization of this new signal alongside your existing logs, metrics and traces.</li>
<li>AI-Powered Workflows: We are leveraging the deep insights provided by continuous profiling data to power new AI-driven workflows, enabling automatic root-cause analysis, anomaly detection, and intelligent optimization suggestions for both code and infrastructure.</li>
</ul>
<p>While the Alpha release marks a significant milestone, it is just the beginning.
We encourage the community to start testing early preview versions of the OTel Profiles integration and contribute to the ongoing profiling work.
To get started with an actual, local deployment, you can use the <a href="https://github.com/open-telemetry/opentelemetry-ebpf-profiler">OpenTelemetry eBPF profiler</a> in combination with a self-hosted <a href="https://www.elastic.co/docs/solutions/observability">Elastic Observability Stack</a> or <a href="https://github.com/elastic/devfiler">devfiler</a>, a standalone desktop application that acts as an OpenTelemetry Profiles compliant backend aimed at experimentation and development.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/otel-profiling-alpha/header.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Using NLP and Pattern Matching to Detect, Assess, and Redact PII in Logs - Part 1]]></title>
            <link>https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1</link>
            <guid isPermaLink="false">pii-ner-regex-assess-redact-part-1</guid>
            <pubDate>Wed, 25 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[How to detect and assess PII in your logs using Elasticsearch and NLP]]></description>
            <content:encoded><![CDATA[<h2>Introduction:</h2>
<p>The prevalence of high-entropy logs in distributed systems has significantly raised the risk of PII (Personally Identifiable Information) seeping into our logs, which can result in security and compliance issues. This 2-part blog delves into the crucial task of identifying and managing this issue using the Elastic Stack. We will explore using NLP (Natural Language Processing) and Pattern matching to detect, assess, and, where feasible, redact PII from logs that are being ingested into Elasticsearch.</p>
<p>In <strong>Part 1</strong> of this blog, we will cover the following:</p>
<ul>
<li>Review the techniques and tools we have available to manage PII in our logs</li>
<li>Understand the roles of NLP / NER in PII detection</li>
<li>Build a composable processing pipeline to detect and assess PII</li>
<li>Sample logs and run them through the NER Model</li>
<li>Assess the results of the NER Model</li>
</ul>
<p>In <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-2">Part 2 of this blog</a> of this blog, we will cover the following:</p>
<ul>
<li>Redact PII using NER and the redact processor</li>
<li>Apply field-level security to control access to the un-redacted data</li>
<li>Enhance the dashboards and alerts</li>
<li>Production considerations and scaling</li>
<li>How to run these processes on incoming or historical data</li>
</ul>
<p>Here is the overall flow we will construct over the 2 blogs:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-overall-flow.png" alt="PII Overall Flow" /></p>
<p>All code for this exercise can be found at:
<a href="https://github.com/bvader/elastic-pii">https://github.com/bvader/elastic-pii</a>.</p>
<h2>Tools and Techniques</h2>
<p>There are four general capabilities that we will use for this exercise.</p>
<ul>
<li>Named Entity Recognition Detection (NER)</li>
<li>Pattern Matching Detection</li>
<li>Log Sampling</li>
<li>Ingest Pipelines as Composable Processing</li>
</ul>
<h4>Named Entity Recognition (NER) Detection</h4>
<p>NER is a sub-task of Natural Language Processing (NLP) that involves identifying and categorizing named entities in unstructured text into predefined categories such as:</p>
<ul>
<li>Person: Names of individuals, including celebrities, politicians, and historical figures.</li>
<li>Organization: Names of companies, institutions, and organizations.</li>
<li>Location: Geographic locations, including cities, countries, and landmarks.</li>
<li>Event: Names of events, including conferences, meetings, and festivals.</li>
</ul>
<p>For our use PII case, we will choose the base BERT NER model <a href="https://huggingface.co/dslim/bert-base-NER">bert-base-NER</a> that can be downloaded from <a href="https://huggingface.co">Hugging Face</a> and loaded into Elasticsearch as a trained model.</p>
<p><strong>Important Note:</strong>  NER / NLP Models are CPU-intensive and expensive to run at scale; thus, we will want to employ a sampling technique to understand the risk in our logs without sending the full logs volume through the NER Model. We will discuss the performance and scaling of the NER model in part 2 of the blog.</p>
<h4>Pattern Matching Detection</h4>
<p>In addition to using an NER, regex pattern matching is a powerful tool for detecting and redacting PII based on common patterns. The Elasticsearch <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/redact-processor.html">redact</a> processor is built for this use case.</p>
<h4>Log Sampling</h4>
<p>Considering the performance implications of NER and the fact that we may be ingesting a large volume of logs into Elasticsearch, it makes sense to sample our incoming logs. We will build a simple log sampler to accomplish this.</p>
<h4>Ingest Pipelines as Composable Processing</h4>
<p>We will create several pipelines, each focusing on a specific capability and a main ingest pipeline to orchestrate the overall process.</p>
<h2>Building the Processing Flow</h2>
<h4>Logs Sampling + Composable Ingest Pipelines</h4>
<p>The first thing we will do is set up a sampler to sample our logs. This ingest pipeline simply takes a sampling rate between 0 (no log) and 10000 (all logs), which allows as low as ~0.01% sampling rate and marks the sampled logs with <code>sample.sampled: true</code>. Further processing on the logs will be driven by the value of <code>sample.sampled</code>. The <code>sample.sample_rate</code> can be set here or &quot;passed in&quot; from the orchestration pipeline.</p>
<p>The command should be run from the Kibana -&gt; Dev Tools</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/logs-sampler-composable-pipelines-part-1.json">The code can be found here</a> for the following three sections of code.</p>
&lt;details open&gt;
  &lt;summary&gt;logs-sampler pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># logs-sampler pipeline - part 1
DELETE _ingest/pipeline/logs-sampler
PUT _ingest/pipeline/logs-sampler
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set Sampling Rate 0 None 10000 all allows for 0.01% precision&quot;,
        &quot;if&quot;: &quot;ctx.sample.sample_rate == null&quot;,
        &quot;field&quot;: &quot;sample.sample_rate&quot;,
        &quot;value&quot;: 10000
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Determine if keeping unsampled docs&quot;,
        &quot;if&quot;: &quot;ctx.sample.keep_unsampled == null&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;sample.sampled&quot;,
        &quot;value&quot;: false
      }
    },
    {
      &quot;script&quot;: {
        &quot;source&quot;: &quot;&quot;&quot; Random r = new Random();
        ctx.sample.random = r.nextInt(params.max); &quot;&quot;&quot;,
        &quot;params&quot;: {
          &quot;max&quot;: 10000
        }
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx.sample.random &lt;= ctx.sample.sample_rate&quot;,
        &quot;field&quot;: &quot;sample.sampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;drop&quot;: {
         &quot;description&quot;: &quot;Drop unsampled document if applicable&quot;,
        &quot;if&quot;: &quot;ctx.sample.keep_unsampled == false &amp;&amp; ctx.sample.sampled == false&quot;
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>Now, let's test the logs sampler. We will build the first part of the composable pipeline. We will be sending logs to the logs-generic-default data stream. With that in mind, we will create the <code>logs@custom</code> ingest pipeline that will be automatically called using the logs <a href="https://www.elastic.co/guide/en/fleet/current/data-streams.html#data-streams-pipelines">data stream framework</a> for customization. We will add one additional level of abstraction so that you can apply this PII processing to other data streams.</p>
<p>Next, we will create the <code>process-pii</code> pipeline. This is the core processing pipeline where we will orchestrate PII processing component pipelines. In this first step, we will simply apply the sampling logic. Note that we are setting the sampling rate to 100, which is equivalent to 10% of the logs.</p>
&lt;details open&gt;
  &lt;summary&gt;process-pii pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Process PII pipeline - part 1
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set true if enabling sampling, otherwise false&quot;,
        &quot;field&quot;: &quot;sample.enabled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set Sampling Rate 0 None 10000 all allows for 0.01% precision&quot;,
        &quot;field&quot;: &quot;sample.sample_rate&quot;,
        &quot;value&quot;: 1000
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to false if you want to drop unsampled data, handy for reindexing hostorical data&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == true&quot;,
        &quot;name&quot;: &quot;logs-sampler&quot;,
        &quot;ignore_failure&quot;: true
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>Finally, we create the logs <code>logs@custom</code>, which will simply call our <code>process-pii</code> pipeline based on the correct <code>data_stream.dataset</code></p>
&lt;details open&gt;
  &lt;summary&gt;logs@custom pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># logs@custom pipeline - part 1
DELETE _ingest/pipeline/logs@custom
PUT _ingest/pipeline/logs@custom
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;pipelinetoplevel&quot;,
        &quot;value&quot;: &quot;logs@custom&quot;
      }
    },
        {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;pipelinetoplevelinfo&quot;,
        &quot;value&quot;: &quot;{{{data_stream.dataset}}}&quot;
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;description&quot; : &quot;Call the process_pii pipeline on the correct dataset&quot;,
        &quot;if&quot;: &quot;ctx?.data_stream?.dataset == 'pii'&quot;, 
        &quot;name&quot;: &quot;process-pii&quot;
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>Now, let's test to see the sampling at work.</p>
<p>Load the data as described here <a href="#data-loading-appendix">Data Loading Appendix</a>. Let's use the sample data first, and we will talk about how to test with your incoming or historical logs later at the end of this blog.</p>
<p>If you look at Observability -&gt; Logs -&gt; Logs Explorer with KQL filter <code>data_stream.dataset : pii</code> and Breakdown by sample.sampled, you should see the breakdown to be approximately 10%</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-discover-1-part-1.png" alt="PII Discover 1" /></p>
<p>At this point we have a composable ingest pipeline that is &quot;sampling&quot; logs. As a bonus, you can use this logs sampler for any other use cases you have as well.</p>
<h4>Loading, Configuration, and Execution of the NER Pipeline</h4>
<h5>Loading the NER Model</h5>
<p>You will need a Machine Learning node to run the NER model on. In this exercise, we are using <a href="https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html">Elastic Cloud Hosted Deployment </a>on AWS with the <a href="https://www.elastic.co/guide/en/cloud/current/ec_selecting_the_right_configuration_for_you.html">CPU Optimized (ARM)</a> architecture. The NER inference will run on a Machine Learning AWS c5d node. There will be GPU options in the future, but today, we will stick with CPU architecture.</p>
<p>This exercise will use a single c5d with 8 GB RAM with 4.2 vCPU up to 8.4 vCPU</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-ml-node-part-1.png" alt="ML Node" /></p>
<p>Please refer to the official documentation on <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-import-model.html">how to import an NLP-trained model into Elasticsearch</a> for complete instructions on uploading, configuring, and deploying the model.</p>
<p>The quickest way to get the model is using the Eland Docker method.</p>
<p>The following command will load the model into Elasticsearch but will not start it. We will do that in the next step.</p>
<pre><code class="language-bash">docker run -it --rm --network host docker.elastic.co/eland/eland \
  eland_import_hub_model \
  --url https://mydeployment.es.us-west-1.aws.found.io:443/ \
  -u elastic -p password \
  --hub-model-id dslim/bert-base-NER --task-type ner

</code></pre>
<h5>Deploy and Start the NER Model</h5>
<p>In general, to improve ingest performance, increase throughput by adding more allocations to the deployment. For improved search speed, increase the number of threads per allocation.</p>
<p>To scale ingest, we will focus on scaling the allocations for the deployed model. More information on this topic is available <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-deploy-model.html">here</a>. The number of allocations must be less than the available allocated processors (cores, not vCPUs) per node.</p>
<p>To deploy and start the NER Model. We will do this using the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.15/start-trained-model-deployment.html">Start trained model deployment API</a></p>
<p>We will configure the following:</p>
<ul>
<li>4 Allocations to allow for more parallel ingestion</li>
<li>1 Thread per Allocation</li>
<li>0 Byes Cache, as we expect a low cache hit rate</li>
<li>8192 Queue</li>
</ul>
<pre><code># Start the model with 4 Allocators x 1 Thread, no cache, and 8192 queue
POST _ml/trained_models/dslim__bert-base-ner/deployment/_start?cache_size=0b&amp;number_of_allocations=4&amp;threads_per_allocation=1&amp;queue_capacity=8192

</code></pre>
<p>You should get a response that looks something like this.</p>
<pre><code class="language-bash">{
  &quot;assignment&quot;: {
    &quot;task_parameters&quot;: {
      &quot;model_id&quot;: &quot;dslim__bert-base-ner&quot;,
      &quot;deployment_id&quot;: &quot;dslim__bert-base-ner&quot;,
      &quot;model_bytes&quot;: 430974836,
      &quot;threads_per_allocation&quot;: 1,
      &quot;number_of_allocations&quot;: 4,
      &quot;queue_capacity&quot;: 8192,
      &quot;cache_size&quot;: &quot;0&quot;,
      &quot;priority&quot;: &quot;normal&quot;,
      &quot;per_deployment_memory_bytes&quot;: 430914596,
      &quot;per_allocation_memory_bytes&quot;: 629366952
    },
...
    &quot;assignment_state&quot;: &quot;started&quot;,
    &quot;start_time&quot;: &quot;2024-09-23T21:39:18.476066615Z&quot;,
    &quot;max_assigned_allocations&quot;: 4
  }
}
</code></pre>
<p>The NER model has been deployed and started and is ready to be used.</p>
<p>The following ingest pipeline implements the NER model via the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-processor.html">inference</a> processor.</p>
<p>There is a significant amount of code here, but only two items of interest now exist. The rest of the code is conditional logic to drive some additional specific behavior that we will look closer at in the future.</p>
<ol>
<li>
<p>The inference processor calls the NER model by ID, which we loaded previously, and passes the text to be analyzed, which, in this case, is the message field, which is the text_field we want to pass to the NER model to analyze for PII.</p>
</li>
<li>
<p>The script processor loops through the message field and uses the data generated by the NER model to replace the identified PII with redacted placeholders. This looks more complex than it really is, as it simply loops through the array of ML predictions and replaces them in the message string with constants, and stores the results in a new field <code>redact.message</code>. We will look at this a little closer in the following steps.</p>
</li>
</ol>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/logs-sampler-composable-pipelines-part-2.json">The code can be found here</a> for the following three sections of code.</p>
<p>The NER PII Pipeline</p>
&lt;details open&gt;
  &lt;summary&gt;logs-ner-pii-processor pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># NER Pipeline
DELETE _ingest/pipeline/logs-ner-pii-processor
PUT _ingest/pipeline/logs-ner-pii-processor
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to true to actually redact, false will run processors but leave original&quot;,
        &quot;field&quot;: &quot;redact.enable&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to true to keep ml results for debugging&quot;,
        &quot;field&quot;: &quot;redact.ner.keep_result&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to PER, LOC, ORG to skip, or NONE to not drop any replacement&quot;,
        &quot;field&quot;: &quot;redact.ner.skip_entity&quot;,
        &quot;value&quot;: &quot;NONE&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to PER, LOC, ORG to skip, or NONE to not drop any replacement&quot;,
        &quot;field&quot;: &quot;redact.ner.minimum_score&quot;,
        &quot;value&quot;: 0
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.message == null&quot;,
        &quot;field&quot;: &quot;redact.message&quot;,
        &quot;copy_from&quot;: &quot;message&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;redact.ner.successful&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;redact.ner.found&quot;,
        &quot;value&quot;: false
      }
    },
    {
      &quot;inference&quot;: {
        &quot;model_id&quot;: &quot;dslim__bert-base-ner&quot;,
        &quot;field_map&quot;: {
          &quot;message&quot;: &quot;text_field&quot;
        },
        &quot;on_failure&quot;: [
          {
            &quot;set&quot;: {
              &quot;description&quot;: &quot;Set 'error.message'&quot;,
              &quot;field&quot;: &quot;failure&quot;,
              &quot;value&quot;: &quot;REDACT_NER_FAILED&quot;
            }
          },
          {
            &quot;set&quot;: {
              &quot;field&quot;: &quot;redact.ner.successful&quot;,
              &quot;value&quot;: false
            }
          }
        ]
      }
    },
    {
      &quot;script&quot;: {
        &quot;if&quot;: &quot;ctx.failure_ner != 'REDACT_NER_FAILED'&quot;,
        &quot;lang&quot;: &quot;painless&quot;,
        &quot;source&quot;: &quot;&quot;&quot;String msg = ctx['message'];
          for (item in ctx['ml']['inference']['entities']) {
          	if ((item['class_name'] != ctx.redact.ner.skip_entity) &amp;&amp; 
          	  (item['class_probability'] &gt;= ctx.redact.ner.minimum_score)) {  
          		  msg = msg.replace(item['entity'], '&lt;' + 
          		  'REDACTNER-'+ item['class_name'] + '_NER&gt;')
          	}
          }
          ctx.redact.message = msg&quot;&quot;&quot;,
        &quot;on_failure&quot;: [
          {
            &quot;set&quot;: {
              &quot;description&quot;: &quot;Set 'error.message'&quot;,
              &quot;field&quot;: &quot;failure&quot;,
              &quot;value&quot;: &quot;REDACT_REPLACEMENT_SCRIPT_FAILED&quot;,
              &quot;override&quot;: false
            }
          },
          {
            &quot;set&quot;: {
              &quot;field&quot;: &quot;redact.successful&quot;,
              &quot;value&quot;: false
            }
          }
        ]
      }
    },
    
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.ml?.inference?.entities.size() &gt; 0&quot;, 
        &quot;field&quot;: &quot;redact.ner.found&quot;,
        &quot;value&quot;: true,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.pii?.found == null&quot;,
        &quot;field&quot;: &quot;redact.pii.found&quot;,
        &quot;value&quot;: false
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.ner?.found == true&quot;,
        &quot;field&quot;: &quot;redact.pii.found&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;remove&quot;: {
        &quot;if&quot;: &quot;ctx.redact.ner.keep_result != true&quot;,
        &quot;field&quot;: [
          &quot;ml&quot;
        ],
        &quot;ignore_missing&quot;: true,
        &quot;ignore_failure&quot;: true
      }
    }
  ],
  &quot;on_failure&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;failure&quot;,
        &quot;value&quot;: &quot;GENERAL_FAILURE&quot;,
        &quot;override&quot;: false
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>The updated PII Processor Pipeline, which now calls the NER Pipeline</p>
&lt;details open&gt;
  &lt;summary&gt;process-pii pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Updated Process PII pipeline that now call the NER pipeline
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set true if enabling sampling, otherwise false&quot;,
        &quot;field&quot;: &quot;sample.enabled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set Sampling Rate 0 None 10000 all allows for 0.01% precision&quot;,
        &quot;field&quot;: &quot;sample.sample_rate&quot;,
        &quot;value&quot;: 1000
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to false if you want to drop unsampled data, handy for reindexing hostorical data&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == true&quot;,
        &quot;name&quot;: &quot;logs-sampler&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == false || (ctx.sample.enabled == true &amp;&amp; ctx.sample.sampled == true)&quot;,
        &quot;name&quot;: &quot;logs-ner-pii-processor&quot;
      }
    }
  ]
}

</code></pre>
&lt;/details&gt;
<p>Now reload the data as described here in <a href="#reloading-the-logs">Reloading the logs</a></p>
<h3>Results</h3>
<p>Let's take a look at the results with the NER processing in place. In the Logs Explorer with KQL query bar, execute the following query
<code>data_stream.dataset : pii and ml.inference.entities.class_name : (&quot;PER&quot; and &quot;LOC&quot; and &quot;ORG&quot; )</code></p>
<p>Logs Explorer should look something like this, open the top message to see the details.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-discover-2-part-1.png" alt="PII Discover 2" /></p>
<h4>NER Model Results</h4>
<p>Lets take a closer look at what these fields mean.</p>
<p><strong>Field:</strong> <code>ml.inference.entities.class_name</code><br />
<strong>Sample Value:</strong> <code>[PER, PER, LOC, ORG, ORG]</code><br />
<strong>Description:</strong> An array of the named entity classes that the NER model has identified.</p>
<p><strong>Field:</strong> <code>ml.inference.entities.class_probability</code><br />
<strong>Sample Value:</strong> <code>[0.999, 0.972, 0.896, 0.506, 0.595]</code><br />
<strong>Description:</strong> The class_probability is a value between 0 and 1, which indicates how likely it is that a given data point belongs to a certain class. The higher the number, the higher the probability that the data point belongs to the named class. <strong>This is important as in the next blog we can decide a threshold that we will want to use to alert and redact on.</strong>'
You can see in this example it identified a <code>LOC</code> as an <code>ORG</code>, we can filter this out / find them by setting a threshold.</p>
<p><strong>Field:</strong> <code>ml.inference.entities.entity</code><br />
<strong>Sample Value:</strong> <code>[Paul Buck, Steven Glens, South Amyborough, ME, Costco]</code><br />
<strong>Description:</strong> The array of entities identified that align positionally with the <code>class_name</code> and <code>class_probability</code>.</p>
<p><strong>Field:</strong> <code>ml.inference.predicted_value</code><br />
<strong>Sample Value:</strong> <code>[2024-09-23T14:32:14.608207-07:00Z] log.level=INFO: Payment successful for order #4594 (user: [Paul Buck](PER&amp;Paul+Buck), david59@burgess.net). Phone: 726-632-0527x520, Address: 3713 [Steven Glens](PER&amp;Steven+Glens), [South Amyborough](LOC&amp;South+Amyborough), [ME](ORG&amp;ME) 93580, Ordered from: [Costco](ORG&amp;Costco)</code><br />
<strong>Description:</strong> The predicted value of the model.</p>
<h4>PII Assessment Dashboard</h4>
<p>Lets take a quick look at a dashboard built to assess PII the data.</p>
<p>To load the dashboard, go to Kibana -&gt; Stack Management -&gt; Saved Objects and import the <code>pii-dashboard-part-1.ndjson</code> file that can be found here:</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/pii-dashboard-part-1.ndjson">https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/pii-dashboard-part-1.ndjson</a></p>
<p>More complete instructions on Kibana Saved Objects can be found <a href="https://www.elastic.co/guide/en/kibana/current/managing-saved-objects.html">here</a>.</p>
<p>After loading the dashboard, navigate to it and select the right time range and you should see something like below. It shows metrics such as sample rate, percent of logs with NER, NER Score Trends etc. We will examine the assessment and actions in part 2 of this blog.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-dashboard-1-part-1.png" alt="PII Dashboard 1" /></p>
<h2>Summary and Next Steps</h2>
<p>In this first part of the blog, we have accomplished the following.</p>
<ul>
<li>Reviewed the techniques and tools we have available for PII detection and assement</li>
<li>Reviewed NLP / NER role in PII detection and assessment</li>
<li>Built the necessary composable ingest pipelines to sample logs and run them through the NER Model</li>
<li>Reviewed the NER results and are ready to move to the second blog</li>
</ul>
<p>In the upcoming <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-2">Part 2 of this blog</a> of this blog, we will cover the following:</p>
<ul>
<li>Redact PII using NER and redact processor</li>
<li>Apply field-level security to control access to the un-redacted data</li>
<li>Enhance the dashboards and alerts</li>
<li>Production considerations and scaling</li>
<li>How to run these processes on incoming or historical data</li>
</ul>
<h2>Data Loading Appendix</h2>
<h4>Code</h4>
<p>The data loading code can be found here:</p>
<p><a href="https://github.com/bvader/elastic-pii">https://github.com/bvader/elastic-pii</a></p>
<pre><code>$ git clone https://github.com/bvader/elastic-pii.git
</code></pre>
<h4>Creating and Loading the Sample Data Set</h4>
<pre><code>$ cd elastic-pii
$ cd python
$ python -m venv .env
$ source .env/bin/activate
$ pip install elasticsearch
$ pip install Faker
</code></pre>
<p>Run the log generator</p>
<pre><code>$ python generate_random_logs.py
</code></pre>
<p>If you do not changes any parameters, this will create 10000 random logs in a file named pii.log with a mix of logs that containe and do not contain PII.</p>
<p>Edit <code>load_logs.py</code> and set the following</p>
<pre><code># The Elastic User 
ELASTIC_USER = &quot;elastic&quot;

# Password for the 'elastic' user generated by Elasticsearch
ELASTIC_PASSWORD = &quot;askdjfhasldfkjhasdf&quot;

# Found in the 'Manage Deployment' page
ELASTIC_CLOUD_ID = &quot;deployment:sadfjhasfdlkjsdhf3VuZC5pbzo0NDMkYjA0NmQ0YjFiYzg5NDM3ZDgxM2YxM2RhZjQ3OGE3MzIkZGJmNTE0OGEwODEzNGEwN2E3M2YwYjcyZjljYTliZWQ=&quot;
</code></pre>
<p>Then run the following command.</p>
<pre><code>$ python load_logs.py
</code></pre>
<h4>Reloading the logs</h4>
<p><strong>Note</strong> To reload the logs, you can simply re-run the above command. You can run the command multiple time during this exercise and the logs will be reloaded (actually loaded again). The new logs will not collide with previous runs as there will be a unique <code>run.id</code> for each run which is displayed at the end of the loading process.</p>
<pre><code>$ python load_logs.py
</code></pre>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-ner-regex-assess-redact-part-1.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Using NLP and Pattern Matching to Detect, Assess, and Redact PII in Logs - Part 2]]></title>
            <link>https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-2</link>
            <guid isPermaLink="false">pii-ner-regex-assess-redact-part-2</guid>
            <pubDate>Tue, 22 Oct 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[How to detect, assess, and redact PII in your logs using Elasticsearch, NLP and Pattern Matching]]></description>
            <content:encoded><![CDATA[<h2>Introduction:</h2>
<p>The prevalence of high-entropy logs in distributed systems has significantly raised the risk of PII (Personally Identifiable Information) seeping into our logs, which can result in security and compliance issues. This 2-part blog delves into the crucial task of identifying and managing this issue using the Elastic Stack. We will explore using NLP (Natural Language Processing) and Pattern matching to detect, assess, and, where feasible, redact PII from logs being ingested into Elasticsearch.</p>
<p>In <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1">Part 1 of this blog</a>, we covered the following:</p>
<ul>
<li>Review the techniques and tools we have available to manage PII in our logs</li>
<li>Understand the roles of NLP / NER in PII detection</li>
<li>Build a composable processing pipeline to detect and assess PII</li>
<li>Sample logs and run them through the NER Model</li>
<li>Assess the results of the NER Model</li>
</ul>
<p>In <strong>Part 2</strong> of this blog, we will cover the following:</p>
<ul>
<li>Apply the <code>redact</code> regex pattern processor and assess the results</li>
<li>Create Alerts using ESQL</li>
<li>Apply field-level security to control access to the un-redacted data</li>
<li>Production considerations and scaling</li>
<li>How to run these processes on incoming or historical data</li>
</ul>
<p>Reminder of the overall flow we will construct over the 2 blogs:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-overall-flow.png" alt="PII Overall Flow" /></p>
<p>All code for this exercise can be found at:
<a href="https://github.com/bvader/elastic-pii">https://github.com/bvader/elastic-pii</a>.</p>
<h3>Part 1 Prerequisites</h3>
<p>This blog picks up where <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1">Part 1 of this blog</a> left off. You must have the NER model, ingest pipelines, and dashboard from Part 1 installed and working.</p>
<ul>
<li>Loaded and configured NER Model</li>
<li>Installed all the composable ingest pipelines from Part 1 of the blog</li>
<li>Installed dashboard</li>
</ul>
<p>You can access the <a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/logs-sampler-composable-pipelines-blog-1-complete.json">complete solution for Blog 1 here</a>. Don't forget to load the dashboard, found <a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/pii-dashboard-part-1.ndjson">here</a>.</p>
<h3>Applying the Redact Processor</h3>
<p>Next, we will apply the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/redact-processor.html"><code>redact</code> processor</a>. The <code>redact</code> processor is a simple regex-based processor that takes a list of regex patterns and looks for them in a field and replaces them with literals when found. The <code>redact</code> processor is reasonably performant and can run at scale. At the end, we will discuss this in detail in the <a href="#production-scaling">production scaling</a> section.</p>
<p>Elasticsearch comes packaged with a number of useful predefined <a href="https://github.com/elastic/elasticsearch/blob/8.15/libs/grok/src/main/resources/patterns/ecs-v1">patterns</a> that can be conveniently referenced by the <code>redact</code> processor. If one does not suit your needs, create a new pattern with a custom definition. The Redact processor replaces every occurrence of a match. If there are multiple matches, they will all be replaced with the pattern name.</p>
<p>In the code below, we leveraged some of the predefined patterns as well as constructing several custom patterns.</p>
<pre><code class="language-bash">        &quot;patterns&quot;: [
          &quot;%{EMAILADDRESS:EMAIL_REGEX}&quot;,      &lt;&lt; Predefined
          &quot;%{IP:IP_ADDRESS_REGEX}&quot;,           &lt;&lt; Predefined
          &quot;%{CREDIT_CARD:CREDIT_CARD_REGEX}&quot;, &lt;&lt; Custom
          &quot;%{SSN:SSN_REGEX}&quot;,                 &lt;&lt; Custom
          &quot;%{PHONE:PHONE_REGEX}&quot;              &lt;&lt; Custom
        ]
</code></pre>
<p>We also replaced the PII with easily identifiable patterns we can use for assessment.</p>
<p>In addition, it is important to note that since the redact processor is a simple regex find and replace, it can be used against many &quot;secrets&quot; patterns, not just PII. There are many references for regex and secrets patterns, so you can reuse this capability to detect secrets in your logs.</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-redact-composable-pipelines-blog-2-redact-processor-1.json">The code can be found here</a> for the following two sections of code.</p>
&lt;details open&gt;
  &lt;summary&gt;redact processor pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Add the PII redact processor pipeline
DELETE _ingest/pipeline/logs-pii-redact-processor
PUT _ingest/pipeline/logs-pii-redact-processor
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;redact.proc.successful&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;redact.proc.found&quot;,
        &quot;value&quot;: false
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.message == null&quot;,
        &quot;field&quot;: &quot;redact.message&quot;,
        &quot;copy_from&quot;: &quot;message&quot;
      }
    },
    {
      &quot;redact&quot;: {
        &quot;field&quot;: &quot;redact.message&quot;,
        &quot;prefix&quot;: &quot;&lt;REDACTPROC-&quot;,
        &quot;suffix&quot;: &quot;&gt;&quot;,
        &quot;patterns&quot;: [
          &quot;%{EMAILADDRESS:EMAIL_REGEX}&quot;,
          &quot;%{IP:IP_ADDRESS_REGEX}&quot;,
          &quot;%{CREDIT_CARD:CREDIT_CARD_REGEX}&quot;,
          &quot;%{SSN:SSN_REGEX}&quot;,
          &quot;%{PHONE:PHONE_REGEX}&quot;
        ],
        &quot;pattern_definitions&quot;: {
          &quot;CREDIT_CARD&quot;: &quot;&quot;&quot;\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{4}&quot;&quot;&quot;,
          &quot;SSN&quot;: &quot;&quot;&quot;\d{3}-\d{2}-\d{4}&quot;&quot;&quot;,
          &quot;PHONE&quot;: &quot;&quot;&quot;(\+\d{1,2}\s?)?1?\-?\.?\s?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}&quot;&quot;&quot;
        },
        &quot;on_failure&quot;: [
          {
            &quot;set&quot;: {
              &quot;description&quot;: &quot;Set 'error.message'&quot;,
              &quot;field&quot;: &quot;failure&quot;,
              &quot;value&quot;: &quot;REDACT_PROCESSOR_FAILED&quot;,
              &quot;override&quot;: false
            }
          },
          {
            &quot;set&quot;: {
              &quot;field&quot;: &quot;redact.proc.successful&quot;,
              &quot;value&quot;: false
            }
          }
        ]
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.message.contains('REDACTPROC')&quot;,
        &quot;field&quot;: &quot;redact.proc.found&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.pii?.found == null&quot;,
        &quot;field&quot;: &quot;redact.pii.found&quot;,
        &quot;value&quot;: false
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.proc?.found == true&quot;,
        &quot;field&quot;: &quot;redact.pii.found&quot;,
        &quot;value&quot;: true
      }
    }
  ],
  &quot;on_failure&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;failure&quot;,
        &quot;value&quot;: &quot;GENERAL_FAILURE&quot;,
        &quot;override&quot;: false
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>And now, we will add the <code>logs-pii-redact-processor</code> pipeline to the overall <code>process-pii</code> pipeline</p>
&lt;details open&gt;
  &lt;summary&gt;redact processor pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Updated Process PII pipeline that now call the NER and Redact Processor pipeline
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set true if enabling sampling, otherwise false&quot;,
        &quot;field&quot;: &quot;sample.enabled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set Sampling Rate 0 None 10000 all allows for 0.01% precision&quot;,
        &quot;field&quot;: &quot;sample.sample_rate&quot;,
        &quot;value&quot;: 1000
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to false if you want to drop unsampled data, handy for reindexing hostorical data&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == true&quot;,
        &quot;name&quot;: &quot;logs-sampler&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == false || (ctx.sample.enabled == true &amp;&amp; ctx.sample.sampled == true)&quot;,
        &quot;name&quot;: &quot;logs-ner-pii-processor&quot;
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == false || (ctx.sample.enabled == true &amp;&amp;  ctx.sample.sampled == true)&quot;,
        &quot;name&quot;: &quot;logs-pii-redact-processor&quot;
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>Reload the data as described in the <a href="#reloading-the-logs">Reloading the logs</a>. If you have not generated the logs the first time, follow the instructions in the <a href="#data-loading-appendix">Data Loading Appendix</a></p>
<p>Go to Discover and enter the following into the KQL bar
<code>sample.sampled : true and redact.message: REDACTPROC</code> and add the <code>redact.message</code> to the table and you should see something like this.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-discover-1-part-2.png" alt="PII Discover Blog 2 Part 1" /></p>
<p>And if you did not load the dashboard from Blog Part 1 at already, load it, it can be found <a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/pii-dashboard-part-1.ndjson">here</a> using the Kibana -&gt; Stack Management -&gt; Saved Objects -&gt; Import.</p>
<p>It should look something like this now. Note that the REGEX portions of the dashboard are now active.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-dashboard-1-part-2.png" alt="PII Dashboards Blog 2 Part 1" /></p>
<h2>Checkpoint</h2>
<p>At this point, we have the following capabilities:</p>
<ul>
<li>Ability to sample incoming logs and apply this PII redaction</li>
<li>Detect and Assess PII with the NER/NLP and Pattern Matching</li>
<li>Assess the amount, type and quality of the PII detections</li>
</ul>
<p>This is a great point to stop if you are just running all this once to see how it works, but we have a few more steps to make this useful in production systems.</p>
<ul>
<li>Clean up the working and unredacted data</li>
<li>Update the Dashboard to work with the cleaned-up data</li>
<li>Apply Role Based Access Control to protect the raw  unredacted data</li>
<li>Create Alerts</li>
<li>Production and Scaling Considerations</li>
<li>How to run these processes on incoming or historical data</li>
</ul>
<h2>Applying to Production Systems</h2>
<h3>Cleanup working data and update the dashboard</h3>
<p>And now we will add the cleanup code to the overall <code>process-pii</code> pipeline.</p>
<p>In short, we set a flag <code>redact.enable: true</code> that directs the pipeline to move the unredacted <code>message</code> field to <code>raw.message</code> and the move the redacted message field <code>redact.message</code>to the <code>message</code> field. We will &quot;protect&quot; the <code>raw.message</code> in the following section.</p>
<p><strong>NOTE:</strong> Of course you can change this behavior if you want to completely delete the unredacted data. In this exercise we will keep it and protect it.</p>
<p>In addition we set <code>redact.cleanup: true</code> to clean up the NLP working data.</p>
<p>These fields allow a lot of control over what data you decide to keep and analyze.</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-redact-composable-pipelines-blog-2-redact-processor-2.json">The code can be found here</a> for the following two sections of code.</p>
&lt;details open&gt;
  &lt;summary&gt;redact processor pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Updated Process PII pipeline that now call the NER and Redact Processor pipeline and cleans up 
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set true if enabling sampling, otherwise false&quot;,
        &quot;field&quot;: &quot;sample.enabled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set Sampling Rate 0 None 10000 all allows for 0.01% precision&quot;,
        &quot;field&quot;: &quot;sample.sample_rate&quot;,
        &quot;value&quot;: 1000
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to false if you want to drop unsampled data, handy for reindexing hostorical data&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == true&quot;,
        &quot;name&quot;: &quot;logs-sampler&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == false || (ctx.sample.enabled == true &amp;&amp; ctx.sample.sampled == true)&quot;,
        &quot;name&quot;: &quot;logs-ner-pii-processor&quot;
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == false || (ctx.sample.enabled == true &amp;&amp;  ctx.sample.sampled == true)&quot;,
        &quot;name&quot;: &quot;logs-pii-redact-processor&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to true to actually redact, false will run processors but leave original&quot;,
        &quot;field&quot;: &quot;redact.enable&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.pii?.found == true &amp;&amp; ctx?.redact?.enable == true&quot;,
        &quot;field&quot;: &quot;message&quot;,
        &quot;target_field&quot;: &quot;raw.message&quot;
      }
    },
    {
      &quot;rename&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.pii?.found == true &amp;&amp; ctx?.redact?.enable == true&quot;,
        &quot;field&quot;: &quot;redact.message&quot;,
        &quot;target_field&quot;: &quot;message&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to true to actually to clean up working data&quot;,
        &quot;field&quot;: &quot;redact.cleanup&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;remove&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.cleanup == true&quot;,
        &quot;field&quot;: [
          &quot;ml&quot;
        ],
        &quot;ignore_failure&quot;: true
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>Reload the data as described here in the <a href="#reloading-the-logs">Reloading the logs</a>.</p>
<p>Go to Discover and enter the following into the KQL bar
<code>sample.sampled : true and redact.pii.found: true</code> and add the following fields to the table</p>
<p><code>message</code>,<code>raw.message</code>,<code>redact.ner.found</code>,<code>redact.proc.found</code>,<code>redact.pii.found</code></p>
<p>You should see something like this
<img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-discover-2-part-2.png" alt="PII Discover Part 2 Blog 2" /></p>
<p>We have everything we need to move forward with protecting the PII and Alerting on it.</p>
<p>Load up the new dashboard that works on the cleaned-up data</p>
<p>To load the dashboard, go to Kibana -&gt; Stack Management -&gt; Saved Objects and import the <code>pii-dashboard-part-2.ndjson</code> file that can be found <a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-dashboard-part-2.ndjson">here</a>.</p>
<p>The new dashboard should look like this. Note: It uses different fields under the covers since we have cleaned up the underlying data.</p>
<p>You should see something like this
<img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-dashboard-2-part-2.png" alt="PII Dashboard Part 2 Blog 2" /></p>
<h3>Apply Role Based Access Control to protect the raw unredacted data</h3>
<p>Elasticsearch supports role-based access control, including field and document level access control natively; it dramatically reduces the operational and maintenance complexity required to secure our application.</p>
<p>We will create a Role that does not allow access to the <code>raw.message</code> field and then create a user and assign that user the role. With that role, the user will only be able to see the redacted message, which is now in the <code>message</code> field, but will not be able to access the protected <code>raw.message</code> field.</p>
<p><strong>NOTE:</strong> Since we only sampled 10% of the data in this exercise the non-sampled <code>message</code> fields are not moved to the <code>raw.message</code>, so they are still viewable, but this shows the capability you can apply in a production system.</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-redact-composable-pipelines-blog-2-rbac.json">The code can be found here</a> for the following section of code.</p>
&lt;details open&gt;
  &lt;summary&gt;RBAC protect-pii role and user code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Create role with no access to the raw.message field
GET _security/role/protect-pii
DELETE _security/role/protect-pii
PUT _security/role/protect-pii
{
 &quot;cluster&quot;: [],
 &quot;indices&quot;: [
   {
     &quot;names&quot;: [
       &quot;logs-*&quot;
     ],
     &quot;privileges&quot;: [
       &quot;read&quot;,
       &quot;view_index_metadata&quot;
     ],
     &quot;field_security&quot;: {
       &quot;grant&quot;: [
         &quot;*&quot;
       ],
       &quot;except&quot;: [
         &quot;raw.message&quot;
       ]
     },
     &quot;allow_restricted_indices&quot;: false
   }
 ],
 &quot;applications&quot;: [
   {
     &quot;application&quot;: &quot;kibana-.kibana&quot;,
     &quot;privileges&quot;: [
       &quot;all&quot;
     ],
     &quot;resources&quot;: [
       &quot;*&quot;
     ]
   }
 ],
 &quot;run_as&quot;: [],
 &quot;metadata&quot;: {},
 &quot;transient_metadata&quot;: {
   &quot;enabled&quot;: true
 }
}

# Create user stephen with protect-pii role
GET _security/user/stephen
DELETE /_security/user/stephen
POST /_security/user/stephen
{
 &quot;password&quot; : &quot;mypassword&quot;,
 &quot;roles&quot; : [ &quot;protect-pii&quot; ],
 &quot;full_name&quot; : &quot;Stephen Brown&quot;
}

</code></pre>
 &lt;/details&gt;
<p>Now log into a separate window with the new user <code>stephen</code> with the <code>protect-pii role</code>. Go to Discover and put <code>redact.pii.found : true</code> in the KQL bar and add the <code>message</code> field to the table. Also, notice that the <code>raw.message</code> is not available.</p>
<p>You should see something like this
<img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-discover-3-part-2.png" alt="PII Dashboard Part 2 Blog 2" /></p>
<h3>Create an Alert when PII Detected</h3>
<p>Now, with the processing of the pipelines, creating an alert when PII is detected is easy. To review <a href="https://www.elastic.co/guide/en/kibana/current/alerting-getting-started.html">Alerting in Kibana</a> in detail if needed</p>
<p>NOTE: <a href="#reloading-the-logs">Reload</a> the data if needed to have recent data.</p>
<p>First, we will create a simple <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/esql.html">ES|QL query</a> in Discover.</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-redact-esql-alert-blog-2.txt">The code can be found here.</a></p>
<pre><code>FROM logs-pii-default
| WHERE redact.pii.found == true
| STATS pii_count = count(*)
| WHERE pii_count &gt; 0
</code></pre>
<p>When you run this you should see something like this.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-esql-1-part-2.png" alt="PII ESQL Part 1 Blog 2" /></p>
<p>Now click the Alerts menu and select <code>Create search threshold rule</code>, and will create an alert to alert us when PII is found.</p>
<p><strong>Select a time field: @timestamp
Set the time window: 5 minutes</strong></p>
<p>Assuming you loaded the data recently when you run <strong>Test</strong> it should do something like</p>
<p>pii_count : <code>343</code>
Alerts generated <code>query matched</code></p>
<p>Add an action when the alert is Active.</p>
<p><strong>For each alert: <code>On status changes</code>
Run when: <code>Query matched</code></strong></p>
<pre><code>Elasticsearch query rule {{rule.name}} is active:

- PII Found: true
- PII Count: {{#context.hits}} {{_source.pii_count}}{{/context.hits}}
- Conditions Met: {{context.conditions}} over {{rule.params.timeWindowSize}}{{rule.params.timeWindowUnit}}
- Timestamp: {{context.date}}
- Link: {{context.link}}
</code></pre>
<p>Add an Action for when the Alert is Recovered.</p>
<p><strong>For each alert: <code>On status changes</code>
Run when: <code>Recovered</code></strong></p>
<pre><code>Elasticsearch query rule {{rule.name}} is Recovered:

- PII Found: false
- Conditions Not Met: {{context.conditions}} over {{rule.params.timeWindowSize}}{{rule.params.timeWindowUnit}}
- Timestamp: {{context.date}}
- Link: {{context.link}}
</code></pre>
<p>When all setup it should look like this and <code>Save</code></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-alert-1-part2.png" alt="Alert Setup" /><br />
<img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-alert-2-part2.png" alt="Action Alert" /><br />
<img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-alert-3-part2.png" alt="Action Alert" /></p>
<p>You should get an Active alert that looks like this if you have recent data. I sent mine to Slack.</p>
<pre><code>Elasticsearch query rule pii-found-esql is active:
- PII Found: true
- PII Count:  374
- Conditions Met: Query matched documents over 5m
- Timestamp: 2024-10-15T02:44:52.795Z
- Link: https://mydeployment123.aws.found.io:9243/app/management/insightsAndAlerting/triggersActions/rule/7d6faecf-964e-46da-aaba-8a2f89f33989
</code></pre>
<p>And then if you wait you will get a Recovered alert that looks like this.</p>
<pre><code>Elasticsearch query rule pii-found-esql is Recovered:
- PII Found: false
- Conditions Not Met: Query did NOT match documents over 5m
- Timestamp: 2024-10-15T02:49:04.815Z
- Link: https://mydeployment123.kb.us-west-1.aws.found.io:9243/app/management/insightsAndAlerting/triggersActions/rule/7d6faecf-964e-46da-aaba-8a2f89f33989
</code></pre>
<h3>Production Scaling</h3>
<h4>NER Scaling</h4>
<p>As we mentioned <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1#named-entity-recognition-ner-detection">Part 1 of this blog</a> of this blog, NER / NLP Models are CPU-intensive and expensive to run at scale; thus, we employed a sampling technique to understand the risk in our logs without sending the full logs volume through the NER Model.</p>
<p>Please review <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1#loading-configuration-and-execution-of-the-ner-pipeline">the setup and configuration of the NER</a> model from Part 1 of the blog.</p>
<p>We chose the base BERT NER model <a href="https://huggingface.co/dslim/bert-base-NER">bert-base-NER</a> for our PII case.</p>
<p>To scale ingest, we will focus on scaling the allocations for the deployed model. More information on this topic is available <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-deploy-model.html">here</a>. The number of allocations must be less than the available allocated processors (cores, not vCPUs) per node.</p>
<p>The metrics below are related to the model and configuration from Part 1 of the blog.</p>
<ul>
<li>4 Allocations to allow for more parallel ingestion</li>
<li>1 Thread per Allocation</li>
<li>0 Byes Cache, as we expect a low cache hit rate
<strong>Note</strong> If there are many repeated logs, cache can help, but with timestamps and other variations, cache will not help and can even slow down the process</li>
<li>8192 Queue</li>
</ul>
<pre><code class="language-bash">GET _ml/trained_models/dslim__bert-base-ner/_stats
.....
           &quot;node&quot;: {
              &quot;0m4tq7tMRC2H5p5eeZoQig&quot;: {
.....
                &quot;attributes&quot;: {
                  &quot;xpack.installed&quot;: &quot;true&quot;,
                  &quot;region&quot;: &quot;us-west-1&quot;,
                  &quot;ml.allocated_processors&quot;: &quot;5&quot;, &lt;&lt; HERE 
.....
            },
            &quot;inference_count&quot;: 5040,
            &quot;average_inference_time_ms&quot;: 138.44285714285715, &lt;&lt; HERE 
            &quot;average_inference_time_ms_excluding_cache_hits&quot;: 138.44285714285715,
            &quot;inference_cache_hit_count&quot;: 0,
.....
            &quot;threads_per_allocation&quot;: 1,
            &quot;number_of_allocations&quot;: 4,  &lt;&lt;&lt; HERE
            &quot;peak_throughput_per_minute&quot;: 1550,
            &quot;throughput_last_minute&quot;: 1373,
            &quot;average_inference_time_ms_last_minute&quot;: 137.55280407865988,
            &quot;inference_cache_hit_count_last_minute&quot;: 0
          }
        ]
      }
    }
</code></pre>
<p>There are 3 key pieces of information above:</p>
<ul>
<li>
<p><code>&quot;ml.allocated_processors&quot;: &quot;5&quot;</code>
The number of physical cores / processors available</p>
</li>
<li>
<p><code>&quot;number_of_allocations&quot;: 4</code>
The number of allocations which is maximum 1 per physical core. <strong>Note</strong>: we could have used 5 allocations, but we only allocated 4 for this exercise</p>
</li>
<li>
<p><code>&quot;average_inference_time_ms&quot;: 138.44285714285715</code>
The averages inference time per document.</p>
</li>
</ul>
<p>The math is pretty straightforward for throughput for Inferences per Min (IPM) per allocation (1 allocation per physical core), since an inference uses a single core and a single thread.</p>
<p>Then the Inferences per Min per Allocation is simply:</p>
<p><code>IPM per allocation = 60,000 ms (in a minute) / 138ms per inference = 435</code></p>
<p>When then lines up with the Total Inferences per Minute</p>
<p><code>Total IPM = 435 IPM / allocation * 4 Allocations = ~1740</code></p>
<p>Suppose we want to do 10,000 IPMs, how many allocations (cores) would I need?</p>
<p><code>Allocations = 10,000 IPM / 435 IPM per allocation = 23 Allocation (cores rounded up)</code></p>
<p>Or perhaps logs are coming in at 5000 EPS and you want to do 1% Sampling.</p>
<p><code>IPM = 5000 EPS * 60sec * 0.01 sampling = 3000 IPM sampled</code></p>
<p>Then</p>
<p><code>Number of Allocators = 3000 IPM / 435 IPM per allocation = 7 allocations (cores rounded up)</code></p>
<p><strong>Want Faster!</strong> Turns out there is a more lightweight NER Model <a href="https://huggingface.co/dslim/distilbert-NER">
distilbert-NER</a> model that is faster, but the tradeoff is a little less accuracy.</p>
<p>Running the logs through this model results in an inference time nearly twice as fast!</p>
<p><code>&quot;average_inference_time_ms&quot;: 66.0263959390863</code></p>
<p>Here is some quick math:
<code>$IPM per allocation = 60,000 ms (in a minute) / 61ms per inference = 983</code></p>
<p>Suppose we want to do 25,000 IPMs, how many allocations (cores) would I need?</p>
<p><code>Allocations = 25,000 IPM / 983 IPM per allocation = 26 Allocation (cores rounded up)</code></p>
<p><strong>Now you can apply this math to determine the correct sampling and NER scaling to support your logging use case.</strong></p>
<h4>Redact Processor Scaling</h4>
<p>In short, the <code>redact</code> processor should scale to production loads as long as you are using appropriately sized and configured nodes and have well-constructed regex patterns.</p>
<h3>Assessing incoming logs</h3>
<p>If you want to test on incoming logs data in a data stream. All you need to do is change the conditional in the <code>logs@custom</code> pipeline to apply the <code>process-pii</code> to the dataset you want to. You can use any conditional that fits your condition.</p>
<p>Note: Just make sure that you have accounted for the proper scaling for the NER and Redact processors they were described above in <a href="#production-scaling">Production Scaling</a></p>
<pre><code class="language-bash">    {
      &quot;pipeline&quot;: {
        &quot;description&quot; : &quot;Call the process_pii pipeline on the correct dataset&quot;,
        &quot;if&quot;: &quot;ctx?.data_stream?.dataset == 'pii'&quot;, &lt;&lt;&lt; HERE
        &quot;name&quot;: &quot;process-pii&quot;
      }
    }
</code></pre>
<p>So if for example your logs are coming into <code>logs-mycustomapp-default</code> you would just change the conditional to</p>
<pre><code>        &quot;if&quot;: &quot;ctx?.data_stream?.dataset == 'mycustomapp'&quot;,
</code></pre>
<h3>Assessing historical data</h3>
<p>If you have a historical (already ingested) data stream or index you can run the assessment over them using the <code>_reindex</code> API&gt;</p>
<p>Note: Just make sure that you have accounted for the proper scaling for the NER and Redact processors they were described above in <a href="#production-scaling">Production Scaling</a></p>
<p>There are a couple of extra steps:
<a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-redact-historical-data-blog-2.json">The code can be found here.</a></p>
<ol>
<li>First we can set the parameters to ONLY keep the sampled data as there is no reason to make a copy of all the unsampled data. In the <code>process-pii</code> pipeline, there is a setting <code>sample.keep_unsampled</code>, which we can set to <code>false</code>, which will then only keep the sampled data</li>
</ol>
<pre><code class="language-bash">    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to false if you want to drop unsampled data, handy for reindexing hostorical data&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: false &lt;&lt;&lt; SET TO false
      }
    },
</code></pre>
<ol start="2">
<li>Second, we will create a pipeline that will reroute the data to the correct data stream to run through all the PII assessment/detection pipelines. It also sets the correct <code>dataset</code> and <code>namespace</code></li>
</ol>
<pre><code class="language-bash">DELETE _ingest/pipeline/sendtopii
PUT _ingest/pipeline/sendtopii
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;data_stream.dataset&quot;,
        &quot;value&quot;: &quot;pii&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;data_stream.namespace&quot;,
        &quot;value&quot;: &quot;default&quot;
      }
    },
    {
      &quot;reroute&quot; : 
      {
        &quot;dataset&quot; : &quot;{{data_stream.dataset}}&quot;,
        &quot;namespace&quot;: &quot;{{data_stream.namespace}}&quot;
      }
    }
  ]
}
</code></pre>
<ol start="3">
<li>Finally, we can run a <code>_reindex</code> to select the data we want to test/assess. It is recommended to review the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html">_reindex</a> documents before trying this. First, select the source data stream you want to assess, in this example, it is the <code>logs-generic-default</code> logs data stream. Note: I also added a <code>range</code> filter to select a specific time range. There is a bit of a &quot;trick&quot; that we need to use since we are re-routing the data to the data stream <code>logs-pii-default</code>. To do this, we just set <code>&quot;index&quot;: &quot;logs-tmp-default&quot;</code> in the <code>_reindex</code> as the correct data stream will be set in the pipeline. We must do that because <code>reroute</code> is a <code>noop</code> if it is called from/to the same datastream.</li>
</ol>
<pre><code class="language-bash">POST _reindex?wait_for_completion=false
{
  &quot;source&quot;: {
    &quot;index&quot;: &quot;logs-generic-default&quot;,
    &quot;query&quot;: {
      &quot;bool&quot;: {
        &quot;filter&quot;: [
          {
            &quot;range&quot;: {
              &quot;@timestamp&quot;: {
                &quot;gte&quot;: &quot;now-1h/h&quot;,
                &quot;lt&quot;: &quot;now&quot;
              }
            }
          }
        ]
      }
    }
  },
  &quot;dest&quot;: {
    &quot;op_type&quot;: &quot;create&quot;,
    &quot;index&quot;: &quot;logs-tmp-default&quot;,
    &quot;pipeline&quot;: &quot;sendtopii&quot;
  }
}
</code></pre>
<h2>Summary</h2>
<p>At this point, you have the tools and processes need to assess, detect, analyze, alert and protect PII in your logs.</p>
<p><a href="https://github.com/bvader/elastic-pii/tree/main/elastic/blog-complete-end-solution">The end state solution can be found here:</a>.</p>
<p>In <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1">Part 1 of this blog</a>, we accomplished the following.</p>
<ul>
<li>Reviewed the techniques and tools we have available for PII detection and assessment</li>
<li>Reviewed NLP / NER role in PII detection and assessment</li>
<li>Built the necessary composable ingest pipelines to sample logs and run them through the NER Model</li>
<li>Reviewed the NER results and are ready to move to the second blog</li>
</ul>
<p>In <strong>Part 2</strong> of this blog, we covered the following:</p>
<ul>
<li>Redact PII using NER and redact processor</li>
<li>Apply field-level security to control access to the un-redacted data</li>
<li>Enhance the dashboards and alerts</li>
<li>Production considerations and scaling</li>
<li>How to run these processes on incoming or historical data</li>
</ul>
<p><em><strong>So get to work and reduce risk in your logs!</strong></em></p>
<h2>Data Loading Appendix</h2>
<h4>Code</h4>
<p>The data loading code can be found here:</p>
<p><a href="https://github.com/bvader/elastic-pii">https://github.com/bvader/elastic-pii</a></p>
<pre><code>$ git clone https://github.com/bvader/elastic-pii.git
</code></pre>
<h4>Creating and Loading the Sample Data Set</h4>
<pre><code>$ cd elastic-pii
$ cd python
$ python -m venv .env
$ source .env/bin/activate
$ pip install elasticsearch
$ pip install Faker
</code></pre>
<p>Run the log generator</p>
<pre><code>$ python generate_random_logs.py
</code></pre>
<p>If you do not changes any parameters, this will create 10000 random logs in a file named pii.log with a mix of logs that containe and do not contain PII.</p>
<p>Edit <code>load_logs.py</code> and set the following</p>
<pre><code># The Elastic User 
ELASTIC_USER = &quot;elastic&quot;

# Password for the 'elastic' user generated by Elasticsearch
ELASTIC_PASSWORD = &quot;askdjfhasldfkjhasdf&quot;

# Found in the 'Manage Deployment' page
ELASTIC_CLOUD_ID = &quot;deployment:sadfjhasfdlkjsdhf3VuZC5pbzo0NDMkYjA0NmQ0YjFiYzg5NDM3ZDgxM2YxM2RhZjQ3OGE3MzIkZGJmNTE0OGEwODEzNGEwN2E3M2YwYjcyZjljYTliZWQ=&quot;
</code></pre>
<p>Then run the following command.</p>
<pre><code>$ python load_logs.py
</code></pre>
<h4>Reloading the logs</h4>
<p><strong>Note</strong> To reload the logs, you can simply re-run the above command. You can run the command multiple time during this exercise and the logs will be reloaded (actually loaded again). The new logs will not collide with previous runs as there will be a unique <code>run.id</code> for each run which is displayed at the end of the loading process.</p>
<pre><code>$ python load_logs.py
</code></pre>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-ner-regex-assess-redact-part-2.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Pruning incoming log volumes with Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/pruning-incoming-log-volumes</link>
            <guid isPermaLink="false">pruning-incoming-log-volumes</guid>
            <pubDate>Fri, 23 Jun 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[To drop or not to drop (events) is the question, not only in deciding what events and fields to remove from your logs but also in the various tools used. Learn about using Beats, Logstash, Elastic Agent, Ingest Pipelines, and OTel Collectors.]]></description>
            <content:encoded><![CDATA[<pre><code class="language-yaml">filebeat.inputs:
  - type: filestream
    id: my-logging-app
    paths:
      - /var/log/*.log
</code></pre>
<pre><code class="language-yaml">filebeat.inputs:
  - type: filestream
    id: my-logging-app
    paths:
      - /var/tmp/other.log
      - /var/log/*.log
processors:
  - drop_event:
      when:
        and:
          - equals:
            url.scheme: http
          - equals:
            url.path: /profile
</code></pre>
<pre><code class="language-yaml">filebeat.inputs:
  - type: filestream
    id: my-logging-app
    paths:
      - /var/tmp/other.log
      - /var/log/*.log
processors:
  - drop_fields:
      when:
        and:
          - equals:
            url.scheme: http
          - equals:
            http.response.status_code: 200
        fields: [&quot;event.message&quot;]
        ignore_missing: false
</code></pre>
<pre><code class="language-ruby">input {
  file {
    id =&gt; &quot;my-logging-app&quot;
    path =&gt; [ &quot;/var/tmp/other.log&quot;, &quot;/var/log/*.log&quot; ]
  }
}
filter {
  if [url.scheme] == &quot;http&quot; &amp;&amp; [url.path] == &quot;/profile&quot; {
    drop {
      percentage =&gt; 80
    }
  }
}
output {
  elasticsearch {
        hosts =&gt; &quot;https://my-elasticsearch:9200&quot;
        data_stream =&gt; &quot;true&quot;
    }
}
</code></pre>
<pre><code class="language-ruby"># Input configuration omitted
filter {
  if [url.scheme] == &quot;http&quot; &amp;&amp; [http.response.status_code] == 200 {
    drop {
      percentage =&gt; 80
    }
    mutate {
      remove_field: [ &quot;event.message&quot; ]
    }
  }
}
# Output configuration omitted
</code></pre>
<pre><code class="language-bash">PUT _ingest/pipeline/my-logging-app-pipeline
{
  &quot;description&quot;: &quot;Event and field dropping for my-logging-app&quot;,
  &quot;processors&quot;: [
    {
      &quot;drop&quot;: {
        &quot;description&quot; : &quot;Drop event&quot;,
        &quot;if&quot;: &quot;ctx?.url?.scheme == 'http' &amp;&amp; ctx?.url?.path == '/profile'&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;remove&quot;: {
        &quot;description&quot; : &quot;Drop field&quot;,
        &quot;field&quot; : &quot;event.message&quot;,
        &quot;if&quot;: &quot;ctx?.url?.scheme == 'http' &amp;&amp; ctx?.http?.response?.status_code == 200&quot;,
        &quot;ignore_failure&quot;: false
      }
    }
  ]
}
</code></pre>
<pre><code class="language-bash">PUT _ingest/pipeline/my-logging-app-pipeline
{
  &quot;description&quot;: &quot;Event and field dropping for my-logging-app with failures&quot;,
  &quot;processors&quot;: [
    {
      &quot;drop&quot;: {
        &quot;description&quot; : &quot;Drop event&quot;,
        &quot;if&quot;: &quot;ctx?.url?.scheme == 'http' &amp;&amp; ctx?.url?.path == '/profile'&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;remove&quot;: {
        &quot;description&quot; : &quot;Drop field&quot;,
        &quot;field&quot; : &quot;event.message&quot;,
        &quot;if&quot;: &quot;ctx?.url?.scheme == 'http' &amp;&amp; ctx?.http?.response?.status_code == 200&quot;,
        &quot;ignore_failure&quot;: false
      }
    }
  ],
  &quot;on_failure&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set 'ingest.failure.message'&quot;,
        &quot;field&quot;: &quot;ingest.failure.message&quot;,
        &quot;value&quot;: &quot;Ingestion issue&quot;
        }
      }
  ]
}
</code></pre>
<pre><code class="language-yaml">receivers:
  filelog:
    include: [/var/tmp/other.log, /var/log/*.log]
processors:
  filter/denylist:
    error_mode: ignore
    logs:
      log_record:
        - 'url.scheme == &quot;info&quot;'
        - 'url.path == &quot;/profile&quot;'
        - &quot;http.response.status_code == 200&quot;
  attributes/errors:
    actions:
      - key: error.message
        action: delete
  memory_limiter:
    check_interval: 1s
    limit_mib: 2000
  batch:
exporters:
  # Exporters configuration omitted
service:
  pipelines:
    # Pipelines configuration omitted
</code></pre>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/pruning-incoming-log-volumes/blog-thumb-elastic-on-elastic.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Root cause analysis with logs: Elastic Observability's anomaly detection and log categorization]]></title>
            <link>https://www.elastic.co/observability-labs/blog/reduce-mttd-ml-machine-learning-observability</link>
            <guid isPermaLink="false">reduce-mttd-ml-machine-learning-observability</guid>
            <pubDate>Tue, 07 Feb 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic Observability provides more than just log aggregation, metrics analysis, APM, and distributed tracing. Elastic’s machine learning capabilities help analyze the root cause of issues, allowing you to focus your time on the most important tasks.]]></description>
            <content:encoded><![CDATA[<p>With more and more applications moving to the cloud, an increasing amount of telemetry data (logs, metrics, traces) is being collected, which can help improve application performance, operational efficiencies, and business KPIs. However, analyzing this data is extremely tedious and time consuming given the tremendous amounts of data being generated. Traditional methods of alerting and simple pattern matching (visual or simple searching etc) are not sufficient for IT Operations teams and SREs. It’s like trying to find a needle in a haystack.</p>
<p>In this blog post, we’ll cover some of Elastic’s artificial intelligence for IT operations (AIOps) and machine learning (ML) capabilities for root cause analysis.</p>
<p>Elastic’s machine learning will help you investigate performance issues by providing anomaly detection and pinpointing potential root causes through time series analysis and log outlier detection. These capabilities will help you reduce time in finding that “needle” in the haystack.</p>
<p>Elastic’s platform enables you to get started on machine learning quickly. You don’t need to have a data science team or design a system architecture. Additionally, there’s no need to move data to a third-party framework for model training.</p>
<p>Preconfigured machine learning models for observability and security are available. If those don't work well enough on your data, in-tool wizards guide you through the few steps needed to configure custom anomaly detection and train your model with supervised learning. To help get you started, there are several key features built into Elastic Observability to aid in analysis, helping bypass the need to run specific ML models. These features help minimize the time and analysis for logs.</p>
<p>Let’s review some of these built-in ML features:</p>
<p><strong>Anomaly detection:</strong> Elastic Observability, when turned on (<a href="https://www.elastic.co/guide/en/kibana/current/xpack-ml-anomalies.html">see documentation</a>), automatically detects anomalies by continuously modeling the normal behavior of your time series data — learning trends, periodicity, and more — in real time to identify anomalies, streamline root cause analysis, and reduce false positives. Anomaly detection runs in and scales with Elasticsearch and includes an intuitive UI.</p>
<p><strong>Log categorization:</strong> Using anomaly detection, Elastic also identifies patterns in your log events quickly. Instead of manually identifying similar logs, the logs categorization view lists log events that have been grouped, based on their messages and formats, so that you can take action quicker.</p>
<p><strong>High-latency or erroneous transactions:</strong> Elastic Observability’s APM capability helps you discover which attributes are contributing to increased transaction latency and identifies which attributes are most influential in distinguishing between transaction failures and successes. An overview of this capability is published here: <a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">APM correlations in Elastic Observability: Automatically identifying probable causes of slow or failed transactions</a>.</p>
<p><strong>AIOps Labs:</strong> AIOps Labs provides two main capabilities using advanced statistical methods:</p>
<ul>
<li><strong>Log spike detector</strong> helps identify reasons for increases in log rates. It makes it easy to find and investigate causes of unusual spikes by using the analysis workflow view. Examine the histogram chart of the log rates for a given data view, and find the reason behind a particular change possibly in millions of log events across multiple fields and values.</li>
<li><strong>Log pattern analysis</strong> helps you find patterns in unstructured log messages and makes it easier to examine your data. It performs categorization analysis on a selected field of a data view, creates categories based on the data, and displays them together with a chart that shows the distribution of each category and an example document that matches the category.</li>
</ul>
<p>_ <strong>In this blog, we will cover anomaly detection and log categorization against the popular “Hipster Shop app” developed by Google, and modified recently by OpenTelemetry.</strong> _</p>
<p>Overviews of high-latency capabilities can be found <a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">here</a>, and an overview of AIOps labs can be found <a href="https://www.youtube.com/watch?v=jgHxzUNzfhM&amp;list=PLhLSfisesZItlRZKgd-DtYukNfpThDAv_&amp;index=5">here</a>.</p>
<p>In this blog, we will examine a scenario where we use anomaly detection and log categorization to help identify a root cause of an issue in Hipster Shop.</p>
<h2>Prerequisites and config</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up this demonstration:</p>
<ul>
<li>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>) on AWS. Deploying this on AWS is required for Elastic Serverless Forwarder.</li>
<li>Utilize a version of the ever so popular <a href="https://github.com/GoogleCloudPlatform/microservices-demo">Hipster Shop</a> demo application. It was originally written by Google to showcase Kubernetes across a multitude of variants available, such as the <a href="https://github.com/open-telemetry/opentelemetry-demo">OpenTelemetry Demo App</a>. The Elastic version is found <a href="https://github.com/elastic/opentelemetry-demo">here</a>.</li>
<li>Ensure you have configured the app for either Elastic APM agents or OpenTelemetry agents. For more details, please refer to these two blogs: <a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OTel in Elastic</a> and <a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Observability and security with OTel in Elastic</a>. Additionally, review the <a href="https://www.elastic.co/guide/en/apm/guide/current/open-telemetry.html">OTel documentation in Elastic</a>.</li>
<li>Look through an overview of <a href="https://www.elastic.co/guide/en/observability/current/apm.html">Elastic Observability APM capabilities</a>.</li>
<li>Look through our <a href="https://www.elastic.co/guide/en/observability/8.5/inspect-log-anomalies.html">Anomaly detection documentation</a> for logs and <a href="https://www.elastic.co/guide/en/observability/8.5/categorize-logs.html">log categorization documentation</a>.</li>
</ul>
<p>Once you’ve instrumented your application with APM (Elastic or OTel) agents and are ingesting metrics and logs into Elastic Observability, you should see a service map for the application as follows:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-service-map.png" alt="" /></p>
<p>In our example, we’ve introduced issues to help walk you through the root cause analysis features: anomaly detection and log categorization. You might have a different set of anomalies and log categorization depending on how you load the application and/or introduce specific issues.</p>
<p>As part of the walk-through, we’ll assume we are a DevOps or SRE managing this application in production.</p>
<h2>Root cause analysis</h2>
<p>While the application has been running normally for some time, you get a notification that some of the services are unhealthy. This can occur from the notification setting you’ve set up in Elastic or other external notification platforms (including customer related issues). In this instance, we’re assuming that customer support has called in multiple customer complaints about the website.</p>
<p>How do you as a DevOps or SRE investigate this? We will walk through two avenues in Elastic to investigate the issue:</p>
<ul>
<li>Anomaly detection</li>
<li>Log categorization</li>
</ul>
<p>While we show these two paths separately, they can be used in conjunction and are complementary, as they are both tools Elastic Observability provides to help you troubleshoot and identify a root cause.</p>
<h3>Machine learning for anomaly detection</h3>
<p>Elastic will detect anomalies based on historical patterns and identify a probability of these issues.</p>
<p>Starting with the service map, you can see anomalies identified with red circles and as we select them, Elastic will provide a score for the anomaly.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-service-map-anomaly-detection.png" alt="" /></p>
<p>In this example, we can see that there is a score of 96 for a specific anomaly for the productCatalogService in the Hipster Shop application. An anomaly score indicates the significance of the anomaly compared to previously seen anomalies. More information on anomaly detection results can be found here. We can also dive deeper into the anomaly and analyze the details.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-single-metric-viewer.png" alt="" /></p>
<p>What you will see for the productCatalogService is that there is a severe spike in average transaction latency time, which is the anomaly that was detected in the service map. Elastic’s machine learning has identified a specific metric anomaly (shown in the single metric view). It’s likely that customers are potentially responding to the slowness of the site and that the company is losing potential transactions.</p>
<p>One step to take next is to review all the other potential anomalies that we saw in the service map in a larger picture. Use an anomaly explorer to view all the anomalies that have been identified.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-anomaly-explorer.png" alt="" /></p>
<p>Elastic is identifying numerous services with anomalies. productCatalogService has the highest score and a good number or others: frontend, checkoutService, advertService, and others, also have high scores. However, this analysis is looking at just one metric.</p>
<p>Elastic can help detect anomalies across all types of data, such as kubernetes data, metrics, and traces. If we analyze across all these types (via individual jobs we’ve created in Elastic machine learning), we will see a more comprehensive view as to what is potentially causing this latency issue.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-anomaly-explorer-job-selection.png" alt="" /></p>
<p>Once all the potential jobs are selected and we’ve sorted by service.name, we can see that productCatalogService is still showing a high anomaly influencer score.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-anomaly-explorer-timeline.png" alt="" /></p>
<p>In addition to the chart giving us a visual of the anomalies, we can review all the potential anomalies. As you will notice, Elastic has also categorized these anomalies (see category examples column). As we scroll through the results, we notice a potential postgreSQL issue from the categorization, which also has a high 94 score. Machine learning has identified a “rare mlcategory,” meaning that it has rarely occurred, hence pointing to a potential cause of the issue customers are seeing.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-machine-learning-service-name.png" alt="" /></p>
<p>We also notice that this issue is potentially caused by pgbench , a popular postgreSQL tool to help benchmark the database. pgbench runs the same sequence of SQL commands over and over, possibly in multiple, concurrent database sessions. While pgbench is definitely a useful tool, it should not be used in a production environment as it causes heavy load on the database host, likely causing the higher latency issues on the site.</p>
<p>While this may or may not be the ultimate root cause, we have rather quickly identified a potentially issue that has a high probability of being the root cause. An engineer likely intended to run pgbench against a staging database to evaluate its performance, and not the production environment.</p>
<h3>Machine learning for log categorization</h3>
<p>Elastic Observability’s service map has detected an anomaly, and in this part of the walk-through, we take a different approach by investigating the service details from the service map versus initially exploring the anomaly. When we explore the service details for productCatalogService, we see the following:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-product-catalog-service.png" alt="" /></p>
<p>The service details are identifying several things:</p>
<ol>
<li>There is an abnormally high latency compared to expected bounds of the service. We see that recently there was a higher than normal (upward of 1s latency) compared to the average to 275ms on average.</li>
<li>There is also a high failure rate for the same time frame as the high latency (lower left chart “ <strong>Failed transaction rate</strong> ”).</li>
<li>Additionally, we can see the transactions and one in particular /ListProduct has an abnormally high latency, in addition to a high failure rate.</li>
<li>We see productCatalogService has a dependency on postgreSQL.</li>
<li>We also see errors all related to postgreSQL.</li>
</ol>
<p>We have an option to dig through the logs and analyze in Elastic or we can use a capability to identify the logs more easily.</p>
<p>If we go to Categories under Logs in Elastic Observability and search for postgresql.logto help identify postgresql logs that could be causing this error, we see that Elastic’s machine learning has automatically categorized the postgresql logs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-categories.png" alt="" /></p>
<p>We notice two additional items:</p>
<ul>
<li>There is a high count category (message count of 23,797 with a high anomaly of 70) related to pgbench (which is odd to see in production). Hence we search further for all pgbench related logs in Categories .</li>
<li>We see an odd issue regarding terminating the connection (with a low count).</li>
</ul>
<p>While investigating the second error, which is severe, we can see logs from Categories before and after the error.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-timestamp.png" alt="" /></p>
<p>This troubleshooting shows postgreSQL having a FATAL error, the database shutting down prior to the error, and all connections terminating. Given the two immediate issues we identified, we have an idea that someone was running pgbench and this potentially overloaded the database, causing the latency issue that customers are seeing.</p>
<p>The next steps here could be to investigate anomaly detection and/or work with the developers to review the code and identify pgbench as part of the deployed configuration.</p>
<h2>Conclusion</h2>
<p>I hope you’ve gotten an appreciation for how Elastic Observability can help you further identify and get closer to pinpointing root cause of issues without having to look for a “needle in a haystack.” Here’s a quick recap of lessons and what you learned:</p>
<ul>
<li>
<p>Elastic Observability has numerous capabilities to help you reduce your time to find root cause and improve your MTTR (even MTTD). In particular, we reviewed the following two main capabilities in this blog:</p>
<ol>
<li><strong>Anomaly detection:</strong> Elastic Observability, when turned on (<a href="https://www.elastic.co/guide/en/kibana/current/xpack-ml-anomalies.html">see documentation</a>), automatically detects anomalies by continuously modeling the normal behavior of your time series data — learning trends, periodicity, and more — in real time to identify anomalies, streamline root cause analysis, and reduce false positives. Anomaly detection runs in and scales with Elasticsearch and includes an intuitive UI.</li>
<li><strong>Log categorization:</strong> Using anomaly detection, Elastic also identifies patterns in your log events quickly. Instead of manually identifying similar logs, the logs categorization view lists log events that have been grouped based on their messages and formats so that you can take action quicker.</li>
</ol>
</li>
<li>
<p>You learned how easy and simple it is to use Elastic Observability’s log categorization and anomaly detection capabilities without having to understand machine learning (which help drive these features), nor having to do any lengthy setups.
Ready to get started? <a href="https://cloud.elastic.co/registration">Register for Elastic Cloud</a> and try out the features and capabilities I’ve outlined above.</p>
</li>
</ul>
<h3>Additional logging resources:</h3>
<ul>
<li><a href="https://www.elastic.co/getting-started/observability/collect-and-analyze-logs">Getting started with logging on Elastic (quickstart)</a></li>
<li><a href="https://www.elastic.co/guide/en/observability/current/logs-metrics-get-started.html">Ingesting common known logs via integrations (compute node example)</a></li>
<li><a href="https://docs.elastic.co/integrations">List of integrations</a></li>
<li><a href="https://www.elastic.co/blog/log-monitoring-management-enterprise">Ingesting custom application logs into Elastic</a></li>
<li><a href="https://www.elastic.co/blog/observability-logs-parsing-schema-read-write">Enriching logs in Elastic</a></li>
<li>Analyzing Logs with <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">Anomaly Detection (ML)</a> and <a href="https://www.elastic.co/blog/observability-logs-machine-learning-aiops">AIOps</a></li>
</ul>
<h3>Common use case examples with logs:</h3>
<ul>
<li><a href="https://youtu.be/ax04ZFWqVCg">Nginx log management</a></li>
<li><a href="https://www.elastic.co/blog/vpc-flow-logs-monitoring-analytics-observability">AWS VPC Flow log management</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-errors-elastic-observability-logs-openai">Using OpenAI to analyze Kubernetes errors</a></li>
<li><a href="https://youtu.be/Li5TJAWbz8Q">PostgreSQL issue analysis with AIOps</a></li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/illustration-machine-learning-anomaly-1680x980.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Live logs and prosper: fixing a fundamental flaw in observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/reimagine-observability-elastic-streams</link>
            <guid isPermaLink="false">reimagine-observability-elastic-streams</guid>
            <pubDate>Mon, 27 Oct 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Stop chasing symptoms. Learn how Streams, in Elastic Observability fixes the fundamental flaw in observability, using AI to proactively find the 'why' in your logs for faster resolution.]]></description>
            <content:encoded><![CDATA[<p>SREs are often overwhelmed by dashboards and alerts that show what and where things are broken, but fail to reveal why. This industry-wide focus on visualizing symptoms forces engineers to manually hunt for answers. The crucial &quot;why&quot; is buried in information-rich logs, but their massive volume and unstructured nature has led the industry to throw them aside or treat them like a second-class citizen. As a result, SREs are forced to turn every investigation into a high-stress, time-consuming hunt for clues. We can solve this problem with logs, but unlocking their potential requires us to reimagine how we work with them and improve the overall investigations journey.</p>
<h2>Observability, the broken promise</h2>
<p>To see why the current model fails, let’s look at the all-too-familiar challenge every SRE dreads: knowing a problem exists but needing to spend valuable time just trying to find where to even start the investigation.</p>
<p>Imagine you get a Slack message from the support team: &quot;a few high-value customers are reporting their payments are failing.&quot; You have no shortage of alerts, but most are just flagging symptoms. You don’t know where to start. You decide to check the logs to see if there is anything obvious, starting with the systems that have the high CPU alert.</p>
<p>You spend a few minutes searching and <code>grep</code>-ing through terabytes of logs for affected customer IDs, trying to piece together the problem. Nothing. You worry that you aren’t getting all the logs to reveal the problem, so you turn on more logging in the application. Now you’re knee-deep in data, desperately trying to find patterns, errors, or other &quot;hints&quot; that will give you a clue as to the <em>why</em>.</p>
<p>Finally, one of the broader log queries hits on an error code associated with an impacted customer ID. This is the first real clue. You pivot your search to this new error code and after an hour of digging, you finally uncover the error message. You've finally found the <em>why</em>, but it was a stressful, manual hunt that took far too much time and impacted dozens more customers.</p>
<p>This incident perfectly illustrates the broken promise of modern observability: The complete failure of the investigation process. Investigations are a manual, reactive process that SREs are forced into every day. At Elastic, we believe metrics, traces, and logs are all essential, but their roles, and the workflow between them, must be fundamentally re-imagined for effective investigations.</p>
<p>Observability is about having the clearest understanding possible of the <em>what</em>, <em>where</em>, and <em>why</em>. Metrics are essential for understanding the <em>what</em>. They are the heartbeat of your system, powering the dashboards and alerts that tell you when a threshold has been breached, like high CPU utilization or error rates. But they are aggregates; they show the symptom, rarely the root cause. Traces are good at identifying the <em>where</em>. They map the journey of a request through a distributed system, pinpointing the specific microservice or function where latency spikes or an error originates. Yet, their effectiveness hinges on complete and consistent code instrumentation, a constant dependency on development teams that can leave you with critical visibility gaps. Logs tell you the <em>why</em>. They contain all the rich, contextual, and unfiltered truth of an event. If we can more proactively and efficiently extract information from logs, we can greatly improve our overall understanding of our environments.</p>
<h2>Challenges of logs in modern environments</h2>
<p>While logs are in the standard toolbox, they have been neglected. SREs using today’s solutions deal with several major problems:</p>
<ul>
<li>First, due to their unstructured nature, it’s very difficult to parse and manage logs so that they’re useful. As a result, many SRE teams spend a lot of time building and maintaining complex pipelines to help manage this process. </li>
</ul>
<ul>
<li>Second, logs can get expensive at high volume, which leads teams to drop them on the floor to control costs, throwing away valuable information in the process. Consequently, when an incident occurs, you waste precious time hunting for the right logs, and manually correlating across services.</li>
</ul>
<ul>
<li>Finally, nobody has built a log solution that proactively works to find the important signals in logs and to surface those critical <em>whys</em> to you when you need them. As a result, log-based investigations are too painful and slow.</li>
</ul>
<p>Why are we here? As applications became more complex, log volume became unmanageable. Instead of solving this with automation, the industry took a shortcut: it gave up on getting the most out of logs and prioritized more manageable but less informative signals.</p>
<p>This decision is the origin of the broken, reactive model. It forced observability into a manual loop of 'observing' alerts, rather than building automation that could help us truly understand our systems to improve how we root cause and resolve issues. This has transformed SREs from investigators into full-time data wranglers, wrestling with Grok patterns and fragile ETL scripts instead of solving outages. </p>
<h2>Introducing Streams to rethink how you use logs for investigations</h2>
<p>Streams is an agentic AI solution that simplifies working with logs to help SRE teams rapidly understand the <em>why</em> behind an issue for faster resolution. The combination of Elasticsearch and AI is turning manual management of noisy logs into automated workflows that identify patterns, context, and meaning, marking a fundamental shift in observability.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reimagine-observability-elastic-streams/streams-manifesto-01.png" alt="Streams" /></p>
<h4>Log everything in any format</h4>
<p>By applying the Elasticsearch platform for context engineering to bring together retrieval and AI-driven parsing to keep up with schema changes, we are reimagining the entire log pipeline.  </p>
<p>Streams ingests raw logs from all your sources to a single destination. It then uses AI to partition incoming logs into their logical components and parses them to extract relevant fields for an SRE to validate, approve, or modify. Imagine a world where you simply point your logs to a single endpoint, and everything just works. Less wrestling with Grok patterns, configuring processors, and hunting for the right plugin. All of which significantly reduces the complexity. Streams is a big step towards realizing that vision.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reimagine-observability-elastic-streams/streams-manifesto-02.png" alt="Streams" /></p>
<p>As a result, SREs are freed from managing complex ingestion pipelines, allowing them to spend less time on data wrangling and more time preventing service disruptions.</p>
<h4>Solve incidents faster with Significant Events </h4>
<p>Significant Events, a capability within Streams, uses AI to automatically surface major errors and anomalies, enabling you to be proactive in your investigations. So, instead of just combing through endless noise, you can focus on the events that truly matter, such as startup and shutdown messages, out-of-memory errors, internal server failures, and other significant signals of change. These events act as actionable markers, giving SREs early warning and clear focus to begin an investigation before service impact.</p>
<p>With this new foundation, logs will become your primary signal for investigation. The panicked, manual search for a needle in a digital haystack is about to be over. Significant Events acts like a smart metal detector that sifts through the chaos and only beeps when it finds issues, helping you to easily ignore all that hay and find the &quot;needle&quot; faster. </p>
<p>Now imagine the same scenario we started with. Instead of starting a frantic, time-consuming grep through terabytes of logs. Streams has already done the heavy lifting. Its AI-driven analysis has detected a new, anomalous pattern that began before your support team even knew about it and automatically surfaced it as a significant event. Rather than you hunting for a clue, the clue finds you. </p>
<p>With a single click, you have the <em>why</em>: a Java out-of-memory error in a specific service component. This is your starting point. You find the root cause in under two minutes and begin remediation. The customer impact is stopped, the dev team gets the specific error, and the problem is contained before it can escalate. In this case, metrics and traces were unhelpful in finding the <em>why</em>. The answer was waiting in the logs all along.</p>
<p>This ideal outcome is possible because you can both afford to keep every log and instantly find the signal within them. Elastic's cost-efficient architecture with powerful compression, searchable snapshots, and data tiering makes full retention a reality. From there, Streams automatically surfaces the significant event, ensuring that the answer is never lost in the noise.</p>
<p>Elastic is the only company that provides an AI-driven log-first approach to elevate your observability signals and make it dramatically faster and easier to get to <em>why</em>. This is built on our decades of leadership in search, relevance, and powerful analytics that provides the foundation for understanding logs at a deep, semantic level.</p>
<h2>The vision for Streams </h2>
<p>The partitioning, parsing, and Significant Events you see today is just the starting point. The next step in our vision is to use the Significant Events to automatically generate critical SRE artifacts. Imagine Streams creating intelligent alerts, on-the-fly investigation dashboards, and even data-driven SLOs based <em>only</em> on the events that actually impact service health. From there, the goal is to use AI to drive automated Root Cause Analysis (RCA) directly from log patterns and generate remediation runbooks, turning a multi-hour hunt into an instant resolution recommendation.</p>
<p>Once this AI-drive log foundation is in place, our vision for Streams expands to become a unified intelligence layer that operates across all your telemetry data. It’s not just about making each signal better in isolation, but about understanding the context and relationships between them to solve complex problems. </p>
<p>For metrics, Streams won’t just alert you to a single metric spike but detect a correlated anomaly across multiple, seemingly unrelated metrics e.g. p99 latency for a specific service, rise in garbage collection time, transaction success rate.</p>
<p>Similarly, for traces it identifies a new, unexpected service call (e.g., a new database or an external API) appears in a critical transaction path after a deployment or identifies specific span is suddenly responsible for a majority of errors across all traces, even if the overall error rate hasn't breached a threshold.</p>
<p>The goal is not to have separate streams for logs, metrics, and traces, but to weave them into a single narrative that automatically correlates all three signals. Ultimately, Streams is about fundamentally changing the goal from human led data gathering exercise to proactive, AI-driven resolution.</p>
<p><em>For more on Streams:</em></p>
<p><em>Read the</em> <a href="https://www.elastic.co/observability-labs/blog/elastic-observability-streams-ai-logs-investigations"><em>Streams launch blog</em></a></p>
<p><em>Look at the</em> <a href="http://elastic.co/elasticsearch/streams"><em>Streams website</em></a></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/reimagine-observability-elastic-streams/streams-manifesto.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Build better Service Level Objectives (SLOs) from logs and metrics]]></title>
            <link>https://www.elastic.co/observability-labs/blog/service-level-objectives-slos-logs-metrics</link>
            <guid isPermaLink="false">service-level-objectives-slos-logs-metrics</guid>
            <pubDate>Fri, 23 Feb 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[To help manage operations and business metrics, Elastic Observability's SLO (Service Level Objectives) feature was introduced in 8.12. This blog reviews this feature and how you can use it with Elastic's AI Assistant to meet SLOs.]]></description>
            <content:encoded><![CDATA[<p>In today's digital landscape, applications are at the heart of both our personal and professional lives. We've grown accustomed to these applications being perpetually available and responsive. This expectation places a significant burden on the shoulders of developers and operations teams.</p>
<p>Site reliability engineers (SREs) face the challenging task of sifting through vast quantities of data, not just from the applications themselves but also from the underlying infrastructure. In addition to data analysis, they are responsible for ensuring the effective use and development of operational tools. The growing volume of data, the daily resolution of issues, and the continuous evolution of tools and processes can detract from the focus on business performance.</p>
<p>Elastic Observability offers a solution to this challenge. It enables SREs to integrate and examine all telemetry data (logs, metrics, traces, and profiling) in conjunction with business metrics. This comprehensive approach to data analysis fosters operational excellence, boosts productivity, and yields critical insights, all of which are integral to maintaining high-performing applications in a demanding digital environment.</p>
<p>To help manage operations and business metrics, Elastic Observability's SLO (Service Level Objectives) feature was introduced in <a href="https://www.elastic.co/guide/en/observability/8.12/slo.html">8.12</a>. This feature enables setting measurable performance targets for services, such as <a href="https://sre.google/sre-book/monitoring-distributed-systems/">availability, latency, traffic, errors, and saturation or define your own</a>. Key components include:</p>
<ul>
<li>
<p>Defining and monitoring SLIs (Service Level Indicators)</p>
</li>
<li>
<p>Monitoring error budgets indicating permissible performance shortfalls</p>
</li>
<li>
<p>Alerting on burn rates showing error budget consumption</p>
</li>
</ul>
<p>Users can monitor SLOs in real-time with dashboards, track historical performance, and receive alerts for potential issues. Additionally, SLO dashboard panels offer customized visualizations.</p>
<p>Service Level Objectives (SLOs) are generally available for our Platinum and Enterprise subscription customers.</p>
&lt;Video vidyardUuid=&quot;ngfY9mrkNEkjmpRY4Qd5Pb&quot; /&gt;
<p>In this blog, we will outline the following:</p>
<ul>
<li>
<p>What are SLOs? A Google SRE perspective</p>
</li>
<li>
<p>Several scenarios of defining and managing SLOs</p>
</li>
</ul>
<h2>Service Level Objective overview</h2>
<p>Service Level Objectives (SLOs) are a crucial component for Site Reliability Engineering (SRE), as detailed in <a href="https://sre.google/sre-book/table-of-contents/">Google's SRE Handbook</a>. They provide a framework for quantifying and managing the reliability of a service. The key elements of SLOs include:</p>
<ul>
<li>
<p><strong>Service Level Indicators (SLIs):</strong> These are carefully selected metrics, such as uptime, latency, throughput, error rates, or other important metrics, that represent the aspects of the service and are important from an operations or business perspective. Hence, an SLI is a measure of the service level provided (latency, uptime, etc.), and it is defined as a ratio of good over total events, with a range between 0% and 100%.</p>
</li>
<li>
<p><strong>Service Level Objective (SLO):</strong> An SLO is the target value for a service level measured as a percentage by an SLI. Above the threshold, the service is compliant. As an example, if we want to use service availability as an SLI, with the number of successful responses at 99.9%, then any time the number of failed responses is &gt; .1%, the SLO will be out of compliance.</p>
</li>
<li>
<p><strong>Error budget:</strong> This represents the threshold of acceptable errors, balancing the need for reliability with practical limits. It is defined as 100% minus the SLO quantity of errors that is tolerated.</p>
</li>
<li>
<p><strong>Burn rate:</strong> This concept relates to how quickly the service is consuming its error budget, which is the acceptable threshold for unreliability agreed upon by the service providers and its users.</p>
</li>
</ul>
<p>Understanding these concepts and effectively implementing them is essential for maintaining a balance between innovation and reliability in service delivery. For more detailed information, you can refer to <a href="https://sre.google/workbook/slo-document/">Google's SRE Handbook</a>.</p>
<p>One main thing to remember is that SLO monitoring is <em>not</em> incident monitoring. SLO monitoring is a proactive, strategic approach designed to ensure that services meet established performance standards and user expectations. It involves tracking Service Level Objectives, error budgets, and the overall reliability of a service over time. This predictive method helps in preventing issues that could impact users and aligns service performance with business objectives.</p>
<p>In contrast, incident monitoring is a reactive process focused on detecting, responding to, and mitigating service incidents as they occur. It aims to address unexpected disruptions or failures in real time, minimizing downtime and impact on service. This includes monitoring system health, errors, and response times during incidents, with a focus on rapid response to minimize disruption and preserve the service's reputation.</p>
<p>Elastic®’s SLO capability is based directly off the Google SRE Handbook. All the definitions and semantics are utilized as described in Google’s SRE handbook. Hence users can perform the following on SLOs in Elastic:</p>
<ul>
<li>
<p>Define an SLO on an SLI such as KQL (log based query), service availability, service latency, custom metric, histogram metric, or a timeslice metric. Additionally, set the appropriate threshold.</p>
</li>
<li>
<p>Utilize occurrence versus time slice based budgeting. Occurrences is the number of good events over the number of total events to compute the SLO. Timeslices break the overall time window into slammer slices of a defined duration and compute the number of good slices over the total slices to compute the SLO. Timeslice targets are more accurate and useful when calculating things like a service’s SLO when trying to meet agreed upon customer targets.</p>
</li>
<li>
<p>Manage all the SLOs in a singular location.</p>
</li>
<li>
<p>Trigger alerts from the defined SLO, whether the SLI is off, burn rate is used up, or the error rate is X.</p>
</li>
<li>
<p>Create unique service level dashboards with SLO information for a more comprehensive view of the service.</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/1-slo-blog.png" alt="Create alerts" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/2-slo-blog.png" alt="Create dashboards" /></p>
<p>SREs need to be able to manage business metrics.</p>
<h2>SLOs based on logs: NGINX availability</h2>
<p>Defining SLOs does not always mean metrics need to be used. Logs are a rich form of information, even when they have metrics embedded in them. Hence it’s useful to understand your business and operations status based on logs.</p>
<p>Elastic allows you to create an SLO based on specific fields in the log message, which don’t have to be metrics. A simple example is a simple multi-tier app that has a web server layer (nginx), a processing layer, and a database layer.</p>
<p>Let’s say that your processing layer is managing a significant number of requests. You want to ensure that the service is up properly. The best way is to ensure that all http.response.status_code are less than 500. Anything less ensures the service is up and any errors (like 404) are all user or client errors versus server errors.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/3-slo-blog.png" alt="expanded document" /></p>
<p>If we use Discover in Elastic, we see that there are close to 2M log messages over a seven-day time frame.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/4-slo-blog.png" alt="17k" /></p>
<p>Additionally, the number of messages with http.response.status_code &gt; 500 is minimal, like 17K.</p>
<p>Rather than creating an alert, we can create an SLO with this query:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/5-slo-blog.png" alt="edit SLO" /></p>
<p>We chose to use occurrences as the budgeting method to keep things simple.</p>
<p>Once defined, we can see how well our SLO is performing over a seven-day time frame. We can see not only the SLO, but also the burn rate, the historical SLI, and error budget, and any specific alerts against the SLO.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/6-slo-blog.png" alt="SLOs" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/7-slo-blog.png" alt="nginx server availability " /></p>
<p>Not only do we get information about the violation, but we also get:</p>
<ul>
<li>
<p>Historical SLI (7 days)</p>
</li>
<li>
<p>Error budget burn down</p>
</li>
<li>
<p>Good vs. bad events (24 hours)</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/8-slo-blog.png" alt="Percentages" /></p>
<p>We can see how we’ve easily burned through our error budget.</p>
<p>Hence something must be going on with nginx. To investigate, all we need to do is utilize the <a href="https://www.elastic.co/blog/context-aware-insights-elastic-ai-assistant-observability">AI Assistant</a>, and use its natural language interface to ask questions to help analyze the situation.</p>
<p>Let’s use Elastic’s AI Assistant to analyze the breakdown of http.response.status_code across all the logs from the past seven days. This helps us understand how many 50X errors we are getting.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/9-slo-blog.png" alt="count of http response status code" /></p>
<p>As we can see, the number of 502s is minimal compared to the number of overall messages, but it is affecting our SLO.</p>
<p>However, it seems like Nginx is having an issue. In order to reduce the issue, we also ask the AI Assistant how to work on this error. Specifically, we ask if there is an internal runbook the SRE team has created.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/10-slo-blog.png" alt="ai assistant thread" /></p>
<p>AI Assistant gets a runbook the team has added to its knowledge base. I can now analyze and try to resolve or reduce the issue with nginx.</p>
<p>While this is a simple example, there are an endless number of possibilities that can be defined based on KQL. Some other simple examples:</p>
<ul>
<li>
<p>99% of requests occur under 200ms</p>
</li>
<li>
<p>99% of log message are not errors</p>
</li>
</ul>
<h2>Application SLOs: OpenTelemetry demo cartservice</h2>
<p>A common application developers and SREs use to learn about OpenTelemetry and test out Observability features is the <a href="https://github.com/elastic/opentelemetry-demo">OpenTelemetry demo</a>.</p>
<p>This demo has <a href="https://opentelemetry.io/docs/demo/feature-flags/">feature flags</a> to simulate issues. With Elastic’s alerting and SLO capability, you can also determine how well the entire application is performing and how well your customer experience is holding up when these feature flags are used.</p>
<p><a href="https://www.elastic.co/blog/opentelemetry-observability">Elastic supports OpenTelemetry by taking OTLP directly with no need for an Elastic specific agent</a>. You can send in OpenTelemetry data directly from the application (through OTel libraries) and through the collector.</p>
<p>We’ve brought up the OpenTelemetry demo on a K8S cluster (AWS EKS) and turned on the cartservice feature flag. This inserts errors into the cartservice. We’ve also created two SLOs to monitor the cartservice’s availability and latency.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/11-slo-blog.png" alt="SLOs" /></p>
<p>We can see that the cartservice’s availability is violated. As we drill down, we see that there aren’t as many successful transactions, which is affecting the SLO.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/12-slo-blog.png" alt="cartservice-otel" /></p>
<p>As we drill into the service, we can see in Elastic APM that there is a higher than normal failure rate of about 5.5% for the emptyCart service.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/13-slo-blog.png" alt="apm" /></p>
<p>We can investigate this further in APM, but that is a discussion for another blog. Stay tuned to see how we can use Elastic’s machine learning, AIOps, and AI Assistant to understand the issue.</p>
<h2>Conclusion</h2>
<p>SLOs allow you to set clear, measurable targets for your service performance, based on factors like availability, response times, error rates, and other key metrics. Hopefully with the overview we’ve provided in this blog, you can see that:</p>
<ul>
<li>
<p>SLOs can be based on logs. In Elastic, you can use KQL to essentially find and filter on specific logs and log fields to monitor and trigger SLOs.</p>
</li>
<li>
<p>AI Assistant is a valuable, easy-to-use capability to analyze, troubleshoot, and even potentially resolve SLO issues.</p>
</li>
<li>
<p>APM Service based SLOs are easy to create and manage with integration to Elastic APM. We also use OTel telemetry to help monitor SLOs.</p>
</li>
</ul>
<p>For more information on SLOs in Elastic, check out <a href="https://www.elastic.co/guide/en/observability/current/slo.html">Elastic documentation</a> and the following resources:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/guide/en/observability/8.12/slo.html">What’s new in Elastic Observability 8.12</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/context-aware-insights-elastic-ai-assistant-observability">Introducing the Elastic AI Assistant</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/opentelemetry-observability">Elastic OpenTelemetry support</a></p>
</li>
</ul>
<p>Ready to get started? Sign up for <a href="https://cloud.elastic.co/registration">Elastic Cloud</a> and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your SLOs.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/139686_-_Elastic_-_Headers_-_V1_3.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Simplifying log data management: Harness the power of flexible routing with Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/simplifying-log-data-management-flexible-routing</link>
            <guid isPermaLink="false">simplifying-log-data-management-flexible-routing</guid>
            <pubDate>Tue, 13 Jun 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[The reroute processor, available as of Elasticsearch 8.8, allows customizable rules for routing documents, such as logs, into data streams for better control of processing, retention, and permissions with examples that you can try on your own.]]></description>
            <content:encoded><![CDATA[<p>In Elasticsearch 8.8, we’re introducing the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/reroute-processor.html">reroute processor</a> in technical preview that makes it possible to send documents, such as logs, to different <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html">data streams</a>, according to flexible routing rules. When using Elastic Observability, this gives you more granular control over your data with regard to retention, permissions, and processing with all the potential benefits of the <a href="https://www.elastic.co/blog/an-introduction-to-the-elastic-data-stream-naming-scheme">data stream naming scheme</a>. While optimized for data streams, the reroute processor also works with classic indices. This blog post contains examples on how to use the reroute processor that you can try on your own by executing the snippets in the <a href="https://www.elastic.co/guide/en/kibana/current/console-kibana.html">Kibana dev tools</a>.</p>
<p>Elastic Observability offers a wide range of <a href="https://www.elastic.co/integrations/data-integrations?solution=observability">integrations</a> that help you to monitor your applications and infrastructure. These integrations are added as policies to <a href="https://www.elastic.co/guide/en/fleet/current/elastic-agent-installation.html">Elastic agents</a>, which help ingest telemetry into Elastic Observability. Several examples of these integrations include the ability to ingest logs from systems that send a stream of logs from different applications, such as <a href="https://www.elastic.co/guide/en/kinesis/current/aws-firehose-setup-guide.html">Amazon Kinesis Data Firehose</a>, <a href="https://docs.elastic.co/en/integrations/kubernetes">Kubernetes container logs</a>, and <a href="https://docs.elastic.co/integrations/tcp">syslog</a>. One challenge is that these multiplexed log streams are sending data to the same Elasticsearch data stream, such as logs-syslog-default. This makes it difficult to create parsing rules in ingest pipelines and dashboards for specific technologies, such as the ones from the <a href="https://docs.elastic.co/en/integrations/nginx">Nginx</a> and <a href="https://docs.elastic.co/en/integrations/apache">Apache</a> integrations. That’s because in Elasticsearch, in combination with the <a href="https://www.elastic.co/blog/an-introduction-to-the-elastic-data-stream-naming-scheme">data stream naming scheme</a>, the processing and the schema are both encapsulated in a data stream.</p>
<p>The reroute processor helps you tease apart data from a generic data stream and send it to a more specific one. You may use that mechanism to send logs to a data stream that is set up by the Nginx integration, for example, so that the logs are parsed with that integration and you can use the integration’s prebuilt dashboards or create custom ones with the fields, such as the url, the status code, and the response time that the Nginx pipeline has parsed out of the Nginx log message. You can also split out/separate regular Nginx logs and errors with the reroute processor, providing further separation ability and categorization of logs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/simplifying-log-data-management-flexible-routing/blog-elastic-routing-pipeline.png" alt="routing pipeline" /></p>
<h2>Example use case</h2>
<p>To use the reroute processor, first:</p>
<ol>
<li>
<p>Ensure you are on Elasticsearch 8.8</p>
</li>
<li>
<p>Ensure you have permissions to manage indices and data streams</p>
</li>
<li>
<p>If you don’t already have an account on <a href="https://cloud.elastic.co/registration?fromURI=/home">Elastic Cloud</a>, sign up for one</p>
</li>
</ol>
<p>Next, you’ll need to <a href="https://www.elastic.co/guide/en/elasticsearch/reference/master/set-up-a-data-stream.html">set up a data stream</a> and create a custom Elasticsearch <a href="https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html">ingest pipeline</a> that is called as the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html#set-default-pipeline">default pipeline</a>. Below we go through this step by step for the “mydata” data set that we’ll simulate ingesting container logs into. We start with a basic example and extend it from there.</p>
<p>The following steps should be utilized in the Elastic console, which is found at <strong>Management -&gt; Dev tools -&gt; Console</strong>. First, we need an an ingest pipeline and a template for the data stream:</p>
<pre><code class="language-bash">PUT _ingest/pipeline/logs-mydata
{
  &quot;description&quot;: &quot;Routing for mydata&quot;,
  &quot;processors&quot;: [
    {
      &quot;reroute&quot;: {
      }
    }
  ]
}
</code></pre>
<p>This creates an ingest pipeline with an empty reroute processor. To make use of it, we need an index template:</p>
<pre><code class="language-bash">PUT _index_template/logs-mydata
{
  &quot;index_patterns&quot;: [
    &quot;logs-mydata-*&quot;
  ],
  &quot;data_stream&quot;: {},
  &quot;priority&quot;: 200,
  &quot;template&quot;: {
    &quot;settings&quot;: {
      &quot;index.default_pipeline&quot;: &quot;logs-mydata&quot;
    },
    &quot;mappings&quot;: {
      &quot;properties&quot;: {
        &quot;container.name&quot;: {
          &quot;type&quot;: &quot;keyword&quot;
        }
      }
    }
  }
}
</code></pre>
<p>The above template is applied to all data that is shipped to logs-mydata-*. We have mapped container.name as a keyword, as this is the field we will be using for routing later on. Now, we send a document to the data stream and it will be ingested into logs-mydata-default:</p>
<pre><code class="language-bash">POST logs-mydata-default/_doc
{
  &quot;@timestamp&quot;: &quot;2023-05-25T12:26:23+00:00&quot;,
  &quot;container&quot;: {
    &quot;name&quot;: &quot;foo&quot;
  }
}
</code></pre>
<p>We can check that it was ingested with the command below, which will show 1 result.</p>
<pre><code class="language-bash">GET logs-mydata-default/_search
</code></pre>
<p>Without modifying the routing processor, this already allows us to route documents. As soon as the reroute processor is specified, it will look for data_stream.dataset and data_stream.namespace fields by default and will send documents to the corresponding data stream, according to the <a href="https://www.elastic.co/blog/an-introduction-to-the-elastic-data-stream-naming-scheme">data stream naming scheme</a> logs-&lt;dataset&gt;-&lt;namespace&gt;. Let’s try this out:</p>
<pre><code class="language-bash">POST logs-mydata-default/_doc
{
  &quot;@timestamp&quot;: &quot;2023-03-30T12:27:23+00:00&quot;,
  &quot;container&quot;: {
&quot;name&quot;: &quot;foo&quot;
  },
  &quot;data_stream&quot;: {
    &quot;dataset&quot;: &quot;myotherdata&quot;
  }
}
</code></pre>
<p>As can be seen with the GET logs-mydata-default/_search command, this document ended up in the logs-myotherdata-default data stream. But instead of using default rules, we want to create our own rules for the field container.name. If the field is container.name = foo, we want to send it to logs-foo-default. For this we modify our routing pipeline:</p>
<pre><code class="language-bash">PUT _ingest/pipeline/logs-mydata
{
  &quot;description&quot;: &quot;Routing for mydata&quot;,
  &quot;processors&quot;: [
    {
      &quot;reroute&quot;: {
        &quot;tag&quot;: &quot;foo&quot;,
        &quot;if&quot; : &quot;ctx.container?.name == 'foo'&quot;,
        &quot;dataset&quot;: &quot;foo&quot;
      }
    }
  ]
}
</code></pre>
<p>Let's test this with a document:</p>
<pre><code class="language-bash">POST logs-mydata-default/_doc
{
  &quot;@timestamp&quot;: &quot;2023-05-25T12:26:23+00:00&quot;,
  &quot;container&quot;: {
    &quot;name&quot;: &quot;foo&quot;
  }
}
</code></pre>
<p>While it would be possible to specify a routing rule for each container name, you can also route by the value of a field in the document:</p>
<pre><code class="language-bash">PUT _ingest/pipeline/logs-mydata
{
  &quot;description&quot;: &quot;Routing for mydata&quot;,
  &quot;processors&quot;: [
    {
      &quot;reroute&quot;: {
        &quot;tag&quot;: &quot;mydata&quot;,
        &quot;dataset&quot;: [
          &quot;{{container.name}}&quot;,
          &quot;mydata&quot;
        ]
      }
    }
  ]
}
</code></pre>
<p>In this example, we are using a field reference as a routing rule. If the container.name field exists in the document, it will be routed — otherwise it falls back to mydata. This can be tested with:</p>
<pre><code class="language-bash">POST logs-mydata-default/_doc
{
  &quot;@timestamp&quot;: &quot;2023-05-25T12:26:23+00:00&quot;,
  &quot;container&quot;: {
    &quot;name&quot;: &quot;foo1&quot;
  }
}

POST logs-mydata-default/_doc
{
  &quot;@timestamp&quot;: &quot;2023-05-25T12:26:23+00:00&quot;,
  &quot;container&quot;: {
    &quot;name&quot;: &quot;foo2&quot;
  }
}
</code></pre>
<p>This creates the data streams logs-foo1-default and logs-foo2-default.</p>
<p><em>NOTE: There is currently a limitation in the processor that requires the fields specified in a <code>{{field.reference}}</code> to be in a nested object notation. A dotted field name does not currently work. Also, you’ll get errors when the document contains dotted field names for any</em> <em>data_stream.*</em> <em>field. This limitation will be</em> <a href="https://github.com/elastic/elasticsearch/pull/96243"><em>fixed</em></a> <em>in 8.8.2 and 8.9.0.</em></p>
<h2>API keys</h2>
<p>When using the reroute processor, it is important that the API keys specified have permissions for the source and target indices. For example, if a pattern is used for routing from logs-mydata-default, the API key must have write permissions for <code>logs-*-*</code> as data could end up in any of these indices (see example further down).</p>
<p>We’re currently <a href="https://github.com/elastic/integrations/issues/5989">working</a> <a href="https://github.com/elastic/integrations/issues/6255">on</a> extending the API key permissions for our <a href="https://www.elastic.co/integrations/data-integrations">integrations</a> so that they allow for routing by default if you’re running a Fleet-managed Elastic Agent.</p>
<p>If you’re using a standalone Elastic Agent, or any other shipper, you can use this as a template to create your API key:</p>
<pre><code class="language-bash">POST /_security/api_key
{
  &quot;name&quot;: &quot;ingest_logs&quot;,
  &quot;role_descriptors&quot;: {
    &quot;ingest_logs&quot;: {
      &quot;cluster&quot;: [
        &quot;monitor&quot;
      ],
      &quot;indices&quot;: [
        {
          &quot;names&quot;: [
            &quot;logs-*-*&quot;
          ],
          &quot;privileges&quot;: [
            &quot;auto_configure&quot;,
            &quot;create_doc&quot;
          ]
        }
      ]
    }
  }
}
</code></pre>
<h2>Future plans</h2>
<p>In Elasticsearch 8.8, the reroute processor was released in technical preview. The plan is to adopt this in our data sink integrations like syslog, k8s, and others. Elastic will provide default routing rules that just work out of the box, but it will also be possible for users to add their own rules. If you are using our integrations, follow <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html#pipelines-for-fleet-elastic-agent">this guide</a> on how to add a custom ingest pipeline.</p>
<h2>Try it out!</h2>
<p>This blog post has shown some sample use cases for document based routing. Try it out on your data by adjusting the commands for index templates and ingest pipelines to your own data, and get started with <a href="https://cloud.elastic.co/registration?fromURI=/home">Elastic Cloud</a> through a 7-day free trial. Let us know via <a href="https://ela.st/reroute-feedback">this feedback form</a> how you’re planning to use the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/reroute-processor.html">reroute processor</a> and whether you have suggestions for improvement.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/simplifying-log-data-management-flexible-routing/observability-digital-transformation-1.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[How Streams in Elastic Observability Simplifies Retention Management]]></title>
            <link>https://www.elastic.co/observability-labs/blog/simplifying-retention-management-with-streams</link>
            <guid isPermaLink="false">simplifying-retention-management-with-streams</guid>
            <pubDate>Thu, 30 Oct 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how Streams simplifies retention management in Elasticsearch with a unified view to monitor, visualize, and control data lifecycles using DSL or ILM.]]></description>
            <content:encoded><![CDATA[<p>Managing retention in Elasticsearch can get complicated fast. Between <a href="https://www.elastic.co/docs/manage-data/lifecycle/data-stream">Data stream lifecycle (DSL)</a>, <a href="https://www.elastic.co/docs/manage-data/lifecycle/index-lifecycle-management">Index lifecycle management (ILM)</a>, templates, and individual index settings, keeping policies consistent across data streams often takes more effort than it should.</p>
<p><strong>Streams</strong> changes that. It introduces a clear, unified way to manage how long your data lives, whether you’re using DSL or ILM. You can visualize ingestion, understand where data sits across tiers, and adjust retention with confidence, applying updates to a single stream without worrying about unintended changes elsewhere, all from a single view.</p>
<h3>Walkthrough: Exploring the Retention Tab</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/simplifying-retention-management-with-streams/retention_view.png" alt="Retention view of a stream" /></p>
<p>Retention management lives in the <strong>Retention</strong> tab of each stream. This is your control panel for understanding how much data you’re storing, how quickly it’s growing, and how your lifecycle policies are applied. It’s also where you can monitor and configure the <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/failure-store">Failure store</a>, which tracks and retains documents that failed to be ingested.</p>
<h4>Metrics at a glance</h4>
<p>At the top of the view, you’ll find an overview of key metrics:</p>
<ul>
<li>Storage size: the total data volume currently held by the stream.</li>
<li>Ingestion averages: calculated from the selected time range, Streams extrapolates both daily and monthly averages to give you a sense of long-term trends.</li>
</ul>
<p>This combination of near-real-time and projected values helps you quickly spot when ingestion is ramping up and whether your retention policy aligns with it.</p>
<h4>Ingestion over time</h4>
<p>Below the metrics, a graph shows ingestion volume over time. This information is approximated based on the number of documents over time, multiplied by the average document size in the backing index. </p>
<h4>Visualizing lifecycle phases</h4>
<p>When an ILM policy is effective, the retention view becomes more visual. Streams displays a phase breakdown (hot, warm, cold, frozen) showing the data volume stored in each phase. This gives you a clear sense of how your data is distributed across the storage tiers and whether your lifecycle is doing what you expect.</p>
<h4>Failure store</h4>
<p>A failure store is a secondary set of indices inside a data stream, dedicated to storing documents that failed to be ingested. Within the Retention tab, you can toggle the Failure store on or off, and configure its own retention period.
We’ll cover Failure store and Data quality in more detail in <a href="https://www.elastic.co/observability-labs/blog/data-quality-and-failure-store-in-streams">this article</a>.</p>
<h3>Updating Retention</h3>
<p>Beyond visualizing your retention, Streams makes it easy to change how it’s managed.</p>
<h4>Switching between DSL and ILM</h4>
<p>You can freely switch a stream between DSL and ILM management, or update a DSL retention  with just a few clicks. Streams takes care of updating the lifecycle settings at the data stream level, ensuring consistent retention across all existing backing indices, not just new ones.</p>
<p>Whether you prefer the simplicity of DSL or the fine-grained tiering of ILM, you can move between the two seamlessly.</p>
<p><em>Clicking “Edit data retention” opens a modal that allows you to update the stream’s configuration. From there you can update the ILM policy or set a custom retention period via DSL.</em>
<img src="https://www.elastic.co/observability-labs/assets/images/simplifying-retention-management-with-streams/edit_ilm.png" alt="Modal view to set a lifecycle policy" /></p>
<p><em>You can set a custom period, or pick an Indefinite retention for your data.</em>
<img src="https://www.elastic.co/observability-labs/assets/images/simplifying-retention-management-with-streams/edit_dsl.png" alt="Modal view to set a custom retention period" /></p>
<p><em>You can also update streams’ lifecycle via the <a href="https://www.elastic.co/docs/api/doc/kibana/operation/operation-put-streams-name">Upsert stream</a> or the <a href="https://www.elastic.co/docs/api/doc/kibana/operation/operation-put-streams-name-ingest">Update ingest stream settings</a> Kibana APIs.</em></p>
<h4>Inherit or defer: different strategies for different stream types</h4>
<p><strong>Classic streams</strong></p>
<p>For classic streams, you can default to the existing index template’s retention. Retention isn’t managed by Streams in this case, it follows the lifecycle configuration defined in the template just as it normally would.</p>
<p>This option is useful if you’re onboarding existing data streams and want to keep their lifecycle behavior intact while still benefiting from Streams’ visibility and monitoring features.</p>
<p><strong>Wired streams</strong></p>
<p>Wired streams live in a tree structure, and that hierarchy allows an inheritance model.</p>
<p>A child stream can inherit the lifecycle of its nearest ancestor that has a concrete policy (ILM or DSL). This keeps your configuration lean and consistent since you can set a single lifecycle at a higher level in the tree and let Streams automatically apply it to all relevant descendants.</p>
<p>If that ancestor’s lifecycle is later updated, Streams cascades the change down to all children that inherit it, so everything stays in sync.</p>
<p><em>In the figure below, we set a different retention for</em> <strong><em>logs.prod</em></strong> <em>and</em> <strong><em>logs.staging</em></strong> <em>environments. The child partitions of these environments automatically inherit the configuration.</em>
<img src="https://www.elastic.co/observability-labs/assets/images/simplifying-retention-management-with-streams/streams_tree.png" alt="A streams tree that shows inheritance" /></p>
<h4>How it works under the hood</h4>
<p>When you apply or update a lifecycle, <strong>Streams</strong> calls Elasticsearch’s <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-put-data-stream-settings">/_data_stream/_settings</a>. This is a new API we’ve added in 8.19 / 9.1 for this purpose. </p>
<p>This API is key to keeping retention consistent:</p>
<ol>
<li>It applies the lifecycle directly at the data stream level, overriding any configuration from cluster settings or index templates.</li>
<li>It propagates the retention update to all existing backing indices, not just new ones, so retention remains uniform across your historical and future data.</li>
</ol>
<p>By centralizing lifecycle management at the data stream level and applying a consistent configuration across the backing indices, we remove the ambiguity that used to exist between template-level and index-level configurations. You always know which retention policy is actually in effect, and you can see it directly in the UI.</p>
<h3>Wrapping Up</h3>
<p>With Streams, retention management becomes clear and consistent. You can visualize ingestion, switch between DSL and ILM, or inherit policies across streams, all without diving into templates or manual index settings.</p>
<p>By unifying retention into a single view, Streams turns lifecycle management into something simple, predictable, and transparent.</p>
<p>Sign up for an Elastic trial at <a href="http://cloud.elastic.co">cloud.elastic.co</a>, and trial Elastic's Serverless offering which will allow you to play with all of the Streams functionality.</p>
<p>Additionally, check out:</p>
<p><em>Read about</em> <a href="https://www.elastic.co/observability-labs/blog/reimagine-observability-elastic-streams"><em>Reimagining streams</em></a></p>
<p><em>Look at the</em> <a href="http://elastic.co/elasticsearch/streams"><em>Streams website</em></a></p>
<p><em>Read the</em> <a href="https://www.elastic.co/docs/solutions/observability/streams/streams"><em>Streams documentation</em></a></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/simplifying-retention-management-with-streams/article.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Smarter log analytics in Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/smarter-log-analytics-in-elastic-observability</link>
            <guid isPermaLink="false">smarter-log-analytics-in-elastic-observability</guid>
            <pubDate>Mon, 10 Jun 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Discover smarter log handling with Kibana's latest features! The new Data Source Selector lets you easily filter logs by integrations like System Logs and Nginx. Smart Fields enhance log analysis by presenting data more intuitively. Simplify your workflow and uncover deeper insights today!]]></description>
            <content:encoded><![CDATA[<p>Discover a smarter way to handle your logs with Kibana's latest features! Our new Data Source selector makes it effortless to zero in on the logs you need, whether they're from System Logs or Application Logs by selecting your integrations or data views. Plus, with the introduction of Smart Fields, your log analysis is now more intuitive and insightful. Get ready to simplify your workflow and uncover deeper insights with these game-changing updates. Dive in and see how easy log exploration can be!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/smarter-log-analytics-in-elastic-observability/smart-fields.png" alt="Smart fields" /></p>
<h2>Find the logs you’re looking for</h2>
<h3>Focus on logs from specific integrations or data views</h3>
<p>We've added the Data Source selector, a handy new feature for viewing specific logs. Now, you can easily filter your logs based on your integrations, like System Logs, Nginx, or Elastic APM, or switch between different data views, like logs or metrics. This new selector is all about making your data easier to find and helping you focus on what matters most in your analysis.</p>
<h2>Dive into your logs</h2>
<h3>Analyze logs with Smart Fields in Kibana</h3>
<p>Logs in Kibana have undergone a significant transformation, particularly in the way log data is presented. The once-basic table view has evolved with the introduction of Smart Fields, providing users with a more insightful and dynamic log analysis experience.</p>
<h4>Resource Smart Field - centralizing log source information</h4>
<p>The resource column further elevates the Logs Explorer page by providing users with a single column for exploring the resource that created the log event. This column groups various resource-indicating fields together, streamlining the investigation process. Currently, the following <a href="https://www.elastic.co/guide/en/ecs/current/ecs-reference.html">ECS</a> fields are grouped under this single column and we recommend including them in your logs:</p>
<ul>
<li><a href="https://www.elastic.co/guide/en/ecs/current/ecs-service.html#field-service-name">service.name</a></li>
<li><a href="https://www.elastic.co/guide/en/ecs/current/ecs-container.html#field-container-name">container.name</a></li>
<li><a href="https://www.elastic.co/guide/en/ecs/current/ecs-orchestrator.html#field-orchestrator-namespace">orchestrator.namespace</a></li>
<li><a href="https://www.elastic.co/guide/en/ecs/current/ecs-host.html#field-host-name">host.name</a></li>
<li><a href="https://www.elastic.co/guide/en/ecs/current/ecs-cloud.html#field-cloud-instance-id">cloud.instance.id</a></li>
</ul>
<p>We know this does not include all use cases and would like your feedback on other fields you use/are important for you to help us provide a tailored and user-centric log analysis experience.</p>
<h4>Content Smart Field - a deeper dive into log data</h4>
<p>The content column revolutionizes log analysis by seamlessly rendering <strong>log.level</strong> and <strong>message</strong> fields. Notably, it automatically handles fallbacks, ensuring a smooth transition when the actual message field is not available. This enhancement simplifies the log exploration process, offering users a more comprehensive understanding of their data.</p>
<h4>Actions column - unleashing additional columns</h4>
<p>As part of our commitment to empowering users, we are introducing the actions column, adding a layer of functionality to the document table. This column includes two powerful actions:</p>
<ul>
<li><strong>Degraded document indicator</strong>: This indicator provides insights about the quality of your data by indicating fields were ignored when the document was indexed and ended up in the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-ignored-field.html">_ignored</a> property of the document. To help analyze what caused the document to degrade, we suggest reading this blog - <a href="https://www.elastic.co/observability-labs/blog/antidote-index-mapping-exceptions-ignore-malformed">The antidote for index mapping exceptions: ignore_malformed</a>.</li>
<li><strong>Stacktrace indicator</strong>: This indicator informs users of the presence of stack traces in the document. This makes it easy to navigate through logs documents and know if they have additional information.</li>
</ul>
<h3>Investigate individual logs by expanding log details</h3>
<p>Now, when you click the expand icon in the actions column, it opens up the <strong>Log details</strong> flyout for any log entry. This new feature gives you a detailed overview of the entry right at your fingertips. Inside the flyout, the <strong>Overview</strong> tab is neatly organized into four sections—Content breakdown, Service &amp; Infrastructure, Cloud, and Others—each offering a snapshot of the most crucial information. Plus, you'll find the same handy controls you're used to in the main table, like filtering in or out, adding or removing columns, and copying data, making it easier than ever to manage your logs directly from the flyout.</p>
<p>The <a href="https://www.elastic.co/guide/en/observability/current/obs-ai-assistant.html">Observability AI Assistant</a> is fully integrated into this view providing contextual insights about the log event and helping to find similar messages.</p>
<h2>Experience a streamlined approach to log exploration</h2>
<p>These enhancements simplify the process of finding and focusing on specific logs and offer more intuitive and insightful data presentation. Dive into your logs with these I tools and streamline your workflow, uncovering deeper insights with ease. Try it now and transform your log analysis!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/smarter-log-analytics-in-elastic-observability/log-monitoring.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Enhancing SRE troubleshooting with the AI Assistant for Observability and your organization's runbooks]]></title>
            <link>https://www.elastic.co/observability-labs/blog/sre-troubleshooting-ai-assistant-observability-runbooks</link>
            <guid isPermaLink="false">sre-troubleshooting-ai-assistant-observability-runbooks</guid>
            <pubDate>Wed, 08 Nov 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Empower your SRE team with this guide to enriching Elastic's AI Assistant Knowledge Base with your organization's internal observability information for enhanced alert remediation and incident management.]]></description>
            <content:encoded><![CDATA[<p>The <a href="https://www.elastic.co/blog/context-aware-insights-elastic-ai-assistant-observability">Observability AI Assistant</a> helps users explore and analyze observability data using a natural language interface, by leveraging automatic function calling to request, analyze, and visualize your data to transform it into actionable observability. The Assistant can also set up a Knowledge Base, powered by <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html">Elastic Learned Sparse EncodeR</a> (ELSER) to provide additional context and recommendations from private data, alongside the large language models (LLMs) using RAG (Retrieval Augmented Generation). Elastic’s Stack — as a vector database with out-of-the-box semantic search and connectors to LLM integrations and the Observability solution — is the perfect toolkit to extract the maximum value of combining your company's unique observability knowledge with generative AI.</p>
<h2>Enhanced troubleshooting for SREs</h2>
<p>Site reliability engineers (SRE) in large organizations often face challenges in locating necessary information for troubleshooting alerts, monitoring systems, or deriving insights due to scattered and potentially outdated resources. This issue is particularly significant for less experienced SREs who may require assistance even with the presence of a runbook. Recurring incidents pose another problem, as the on-call individual may lack knowledge about previous resolutions and subsequent steps. Mature SRE teams often invest considerable time in system improvements to minimize &quot;fire-fighting,&quot; utilizing extensive automation and documentation to support on-call personnel.</p>
<p>Elastic® addresses these challenges by combining generative AI models with relevant search results from your internal data using RAG. The <a href="https://www.elastic.co/guide/en/observability/current/obs-ai-assistant.html">Observability AI Assistant's internal Knowledge Base</a>, powered by our semantic search retrieval model <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html">ELSER</a>, can recall information at any point during a conversation, providing RAG responses based on internal knowledge.</p>
<p>This Knowledge Base can be enriched with your organization's information, such as runbooks, GitHub issues, internal documentation, and Slack messages, allowing the AI Assistant to provide specific assistance. The Assistant can also document and store specific information from an ongoing conversation with an SRE while troubleshooting issues, effectively creating runbooks for future reference. Furthermore, the Assistant can generate summaries of incidents, system status, runbooks, post-mortems, or public announcements.</p>
<p>This ability to retrieve, summarize, and present contextually relevant information is a game-changer for SRE teams, transforming the work from chasing documents and data to an intuitive, contextually sensitive user experience.The Knowledge Base (see <a href="https://www.elastic.co/guide/en/observability/current/obs-ai-assistant.html#obs-ai-requirements">requirements</a>) serves as a central repository of Observability knowledge, breaking documentation silos and integrating tribal knowledge, making this information accessible to SREs enhanced with the power of LLMs.</p>
<p>Your LLM provider may collect query telemetry when using the AI Assistant. If your data is confidential or has sensitive details, we recommend you verify the data treatment policy of the LLM connector you provided to the AI Assistant.</p>
<p>In this blog post, we will cover different ways to enrich your Knowledge Base (KB) with internal information. We will focus on a specific alert, indicating that there was an increase in logs with “502 Bad Gateway” errors that has surpassed the alert’s threshold.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-1.png" alt="1 - threshold breached" /></p>
<h2>How to troubleshoot an alert with the Knowledge Base</h2>
<p>Before the KB has been enriched with internal information, when the SRE asks the AI Assistant about how to troubleshoot an alert, the response from the LLM will be based on the data it learned during training; however, the LLM is not able to answer questions related to private, recent, or emerging knowledge. In this case, when asking for the steps to troubleshoot the alert, the response will be based on generic information.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-2.png" alt="2 - troubleshooting steps" /></p>
<p>However, once the KB has been enriched with your runbooks, when your team receives a new alert on “502 Bad Gateway” Errors, they can use AI Assistant to access the internal knowledge to troubleshoot it, using semantic search to find the appropriate runbook in the Knowledge Base.</p>
<p>In this blog, we will cover different ways to add internal information on how to troubleshoot an alert to the Knowledge Base:</p>
<ol>
<li>
<p>Ask the assistant to remember the content of an existing runbook.</p>
</li>
<li>
<p>Ask the Assistant to summarize and store in the Knowledge Base the steps taken during a conversation and store it as a runbook.</p>
</li>
<li>
<p>Import your runbooks from GitHub or another external source to the Knowledge Base using our Connector and APIs.</p>
</li>
</ol>
<p>After the runbooks have been added to the KB, the AI Assistant is now able to recall the internal and specific information in the runbooks. By leveraging the retrieved information, the LLM could provide more accurate and relevant recommendations for troubleshooting the alert. This could include suggesting potential causes for the alert, steps to resolve the issue, preventative measures for future incidents, or asking the assistant to help execute the steps mentioned in the runbook using functions. With more accurate and relevant information at hand, the SRE could potentially resolve the alert more quickly, reducing downtime and improving service reliability.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/Screenshot_2023-11-10_at_9.52.38_AM.png" alt="3 - troubleshooting 502 Bad gateway" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-4.png" alt="4 - (5) test the backend directly" /></p>
<p>Your Knowledge Base documents will be stored in the indices <em>.kibana-observability-ai-assistant-kb-</em>*. Have in mind that LLMs have restrictions on the amount of information the model can read and write at once, called token limit. Imagine you're reading a book, but you can only remember a certain number of words at a time. Once you've reached that limit, you start to forget the earlier words you've read. That's similar to how a token limit works in an LLM.</p>
<p>To keep runbooks within the token limit for Retrieval Augmented Generation (RAG) models, ensure the information is concise and relevant. Use bullet points for clarity, avoid repetition, and use links for additional information. Regularly review and update the runbooks to remove outdated or irrelevant information. The goal is to provide clear, concise, and effective troubleshooting information without compromising the quality due to token limit constraints. LLMs are great for summarization, so you could ask the AI Assistant to help you make the runbooks more concise.</p>
<h2>Ask the assistant to remember the content of an existing runbook</h2>
<p>The easiest way to store a runbook into the Knowledge Base is to just ask the AI Assistant to do it! Open a new conversation and ask “Can you store this runbook in the KB for future reference?” followed by pasting the content of the runbook in plain text.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-5.png" alt="5 - new conversation - let's work on this together" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-6.png" alt="6 - new converastion" /></p>
<p>The AI Assistant will then store it in the Knowledge Base for you automatically, as simple as that.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-7.png" alt="7 - storing a runbook" /></p>
<h2>Ask the Assistant to summarize and store the steps taken during a conversation in the Knowledge Base</h2>
<p>You can also ask the AI Assistant to remember something while having a conversation — for example, after you have troubleshooted an alert using the AI Assistant, you could ask to &quot;remember how to troubleshoot this alert for next time.&quot; The AI Assistant will create a summary of the steps taken to troubleshoot the alert and add it to the Knowledge Base, effectively creating runbooks for future reference. Next time you are faced with a similar situation, the AI Assistant will recall this information and use it to assist you.</p>
<p>In the following demo, the user asks the Assistant to remember the steps that have been followed to troubleshoot the root cause of an alert, and also to ping the Slack channel when this happens again. In a later conversation with the Assistant, the user asks what can be done about a similar problem, and the AI Assistant is able to remember the steps and also reminds the user to ping the Slack channel.</p>
<p>After receiving the alert, you can open the AI Assistant chat and test troubleshooting the alert. After investigating an alert, ask the AI Assistant to summarize the analysis and the steps taken to root cause. To remember them for the next time, we have a similar alert and add extra instruction like to warn the Slack channel.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-8.png" alt="8. -teal box" /></p>
<p>The Assistant will use the built-in functions to summarize the steps and store them into your Knowledge Base, so they can be recalled in future conversations.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/Screenshot_2023-11-08_at_11.34.08_AM.png" alt="9 - Elastic assistant chat (CROP)" /></p>
<p>Open a new conversation, and ask what are the steps to take when troubleshooting a similar alert to the one we just investigated. The Assistant will be able to recall the information stored in the KB that is related to the specific alert, using semantic search based on <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html">ELSER</a>, and provide a summary of the steps taken to troubleshoot it, including the last indication of informing the Slack channel.</p>
&lt;Video vidyardUuid=&quot;p14Ss8soJDkW8YoCtKPrQF&quot; loop={true} /&gt;
<h2>Import your runbooks stored in GitHub to the Knowledge Base using APIs or our GitHub Connector</h2>
<p>You can also add proprietary data into the Knowledge Base programmatically by ingesting it (e.g., GitHub Issues, Markdown files, Jira tickets, text files) into Elastic.</p>
<p>If your organization has created runbooks that are stored in Markdown documents in GitHub, follow the steps in the next section of this blog post to index the runbook documents into your Knowledge Base.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-10.png" alt="10 - github handling 502" /></p>
<p>The steps to ingest documents into the Knowledge Base are the following:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-11.png" alt="11 - using internal knowledge" /></p>
<h3>Ingest your organization’s knowledge into Elasticsearch</h3>
<p><strong>Option 1:</strong> <strong>Use the</strong> <a href="https://www.elastic.co/guide/en/enterprise-search/current/crawler.html"><strong>Elastic web crawler</strong></a> <strong>.</strong> Use the web crawler to programmatically discover, extract, and index searchable content from websites and knowledge bases. When you ingest data with the web crawler, a search-optimized <a href="https://www.elastic.co/blog/what-is-an-elasticsearch-index">Elasticsearch® index</a> is created to hold and sync webpage content.</p>
<p><strong>Option 2: Use Elasticsearch's</strong> <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html"><strong>Index API</strong></a> <strong>.</strong> <a href="https://www.elastic.co/guide/en/cloud/current/ec-ingest-guides.html">Watch tutorials</a> that demonstrate how you can use the Elasticsearch language clients to ingest data from an application.</p>
<p><strong>Option 3: Build your own connector.</strong> Follow the steps described in this blog: <a href="https://www.elastic.co/search-labs/how-to-create-customized-connectors-for-elasticsearch">How to create customized connectors for Elasticsearch</a>.</p>
<p><strong>Option 4: Use Elasticsearch</strong> <a href="https://www.elastic.co/guide/en/workplace-search/current/workplace-search-content-sources.html"><strong>Workplace Search connectors</strong></a> <strong>.</strong> For example, the <a href="https://www.elastic.co/guide/en/workplace-search/current/workplace-search-github-connector.html">GitHub connector</a> can automatically capture, sync, and index issues, Markdown files, pull requests, and repos.</p>
<ul>
<li>Follow the steps to <a href="https://www.elastic.co/guide/en/workplace-search/current/workplace-search-github-connector.html#github-configuration">configure the GitHub Connector in GitHub</a> to create an OAuth App from the GitHub platform.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-12.png" alt="12 - elastic workplace search" /></p>
<ul>
<li>Now you can connect a GitHub instance to your organization. Head to your organization’s <strong>Search &gt; Workplace Search</strong> administrative dashboard, and locate the Sources tab.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/Screenshot_2023-11-08_at_10.19.19_AM.png" alt="13 - screenshot" /></p>
<ul>
<li>Select <strong>GitHub</strong> (or GitHub Enterprise) in the Configured Sources list, and follow the GitHub authentication flow as presented. Upon the successful authentication flow, you will be redirected to Workplace Search and will be prompted to select the Organization you would like to synchronize.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-14.png" alt="14 - configure and connect" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-15.png" alt="15 - how to add github" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-16.png" alt="16 - github" /></p>
<ul>
<li>After configuring the connector and selecting the organization, the content should be synchronized and you will be able to see it in Sources. If you don’t need to index all the available content, you can specify the indexing rules via the API. This will help shorten indexing times and limit the size of the index. See <a href="https://www.elastic.co/guide/en/workplace-search/current/workplace-search-customizing-indexing-rules.html">Customizing indexing</a>.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-17.png" alt="17 - source overview" /></p>
<ul>
<li>The source has created an index in Elastic with the content (Issues, Markdown Files…) from your organization. You can find the index name by navigating to <strong>Stack Management &gt; Index Management</strong> , activating the <strong>Include hidden Indices</strong> button on the right, and searching for “GitHub.”</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-18.png" alt="18 - index mgmt" /></p>
<ul>
<li>You can explore the documents you have indexed by creating a Data View and exploring it in Discover. Go to <strong>Stack Management &gt; Kibana &gt; Data Views &gt; Create data view</strong> and introduce the data view Name, Index pattern (make sure you activate “Allow hidden and system indices” in advanced options), and Timestamp field:</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-19.png" alt="19 - create data view" /></p>
<ul>
<li>You can now explore the documents in Discover using the data view:</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-20.png" alt="20 - data view" /></p>
<h3>Reindex your internal runbooks into the AI Assistant’s Knowledge Base Index, using it's semantic search pipeline</h3>
<p>Your Knowledge Base documents are stored in the indices <em>.kibana-observability-ai-assistant-kb-*</em>. To add your internal runbooks imported from GitHub to the KB, you just need to reindex the documents from the index you created in the previous step to the KB’s index. To add the semantic search capabilities to the documents in the KB, the reindex should also use the ELSER pipeline preconfigured for the KB, <em>.kibana-observability-ai-assistant-kb-ingest-pipeline</em>.</p>
<p>By creating a Data View with the KB index, you can explore the content in Discover.</p>
<p>You execute the query below in <strong>Management &gt; Dev Tools</strong> , making sure to replace the following, both on “_source” and “inline”:</p>
<ul>
<li>InternalDocsIndex : name of the index where your internal docs are stored</li>
<li>text_field : name of the field with the text of your internal docs</li>
<li>timestamp : name of the field of the timestamp in your internal docs</li>
<li>public : (true or false) if true, makes a document available to all users in the defined <a href="https://www.elastic.co/guide/en/kibana/current/xpack-spaces.html">Kibana Space</a> (if is defined) or in all spaces (if is not defined); if false, document will be restricted to the user indicated in</li>
<li>(optional) space : if defined, restricts the internal document to be available in a specific <a href="https://www.elastic.co/guide/en/kibana/current/xpack-spaces.html">Kibana Space</a></li>
<li>(optional) user.name : if defined, restricts the internal document to be available for a specific user</li>
<li>(optional) &quot;query&quot; filter to index only certain docs (see below)</li>
</ul>
<pre><code class="language-bash">POST _reindex
{
    &quot;source&quot;: {
        &quot;index&quot;: &quot;&lt;InternalDocsIndex&gt;&quot;,
        &quot;_source&quot;: [
            &quot;&lt;text_field&gt;&quot;,
            &quot;&lt;timestamp&gt;&quot;,
            &quot;namespace&quot;,
            &quot;is_correction&quot;,
            &quot;public&quot;,
            &quot;confidence&quot;
        ]
    },
    &quot;dest&quot;: {
        &quot;index&quot;: &quot;.kibana-observability-ai-assistant-kb-000001&quot;,
        &quot;pipeline&quot;: &quot;.kibana-observability-ai-assistant-kb-ingest-pipeline&quot;
    },
    &quot;script&quot;: {
        &quot;inline&quot;: &quot;ctx._source.text=ctx._source.remove(\&quot;&lt;text_field&gt;\&quot;);ctx._source.namespace=\&quot;&lt;space&gt;\&quot;;ctx._source.is_correction=false;ctx._source.public=&lt;public&gt;;ctx._source.confidence=\&quot;high\&quot;;ctx._source['@timestamp']=ctx._source.remove(\&quot;&lt;timestamp&gt;\&quot;);ctx._source['user.name'] = \&quot;&lt;user.name&gt;\&quot;&quot;
    }
}
</code></pre>
<p>You may want to specify the type of documents that you reindex in the KB — for example, you may only want to reindex Markdown documents (like Runbooks). You can add a “query” filter to the documents in the source. In the case of GitHub, runbooks are identified with the “type” field containing the string “file,” and you could add that to the reindex query like indicated below. To add also GitHub Issues, you can also include in the query “type” field containing the string “issues”:</p>
<pre><code class="language-json">&quot;source&quot;: {
        &quot;index&quot;: &quot;&lt;InternalDocsIndex&gt;&quot;,
        &quot;_source&quot;: [
            &quot;&lt;text_field&gt;&quot;,
            &quot;&lt;timestamp&gt;&quot;,
            &quot;namespace&quot;,
            &quot;is_correction&quot;,
            &quot;public&quot;,
            &quot;confidence&quot;
        ],
    &quot;query&quot;: {
      &quot;terms&quot;: {
        &quot;type&quot;: [&quot;file&quot;]
      }
    }
</code></pre>
<p>Great! Now that the data is stored in your Knowledge Base, you can ask the Observability AI Assistant any questions about it:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-21.png" alt="21 - new conversation" /></p>
&lt;Video vidyardUuid=&quot;zRxsp1EYjmR4FW4yRtSxcr&quot; loop={true} /&gt;
&lt;Video vidyardUuid=&quot;vV5md3mVtY8KxUVjSvtT7V&quot; loop={true} /&gt;
<h2>Conclusion</h2>
<p>In conclusion, leveraging internal Observability knowledge and adding it to the Elastic Knowledge Base can greatly enhance the capabilities of the AI Assistant. By manually inputting information or programmatically ingesting documents, SREs can create a central repository of knowledge accessible through the power of Elastic and LLMs. The AI Assistant can recall this information, assist with incidents, and provide tailored observability to specific contexts using Retrieval Augmented Generation. By following the steps outlined in this article, organizations can unlock the full potential of their Elastic AI Assistant.</p>
<p><a href="https://www.elastic.co/generative-ai/ai-assistant">Start enriching your Knowledge Base with the Elastic AI Assistant today</a> and empower your SRE team with the tools they need to excel. Follow the steps outlined in this article and take your incident management and alert remediation processes to the next level. Your journey toward a more efficient and effective SRE operation begins now.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/11-hand.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Better RCAs with multi-agent AI Architecture]]></title>
            <link>https://www.elastic.co/observability-labs/blog/super-agent-architecture</link>
            <guid isPermaLink="false">super-agent-architecture</guid>
            <pubDate>Fri, 31 May 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Discover how specialized LLM agents collaborate to tackle complex tasks with unparalleled efficiency]]></description>
            <content:encoded><![CDATA[<h2>What’s a multi agent architecture?</h2>
<p>You might have heard the term Agent pop up recently in different open source projects or vendors focusing their go-to-market on GenAI. Indeed, while most GenAI applications are focused on RAG applications today, there is an increasing interest in isolating tasks that could be achieved with a more special model into what is called an Agent.</p>
<p>To be clear, an agent will be given a task, which could be a prompt, and execute the task by leveraging other models, data sources, and a knowledge base. Depending on the field of application, the results should ultimately look like generated text, pictures, charts, or sounds.</p>
<p>Now, what the multi-Agent Architecture, is the process of leveraging multiple agents around a given task by:</p>
<ul>
<li>Orchestrating complex system oversight with multiple agents</li>
<li>Analyzing and strategizing in real-time with strategic reasoning</li>
<li>Specializing agents, tasks are decomposed into smaller focused tasks into expert-handled elements</li>
<li>Sharing insights for cohesive action plans, creating collaborative dynamics</li>
</ul>
<p>In a nutshell, multi-agent architecture's superpower is tackling intricate challenges beyond human speed and solving complex problems. It enables a couple of things:</p>
<ul>
<li>Scale the intelligence as the data and complexity grows. The tasks are decomposed into smaller work units, and the expert network grows accordingly.</li>
<li>Coordinate simultaneous actions across systems, scale collaboration</li>
<li>Evolving with data allows continuous adaptation with new data for cutting-edge decision-making.</li>
<li>Scalability, high performance, and resilience</li>
</ul>
<h2>Single Agent Vs Multi-Agent Architecture</h2>
<p>Before double-clicking on the multi-agent architecture, let’s talk about the single-agent architecture. The single-agent architecture is designed for straightforward tasks and a late feedback loop from the end user. There are multiple single-agent frameworks such as ReAct (Reason+Act), RAISE (ReAct+ Short/Long term memory), Reflexion, AutoGPT+P, and LATS (Language Agent Tree Search). The general process these architectures enable is as follows:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/super-agent-architecture/single.png" alt="alt_text" /></p>
<p>The Agent takes action, observes, executes, and self-decides whether or not it looks complete, ends the process if finished, or resubmits the new results as an input action, the process keeps going.</p>
<p>While simple tasks are ok with this type of agent, such as a RAG application where a user will ask a question, and the agent returns an answer based on the LLM and a knowledge base, there are a couple of limitations:</p>
<ul>
<li>Endless execution loop: the agent is never satisfied with the output and reiterates.</li>
<li>Hallucinations</li>
<li>Lack of feedback loop or enough data to build a feedback loop</li>
<li>Lack of planning</li>
</ul>
<p>For these reasons, the need for a better self-evaluation loop, externalizing the observation phase, and division of labor is rising, creating the need for a multi-agent architecture.</p>
<p>Multi-agent architecture relies on taking a complex task, breaking it down into multiple smaller tasks, planning the resolution of these tasks, executing, evaluating, sharing insights, and delivering an outcome. For this, there is more than one agent; in fact, the minimum value for the network size N is N=2 with:</p>
<ul>
<li>A Manager</li>
<li>An Expert</li>
</ul>
<p>When N=2, the source task is simple enough only to need one expert agent as the task can not be broken down into multiple tasks. Now, when the task is more complex, this is what the architecture can look like:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/super-agent-architecture/multi-vertical.png" alt="alt_text" /></p>
<p>With the help of an LLM, the Manager decomposes the tasks and delegates the resolutions to multiple agents. The above architecture is called Vertical since the agents directly send their results to the Manager. In a horizontal architecture, agents work and share insight together as groups, with a volunteer-based system to complete a task, they do not need a leader as shown below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/super-agent-architecture/multi-horizontal.png" alt="alt_text" /></p>
<p>A very good paper covering these two architectures with more insights can be found here: <a href="https://arxiv.org/abs/2404.11584">https://arxiv.org/abs/2404.11584</a></p>
<h2>Application Vertical Multi-Agent Architecture to Observability</h2>
<p>Vertical Multi-Agent Architecture can have a manager, experts, and a communicator. This is particularly important when these architectures expose the task's result to an end user.</p>
<p>In the case of Observability, what we envision in this blog post is the scenario of an SRE running through a Root Cause Analysis (RCA) process. The high-level logic will look like this:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/super-agent-architecture/maar-observability.png" alt="alt_text" /></p>
<ul>
<li>Communicator:
<ul>
<li>Read the initial command from the Human</li>
<li>Pass command to Manager</li>
<li>Provide status updates to Human</li>
<li>Provide a recommended resolution plan to the Human</li>
<li>Relay follow-up commands from Human to Manager</li>
</ul>
</li>
<li>Manager:
<ul>
<li>Read the initial command from the Communicator </li>
<li>Create working group </li>
<li>Assign Experts to group </li>
<li>Evaluate signals and recommendations from Experts </li>
<li>Generate recommended resolution plan </li>
<li>Execute plan (optional)</li>
</ul>
</li>
<li>Expert:
<ul>
<li>Each expert task with singular expertise tied to Elastic integration </li>
<li>Use o11y AI Assistant to triage and troubleshoot data related to their expertise </li>
<li>Work with other Experts as needed to correlate issues </li>
<li>Provide recommended root cause analysis for their expertise (if applicable) </li>
<li>Provide recommended resolution plan for their expertise (if applicable)</li>
</ul>
</li>
</ul>
<p>We believe that breaking down the experts by integration provides enough granularity in the case of observability and allows them to focus on a specific data source. Doing this also gives the manager a breakdown key when receiving a complex incident involving multiple data layers (application, network, datastores, infrastructures).</p>
<p>For example, a complex task initiated by an alert in an e-commerce application could be “Revenue dropped by 30% in the last hour.” This task would be submitted to the manager, who will look at all services, applications, datastores, network components, and infrastructure involved and decompose these into investigation tasks. Each expert would investigate within their specific scope and provide observations to the manager. The manager will be responsible for correlating and providing observations on what caused the problem.</p>
<h3>Core Architecture</h3>
<p>In the above example, we have decided to deploy the architecture on the below software architecture:</p>
<ul>
<li>The agent manager and expert agent are deployed on GCP or your favorite cloud provider</li>
<li>Most of the components are written in Python</li>
<li>A task management layer is necessary to queue the task to the expert</li>
<li>Expert agents are specifically deployed by integration/data source and converse with the Elastic AI Assistant deployed in Kibana.</li>
<li>The AI Assistant can access a real-time context to help the expert resolve their task.</li>
<li>Elasticsearch is used as the AI Assistant context and as the expert memory to build its experience.</li>
<li>The backend LLM here is GPT-4, now GTP-4o, running on Azure.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/super-agent-architecture/core-architecture.png" alt="alt_text" /></p>
<h3>Agent Experience</h3>
<p>Agent experience is built based on previous events stored in Elasticsearch, to which the expert can look semantically for similar events. When they find one, they get the execution path stored in memory to execute it.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/super-agent-architecture/agent-experience.png" alt="alt_text" /></p>
<p>The beauty of using the Elasticsearch Vector Database for this is the semantic query the agent will be able to execute against the memory and how the memory itself can be managed. Indeed, there is a notion of short—and long-term memory that could be very interesting in the case of observability, where some events often happen and probably worth to be stored in the short-term memory because they are questioned more often. Less queried but important events can be stored in a longer-term memory with more cost-effective hardware.</p>
<p>The other aspect of the Agent Experience is the semantic <a href="https://www.elastic.co/search-labs/blog/semantic-reranking-with-retrievers">reranking</a> feature with Elasticsearch. When the agent executes a task, reranking is used to surface the best outcome compared to past experience:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/super-agent-architecture/agent-experience-build.png" alt="alt_text" /></p>
<p>If you are looking for a working example of the above, <a href="https://www.elastic.co/observability-labs/blog/elastic-ai-assistant-observability-escapes-kibana">check this blog post</a> where 2 agents are working together with the Elastic Observability AI Assistant on an RCA:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/super-agent-architecture/ops-burger.png" alt="alt_text" /></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/super-agent-architecture/githubcopilot-aiassistant.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Supercharge Your vSphere Monitoring with Enhanced vSphere Integration]]></title>
            <link>https://www.elastic.co/observability-labs/blog/supercharge-your-vsphere-monitoring-with-enhanced-vsphere-integration</link>
            <guid isPermaLink="false">supercharge-your-vsphere-monitoring-with-enhanced-vsphere-integration</guid>
            <pubDate>Wed, 11 Dec 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Supercharge Your vSphere Monitoring with Enhanced vSphere Integration]]></description>
            <content:encoded><![CDATA[<p><a href="https://www.vmware.com/products/cloud-infrastructure/vsphere">vSphere</a> is VMware's cloud computing virtualization platform that provides a powerful suite for managing virtualized resources. It allows organizations to create, manage, and optimize virtual environments, providing advanced capabilities such as high availability, load balancing, and simplified resource allocation. vSphere enables efficient utilization of hardware resources, reducing costs while increasing the flexibility and scalability of IT infrastructure.</p>
<p>With the release of an upgraded <a href="https://www.elastic.co/docs/current/integrations/vsphere">vSphere integration</a> we now support an enhanced set of metrics and datastreams. Package version 1.15.0 onwards introduces new datastreams that significantly improve the collection of performance metrics, providing deeper insights into your vSphere environment.</p>
<p>This enhanced version includes a total of seven datastreams, featuring critical new metrics such as disk performance, memory utilization, and network status. Additionally, these datastreams now offer detailed visibility into associated resources like hosts, clusters, and resource pools. To make the most of these insights, we’ve also introduced prebuilt dashboards, helping teams monitor and troubleshoot their vSphere environments with ease and precision.</p>
<p>We have expanded the performance metrics to encompass a broader range of insights across all datastreams, while also introducing new datastreams for clusters, resource pools, and networks. This enhanced integration version now includes a total of seven datastreams, featuring critical new metrics such as disk performance, memory utilization, and network status. Additionally, these datastreams now offer detailed visibility into associated resources like hosts, clusters, and resource pools.</p>
<p>Each datastream also includes detailed alarm information, such as the alarm name, description, status (e.g. critical or warning), and the affected entity's name. To make the most of these insights, we’ve also introduced prebuilt dashboards, helping teams monitor and troubleshoot their vSphere environments with ease and precision.</p>
<h2>Overview of the Datastreams</h2>
<ul>
<li><strong>Host Datastream:</strong> This datastream monitors the disk performance of the host, including metrics such as disk latency, average read/write bytes, uptime, and status. It also captures network metrics, such as packet information, network bandwidth, and utilization, as well as CPU and memory usage of the host. Additionally, it lists associated datastores, virtual machines, and networks within vSphere.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/supercharge-your-vsphere-monitoring-with-enhanced-vsphere-integration/hosts.png" alt="Host Datastream" /></p>
<ul>
<li><strong>Virtual Machine Datastream:</strong> This datastream tracks the used and available CPU and memory resources of virtual machines, along with the uptime and status of each VM. It includes information about the host on which the VM is running, as well as detailed snapshot metrics like the number of snapshots, creation dates, and descriptions. Additionally, it provides insights into associated hosts and datastores.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/supercharge-your-vsphere-monitoring-with-enhanced-vsphere-integration/virtualmachine.png" alt="Virtual Machine Datastream" /></p>
<ul>
<li>
<p><strong>Datastore Datastream:</strong> This datastream provides information on the total, used, and available capacity of datastores, along with their overall status. It also captures metrics such as the average read/write rate and lists the hosts and virtual machines connected to each datastore.</p>
</li>
<li>
<p><strong>Datastore Cluster:</strong> A datastore cluster in vSphere is a collection of datastores grouped together for efficient storage management. This datastream provides details on the total capacity and free space in the storage pod, along with the list of datastores within the cluster.</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/supercharge-your-vsphere-monitoring-with-enhanced-vsphere-integration/datastore.png" alt="Datastore Datastream" /></p>
<ul>
<li>
<p><strong>Resource Pool:</strong> Resource pools in vSphere serve as logical abstractions that allow flexible allocation of CPU and memory resources. This datastream captures memory metrics, including swapped, ballooned, and shared memory, as well as CPU metrics like distributed and static CPU entitlement. It also lists the virtual machines associated with each resource pool.</p>
</li>
<li>
<p><strong>Network Datastream:</strong> This datastream captures the overall configuration and status of the network, including network types (e.g., vSS, vDS). It also lists the hosts and virtual machines connected to each network.</p>
</li>
<li>
<p><strong>Cluster Datastream:</strong> A Cluster in vSphere is a collection of ESXi hosts and their associated virtual machines that function as a unified resource pool. Clustering in vSphere allows administrators to manage multiple hosts and resources centrally, providing high availability, load balancing, and scalability to the virtual environment. This datastream includes metrics indicating whether HA or admission control is enabled and lists the hosts, networks, and datastores associated with the cluster.</p>
</li>
</ul>
<h2>Alarms support in vSphere Integration</h2>
<p>Alarms are a vital part of the vSphere integration, providing real-time insights into critical events across your virtual environment. In the updated Elastic’s vSphere integration, alarms are now reported for all the entities. They include detailed information such as the alarm name, description, severity (e.g., critical or warning), affected entity, and triggered time. These alarms are seamlessly integrated into datastreams, helping administrators and SREs quickly identify and resolve issues like resource shortages or performance bottlenecks.</p>
<h4>Example Alarm</h4>
<pre><code class="language-yaml">&quot;triggered_alarms&quot;: [
  {
    &quot;description&quot;: &quot;Default alarm to monitor host memory usage&quot;,
    &quot;entity_name&quot;: &quot;host_us&quot;,
    &quot;id&quot;: &quot;alarm-4.host-12&quot;,
    &quot;name&quot;: &quot;Host memory usage&quot;,
    &quot;status&quot;: &quot;red&quot;,
    &quot;triggered_time&quot;: &quot;2024-08-28T10:31:26.621Z&quot;
  }
]
</code></pre>
<p>This example highlights a triggered alarm for monitoring host memory usage, indicating a critical status (red) for the host &quot;host_us.&quot; Such alarms empower teams to act swiftly and maintain the stability of their vSphere environment.</p>
<h2>Lets Try It Out!</h2>
<p>The new <a href="https://www.elastic.co/docs/current/integrations/vsphere">vSphere integration</a> in Elastic Cloud is more than just a monitoring tool; it’s a comprehensive solution that empowers you to manage and optimize your virtual environments effectively. With deeper insights and enhanced data granularity, you can ensure high availability, improved load balancing, and smarter resource allocation. Spin up an Elastic Cloud, and start monitoring your vSphere infrastructure.</p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/supercharge-your-vsphere-monitoring-with-enhanced-vsphere-integration/title.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Tailoring span names and enriching spans without changing code with OpenTelemetry - Part 1]]></title>
            <link>https://www.elastic.co/observability-labs/blog/tailoring-span-names-and-enriching-spans-without-changing-code-with-opentelemetry</link>
            <guid isPermaLink="false">tailoring-span-names-and-enriching-spans-without-changing-code-with-opentelemetry</guid>
            <pubDate>Mon, 26 Aug 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[The OpenTelemetry Collector offers powerful capabilities to enrich and refine telemetry data before it reaches your observability tools. In this blog post, we'll explore how to leverage the Collector to create more meaningful transaction names in Elastic Observability, significantly enhancing the value of your monitoring data.]]></description>
            <content:encoded><![CDATA[<p>The OpenTelemetry Collector offers powerful capabilities to enrich and refine telemetry data before it reaches your observability tools. In this blog post, we'll explore how to leverage the Collector to create more meaningful transaction names in Elastic Observability, significantly enhancing the value of your monitoring data.</p>
<p>Consider this scenario: You have a transaction labeled simply as &quot;HTTP GET&quot; with an average response time of 5ms. However, this generic label masks a variety of distinct operations – payment processing, user logins, and adding items to a cart. Does that 5ms average truly represent the performance of these diverse actions? Clearly not.</p>
<p>The other problem that happens is that span traces become all mixed up so that login spans and image serving spans all become part of the same bucket, this makes things like latency correlation analysis hard in Elastic.</p>
<p>We'll focus on a specific technique using the collector's attributes, and transform processors to extract meaningful information from HTTP URLs and use it to create more descriptive span names. This approach not only improves the accuracy of your metrics but also enhances your ability to quickly identify and troubleshoot performance issues across your microservices architecture.</p>
<p>By using these processors in combination, we can quickly address the issue of overly generic transaction names, creating more granular and informative identifiers that provide accurate visibility into your services' performance.</p>
<p>However, it's crucial to approach this technique with caution. While more detailed transaction names can significantly improve observability, they can also lead to an unexpected challenge: cardinality explosion. As we dive into the implementation details, we'll also discuss how to strike the right balance between granularity and manageability, ensuring that our solution enhances rather than overwhelms our observability stack.</p>
<p>In the following sections, we'll walk through the configuration step-by-step, explaining how each processor contributes to our goal, and highlighting best practices to avoid potential pitfalls like cardinality issues. Whether you're new to OpenTelemetry or looking to optimize your existing setup, this guide will help you unlock more meaningful insights from your telemetry data.</p>
<h2>Prerequisites and configuration</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up the configuration:</p>
<ul>
<li>Ensure you have an account on Elastic Cloud and a deployed stack (see instructions <a href="https://www.elastic.co/cloud/">here</a>).</li>
<li>I am also using the OpenTelemetry demo in my environment, this is important to follow along with as this demo has the specific issue I want to address. You should clone the repository and follow the instructions <a href="https://github.com/elastic/opentelemetry-demo">here</a> to get this up and running. I recommend using Kubernetes and I will be doing this in my AWS EKS (Elastic Kubernetes Service) environment.</li>
</ul>
<h3>The OpenTelemetry Demo</h3>
<p>The OpenTelemetry Demo is a comprehensive, microservices-based application designed to showcase the capabilities and best practices of OpenTelemetry instrumentation. It simulates an e-commerce platform, incorporating various services such as frontend, cart, checkout, and payment processing. This demo serves as an excellent learning tool and reference implementation for developers and organizations looking to adopt OpenTelemetry.</p>
<p>The demo application generates traces, metrics, and logs across its interconnected services, demonstrating how OpenTelemetry can provide deep visibility into complex, distributed systems. It's particularly useful for experimenting with different collection, processing, and visualization techniques, making it an ideal playground for exploring observability concepts and tools like the OpenTelemetry Collector.</p>
<p>By using real-world scenarios and common architectural patterns, the OpenTelemetry Demo helps users understand how to effectively implement observability in their own applications and how to leverage the data for performance optimization and troubleshooting.</p>
<p>Once you have an Elastic Cloud instance and you fire up the OpenTelemetry demo, you should see something like this on the Elastic Service Map page:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/tailoring-span-names-and-enriching-spans-without-changing-code-with-opentelemetry/image3.png" alt="" /></p>
<p>Navigating to the traces page will give you the following set up.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/tailoring-span-names-and-enriching-spans-without-changing-code-with-opentelemetry/image1.png" alt="" /></p>
<p>As you can see there are some very broad transaction names here like HTTP GET and the averages will not be very accurate for specific business functions within your services as shown.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/tailoring-span-names-and-enriching-spans-without-changing-code-with-opentelemetry/image6.png" alt="" /></p>
<p>So let's fix that with the OpenTelemetry Collector.</p>
<h2>The OpenTelemetry Collector</h2>
<p>The OpenTelemetry Collector is a vital component in the OpenTelemetry ecosystem, serving as a vendor-agnostic way to receive, process, and export telemetry data. It acts as a centralized observability pipeline that can collect traces, metrics, and logs from various sources, then transform and route this data to multiple backend systems.</p>
<p>The collector's flexible architecture allows for easy configuration and extension through a wide range of receivers, processors, and exporters which you can explore over <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib">here</a>. I have personally found navigating the 'contrib' archive incredibly useful for finding techniques that I didn't know existed. This makes the OpenTelemetry Collector an invaluable tool for organizations looking to standardize their observability data pipeline, reduce overhead, and seamlessly integrate with different monitoring and analysis platforms.</p>
<p>Let's go back to our problem, how do we change the transaction names that Elastic is using to something more useful so that our HTTP GET translates to something like payment-service/login? The first thing we do is we take the full http url and consider which parts of it relate to our transaction.  Looking at the span details we see a url</p>
<pre><code>my-otel-demo-frontendproxy:8080/api/recommendations?productIds=&amp;sessionId=45a9f3a4-39d8-47ed-bf16-01e6e81c80bc&amp;currencyCode=
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/tailoring-span-names-and-enriching-spans-without-changing-code-with-opentelemetry/image4.png" alt="" /></p>
<p>Now obviously we wouldn't want to create transaction names that map to every single session id, that would lead to the cardinality explosion we talked about earlier, however, something like the first two parts of the url 'api/recommendations' looks like exactly the kind of thing we need.</p>
<h3>The attributes processor</h3>
<p>The OpenTelemetry collector gives us a useful tool <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/attributesprocessor">here</a>, the attributes processor can help us extract parts of the url to use later in our observability pipeline. To do this is very simple, we simply build a regex like this one below. Now I should mention that I did not generate this regex myself but I used an LLM to do this for me, never fear regex again!</p>
<pre><code class="language-yaml">attributes:
  actions:
    - key: http.url
      action: extract
      pattern: '^(?P&lt;short_url&gt;https?://[^/]+(?:/[^/]+)*)(?:/(?P&lt;url_truncated_path&gt;[^/?]+/[^/?]+))(?:\?|/?$)'
</code></pre>
<p>This configuration is doing some heavy lifting for us, so let's break it down:</p>
<ul>
<li>We're using the attributes processor, which is perfect for manipulating span attributes.</li>
<li>We're targeting the http.url attribute of incoming spans.</li>
<li>The extract action tells the processor to pull out specific parts of the URL using our regex pattern.</li>
</ul>
<p>Now, about that regex - it's designed to extract two key pieces of information:</p>
<ol>
<li><code>short_url</code>: This captures the protocol, domain, and optionally the first path segment. For example, in &quot;<a href="https://example.com/api/users/profile">https://example.com/api/users/profile</a>&quot;, it would grab &quot;<a href="https://example.com/api">https://example.com/api</a>&quot;.</li>
<li><code>url_truncated_path</code>: This snags the next two path segments (if they exist). In our example, it would extract &quot;users/profile&quot;.</li>
</ol>
<p>Why is this useful? Well, it allows us to create more specific transaction names based on the URL structure, without including overly specific details that could lead to cardinality explosion. For instance, we avoid capturing unique IDs or query parameters that would create a new transaction name for every single request.</p>
<p>So, if we have a URL like &quot;<a href="https://example.com/api/users/profile?id=123">https://example.com/api/users/profile?id=123</a>&quot;, our extracted <code>url_truncated_path</code> would be &quot;users/profile&quot;. This gives us a nice balance - it's more specific than just &quot;HTTP GET&quot;, but not so specific that we end up with thousands of unique transaction names.</p>
<p>Now it's worth mentioning here that if you don't have an attribute you want to use for naming your transactions it is worth looking at the options for your SDK or agent, as an example the Java automatic instrumentation Otel agent has the <a href="https://opentelemetry.io/docs/zero-code/java/agent/instrumentation/http/#capturing-http-request-and-response-headers">following options</a> for capturing request and response headers. You can then subsequently use this data to name your transactions if the url is insufficient!</p>
<p>In the next steps, we'll see how to use this extracted information to create more meaningful span names, providing better granularity in our observability data without overwhelming our system. Remember, the goal is to enhance our visibility, not to drown in a sea of overly specific metrics!</p>
<h3>The transform processor</h3>
<p>Now that we've extracted the relevant parts of our URLs, it's time to put that information to good use. Enter the transform processor - our next powerful tool in the OpenTelemetry Collector pipeline.</p>
<p>The <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/transformprocessor">transform processor</a> allows us to modify various aspects of our telemetry data, including span names. Here's the configuration we'll use:</p>
<pre><code class="language-yaml">transform:
  trace_statements:
    - context: span
      statements:
        - set(name, attributes[&quot;url_truncated_path&quot;])
</code></pre>
<p>Let's break this down:</p>
<ul>
<li>We're using the transform processor, which gives us fine-grained control over our spans.</li>
<li>We're focusing on <code>trace_statements</code>, as we want to modify our trace spans.</li>
<li>The <code>context: span</code> tells the processor to apply these changes to each individual span.</li>
<li>Our statement is where the magic happens: we're setting the span's name to the value of the <code>url_truncated_path</code> attribute we extracted earlier.</li>
</ul>
<p>What does this mean in practice? Remember our previous example URL &quot;<a href="https://example.com/api/users/profile?id=123">https://example.com/api/users/profile?id=123</a>&quot;? Instead of a generic span name like &quot;HTTP GET&quot;, we'll now have a much more informative name: &quot;users/profile&quot;.</p>
<p>This transformation brings several benefits:</p>
<ol>
<li>Improved Readability: At a glance, you can now see what part of your application is being accessed.</li>
<li>Better Aggregation: You can easily group and analyze similar requests, like all operations on user profiles.</li>
<li>Balanced Cardinality: We're specific enough to be useful, but not so specific that we create a new span name for every unique URL.</li>
</ol>
<p>By combining the attribute extraction we did earlier with this transformation, we've created a powerful system for generating meaningful span names. This approach gives us deep insight into our application's behavior without the risk of cardinality explosion.</p>
<h2>Putting it All Together</h2>
<p>The resulting config for the OpenTelemetry collector is below remember this goes into the opentelemetry-demo/kubernetes/elastic-helm/configmap-deployment.yaml and is applied with kubectl apply -f configmap-deployment.yaml</p>
<pre><code class="language-yaml">---
apiVersion: v1
kind: ConfigMap
metadata:
  name: elastic-otelcol-agent
  namespace: default
  labels:
    app.kubernetes.io/name: otelcol

data:
  relay: |
    connectors:
      spanmetrics: {}
    exporters:
      debug: {}
      otlp/elastic:
        endpoint: ${env:ELASTIC_APM_ENDPOINT}
        compression: none
        headers:
          Authorization: Bearer ${ELASTIC_APM_SECRET_TOKEN}
    extensions:
    processors:
      batch: {}
      resource:
        attributes:
          - key: deployment.environment
            value: &quot;opentelemetry-demo&quot;
            action: upsert
      attributes:
        actions:
          - key: http.url
            action: extract
            pattern: '^(?P&lt;short_url&gt;https?://[^/]+(?:/[^/]+)*)(?:/(?P&lt;url_truncated_path&gt;[^/?]+/[^/?]+))(?:\?|/?$)'
      transform:
        trace_statements:
          - context: span
            statements:
              - set(name, attributes[&quot;url_truncated_path&quot;])
    receivers:
      httpcheck/frontendproxy:
        targets:
        - endpoint: http://example-frontendproxy:8080
      otlp:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:4317
          http:
            cors:
              allowed_origins:
              - http://*
              - https://*
            endpoint: ${env:MY_POD_IP}:4318
    service:
      extensions:
      pipelines:
        logs:
          exporters:
          - debug
          - otlp/elastic
          processors:
          - batch
          - resource
          - attributes
          - transform
          receivers:
          - otlp
        metrics:
          exporters:
          - otlp/elastic
          - debug
          processors:
          - batch
          - resource
          receivers:
          - httpcheck/frontendproxy
          - otlp
          - spanmetrics
        traces:
          exporters:
          - otlp/elastic
          - debug
          - spanmetrics
          processors:
          - batch
          - resource
          - attributes
          - transform
          receivers:
          - otlp
      telemetry:
        metrics:
          address: ${env:MY_POD_IP}:8888
</code></pre>
<p>You'll notice that we tie everything together by adding our enrichment and transformations to the traces section in pipelines at the bottom of the collector config. This is the definition of our observability pipeline, bringing together all the pieces we've discussed to create more meaningful and actionable telemetry data.</p>
<p>By implementing this configuration, you're taking a significant step towards more insightful observability. You're not just collecting data; you're refining it to provide clear, actionable insights into your application's performance, check out the final result below!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/tailoring-span-names-and-enriching-spans-without-changing-code-with-opentelemetry/image2.png" alt="" /></p>
<h2>Ready to Take Your Observability to the Next Level?</h2>
<p>Implementing OpenTelemetry with Elastic Observability opens up a world of possibilities for understanding and optimizing your applications. But this is just the beginning! To further enhance your observability journey, check out these valuable resources:</p>
<ol>
<li><a href="https://www.elastic.co/observability-labs/blog/infrastructure-monitoring-with-opentelemetry-in-elastic-observability">Infrastructure Monitoring with OpenTelemetry in Elastic Observability</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/tag/opentelemetry">Explore More OpenTelemetry Content</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/using-the-otel-operator-for-injecting-java-agents">Using the OTel Operator for Injecting Java Agents</a></li>
<li><a href="https://www.elastic.co/what-is/opentelemetry">What is OpenTelemetry?</a></li>
</ol>
<p>We encourage you to dive deeper, experiment with these configurations, and see how they can transform your observability data. Remember, the key is to find the right balance between detail and manageability.</p>
<p>Have you implemented similar strategies in your observability pipeline? We'd love to hear about your experiences and insights. Share your thoughts in the comments below or reach out to us on our community forums.</p>
<p>Stay tuned for Part 2 of this series, where we will look at an advanced technique for collecting more data that can help you get even more granular by collecting Span names, baggage and data for metrics using a Java plugin all without code.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/tailoring-span-names-and-enriching-spans-without-changing-code-with-opentelemetry/tailor.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[The next evolution of observability: unifying data with OpenTelemetry and generative AI]]></title>
            <link>https://www.elastic.co/observability-labs/blog/the-next-evolution-of-observability-unifying-data-with-opentelemetry-and-generative-ai</link>
            <guid isPermaLink="false">the-next-evolution-of-observability-unifying-data-with-opentelemetry-and-generative-ai</guid>
            <pubDate>Wed, 11 Jun 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Generative AI and machine learning are revolutionizing observability, but siloed data hinders their true potential. This article explores how to break down data silos by unifying logs, metrics, and traces with OpenTelemetry, unlocking the full power of GenAI for natural language investigations, automated root cause analysis, and proactive issue resolution.]]></description>
            <content:encoded><![CDATA[<p>The Observability industry today stands at a critical juncture. While our applications generate more telemetry data than ever before, this wealth of information typically exists in siloed tools, separate systems for logs, metrics, and traces. Meanwhile, Generative AI is hurtling toward us like an asteroid about to make a tremendous impact on our industry.</p>
<p>As SREs, we've grown accustomed to jumping between dashboards, log aggregators, and trace visualizers when troubleshooting issues. But what if there was a better way? What if AI could analyze all your observability data holistically, answering complex questions in natural language, and identifying root causes automatically?</p>
<p>This is the next evolution of observability. But to harness this power, we need to rethink how we collect, store, and analyze our telemetry data.</p>
<h2>The problem: siloed data limits AI effectiveness</h2>
<p>Traditional observability setups separate data into distinct types:</p>
<ul>
<li>Metrics: Numeric measurements over time (CPU, memory, request rates)</li>
<li>Logs: Detailed event records with timestamps and context</li>
<li>Traces: Request journeys through distributed systems</li>
<li>Profiles: Code-level execution patterns showing resource consumption and performance bottlenecks at the function/line level</li>
</ul>
<p>This separation made sense historically due to the way the industry evolved. Different data types have traditionally had different cardinality, structure, access patterns and volume characteristics. However, this approach creates significant challenges for AI-powered analysis:</p>
<pre><code class="language-text">Metrics (Prometheus) → &quot;CPU spiked at 09:17:00&quot;
Logs (ELK) → &quot;Exception in checkout service at 09:17:32&quot; 
Traces (Jaeger) → &quot;Slow DB queries in order-service at 09:17:28&quot;
Profiles (pyroscope) -&gt; &quot;calculate_discount() is taking 75% of CPU time&quot;
</code></pre>
<p>When these data sources live in separate systems, AI tools must either:</p>
<ol>
<li>Work with an incomplete picture (seeing only metrics but not the related logs)</li>
<li>Rely on complex, brittle integrations that often introduce timing skew</li>
<li>Force developers to manually correlate information across tools</li>
</ol>
<p>Imagine asking an AI, &quot;Why did checkout latency spike at 09:17?&quot; To answer comprehensively, it needs access to logs (to see the stack trace), traces (to understand the service path), and metrics (to identify resource strain). With siloed tools, the AI either sees only fragments of the story or requires complex ETL jobs that are slower than the incident itself.</p>
<h2>Why traditional machine learning (ML) falls short</h2>
<p>Traditional machine learning for observability typically focuses on anomaly detection within a single data dimension. It can tell you when metrics deviate from normal patterns, but struggles to provide context or root cause.</p>
<p>ML models trained on metrics alone might flag a latency spike, but can't connect it to a recent deployment (found in logs) or identify that it only affects requests to a specific database endpoint (found in traces). They behave like humans with extreme tunnel vision, seeing only a fraction of the relevant information and only the information that a specific vendor has given you an opinionated view into.</p>
<p>This limitation becomes particularly problematic in modern microservice architectures where problems frequently cascade across services. Without a unified view, traditional ML can detect symptoms but struggles to identify the underlying cause.</p>
<h2>The solution: unified data with enriched logs</h2>
<p>The solution is conceptually simple but transformative: unify metrics, logs, and traces into a single data store, ideally with enriched logs that contain all signals about a request in a single JSON document. We're about to see a merging of signals.</p>
<p>Think of traditional logs as simple text lines:</p>
<pre><code class="language-text">[2025-05-19 09:17:32] ERROR OrderService - Failed to process checkout for user 12345
</code></pre>
<p>Now imagine an enriched log that contains not just the error message, but also:</p>
<ul>
<li>The complete distributed trace context</li>
<li>Related metrics at that moment</li>
<li>System environment details</li>
<li>Business context (user ID, cart value, etc.)</li>
</ul>
<p>This approach creates a holistic view where every signal about the same event sits side-by-side, perfect for AI analysis.</p>
<h2>How generative AI changes things</h2>
<p>Generative AI differs fundamentally from traditional ML in its ability to:</p>
<ol>
<li>Process unstructured data: Understanding free-form log messages and error text</li>
<li>Maintain context: Connecting related events across time and services</li>
<li>Answer natural language queries: Translating human questions into complex data analysis</li>
<li>Generate explanations: Providing reasoning alongside conclusions</li>
<li>Surface hidden patterns: Discovering correlations and anomalies in log data that would be impractical to find through manual analysis or traditional querying</li>
</ol>
<p>With access to unified observability data, GenAI can analyze complete system behavior patterns and correlate across previously disconnected signals.</p>
<p>For example, when asked &quot;Why is our checkout service slow?&quot; a GenAI model with access to unified data can:</p>
<ul>
<li>Analyze unified enriched logs to identify which specific operations are slow and to find errors or warnings in those components</li>
<li>Check attached metrics to understand resource utilization</li>
<li>Correlate all these signals with deployment events or configuration changes</li>
<li>Present a coherent explanation in natural language with supporting graphs and visualizations</li>
</ul>
<h2>Implementing unified observability with OpenTelemetry</h2>
<p>OpenTelemetry provides the perfect foundation for unified observability with its consistent schema across metrics, logs, and traces. Here's how to implement enriched logs in a Java application:</p>
<pre><code class="language-java">import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.metrics.Meter;
import io.opentelemetry.api.metrics.DoubleHistogram;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;
import java.lang.management.ManagementFactory;
import java.lang.management.OperatingSystemMXBean;

public class OrderProcessor {
    private static final Logger logger = LoggerFactory.getLogger(OrderProcessor.class);
    private final Tracer tracer;
    private final DoubleHistogram cpuUsageHistogram;
    private final OperatingSystemMXBean osBean;

    public OrderProcessor(OpenTelemetry openTelemetry) {
        this.tracer = openTelemetry.getTracer(&quot;order-processor&quot;);
        Meter meter = openTelemetry.getMeter(&quot;order-processor&quot;);
        this.cpuUsageHistogram = meter.histogramBuilder(&quot;system.cpu.load&quot;)
                                      .setDescription(&quot;System CPU load&quot;)
                                      .setUnit(&quot;1&quot;)
                                      .build();
        this.osBean = ManagementFactory.getOperatingSystemMXBean();
    }

    public void processOrder(String orderId, double amount, String userId) {
        Span span = tracer.spanBuilder(&quot;processOrder&quot;).startSpan();
        try (Scope scope = span.makeCurrent()) {
            // Add attributes to the span
            span.setAttribute(&quot;order.id&quot;, orderId);
            span.setAttribute(&quot;order.amount&quot;, amount);
            span.setAttribute(&quot;user.id&quot;, userId);
            // Populate MDC for structured logging
            MDC.put(&quot;trace_id&quot;, span.getSpanContext().getTraceId());
            MDC.put(&quot;span_id&quot;, span.getSpanContext().getSpanId());
            MDC.put(&quot;order_id&quot;, orderId);
            MDC.put(&quot;order_amount&quot;, String.valueOf(amount));
            MDC.put(&quot;user_id&quot;, userId);
            // Record CPU usage metric associated with the current trace context
            double cpuLoad = osBean.getSystemLoadAverage();
            if (cpuLoad &gt;= 0) {
                cpuUsageHistogram.record(cpuLoad);
                MDC.put(&quot;cpu_load&quot;, String.valueOf(cpuLoad));
            }
            // Log a structured message
            logger.info(&quot;Processing order&quot;);
            // Simulate business logic
            // ...
            span.setAttribute(&quot;order.status&quot;, &quot;completed&quot;);
            logger.info(&quot;Order processed successfully&quot;);
        } catch (Exception e) {
            span.recordException(e);
            span.setAttribute(&quot;order.status&quot;, &quot;failed&quot;);
            logger.error(&quot;Order processing failed&quot;, e);
        } finally {
            MDC.clear();
            span.end();
        }
    }
}
</code></pre>
<p>This code demonstrates how to:</p>
<ol>
<li>Create a span for the operation</li>
<li>Add business attributes</li>
<li>Add current CPU usage</li>
<li>Link everything with consistent IDs</li>
<li>Record exceptions and outcomes in the backend system</li>
</ol>
<p>When configured with an appropriate exporter, this creates enriched logs that contain both application events and their complete context.</p>
<h2>Powerful queries across previously separate data</h2>
<p>With data that has not yet been enriched, there is still hope. Firstly with GenAI powered ingestion it is possible to extract key fields to help correlate data such as a session id's. This will help you enrich your logs so they get the structure they need to behave like other signals. Below we can see Elastic's Auto Import mechanism that will automatically generate ingest pipelines and pull unstructured information from logs into a structured format perfect for analytics.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/the-next-evolution-of-observability-unifying-data-with-opentelemetry-and-generative-ai/image4.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/the-next-evolution-of-observability-unifying-data-with-opentelemetry-and-generative-ai/image2.png" alt="" /></p>
<p>Once you have this data in the same data store, you can perform powerful join queries that were previously impossible. For example, finding slow database queries that affected specific API endpoints:</p>
<pre><code class="language-sql">FROM logs-nginx.access-default 
| LOOKUP JOIN .ds-logs-mysql.slowlog-default-2025.05.01-000002 ON request_id 
| KEEP request_id, mysql.slowlog.query, url.query 
| WHERE mysql.slowlog.query IS NOT NULL
</code></pre>
<p>This query joins web server logs with database slow query logs, allowing you to directly correlate user-facing performance with database operations.</p>
<p>For GenAI interfaces, these complex queries can be generated automatically from natural language questions:</p>
<p>&quot;Show me all checkout failures that coincided with slow database queries&quot;</p>
<p>The AI translates this into appropriate queries across your unified data store, correlating application errors with database performance.</p>
<h2>Real-world applications and use cases</h2>
<h3>Natural language investigation</h3>
<p>Imagine asking your observability system:</p>
<p>&quot;Why did checkout latency spike at 09:17 yesterday?&quot;</p>
<p>A GenAI-powered system with unified data could respond:</p>
<p>&quot;Checkout latency increased by 230% at 09:17:32 following deployment v2.4.1 at 09:15. The root cause appears to be increased MySQL query times in the inventory-service. Specifically, queries to the 'product_availability' table are taking an average of 2300ms compared to the normal 95ms. This coincides with a CPU spike on database host db-03 and 24 'Lock wait timeout' errors in the inventory service logs.&quot;</p>
<p>Here's an example of Claude Desktop connected to <a href="https://github.com/elastic/mcp-server-elasticsearch">Elastic's MCP (Model Context Protocol) Server</a> which demonstrates how powerful natural language investigations can be. Here we ask Claude &quot;analyze my web traffic patterns&quot; and as you can see it has correctly identified that this is in our demo environment.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/the-next-evolution-of-observability-unifying-data-with-opentelemetry-and-generative-ai/image3.png" alt="" /></p>
<h3>Unknown problem detection</h3>
<p>GenAI can identify subtle patterns by correlating signals that would be missed in siloed systems. For example, it might notice that a specific customer ID appears in error logs only when a particular network path is taken through your microservices—indicating a data corruption issue affecting only certain user flows.</p>
<h3>Predictive maintenance</h3>
<p>By analyzing the unified historical patterns leading up to previous incidents, GenAI can identify emerging problems before they cause outages:</p>
<p>&quot;Warning: Current load pattern on authentication-service combined with increasing error rates in user-profile-service matches 87% of the signature that preceded the April 3rd outage. Recommend scaling user-profile-service pods immediately.&quot;</p>
<h2>The future: agentic AI for observability</h2>
<p>The next frontier is agentic AI, systems that not only analyze but take action automatically.</p>
<p>These AI agents could:</p>
<ol>
<li>Continuously monitor all observability signals</li>
<li>Autonomously investigate anomalies</li>
<li>Implement fixes for known patterns</li>
<li>Learn from the effectiveness of previous interventions</li>
</ol>
<p>For example, an observability agent might:</p>
<ul>
<li>Detect increased error rates in a service</li>
<li>Analyze logs and traces to identify a memory leak</li>
<li>Correlate with recent code changes</li>
<li>Increase the memory limit temporarily</li>
<li>Create a detailed ticket with the root cause analysis</li>
<li>Monitor the fix effectiveness</li>
</ul>
<p>This is about creating systems that understand your application's behavior patterns deeply enough to maintain them proactively. See how this works in Elastic Observability, in the screenshot at the end of the RCA we are sending an email summary but this could trigger any action.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/the-next-evolution-of-observability-unifying-data-with-opentelemetry-and-generative-ai/image1.png" alt="" /></p>
<h2>Business outcomes</h2>
<p>Unifying observability data for GenAI analysis delivers concrete benefits:</p>
<ul>
<li>Faster resolution times: Problems that previously required hours of manual correlation can be diagnosed in seconds</li>
<li>Fewer escalations: Junior engineers can leverage AI to investigate complex issues before involving specialists</li>
<li>Improved system reliability: Earlier detection and resolution of emerging issues</li>
<li>Better developer experience: Less time spent context-switching between tools</li>
<li>Enhanced capacity planning: More accurate prediction of resource needs</li>
</ul>
<h2>Implementation steps</h2>
<p>Ready to start your observability transformation? Here's a practical roadmap:</p>
<ol>
<li>Adopt OpenTelemetry: Standardize on OpenTelemetry for all telemetry data collection and use it to generate enriched logs.</li>
<li>Choose a unified storage solution: Select a platform that can efficiently store and query metrics, logs, traces and enriched logs together</li>
<li>Enrich your telemetry: Update application instrumentation to include relevant context</li>
<li>Create correlation IDs: Ensure every request has identifiers</li>
<li>Implement semantic conventions: Follow consistent naming patterns across your telemetry data</li>
<li>Start with focused use cases: Begin with high-value scenarios like checkout flows or critical APIs</li>
<li>Leverage GenAI tools: Integrate tools that can analyze your unified data and respond to natural language queries</li>
</ol>
<p>Remember, AI can only be as smart as the data you feed it. The quality and completeness of your telemetry data will determine the effectiveness of your AI-powered observability.</p>
<h2>Generative AI: an evolutionary catalyst for observability</h2>
<p>The unification of observability data for GenAI analysis represents an evolutionary leap forward comparable to the transition from Internet 1.0 to 2.0. Early adopters will gain a significant competitive advantage through faster problem resolution, improved system reliability, and more efficient operations. GAI is a huge step for increasing observability maturity and moving your team to a more proactive stance.</p>
<p>Think of traditional observability as a doctor trying to diagnose a patient while only able to see their heart rate. Unified observability with GenAI is like giving that doctor a complete health picture, vital signs, lab results, medical history, and genetic data all accessible through natural conversation.</p>
<p>As SREs, we stand at the threshold of a new era in system observability. The asteroid of GenAI isn't a threat to be feared, it's an opportunity to evolve our practices and tools to build more reliable, understandable systems. The question isn't whether this transformation will happen, but who will lead it.</p>
<p>Will you?</p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/the-next-evolution-of-observability-unifying-data-with-opentelemetry-and-generative-ai/title.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Trace your Azure Function application with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/trace-azure-function-application-observability</link>
            <guid isPermaLink="false">trace-azure-function-application-observability</guid>
            <pubDate>Tue, 16 May 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Serverless applications deployed on Azure Functions are growing in usage. This blog shows how to deploy a serverless application on Azure functions with Elastic Agent and use Elastic's APM capability to manage and troubleshoot issues.]]></description>
            <content:encoded><![CDATA[<p>Adoption of Azure Functions in cloud-native applications on Microsoft Azure has been increasing exponentially over the last few years. Serverless functions, such as the Azure Functions, provide a high level of abstraction from the underlying infrastructure and orchestration, given these tasks are managed by the cloud provider. Software development teams can then focus on the implementation of business and application logic. Some additional benefits include billing for serverless functions based on the actual compute and memory resources consumed, along with automatic on-demand scaling.</p>
<p>While the benefits of using serverless functions are manifold, it is also necessary to make them observable in the wider end-to-end microservices architecture context.</p>
<h2>Elastic Observability (APM) for Azure Functions: The architecture</h2>
<p><a href="https://www.elastic.co/blog/whats-new-elastic-observability-8-7-0">Elastic Observability 8.7</a> introduced distributed tracing for Microsoft Azure Functions — available for the Elastic APM Agents for .NET, Node.js, and Python. Auto-instrumentation of HTTP requests is supported out-of-the-box, enabling the detection of performance bottlenecks and sources of errors.</p>
<p>The key components of the solution for observing Azure Functions are:</p>
<ol>
<li>The Elastic APM Agent for the relevant language</li>
<li>Elastic Observability</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/trace-azure-function-application-observability/blog-elastic-azure-function.png" alt="azure function" /></p>
<p>The APM server validates and processes incoming events from individual APM Agents and transforms them into Elasticsearch documents. The APM Agent provides auto-instrumentation capabilities for the application being observed. The Node.js APM Agent can trace function invocations in an Azure Functions app.</p>
<h2>Setting up Elastic APM for Azure Functions</h2>
<p>To demonstrate the setup and usage of Elastic APM, we will use a <a href="https://github.com/elastic/azure-functions-apm-nodejs-sample-app">sample Node.js application</a>.</p>
<h3>Application overview</h3>
<p>The Node.js application has two <a href="https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-http-webhook">HTTP-triggered</a> functions named &quot;<a href="https://github.com/elastic/azure-functions-apm-nodejs-sample-app/blob/main/Hello/index.js">Hello</a>&quot; and &quot;<a href="https://github.com/elastic/azure-functions-apm-nodejs-sample-app/blob/main/Goodbye/index.js">Goodbye</a>.&quot; Once deployed, they can be called as follows, and tracing data will be sent to the configured Elastic Observability deployment.</p>
<pre><code class="language-bash">curl -i https://&lt;APP_NAME&gt;.azurewebsites.net/api/hello
curl -i https://&lt;APP_NAME&gt;.azurewebsites.net/api/goodbye
</code></pre>
<h3>Setup</h3>
<p><strong>Step 0. Prerequisites</strong></p>
<p>To run the sample application, you will need:</p>
<ul>
<li>
<p>An installation of <a href="https://nodejs.org/">Node.js</a> (v14 or later)</p>
</li>
<li>
<p>Access to an Azure subscription with an appropriate role to create resources</p>
</li>
<li>
<p>The <a href="https://learn.microsoft.com/en-us/cli/azure/install-azure-cli">Azure CLI (az)</a> logged into an Azure subscription</p>
<ol>
<li>Use az login to login</li>
<li>See the output of az account show</li>
</ol>
</li>
<li>
<p>The <a href="https://learn.microsoft.com/en-us/azure/azure-functions/functions-run-local?tabs=v4%2Cwindows%2Ccsharp%2Cportal%2Cbash#install-the-azure-functions-core-tools">Azure Functions Core Tools (func)</a> (func --version should show a 4.x version)</p>
</li>
<li>
<p>An Elastic Observability deployment to which monitoring data will be sent</p>
<ol>
<li>The simplest way to get started with Elastic APM Microsoft Azure is through Elastic Cloud. <a href="https://www.elastic.co/guide/en/elastic-stack-deploy/current/azure-marketplace-getting-started.html">Get started with Elastic Cloud on Azure Marketplace</a> or <a href="https://www.elastic.co/cloud/elasticsearch-service/signup">sign up for a trial on Elastic Cloud</a>.</li>
</ol>
</li>
<li>
<p>The APM server URL (serverUrl) and secret token (secretToken) from your Elastic stack deployment for configuration below</p>
<ol>
<li><a href="https://www.elastic.co/guide/en/apm/guide/8.7/install-and-run.html">How to get the serverUrl and secretToken documentation</a></li>
</ol>
</li>
</ul>
<p><strong>Step 1. Clone the sample application repo and install dependencies</strong></p>
<pre><code class="language-bash">git clone https://github.com/elastic/azure-functions-apm-nodejs-sample-app.git
cd azure-functions-apm-nodejs-sample-app
npm install
</code></pre>
<p><strong>Step 2. Deploy the Azure Function App</strong><br />
Caution icon! Deploying a function app to Azure can incur <a href="https://azure.microsoft.com/en-us/pricing/details/functions/">costs</a>. The following setup uses the free tier of Azure Functions. Step 5 covers the clean-up of resources.</p>
<p><strong>Step 2.1</strong><br />
To avoid name collisions with others that have independently run this demo, we need a short unique identifier for some resource names that need to be globally unique. We'll call it the DEMO_ID. You can run the following to generate one and save it to DEMO_ID and the &quot;demo-id&quot; file.</p>
<pre><code class="language-bash">if [[ ! -f demo-id ]]; then node -e 'console.log(crypto.randomBytes(3).toString(&quot;hex&quot;))' &gt;demo-id; fi
export DEMO_ID=$(cat demo-id)
echo $DEMO_ID
</code></pre>
<p><strong>Step 2.2</strong><br />
Before you can deploy to Azure, you will need to create some Azure resources: a Resource Group, Storage Account, and the Function App. For this demo, you can use the following commands. (See <a href="https://learn.microsoft.com/en-us/azure/azure-functions/create-first-function-cli-node#create-supporting-azure-resources-for-your-function">this Azure docs section</a> for more details.)</p>
<pre><code class="language-bash">REGION=westus2   # Or use another region listed in 'az account list-locations'.
az group create --name &quot;AzureFnElasticApmNodeSample-rg&quot; --location &quot;$REGION&quot;
az storage account create --name &quot;eapmdemostor${DEMO_ID}&quot; --location &quot;$REGION&quot; \
    --resource-group &quot;AzureFnElasticApmNodeSample-rg&quot; --sku Standard_LRS
az functionapp create --name &quot;azure-functions-apm-nodejs-sample-app-${DEMO_ID}&quot; \
    --resource-group &quot;AzureFnElasticApmNodeSample-rg&quot; \
    --consumption-plan-location &quot;$REGION&quot; --runtime node --runtime-version 18 \
    --functions-version 4 --storage-account &quot;eapmdemostor${DEMO_ID}&quot;
</code></pre>
<p><strong>Step 2.3</strong><br />
Next, configure your Function App with the APM server URL and secret token for your Elastic deployment. This can be done in the <a href="https://portal.azure.com/">Azure Portal</a> or with the az CLI.</p>
<p>In the Azure portal, browse to your Function App, then its Application Settings (<a href="https://learn.microsoft.com/en-us/azure/azure-functions/functions-how-to-use-azure-function-app-settings?tabs=portal#settings">Azure user guide</a>). You'll need to add two settings:</p>
<p>First set your APM URL and token.</p>
<pre><code class="language-bash">export ELASTIC_APM_SERVER_URL=&quot;&lt;your serverUrl&gt;&quot;
export ELASTIC_APM_SECRET_TOKEN=&quot;&lt;your secretToken&gt;&quot;
</code></pre>
<p>Or you can use the az functionapp config appsettings set ... CLI command as follows:</p>
<pre><code class="language-bash">az functionapp config appsettings set \
  -g &quot;AzureFnElasticApmNodeSample-rg&quot; -n &quot;azure-functions-apm-nodejs-sample-app-${DEMO_ID}&quot; \
  --settings &quot;ELASTIC_APM_SERVER_URL=${ELASTIC_APM_SERVER_URL}&quot;
az functionapp config appsettings set \
  -g &quot;AzureFnElasticApmNodeSample-rg&quot; -n &quot;azure-functions-apm-nodejs-sample-app-${DEMO_ID}&quot; \
  --settings &quot;ELASTIC_APM_SECRET_TOKEN=${ELASTIC_APM_SECRET_TOKEN}&quot;
</code></pre>
<p>The ELASTIC_APM_SERVER_URL and ELASTIC_APM_SECRET_TOKEN are set in Azure function’s settings for the app and used by the Elastic APM Agent. This is initiated by the initapm.js file, which starts the Elastic APM agent with:</p>
<pre><code class="language-javascript">require(&quot;elastic-apm-node&quot;).start();
</code></pre>
<p>When you log in to Azure and look at the function’s configuration, you will see them set:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/trace-azure-function-application-observability/blog-elastic-azure-functions-application-settings.png" alt="azure functions application settings" /></p>
<p><strong>Step 2.4</strong><br />
Now you can publish your app. (Re-run this command every time you make a code change.)</p>
<pre><code class="language-bash">func azure functionapp publish &quot;azure-functions-apm-nodejs-sample-app-${DEMO_ID}&quot;
</code></pre>
<p>You should log in to Azure to see the function running.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/trace-azure-function-application-observability/blog-elastic-azure-function-app.png" alt="azure function app" /></p>
<p><strong>Step 3. Try it out</strong></p>
<pre><code class="language-bash">% curl https://azure-functions-apm-nodejs-sample-app-${DEMO_ID}.azurewebsites.net/api/Hello
{&quot;message&quot;:&quot;Hello.&quot;}
% curl https://azure-functions-apm-nodejs-sample-app-${DEMO_ID}.azurewebsites.net/api/Goodbye
{&quot;message&quot;:&quot;Goodbye.&quot;}
</code></pre>
<p>In a few moments, the APM app in your Elastic deployment will show tracing data for your Azure Function app.</p>
<p><strong>Step 4. Apply some load to your app</strong><br />
To get some more interesting data, you can run the following to generate some load on your deployed function app:</p>
<pre><code class="language-bash">npm run loadgen
</code></pre>
<p>This uses the <a href="https://github.com/mcollina/autocannon">autocannon</a> node package to generate some light load (2 concurrent users, each calling at 5 requests/s for 60s) on the &quot;Goodbye&quot; function.</p>
<p><strong>Step 5. Clean up resources</strong><br />
If you deployed to Azure, you should make sure to delete any resources so you don't incur any costs.</p>
<pre><code class="language-bash">az group delete --name &quot;AzureFnElasticApmNodeSample-rg&quot;
</code></pre>
<h2>Analyzing Azure Function APM data in Elastic</h2>
<p>Once you have successfully set up the sample application and started generating load, you should see APM data appearing in the Elastic Observability APM Services capability.</p>
<h2>Service map</h2>
<p>With the default setup, you will see two services in the APM Service map.</p>
<p>The main function: azure-functions-apm-nodejs-sample-app</p>
<p>And the end point where your function is accessible: azure-functions-apm-nodejs-sample-app-ec7d4c.azurewebsites.net</p>
<p>You will see that there is a connection between the two as your application is taking requests and answering through the endpoint.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/trace-azure-function-application-observability/blog-elastic-observability-services.png" alt="observability services" /></p>
<p>From the <a href="https://www.elastic.co/observability/application-performance-monitoring">APM Service</a> map you can further investigate the function, analyze traces, look at logs, and more.</p>
<h3>Service details</h3>
<p>When we dive into the details, we can see several items.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/trace-azure-function-application-observability/blog-elastic-observability-azure-functions-apm.png" alt="observability azure functions apm" /></p>
<ul>
<li>Latency for the recent load we ran against the application</li>
<li>Transactions (Goodbye and Hello)</li>
<li>Average throughput</li>
<li>And more</li>
</ul>
<h3>Transaction details</h3>
<p>We can see transaction details.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/trace-azure-function-application-observability/blog-elastic-observability-get-api-goodbye.png" alt="observability get api goodbye" /></p>
<p>An individual trace shows us that the &quot;Goodbye&quot; function <a href="https://github.com/elastic/azure-functions-apm-nodejs-sample-app/blob/main/Goodbye/index.js#L6-L10">calls the &quot;Hello&quot; function</a> in the same function app before returning:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/trace-azure-function-application-observability/blog-elastic-latency-distribution-trace-sample.png" alt="latency distribution trace sample" /></p>
<h3>Machine learning based latency correlation</h3>
<p>As we’ve mentioned in other blogs, we can also correlate issues such as higher than normal latency. Since we see a spike at 1s, we run the embedded latency correlation, which uses machine learning to help analyze the potential impacting component by analyzing logs, metrics, and traces.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/trace-azure-function-application-observability/blog-elastic-latency-distribution-correlations.png" alt="latency distribution correlations" /></p>
<p>The correlation indicated there is a potential cause (25%) due to the host sending the load (my machine).</p>
<h3>Cold start detection</h3>
<p>Also, we can see the impact a <a href="https://azure.microsoft.com/en-ca/blog/understanding-serverless-cold-start/">cold start</a> can have on the latency of a request:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/trace-azure-function-application-observability/blog-elastic-trace-sample.png" alt="trace sample" /></p>
<h2>Summary</h2>
<p>Elastic Observability provides real-time monitoring of Azure Functions in your production environment for a broad range of use cases. Curated dashboards assist DevOps teams in performing root cause analysis for performance bottlenecks and errors. SRE teams can quickly view upstream and downstream dependencies, as well as perform analyses in the context of distributed microservices architecture.</p>
<h2>Learn more</h2>
<p>To learn how to add the Elastic APM Agent to an existing Node.js Azure Function app, read <a href="https://www.elastic.co/guide/en/apm/agent/nodejs/master/azure-functions.html">Monitoring Node.js Azure Functions</a>. Additional resources include:</p>
<ul>
<li><a href="https://www.elastic.co/blog/getting-started-with-the-azure-integration-enhancement">How to deploy and manage Elastic Observability on Microsoft Azure</a></li>
<li><a href="https://www.elastic.co/guide/en/apm/guide/current/apm-quick-start.html">Elastic APM Quickstart</a></li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/trace-azure-function-application-observability/09-road.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Trace-based testing with Elastic APM and Tracetest]]></title>
            <link>https://www.elastic.co/observability-labs/blog/trace-based-testing-apm-tracetest</link>
            <guid isPermaLink="false">trace-based-testing-apm-tracetest</guid>
            <pubDate>Wed, 15 Feb 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Want to run trace-based tests with Elastic APM? We're happy to announce that Tracetest now integrates with Elastic Observability APM. Check out this hands-on example of how Tracetest works with Elastic Observability APM and OpenTelemetry.]]></description>
            <content:encoded><![CDATA[<p><em>This post was originally published on the</em> <a href="https://tracetest.io/blog/tracetest-integration-elastic-trace-based-testing-application-performance-monitoring"><em>Tracetest blog</em></a><em>.</em></p>
<p>Want to run trace-based tests with Elastic APM? Today is your lucky day. We're happy to announce that Tracetest now integrates with Elastic Observability APM.</p>
<p>Check out this <a href="https://github.com/kubeshop/tracetest/tree/main/examples/tracetest-elasticapm-with-elastic-agent">hands-on example</a> of how Tracetest works with Elastic Observability APM and OpenTelemetry!</p>
<p><a href="https://tracetest.io/">Tracetest</a> is a <a href="https://www.cncf.io/">CNCF</a> project aiming to provide a solution for deep integration and system testing by leveraging the rich data in distributed system traces. In this blog, we intend to provide an introduction to Tracetest and its capabilities, including how it can be integrated with <a href="https://www.elastic.co/observability/application-performance-monitoring">Elastic Application Performance Monitoring</a> and <a href="https://opentelemetry.io/">OpenTelemetry</a> to enhance the testing process.</p>
<h2>Your good friend distributed tracing</h2>
<p>Distributed tracing is a way to understand how a distributed system works by tracking the flow of requests through the system. It can be used for a variety of purposes, such as identifying and fixing performance issues, figuring out what went wrong when an error occurs, and making sure that the system is running smoothly. Here are a few examples of how distributed tracing can be used:</p>
<ul>
<li><strong>Monitoring performance:</strong> Distributed tracing can help you keep an eye on how your distributed system is performing by showing you what's happening in real time. This can help you spot and fix problems like bottlenecks or slow response times that can make the system less reliable.</li>
<li><strong>Finding the source of problems:</strong> When something goes wrong, distributed tracing can help you figure out what happened by showing you the sequence of events that led up to the problem. This can help you pinpoint the specific service or component that's causing the issue and fix it.</li>
<li><strong>Debugging:</strong> Distributed tracing can help you find and fix bugs by giving you detailed information about what's happening in the system. This can help you understand why certain requests are behaving in unexpected ways and how to fix them.</li>
<li><strong>Security:</strong> Distributed tracing can help you keep an eye on security by showing you who is making requests to the system, where they are coming from, and what services are being accessed.</li>
<li><strong>Optimization:</strong> Distributed tracing can help you optimize the performance of the system by providing insight into how requests are flowing through it, which can help you identify areas that can be made more efficient and reduce the number of requests that need to be handled.</li>
</ul>
<h2>Distributed tracing — Now also for testing</h2>
<p>Observability, previously only used in operations, is now being applied in other areas of development, such as testing. This shift has led to the emergence of <a href="https://www.infoq.com/articles/observability-driven-development/">&quot;Observability-driven development&quot;</a> and &quot;trace-based testing&quot; as new methods for using distributed tracing to test distributed applications.</p>
<p>Instead of just checking that certain parts of the code are working, trace-driven testing follows the path that a request takes as it goes through the system. This way, you can make sure that the entire system is working properly and that the right output is produced for a given input. By using distributed tracing, developers can record what happens during the test and then use that information to check that everything is working as it should.</p>
<p>This method of testing can help to find problems that may be hard to detect with other types of testing and can better validate that the new code is working as expected. Additionally, distributed tracing provides information about what is happening during the test, such as how long it takes for a request to be processed and which services are being used, which can help developers understand how the code behaves in a real-world scenario.</p>
<h2>Enters Tracetest</h2>
<p><a href="https://tracetest.io/">Tracetest</a> is a CNCF project that can run tests by verifying new traces against previously created assertions against other traces captured from the real systems. Here's how you can use Tracetest:</p>
<ul>
<li>Capture the baseline good known trace. This will be the golden standard that you will use to write your tests and assertions. Trace-driven development is a better way to test how different parts of the system work together because it allows developers to test the entire process from start to finish, making sure that everything is working as it should and giving a more complete view of how the system is functioning instead of trying to create disjointed assertions validating the request execution.</li>
<li>Now you can start validating your code changes against good known behavior captured previously.</li>
<li>Tracetest can validate the resulting traces from the test and see if the system is working as it should. This can help you find problems that traditional testing methods might not catch.</li>
<li>Create reports: Tracetest can also create reports that summarize the results of the test so that you can share the information with your team.</li>
<li>Help you validate in production that the new requests follow the known path and run the predefined assertions against them.</li>
</ul>
<p>The APM tool in Kibana, which is a familiar UI for many developers, can provide extra information when used with Tracetest. The APM tool can show you how the system is performing during the test and help you find issues using the familiar user interface. For example, the APM tool can show you how requests are moving through the system, how long requests take to be processed, and which parts of the system are being used. This information can help you identify and fix problems during testing.</p>
<p>Furthermore, the APM tool can be set to show you all the data in real-time, which allows you to monitor the system's behavior during the test or even in production and helps you make sense of what Tracetest is showing.</p>
<h2>How Tracetest works with Elastic APM to test the application</h2>
<p>The components work together to provide a complete solution for testing distributed systems. The telemetry captured by the OpenTelemetry agent is sent to the Elastic APM Server, which processes and formats the data for indexing in Elasticsearch. The data can then be queried and analyzed using Kibana APM UI, and Tracetest can be used to conduct deep integration and system tests by utilizing the rich data contained in the distributed system trace.</p>
<p>For more details on Elastic's support for OpenTelelemetry, check out <a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/trace-based-testing-apm-tracetest/blog-elastic-distributed-system-trace.png" alt="" /></p>
<ol>
<li>Tracetest initiates the test by sending a request to the application under test.</li>
<li>The application processes the request, and the built-in OpenTelemetry agent captures the telemetry data of the request. This data includes information such as request and response payloads, request and response headers, and any errors that occurred during the request processing. The agent then sends the captured telemetry data to the Elastic APM Server.</li>
<li>Elastic APM server consumes OpenTelemetry or Elastic APM spans and sends the data to be stored and indexed in Elasticsearch.</li>
<li>Tracetest polls Elasticsearch to retrieve the captured trace data. It makes use of Elasticsearch query to fetch the trace data. Tracetest compares the received trace data with the expected trace data and runs the assertions. This step is used to check whether the data received from the application matches the expected data and to check for any errors or issues that may have occurred during the request processing. Based on the results of the comparison, Tracetest will report any errors or issues found and will provide detailed information about the root cause of the problem. If the test passes, Tracetest will report that the test passed, and the test execution process will be completed.</li>
<li>The trace data is visible and can be analyzed in Kibana APM UI as well.</li>
</ol>
<h2>Running your first Tracetest environment with Elastic APM and Docker compose</h2>
<p>In your existing observability setup, you have the <a href="https://opentelemetry.io/docs/instrumentation/js/getting-started/nodejs/">OpenTelemetry Nodejs agent</a> configured in your code and <a href="https://www.elastic.co/blog/opentelemetry-observability">sending OpenTelemetry traces to the Elastic APM server that then stores</a> them in Elasticsearch. Adding Tracetest to the infrastructure lets you write detailed trace-based tests based on the existing tracing infrastructure. Tracetest runs tests against endpoints and uses trace data to run assertions.</p>
<p>The example that we are going to run is from the Tracetest GitHub repository. It contains a docker-compose setup, which is a convenient way to run multiple services together in a defined environment. The example includes a sample application that has been instrumented with an OpenTelemetry agent. The example also includes the Tracetest server with its Postgres database, which is responsible for invoking the test, polling Elasticsearch to retrieve the captured trace data, comparing the received trace data with the expected trace data, and running the assertions. Finally, the example includes Elasticsearch, Kibana, and the Elastic APM server from the Elastic Stack.</p>
<p>To quickly access the example, you can run the following:</p>
<pre><code class="language-bash">git clone https://github.com/kubeshop/tracetest.git
cd tracetest/examples/tracetest-elasticapm-with-otel
docker-compose up -d
</code></pre>
<p>Once you have Tracetest set up, open <a href="http://localhost:11633">http://localhost:11633</a> in your browser to check out the Web UI.</p>
<p>Navigate to the Settings menu and ensure the connection to Elasticsearch is working by pressing Test Connection:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/trace-based-testing-apm-tracetest/blog-elastic-tracetest-configure-data-store.png" alt="" /></p>
<p>To create a test, click the Create dropdown and choose Create New Test. Select the HTTP Request and give it a name and description.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/trace-based-testing-apm-tracetest/blog-elastic-create-new-test.png" alt="" /></p>
<p>For this simple example, GET the Node.js app, which runs at <a href="http://app:8080">http://app:8080</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/trace-based-testing-apm-tracetest/blog-elastic-trace-request-details.png" alt="" /></p>
<p>With the test created, you can click the Trace tab to see the distributed trace. It’s simple, but you can start to see how it delivers immediate visibility into every transaction your HTTP request generates.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/trace-based-testing-apm-tracetest/blog-elastic-tracetest-trigger.png" alt="" /></p>
<p>From here, you can continue by adding assertions.</p>
<p>To make an assertion based on the GET / span of our trace, select that span in the graph view and click <strong>Current span</strong> in the Test Spec modal. Or, copy this span selector directly, using the <a href="https://docs.tracetest.io/concepts/selectors/">Tracetest Selector Language</a>:</p>
<pre><code class="language-javascript">span[tracetest.span.type=&quot;http&quot; name=&quot;GET /&quot; http.target=&quot;/&quot; http.method=&quot;GET&quot;]
</code></pre>
<p>Below, add the attr:http.status_code attribute and the expected value, which is 200. You can add more complex assertions as well, like testing whether the span executes in less than 500ms. Add a new assertion for attr:http.status_code, choose &lt;, and add 500ms as the expected value.</p>
<p>You can check against other properties, return statuses, timing, and much more, but we’ll keep it simple for now.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/trace-based-testing-apm-tracetest/blog-elastic-tracetest-edit-test-spec.png" alt="" /></p>
<p>Then click <strong>Save Test Spec</strong> , followed by <strong>Publish</strong> , and you’ve created your first assertion.If you open the APM app in Kibana at <a href="https://localhost:5601">https://localhost:5601</a> (find the username and password from the examples/tracetest-elasticapm- <strong>with</strong> -otel/.env file), you will be able to navigate to the transaction generated by the test representing the overall application call with three underlying spans:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/trace-based-testing-apm-tracetest/blog-elastic-latency-distribution.png" alt="" /></p>
<h2>Summary</h2>
<p>Elastic APM and Tracetest are tools that can help make testing distributed applications easier by providing a more comprehensive view of the system's behavior and allowing developers to identify and diagnose performance issues more efficiently. Tracetest allows you to test the entire process from start to finish, making sure that everything is working as it should, by following the path that a request takes.</p>
<p>Elastic APM provides detailed information about the performance of a system, including how requests are flowing through the system, how long requests take to be processed, and which services are being called. Together, these tools can help developers to identify and fix issues more quickly, improve collaboration and communication among the team, and ultimately improve the overall quality of the system.</p>
<blockquote>
<ul>
<li>Elastic APM documentation: <a href="https://www.elastic.co/guide/en/apm/guide/current/index.html">https://www.elastic.co/guide/en/apm/guide/current/index.html</a></li>
<li>Tracetest documentation: <a href="https://tracetest.io/docs/">https://tracetest.io/docs/</a> </li>
<li>Tracetest Github page: <a href="https://github.com/kubeshop/tracetest">https://github.com/kubeshop/tracetest</a> </li>
<li>Elastic blog: <a href="https://www.elastic.co/blog/category/technical-topics">https://www.elastic.co/blog/category/technical-topics</a> </li>
<li>Elastic APM community forum: <a href="https://discuss.elastic.co/c/apm">https://discuss.elastic.co/c/apm</a> </li>
<li>Tracetest support: <a href="https://discord.com/channels/884464549347074049/963470167327772703">Discord channel</a></li>
</ul>
</blockquote>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/trace-based-testing-apm-tracetest/telescope-search-1680x980.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Revealing unknowns in your tracing data with inferred spans in OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/tracing-data-inferred-spans-opentelemetry</link>
            <guid isPermaLink="false">tracing-data-inferred-spans-opentelemetry</guid>
            <pubDate>Mon, 22 Apr 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Distributed tracing is essential in understanding complex systems, but it can miss latency issue details. By combining profiling techniques with distributed tracing, Elastic provides the inferred spans feature as an extension for the OTel Java SDK.]]></description>
            <content:encoded><![CDATA[<p>In the complex world of microservices and distributed systems, achieving transparency and understanding the intricacies and inefficiencies of service interactions and request flows has become a paramount challenge. Distributed tracing is essential in understanding distributed systems. But distributed tracing, whether manually applied or auto-instrumented, is usually rather coarse-grained. Hence, distributed tracing covers only a limited fraction of the system and can easily miss parts of the system that are the most useful to trace.</p>
<p>Addressing this gap, Elastic developed the concept of inferred spans as a powerful enhancement to traditional instrumentation-based tracing as an extension for the OpenTelemetry Java SDK/Agent. We are in the process of contributing this back to OpenTelemetry, until then our <a href="https://github.com/elastic/elastic-otel-java/tree/main/inferred-spans">extension</a> can be seamlessly used with the existing OpenTelelemetry Java SDK (as described below).</p>
<p>Inferred spans are designed to augment the visibility provided by instrumentation-based traces, shedding light on latency sources within the application or libraries that were previously uninstrumented. This feature significantly expands the utility of distributed tracing, allowing for a more comprehensive understanding of system behavior and facilitating a deeper dive into performance optimization.</p>
<h2>What is inferred spans?</h2>
<p>Inferred spans is an observability technique that combines distributed tracing with profiling techniques to illuminate the darker, unobserved corners of your application — areas where standard instrumentation techniques fall short. The inferred spans feature interweaves information derived from profiling stacktraces with instrumentation-based tracing data, allowing for the generation of new spans based on the insights drawn from profiling data.</p>
<p>This feature proves invaluable when dealing with custom code or third-party libraries that significantly contribute to the request latency but lack built-in or external instrumentation support. Often, identifying or crafting specific instrumentation for these segments can range from challenging to outright unfeasible. Moreover, certain scenarios exist where implementing instrumentation is impractical due to the potential for substantial performance overhead. For instance, instrumenting application locking mechanisms, despite their critical role, is not viable because of their ubiquitous nature and the significant latency overhead the instrumentation can introduce to application requests. Still, ideally, such latency issues would be visible within your distributed traces.</p>
<p>Inferred spans ensures a deeper visibility into your application’s performance dynamics including the above-mentioned scenarios.</p>
<h2>Inferred spans in action</h2>
<p>To demonstrate the inferred spans feature we will use the Java implementation of the <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix/java-favorite">Elastiflix demo application</a>. Elasticflix has an endpoint called favorites that does some Redis calls and also includes an artificial delay. First, we use the plain OpenTelemetry Java Agent to instrument our application:</p>
<pre><code class="language-java">java -javaagent:/path/to/otel-javaagent-&lt;version&gt;.jar \
-Dotel.service.name=my-service-name \
-Dotel.exporter.otlp.endpoint=https://&lt;our-elastic-apm-endpoint&gt; \
&quot;-Dotel.exporter.otlp.headers=Authorization=Bearer SECRETTOKENHERE&quot; \
-jar my-service-name.jar
</code></pre>
<p>With the OpenTelemetry Java Agent we get out-of-the-box instrumentation for HTTP entry points and calls to Redis for our Elastiflix application. The resulting traces contain spans for the POST /favorites entrypoint, as well as a few short spans for the calls to Redis.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/tracing-data-inferred-spans-opentelemetry/image2.png" alt="POST /favorites entrypoint" /></p>
<p>As you can see in the trace above, it’s not clear where most of the time is spent within the POST /favorites request.</p>
<p>Let’s see how inferred spans can shed light into these areas. You can use the inferred spans feature either manually with your OpenTelemetry SDK (see section below), package it as a drop-in extension for the upstream OpenTelemetry Java agent, or just use <a href="https://github.com/elastic/elastic-otel-java/tree/main">Elastic’s distribution of the OpenTelemetry Java agent</a> that comes with the inferred spans feature.</p>
<p>For convenience, we just download the <a href="https://mvnrepository.com/artifact/co.elastic.otel/elastic-otel-javaagent/0.0.1">agent jar</a> of the Elastic distribution and extend the configuration to enable the inferred spans feature:</p>
<pre><code class="language-java">java -javaagent:/path/to/elastic-otel-javaagent-&lt;version&gt;.jar \
-Dotel.service.name=my-service-name \
-Dotel.exporter.otlp.endpoint=https://XX.apm.europe-west3.gcp.cloud.es.io:443 \
&quot;-Dotel.exporter.otlp.headers=Authorization=Bearer SECRETTOKENHERE&quot; \
-Delastic.otel.inferred.spans.enabled=true \
-jar my-service-name.jar
</code></pre>
<p>The only non-standard option here is elastic.otel.inferred.spans.enabled: The inferred spans Feature is currently opt-in and therefore needs to be enabled explicitly. Running the same application with the inferred spans feature enabled yields more comprehensive traces:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/tracing-data-inferred-spans-opentelemetry/image1.png" alt="more comprehensive traces" /></p>
<p>The inferred-spans (colored blue in the above screenshot) follow the naming pattern Class#method. With that, the inferred spans feature helps us pinpoint the exact methods that contribute the most to the overall latency of the request. Note that the parent-child relationship between the HTTP entry span, the Redis spans, and the inferred spans is reconstructed correctly, resulting in a fully functional trace structure.</p>
<p>Examining the handleDelay method within the Elastiflix application reveals the use of a straightforward sleep statement. Although the sleep method is not CPU-bound, the full duration of this delay is captured as inferred spans. This stems from employing the async-profiler's wall clock time profiling, as opposed to solely relying on CPU profiling. The ability of the inferred spans feature to reflect actual latency, including for I/O operations and other non-CPU-bound tasks, represents a significant advancement. It allows for diagnosing and resolving performance issues that extend beyond CPU limitations, offering a more nuanced view of system behavior.</p>
<h2>Using inferred spans with your own OpenTelemetry SDK</h2>
<p>OpenTelemetry is a highly extensible framework: Elastic embraces this extensibility by also publishing most extensions shipped with our OpenTelemetry Java Distro as standalone-extensions to the <a href="https://github.com/open-telemetry/opentelemetry-java">OpenTelemetry Java SDK</a>.</p>
<p>As a result, if you do not want to use our distro (e.g., because you don’t need or want bytecode instrumentation in your project), you can still use our extensions, such as the extension for the inferred spans feature. All you need to do is set up the <a href="https://opentelemetry.io/docs/languages/java/instrumentation/#initialize-the-sdk">OpenTelemetry SDK in your code</a> and add the inferred spans extension as a dependency:</p>
<pre><code class="language-xml">&lt;dependency&gt;
    &lt;groupId&gt;co.elastic.otel&lt;/groupId&gt;
    &lt;artifactId&gt;inferred-spans&lt;/artifactId&gt;
    &lt;version&gt;{latest version}&lt;/version&gt;
&lt;/dependency&gt;
</code></pre>
<p>During your SDK setup, you’ll have to initialize and register the extension:</p>
<pre><code class="language-java">InferredSpansProcessor inferredSpans = InferredSpansProcessor.builder()
  .samplingInterval(Duration.ofMillis(10)) //the builder offers all config options
  .build();
SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
  .addSpanProcessor(inferredSpans)
.addSpanProcessor(BatchSpanProcessor.builder(OtlpGrpcSpanExporter.builder()
    .setEndpoint(&quot;https://&lt;your-elastic-apm-endpoint&gt;&quot;)
    .addHeader(&quot;Authorization&quot;, &quot;Bearer &lt;secrettoken&gt;&quot;)
    .build()).build())
  .build();
inferredSpans.setTracerProvider(tracerProvider);
</code></pre>
<p>The inferred spans extension seamlessly integrates with the <a href="https://opentelemetry.io/docs/languages/java/instrumentation/#automatic-configuration">OpenTelemetry SDK Autoconfiguration mechanism</a>. By incorporating the OpenTelemetry SDK and its extensions as dependencies within your application code — rather than through an external agent — you gain the flexibility to configure them using the same environment variables or JVM properties. Once the inferred spans extension is included in your classpath, activating it for autoconfigured SDKs becomes straightforward. Simply enable it using the elastic.otel.inferred.spans.enabled property, as previously described, to leverage the full capabilities of this feature with minimal setup.</p>
<h2>How does inferred spans work?</h2>
<p>The inferred spans feature leverages the capabilities of collecting wall clock time profiling data of the widely-used <a href="https://github.com/async-profiler/async-profiler">async-profiler</a>, a low-overhead, popular production-time profiler in the Java ecosystem. It then transforms the profiling data into actionable spans as part of the distributed traces. But what mechanism allows for this transformation?</p>
<p>Essentially, the inferred spans extension engages with the lifecycle of span events, specifically when a span is either activated or deactivated across any thread via the <a href="https://opentelemetry.io/docs/specs/otel/context/">OpenTelemetry context</a>. Upon the activation of the initial span within a transaction, the extension commences a session of wall-clock profiling via the async-profiler, set to a predetermined duration. Concurrently, it logs the details of all span activations and deactivations, capturing their respective timestamps and the threads on which they occurred.</p>
<p>Following the completion of the profiling session, the extension processes the profiling data alongside the log of span events. By correlating the data, it reconstructs the inferred spans. It's important to note that, in certain complex scenarios, the correlation may assign an incorrect name to a span. To mitigate this and aid in accurate identification, the extension enriches the inferred spans with stacktrace segments under the code.stacktrace attribute, offering users clarity and insight into the precise methods implicated.</p>
<h2>Inferred spans vs. correlation of traces with profiling data</h2>
<p>In the wake of OpenTelemetry's recent <a href="https://opentelemetry.io/blog/2024/profiling/">announcement of the profiling signal</a>, coupled with <a href="https://www.elastic.co/blog/elastic-donation-proposal-to-contribute-profiling-agent-to-opentelemetry">Elastic's commitment to donating the Universal Profiling Agent</a> to OpenTelemetry, you might be wondering about how the inferred spans feature differentiates from merely correlating profiling data with distributed traces using span IDs and trace IDs. Rather than viewing these as competing functionalities, it's more accurate to consider them complementary.</p>
<p>The inferred spans feature and the correlation of tracing with profiling data both employ similar methodologies — melding tracing information with profiling data. However, they each shine in distinct areas. Inferred spans excels at identifying long-running methods that could escape notice with traditional CPU profiling, which is more adept at pinpointing CPU bottlenecks. A unique advantage of inferred spans is its ability to account for I/O time, capturing delays caused by operations like disk access that wouldn't typically be visible in CPU profiling flamegraphs.</p>
<p>However, the inferred spans feature has its limitations, notably in detecting latency issues arising from &quot;death by a thousand cuts&quot; — where a method, although not time-consuming per invocation, significantly impacts total latency due to being called numerous times across a request. While individual calls might not be captured as inferred spans due to their brevity, CPU-bound methods contributing to latency are unveiled through CPU profiling, as flamegraphs display the aggregate CPU time consumed by these methods.</p>
<p>An additional strength of the inferred spans feature lies in its data structure, offering a simplified tracing model that outlines typical parent-child relationships, execution order, and good latency estimates. This structure is achieved by integrating tracing data with span activation/deactivation events and profiling data, facilitating straightforward navigation and troubleshooting of latency issues within individual traces.</p>
<p>Correlating distributed tracing data with profiling data comes with a different set of advantages. Learn more about it in our related blog post, <a href="https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation">Beyond the trace: Pinpointing performance culprits with continuous profiling and distributed tracing correlation</a>.</p>
<h2>What about the performance overhead?</h2>
<p>As mentioned before, the inferred spans functionality is based on the widely used async-profiler, known for its minimal impact on performance. However, the efficiency of profiling operations is not without its caveats, largely influenced by the specific configurations employed. A pivotal factor in this balancing act is the sampling interval — the longer the interval between samples, the lower the incurred overhead, albeit at the expense of potentially overlooking shorter methods that could be critical to the inferred spans feature discovery process.</p>
<p>Adjusting the probability-based trace sampling presents another way for optimization, directly influencing the overhead. For instance, setting trace sampling to 50% effectively halves the profiling load, making the inferred spans feature even more resource-efficient on average per request. This nuanced approach to tuning ensures that the inferred spans feature can be leveraged in real-world, production environments with a manageable performance footprint. When properly configured, this feature offers a potent, low-overhead solution for enhancing observability and diagnostic capabilities within production applications.</p>
<h2>What’s next for inferred spans and OpenTelemetry?</h2>
<p>This blog post outlined and introduced the inferred spans feature available as an extension for the OpenTelemetry Java SDK and built into the newly introduced Elastic OpenTelemetry Java Distro. Inferred spans allows users to troubleshoot latency issues in areas of code that are not explicitly instrumented while utilizing traditional tracing data.</p>
<p>The feature is currently merely a port of the existing feature from the proprietary Elastic APM Agent. With Elastic embracing OpenTelemetry, we plan on contributing this extension to the upstream OpenTelemetry project. For that, we also plan on migrating the extension to the latest async-profiler 3.x release. <a href="https://github.com/elastic/elastic-otel-java/tree/main/inferred-spans">Try out inferred spans for yourself</a> and see how it can help you diagnose performance problems in your applications.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/tracing-data-inferred-spans-opentelemetry/148360-Blog-header-image--Revealing-Unknowns-in-your-Tracing-Data-with-Inferred-Spans-in-OpenTelemetry_V1.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Transforming Industries and the Critical Role of LLM Observability: How to use Elastic's LLM integrations in real-world scenarios]]></title>
            <link>https://www.elastic.co/observability-labs/blog/transforming-industries-and-the-critical-role-of-llm-observability</link>
            <guid isPermaLink="false">transforming-industries-and-the-critical-role-of-llm-observability</guid>
            <pubDate>Thu, 08 May 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[This blog explores four industry specific use cases that use Large Language Models (LLMs) and highlights how Elastic's LLM observability integrations provide insights into the cost, performance, reliability and the prompts and response exchange with the LLM.]]></description>
            <content:encoded><![CDATA[<p>In today's tech-centric world, Large Language Models (LLMs) are transforming sectors from finance and healthcare to research. LLMs are starting to underpin products and services across the spectrum. Take for example recent <a href="https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#advanced-coding">advanced coding</a> developments in Google's Gemini 2.5 which enable it to use its reasoning capabilities to create a video game by producing the executable code from a short prompt.  Or <a href="https://www.aboutamazon.com/news/devices/new-alexa-generative-artificial-intelligence">new ways</a> to interact with Amazon's Alexa - for example, you could send a picture of a live music schedule, and have Alexa add the details to your calendar. And let's not forget Microsoft's <a href="https://blogs.microsoft.com/blog/2025/04/04/your-ai-companion/">personalization of Copilot</a> which remembers what you talk about, so it learns your likes and dislikes and details about your life; the name of your dog, that tricky project at work, what keeps you motivated to stick to your new workout routine.</p>
<p>Despite their widespread utility of LLMs, deploying these sophisticated tools in real-world scenarios poses distinct challenges, especially in managing their complex behaviors. For users such as Site Reliability Engineers (SREs), DevOps teams, and AI/ML engineers, ensuring reliability, performance, and compliance of these models introduces an additional  layer of complexity. This is where the concept of LLM Observability becomes essential. It offers crucial insights into the performance of these models, ensuring that these advanced AI systems operate both effectively and ethically.</p>
<h3>Why LLM Observability Matters and How Elastic Makes It Easy</h3>
<p>LLMs are not just another piece of software; they are sophisticated systems capable of human-like capabilities such as text generation, comprehension, and even coding. But with great power comes greater need for oversight. The opaque nature of these models can obscure how decisions are made and content generated. This makes it even more critical to implement robust observability to monitor and troubleshoot issues such as hallucinations, inappropriate content, cost overruns, errors and performance degradation. By monitoring these models closely, we can safeguard against unexpected outcomes and maintain user trust.</p>
<h3>Real-World Scenarios</h3>
<p>Let's explore real-world scenarios where companies leverage LLM-powered applications to enhance productivity and user experience, and how Elastic's LLM observability solutions monitor critical aspects of these models.</p>
<h4>1. Generative AI for Customer Support</h4>
<p>Companies are increasingly leveraging LLMs and generative AI to enhance customer support, using platforms like Google Vertex AI for hosting these models efficiently. With the introduction of advanced AI models such as Google's Gemini, which is integrated into Vertex AI, businesses can deploy sophisticated chatbots that manage customer inquiries, from basic questions to complex issues, in real time. These AI systems understand and respond with natural language, offering instant support for issues such as product troubleshooting or managing orders thus
reducing wait times. They also learn from each interaction to improve accuracy continuously. This boosts customer satisfaction and allows human agents to focus on complex tasks, enhancing overall efficiency. Other ways that AI tools can further empower customer care agents is with real-time analytics, sentiment detection, and conversation summarization.</p>
<p>To support use cases like the AI-powered customer support described above, Elastic recently launched LLM observability integrations including support for <a href="https://www.elastic.co/guide/en/integrations/current/gcp_vertexai.html">LLMs hosted on GCP Vertex AI</a>. Customers who wish to monitor foundation models such as Gemini and Imagen hosted on Google Vertex AI can benefit from Elastic’s Vertex AI integration to get a deeper understanding of model behavior and performance, and ensure that the AI-driven tools are not only effective but also reliable. Customers get out-of-the-box experience ingesting a curated set of metrics from Vertex AI as well as a pre-configured dashboard.</p>
<p>By continuously tracking these metrics, customers can proactively manage their AI resources, optimize operations, and ultimately enhance the overall customer experience.</p>
<p>Let's look at some of the metrics you get from the Google Vertex AI integration which are helpful in the context of using generative AI for customer support.</p>
<ol>
<li><strong>Prediction Latency</strong>: Measures the time taken to complete predictions, critical for real-time customer interactions.</li>
<li><strong>Error Rate</strong>: Tracks errors in predictions, which is vital for maintaining the accuracy and reliability of AI-driven customer support.</li>
<li><strong>Prediction Count</strong>: Counts the number of predictions made, helping assess the scale of AI usage in customer interactions.</li>
<li><strong>Model Usage</strong>: Tracks how frequently the AI models are accessed by both virtual assistants and customer support tools.</li>
<li><strong>Total Invocations</strong>: Measures the total number of times the AI services are used, providing insights into user engagement and dependency on these tools.</li>
<li><strong>CPU and Memory Utilization</strong>: By observing CPU and memory usage, users can optimize resource allocation, ensuring that the AI tools are running efficiently without overloading the system.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/transforming-industries-and-the-critical-role-of-llm-observability/vertex-overview.png" alt="Vertex Overview" /></p>
<p>To learn more about how Elastic's Google Vertex AI integration can augment your LLM observability, have a quick read of this <a href="https://www.elastic.co/observability-labs/blog/elevate-llm-observability-with-gcp-vertex-ai-integration">blog</a>.</p>
<h4>2. Transforming Healthcare with Generative AI</h4>
<p>The healthcare industry is embracing generative AI to enhance patient interactions and streamline operational workflows. By leveraging platforms like Amazon Bedrock, healthcare organizations deploy advanced large language models (LLMs) to power tools that convert doctor-patient conversations into structured medical notes, reducing administrative overhead and allowing clinicians to prioritize diagnosis and treatment. These AI-driven solutions provide real-time insights, enabling informed decision-making and improving patient outcomes. Additionally, patient-facing applications powered by LLMs offer secure access to health records, empowering individuals to manage their care proactively.</p>
<p>Robust observability is essential to maintain the reliability and performance of these generative AI applications in healthcare. Elastic’s <a href="https://www.elastic.co/guide/en/integrations/current/aws_bedrock.html">Amazon Bedrock integration</a> equips providers with tools to monitor LLM behavior, capturing critical metrics like invocation latency, error rates, token usage and guardrail invocation. Pre-configured dashboards provide visibility into prompt and completion text, enabling teams to verify the accuracy of AI-generated outputs, such as medical notes, and detect issues like hallucinations.</p>
<p>Additionally, customers who configure Guardrails for Amazon Bedrock to filter harmful content like hate speech, personal insults, and other inappropriate topics, can use the Bedrock Integration to observe the prompts and responses that caused the guardrail to filter them out. This helps application developers take proactive actions to maintain a safe and positive user experience.</p>
<p>Some of the logs and metrics that can be helpful for customers using LLMs hosted on Amazon Bedrock are the following</p>
<ol>
<li><strong>Invocation Details</strong>: This Integration records the Invocation latency, count, throttles. These metrics are critical for ensuring that generative AI models respond quickly and accurately to patient queries or appointment scheduling tasks, maintaining a seamless user experience.</li>
<li><strong>Error Rates</strong>:  Tracking error rates ensures that AI tools, such as patient query assistants or appointment systems, consistently deliver accurate and reliable results. By identifying and addressing issues early, healthcare providers can maintain trust in AI systems and prevent disruptions in critical patient interactions.</li>
<li><strong>Token Usage</strong>: In healthcare, tracking token usage helps identify resource-intensive queries, such as detailed patient record summaries or complex symptom analyses, ensuring efficient model operation. By monitoring token usage, healthcare providers can optimize costs for AI-powered tools while maintaining scalability to handle growing patient interactions.</li>
<li><strong>Prompt and Completion Text</strong>: Capturing prompt and completion text allows healthcare providers to analyze how AI models respond to specific patient queries or administrative tasks, ensuring meaningful and contextually accurate interactions. This insight helps refine prompts to improve the AI's understanding and ensures that generated responses, such as appointment details or treatment explanations, meet the quality standards expected in healthcare.</li>
<li><strong>Prompt and response where guardrails intervened</strong>: Being able to track requests and responses that were deemed inappropriate by guardrails helps healthcare providers monitor what information patients are asking for. With this information users can make continuous adjustments to the LLMs to ensure appropriate responses, balancing flexibility and rich communication on the one hand, and on the other, privacy protection, hallucination prevention, and harmful content filtering.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/transforming-industries-and-the-critical-role-of-llm-observability/aws-bedrock-overview.png" alt="Bedrock Overview" /></p>
<p>Amazon Bedrock Gaurdrails OOTB dashboard
<img src="https://www.elastic.co/observability-labs/assets/images/transforming-industries-and-the-critical-role-of-llm-observability/amazon-bedrock-gaurdrails.png" alt="Bedrock Gaurdrails Overview" /></p>
<p>To learn about the Amazon Bedrock Integration, read this <a href="https://www.elastic.co/observability-labs/blog/llm-observability-aws-bedrock">blog</a>. To dive deeper into how the integration can help with observability of Guardrails for Amazon Bedrock, take a look at this <a href="https://www.elastic.co/observability-labs/blog/llm-observability-amazon-bedrock-guardrails">blog</a>.</p>
<h4>3.  Enhancing Telco Efficiency with GenAI</h4>
<p>The telecommunication industry can leverage services like Azure OpenAI to transform customer interactions, optimize operations, and enhance service delivery. By integrating advanced generative AI models, telcos can offer highly personalized and responsive customer experiences across multiple channels. AI-powered virtual assistants streamline customer support by automating routine queries and providing accurate, context-aware responses, reducing the workload on human agents and enabling them to focus on complex issues while improving efficiency and satisfaction. Additionally, AI-driven insights help telcos understand customer preferences, anticipate needs, and deliver tailored offerings that boost customer loyalty. Operationally, LLMs such as Azure OpenAI enhance internal processes by enabling smarter knowledge management and faster access to critical information.</p>
<p>Elastic's LLM observability integrations like the <a href="https://www.elastic.co/guide/en/integrations/current/azure_openai.html">Azure OpenAI integration</a> can provide visibility into AI performance and costs, empowering telecom providers to make data-driven decisions and enhance customer engagement. It can help optimize resource allocation by analyzing call patterns, predicting service demands, and identifying trends, enabling telcos to scale their AI operations efficiently while maintaining high service quality.</p>
<p>Some of the key metrics and logs that Azure OpenAI that can provide insights are:</p>
<ol>
<li><strong>Error Counts</strong>: It provides critical insights into failed requests and incomplete transactions, enabling telecom providers to proactively identify and resolve issues in AI-powered applications.</li>
<li><strong>Prompt Input and Completion Text</strong>: This captures the input queries provided to AI systems and the corresponding AI-generated outputs. These fields allow telecom providers to analyze customer queries, monitor response quality, and refine AI training datasets to improve relevance and accuracy.</li>
<li><strong>Response Latency</strong>: It measures the time taken by AI models to generate responses, ensuring that virtual assistants and automated systems deliver quick and efficient replies to customer queries.</li>
<li><strong>Token Usage</strong>: It tracks the number of input and output tokens processed by the AI model, offering insights into resource consumption and cost efficiency. This data helps telecom providers monitor AI usage patterns, optimize configurations, and scale resources effectively</li>
<li><strong>Content Filter Results</strong>: In Azure OpenAI, this plays a crucial role in handling sensitive inputs provided by customers, ensuring compliance, safety, and responsible AI usage. This feature identifies and flags potentially inappropriate or harmful queries and responses in real time, enabling telecom providers to address sensitive topics with care and accuracy.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/transforming-industries-and-the-critical-role-of-llm-observability/azure-openai-overview.png" alt="Azureopenai Overview" /></p>
<p>The Azure OpenAI content filtering OOTB dashboard
<img src="https://www.elastic.co/observability-labs/assets/images/transforming-industries-and-the-critical-role-of-llm-observability/azure-openai-contentfiltering.png" alt="Azureopenai Overview1" /></p>
<p>You can learn more about Elastic's Azure OpenAI integration from these two blogs - <a href="https://www.elastic.co/observability-labs/blog/llm-observability-azure-openai">Part 1</a> and <a href="https://www.elastic.co/observability-labs/blog/llm-observability-azure-openai-v2">Part 2</a>.</p>
<h4>4. OpenAI Integration for Generative AI Applications</h4>
<p>As AI-powered solutions become integral to modern workflows, OpenAI's sophisticated models, including language models like GPT-4o and GPT-3.5 Turbo, image generation models like DALL·E, and audio processing models like Whisper, drive innovation across applications such as virtual assistants, content creation, and speech-to-text systems. With growing complexity and scale, ensuring these models perform reliably, remain cost-efficient, and adhere to ethical guidelines is paramount. Elastic's <a href="https://www.elastic.co/docs/reference/integrations/openai">OpenAI integration</a> provides a robust solution, offering deep visibility into model behaviour to support seamless and responsible AI deployments.</p>
<p>By tapping into the OpenAI Usage API, Elastic's integration delivers actionable insights through intuitive, pre-configured dashboards, enabling Site Reliability Engineers (SREs) and DevOps teams to monitor performance and optimize resource usage across OpenAI's diverse model portfolio. This unified observability approach empowers organizations to track critical metrics, identify inefficiencies, and maintain high-quality AI-driven experiences. The following key metrics from Elastic's OpenAI integration help organizations achieve effective oversight:</p>
<ol>
<li><strong>Request Latency</strong>: Measures the time taken for OpenAI models to process requests, ensuring responsive performance for real-time applications like chatbots or transcription services.</li>
<li><strong>Invocation Rates</strong>: Tracks the frequency of API calls across models, providing insights into usage patterns and helping identify high-demand workloads.</li>
<li><strong>Token Usage</strong>: Monitors input and output tokens (e.g., prompt, completion, cached tokens) to optimize costs and fine-tune prompts for efficient resource consumption.</li>
<li><strong>Error Counts</strong>: Captures failed requests or incomplete transactions, enabling proactive issue resolution to maintain application reliability.</li>
<li><strong>Image Generation Metrics</strong>: Tracks invocation rates and output dimensions for models like DALL·E, helping assess costs and usage trends in image-based applications.</li>
<li><strong>Audio Transcription Metrics</strong>: Monitors invocation rates and transcribed seconds for audio models like Whisper, supporting cost optimization in speech-to-text workflows.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/transforming-industries-and-the-critical-role-of-llm-observability/openai-overview.png" alt="Openai Overview" /></p>
<p>To learn more about Elastic's OpenAI integration, read this <a href="https://www.elastic.co/observability-labs/blog/llm-observability-openai">blog</a>.</p>
<h4>Actionable LLM Observability</h4>
<p>Elastic's LLM observability integrations empower users to take proactive control of their AI operations through actionable insights and real-time alerts. For instance, by setting a predefined threshold for token count, Elastic can trigger automated alerts when usage exceeds this limit, notifying Site Reliability Engineers (SREs) or DevOps teams via email, Slack, or other preferred channels. This ensures prompt awareness of potential cost overruns or resource-intensive queries, enabling teams to adjust model configurations or scale resources swiftly to maintain operational efficiency.</p>
<p>In the example below, the rule is set to alert the user if token_count crosses a threshold of 500.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/transforming-industries-and-the-critical-role-of-llm-observability/slo-1.png" alt="SLO Overview" /></p>
<p>The alert is triggered when the token count exceeds the threshold as seen below
<img src="https://www.elastic.co/observability-labs/assets/images/transforming-industries-and-the-critical-role-of-llm-observability/slo-2.png" alt="SLO Overview1" /></p>
<p>Another example is tracking invocation spikes, such as when the number of predictions or API calls surpasses a defined Service Level Objective (SLO). For example, if a Bedrock AI-hosted model experiences a sudden surge in invocations due to increased customer interactions, Elastic can alert teams to investigate potential anomalies or scale infrastructure accordingly. These proactive measures help maintain the reliability and cost-effectiveness of LLM-powered applications.</p>
<p>By providing pre-configured dashboards and customizable alerts, Elastic ensures that organizations can respond to critical events in real time, keeping their AI systems aligned with cost and performance goals as well as standards for content safety and reliability.</p>
<h4>Conclusion</h4>
<p>LLMs are transforming industries, but their complexity requires effective oversight observability to ensure their reliability and safe use. Elastic's LLM observability integrations provide a comprehensive solution, empowering businesses to monitor performance, manage resources, and address challenges like hallucinations and content safety. As LLMs become increasingly integral to various sectors, robust observability tools like those offered by Elastic ensure that these AI-driven innovations remain dependable, cost-effective, and aligned with ethical and safety standards.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/transforming-industries-and-the-critical-role-of-llm-observability/llmobs2.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[How to Troubleshoot Kubernetes Pod Restarts & OOMKilled Events with Agent Builder]]></title>
            <link>https://www.elastic.co/observability-labs/blog/troubleshoot-kubernetes-pod-restarts-oomkilled-elastic-agent-builder</link>
            <guid isPermaLink="false">troubleshoot-kubernetes-pod-restarts-oomkilled-elastic-agent-builder</guid>
            <pubDate>Wed, 25 Feb 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to immediately troubleshoot Kubernetes pod restarts and OOMKilled events with Elastic Agent Builder. We’ll show how to detect, analyze, and remediate failures.]]></description>
            <content:encoded><![CDATA[<h2>Initial Summary</h2>
<ul>
<li>Detect Kubernetes pod restarts and OOMKill events using Elastic Agent Builder</li>
<li>Analyze CPU and memory pressure using ES|QL over Kubernetes metrics</li>
<li>Generate troubleshooting summaries and remediation guidance</li>
</ul>
<p>This article explains how to use <a href="https://www.elastic.co/search-labs/blog/elastic-ai-agent-builder-context-engineering-introduction">Elastic Agent Builder</a> to automatically detect, analyze, and remediate Kubernetes pod failures caused by resource pressure (CPU and memory), with a focus on pods experiencing frequent restarts and OOMKilled events. Elastic Agent Builder lets you quickly create precise agents that utilize all your data with powerful tools (such as ES|QL queries), chat interfaces, and custom agents.</p>
<h2>Introduction: What is the Elastic Agent Builder?</h2>
<p>Elastic has an AI Agent embedded that you can use to get more insights from all of the logs, metrics and traces that you’ve ingested. While that’s great, you can take it one step further and streamline the process by creating tools that the agent can use.</p>
<p>Giving the agent tools means it spends less time ‘thinking’ and quickly gets to assessing what’s important to you. For example, if I have a Kubernetes environment that needs monitoring, and I want to keep an eye on pod restarts and memory and CPU usage without hanging out at the terminal, I can have Elastic alert me if something goes wrong. </p>
<p>Having an alert is great, but how do I get the bigger picture, faster? You need to know what service is having (or creating) the issues, why, and how to fix it.</p>
<h2>Assumptions</h2>
<p>This guide assumes:</p>
<ul>
<li>A running Kubernetes cluster</li>
<li>An Elastic Observability deployment</li>
<li>Kubernetes metrics indexed in Elastic</li>
</ul>
<h2>Step 1: Create a New Elastic Agent</h2>
<p>In Elastic Observability, use the top search bar to search for Agents. Create a new agent.</p>
<p>This agent is going to be the Kubernetes Pod Troubleshooter agent, designed to help users troubleshoot pod restarts, OOMKill terminations and evaluate CPU or memory pressure. </p>
<p>The Kubernetes Pod Troubleshooter agent will:</p>
<ol>
<li>Identify pods that have restarted more than once</li>
<li>Filter for pods that are not in a running state</li>
<li>Retrieve the container termination reason (e.g., OOMKilled)</li>
<li>Analyze CPU and memory utilization for affected services</li>
<li>Flag resource utilization above 60% (warning) and 80% (critical)</li>
<li>Provide remediation recommendations</li>
</ol>
<p>The agent requires instructions to guide how the agent behaves when interacting with tools or responding to queries. This description can set tone, priorities or special behaviours. The instructions below tell the agent to execute the steps outlined above. </p>
<pre><code>You will help users troubleshoot problematic pods by searching the metrics for pods that have restarted more than once and the status is not running. Pods that have the highest number of restarts will be returned to the user.
Once the containers that are not running and have restarted multiple times are found you will use their container ID or image name to to look up the container status reason and reason for the last termination. You will return that reason to the user.
You will also begin basic troubleshooting steps, such as checking  for insufficient cluster resources (CPU or memory) from the metrics and tools available.
Any CPU or memory utilization percentages over 60%, and definitely over 80% should be flagged to the user with remediation steps.
</code></pre>
<p>Getting answers quickly is critical when troubleshooting high-value systems and environments. Using Tools ensures that the workflow is repeatable and that you can trust the results. You also get complete oversight of the process, as the Elastic Agent outlines every step and query that it took and you can explore the results in Discover.</p>
<p>You will create custom tools that the agent will run to complete the Kubernetes troubleshooting tasks that the custom instructions references such as: <code>look up the container status reason and reason for the last termination</code> and <code>checking  for insufficient cluster resources (CPU or memory).</code></p>
<h2>Step 2: Create Tools - Pod Restarts</h2>
<p>The first tool takes the Kubernetes metrics and assesses if the pod has restarted and it has a last terminated reason, and if it has the agent will present that information to the user.</p>
<p>This <code>pod-restarts</code> tool uses a custom ES|QL query that interrogates the Kubernetes metrics data coming from OTel.</p>
<p>The ES|QL query:</p>
<ol>
<li>Filters for containers that have restarted and have a reason for termination; then</li>
<li>Calculates the number of restarts; then</li>
<li>Returns the number of restarts and termination reason per service.</li>
</ol>
<pre><code>FROM metrics-k8sclusterreceiver.otel-default
| WHERE metrics.k8s.container.restarts &gt; 0
| WHERE resource.attributes.k8s.container.status.last_terminated_reason IS NOT NULL
| STATS total_restarts = SUM(metrics.k8s.container.restarts),
        reasons = VALUES(resource.attributes.k8s.container.status.last_terminated_reason) 
  BY resource.attributes.service.name
| SORT total_restarts DESC
</code></pre>
<h2>Step 3: Create Tools - Service Memory</h2>
<p>The custom tools can take input variables, which increases speed and accuracy of the results.</p>
<p>Common reasons for pods not scheduling, or restarting often, is due to the cluster or nodes being under-resourced. The <code>pod-restarts</code> tool returns services that have many restarts and OOMKill termination reasons, which indicate memory pressure.</p>
<p>The <code>eval-pod-memory</code> tool is a custom ES|QL that:</p>
<ol>
<li>Filters for metrics data that match the service name returned from the <code>pod-restarts</code> tool within the last 12 hours; then</li>
<li>Converts memory usage, requests, limits and utilization into megabytes; then</li>
<li>Calculates the average of each of those metrics; then</li>
<li>Groups them into 1 minute groupings and sorts them.</li>
</ol>
<pre><code>FROM metrics-*
| WHERE resource.attributes.service.name == ?servicename
| WHERE @timestamp &gt;= NOW() - 12 hours
| EVAL
  memory_usage_mb = metrics.container.memory.usage / 1024 / 1024,
   memory_request_mb = metrics.k8s.container.memory_request / 1024 / 1024,
   memory_limit_mb = metrics.k8s.container.memory_limit / 1024 / 1024,
   memory_utilization_pct = metrics.k8s.container.memory_limit_utilization * 100
| STATS
   avg_memory_usage = AVG(memory_usage_mb),
   avg_memory_request = AVG(memory_request_mb),
   avg_memory_limit = AVG(memory_limit_mb),
   avg_memory_utilization = AVG(memory_utilization_pct)
   BY bucket = BUCKET(@timestamp, 1 minute)
| SORT bucket ASC
</code></pre>
<h2>Step 4: Create Tools: Service CPU</h2>
<p>As CPU usage is another common reason for pods to fail scheduling or be stuck in endless restart loops, the next tool will evaluate CPU usage, requests and limits.</p>
<p>The <code>eval-pod-cpu</code> tool is a custom ES|QL that:</p>
<ol>
<li>Filters for metrics data that match the service name returned from the <code>pod-restarts</code> tool within the last 12 hours; then</li>
<li>Calculates the average for CPU usage, CPU request utilization and CPU limit utilization.</li>
</ol>
<pre><code>FROM metrics-kubeletstatsreceiver.otel-default
| WHERE k8s.container.name == ?servicename OR resource.attributes.k8s.container.name == ?servicename
| STATS
  avg_cpu_usage = AVG(container.cpu.usage),
  avg_cpu_request_utilization = AVG(k8s.container.cpu_request_utilization) * 100,
  avg_cpu_limit_utilization = AVG(k8s.container.cpu_limit_utilization) * 100
| LIMIT 100
</code></pre>
<h2>Step 5: Assign Tools to Kubernetes Pod Troubleshooter Agent</h2>
<p>Once all of the tools are built you need to assign them to the agent.</p>
<p>This image shows the Kubernetes Pod Troubleshooter agent with the three tools: <code>pod-restarts</code>, <code>eval-pod-cpu</code> and <code>eval-pod-memory</code> assigned to it and active.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/troubleshoot-kubernetes-pod-restarts-oomkilled-elastic-agent-builder/kubernetes-pod-troubleshooter.png" alt="kubernetes-pod-troubleshooter" /></p>
<h2>Step 6: Test the Kubernetes Pod Troubleshooter Agent</h2>
<p>To simulate memory pressure the Open Telemetry demo is running inside the cluster. Artificially lowering the memory requests and limits and increasing the service load will cause pods to restart.</p>
<p>To do this to the open telemetry demo in your cluster, follow these steps. </p>
<p>Reduce the cart service to one replica by scaling the deployment. Once that is complete, change the resources on the deployment by lowering the memory requests and limits as shown in this command:</p>
<pre><code>kubectl -n otel-demo scale deploy/cart --replicas=1
kubectl -n otel-demo set resources deploy/cart -c cart --requests=memory=50Mi --limits=memory=60Mi
</code></pre>
<p>The OpenTelemetry demo application comes with a load-generator. This is used to simulate requests to the demo site by modifying the users and spawn rate in the load generator deployment, as shown in this command:</p>
<pre><code>kubectl -n otel-demo set env deploy/load-generator LOCUST_USERS=800 LOCUST_SPAWN_RATE=200 LOCUST_BROWSER_TRAFFIC_ENABLED=false
</code></pre>
<p>If you list all of your pods in the cluster or namespace, you should begin to see restarts.</p>
<p>You can now chat with the Kubernetes Pod Troubleshooter agent and ask “Are any of my Kubernetes pods having issues?”.</p>
<p>The screenshot shows the final response from the Kubernetes Pod Troubleshooter agent. It provides a problem summary of its findings from each tool, showing which services were experiencing the most restarts and memory and CPU utilization. </p>
<p>The threshold interpretations were described in the initial agent instructions, where &gt;60% utilization is a warning (sustained pressure) and &gt;80% utilization is critical (high likelihood of restarts or throttling). This aligns with findings presented by the Kubernetes Pod Troubleshooter agent, where the services that had the highest restarts were all above 90% memory utilization. The agent needs clearly defined threshold values to correctly assess the returned memory and CPU utilization values. </p>
<p>Problem summary returned by the Kubernetes Pod Troubleshooter agent:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/troubleshoot-kubernetes-pod-restarts-oomkilled-elastic-agent-builder/problem-summary-by-Kubernetes.png" alt="problem summary by Kubernetes" /></p>
<h2>Conclusion and Final Thoughts</h2>
<p>Elastic Agent Builder enables fast, repeatable Kubernetes troubleshooting by combining ES|QL-driven analysis with constrained AI reasoning.</p>
<p>The creation of custom tools that use specific ES|QL queries combined with downstream queries that take input variables from the output of previous tools eliminates or reduces error propagation and hallucinations. In comparison to generic AI troubleshooting without purpose-built tools, you run the risk of it analyzing too many services (that aren’t relevant to the issue at hand). This will slow down the thinking process and generate longer responses, increasing the likelihood of error propagation and hallucinations. </p>
<p>With the Elastic Agent Builder, you can inspect the output of every tool if you need to, to explore and verify the outputs.</p>
<p>Having a succinct problem summary is a game-changer, bringing your attention straight to the most affected services.</p>
<p>Reasoning returned by the Kubernetes Pod Troubleshooter agent:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/troubleshoot-kubernetes-pod-restarts-oomkilled-elastic-agent-builder/return-pod-troubleshooter-agent.png" alt="summary-returned-kubernetes-pod-troubleshooter" /></p>
<p>Not only that, but the agent can go one step further and offer recommendations for remediation based on what outputs the tools delivered.</p>
<p>Remediation recommendation returned by the Kubernetes Pod Troubleshooter agent:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/troubleshoot-kubernetes-pod-restarts-oomkilled-elastic-agent-builder/remediation-recommendation-kubernetes-pod-troubleshooter.png" alt="remediation-recommendation-kubernetes-pod-troubleshooter" /></p>
<p>Sign up for <a href="https://www.elastic.co/cloud/serverless">Elastic Cloud Serverless</a> and try this out with your Kubernetes clusters.</p>
<h2>Frequently Asked Questions</h2>
<p><strong>1. When to use the Elastic Agent Builder for Troubleshooting</strong></p>
<p>Use the Elastic Agent Builder for Troubleshooting that works best if:</p>
<ul>
<li>
<p>You need repeatable, auditable troubleshooting workflows</p>
</li>
<li>
<p>You want deterministic analysis instead of free-form AI responses</p>
</li>
<li>
<p>You’re investigating something that is reported in the logs or metrics (i.e. pod restarts, OOMKills, or resource pressure)</p>
</li>
<li>
<p>You want to reduce mean time to resolution (MTTR)</p>
</li>
</ul>
<p><strong>2. Do I need OpenTelemetry to use Elastic Agent Builder for Kubernetes troubleshooting?</strong> </p>
<p>No, you don’t need to use OpenTelemetry. You have two options:</p>
<ul>
<li>
<p>You can collect logs and metrics from Kubernetes using the Elastic Agent; or </p>
</li>
<li>
<p>You can collect logs, traces and metrics with the Elastic Distro for OTel (EDOT) Collector</p>
</li>
</ul>
<p>When following the steps above, this would change the field names that are used in the tools above. For example, <code>kubernetes.container.memory.usage.bytes</code> vs <code>metrics.container.memory.usage</code>.</p>
<p><strong>3. Can this agent be adapted for node-level failures?</strong> </p>
<p>Yes, Elastic has hundreds of <a href="https://www.elastic.co/docs/reference/fleet#integrations">integrations</a>, including AWS (for EKS), Azure (for AKS), Google Cloud (for GKE), as well as host operating system monitoring.</p>
<p>The queries shown above would be modified to use the correct field.</p>
<p><strong>4. Can these tools be reused in automation workflows?</strong> </p>
<p>Yes, <a href="https://www.elastic.co/search-labs/blog/elastic-workflows-automation">Elastic Workflows</a> can reuse the same scripted automations and AI agents you build in Elastic. An agent can handle the initial analysis and investigation (reducing manual effort), and the workflow can continue with structured steps, such as running Elasticsearch queries, transforming data, branching on conditions and calling external APIs or tools like Slack, Jira and PagerDuty. Workflows can also be exposed to Agent Builder as reusable tools, just like the tool created in this guide.</p>
<p>For more advanced automation from a similar scenario as described in this guide, learn how to <a href="https://www.elastic.co/observability-labs/blog/agentic-cicd-kubernetes-mcp-server">integrate AI agents into GitHub Actions to monitor K8s health and improve deployment reliability via Observability</a>.</p>
<p><strong>5. Can these tools be triggered by alerts?</strong> </p>
<p>Yes, alerts can trigger <a href="https://www.elastic.co/search-labs/blog/elastic-workflows-automation">Elastic Workflows</a>, and pass the alert context to the workflow. This workflow may be integrated with an Elastic Agent, as described above.</p>
<p>Additionally, Elastic Alerts allow you to publish investigation guides alongside alerts so an SRE has all of the information they need to begin investigating. Any troubleshooting or investigative agents can be linked to from the investigation guide, meaning the SRE doesn’t have to follow manual processes outlined in an investigation guide and instead let the agent handle the manual, repetitive investigations.</p>
<p><strong>6. How can I get started with Agent Builder?</strong></p>
<p>Sign up for <a href="https://www.elastic.co/cloud/serverless">Elastic Cloud Serverless</a>, a new fully managed, stateless architecture that auto-scales no matter your data, usage, and performance needs.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/troubleshoot-kubernetes-pod-restarts-oomkilled-elastic-agent-builder/cover.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Two sides of the same coin: Uniting testing and monitoring with Synthetic Monitoring]]></title>
            <link>https://www.elastic.co/observability-labs/blog/testing-monitoring-synthetic-monitoring</link>
            <guid isPermaLink="false">testing-monitoring-synthetic-monitoring</guid>
            <pubDate>Mon, 06 Feb 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[DevOps aims to establish complementary practices across development and operations. See how Playwright, @elastic/synthetics, GitHub Actions, and Elastic Synthetics can unite development and SRE teams in validating and monitoring the user experience.]]></description>
            <content:encoded><![CDATA[<p>Historically, software development and SRE have worked in silos with different cultural perspectives and priorities. The goal of DevOps is to establish common and complementary practices across software development and operations. However, for some organizations true collaboration is rare, and we still have a way to go to build effective DevOps partnerships.</p>
<p>Outside of cultural challenges, one of the most common reasons for this disconnect is using different tools to achieve similar goals — case in point, end-to-end (e2e) testing versus <a href="https://www.elastic.co/observability/synthetic-monitoring">synthetic monitoring</a>.</p>
<p>This blog shares an overview of these techniques. Using the example repository <a href="https://github.com/carlyrichmond/synthetics-replicator">carlyrichmond/synthetics-replicator</a>, we’ll also show how Playwright, @elastic/synthetics, and GitHub Actions can combine forces with Elastic Synthetics and the recorder to unite development and SRE teams in validating and monitoring the user experience for a simple web application hosted on a provider such as <a href="https://www.netlify.com/">Netlify</a>.</p>
<p>Elastic recently <a href="https://www.elastic.co/blog/new-synthetic-monitoring-observability">introduced synthetics monitoring</a>, and <a href="https://www.elastic.co/blog/why-and-how-replace-end-to-end-tests-synthetic-monitors">as highlighted in our prior blog</a>, it can replace e2e tests altogether. Uniting around a single tool to validate the user workflow early provides a common language to recreate user issues to validate fixes against.</p>
<h2>Synthetics Monitoring versus e2e tests</h2>
<p>If development and operations tools are at war, it’s difficult to unify their different cultures together. Considering the definitions of these approaches shows that they in fact aim to achieve the same objective.</p>
<p>e2e tests are a suite of tests that recreate the user path, including clicks, user text entry, and navigations. Although many argue it’s about testing the integration of the layers of a software application, it’s the user workflow that e2e tests emulate. Meanwhile, Synthetic Monitoring, specifically a subset known as browser monitoring, is an application performance monitoring practice that emulates the user path through an application.</p>
<p>Both these techniques emulate the user path. If we use tooling that crosses the developer and operational divide, we can work together to build tests that can also provide production monitoring in our web applications.</p>
<h2>Creating user journeys</h2>
<p>When a new user workflow, or set of features that accomplish a key goal, is under development in our application, developers can use @elastic/synthetics to create user journeys. The initial project scaffolding can be generated using the init utility once installed, as in the below example. Note that Node.js must be installed prior to using this utility.</p>
<pre><code class="language-bash">npm install -g @elastic/synthetics
npx @elastic/synthetics init synthetics-replicator-tests
</code></pre>
<p>Before commencing the wizard, make sure you have your Elastic cluster information and the Elastic Synthetics integration set on your cluster. You will need:</p>
<ol>
<li>Monitor Management must be enabled within the Elastic Synthetics app as per the prerequisites in the <a href="https://www.elastic.co/guide/en/observability/8.8/synthetics-get-started-project.html#_prerequisites">documentation getting started</a>.</li>
<li>The Elastic Cloud cluster Cloud ID if using Elastic Cloud. Alternatively, if you are using on-prem hosting you need to enter your Kibana endpoint.</li>
<li>An API key generated from your cluster. There is a shortcut in the Synthetics application Settings to generate this key under the Project API Keys tab, as shown <a href="https://www.elastic.co/guide/en/observability/current/synthetics-get-started-project.html#synthetics-get-started-project-init">in the documentation</a>.</li>
</ol>
<p>This wizard will take you through and generate a sample project containing configuration and example monitor journeys, with a structure similar to the below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/testing-monitoring-synthetic-monitoring/blog-elastic-synthetics-replicator-tests.png" alt="synthetics replicator tests" /></p>
<p>For web developers, most of the elements such as the README and package.json and lock files will be familiar. The main configuration for your monitors is available in synthetics.config.tsas shown below. This configuration can be amended to include production and development-specific configuration. This is essential for combining forces and reusing the same monitors for e2e tests and allowing for any journeys to be used as e2e tests and production monitors. Although not in this example, details of <a href="https://www.elastic.co/guide/en/observability/current/synthetics-private-location.html">private locations</a> can be included if you would prefer to monitor from your own dedicated Elastic instance rather than from Elastic infrastructure.</p>
<pre><code class="language-javascript">import type { SyntheticsConfig } from &quot;@elastic/synthetics&quot;;

export default (env) =&gt; {
  const config: SyntheticsConfig = {
    params: {
      url: &quot;http://localhost:5173&quot;,
    },
    playwrightOptions: {
      ignoreHTTPSErrors: false,
    },
    /**
     * Configure global monitor settings
     */
    monitor: {
      schedule: 10,
      locations: [&quot;united_kingdom&quot;],
      privateLocations: [],
    },
    /**
     * Project monitors settings
     */
    project: {
      id: &quot;synthetics-replicator-tests&quot;,
      url: &quot;https://elastic-deployment:port&quot;,
      space: &quot;default&quot;,
    },
  };
  if (env === &quot;production&quot;) {
    config.params = { url: &quot;https://synthetics-replicator.netlify.app/&quot; };
  }
  return config;
};
</code></pre>
<h2>Writing your first journey</h2>
<p>Although the above configuration applies to all monitors in the project, it can be overridden for a given test.</p>
<pre><code class="language-javascript">import { journey, step, monitor, expect, before } from &quot;@elastic/synthetics&quot;;

journey(&quot;Replicator Order Journey&quot;, ({ page, params }) =&gt; {
  // Only relevant for the push command to create
  // monitors in Kibana
  monitor.use({
    id: &quot;synthetics-replicator-monitor&quot;,
    schedule: 10,
  });

  // journey steps go here
});
</code></pre>
<p>The @elastic/synthetics wrapper exposes many <a href="https://www.elastic.co/guide/en/observability/current/synthetics-create-test.html#synthetics-syntax">standard test methods</a> such as the before and after constructs that allow for setup and tear down of typical properties in the tests, as well as support for many common assertion helper methods. A full list of supported expect methods are listed in the <a href="https://www.elastic.co/guide/en/observability/current/synthetics-create-test.html#synthetics-assertions-methods">documentation</a>. The Playwright page object is also exposed, which enables us to perform <a href="https://playwright.dev/docs/api/class-page">all the expected activities provided in the API</a> such as locating page elements and simulating user events such as clicks that are depicted in the below example.</p>
<pre><code class="language-javascript">import { journey, step, monitor, expect, before } from &quot;@elastic/synthetics&quot;;

journey(&quot;Replicator Order Journey&quot;, ({ page, params }) =&gt; {
  // monitor configuration goes here

  before(async () =&gt; {
    await page.goto(params.url);
  });

  step(&quot;assert home page loads&quot;, async () =&gt; {
    const header = await page.locator(&quot;h1&quot;);
    expect(await header.textContent()).toBe(&quot;Replicatr&quot;);
  });

  step(&quot;assert move to order page&quot;, async () =&gt; {
    const orderButton = await page.locator(&quot;data-testid=order-button&quot;);
    await orderButton.click();

    const url = page.url();
    expect(url).toContain(&quot;/order&quot;);

    const menuTiles = await page.locator(&quot;data-testid=menu-item-card&quot;);
    expect(await menuTiles.count()).toBeGreaterThan(2);
  });

  // other steps go here
});
</code></pre>
<p>As you can see in the above example, it also exposes the journey and step constructs. This construct mirrors the behavior-driven development (BDD) practice of showing the user journey through the application in tests.</p>
<p>Developers are able to execute the tests against a locally running application as part of their feature development to see successful and failed steps in the user workflow. In the below example, the local server startup command is outlined in blue at the top. The monitor execution command is presented in red further down.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/testing-monitoring-synthetic-monitoring/blog-elastic-synthetics-replicator-npm-start.png" alt="" /></p>
<p>As you can see from the green ticks next to each journey step, each of our tests pass. Woo!</p>
<h2>Gating your CI pipelines</h2>
<p>It’s important to use the execution of the monitors within your CI pipeline as a gate for merging code changes and uploading the new version of your monitors. Each of the jobs in our <a href="https://github.com/carlyrichmond/synthetics-replicator/blob/main/.github/workflows/push-build-test-synthetics-replicator.yml">GitHub Actions workflow</a> will be discussed in this and the subsequent section.</p>
<p>The test job spins up a test instance and runs our user journeys to validate our changes, as illustrated below. This step should run for pull requests to validate developer changes, as well as on push.</p>
<pre><code class="language-yaml">jobs:
  test:
    env:
      NODE_ENV: development
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-node@v3
        with:
          node-version: 18
      - run: npm install
      - run: npm start &amp;
      - run: &quot;npm install @elastic/synthetics &amp;&amp; SYNTHETICS_JUNIT_FILE='junit-synthetics.xml' npx @elastic/synthetics . --reporter=junit&quot;
        working-directory: ./apps/synthetics-replicator-tests/journeys
      - name: Publish Unit Test Results
        uses: EnricoMi/publish-unit-test-result-action@v2
        if: always()
        with:
          junit_files: &quot;**/junit-*.xml&quot;
          check_name: Elastic Synthetics Tests
</code></pre>
<p>Note that, unlike the journey execution on our local machine, we make use of the --reporter=junit option when executing npx @elastic/synthetics to provide visibility of our passing, or sadly sometimes failing, journeys to the CI job.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/testing-monitoring-synthetic-monitoring/blog-elastic-synthetics-tests.png" alt="" /></p>
<h2>Automatically upload monitors</h2>
<p>To ensure the latest monitors are available in Elastic Uptime, it’s advisable to push the monitors programmatically as part of the CI workflow such as the example task below does. Our workflow has a second job push, shown below, which is dependent on the successful execution of our test job that uploads your monitors to your cluster. Note that this job is configured in our workflow to run on push to ensure changes have been validated rather than just raised within a pull request.</p>
<pre><code class="language-yaml">jobs:
  test: …
  push:
    env:
      NODE_ENV: production
      SYNTHETICS_API_KEY: ${{ secrets.SYNTHETICS_API_KEY }}
    needs: test
    defaults:
      run:
        working-directory: ./apps/synthetics-replicator-tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-node@v3
        with:
          node-version: 18
      - run: npm install
      - run: npm run push
</code></pre>
<p>The @elastic/synthetics init wizard generates a push command for you when you create your project that can be triggered from the project folder. This is shown below through the steps and working_directory configuration. The push command requires the API key from your Elastic cluster, which should be stored as a secret within a trusted vault and referenced via a workflow environment variable. It is also vital that monitors pass ahead of pushing the updated monitor configuration to your Elastic Synthetics instance to prevent breaking your production monitoring. Unlike e2e tests running against a testing environment, broken monitors impact SRE activities and therefore any changes need to be validated. For that reason, applying a dependency to your test step via the needs option is recommended.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/testing-monitoring-synthetic-monitoring/blog-elastic-push-build-test-synthetics-replicator.png" alt="" /></p>
<h2>Monitoring using Elastic Synthetics</h2>
<p>Once monitors have been uploaded, they give a regular checkpoint to SRE teams as to whether the user workflow is functioning as intended — not just because they will run on a regular schedule as configured for the project and individual tests as shown previously, but also due to the ability to check the state of all monitor runs and execute them on demand.</p>
<p>The Monitors Overview tab gives us an immediate view of the status of all configured monitors, as well as the ability to run the monitor manually via the card ellipsis menu.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/testing-monitoring-synthetic-monitoring/blog-elastic-monitors.png" alt="elastic observability monitors" /></p>
<p>From the Monitor screen, we can also navigate to an overview of an individual monitor execution to investigate failures.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/testing-monitoring-synthetic-monitoring/blog-elastic-test-run-details.png" alt="test run details" /></p>
<p>The other monitoring superpower SREs now have is the integration between these monitors to familiar tools SREs already use in scrutinizing the performance and availability of applications such as APM, metrics, and logs. The aptly named <strong>Investigate</strong> menu allows easy navigation while SREs are performing investigations into potential failures or bottlenecks.</p>
<p>There is also a balance between finding issues and being notified of potential problems automatically. SREs already familiar with setting rules and thresholds for notification of issues will be happy to know that this is also possible for browser monitors. The editing of an example rule is shown below.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/testing-monitoring-synthetic-monitoring/blog-elastic-rules.png" alt="elastic observability rules" /></p>
<p>The status of browser monitors can be configured not only to consider if any individual or collective monitors have been down several times, such as in the status check above, but also to gauge the overall availability by looking at the percentage of passed checks within a given time period. SREs are not only interested in reacting to issues in a traditional production management way — they want to improve the availability of applications, too.</p>
<h2>Recording user workflows</h2>
<p>The limitation of generating e2e tests through the development lifecycle is that sometimes teams miss things, and the prior toolset is geared toward development teams. Despite the best intentions to design an intuitive product using multi-discipline teams, users may use applications in unintended ways. Furthermore, the monitors written by developers will only cover those expected workflows and raise the alarm either when these monitors fail in production or when they start to behave differently if anomaly detection is applied to them.</p>
<p>When user issues arise, it’s useful to recreate that problem in the same format as our monitors. It’s also important to leverage the experience of SREs in generating user journeys, as they will consider failure cases intuitively where developers may struggle and focus on happy cases. However, not all SREs will have the experience or confidence to write these journeys using Playwright and @elastic/synthetics.</p>
&lt;Video vidyardUuid=&quot;NnJFuY5mpCdUNfLJSMAma3&quot; /&gt;
<p>Enter the Elastic Synthetics Recorder! The above video gives a walkthrough of how it can be used to record the steps in a user journey and export them to a JavaScript file for inclusion in your monitor project. This is useful for feeding back into the development phase and testing developed fixes to solve the problem. This approach cannot be made unless we all combine forces to use these monitors together.</p>
<h2>Try it out!</h2>
<p>As of 8.8, @elastic/synthetics and the Elastic Synthetics app are generally available, and the trusty recorder is in beta. Share your experiences of bridging the developer and operations divide with Synthetic Monitoring via the <a href="https://discuss.elastic.co/c/observability/uptime/75">Uptime category</a> in the Community Discuss forums or via <a href="https://ela.st/slack">Slack</a>.</p>
<p>Happy monitoring!</p>
<p><em>Originally published February 6, 2023; updated May 23, 2023.</em></p>
<blockquote>
<ol>
<li><a href="https://www.elastic.co/observability-labs/blog/why-and-how-replace-end-to-end-tests-synthetic-monitors">Why and how to replace end-to-end tests with synthetic monitors</a></li>
<li><a href="https://www.elastic.co/guide/en/observability/current/monitor-uptime-synthetics.html#monitor-uptime-synthetics">Uptime and Synthetic Monitoring</a></li>
<li><a href="https://www.elastic.co/guide/en/observability/current/synthetics-journeys.html">Scripting browser monitors</a></li>
<li><a href="https://www.elastic.co/guide/en/observability/current/synthetics-recorder.html">Use the Synthetics Recorder</a></li>
<li><a href="https://playwright.dev/">Playwright</a></li>
<li><a href="https://docs.github.com/en/actions">GitHub Actions</a></li>
</ol>
</blockquote>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/testing-monitoring-synthetic-monitoring/digital-experience-monitoring.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Universal Profiling: Detecting CO2 and energy efficiency]]></title>
            <link>https://www.elastic.co/observability-labs/blog/universal-profiling-detecting-co2-energy-efficiency</link>
            <guid isPermaLink="false">universal-profiling-detecting-co2-energy-efficiency</guid>
            <pubDate>Mon, 05 Feb 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Universal Profiling introduces the possibility to capture environmental impact. In this post, we compare Python and Go implementations and showcase the substantial CO2 savings achieved through code optimization.]]></description>
            <content:encoded><![CDATA[<p>A while ago, we posted a <a href="https://www.elastic.co/blog/importing-chess-games-elasticsearch-universal-profiling">blog</a> that detailed how we imported over 4 billion chess games with speed using Python and optimized the code leveraging our Universal Profiling&lt;sup&gt;TM&lt;/sup&gt;. This was based on Elastic Stack running on version 8.9. We are now on <a href="https://www.elastic.co/blog/whats-new-elastic-8-12-0">8.12</a>, and it is time to do a second part that shows how easy it is to observe compiled languages and how Elastic®’s Universal Profiling can help you determine the benefit of a rewrite, both from a cost and environmental friendliness angle.</p>
<h2>Why efficiency matters — for you and the environment</h2>
<p>Data centers are estimated to consume ~3% of global electricity consumption, and their usage is expected to double by 2030.* The cost of a digital service is a close proxy to its computing efficiency, and thus, being more efficient is a win-win: less energy consumed, smaller bill.</p>
<p>In the same scenario, companies want the ability to scale to more users while spending less for each user and are effectively looking into methods of reducing their energy consumption.</p>
<p>In this spirit, <a href="https://www.elastic.co/observability/universal-profiling">Universal Profiling</a> comes equipped with data and visualizations to help determine where efficiency improvement efforts are worth the most.</p>
<p><a href="https://www.elastic.co/blog/continuous-profiling-efficient-cost-effective-applications">Energy efficiency</a> measures how much a digital service consumes to produce an output given an input. It can be measured in multiple ways, and we at Elastic Observability chose CO&lt;sub&gt;2&lt;/sub&gt; emissions and annualized CO&lt;sub&gt;2&lt;/sub&gt; emissions (more details on them later).</p>
<p>Let’s take the example of an e-commerce website: the energy efficiency of the “search inventory” process could be calculated as the average CPU time needed to serve a user request. Once the baseline for this value is determined, changes to the software delivering the search process may result in more or less CPU time consumed for the same feature, resulting in less or more efficient code.</p>
<h2>How to set up and configure wattage and CO2</h2>
<p>You can find a “Settings” button in the top-right corner of the Universal Profiling views. From there, you can customize the coefficient used to calculate CO&lt;sub&gt;2&lt;/sub&gt; emissions tied to profiling data.</p>
<p>The values set here will be used only when the profiles gathered from host agents are not already associated with publicly known data certified by cloud providers. For example, suppose you have a hybrid cloud deployment with a portion of your workload running on-premise and a portion running in GCP. In that case, the values set here will only be used to calculate the CO&lt;sub&gt;2&lt;/sub&gt; emissions for the on-premise machines; we already use all the coefficients as declared by GCP to calculate the emissions of those machines.</p>
<h2>Python vs. Go</h2>
<p>Our first <a href="https://www.elastic.co/blog/importing-chess-games-elasticsearch-universal-profiling">blog post</a> implemented a solution to read PGN chess games, a text representation in Python. It showed how Universal Profiler can be leveraged to identify slow functions and help you rewrite your code faster and more efficiently. At the end of it, we were happy with the Python version. It is still used today to grab the monthly updates from the <a href="https://database.lichess.org/">Lichess database</a> and ingest them into Elasticsearch®. I always wanted a reason to work more with Go, and we rewrote Python to Go. We leveraged goroutines and channels to send data through message passing. You can see more about it in our <a href="https://github.com/philippkahr/blogs/tree/main/universal-profiling">GitHub repository</a>.</p>
<p>Rewriting in Go also means switching from an interpreted language to a compiled one. As with everything in IT, this has benefits as well as disadvantages. One disadvantage is that we must ship debug symbols for the compiled binary. When we build the binary, we can use the symbtool program to ship the debug symbols. Without debug symbols, we see uninterpretable information as frames will be labeled with hexadecimal addresses in the flame graph rather than source code annotations.</p>
<p>First, make sure that your executable includes debug symbols. Go per default builds with debug symbols. You can check this by using file yourbinary. The important part is that it is not stripped.</p>
<pre><code class="language-bash">file lichess
lichess: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, Go BuildID=gufIkqA61WnCh8haeW-2/lfn3ne3U_y8MGoFD4AvT/QJEykzbacbYEmEQpXH6U/MqVbk-402n1k3B8yPB6I, with debug_info, not stripped
</code></pre>
<p>Now we need to push the symbols using symbtool. You must create an Elasticsearch API key as the authentication method. In the Universal Profiler UI in Kibana®, an <strong>Add Data</strong> button in the top right corner will tell you exactly what to do. The command is like this. The -e is the part where you pass through the path of your executable file. In our case, this is lichess as above.</p>
<pre><code class="language-bash">symbtool push-symbols executable -t &quot;ApiKey&quot; -u &quot;elasticsearch-url&quot; -e &quot;lichess&quot;
</code></pre>
<p>Now that debug symbols are available inside the cluster, we can run both implementations with the same file simultaneously and see what Universal Profiler can tell us about it.</p>
<h2>Identifying CO2 and energy efficiency savings</h2>
<p>Python is more frequently scheduled on the CPU. Thus, it runs more often on the hardware and contributes more to the machines’ resource usage.</p>
<p>We use the differential flame graph to identify and automatically calculate the difference in the following comparison. You need to filter on process.thread.name: “python3.11” in the baseline, and for the comparison, filter for lichess.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/universal-profiling-detecting-co2-energy-efficiency/1-elastic-blog-uni-profiling.png" alt="1 - universal profiling" /></p>
<p>Looking at the impact of annualized CO&lt;sub&gt;2&lt;/sub&gt; emissions, we see a decrease from 65.32kg of CO&lt;sub&gt;2&lt;/sub&gt; from the Python solution to 16.78kg. That is a difference of 48.54kg CO&lt;sub&gt;2&lt;/sub&gt; savings over a year.</p>
<p>If we take a step back, we’ll want to figure out why Python produces many more emissions. In the flamegraph view, we filter down to just showing Python, and we can click on the first frame called python3.11. A little popup tells us that it caused 32.95kg of emissions. That is nearly 50% of all emissions caused by the runtime. Our program itself caused the other ~32kg of CO&lt;sub&gt;2&lt;/sub&gt;. We immediately reduced 32kg of annual emissions by cutting out the Python interpreter with Go.</p>
<p>We can lock that box using a right click and click <strong>Show more information</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/universal-profiling-detecting-co2-energy-efficiency/2-elastic-blog-uni-profiling.png" alt="2 - universal profiling graphs blue-orange" /></p>
<p>The <strong>Show more information</strong> link displays detailed information about the frame, like sample count, total CPU, core seconds, and dollar costs. We won’t go into more detail in this blog.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/universal-profiling-detecting-co2-energy-efficiency/3-elastic-blog-uni-profiling.png" alt="3 impact estimates" /></p>
<h2>Reduce your carbon footprint today with Universal Profiling</h2>
<p>This blog post demonstrates that rewriting your code base can reduce your carbon footprint immensely. Using Universal Profiler, you could do a quick PoC to showcase how much carbon resources can be spared.</p>
<p>Learn how you can <a href="https://www.elastic.co/guide/en/observability/current/profiling-get-started.html">get started</a> with Elastic Universal Profiling today.</p>
<blockquote>
<ul>
<li>Cluster for storing the data where three nodes, each 64GB RAM and 32 CPU cores, are running GCP on Elastic Cloud.</li>
<li>The machine for sending the data is a GCP e2-standard-32, thus 128GB RAM and 32 CPU cores with a 500GB balanced disk to read the games from.</li>
<li>The file used for the games is this <a href="https://database.lichess.org/standard/lichess_db_standard_rated_2023-12.pgn.zst">Lichess database</a> containing 96,909,211 games. The extracted file size is 211GB.</li>
</ul>
</blockquote>
<p><strong>Source:</strong></p>
<p>*<a href="https://media.ccc.de/v/camp2023-57070-energy_consumption_of_data_centers">https://media.ccc.de/v/camp2023-57070-energy_consumption_of_data_centers</a></p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/universal-profiling-detecting-co2-energy-efficiency/141935_-_Blog_header_image-_Op1_V1.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Combining Elastic Universal Profiling with Java APM Services and Traces]]></title>
            <link>https://www.elastic.co/observability-labs/blog/universal-profiling-with-java-apm-services-traces</link>
            <guid isPermaLink="false">universal-profiling-with-java-apm-services-traces</guid>
            <pubDate>Thu, 20 Jun 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to combine the power of Elastic universal profiling with APM data from Java services to easily pinpoint CPU bottlenecks. Compatible with both OpenTelemetry and the classic Elastic APM Agent.]]></description>
            <content:encoded><![CDATA[<p>In <a href="https://www.elastic.co/observability-labs/blog/continuous-profiling-distributed-tracing-correlation">a previous blog post</a>, we introduced the technical details of how we managed to correlate eBPF profiling data with APM traces.
This time, we'll show you how to get this feature up and running to pinpoint CPU bottlenecks in your Java services! The correlation is supported for both OpenTelemetry and the classic Elastic APM Agent. We'll show you how to enable it for both.</p>
<h2>Demo Application</h2>
<p>For this blog post, we’ll be using the <a href="https://github.com/JonasKunz/cpu-burner">cpu-burner demo application</a> to showcase the correlation capabilities of APM, tracing, and profiling in Elastic. This application was built to continuously execute several CPU-intensive tasks:</p>
<ul>
<li>It computes Fibonacci numbers using the naive, recursive algorithm.</li>
<li>It hashes random data with the SHA-2 and SHA-3 hashing algorithms.</li>
<li>It performs numerous large background allocations to stress the garbage collector.</li>
</ul>
<p>The computations of the Fibonacci numbers and the hashing will each be visible as transactions in Elastic: They have been manually instrumented using the OpenTelemetry API.</p>
<h2>Setting up Profiling and APM</h2>
<p>First, we’ll need to set up the universal profiling host agent on the host where the demo application will run. Starting from version 8.14.0, correlation with APM data is supported and enabled out of the box for the profiler. There is no special configuration needed; we can just follow the <a href="https://www.elastic.co/guide/en/observability/current/profiling-get-started.html">standard setup guide</a>.
Note that at the time of writing, universal profiling only supports Linux.
On Windows, you'll have to use a VM to try the demo.
On macOS, you can use <a href="https://github.com/abiosoft/colima">colima</a> as docker engine and run the profiling host agent and the demo app in container images.</p>
<p>In addition, we’ll need to instrument our demo application with an APM agent. We can either use the <a href="https://github.com/elastic/apm-agent-java">classic Elastic APM agent</a> or the <a href="https://github.com/elastic/elastic-otel-java">Elastic OpenTelemetry Distribution</a>.</p>
<h3>Using the Classic Elastic APM Agent</h3>
<p>Starting with version 1.50.0, the classic Elastic APM agent ships with the capability to correlate the traces it captures with the profiling data from universal profiling. We’ll just need to enable it explicitly via the <strong>universal_profiling_integration_enabled</strong> config option. Here is the standard command line for running the demo application with the setting enabled:</p>
<pre><code class="language-shell">curl -o 'elastic-apm-agent.jar' -L 'https://oss.sonatype.org/service/local/artifact/maven/redirect?r=releases&amp;g=co.elastic.apm&amp;a=elastic-apm-agent&amp;v=LATEST'
java -javaagent:elastic-apm-agent.jar \
-Delastic.apm.service_name=cpu-burner-elastic \
-Delastic.apm.secret_token=XXXXX \
-Delastic.apm.server_url=&lt;elastic-apm-server-endpoint&gt; \
-Delastic.apm.application_packages=co.elastic.demo \
-Delastic.apm.universal_profiling_integration_enabled=true \
-jar ./target/cpu-burner.jar
</code></pre>
<h3>Using OpenTelemetry</h3>
<p>The feature is also available as an OpenTelemetry SDK extension.
This means you can use it as a plugin for the vanilla OpenTelemetry agent or add it to your OpenTelemetry SDK if you are not using an agent.
In addition, the feature ships by default with the Elastic OpenTelemetry Distribution for Java and can be used via any of the <a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-java-agent">possible usage methods</a>.
While the extension is currently Elastic-specific, we are already working with the various OpenTelemetry SIGs on standardizing the correlation mechanism, especially now after the <a href="https://www.elastic.co/observability-labs/blog/elastic-profiling-agent-acceptance-opentelemetry">eBPF profiling agent has been contributed</a>.</p>
<p>For this demo, we’ll be using the Elastic OpenTelemetry Distro Java agent to run the extension:</p>
<pre><code class="language-shell">curl -o 'elastic-otel-javaagent.jar' -L 'https://oss.sonatype.org/service/local/artifact/maven/redirect?r=releases&amp;g=co.elastic.otel&amp;a=elastic-otel-javaagent&amp;v=LATEST'
java -javaagent:./elastic-otel-javaagent.jar \
-Dotel.exporter.otlp.endpoint=&lt;elastic-cloud-OTLP-endpoint&gt; \
&quot;-Dotel.exporter.otlp.headers=Authorization=Bearer XXXX&quot; \
-Dotel.service.name=cpu-burner-otel \
-Delastic.otel.universal.profiling.integration.enabled=true \
-jar ./target/cpu-burner.jar
</code></pre>
<p>Here, we explicitly enabled the profiling integration feature via the <strong>elastic.otel.universal.profiling.integration.enabled</strong> property. Note that with an upcoming release of the universal profiling feature, this won’t be necessary anymore! The OpenTelemetry extension will then automatically detect the presence of the profiler and enable the correlation feature based on that.</p>
<p>The demo repository also comes with a Dockerfile, so you can alternatively build and run the app in docker:</p>
<pre><code class="language-shell">docker build -t cpu-burner .
docker run --rm -e OTEL_EXPORTER_OTLP_ENDPOINT=&lt;elastic-cloud-OTLP-endpoint&gt; -e OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=Bearer XXXX&quot; cpu-burner
</code></pre>
<p>And that’s it for setup; we are now ready to inspect the correlated profiling data!</p>
<h2>Analyzing Service CPU Usage</h2>
<p>The first thing we can do now is head to the “Flamegraph” view in Universal Profiling and inspect flamegraphs filtered on APM services. Without the APM correlation, universal profiling is limited to filtering on infrastructure concepts, such as hosts, containers, and processes.
Below is a screencast showing a flamegraph filtered on the service name of our demo application:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/universal-profiling-with-java-apm-services-traces/service-profiling.gif" alt="Universal Profiling Flamegraph filtered on the service name of our demo application" /></p>
<p>With this filter applied, we get a flamegraph aggregated over all instances of our service. If that is not desired, we could narrow down the filter, e.g. based on the host or container names. Note that the same service-level flamegraph view is also available on the “Universal Profiling” tab in the APM service UI.</p>
<p>The flamegraphs show exactly how the demo application is spending its CPU time, independently of whether it is covered by instrumentation or not. From left to right, we can first see the time spent in application tasks: We can identify the background allocations not covered by APM transactions as well as the SHA-computation and Fibonacci transactions.
Interestingly, this application logic only covers roughly 60% of the total CPU time! The remaining time is spent mostly in the G1 garbage collector due to the high allocation rate of our application. The flamegraph shows all G1-related activities and the timing of the individual phases of concurrent tasks. We can easily identify those based on the native function names. This is made possible by universal profiling being capable of profiling and symbolizing the JVM’s C++ code in addition to the Java code.</p>
<h2>Pinpointing Transaction Bottlenecks</h2>
<p>While the service-level flamegraph already gives good insights on where our transactions consume the most CPU, this is mainly due to the simplicity of the demo application. In real-world applications, it can be much harder to pinpoint that certain stack frames come mostly from certain transactions. For this reason, the APM agent also correlates CPU profiling data from universal profiling on the transaction level.</p>
<p>We can navigate to the “Universal Profiling” tab on the transaction details page to get per-transaction flamegraphs:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/universal-profiling-with-java-apm-services-traces/navigate-to-transaction-profiles.gif" alt="Navigation to per-transaction profiling flamegraphs" /></p>
<p>For example, let’s have a look at the flamegraph of our transaction computing SHA-2 and SHA-3 hashes of randomly generated data:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/universal-profiling-with-java-apm-services-traces/tx-unfiltered.png" alt="Flamegraph for the hashing transaction" /></p>
<p>Interestingly, the flamegraph uncovers some unexpected results: The transactions spend more time computing the random bytes to be hashed rather than on the hashing itself! So if this were a real-world application, a possible optimization could be to use a more performant random number generator.</p>
<p>In addition, we can see that the MessageDigest.update call for computing the hash values fans out into two different code paths: One is a call into the <a href="https://www.bouncycastle.org/">BouncyCastle cryptography library</a>, the other one is a JVM stub routine, meaning that the JIT compiler has inserted special assembly code for a function.</p>
<p>The flamegraph shown in the screenshot displays the aggregated data for all “shaShenanigans” transactions in the given time filter. We can further filter this down using the transaction filter bar at the top. To make the best use of this, the demo application annotates the transactions with the hashing algorithm used via OpenTelemetry attributes:</p>
<pre><code class="language-java">public static void shaShenanigans(MessageDigest digest) {
    Span span = tracer.spanBuilder(&quot;shaShenanigans&quot;)
        .setAttribute(&quot;algorithm&quot;, digest.getAlgorithm())
        .startSpan();
    ...
    span.end()
}
</code></pre>
<p>So, let’s filter our flamegraph based on the used hashing algorithm:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/universal-profiling-with-java-apm-services-traces/tx-filter-bar.png" alt="Transaction Filter Bar" /></p>
<p>Note that “SHA-256” is the name of the JVM built-in SHA-2 256-bit implementation. This now gives the following flamegraph:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/universal-profiling-with-java-apm-services-traces/tx-sha-256.png" alt="Transaction Filter Bar" /></p>
<p>We can see that the BouncyCastle stack frames are gone and MessageDigest.update spends all its time in the JVM stub routines. Therefore, the stub routine is likely hand-crafted assembly from the JVM maintainers for the SHA2 algorithm.</p>
<p>If we instead filter on “SHA3-256”, we get the following result:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/universal-profiling-with-java-apm-services-traces/tx-sha3.png" alt="Transaction Filter Bar" /></p>
<p>Now, as expected, MessageDigest.update spends all its time in the BouncyCastle library for the SHA3 implementation. Note that the hashing here takes up more time in relation to the random data generation, showing that the SHA2 JVM stub routine is significantly faster than the BouncyCastle Java SHA3 implementation.</p>
<p>This filtering is not limited to custom attributes like those shown in this demo. You can filter on any transaction attributes, including latency, HTTP headers, and so on. For example, for typical HTTP applications, it allows analyzing the efficiency of the used JSON serializer based on the payload size.
Note that while it is possible to filter on single transaction instances (e.g. based on trace.id), this is not recommended: To allow continuous profiling in production systems, the profiler by default runs with a low sampling rate of 20hz. This means that for typical real-world applications, this will not yield enough data when looking at a single transaction execution. Instead, we gain insights by monitoring multiple executions of a group of transactions over time and aggregating their samples, for example in a flamegraph.</p>
<h2>Summary</h2>
<p>A common reason for applications to degrade is overly high CPU usage. In this blog post, we showed how to combine universal profiling with APM to find the actual root cause in such cases: We explained how to analyze the CPU time using profiling flamegraphs on service and transaction levels.
In addition, we further drilled down into data using custom filters.
We used a simple demo application for this purpose, so go ahead and try it yourself with your own, real-world applications to uncover the actual power of the feature!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/universal-profiling-with-java-apm-services-traces/blog-header.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[From Uptime to Synthetics in Elastic: Your migration Playbook]]></title>
            <link>https://www.elastic.co/observability-labs/blog/uptime-to-synthetics-guide</link>
            <guid isPermaLink="false">uptime-to-synthetics-guide</guid>
            <pubDate>Thu, 11 Sep 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Effortlessly migrate your existing Uptime TCP, ICMP, and HTTP monitors to Elastic Synthetics with this comprehensive guide, leveraging Private Locations and Synthetics Projects for efficient, future-proof monitoring.]]></description>
            <content:encoded><![CDATA[<p>Have you seen the warning that Uptime is deprecated and want to know how to easily migrate to Synthetics? Then you are in the right place.
Starting with version 8.15.0, uptime checks have been deprecated in favor of synthetic monitoring.</p>
<p>Many users may have a large number of TCP, ICMP, and HTTP monitors and need to migrate them to Synthetics. In this guide, we will explain how to perform this migration easily while ensuring that it will be future-proof or able to develop more advanced checks such as <a href="https://www.elastic.co/docs/solutions/observability/synthetics/#monitoring-synthetics">Browser monitors</a>.</p>
<p>First, we must consider the number of monitors to migrate; if the number is small, the easiest way would be to do it manually through the <a href="https://www.elastic.co/docs/solutions/observability/synthetics/create-monitors-ui">Synthetics UI</a>. However, in this guide we will assume that we have dozens or hundreds of monitors to migrate, and doing it manually in the Synthetics UI is not an option.</p>
<h1>Private Location</h1>
<p>Traditionally, uptime monitors required a <a href="https://www.elastic.co/docs/reference/beats/heartbeat/">Heartbeat</a> to be deployed in your infrastructure, which indirectly allowed you to monitor endpoints or hosts on your private network. If this is still a requirement, you will need to either configure <a href="https://www.elastic.co/docs/solutions/observability/synthetics/monitor-resources-on-private-networks">Private Location</a> or allow Elastic’s global managed infrastructure to <a href="https://www.elastic.co/docs/solutions/observability/synthetics/monitor-resources-on-private-networks#monitor-via-access-control">access your private endpoints</a> (only on <a href="https://www.elastic.co/docs/deploy-manage/deploy/elastic-cloud/cloud-hosted">ECH</a> &amp; <a href="https://www.elastic.co/docs/deploy-manage/deploy/elastic-cloud/serverless">Serverless</a>).</p>
<p>In this guide, we will use Private Locations, which will allow you to monitor both internal and external resources. More details can be found here: <a href="https://www.elastic.co/docs/solutions/observability/synthetics/monitor-resources-on-private-networks#monitor-via-private-agent">Monitor resources on private networks</a></p>
<h2>Step 1: Set up Fleet Server and Elastic Agent</h2>
<p>Private Locations are simply Elastic Agents enrolled in Fleet and managed through an agent policy. </p>
<p>If you don't have a Fleet Server yet, start <a href="https://www.elastic.co/docs/reference/fleet/fleet-server">setting up a Fleet Server</a>. This step is not necessary if you use ECH, as it comes by default.</p>
<p>Next, you will need to create an Agent Policy. Go to <strong>Observability → Monitors (Synthetics) → Settings (top right) → Private Location → + Create Location</strong></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/uptime-to-synthetics-guide/create-private-location.png" alt="Create Private Location" /></p>
<p>Fill in the fields and create a new policy for this Private Location. It is important to know that Private Location should be set up against an agent policy that runs on a <strong>single</strong> Elastic Agent. </p>
<h2>Step 2: Deploy the Elastic Agent</h2>
<p>Now we need to deploy the Elastic Agent that will be responsible for running all the monitors. We can use the same host we were using for Heartbeat. There is only one requirement: we must be able to run Docker containers, since to take advantage of all the features of Synthetics, we must use the <code>elastic-agent-complete</code> Docker Image.</p>
<ol>
<li>
<p>Go to <strong>Fleet –&gt; Enrollment tokens</strong> and note the enrollment token relevant to the policy you just created for the Private Location. Now go to <strong>Settings</strong> and note the default Fleet server host URL.</p>
</li>
<li>
<p>On the host, run the following commands. For more information on running Elastic Agent with Docker, refer to Run Elastic Agent in a container.</p>
</li>
</ol>
<pre><code class="language-sh">docker run \
  --env FLEET_ENROLL=1 \
  --env FLEET_URL={fleet_server_host_url} \
  --env FLEET_ENROLLMENT_TOKEN={enrollment_token} \
  --cap-add=NET_RAW \
  --cap-add=SETUID \
  --rm docker.elastic.co/elastic-agent/elastic-agent-complete:9.3.1
</code></pre>
<h1>Synthetic Project</h1>
<p>At this point, we already have the location from which our Synthetic monitors will run. Now we need to load our Uptime monitors as Synthetics.</p>
<p>As we mentioned earlier, there are two ways to do this: either manually through the Synthetics UI or through a Synthetics Project.
In our case, since we have so many monitors to migrate and don't want to do it manually, we will use <a href="https://www.elastic.co/docs/solutions/observability/synthetics/create-monitors-with-projects">Synthetics Projects</a>. </p>
<p>The great thing about Synthetics Project is that it has some backward compatibility with the definition of monitors in <code>heartbeat.yml</code> and we will be leveraging it.</p>
<h2>What's Synthetics project?</h2>
<p>Synthetics project is the most powerful and flexible way to manage synthetic monitors in Elastic, based on the Infrastructure as Code principle and compatible with Git-Ops flows. Instead of configuring monitors from the interface, you define them as code: .yml files for lightweight monitors and JavaScript or TypeScript scripts for browser-type monitors (journeys).</p>
<p>This approach allows you to structure your monitors in a repository, version them with Git, validate them, and deploy them automatically using CI/CD flows, providing traceability, reviews, and consistent deployments.</p>
<h2>Step 3: Initialize your Synthetics project</h2>
<p>You will no longer need to connect to the hosts where you deployed the Elastic Agent, as the remaining steps can be performed locally as long as you have connectivity to Kibana!</p>
<p>Since Synthetics Projects is based on Node.js, make sure you have it <a href="https://nodejs.org/en/download">installed</a>. </p>
<ol>
<li>Install the package:</li>
</ol>
<pre><code class="language-sh">npm install -g @elastic/synthetics
</code></pre>
<ol start="2">
<li>Confirm your system is setup correctly:</li>
</ol>
<pre><code class="language-sh">npx @elastic/synthetics -h
</code></pre>
<ol start="3">
<li>Start by creating your first Synthetics project. Run the command below to create a new Synthetics project named <code>synthetic-project-test</code> in the current directory.</li>
</ol>
<pre><code class="language-sh">npx @elastic/synthetics init synthetic-project-test
</code></pre>
<ol start="4">
<li>
<p>Follow the prompt instructions to configure the default variables for your Synthetics project. Make sure to at least <strong>select your Private Location.</strong> Once that’s done, set the <code>SYNTHETICS_API_KEY</code> environment variable in your terminal, which allows the project to authenticate with Kibana.</p>
<ol>
<li>
<p>To generate an API key go to Synthetics Kibana.</p>
</li>
<li>
<p>Click <strong>Settings</strong>.</p>
</li>
<li>
<p>Switch to the <strong>Project API Keys</strong> tab.</p>
</li>
<li>
<p>Click <strong>Generate Project API key</strong>.</p>
</li>
</ol>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/uptime-to-synthetics-guide/generate-api-key.png" alt="Generate API Key" /></p>
<p>More details for all the steps can be found here: <a href="https://www.elastic.co/docs/solutions/observability/synthetics/create-monitors-with-projects#synthetics-get-started-project-create-a-synthetics-project">Create monitors with a Synthetics project</a></p>
<h2>Step 4: Add your <code>heartbeat.yml</code> files</h2>
<p>Once the project is initialized, access the folder it has created and take a look at the project structure:</p>
<ul>
<li>
<p><code>journeys</code> is where you’ll add .ts and .js files defining your browser monitors. It currently contains files defining sample monitors.</p>
</li>
<li>
<p><code>lightweight</code> is where we’ll add our heartbeat.yml files defining our lightweight monitors. It currently contains a file defining sample monitors.</p>
</li>
</ul>
<p>Therefore, all we have to do is copy our <code>heartbeat.yml</code> files to this lightweight folder. Before copying <code>heartbeat.yml</code>, keep in mind that we don't need all the content, we are only interested in the <code>heartbeat.monitors</code> part. <br />
We recommend considering splitting the file into logical groups. Instead of maintaining a single large YAML file, you could create multiple smaller YAML files, with each file representing either a single check or a group of related checks. This approach may simplify management and improve compatibility with GitOps workflows.<br />
Each YAML file should look like this:</p>
<pre><code>carles@synthetics-migration:synthetic-project-test/lightweight# cat heartbeat.yml

heartbeat.monitors:
- type: icmp
  schedule: '@every 10s'
  hosts: [&quot;localhost&quot;]
  id: my-icmp-service-synth
  name: My ICMP Service - Synthetic
- type: tcp
  schedule: '@every 10s'
  hosts: [&quot;myremotehost:8123&quot;]
  mode: any
  id: my-tcp-service-synth
  name: My TCP Service Synthetic
- type: http
  schedule: '@every 10s'
  urls: [&quot;http://elastic.co&quot;]
  id: my-http-service-synth
  name: My HTTP Service Synthetic
</code></pre>
<p>What we just did is define different ICMP, TCP, and HTTP checks as code.</p>
<p>Now we need to ask Synthetics project to create the monitors in Kibana based on what we have defined in our YAML files:</p>
<pre><code class="language-sh">npx @elastic/synthetics push --auth $SYNTHETICS_API_KEY --url &lt;kibana-url&gt;
</code></pre>
<p>Unfortunately, we do not support a 1-to-1 mapping of the heartbeat schema to the lightweight schema, so you may encounter some errors during the execution of this command. One example is the definition of <code>schedule</code>. Heartbeat supports the use of crontab expressions, but Project requires the use of <code>@every</code> syntax.</p>
<p>If no syntax errors were found, the command output will show that the monitors have been successfully created in Kibana!</p>
<p>Then, go to <strong>Synthetics</strong> in Kibana. You should see your newly pushed monitors running. You can also go to the Management tab to see the monitors' configuration settings.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/uptime-to-synthetics-guide/blog-header.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Using a custom agent with the OpenTelemetry Operator for Kubernetes]]></title>
            <link>https://www.elastic.co/observability-labs/blog/using-the-otel-operator-for-injecting-elastic-agents</link>
            <guid isPermaLink="false">using-the-otel-operator-for-injecting-elastic-agents</guid>
            <pubDate>Tue, 16 Jul 2024 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[<p>This is the second part of a two part series. The first part is available at <a href="https://www.elastic.co/observability-labs/blog/using-the-otel-operator-for-injecting-java-agents">Zero config OpenTelemetry auto-instrumentation for Kubernetes Java applications</a>. In that first part I walk through setting up and installing the <a href="https://github.com/open-telemetry/opentelemetry-operator/">OpenTelemetry Operator for Kubernetes</a>, and configuring that for auto-instrumentation of a Java application using the <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/">OpenTelemetry Java agent</a>.</p>
<p>In this second part, I show how to install <em>any</em> Java agent via the OpenTelemetry operator, using the Elastic Java agents as examples.</p>
<h2>Installation and configuration recap</h2>
<p>Part 1 of this series, <a href="https://www.elastic.co/observability-labs/blog/using-the-otel-operator-for-injecting-java-agents">Zero config OpenTelemetry auto-instrumentation for Kubernetes Java applications</a>, details the installation and configuration of the OpenTelemetry operator and an Instrumentation resource. Here is an outline of the steps as a reminder:</p>
<ol>
<li>Install cert-manager, eg <code>kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml</code></li>
<li>Install the operator, eg <code>kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml</code></li>
<li>Create an Instrumentation resource</li>
<li>Add an annotation to either the deployment or the namespace</li>
<li>Deploy the application as normal</li>
</ol>
<p>In that first part, steps 3, 4 &amp; 5 were implemented for the <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/">OpenTelemetry Java agent</a>. In this blog I’ll implement them for other agents, using the Elastic APM agents as examples. I assume that steps 1 &amp; 2 outlined above have already been done, ie that the operator is now installed. I will continue using the <code>banana</code> namespace for the examples, so ensure that namespace exists (<code>kubectl create namespace banana</code>). As per part 1, if you use any of the example instrumentation definitions below, you’ll need to substitute <code>my.apm.server.url</code> and <code>my-apm-secret-token</code> with the values appropriate for your collector.</p>
<h2>Using the Elastic Distribution for OpenTelemetry Java</h2>
<p>From version 0.4.0, the <a href="https://github.com/elastic/elastic-otel-java">Elastic Distribution for OpenTelemetry Java</a> includes the agent jar at the path <code>/javaagent.jar</code> in the docker image - which is essentially all that is needed for a docker image to be usable by the OpenTelemetry operator for auto-instrumentation. This means the Instrumentation resource is straightforward to define, and as it’s a distribution of the OpenTelemetry Java agent, all the OpenTelemetry environment can apply:</p>
<pre><code>apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: elastic-otel
  namespace: banana
spec:
  exporter:
    endpoint: https://my.apm.server.url
  propagators:
    - tracecontext
    - baggage
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: &quot;1.0&quot;
  java:
    image: docker.elastic.co/observability/elastic-otel-javaagent:1.9.0
    env:
      - name: OTEL_EXPORTER_OTLP_HEADERS
        value: &quot;Authorization=Bearer my-apm-secret-token&quot;
      - name: ELASTIC_OTEL_INFERRED_SPANS_ENABLED
        value: &quot;true&quot;
      - name: ELASTIC_OTEL_SPAN_STACK_TRACE_MIN_DURATION
        value: &quot;50&quot;
</code></pre>
<p>I’ve included environment for switching on several features in the agent, including</p>
<ol>
<li>ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED to switch on the inferred spans implementation feature described in <a href="https://www.elastic.co/observability-labs/blog/tracing-data-inferred-spans-opentelemetry">this blog</a></li>
<li>Span stack traces are automatically captured if the span takes more than ELASTIC_OTEL_SPAN_STACK_TRACE_MIN_DURATION (default would be 5ms)</li>
</ol>
<p>Adding in the annotation ...</p>
<pre><code>metadata:
  annotations:
    instrumentation.opentelemetry.io/inject-java: &quot;elastic-otel&quot;
</code></pre>
<p>... to the pod yaml gets the application traced, and displayed in the Elastic APM UI, including the inferred child spans and stack traces</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/using-the-otel-operator-for-injecting-elastic-agents/elastic-apm-ui-with-stack-trace.png" alt="Elastic APM UI showing methodB traced with stack traces and inferred spans" /></p>
<p>The additions from the features mentioned above are circled in red - inferred spans (for methodC and methodD) bottom left, and the stack trace top right. (Note that the pod included the <code>OTEL_INSTRUMENTATION_METHODS_INCLUDE</code> environment variable set to <code>&quot;test.Testing[methodB]&quot;</code> so that traces from methodB are shown; for pod configuration see the &quot;Trying it&quot; section in <a href="https://www.elastic.co/observability-labs/blog/using-the-otel-operator-for-injecting-java-agents">part 1</a>)</p>
<h2>Using the Elastic APM Java agent</h2>
<p>From version 1.50.0, the <a href="https://github.com/elastic/apm-agent-java">Elastic APM Java agent</a> includes the agent jar at the path /javaagent.jar in the docker image - which is essentially all that is needed for a docker image to be usable by the OpenTelemetry operator for auto-instrumentation. This means the Instrumentation resource is straightforward to define:</p>
<pre><code>apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: elastic-apm
  namespace: banana
spec:
  java:
    image: docker.elastic.co/observability/apm-agent-java:1.55.4
    env:
      - name: ELASTIC_APM_SERVER_URL
        value: &quot;https://my.apm.server.url&quot;
      - name: ELASTIC_APM_SECRET_TOKEN
        value: &quot;my-apm-secret-token&quot;
      - name: ELASTIC_APM_LOG_LEVEL
        value: &quot;INFO&quot;
      - name: ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED
        value: &quot;true&quot;
      - name: ELASTIC_APM_LOG_SENDING
        value: &quot;true&quot;
</code></pre>
<p>I’ve included environment for switching on several features in the agent, including</p>
<ul>
<li>ELASTIC_APM_LOG_LEVEL set to the default value (INFO) which could easily be switched to DEBUG</li>
<li>ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED to switch on the inferred spans implementation equivalent to the feature described in <a href="https://www.elastic.co/observability-labs/blog/tracing-data-inferred-spans-opentelemetry">this blog</a></li>
<li>ELASTIC_APM_LOG_SENDING which switches on sending logs to the APM UI, the logs are automatically correlated with transactions (for all common logging frameworks)</li>
</ul>
<p>Adding in the annotation ...</p>
<pre><code>metadata:
  annotations:
     instrumentation.opentelemetry.io/inject-java: &quot;elastic-apm&quot;
</code></pre>
<p>... to the pod yaml gets the application traced, and displayed in the Elastic APM UI, including the inferred child spans</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/using-the-otel-operator-for-injecting-elastic-agents/elastic-apm-ui-with-inferred-spans.png" alt="Elastic APM UI showing methodB traced with inferred spans" /></p>
<p>(Note that the pod included the <code>ELASTIC_APM_TRACE_METHODS</code> environment variable set to <code>&quot;test.Testing#methodB&quot;</code> so that traces from methodB are shown; for pod configuration see the &quot;Trying it&quot; section in <a href="https://www.elastic.co/observability-labs/blog/using-the-otel-operator-for-injecting-java-agents">part 1</a>)</p>
<h2>Using an extension with the OpenTelemetry Java agent</h2>
<p>Setting up an Instrumentation resource for the OpenTelemetry Java agent is straightforward and was done in <a href="https://www.elastic.co/observability-labs/blog/using-the-otel-operator-for-injecting-java-agents">part 1</a> of this two part series - and you can see from the above examples it’s just a matter of deciding on the docker image URL you want to use. However if you want to include an <em>extension</em> in your deployment, this is a little more complex, but also supported by the operator. Basically the extensions you want to include with the agent need to be in docker images - or you have to build an image which includes the extensions that are not already in images. Then you declare the images and the directories the extensions are in, in the Instrumentation resource. As an example, I’ll show an Instrumentation which uses version 2.5.0 of the <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/">OpenTelemetry Java agent</a> together with the <a href="https://github.com/elastic/elastic-otel-java/tree/main/inferred-spans">inferred spans extension</a> from the <a href="https://github.com/elastic/elastic-otel-java">Elastic OpenTelemetry Java distribution</a>. The distro image includes the extension at path <code>/extensions/elastic-otel-agentextension.jar</code>. The Instrumentation resource allows either directories or file paths to be specified, here I’ll list the directory:</p>
<pre><code>apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: otel-plus-extension-instrumentation
  namespace: banana
spec:
  exporter:
    endpoint: https://my.apm.server.url
  propagators:
    - tracecontext
    - baggage
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: &quot;1.0&quot;
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:2.5.0
    extensions:
      - image: &quot;docker.elastic.co/observability/elastic-otel-javaagent:1.9.0&quot;
        dir: &quot;/extensions&quot;
    env:
      - name: OTEL_EXPORTER_OTLP_HEADERS
        value: &quot;Authorization=Bearer my-apm-secret-token&quot;
      - name: ELASTIC_OTEL_INFERRED_SPANS_ENABLED
        value: &quot;true&quot;
</code></pre>
<p>Note that you can have multiple <code>image … dir</code> pairs, ie include multiple extensions from different images. Note also if you are testing this specific configuration that the inferred spans extension included here will be contributed to the OpenTelemetry contrib repo at some point after this blog is published, after which the extension may no longer be present in a later version of the referred image (since it will be available from the <a href="https://github.com/open-telemetry/opentelemetry-java-contrib/">contrib repo</a> instead).</p>
<h2>Next steps</h2>
<p>Here I’ve shown how to use any agent with the <a href="https://github.com/open-telemetry/opentelemetry-operator/">OpenTelemetry Operator for Kubernetes</a>, and configure that for your system. In particular the examples have showcased how to use the Elastic Java agents to auto-instrument Java applications running in your Kubernetes clusters, along with how to enable features, using Instrumentation resources. And you can set it up for either zero config for deployments, or for just one annotation which is generally a more flexible mechanism (you can have multiple Instrumentation resource definitions, and the deployment can select the appropriate one for its application).</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/using-the-otel-operator-for-injecting-elastic-agents/blog-header-720x420.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Zero config OpenTelemetry auto-instrumentation for Kubernetes Java applications]]></title>
            <link>https://www.elastic.co/observability-labs/blog/using-the-otel-operator-for-injecting-java-agents</link>
            <guid isPermaLink="false">using-the-otel-operator-for-injecting-java-agents</guid>
            <pubDate>Thu, 11 Jul 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Walking through how to install and enable the OpenTelemetry Operator for Kubernetes to auto-instrument Java applications, with no configuration changes needed for deployments]]></description>
            <content:encoded><![CDATA[<p>The <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/">OpenTelemetry Java agent</a> has a number of <a href="https://opentelemetry.io/docs/languages/java/automatic/#setup">ways to install</a> the agent into a Java application. If you are running your Java applications in Kubernetes pods, there is a separate mechanism (which under the hood uses JAVA_TOOL_OPTIONS and other environment variables) to auto-instrument Java applications. This auto-instrumentation can be achieved with zero configuration of the applications and pods!</p>
<p>The mechanism to achieve zero-config auto-instrumentation of Java applications in Kubernetes is via the <a href="https://github.com/open-telemetry/opentelemetry-operator/">OpenTelemetry Operator for Kubernetes</a>. This operator has many capabilities and the full documentation (and of course source) is available in the project itself. In this blog, I'll walk through installing, setting up and running zero-config auto-instrumentation of Java applications in Kubernetes using the OpenTelemetry Operator.</p>
<h2>Installing the OpenTelemetry Operator&lt;a id=&quot;installing-the-opentelemetry-operator&quot;&gt;&lt;/a&gt;</h2>
<p>At the time of writing this blog, the OpenTelemetry Operator needs the certification manager to be installed, after which the operator can be installed. Installing from the web is straightforward. First install the <code>cert-manager</code> (the version to be installed will be specified in the <a href="https://github.com/open-telemetry/opentelemetry-operator/">OpenTelemetry Operator for Kubernetes</a> documentation):</p>
<pre><code>kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml
</code></pre>
<p>Then when the cert managers are ready (<code>kubectl get pods -n cert-manager</code>)  ...</p>
<pre><code>NAMESPACE      NAME                                         READY
cert-manager   cert-manager-67c98b89c8-rnr5s                1/1
cert-manager   cert-manager-cainjector-5c5695d979-q9hxz     1/1
cert-manager   cert-manager-webhook-7f9f8648b9-8gxgs        1/1
</code></pre>
<p>... you can install the OpenTelemetry Operator:</p>
<pre><code>kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml
</code></pre>
<p>You can, of course, use a specific version of the operator instead of the <code>latest</code>. But here I’ve used the <code>latest</code> version.</p>
<h2>An Instrumentation resource&lt;a id=&quot;an-instrumentation-resource&quot;&gt;&lt;/a&gt;</h2>
<p>Now you need to add just one further Kubernetes resource to enable auto-instrumentation: an <code>Instrumentation</code> resource. I am going to use the <code>banana</code> namespace for my examples, so I have first created that namespace (<code>kubectl create namespace banana</code>). The auto-instrumentation is specified and configured by these Instrumentation resources. Here is a basic one which will allow every Java pod in the <code>banana</code> namespace to be auto-instrumented with version 2.5.0 of the <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/">OpenTelemetry Java agent</a>:</p>
<pre><code>apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: banana-instr
  namespace: banana
spec:
  exporter:
    endpoint: &quot;https://my.endpoint&quot;
  propagators:
    - tracecontext
    - baggage
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: &quot;1.0&quot;
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:2.5.0
    env:
      - name: OTEL_EXPORTER_OTLP_HEADERS
        value: &quot;Authorization=Bearer MyAuth&quot;
</code></pre>
<p>Creating this resource (eg with <code>kubectl apply -f banana-instr.yaml</code>, assuming the above yaml was saved in file <code>banana-instr.yaml</code>) makes the <code>banana-instr</code> Instrumentation resource available for use. (Note you will need to change <code>my.endpoint</code> and <code>MyAuth</code> to values appropriate for your collector.) You can use this instrumentation immediately by adding an annotation to any deployment in the <code>banana</code> namespace:</p>
<pre><code>metadata:
  annotations:
    instrumentation.opentelemetry.io/inject-java: &quot;true&quot;
</code></pre>
<p>The <code>banana-instr</code> Instrumentation resource is not yet set to be applied by <em>default</em> to all pods in the banana namespace. Currently it's zero-config as far as the <em>application</em> is concerned, but it requires an annotation added to a <em>pod or deployment</em>. To make it fully zero-config for <em>all pods</em> in the <code>banana</code> namespace, we need to add that annotation to the namespace itself, ie editing the namespace (<code>kubectl edit namespace banana</code>) so it would then have contents similar to</p>
<pre><code>apiVersion: v1
kind: Namespace
metadata:
  name: banana
  annotations:
    instrumentation.opentelemetry.io/inject-java: &quot;banana-instr&quot;
...
</code></pre>
<p>Now we have a namespace that is going to auto-instrument <em>every</em> Java application deployed in the <code>banana</code> namespace with the 2.5.0 <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/">OpenTelemetry Java agent</a>!</p>
<h2>Trying it&lt;a id=&quot;trying-it&quot;&gt;&lt;/a&gt;</h2>
<p>There is a simple example Java application at <a href="http://docker.elastic.co/demos/apm/k8s-webhook-test">docker.elastic.co/demos/apm/k8s-webhook-test</a> which just repeatedly calls the chain <code>main-&gt;methodA-&gt;methodB-&gt;methodC-&gt;methodD</code> with some sleeps in the calls. Running this (<code>kubectl apply -f banana-app.yaml</code>) using a very basic pod definition:</p>
<pre><code>apiVersion: v1
kind: Pod
metadata:
  name: banana-app
  namespace: banana
  labels:
    app: banana-app
spec:
  containers:
    - image: docker.elastic.co/demos/apm/k8s-webhook-test
      imagePullPolicy: Always
      name: banana-app
      env: 
      - name: OTEL_INSTRUMENTATION_METHODS_INCLUDE
        value: &quot;test.Testing[methodB]&quot;
</code></pre>
<p>results in the app being auto-instrumented with no configuration changes! The resulting app shows up in any APM UI, such as Elastic APM</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/using-the-otel-operator-for-injecting-java-agents/elastic-apm-ui-transaction.png" alt="Elastic APM UI showing methodB traced" /></p>
<p>As you can see, for this example I also added this env var to the pod yaml, <code>OTEL_INSTRUMENTATION_METHODS_INCLUDE=&quot;test.Testing[methodB]&quot;</code> so that there were traces showing from methodB.</p>
<h2>The technology behind the auto-instrumentation&lt;a id=&quot;the-technology-behind-the-auto-instrumentation&quot;&gt;&lt;/a&gt;</h2>
<p>To use the auto-instrumentation there is no specific need to understand the underlying mechanisms, but for those of you interested, here’s a quick outline.</p>
<ol>
<li>The <a href="https://github.com/open-telemetry/opentelemetry-operator/">OpenTelemetry Operator for Kubernetes</a> installs a <a href="https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/">mutating webhook</a>, a standard Kubernetes component.</li>
<li>When deploying, Kubernetes first sends all definitions to the mutating webhook.</li>
<li>If the mutating webhook sees that the conditions for auto-instrumentation should be applied (ie
<ol>
<li>there is an Instrumentation resource for that namespace and</li>
<li>the correct annotation for that Instrumentation is applied to the definition in some way, either from the definition itself or from the namespace),</li>
</ol>
</li>
<li>then the mutating webhook “mutates” the definition to include the environment defined by the Instrumentation resource.</li>
<li>The environment includes the explicit values defined in the env, as well as some implicit OpenTelemetry values (see the <a href="https://github.com/open-telemetry/opentelemetry-operator/">OpenTelemetry Operator for Kubernetes</a> documentation for full details).</li>
<li>And most importantly, the operator
<ol>
<li>pulls the image defined in the Instrumentation resource,</li>
<li>extracts the file at the path <code>/javaagent.jar</code> from that image (using shell command <code>cp</code>)</li>
<li>inserts it into the pod at path <code>/otel-auto-instrumentation-java/javaagent.jar</code></li>
<li>and adds the environment variable <code>JAVA_TOOL_OPTIONS=-javaagent:/otel-auto-instrumentation-java/javaagent.jar</code>.</li>
</ol>
</li>
<li>The JVM automatically picks up that JAVA_TOOL_OPTIONS environment variable on startup and applies it to the JVM command-line.</li>
</ol>
<h2>Next steps&lt;a id=&quot;next-steps&quot;&gt;&lt;/a&gt;</h2>
<p>This walkthrough can be repeated in any Kubernetes cluster to demonstrate and experiment with auto-instrumentation (you will need to create the banana namespace first). In part 2 of this two part series, <a href="https://www.elastic.co/observability-labs/blog/using-the-otel-operator-for-injecting-elastic-agents">Using a custom agent with the OpenTelemetry Operator for Kubernetes</a>, I show how to install any Java agent via the OpenTelemetry operator, using the Elastic Java agents as examples.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/using-the-otel-operator-for-injecting-java-agents/blog-header.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Easily analyze AWS VPC Flow Logs with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/vpc-flow-logs-monitoring-analytics-observability</link>
            <guid isPermaLink="false">vpc-flow-logs-monitoring-analytics-observability</guid>
            <pubDate>Mon, 23 Jan 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic Observability can ingest and help analyze AWS VPC Flow Logs from your application’s VPC. Learn how to ingest AWS VPC Flow Logs through a step-by-step method into Elastic, then analyze it and apply OOTB machine learning for insights.]]></description>
            <content:encoded><![CDATA[<p>Elastic Observability provides a full-stack observability solution, by supporting metrics, traces, and logs for applications and infrastructure. In <a href="https://www.elastic.co/blog/aws-service-metrics-monitor-observability-easy">a previous blog</a>, I showed you an <a href="https://www.elastic.co/observability/aws-monitoring">AWS monitoring</a> infrastructure running a three-tier application. Specifically we reviewed metrics ingest and analysis on Elastic Observability for EC2, VPC, ELB, and RDS. In this blog, we will cover how to ingest logs from AWS, and more specifically, we will review how to get VPC Flow Logs into Elastic and what you can do with this data.</p>
<p>Logging is an important part of observability, for which we generally think of metrics and/or tracing. However, the amount of logs an application or the underlying infrastructure output can be significantly daunting.</p>
<p>With Elastic Observability, there are three main mechanisms to ingest logs:</p>
<ul>
<li>The new Elastic Agent pulls metrics and logs from CloudWatch and S3 where logs are generally pushed from a service (for example, EC2, ELB, WAF, Route53, etc ). We reviewed Elastic agent metrics configuration for EC2, RDS (Aurora), ELB, and NAT metrics in this <a href="https://www.elastic.co/blog/aws-service-metrics-monitor-observability-easy">blog</a>.</li>
<li>Using <a href="https://www.elastic.co/blog/elastic-and-aws-serverless-application-repository-speed-time-to-actionable-insights-with-frictionless-log-ingestion-from-amazon-s3">Elastic’s Serverless Forwarder (runs on Lambda and available in AWS SAR)</a> to send logs from Firehose, S3, CloudWatch, and other AWS services into Elastic.</li>
<li>Beta feature (contact your Elastic account team): Using AWS Firehose to directly insert logs from AWS into Elastic — specifically if you are running the Elastic stack on AWS infrastructure.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/Elastic-Observability-VPC-Flow-Logs.jpg" alt="" /></p>
<p>In this blog we will provide an overview of the second option, Elastic’s serverless forwarder collecting VPC Flow Logs from an application deployed on EC2 instances. Here’s what we'll cover:</p>
<ul>
<li>A walk-through on how to analyze VPC Flow Log info with Elastic’s Discover, dashboard, and ML analysis.</li>
<li>A detailed step-by-step overview and setup of the Elastic serverless forwarder on AWS as a pipeline for VPC Flow Logs into <a href="http://cloud.elastic.co">Elastic Cloud</a>.</li>
</ul>
<h2>Elastic’s serverless forwarder on AWS Lambda</h2>
<p>AWS users can quickly ingest logs stored in Amazon S3, CloudWatch, or Kinesis with the Elastic serverless forwarder, an AWS Lambda application, and view them in the Elastic Stack alongside other logs and metrics for centralized analytics. Once the AWS serverless forwarder is configured and deployed from AWS, Serverless Application Registry (SAR) logs will be ingested and available in Elastic for analysis. See the following links for further configuration guidance:</p>
<ul>
<li><a href="https://www.elastic.co/blog/elastic-and-aws-serverless-application-repository-speed-time-to-actionable-insights-with-frictionless-log-ingestion-from-amazon-s3">Elastic’s serverless forwarder (runs Lambda and available in AWS SAR)</a></li>
<li><a href="https://github.com/elastic/elastic-serverless-forwarder/blob/main/docs/README-AWS.md#s3_config_file">Serverless forwarder GitHub repo</a></li>
</ul>
<p>In our configuration we will ingest VPC Flow Logs into Elastic for the three-tier app deployed in the previous <a href="https://www.elastic.co/blog/aws-service-metrics-monitor-observability-easy">blog</a>.</p>
<p>There are three different configurations with the Elastic serverless forwarder:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-3-configurations.png" alt="" /></p>
<p>Logs can be directly ingested from:</p>
<ul>
<li><strong>Amazon CloudWatch:</strong> Elastic serverless forwarder can pull VPC Flow Logs directly from an Amazon CloudWatch log group, which is a commonly used endpoint to store VPC Flow Logs in AWS.</li>
<li><strong>Amazon Kinesis:</strong> Elastic serverless forwarder can pull VPC Flow Logs directly from Kinesis, which is another location to <a href="https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs-firehose.html">publish VPC Flow Logs</a>.</li>
<li><strong>Amazon S3:</strong> Elastic serverless forwarder can pull VPC Flow Logs from Amazon S3 via SQS event notifications, which is a common endpoint to publish VPC Flow Logs in AWS.</li>
</ul>
<p>We will review how to utilize a common configuration, which is to send VPC Flow Logs to Amazon S3 and into Elastic Cloud in the second half of this blog.</p>
<p>But first let's review how to analyze VPC Flow Logs on Elastic.</p>
<h2>Analyzing VPC Flow Logs in Elastic</h2>
<p>Now that you have VPC Flow Logs in Elastic Cloud, how can you analyze them?</p>
<p>There are several analyses you can perform on the VPC Flow Log data:</p>
<ol>
<li>Use Elastic’s Analytics Discover capabilities to manually analyze the data.</li>
<li>Use Elastic Observability’s anomaly feature to identify anomalies in the logs.</li>
<li>Use an out-of-the-box (OOTB) dashboard to further analyze data.</li>
</ol>
<h3>Using Elastic Discover</h3>
<p>In Elastic analytics, you can search and filter your data, get information about the structure of the fields, and display your findings in a visualization. You can also customize and save your searches and place them on a dashboard. With Discover, you can:</p>
<ul>
<li>View logs in bulk, within specific time frames</li>
<li>Look at individual details of each entry (document)</li>
<li>Filter for specific values</li>
<li>Analyze fields</li>
<li>Create and save searches</li>
<li>Build visualizations</li>
</ul>
<p>For a complete understanding of Discover and all of Elastic’s analytics capabilities, look at <a href="https://www.elastic.co/guide/en/kibana/current/discover.html#">Elastic documentation</a>.</p>
<p>For VPC Flow Logs, an important stat is to understand:</p>
<ul>
<li>How many logs were accepted/rejected</li>
<li>Where potential security violations are occur (for example, source IPs from outside the VPC)</li>
<li>What port is generally being queried</li>
</ul>
<p>I’ve filtered the logs on the following:</p>
<ul>
<li>Amazon S3: bshettisartest</li>
<li>VPC Flow Log action: REJECT</li>
<li>VPC Network Interface: Webserver 1</li>
</ul>
<p>We want to see what IP addresses are trying to hit our web servers.</p>
<p>From that, we want to understand which IP addresses we are getting the most REJECTS from, and we simply find the <strong>source</strong>.ip field. Then, we can quickly get a breakdown that shows 185.242.53.156 is the most rejected for the last 3+ hours we’ve turned on VPC Flow Logs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-100-hits.png" alt="" /></p>
<p>Additionally, I can see a visualization by selecting the “Visualize” button. We get the following, which we can add to a dashboard:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-add-to-a-dashboard.png" alt="" /></p>
<p>In addition to IP addresses, we want to also see what port is being hit on our web servers.<br />
We select the destination port field, and the quick pop-up shows us a list of ports being targeted. We can see that port 23 is being targeted (this port is generally used for telnet), port 445 is being targeted (used for Microsoft Active Directory), and port 433 (used for https ssl). We also see these are all REJECT.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-reject.png" alt="" /></p>
<h3>Anomaly detection in Elastic Observability logs</h3>
<p>Addition to Discover, Elastic Observability provides the ability to detect anomalies on logs. In Elastic Observability -&gt; logs -&gt; anomalies you can turn on machine learning for:</p>
<ul>
<li>Log rate: automatically detects anomalous log entry rates</li>
<li>Categorization: automatically categorizes log messages</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-anomaly-detection-with-machine-learning.png" alt="" /></p>
<p>For our VPC Flow Log, we turned both on. And when we look at what has been detected for anomalous log entry rates, we see:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-anomalies.png" alt="" /></p>
<p>Elastic immediately detected a spike in logs when we turned on VPC Flow Logs for our application. The rate change is being detected because we’re also ingesting VPC Flow Logs from another application for a couple of days prior to adding the application in this blog.</p>
<p>We can further drill down into this anomaly with machine learning and analyze further.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-anomaly-explorer.png" alt="" /></p>
<p>There is more machine learning analysis you can utilize with your logs — check out <a href="https://www.elastic.co/guide/en/kibana/8.5/xpack-ml.html">Elastic machine learning documentation</a>.</p>
<p>Since we know that a spike exists, we can also use Elastic AIOps Labs Explain Log Rate Spikes capability in Machine Learning. Additionally, we’ve grouped them to see what is causing some of the spikes.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-explain-log-rate-spikes.png" alt="" /></p>
<p>As we can see, a specific network interface is sending more VPC log flows than others. We can further drill down into this further in Discover.</p>
<h3>VPC Flow Log dashboard on Elastic Observability</h3>
<p>Finally, Elastic also provides an OOTB dashboard to showing the top IP addresses hitting your VPC, geographically where they are coming from, the time series of the flows, and a summary of VPC Flow Log rejects within the time frame.</p>
<p>This is a baseline dashboard that can be enhanced with visualizations you find in Discover, as we reviewed in option 1 (Using Elastic’s Analytics Discover capabilities) above.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-action-geolocation.png" alt="" /></p>
<h2>Setting it all up</h2>
<p>Let’s walk through the details of configuring Amazon Kinesis Data Firehose and Elastic Observability to ingest data.</p>
<h3>Prerequisites and config</h3>
<p>If you plan on following steps, here are some of the components and details we used to set up this demonstration:</p>
<ul>
<li>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>) on AWS. Deploying this on AWS is required for Elastic Serverless Forwarder.</li>
<li>Ensure you have an AWS account with permissions to pull the necessary data from AWS. Specifically, ensure you can configure the agent to pull data from AWS as needed. <a href="https://docs.elastic.co/integrations/aws#requirements">Please look at the documentation for details</a>.</li>
<li>We used <a href="https://github.com/aws-samples/aws-three-tier-web-architecture-workshop">AWS’s three-tier app</a> and installed it as instructed in GitHub. (<a href="https://www.elastic.co/blog/aws-service-metrics-monitor-observability-easy">See blog on ingesting metrics from the AWS services supporting this app</a>.)</li>
<li>Configure and install Elastic’s Serverless Forwarder.</li>
<li>Ensure you turn on VPC Flow Logs for the VPC where the application is deployed and send logs to AWS Firehose.</li>
</ul>
<h3>Step 0: Get an account on Elastic Cloud</h3>
<p>Follow the instructions to <a href="https://cloud.elastic.co/registration?fromURI=/home">get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-start-cloud-trial.png" alt="" /></p>
<h3>Step 1: Deploy Elastic on AWS</h3>
<p>Once logged in to Elastic Cloud, create a deployment on AWS. It’s important to ensure that the deployment is on AWS. The Amazon Kinesis Data Firehose connects specifically to an endpoint that needs to be on AWS.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-create-a-deployment.png" alt="" /></p>
<p>Once your deployment is created, make sure you copy the Elasticsearch endpoint.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-aws-logs.png" alt="" /></p>
<p>The endpoint should be an AWS endpoint, such as:</p>
<pre><code class="language-bash">https://aws-logs.es.us-east-1.aws.found.io
</code></pre>
<h3>Step 2: Turn on Elastic’s AWS Integrations on AWS</h3>
<p>In your deployment’s Elastic Integration section, go to the AWS integration and select Install AWS assets.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-aws-settings.png" alt="" /></p>
<h3>Step 3: Deploy your application</h3>
<p>Follow the instructions listed out in <a href="https://github.com/aws-samples/aws-three-tier-web-architecture-workshop">AWS’s Three-Tier app</a> and instructions in the workshop link on GitHub. The workshop is listed <a href="https://catalog.us-east-1.prod.workshops.aws/workshops/85cd2bb2-7f79-4e96-bdee-8078e469752a/en-US">here</a>.</p>
<p>Once you’ve installed the app, get credentials from AWS. This will be needed for Elastic’s AWS integration.</p>
<p>There are several options for credentials:</p>
<ul>
<li>Use access keys directly</li>
<li>Use temporary security credentials</li>
<li>Use a shared credentials file</li>
<li>Use an IAM role Amazon Resource Name (ARN)</li>
</ul>
<p>View more details on specifics around necessary <a href="https://docs.elastic.co/en/integrations/aws#aws-credentials">credentials</a> and <a href="https://docs.elastic.co/en/integrations/aws#aws-permissions">permissions</a>.</p>
<h3>Step 4: Send VPC Flow Logs to Amazon S3 and set up Amazon SQS</h3>
<p>In the VPC for the application deployed in Step 3, you will need to configure VPC Flow Logs and point them to an Amazon S3 bucket. Specifically, you will want to keep it as AWS default format.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-create-flow-log.png" alt="" /></p>
<p>Create the VPC Flow log.</p>
<p>Next:</p>
<ul>
<li><a href="https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-getting-started.html">Set up an Amazon SQS queue</a></li>
<li><a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/ways-to-add-notification-config-to-bucket.html">Configure Amazon S3 event notifications</a></li>
</ul>
<h3>Step 5: Set up Elastic Serverless Forwarder on AWS</h3>
<p>Follow instructions listed in <a href="https://www.elastic.co/guide/en/observability/8.5/aws-deploy-elastic-serverless-forwarder.html">Elastic’s documentation</a> and refer to the <a href="https://www.elastic.co/blog/elastic-and-aws-serverless-application-repository-speed-time-to-actionable-insights-with-frictionless-log-ingestion-from-amazon-s3">previous blog</a> providing an overview. The important bits during the configuration in Lambda’s application repository are to ensure you:</p>
<ul>
<li>Specify the S3 Bucket in ElasticServerlessForwarderS3Buckets where the VPC Flow Logs are being sent. The value is the ARN of the S3 Bucket you created in Step 4.</li>
<li>Specify the configuration file path in ElasticServerlessForwarderS3ConfigFile. The value is the S3 url in the format &quot;s3://bucket-name/config-file-name&quot; pointing to the configuration file (sarconfig.yaml).</li>
<li>Specify the S3 SQS Notifications queue used as the trigger of the Lambda function in ElasticServerlessForwarderS3SQSEvents. The value is the ARN of the SQS Queue you set up in Step 4.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-application-settings.png" alt="" /></p>
<p>Once Amazon CloudFormation finishes setting up Elastic serverless forwarder, you should see two Amazon Lambda functions:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-functions.png" alt="" /></p>
<p>In order to check if logs are coming in, go to the function with “ <strong>ApplicationElasticServer</strong> ” in the name, and go to monitor and look at <strong>logs</strong>. You should see the logs being pulled from S3.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-function-overview.png" alt="" /></p>
<h3>Step 6: Check and ensure you have logs in Elastic</h3>
<p>Now that steps 1–4 are complete, you can go to Elastic’s Discover capability and you should see VPC Flow Logs coming in. In the image below, we’ve filtered by Amazon S3 bucket <strong>bshettisartest</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-log-dashboard-filter.png" alt="" /></p>
<h2>Conclusion: Elastic Observability easily integrates with VPC Flow Logs for analytics, alerting, and insights</h2>
<p>I hope you’ve gotten an appreciation for how Elastic Observability can help you manage AWS VPC Flow Logs. Here’s a quick recap of lessons and what you learned:</p>
<ul>
<li>A walk-through of how Elastic Observability provides enhanced analysis for VPC Flow Logs:
<ul>
<li>Using Elastic’s Analytics Discover capabilities to manually analyze the data</li>
<li>Leveraging Elastic Observability’s anomaly features to:
<ul>
<li>Identify anomalies in the VPC flow logs</li>
<li>Detects anomalous log entry rates</li>
<li>Automatically categorizes log messages</li>
</ul>
</li>
<li>Using an OOTB dashboard to further analyze data</li>
</ul>
</li>
<li>A more detailed walk-through of how to set up the Elastic Serverless Forwarder</li>
</ul>
<p>Start your own <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k?trk=5fbc596b-6d2a-433a-8333-0bd1f28e84da%E2%89%BBchannel=el">7-day free trial</a> by signing up via <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k?trk=d54b31eb-671c-49ba-88bb-7a1106421dfa%E2%89%BBchannel=el">AWS Marketplace</a> and quickly spin up a deployment in minutes on any of the <a href="https://www.elastic.co/guide/en/cloud/current/ec-reference-regions.html#ec_amazon_web_services_aws_regions">Elastic Cloud regions on AWS</a> around the world. Your AWS Marketplace purchase of Elastic will be included in your monthly consolidated billing statement and will draw against your committed spend with AWS.</p>
<h3>Additional logging resources:</h3>
<ul>
<li><a href="https://www.elastic.co/getting-started/observability/collect-and-analyze-logs">Getting started with logging on Elastic (quickstart)</a></li>
<li><a href="https://www.elastic.co/guide/en/observability/current/logs-metrics-get-started.html">Ingesting common known logs via integrations (compute node example)</a></li>
<li><a href="https://docs.elastic.co/integrations">List of integrations</a></li>
<li><a href="https://www.elastic.co/blog/log-monitoring-management-enterprise">Ingesting custom application logs into Elastic</a></li>
<li><a href="https://www.elastic.co/blog/observability-logs-parsing-schema-read-write">Enriching logs in Elastic</a></li>
<li>Analyzing Logs with <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">Anomaly Detection (ML)</a> and <a href="https://www.elastic.co/blog/observability-logs-machine-learning-aiops">AIOps</a></li>
</ul>
<h3>Common use case examples with logs:</h3>
<ul>
<li><a href="https://youtu.be/ax04ZFWqVCg">Nginx log management</a></li>
<li><a href="https://www.elastic.co/blog/vpc-flow-logs-monitoring-analytics-observability">AWS VPC Flow log management</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-errors-elastic-observability-logs-openai">Using OpenAI to analyze Kubernetes errors</a></li>
<li><a href="https://youtu.be/Li5TJAWbz8Q">PostgreSQL issue analysis with AIOps</a></li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/patterns-midnight-background-no-logo-observability.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Web Frontend Instrumentation and Monitoring with OpenTelemetry and Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/web-frontend-instrumentation-with-opentelemetry</link>
            <guid isPermaLink="false">web-frontend-instrumentation-with-opentelemetry</guid>
            <pubDate>Mon, 04 Aug 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how frontend instrumentation differs to backend, and the current state of client web instrumentation in OpenTelemetry]]></description>
            <content:encoded><![CDATA[<p>DevOps, SRE and software engineering teams all require telemetry data to understand what's going on across their infrastructure and full-stack applications. Indeed we have covered instrumentation of backend services in several language ecosystems using <a href="http://opentelemetry.io">OpenTelemetry</a> (OTel) in the past. Yet for frontend tools, teams are often still relying on RUM agents, or sadly no instrumentation at all, due to the subtle differences in metrics that are needed to understand what's going on.</p>
<p>In this blog, we will discuss the current state of client instrumentation for the browser, along with an example showing how to instrument a simple JavaScript frontend using <a href="https://opentelemetry.io/docs/languages/js/getting-started/browser/">the OpenTelemetry browser instrumentation</a>. Furthermore, we'll also share how the baggage propagators help us build a full picture of what is going on across the entire application by connecting backend traces with frontend signals. If you want to dive straight into the code, check out the repo <a href="https://github.com/carlyrichmond/otel-record-store">here</a>.</p>
<h2>Application Overview</h2>
<p>The application that we use for this blog is called <a href="https://github.com/carlyrichmond/otel-record-store">OTel Record Store</a>, a simple web application written with Svelte and JavaScript (albeit our implementation is compatible with other web frameworks), communicating with a Java backend. Both send telemetry signals to an Elastic backend.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/web-frontend-instrumentation-with-opentelemetry/1-otel-frontend-sample-architecture.png" alt="Architecture" /></p>
<p>Eagle-eyed readers will noticed that signals from our frontend pass through a proxy and collector. The proxy is required to ensure that the appropriate Cross-Origin headers are populated to allow the signals to pass into Elastic, as well as the traditional reasons such as security, privacy and access control:</p>
<pre><code class="language-nginx">events {}

http {

  server {

    listen 8123; 

    # Traces endpoint exposed as example, others available in code repo
    location /v1/traces {
      proxy_pass http://host.docker.internal:4318;
      # Apply CORS headers to ALL responses, including POST
      add_header 'Access-Control-Allow-Origin' 'http://localhost:4173' always;
      add_header 'Access-Control-Allow-Methods' 'POST, OPTIONS' always;
      add_header 'Access-Control-Allow-Headers' 'Content-Type' always;
      add_header 'Access-Control-Allow-Credentials' 'true' always;

      # Preflight requests receive a 204 No Content response
      if ($request_method = OPTIONS) {
        return 204;
      }
    }
  }
}
</code></pre>
<p>While collectors can also be used to add headers, we have left this example to perform traditional tasks such as routing and processing.</p>
<h2>Prerequisites</h2>
<p>This example requires an Elastic cluster, run either locally via <a href="https://github.com/elastic/start-local">start-local</a>, via Elastic Cloud or Serverless. Here we use the Managed OLTP endpoint in Elastic Serverless. Any mechanism requires you to specify several key environment variables, listed in the <a href="https://github.com/carlyrichmond/otel-record-store/blob/main/.env-example">.env-example file</a>:</p>
<pre><code class="language-zsh">ELASTIC_ENDPOINT=https://my-elastic-endpoint:443
ELASTIC_API_KEY=my-api-key
</code></pre>
<h3>Running the application</h3>
<p>To run our example, follow the steps in the <a href="https://github.com/carlyrichmond/otel-record-store/blob/main/README.md">project README</a>, summarized below:</p>
<pre><code class="language-zsh"># Terminal 1: backend service, proxy and collector
docker-compose build
docker-compose up

# Terminal 2: frontend and sample telemetry data
cd records-ui
npm install
npm run generate
</code></pre>
<h2>Java Backend Instrumentation</h2>
<p>We will not cover the specifics of instrumentation of Java services with EDOT as there is already a great guide to get started <a href="https://github.com/elastic/elastic-otel-java">in the <code>elastic-otel-java</code> README</a>. The example is here purely for showcasing propagation that is important for investigating UI issues. All you need to know is that we make use of automatic instrumentation, sending logs, metrics and traces via <a href="https://opentelemetry.io/docs/specs/otel/protocol/">OpenTelemetry Protocol, or OTLP</a> using the below environment variables:</p>
<pre><code class="language-zsh">OTEL_RESOURCE_ATTRIBUTES=service.version=1,deployment.environment=dev
OTEL_SERVICE_NAME=record-store-server-java
OTEL_EXPORTER_OTLP_ENDPOINT=$ELASTIC_ENDPOINT
OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=ApiKey ${ELASTIC_API_KEY}&quot;
OTEL_TRACES_EXPORTER=otlp
OTEL_METRICS_EXPORTER=otlp
OTEL_LOGS_EXPORTER=otlp
</code></pre>
<p>The instrumentation is then initialized using the <code>-javaagent</code> option:</p>
<pre><code class="language-dockerfile">ENV JAVA_TOOL_OPTIONS=&quot;-javaagent:./elastic-otel-javaagent-1.2.1.jar&quot;
</code></pre>
<h2>Client Instrumentation</h2>
<p>Now that we have established our prerequisites, let's dive into the instrumentation code for our simple web application. Although we'll cover the implementation in sections, the full solution is available <a href="https://github.com/carlyrichmond/otel-record-store/blob/main/records-ui/src/lib/telemetry/frontend.tracer.ts">here in <code>frontend.tracer.ts</code></a>.</p>
<h3>State of OTel Client Instrumentation</h3>
<p>At time of writing, the <a href="https://opentelemetry.io/docs/languages/js/">OpenTelemetry JavaScript SDK</a> has stable support for metrics and traces, with logs currently under development and therefore subject to breaking changes <a href="https://opentelemetry.io/docs/languages/js/">as listed in their documentation</a>:</p>
<table>
<thead>
<tr>
<th>Traces</th>
<th>Metrics</th>
<th>Logs</th>
</tr>
</thead>
<tbody>
<tr>
<td>Stable</td>
<td>Stable</td>
<td>Development</td>
</tr>
</tbody>
</table>
<p>What differs from many other SDKs is the note warning that client instrumentation for the browser is experimental and mostly unspecified. It is subject to breaking change, and many pieces such as plugin support for measuring Google Core Web Vitals are in progress as reflected in the <a href="https://github.com/orgs/open-telemetry/projects/19/views/1">Client Instrumentation SIG project board</a>. In subsequent sections we'll show examples for signal capture, and also browser specific instrumentations including document load, user interaction and Core Web Vitals capture.</p>
<h3>Resource Definition</h3>
<p>When instrumenting web UIs, we need to establish our UI as an OpenTelemetry <a href="https://opentelemetry.io/docs/languages/js/resources/">Resource</a>. By definition, resources are entites that produce telemetry information. We want to see our UI as an entity in our system that interacts with other entities, which can be specified using the following code:</p>
<pre><code class="language-ts">// Defines a Resource to include metadata like service.name, required by Elastic
import { resourceFromAttributes, detectResources } from '@opentelemetry/resources';

// Experimental detector for browser environment
import { browserDetector } from '@opentelemetry/opentelemetry-browser-detector';

// Provides standard semantic keys for attributes, like service.name
import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';

const detectedResources = detectResources({ detectors: [browserDetector] });
let resource = resourceFromAttributes({
	[ATTR_SERVICE_NAME]: 'records-ui-web',
	'service.version': 1,
	'deployment.environment': 'dev'
});
resource = resource.merge(detectedResources);
</code></pre>
<p>A unique identifier for the service is required, and is common to all SDKs. What differs from other implementations is the inclusion of the <a href="https://www.npmjs.com/package/@opentelemetry/opentelemetry-browser-detector"><code>browserDetector</code></a> which, when merged with our defined resource attributes adds browser attributes such as platform, brands (e.g. Chrome versus Edge) and whether a mobile browser is being used:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/web-frontend-instrumentation-with-opentelemetry/2-otel-browser-attributes.png" alt="Sample Span JSON with Resource and Browser Attributes" /></p>
<p>Having this information on spans and errors is useful in diagnostic situations in identifying application and dependency compatibility issues with certain browsers (such as Internet Explorer from my time as an engineer 🤦).</p>
<h3>Logs</h3>
<p>Traditionally, frontend engineers rely on the DevTools console of their favourite browser to examine logs. With UI log messages only being accessible within your browser rather than forwarded to a file somewhere, which is the common pattern with backend services, we lose visibility of this resource when triaging user issues.</p>
<p>OpenTelemetry defines the concept of an <a href="https://opentelemetry.io/docs/concepts/signals/logs/#log-record-exporter">exporter</a> that allow us to send signals to a particular destination, such as logs.</p>
<pre><code class="language-ts">// Get logger and severity constant imports
import { logs, SeverityNumber } from '@opentelemetry/api-logs';

// Provider and batch processor for sending logs
import { BatchLogRecordProcessor, LoggerProvider } from '@opentelemetry/sdk-logs';

// Export logs via OTLP
import { OTLPLogExporter } from '@opentelemetry/exporter-logs-otlp-http';

// Configure logging to send to the collector via nginx
const logExporter = new OTLPLogExporter({
	url: 'http://localhost:8123/v1/logs' // nginx proxy
});

const loggerProvider = new LoggerProvider({
	resource: resource, // see resource initialisation above
	processors: [new BatchLogRecordProcessor(logExporter)]
});

logs.setGlobalLoggerProvider(loggerProvider);
</code></pre>
<p>Once the provider has been initialized, we need to get a hold of the logger to send our traces to Elastic rather than using good ol' <code>console.log('Help!')</code>:</p>
<pre><code class="language-ts">// Example gets logger and sends a message to Elastic
const logger = logs.getLogger('default', '1.0.0');
logger.emit({
	severityNumber: SeverityNumber.INFO,
	severityText: 'INFO',
	body: 'Logger initialized'
});
</code></pre>
<p>They will now be visible in Discover and the Logs views, allowing us to search for relevant outages as part of investigations and incidents:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/web-frontend-instrumentation-with-opentelemetry/3-otel-log-discover.png" alt="Sample Logs in Discover" /></p>
<h3>Traces</h3>
<p>The power of traces in diagnosing issues in the UI is in the visibility of not just what is going on within the web application, but seeing the connections and time taken to make calls to the labyrinth of services behind. To instrument a web-based application, we need to make use of the <code>WebTraceProvider</code> using the <code>OTLPTraceExporter</code> in a similar way to how exporters work for logs and metrics:</p>
<pre><code class="language-ts">/* Packages for exporting traces */

// Import the WebTracerProvider, which is the core provider for browser-based tracing
import { WebTracerProvider } from '@opentelemetry/sdk-trace-web';

// BatchSpanProcessor forwards spans to the exporter in batches to prevent flooding
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';

// Import the OTLP HTTP exporter for sending traces to the collector over HTTP
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

// Configure the OTLP exporter to talk to the collector via nginx
const exporter = new OTLPTraceExporter({
	url: 'http://localhost:8123/v1/traces' // nginx proxy
});

// Instantiate the trace provider and inject the resource
const provider = new WebTracerProvider({
	resource: resource,
	spanProcessors: [
		// Send each completed span through the OTLP exporter
		new BatchSpanProcessor(exporter)
	]
});
</code></pre>
<p>Next we need to register our provider. One thing that's slightly different in the web world is how we configure propagation. <a href="https://opentelemetry.io/docs/concepts/context-propagation/">Context propagation</a> in OpenTelemetry refers to the concept of moving context between services and processes which, in our case, allows us to correlate the web signals with those of backend services. Often this is done automatically. As you will see from the below snippet, there are 3 concepts that help us with propagation:</p>
<pre><code class="language-ts">// This context manager ensures span context is maintained across async boundaries in the browser
import { ZoneContextManager } from '@opentelemetry/context-zone';

// Context Propagation across signals
import {
	CompositePropagator,
	W3CBaggagePropagator,
	W3CTraceContextPropagator
} from '@opentelemetry/core';

// Provider instantiation code omitted

// Register the provider with propagation and set up the async context manager for spans
provider.register({
	contextManager: new ZoneContextManager(),
	propagator: new CompositePropagator({
		propagators: [new W3CBaggagePropagator(), new W3CTraceContextPropagator()]
	})
});
</code></pre>
<p>The first is the <code>ZoneContextManager</code> which propagates context such as spans and traces across asynchronous operations. Web developers will be familiar with <a href="https://www.npmjs.com/package/zone.js?activeTab=readme">zone.js</a>, the framework used by many JS frameworks to provide an execution context that persists across async tasks.</p>
<p>Additionally, we have combined the <code>W3CBaggagePropagator</code> and <code>W3CTraceContextPropagator</code> using the <code>CompositePropagator</code> to ensure key value pair attributes are passed between signals as per the <a href="https://w3c.github.io/baggage/">W3C specification defined here</a>. In the case of the <code>W3CTraceContextPropagator</code>, it allows the propagation of the <code>traceparent</code> and <code>tracestate</code> HTTP headers as per the <a href="https://www.w3.org/TR/trace-context-2/">specification located here</a>.</p>
<h4>Auto Instrumentation</h4>
<p>The simplest way to start instrumenting a web application is to register the web auto-instrumentations. At time of writing <a href="https://github.com/open-telemetry/opentelemetry-js-contrib/tree/main/packages/auto-instrumentations-web#readme">the documentation</a> states that the following instrumentations can be configured via this approach:</p>
<ol>
<li><a href="https://www.npmjs.com/package/@opentelemetry/instrumentation-document-load">@opentelemetry/instrumentation-document-load</a></li>
<li><a href="https://www.npmjs.com/package/@opentelemetry/instrumentation-fetch">@opentelemetry/instrumentation-fetch</a></li>
<li><a href="https://www.npmjs.com/package/@opentelemetry/instrumentation-user-interaction">@opentelemetry/instrumentation-user-interaction</a></li>
<li><a href="https://www.npmjs.com/package/@opentelemetry/instrumentation-xml-http-request">@opentelemetry/instrumentation-xml-http-request</a></li>
</ol>
<p>Configuration for each configuration can be passed as configuration to <code>registerInstrumentations</code> as shown in the below example configuring the fetch and XMLHTTPRequest instrumentations:</p>
<pre><code class="language-ts">// Used to auto-register built-in instrumentations
import { registerInstrumentations } from '@opentelemetry/instrumentation';

// Import the auto-instrumentations for web, which includes common libraries, frameworks and document load
import { getWebAutoInstrumentations } from '@opentelemetry/auto-instrumentations-web';

// Enable automatic span generation for document load and user click interactions
registerInstrumentations({
  instrumentations: [
    getWebAutoInstrumentations({
      '@opentelemetry/instrumentation-fetch': {
        propagateTraceHeaderCorsUrls: /.*/,
        clearTimingResources: true
        },
        '@opentelemetry/instrumentation-xml-http-request': {
          propagateTraceHeaderCorsUrls: /.*/
          }
      })
    ]
});
</code></pre>
<p>Taking the @opentelemetry/instrumentation-fetch instrumentation as an example, we are able to see traces for HTTP requests, and the propagators also ensure that the spans can connect with our Java backend services to give a full picture of the amount of time taken to process the request at each stage:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/web-frontend-instrumentation-with-opentelemetry/4-otel-http-get-sample-trace.png" alt="Sample HTTP GET Trace" /></p>
<p>While auto-instrumentations is agreat way to get common instrumentations, we can also instantiate instrumentations directly, as we'll see in the remainder of this article.</p>
<h4>Document Load Instrumentation</h4>
<p>Another consideration unique to web frontend is the time taken to load assets such as images, JavaScript files and even stylesheets. Such assets taking considerable time to load can impact metrics such as <a href="https://web.dev/articles/fcp">First Contentful Paint</a>, and therefore the user experience. The <a href="https://www.npmjs.com/package/@opentelemetry/instrumentation-document-load">OTel Document Load instrumentation</a> allows for automatic instrumentation of the time taken to load assets when using the <a href="https://www.npmjs.com/package/@opentelemetry/sdk-trace-web">@opentelemetry/sdk-trace-web</a> package.</p>
<p>It is simply a case of adding the instrumentation to the <code>instrumentations</code> array we have provided to our provider using <code>registerInstrumentations</code>:</p>
<pre><code class="language-ts">// Used to auto-register built-in instrumentations like page load and user interaction
import { registerInstrumentations } from '@opentelemetry/instrumentation';

// Document Load Instrumentation automatically creates spans for document load events
import { DocumentLoadInstrumentation } from '@opentelemetry/instrumentation-document-load';

// Configuration discussed above omitted

// Enable automatic span generation for document load and user click interactions
registerInstrumentations({
  instrumentations: [
    // Automatically tracks when the document loads
    new DocumentLoadInstrumentation({
      ignoreNetworkEvents: false,
      ignorePerformancePaintEvents: false
      }),
      // Other instrumentations omitted
  ]
});
</code></pre>
<p>This configuration will create a new trace conventiently named <code>documentLoad</code>, that will show us the time taken to load resources within the document, similar to the following:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/web-frontend-instrumentation-with-opentelemetry/5-otel-document-load-example-trace.png" alt="Sample documentLoad Trace" /></p>
<p>Each span will have metadata attached to help us identify which resources are taking considerable time to load, such as this image example, where the resource takes <strong>837ms</strong> to load:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/web-frontend-instrumentation-with-opentelemetry/6-otel-document-load-http-url-metadata.png" alt="documentLoad Trace Metadata" /></p>
<h4>Click Events</h4>
<p>You may wonder why we want to capture user interactions with web applications for diagnostic purposes. Being able to see the trigger points for errors can be useful in incidents to establish a timeline of what happened, and determine if users are indeed being impact as is the case for Real Ueer Monitoring tools. But if we also consider the field of Digital Experience Monitoring, or DEM, software teams need details on usage of application features to understand the user journey and how it could possibly being improved in a data-drive way. Capturing user events is required for both.</p>
<p>The <a href="https://www.npmjs.com/package/@opentelemetry/instrumentation-user-interaction">OTel UserInteraction instrumentation for web</a> is how we capture these events. Similar to the document load instrumentation it depends on the <a href="https://www.npmjs.com/package/@opentelemetry/sdk-trace-web">@opentelemetry/sdk-trace-web</a> package, and when used with <code>zone-js</code> and the <code>ZoneContextManager</code> it also supports async operations.</p>
<p>Like other instrumentations it is added via <code>registerInstrumentations</code>:</p>
<pre><code class="language-ts">// Used to auto-register built-in instrumentations like page load and user interaction
import { registerInstrumentations } from '@opentelemetry/instrumentation';

// Automatically creates spans for user interactions like clicks
import { UserInteractionInstrumentation } from '@opentelemetry/instrumentation-user-interaction';

// Configuration discussed above omitted

// Enable automatic span generation for document load and user click interactions
registerInstrumentations({
  instrumentations: [
    // User events
    new UserInteractionInstrumentation({
      eventNames: ['click', 'input'] // instrument click and input events only
    }),
    // Other instrumentations omitted
  ]
});
</code></pre>
<p>It will capture and label spans for the user events we configure, and leveraging the propagators configured previously can connect spans from other resources to the user event, similar to the below example where we see the service call to get records when the user adds a search term to the <code>input</code> box:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/web-frontend-instrumentation-with-opentelemetry/7-otel-user-interaction-input-sample-trace.png" alt="User Interaction input Sample Trace" /></p>
<h3>Metrics</h3>
<p>There are numerous different measurements that are helpful in capturing useful indicators of availability and performace of web applications, such as latency, throughput or the number of 404 errors. <a href="https://developers.google.com/search/docs/appearance/core-web-vitals">Google Core Web Vitals</a> are a set of standard metrics used by web developers to measure real-world user experience of web sites, including loading performance, reactivity to user input and visual stability. Given at time of writing <a href="https://github.com/open-telemetry/opentelemetry-js-contrib/issues/1461">the Core Web Vitals Plugin for OTel Browser is on the backlog</a>, let's try building our own custom instrumentation using the <a href="https://www.npmjs.com/package/web-vitals">web-vitals JS library</a> to capture these as <a href="https://opentelemetry.io/docs/concepts/signals/metrics/">OTel metrics</a>.</p>
<p>In OpenTelemetry you can create your own custom instrumentation by extending the <code>InstrumentationBase</code>, overriding the <code>constructor</code> to create the <code>MeterProvider</code>, <code>Meter</code> and <code>OTLPMetricExporter</code> that will allow us to send our Core Web Vital measurements to Elastic via our proxy, as presented in <a href="https://github.com/carlyrichmond/otel-record-store/blob/main/records-ui/src/lib/telemetry/web-vitals.instrumentation.ts"><code>web-vitals.instrumentation.ts</code></a>. Note that below we show only the LCP meter for succinctness, but the full example <a href="https://github.com/carlyrichmond/otel-record-store/blob/main/records-ui/src/lib/telemetry/web-vitals.instrumentation.ts">here</a> measures all web vitals.</p>
<pre><code class="language-ts">/* OpenTelemetry JS packages */
// Instrumentation base to create a custom Instrumentation for our provider
import {
	InstrumentationBase,
	type InstrumentationConfig,
	type InstrumentationModuleDefinition
} from '@opentelemetry/instrumentation';

// Metrics API
import {
	metrics,
	type ObservableGauge,
	type Meter,
	type Attributes,
	type ObservableResult,

} from '@opentelemetry/api';

export class WebVitalsInstrumentation extends InstrumentationBase {

  // Meter captures measurements at runtime
	private cwvMeter: Meter;

	/* Core Web Vitals Measures, LCP provided, others omitted */
	private lcp: ObservableGauge;

	constructor(config: InstrumentationConfig, resource: Resource) {
		super('WebVitalsInstrumentation', '1.0', config);

    // Create metric reader to process metrics and export using OTLP
		const metricReader = new PeriodicExportingMetricReader({
			exporter: new OTLPMetricExporter({
				url: 'http://localhost:8123/v1/metrics' // nginx proxy
			}),
			// Default is 60000ms (60 seconds).
			// Set to 10 seconds for demo purposes only.
			exportIntervalMillis: 10000
		});

    // Creating Meter Provider factory to send metrics
		const myServiceMeterProvider = new MeterProvider({
			resource: resource,
			readers: [metricReader]
		});
		metrics.setGlobalMeterProvider(myServiceMeterProvider);

    // Create web vitals meter
		this.cwvMeter = metrics.getMeter('core-web-vitals', '1.0.0');

		// Initialising CWV metric gauge instruments (LCP given as example, others omitted here)
		this.lcp = this.cwvMeter.createObservableGauge('lcp', { unit: 'ms', description: 'Largest Contentful Paint' });
	}

	protected init(): InstrumentationModuleDefinition | InstrumentationModuleDefinition[] | void {}

  // Other steps discussed later
}
</code></pre>
<p>You'll notice in our LCP example we have created an <code>ObservableGauge</code> to capture the value at the time it is read via a callback function. This can be setup when we <code>enable</code> our custom instrumentation, specifying when the LCP event is triggered the value will be sent via <code>result.observe</code>:</p>
<pre><code class="language-ts">/* Web Vitals Frontend package, LCP shown as example*/
import { onLCP, type LCPMetric } from 'web-vitals';

/* OpenTelemetry JS packages */
// Instrumentation base to create a custom Instrumentation for our provider
import {
	InstrumentationBase,
	type InstrumentationConfig,
	type InstrumentationModuleDefinition
} from '@opentelemetry/instrumentation';

// Metrics API
import {
	metrics,
	type ObservableGauge,
	type Meter,
	type Attributes,
	type ObservableResult,

} from '@opentelemetry/api';
 
// Other OTel Metrics imports omitted

// Time calculator via performance component
import { hrTime } from '@opentelemetry/core';

type CWVMetric = LCPMetric | CLSMetric | INPMetric | TTFBMetric | FCPMetric;

export class WebVitalsInstrumentation extends InstrumentationBase {

	/* Core Web Vitals Measures */
	private lcp: ObservableGauge;

	// Constructor and Initialization omitted

	enable() {
		// Capture Largest Contentful Paint, other vitals omitted
		onLCP(
			(metric) =&gt; {
				this.lcp.addCallback((result) =&gt; {
					this.sendMetric(metric, result);
				});
			},
			{ reportAllChanges: true }
		);
	}

  // Callback utility to add attributes and send captured metric
	private sendMetric(metric: CWVMetric, result: ObservableResult&lt;Attributes&gt;): void {
		const now = hrTime();

		const attributes = {
			startTime: now,
			'web_vital.name': metric.name,
			'web_vital.id': metric.id,
			'web_vital.navigationType': metric.navigationType,
			'web_vital.delta': metric.delta,
			'web_vital.value': metric.value,
			'web_vital.rating': metric.rating,
			// metric specific attributes
			'web_vital.entries': JSON.stringify(metric.entries)
		};

		result.observe(metric.value, attributes);
	}
}
</code></pre>
<p>To use our own instrumentation, we need to register our instrumentation just like we did in <code>frontend.tracer.ts</code> for the available web instrumentations to capture document and user event instrumentations:</p>
<pre><code class="language-ts">registerInstrumentations({
  instrumentations: [
    // Other web instrumentations omitted
    // Custom Web Vitals instrumentation
    new WebVitalsInstrumentation({}, resource)
    ]
});
</code></pre>
<p>The <code>lcp</code> metric, along with the attributes we specified as part of our <code>sendMetric</code> function will be sent to our Elastic cluster:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/web-frontend-instrumentation-with-opentelemetry/8-otel-metric-elastic-discover-view.png" alt="LCP Metric in Discover" /></p>
<p>These metrics will not feed into the <a href="https://www.elastic.co/docs/solutions/observability/applications/user-experience">User Experience dashboard</a> due to compatibility, but we can create a dashboard leveraging the values to show the trends of each of our vitals:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/web-frontend-instrumentation-with-opentelemetry/9-otel-core-web-vitals-dashboard.png" alt="Sample Core Web Vitals Dashboard" /></p>
<h2>Summary</h2>
<p>In this blog, we presented the current state of client instrumentation for the browser, along with an example showing how to instrument a simple JavaScript frontend using <a href="https://opentelemetry.io/docs/languages/js/getting-started/browser/">the OpenTelemetry browser instrumentation</a>. To reflect back on the code, check out the repo <a href="https://github.com/carlyrichmond/otel-record-store">here</a>. If you have any questions or want to learn from other developers connect with the <a href="https://www.elastic.co/community">Elastic Community</a>.</p>
<blockquote>
<p>Developer resources:</p>
<ul>
<li><a href="https://github.com/carlyrichmond/otel-record-store">OTel Record Store Application</a></li>
<li><a href="https://opentelemetry.io/docs/languages/js/getting-started/browser/">JavaScript Browser Instrumentation</a></li>
</ul>
</blockquote>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/web-frontend-instrumentation-with-opentelemetry/web-blog-header.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Unlocking whole-system visibility with Elastic Universal Profiling™]]></title>
            <link>https://www.elastic.co/observability-labs/blog/whole-system-visibility-elastic-universal-profiling</link>
            <guid isPermaLink="false">whole-system-visibility-elastic-universal-profiling</guid>
            <pubDate>Mon, 25 Sep 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Visual profiling data can be overwhelming. This blog post aims to demystify continuous profiling and guide you through its unique visualizations. We will equip you with the knowledge to derive quick, actionable insights from Universal Profiling™.]]></description>
            <content:encoded><![CDATA[<h2>Identify, optimize, measure, repeat!</h2>
<p>SREs and developers who want to maintain robust, efficient systems and achieve optimal code performance need effective tools to measure and improve code performance. Profilers are invaluable for these tasks, as they can help you boost your app's throughput, ensure consistent system reliability, and gain a deeper understanding of your code's behavior at runtime. However, traditional profilers can be cumbersome to use, as they often require code recompilation and are limited to specific languages. Additionally, they can also have a high overhead that negatively affects performance and makes them less suitable for quick, real-time debugging in production environments.</p>
<p>To address the limitations of traditional profilers, Elastic&lt;sup&gt;®&lt;/sup&gt; recently <a href="https://www.elastic.co/blog/continuous-profiling-is-generally-available">announced the general availability of Elastic Universal Profiling</a>, a <a href="https://www.elastic.co/observability/universal-profiling">continuous profiling</a> product that is refreshingly straightforward to use, eliminating the need for instrumentation, recompilations, or restarts. Moreover, Elastic Universal Profiling does not require on-host debug symbols and is language-agnostic, allowing you to profile any process running on your machines — from your application's code to third-party libraries and even kernel functions.</p>
<p>However, even the most advanced tools require a certain level of expertise to interpret the data effectively. The wealth of visual profiling data — flamegraphs, stacktraces, or functions — can initially seem overwhelming. This blog post aims to demystify <a href="https://www.elastic.co/observability/universal-profiling">continuous profiling</a> and guide you through its unique visualizations. We will equip you with the knowledge to derive quick, actionable insights from Universal Profiling.</p>
<p>Let’s begin.</p>
<h2>Stacktraces: The cornerstone for profiling</h2>
<h3>It all begins with a stacktrace — a snapshot capturing the cascade of function calls.</h3>
<p>A stacktrace is a snapshot of the call stack of an application at a specific point in time. It captures the sequence of function calls that the program has made up to that point. In this way, a stacktrace serves as a historical record of the call stack, allowing you to trace back the steps that led to a particular state in your application.</p>
<p>Further, stacktraces are the foundational data structure that profilers rely on to determine what an application is executing at any given moment. This is particularly useful when, for instance, your infrastructure monitoring indicates that your application servers are consuming 95% of CPU resources. While utilities such as 'top -H' can show the top processes that are consuming CPU, they lack the granularity needed to identify the specific lines of code (in the top process) responsible for the high usage.</p>
<p>In the case of Elastic Universal Profiling, <a href="https://www.elastic.co/blog/ebpf-observability-security-workload-profiling">eBPF is used</a> to perform sampling of every process that is keeping a CPU core busy. Unlike most instrumentation profilers that focus solely on your application code, Elastic Universal Profiling provides whole-system visibility — it profiles not just your code, but also code you don't own, including third-party libraries and even kernel operations.</p>
<p>The diagram below shows how the Universal Profiling agent works at a very high level. Step 5 indicates the ingestion of the stacktraces into the profiling collector, a new part of the Elastic Stack.</p>
<p>_ <strong>Just</strong> _ <a href="https://www.elastic.co/guide/en/observability/current/profiling-get-started.html">_ <strong>deploy the profiling host agent</strong> _</a> <em><strong>and receive profiling data (in Kibana</strong></em>&lt;sup&gt;&lt;em&gt;®&lt;/em&gt;&lt;/sup&gt;<em><strong>) a few minutes later.</strong></em> <a href="https://www.elastic.co/guide/en/observability/current/profiling-get-started.html">_ <strong>Get started now</strong> _</a>_ <strong>.</strong> _</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/whole-system-visibility-elastic-universal-profiling/elastic-blog-1-flowchart-linux.png" alt="High-level depiction of how the profiling agent works" /></p>
<ol>
<li>
<p>Unwinder eBPF programs (bytecode) are sent to the kernel.</p>
</li>
<li>
<p>The kernel verifies that the BPF program is safe. If accepted, the program is attached to the probes and executed when the event occurs.</p>
</li>
<li>
<p>The eBPF programs pass the collected data to userspace via maps.</p>
</li>
<li>
<p>The agent reads the collected data from maps. The data transferred from the agent to the maps are process-specific and interpreter-specific meta-information that help the eBPF unwinder programs perform unwinding.</p>
</li>
<li>
<p>Stacktraces, metrics, and metadata are pushed to the Elastic Stack.</p>
</li>
<li>
<p>Visualize data as flamegraphs, stacktraces, and functions via Kibana.</p>
</li>
</ol>
<p>While stacktraces are the key ingredient for most profiling tools, interpreting them can be tricky. Let's take a look at a simple example to make things a bit easier. The table below shows a group of stacktraces from a Java application and assigns each a percentage to indicate its share of CPU time consumption.</p>
<p><strong>Table 1: Grouped Stacktraces with CPU Time Percentage</strong></p>
<table>
<thead>
<tr>
<th>Percentage</th>
<th>Function Calls</th>
</tr>
</thead>
<tbody>
<tr>
<td>60%</td>
<td>startApp -&gt; authenticateUser -&gt; processTransaction</td>
</tr>
<tr>
<td>20%</td>
<td>startApp -&gt; loadAccountDetails -&gt; fetchRecentTransactions</td>
</tr>
<tr>
<td>10%</td>
<td>startApp -&gt; authenticateUser -&gt; processTransaction -&gt; verifyFunds</td>
</tr>
<tr>
<td>2%</td>
<td>startApp -&gt; authenticateUser -&gt; processTransaction -&gt;libjvm.so</td>
</tr>
<tr>
<td>1%</td>
<td>startApp -&gt; authenticateUser -&gt; processTransaction -&gt;libjvm.so -&gt;vmlinux: asm_common_interrupt -&gt;vmlinux: asm_sysvec_apic_timer_interrupt</td>
</tr>
</tbody>
</table>
<p>The percentages above represent the relative frequency of each specific stacktrace compared to the total number of stacktraces collected over the observation period, not actual CPU usage percentages. Also, the libjvm.so and kernel frames (vmlinux:*) in the example are commonly observed with whole-system profilers like Elastic Universal Profiling.</p>
<p>Also, we can see that <strong>60%</strong> of the time is spent in the sequence startApp; authenticateUser; processTransaction. An additional <strong>10%</strong> of the processing time is allocated to verifyFunds, a function invoked by processTransaction. Given these observations, it becomes evident that optimization initiatives would yield the most impact if centered on the processTransaction function, as it is one of the most expensive functions. However, real-world stacktraces can be far more intricate than this example. So how do we make sense of them quickly? The answer to this problem resulted in the creation of flamegraphs.</p>
<h2>Flamegraphs: A visualization of stacktraces</h2>
<p>While the above example may appear straightforward, it scarcely reflects the complexities encountered when aggregating multiple stacktraces across a fleet of machines on a continuous basis. The depth of the stack traces and the numerous branching paths can make it increasingly difficult to pinpoint where code is consuming resources. This is where flamegraphs, a concept popularized by <a href="https://www.brendangregg.com/flamegraphs.html">Brendan Gregg</a>, come into play.</p>
<p>A flamegraph is a visual interpretation of stacktraces, designed to quickly and accurately identify the functions that are consuming the most resources. Each function is represented by a rectangle, where the width of the rectangle represents the amount of time spent in the function, and the number of stacked rectangles represents the stack depth. The stack depth is the number of functions that were called to reach the current function.</p>
<p>Elastic Universal Profiling uses icicle graphs, which is an inverted variant of the standard flamegraph. In an icicle graph, the root function is at the top, and its child functions are shown below their parents –– making it easier to see the hierarchy of functions and how they are related to each other.</p>
<p>In most flamegraphs, the y-axis represents stack depth, but there is no standardization for the x-axis. Some profiling tools use the x-axis to indicate the passage of time; in these instances, the graph is more accurately termed a flame chart. Others sort the x-axis alphabetically. Universal Profiling sorts functions on the x-axis based on relative CPU percentage utilization, starting with the function that consumes the most CPU time on the left, as shown in the example icicle graph below.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/whole-system-visibility-elastic-universal-profiling/elastic-blog-2-cpu-time.png" alt="Example icicle graph: The percentage represents relative CPU time, not the real CPU usage time. " /></p>
<h2>Debugging and optimizing performance issues: Stacktraces, TopN functions, flamegraphs</h2>
<p>SREs and SWEs can use Universal Profiling for troubleshooting, debugging, and performance optimization. It builds stacktraces that go from the kernel, through userspace native code, all the way into code running in higher level runtimes, enabling you to <strong>identify performance regressions</strong> , <strong>reduce wasteful computations</strong> , and <strong>debug complex issues faster</strong>.</p>
<p>To this end, Universal Profiling offers three main visualizations: Stacktraces, TopN Functions, and flamegraphs.</p>
<h3>Stacktrace view</h3>
<p>The stacktraces view shows grouped stacktrace graphs by threads, hosts, Kubernetes deployments, and containers. It can be used to detect unexpected CPU spikes across threads and drill down into a smaller time range to investigate further with a flamegraph. Refer to the <a href="https://www.elastic.co/guide/en/observability/current/universal-profiling.html#profiling-stacktraces-intro">documentation</a> for details.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/whole-system-visibility-elastic-universal-profiling/elastic-blog-3-wave-patterns.png" alt="Notice the wave pattern in the stacktrace view, enabling you to drill down into a CPU spike " /></p>
<h3>TopN functions view</h3>
<p>Universal Profiling's topN functions view shows the most frequently sampled functions, broken down by CPU time, annualized CO&lt;sub&gt;2&lt;/sub&gt;, and annualized cost estimates. You can use this view to identify the most expensive functions across your entire fleet, and then apply filters to focus on specific components for a more detailed analysis. Clicking on a function name will redirect you to the flamegraph, enabling you to examine the call hierarchy.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/whole-system-visibility-elastic-universal-profiling/elastic-blog-4-topN-functions-page.png" alt="TopN functions page" /></p>
<h3>Flamegraphs view</h3>
<p>The flamegraph page is where you will most likely spend the most time, especially when debugging and optimizing. We recommend that you use the guide below to identify performance bottlenecks and optimization opportunities with flamegraphs. The three key elements-conditions to look for are <strong>width</strong> , <strong>hierarchy</strong> , and <strong>height</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/whole-system-visibility-elastic-universal-profiling/elastic-blog-5-icivle-flamegraph.png" alt="Icicle flamegraph: We use the colors to determine different types of code (e.g., native, interpreted, kernel)." /></p>
<p><strong>Width matters:</strong> In icicle graphs, wider rectangles signify functions taking up more CPU time. Always read the graph from left to right and note the widest rectangles, as these are the prime hot spots.</p>
<p><strong>Hierarchy matters:</strong> Navigate the graph's stack to understand function relationships. This vertical examination will help you identify whether one or multiple functions are responsible for performance bottlenecks. This could also uncover opportunities for code improvements, such as swapping an inefficient library or avoiding unnecessary I/O operations.</p>
<p><strong>Height matters:</strong> Elevated or tall stacks in the graph usually point to deep call hierarchies. These can be an indicator of complex and less efficient code structures that may require attention.</p>
<p>Also, when navigating a flamegraph, you may want to look for specific function names to validate your assumptions on their presence: in the Universal Profiling flamegraphs view, there is a “Search” bar at the bottom left corner of the view. You can input a regex, and the match will be highlighted in the flamegraph; by clicking on the left and right arrows next to the Search bar, you can move across the occurrences on the flamegraph and spot callers and callee of the matched function.</p>
<p>In summary,</p>
<ul>
<li><strong>Scan</strong> horizontally from left to right, focusing on width for CPU-intensive functions.</li>
<li><strong>Examine</strong> vertically to examine the stack and spot bottlenecks.</li>
<li><strong>Look</strong> for <strong>towering stacks</strong> to identify potential complexities in the code.</li>
</ul>
<p>To recap, use topN functions to generate optimization hypotheses and validate them with stacktraces and/or flamegraphs. Use stacktraces to monitor CPU utilization trends and to delve into the finer details. Use flamegraphs to quickly debug and optimize your code, using width, hierarchy, and height as guides.</p>
<p>_ <strong>Identify. Optimize. Measure. Repeat!</strong> _</p>
<h2>Measure the impact of your change</h2>
<h3>For the very first time in history, developers can now measure the performance (gained or lost), cloud cost, and carbon footprint impact of every deployed change.</h3>
<p>Once you have identified a performance issue and applied fixes or optimizations to your code, it is essential to measure the impact of your changes. The differential topN functions and differential flamegraph pages are invaluable for this, as they can help you identify regressions and measure your change impact not only in terms of performance but also in terms of carbon emissions and cost savings.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/whole-system-visibility-elastic-universal-profiling/elastic-blog-6-uni-profiling.png" alt="A differential function view, showing the performance, CO2, and cost impact of a change" /></p>
<p>The Diff column indicates a change in the function’s rank.</p>
<p>You may need to use tags or other metadata, such as container and deployment name, in combination with time ranges to differentiate between the optimized and non-optimized changes.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/whole-system-visibility-elastic-universal-profiling/elastic-blog-7-differential-flamegraph.png" alt="A differential flamegraph showing regression in A/B testing" /></p>
<h2>Universal Profiling: The key to optimizing application resources</h2>
<p>Computational efficiency is no longer just a nice-to-have, but a must-have from both a financial and environmental sustainability perspective. Elastic Universal Profiling provides unprecedented visibility into the runtime behavior of all your applications, so you can identify and optimize the most resource-intensive areas of your code. The result is not merely better-performing software but also reduced resource consumption, lower cloud costs, and a reduction in carbon footprint. Optimizing your code with Universal Profiling is not only the right thing to do for your business, it’s the right thing to do for our world.</p>
<p><a href="https://www.elastic.co/guide/en/observability/current/profiling-get-started.html">Get started</a> with Elastic Universal Profiling today.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/whole-system-visibility-elastic-universal-profiling/universal-profiling-blog-720x420.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Windows Event Log Monitoring with OpenTelemetry & Elastic Streams]]></title>
            <link>https://www.elastic.co/observability-labs/blog/windows-event-monitoring-with-opentelemetry-and-elastic-streams</link>
            <guid isPermaLink="false">windows-event-monitoring-with-opentelemetry-and-elastic-streams</guid>
            <pubDate>Thu, 05 Feb 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to enhance Windows Event Log monitoring with OpenTelemetry for standardized ingestion and Elastic Streams for smart partitioning and analysis.]]></description>
            <content:encoded><![CDATA[<p>For system administrators and SREs, Windows Event Logs are both a goldmine and a graveyard. They contain the critical data needed to diagnose the root cause of a server crash or a security breach, but they are often buried under gigabytes of noise. Traditionally, extracting value from these logs required brittle regex parsers, manual rule creation, and a significant amount of human intuition.</p>
<p>However, the landscape of log management is shifting. By combining the industry-standard ingestion of OpenTelemetry (OTel) with the AI-driven capabilities of Elastic Streams, we can change how we monitor Windows infrastructure. This approach isn't just moving data. We are also using Large Language Models (LLMs) to understand it.</p>
<h2>The Challenge with Traditional Windows Logging</h2>
<p>Windows generates a massive variety of logs: System, Security, Application, Setup, and Forwarded Events. Within those categories, you have thousands of Event IDs. Historically, getting this data into an observability platform involved installing proprietary agents and configuring complex pipelines to strip out the XML headers and format the messages.</p>
<p>Once the data was ingested, we can try to figure out what &quot;bad&quot; looked like. You had to know in advance that Event ID 7031 indicated a service crash, and then write a specific alert for it. If you missed a specific Event ID or if the format changed, your monitoring went dark.</p>
<h2>Step 1: Ingestion via OpenTelemetry</h2>
<p>The first step in modernizing this workflow is adopting OpenTelemetry. The OTel collector has matured significantly and now offers robust support for Windows environments. By installing the collector directly on Windows servers, you can configure receivers to tap into the event log subsystems.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/windows-event-monitoring-with-opentelemetry-and-elastic-streams/otel-config.png" alt="OTel collector configuration for Windows Event Logs" /></p>
<p>The beauty of this approach is standardization. You aren't locked into a vendor-specific shipping agent. The OTel collector acts as a universal router, grabbing the logs and sending them to your observability backend in this case, the Elastic logs index designed to handle high-throughput streams.</p>
<p>The key thing to pay attention to in this configuration is how we add this transform statement:</p>
<pre><code class="language-yaml">transform/logs-streams:
  log_statements:
    - context: resource
      statements:
        - set(attributes[&quot;elasticsearch.index&quot;], &quot;logs&quot;)
</code></pre>
<p>This works with the vanilla opentelemetry collector and when the data arrives in Elastic, it tells Elastic to use the new wired streams feature which enables all the downstream AI features we discuss in later steps.</p>
<p>Checkout my example configuration <a href="https://github.com/davidgeorgehope/otel-collector-windows/blob/main/config.yaml">here</a></p>
<h2>Step 2: AI-Driven Partitioning</h2>
<p>Once the data arrives, the next challenge is organization. Dumping all Windows logs into a single <code>logs-*</code> index is a recipe for slow queries and confusion. In the past, we split indices based on hardcoded fields. Now, we can use AI to &quot;fingerprint&quot; the data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/windows-event-monitoring-with-opentelemetry-and-elastic-streams/ai-partitioning.png" alt="AI-driven partitioning of Windows logs" /></p>
<p>This process involves analyzing the incoming stream to identify patterns. The system looks at the structure and content of the logs to determine their origin. For example, it can distinguish between a <code>Windows Security Audit</code> log and a <code>Service Control Manager</code> log purely based on the data shape.</p>
<p>The result is automatic partitioning. The system creates separate, optimized &quot;buckets&quot; or streams for each data type. You get a clean separation of concerns, Security logs go to one stream, File Manager logs to another, without having to write a single conditional routing rule. This partitioning is crucial for performance and for the next phase of the process: analysis.</p>
<h2>Step 3: Significant Events and LLM Analysis</h2>
<p>Once your data is partitioned (e.g., into a dedicated <code>Service Control Manager</code> stream), you can apply GenAI models to analyze the semantic meaning of that stream.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/windows-event-monitoring-with-opentelemetry-and-elastic-streams/llm-analysis.png" alt="LLM analysis of log streams" /></p>
<p>In a traditional setup, the system sees text strings. In an AI-driven setup, the system understands context. When an LLM analyzes the <code>Service Control Manager</code> stream, it identifies what that system is responsible for. It knows that this specific component manages the starting and stopping of system services.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/windows-event-monitoring-with-opentelemetry-and-elastic-streams/significant-events-suggestions.png" alt="Significant events suggestions from AI" /></p>
<p>Because the model understands the <em>purpose</em> of the log stream, it can generate suggestions for what constitutes a &quot;Significant Event.&quot; It doesn't need you to tell it to look for crashes; it knows that for a Service Manager, a crash is a critical failure.</p>
<h3>From Passive Storage to Proactive Suggestions</h3>
<p>The workflow effectively automates the creation of detection rules. The LLM scans the logs and generates a list of potential problems relevant to that specific dataset, such as:</p>
<ul>
<li><strong>Service Crashes:</strong> High severity anomalies where background processes terminate unexpectedly.</li>
<li><strong>Startup/Boot Failures:</strong> Critical errors preventing the OS from reaching a stable state.</li>
<li><strong>Permission Denials:</strong> Security-relevant events regarding service interactions.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/windows-event-monitoring-with-opentelemetry-and-elastic-streams/significant-events-list.png" alt="List of significant events detected" /></p>
<p>It bubbles these up as suggested observations. You can review a list of potential issues, see the severity the AI has assigned to them (e.g., Critical, Warning), and with a single click, generate the query required to find those logs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/windows-event-monitoring-with-opentelemetry-and-elastic-streams/query-generation.png" alt="Auto-generated query for significant events" /></p>
<h2>Conclusion</h2>
<p>The combination of OpenTelemetry for standardized ingestion and AI-driven Streams for analysis turns the chaotic flood of Windows logs into a structured, actionable intelligence source. We are moving away from the era of &quot;log everything, look at nothing&quot; to an era where our tools understand our infrastructure as well as we do.</p>
<p>The barrier to effective monitoring is no longer technical complexity. Whether you are tracking security audits or debugging boot loops, leveraging LLMs to partition and analyze your streams is the new standard for observability.</p>
<p><a href="https://cloud.elastic.co/serverless-registration?onboarding_token=observability">Try Streams today</a></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/windows-event-monitoring-with-opentelemetry-and-elastic-streams/ai-partitioning.png" length="0" type="image/png"/>
        </item>
    </channel>
</rss>