[AWS] Introduce initial alert rule templates#15346

gpop63 · 2025-09-16T14:09:25Z

Overview

This PR introduces the first set of alert rule templates for key AWS data streams. For each stream, we selected the two most critical metrics to monitor.

`ec2_metrics`

High CPU Utilization
- This alarm is used to detect high CPU utilization.
Status Check failed
- This alarm is used to detect the underlying problems with instances, including both system status check failures and instance status check failures.

`lambda`

High Number of Throttles
- The alarm helps detect a high number of throttled invocation requests for a Lambda function.
High Number of Errors
- The alarm helps detect high error counts in function invocations.

`sqs`

Oldest Message Name is Too High
- This alarm is used to detect whether the age of the oldest message in the QueueName queue is too high. Threshold depends on situation.
High Number of Visible Messages
- This alarm is used to detect whether the message count of the active queue is too high and consumers are slow to process the messages or there are not enough consumers to process them. Threshold depends on situation.

`sns`

Any Message Delivery Fails
- This alarm helps you proactively find issues with the delivery of notifications and take appropriate actions to address them. Threshold depends on situation.
Number of Notifications Filtered Out - Invalid Attributes
- The alarm is used to detect if the published messages are not valid or if inappropriate filters have been applied to a subscriber.

Checklist

I have reviewed tips for building integrations and this pull request is aligned with them.
I have verified that all data streams collect metrics or logs.
I have added an entry to my package's changelog.yml file.
I have verified that Kibana version constraints are current according to guidelines.
I have verified that any added dashboard complies with Kibana's Dashboard good practices

Author's Checklist

[ ]

How to test this PR locally

Related issues

Closes elastic/obs-integration-team/issues/536

Screenshots

cla-checker-service · 2025-09-16T14:09:30Z

💚 CLA has been signed

packages/aws/changelog.yml

ishleenk17 · 2025-09-17T04:50:19Z

@gpop63 : The template will be usable from 9.2 onwards .
If yes, lets mark the PR as DON'T MERGE.

Can you please share a screenshot of how a particular alert looks like. Also, are we not adding any information about alert support in the README's ?

muthu-mps · 2025-09-17T06:16:28Z

packages/aws/kibana/alerting_rule_template/ec2-high-cpu-utilization.json

+  "type": "alerting_rule_template",
+  "attributes": {
+    "name": "EC2 High CPU Utilization",
+    "tags": [],


Can we add tags?

What tags were you thinking of, Muthu?

The tags can have the service name and the Alert metrics name. Similar to what I have added here in Azure AI Foundry.
e.g., [AWS EC2, AWS EC2 CPU Utilization].

muthu-mps · 2025-09-17T06:18:15Z

packages/aws/kibana/alerting_rule_template/ec2-high-cpu-utilization.json

@@ -0,0 +1,37 @@
+{
+  "id": "b6513de4-6c36-499a-8f0a-98431cd4dbee",


Should the id match with the file name of the rule_template?
Error: defines non-matching ID

gpop63 · 2025-09-17T10:51:56Z

@ishleenk17 right now the support is not fully there we only see them under assets and in saved objects

muthu-mps · 2025-09-18T10:08:02Z

packages/aws/kibana/alerting_rule_template/ec2-high-cpu-utilization.json

+      "groupBy": "all",
+      "termSize": 5,
+      "sourceFields": [],
+      "timeField": "event.ingested",


Can the time field be @timestamp? Is there a reason for choosing event.ingested instead of @timestamp?

I tried using the @timestamp field but it wasn't generating alerts. For some AWS data streams @timestamp is when the actual metric happened in AWS.

muthu-mps · 2025-09-18T10:08:30Z

packages/aws/kibana/alerting_rule_template/ec2-high-cpu-utilization.json

+        "esql": "FROM metrics-aws.ec2_metrics-default\n| STATS cpuutilization=avg(aws.ec2.metrics.CPUUtilization.avg) by cloud.account.id, cloud.region, aws.dimensions.InstanceId\n| WHERE cpuutilization >= 80"
+      },
+      "aggType": "count",
+      "groupBy": "all",


Is this groupBy not applicable while using ESQL query?

The group by of actual data happens in the esql query itself, this has to be a property of the alert.

muthu-mps · 2025-09-18T10:09:49Z

packages/aws/kibana/alerting_rule_template/ec2-high-cpu-utilization.json

+      "thresholdComparator": ">",
+      "size": 100,
+      "esqlQuery": {
+        "esql": "FROM metrics-aws.ec2_metrics-default\n| STATS cpuutilization=avg(aws.ec2.metrics.CPUUtilization.avg) by cloud.account.id, cloud.region, aws.dimensions.InstanceId\n| WHERE cpuutilization >= 80"


Applying dataset filter help fetch only the specific data for the alerting metrics. WDYT?

How do we do that? also this esql query targets documents from a specific data stream/index (metrics-aws.ec2_metrics-default)

We can ignore this as we directly target against specific datastream.

muthu-mps · 2025-09-18T10:29:25Z

packages/aws/kibana/alerting_rule_template/ec2-high-cpu-utilization.json

+      "searchType": "esqlQuery",
+      "timeWindowSize": 5,
+      "timeWindowUnit": "m",
+      "threshold": [


Similar to groupby. Check whether the threshold value is applied directly from ESQL query and not from here.

The threshold is set in the esql query, this is a different property of the alert.

Co-authored-by: Dan Kortschak <dan.kortschak@elastic.co>

elastic-vault-github-plugin-prod · 2025-09-22T10:50:29Z

🚀 Benchmarks report

To see the full report comment with /test benchmark fullreport

elastic-sonarqube · 2025-09-25T14:53:47Z

Quality Gate failed

Failed conditions
0.7% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube

daniela-elastic · 2025-09-30T16:21:21Z

packages/aws/kibana/alerting_rule_template/aws-ec2-high-cpu-utilization.json

Where do we declare which service (entity) this alert template applies to? Something like resource : aws.ec2

I have included the service name in the name of the alert rule template. I suppose Kibana should allow us to filter by tags or by partial matches on the title of the alert rule template.

…templates

agithomas · 2025-11-05T09:50:33Z

packages/aws/kibana/alerting_rule_template/aws-sqs-messages-visible.json

+    "schedule": {
+      "interval": "1m"
+    },


This is applicable for all the configurations.

Should we keep this so frequently? I suggest, this be equal to the default period value for metrics ingestion. Following so, it helps to avoid any no-data found alert (when user decides to extend the configuration)

Should we set timeWindowSize to match the integration period? That way, for example, every 5 minutes we’d check for alerts in documents from the past 5 minutes.

I think, thats a resonable thing to do. The impact I assume here will be that instead of an alert being notified at the period + 1m interval, the alert will be notified at 2 x period internal. Here period is 5m for most AWS servies.

@tommyers-elastic , what would be your recommendation?

i don't think we have any way to couple configs in agent policy templates with these rule configurations, so whatever we choose will have to be always added by hand.

my only thinking here is that it doesn't make sense to run a rule more frequently than the integration collection period. matching the rule frequency with the collection period seems sensible to me.

it's a shame there's no way to put hints in the form such that we could have something that shows up and says "should match the integration collection period" or something. if we think it's worthwhile we could suggest this as a feature.

…templates

muthu-mps · 2025-11-06T09:23:49Z

packages/aws/manifest.yml

    subscription: basic
  kibana:
-    version: "^8.19.0 || ^9.1.0"
+    version: "^9.2.1"


@elastic/security-service-integrations team, This feature is supported starting from 9.2.1 release version. The minimum stack version gets upgraded to 9.2.1. Since AWS integrations involve co-ownership, Could you confirm if the stack version upgrade is fine with the integrations managed by security team?

Ad discussed elsewhere, I think this version constraint should be left unchanged. The alerting rule template files will be built into the package and they will be installed and used on stack versions that support them

gpop63 · 2025-11-07T10:04:03Z

/test

muthu-mps · 2025-11-11T04:25:27Z

/test

agithomas · 2025-11-11T04:50:29Z

Sharing a suggestion here:

Could we follows a more structured comment style - helping user to identify the purpose, default value, condition, group-by information with easily and make suggestions easier? When followed, these descriptions could be combined with other platform capabilities including AI assistant (in future, if & when needed)

Examples:

1. 
// Alert triggers when the maximum number of visible messages in an SQS queue
// reaches or exceeds the defined threshold (default: 1000) within the lookback window.
//
// The alert is grouped by cloud account, region, and queue name to identify
// which specific queue is experiencing backlog.
//
// To adjust sensitivity, change the `msgsvisible` threshold value in the WHERE clause.

2.
// Alert triggers when SNS notifications are being filtered out by subscription filter policies.
// A non-zero value usually indicates a mismatch between published messages and subscriber
// filter rules, which may result in messages not reaching intended consumers.
//
// The alert is grouped by cloud account, region, and topic name to identify the affected topic.
//
// To adjust sensitivity, change the `notificationsfilteredout` threshold in the WHERE clause.

gpop63 · 2025-11-11T09:56:30Z

/test

muthu-mps · 2025-11-12T10:01:06Z

@agithomas - Apart from version dependency, Can you help with the review and approval if everything looks good?

agithomas

@agithomas - Apart from version dependency, Can you help with the review and approval if everything looks good?

LGTM from the alerts configuration. Looking forward to have a common agreement on the version dependency before proceeding.

muthu-mps · 2025-11-21T04:52:47Z

/test

muthu-mps · 2025-11-24T09:30:05Z

/test

…templates

elasticmachine · 2025-12-23T16:56:13Z

💚 Build Succeeded

Buildkite Build
Commit: be02468

History

💚 Build #35809 succeeded adb5f1f
💚 Build #35502 succeeded 28d6878
💔 Build #34545 failed edb21dc
💔 Build #34448 failed d4820d0
💔 Build #34289 failed d4820d0
💔 Build #34228 failed d721a61

cc @gpop63

elastic-vault-github-plugin-prod · 2025-12-26T14:42:13Z

Package aws - 5.4.0 containing this change is available at https://epr.elastic.co/package/aws/5.4.0/

gpop63 requested review from a team as code owners September 16, 2025 14:09

gpop63 added 2 commits September 16, 2025 17:12

add alert rule templates

db8282f

bump package version

bbb5db6

gpop63 force-pushed the add_aws_alert_rule_templates branch from ef16f46 to bbb5db6 Compare September 16, 2025 14:12

gpop63 self-assigned this Sep 16, 2025

gpop63 added Integration:aws AWS enhancement New feature or request labels Sep 16, 2025

andrewkroh added the Team:obs-ds-hosted-services Observability Hosted Services team [elastic/obs-ds-hosted-services] label Sep 16, 2025

efd6 reviewed Sep 16, 2025

View reviewed changes

packages/aws/changelog.yml Outdated Show resolved Hide resolved

muthu-mps changed the title ~~[AWS] Introduce initial alert rule templates~~ [AWS] Introduce initial alert rule templates - DO NOT MERGE Sep 17, 2025

muthu-mps reviewed Sep 17, 2025

View reviewed changes

fix ids

94ffdb6

muthu-mps requested a review from agithomas September 18, 2025 05:52

muthu-mps reviewed Sep 18, 2025

View reviewed changes

gpop63 and others added 2 commits September 22, 2025 12:17

Apply suggestion from @efd6

6ccc000

Co-authored-by: Dan Kortschak <dan.kortschak@elastic.co>

Merge branch 'main' into add_aws_alert_rule_templates

14419f7

Merge branch 'main' into add_aws_alert_rule_templates

74fd7dc

muthu-mps changed the title ~~[AWS] Introduce initial alert rule templates - DO NOT MERGE~~ [AWS] Introduce initial alert rule templates Sep 26, 2025

daniela-elastic reviewed Sep 30, 2025

View reviewed changes

gpop63 added 2 commits October 3, 2025 12:50

add tags

665711c

Merge remote-tracking branch 'upstream/main' into add_aws_alert_rule_…

0b5ec5a

…templates

agithomas reviewed Nov 5, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into add_aws_alert_rule_…

6f3758b

…templates

muthu-mps marked this pull request as ready for review November 5, 2025 10:38

gpop63 added 3 commits November 6, 2025 11:03

bump kibana version

84f98b5

improve time window and interval

5a38d36

bump kibana version

9fefffa

muthu-mps reviewed Nov 6, 2025

View reviewed changes

Merge branch 'main' into add_aws_alert_rule_templates

ca349a8

muthu-mps requested a review from efd6 November 11, 2025 06:10

gpop63 added 2 commits November 12, 2025 11:46

improve comments

20c3d90

improve comments

25c3b79

agithomas approved these changes Nov 12, 2025

View reviewed changes

muthu-mps and others added 3 commits November 13, 2025 14:49

Merge branch 'main' into add_aws_alert_rule_templates

de094d9

update stack and format version

d721a61

fix ids

d4820d0

muthu-mps and others added 5 commits November 24, 2025 15:10

Merge branch 'main' into add_aws_alert_rule_templates

adec810

bump kibana version

edb21dc

Merge branch 'main' into add_aws_alert_rule_templates

28d6878

Merge remote-tracking branch 'upstream/main' into add_aws_alert_rule_…

adb5f1f

…templates

Merge remote-tracking branch 'upstream/main' into add_aws_alert_rule_…

be02468

…templates

gpop63 merged commit 5246e51 into elastic:main Dec 26, 2025
8 checks passed

		@@ -0,0 +1,37 @@
		{
		"id": "b6513de4-6c36-499a-8f0a-98431cd4dbee",

Conversation

gpop63 commented Sep 16, 2025

Overview

ec2_metrics

lambda

sqs

sns

Checklist

Author's Checklist

How to test this PR locally

Related issues

Screenshots

Uh oh!

cla-checker-service bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ishleenk17 commented Sep 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gpop63 commented Sep 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elastic-vault-github-plugin-prod bot commented Sep 22, 2025

🚀 Benchmarks report

Uh oh!

elastic-sonarqube bot commented Sep 25, 2025

Quality Gate failed

Uh oh!

daniela-elastic Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gpop63 commented Nov 7, 2025

Uh oh!

muthu-mps commented Nov 11, 2025

Uh oh!

agithomas commented Nov 11, 2025

Uh oh!

gpop63 commented Nov 11, 2025

`ec2_metrics`

`lambda`

`sqs`

`sns`

cla-checker-service bot commented Sep 16, 2025 •

edited

Loading

daniela-elastic Sep 30, 2025 •

edited

Loading