7.1 Tasks: Setup custom alerting rules

Task 7.1.1: Add alerting rules

Note

Alertmanager will automatically send mails to the defined responsible email address in the teams root configuration when you set the label severity=critical in your PrometheusRule. To change this behaviour and/or add Alerting to MS Teams, check the documentation 03 - Setup custom alerting rules in Confluence.

The Prometheus Operator allows you to configure Alerting Rules (PrometheusRules). This enables OpenShift users to configure and maintain alerting rules for their projects. Furthermore it is possible to treat Alerting Rules like any other Kubernetes resource and lets you manage them in Helm or Kustomize. A PrometheusRule has the following form:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: <resource-name>
spec:
  <rule definition>

See the Alertmanager documentation for <rule definition>

Example:

To add an Alerting rule, create a PrometheusRule resource training_testrules.yaml in the monitoring folder of your CAASI Team Config Repository.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: testrules
spec:
  groups:
    - name: testrulesgroup
      rules:
        - alert: kubePodCrashLooping # this will be the mail subject, enter which ever text you want
          expr: rate(kube_pod_container_status_restarts_total{job="kube-state-metrics",namespace="<team>-monitoring"}[5m]) * 60 * 5 > 0
          for: 15m
          annotations:
            message: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes.
          labels:
            severity: info

This will fire an alert, everytime the following query matches

rate(kube_pod_container_status_restarts_total{job="kube-state-metrics",namespace="<team>-monitoring"}[5m]) * 60 * 5 > 0

You can build/verify your Query in your Thanos Querier UI . As soon, as you apply the PrometheusRule resource, you should be able to see the alert in your Thanos Ruler implementation.

Task 7.1.2: Send a test alert

In this task you can use the amtool command to send a test alert.

To send a test alert with the labels alertname=Up and node=bar you can simply execute the following command.

Note

As you will be executing some oc commands in the following labs, make sure you are logged in to your OpenShift Cluster.

You can copy the login Command from the OpenShift UI:

Browse to http://LOCALHOST_OPENSHIFT
Click on your name in the top right
Copy login command
Replace 6443 with 443

team=<team>
oc -n $team-monitoring exec -it sts/alertmanager-alertmanager -- sh
amtool alert add --alertmanager.url=http://localhost:9093 alertname=Up node=bar

Check in the Alertmanger web UI if you see the test alert with the correct labels set.

Task 7.1.3: Show the routing tree

Show routing tree:

team=<team>
oc -n $team-monitoring exec -it sts/alertmanager-alertmanager -- sh
amtool config routes --config.file /etc/alertmanager/config/alertmanager.yaml

Depending on the configured receivers your output might vary.

The routing tree of the monitoring stack in namespace infra-config is more complex than the one of the examples-monitoring namespace:

Namespace config-caasi01-monitoring:

$ oc -n config-caasi01-monitoring exec -it sts/alertmanager-alertmanager -- amtool config routes --config.file /etc/alertmanager/config/alertmanager.yaml
Routing tree:
.
└── default-route  receiver: default
    ├── {severity=~"^(?:critical|warning)$"}  continue: true  receiver: mail-critical
    ├── {alertname=~"^(?:DeadMansSwitch)$"}  receiver: deadmanswitch
    ├── {env="prod",severity="critical"}  receiver: teams-critical-prod
    ├── {env="prod",severity="warning"}  receiver: teams-warning-prod
    ├── {env="prod"}  receiver: teams-info-prod
    ├── {env!="prod",severity="critical"}  receiver: teams-critical-nonprod
    ├── {env!="prod",severity="warning"}  receiver: teams-warning-nonprod
    ├── {env!="prod",severity="info"}  receiver: teams-info-nonprod
    └── {env!="prod"}  receiver: teams-warning-prod

Namespace examples-monitoring:

$ oc -n examples-monitoring exec -it sts/alertmanager-alertmanager -- amtool config routes --config.file /etc/alertmanager/config/alertmanager.yaml
Routing tree:
.
└── default-route  receiver: default

Task 7.1.4: Silencing alerts

Sometimes the huge amount of alerts can be overwhelming, or you’re currently working on fixing an issue, which triggers an alert. Or you’re simply testing something that fires alerts.

In such cases alert silencing can be very helpful.

Let’s now silence our test alert.

Open the Alertmanger web UI and search for the test alert.

Note

The alert might have been resolved already, use the following command to re-trigger it again:

team=<team>
oc -n $team-monitoring exec -it sts/alertmanager-alertmanager -- sh
amtool alert add --alertmanager.url=http://localhost:9093 alertname=Up node=bar

You can either silence the specific alert by simply clicking on the Silcence button next to the alert, or create a new silence by clicking the New Silence button in the top menu on the right. Either way, you’ll end up on the same form. The button next to the alert will conveniently fill out the matchers, so that the alert will be affected by the new silence.

Click the Silence button next to the test alert.
Make sure the matchers contains the two labels (alertname="Up", node="bar") of the test alert.
Set the duration to 1h
Add your username to the creator form field.
Fill out the description with the reason you’re creating a silence.

You can then use the Preview Alerts button to check your matchers and create the alert by clicking create.

Alert Silencing

All alerts, which match the defined labels of the matcher will be silenced for the defined time slot.

Go back to the Alerts page, the silenced alert disappeared and will only be visible when checking the silenced alerts checkbox.

The top menu entry silence will show you a list of the created silences. Silences can also be created programmatically using the API or the amtool (amtool silence --help).

The following command is exactly the same you just did via the Web UI:

team=<team>
oc -n $team-monitoring exec -it sts/alertmanager-alertmanager -- sh
amtool silence add alertname=Up node=bar --author="<username>" --comment="I'm testing the silences" --alertmanager.url=http://localhost:9093

Task 7.1.5: Test your alert receivers

Add a test alert and check if your defined target mailbox receives the mail. It can take up to 5 minutes as the alarms are grouped together based on the group_interval .

Alerting Mail

team=<team>
oc -n $team-monitoring exec -it sts/alertmanager-alertmanager -- sh
amtool alert add --alertmanager.url=http://localhost:9093 env=dev severity=critical

Note

Alerts with the label severity=critical will send a mail to the defined responsible mail address in the teams root configuration and post the alert as a message in the defined Teams channel (if enabled).

Example:

oc -n examples-monitoring exec -it sts/alertmanager-alertmanager -- sh
amtool alert add --alertmanager.url=http://localhost:9093 alert=test severity=critical

It is also advisable to validate the routing configuration against a test dataset to avoid unintended changes. With the option --verify.receivers the expected output can be specified:

team=<team>
oc -n $team-monitoring  exec -it sts/alertmanager-alertmanager -- sh
amtool config routes test --config.file /etc/alertmanager/config/alertmanager.yaml --verify.receivers=mail-critical env=dev severity=info

default
WARNING: Expected receivers did not match resolved receivers.

team=<team>
oc -n $team-monitoring exec -it sts/alertmanager-alertmanager -- sh
amtool config routes test --config.file /etc/alertmanager/config/alertmanager.yaml --verify.receivers=mail-critical env=prod severity=critical

Last modified April 8, 2026: Merge 46d67d708c2373264831d9204454dcf1e55cc690 into bab7e50cc1efa022e59121d708c7de54c6d4d4ce (38d939c)