2.3 Tasks: PromQL

In this lab you are going to learn a bit more about PromQL (Prometheus Query Language) .

PromQL is the query language that allows you to select, aggregate and filter the time series data collected by prometheus in real time.

Task 2.3.1: Explore Examples

In this first task you are going to explore some querying examples.

Get all time series with the metric prometheus_http_requests_total.

prometheus_http_requests_total

The result represents the time series for the http requests sent to your Prometheus server as an instant vector.

Get all time series with the metric prometheus_http_requests_total and the given code and handler labels.

Additionally select your monitoring namespace using the namespace label.

prometheus_http_requests_total{code="200", handler="/api/v1/targets",namespace="<team>-monitoring"}

The result will show you the time series for the http requests sent to the query endpoint of your Prometheus Server, which were successful ( HTTP status code 200 ).

Get a whole range of time (5 minutes) for the same vector, making it a range vector:

prometheus_http_requests_total{code="200", handler="/api/v1/targets",namespace="<team>-monitoring"}[5m]

A range vector can not be graphed directly in the Prometheus UI, use the table view to display the result.

With regular expressions you can filter time series only for handlers whose name matches a certain pattern, in this case all handlers starting with /api:

prometheus_http_requests_total{handler=~"/api.*", namespace="<team>-monitoring"}

All regular expressions in Prometheus use the RE2 syntax . To select all HTTP status codes except 2xx, you would execute:

prometheus_http_requests_total{code!~"2..",namespace="<team>-monitoring"}

Task 2.3.2: Sum Aggregation Operator

The Prometheus Aggregation operators help us to aggregate time series in PromQL.

There is a Prometheus metric that represents all samples scraped by Prometheus. Let’s sum up the metrics returned.

Hints

The metric scrape_samples_scraped represents the total of scraped samples by job and instance. To get the total amount of scraped samples, we use the Prometheus aggregation operators sum to sum the values.

Additionally select your Prometheus instance using the prometheus label. Replace <team>-monitoring/prometheus with the monitoring name you defined earlier in lab 01.

sum(scrape_samples_scraped{prometheus="<team>-monitoring/prometheus"})

Task 2.3.3: Rate Function

Use the rate() function to display the current CPU idle usage per CPU core of the server in % based on data of the last 5 minutes.

Hints

The CPU metrics are collected and exposed by the node_exporter therefore the metric we’re looking for is under the node namespace.

node_cpu_seconds_total

To get the idle CPU seconds, we add the label filter {mode="idle"}.

Since the rate function calculates the per-second average increase of the time series in a range vector, we have to pass a range vector to the function.

To get the idle usage in % we therefore have to multiply it with 100.

rate(
  node_cpu_seconds_total{mode="idle",instance="prometheus-training.balgroupit.com:9100"}[5m]
  )
* 100

Task 2.3.4: Arithmetic Binary Operator

In the previous lab, we created a query that returns the CPU idle usage. Now let’s reuse that query to create a query that returns the current CPU usage per core of the server in %. The usage is the total (100%) minus the CPU usage idle.

Hints

To get the CPU usage we can simply substract idle CPU usage from 1 (100%) and then multiply it by 100 to get percentage.

(
  1 -
  rate(
      node_cpu_seconds_total{mode="idle",instance="prometheus-training.balgroupit.com:9100"}[5m]
      )
)
* 100

Task 2.3.5: How much free memory

Arithmetic Binary Operator can not only be used with constant values eg. 1, it can also be used to evaluate to other instant vectors.

Write a Query that returns how much of the memory is free in %.

The node exporter exposes these two metrics:

  • node_memory_MemTotal_bytes
  • node_memory_MemAvailable_bytes
Hints

We can simply divide the available memory metric by the total memory of the node and multiply it by 100 to get percent.

sum by(instance) (node_memory_MemAvailable_bytes{instance="prometheus-training.balgroupit.com:9100"})
/
sum by(instance) (node_memory_MemTotal_bytes{instance="prometheus-training.balgroupit.com:9100"})
* 100

Task 2.3.6: Comparison Binary Operators

In addition to the Arithmetic Binary Operator, PromQL also provides a set of Comparison binary operators

  • == (equal)
  • != (not-equal)
  • > (greater-than)
  • < (less-than)
  • >= (greater-or-equal)
  • <= (less-or-equal)

Check if the server has more than 20% memory available using a Comparison binary operators

Hints

We can simply use the greater-than-binary operator to compare the instant vector from the query with 20 (In our case, this corresponds to 20% memory usage).

sum by(instance) (node_memory_MemAvailable_bytes{instance="prometheus-training.balgroupit.com:9100"})
/
sum by(instance) (node_memory_MemTotal_bytes{instance="prometheus-training.balgroupit.com:9100"})
* 100
> 20

The query only has a result when more than 20% of the memory is available.

Change the value from 20 to 90 or more to see the result, when the operator doesn’t match.

Task 2.3.7: Histogram (optional)

So far we’ve been using gauge and counter metric types in our queries.

Read the documentation about the histogram metric type.

There exists a histogram for the http request durations to the Prometheus sever. It basically counts requests that took a certain amount of time and puts them into matching buckets (le label).

We want to write a query that returns

  • the total numbers of requests
  • to the Prometheus server
  • on /metrics
  • below 0.1 seconds
Hints

A metric name has an application prefix relevant to the domain the metric belongs to. The prefix is sometimes referred to as namespace by client libraries. As seen in previous labs, the http metrics for the Prometheus server are available in the prometheus_ namespace.

By filtering the le label to 0.1 we get the result for our query.

prometheus_http_request_duration_seconds_bucket{handler="/metrics",le="0.1",namespace="<team>-monitoring"}

Tip: Analyze the query in PromLens

Advanced: You can calculate how many requests in % were below 0.1 seconds by aggregating above metric. See more information about Apdex score at Prometheus documentation

Example

sum(
  rate(
    prometheus_http_request_duration_seconds_bucket{handler="/metrics",le="0.1",namespace="<team>-monitoring"}[5m]
  )
) by (job, handler)
/
sum(
  rate(
    prometheus_http_request_duration_seconds_count{handler="/metrics",namespace="<team>-monitoring"}[5m]
  )
) by (job, handler)
* 100

Task 2.3.8: Quantile (optional)

We can use the histogram_quantile function to calculate the request duration quantile of the requests to the Prometheus server from a histogram metric. To archive this we can use the metric prometheus_http_request_duration_seconds_bucket, which the Prometheus server exposes by default.

Write a query, that returns the per-second average of the 0.9th quantile under the metrics handler using the metric mentioned above.

Hints

Expression

histogram_quantile(
  0.9,
  rate(
    prometheus_http_request_duration_seconds_bucket{handler="/metrics",namespace="<team>-monitoring"}[5m]
  )
)

Explanation: histogram_quantile will calculate the 0.9 quantile based on the samples distribution in our buckets by assuming a linear distribution within a bucket.

Task 2.3.9: predict_linear function (optional)

We could simply alert on static thresholds. For example, notify when the file system is more than 90% full. But sometimes 90% disk usage is a desired state. For example, if our volume is very large. (e.g. 10% of 10TB would still be 1TB free, who wants to waste that space?) So it is better to write queries based on predictions. Say, a query that tells me that my disk will be full within the next 24 hours if the growth rate is the same as the last 6 hours.

Let’s write a query, that exactly makes such predictions:

  • Find a metric that displays you the available disk space on filesystem mounted on /
  • Use a function that allows you to predict when the filesystem will be full in 4 hours
  • Predict the usage linearly based on the growth over the last 1 hour
Hints

Expression

predict_linear(node_filesystem_avail_bytes{mountpoint="/",instance="prometheus-training.balgroupit.com:9100"}[1h], 3600 * 4) < 0

Explanation: based on data over the last 1h, the disk will be < 0 bytes in 3600 * 4 seconds. The query will return no data because the file system will not be full in the next 4 hours. You can check how much disk space will be available in 4 hours by removing the < 0 part.

predict_linear(node_filesystem_avail_bytes{mountpoint="/",instance="prometheus-training.balgroupit.com:9100"}[1h], 3600 * 4)

Task 2.3.10: Many-to-one vector matches (optional)

Prometheus provides built-in metrics that can be used to correlate their values with metrics exposed by your exporters. One such metric is date(). Prometheus also allows you to add more labels from different metrics if you can correlate both metrics by labels. See Many-to-one and one-to-many vector matches for more examples.

Write a query that answers the following questions:

  • What is the uptime of the server in minutes?
  • Which kernel is currently active?
Hints

Expression

(
  (
    time() - node_boot_time_seconds{instance="prometheus-training.balgroupit.com:9100"}
  ) / 60
)
* on(instance) group_left(release) node_uname_info
  • time(): Use the current UNIX Epoch time
  • node_boot_time_seconds: Returns the UNIX epoch time at which the VM was started
  • on(instance) group_left(release) node_uname_info: Group your metrics result with the metric node_uname_info which contains information about your kernel in the release label.

Alternative solution with group_right instead of group_left would be:

node_uname_info{instance="prometheus-training.balgroupit.com:9100"} * on(instance) group_right(release)
(
  (
    time() - node_boot_time_seconds
  ) / 60
)