Commit 559c589a authored by Brian Brazil's avatar Brian Brazil

Document without aggregation modifier

parent 1733b1bf
...@@ -17,8 +17,8 @@ convention. ...@@ -17,8 +17,8 @@ convention.
Recording rules should be of the general form `level:metric:operations`. Recording rules should be of the general form `level:metric:operations`.
`level` represents the aggregation level and labels of the rule output. `level` represents the aggregation level and labels of the rule output.
`metric` is the metric name and should be unchanged other than stripping `metric` is the metric name and should be unchanged other than stripping
`_total` off counters when using `rate()`. `operations` is a list of operations `_total` off counters when using `rate()` or `irate()`. `operations` is a list
that were applied to the metric, newest operation first. of operations that were applied to the metric, newest operation first.
Keeping the metric name unchanged makes it easy to know what a metric is and Keeping the metric name unchanged makes it easy to know what a metric is and
easy to find in the codebase. easy to find in the codebase.
...@@ -41,69 +41,78 @@ Instead keep the metric name without the `_count` or `_sum` suffix and replace ...@@ -41,69 +41,78 @@ Instead keep the metric name without the `_count` or `_sum` suffix and replace
the `rate` in the operation with `mean`. This represents the average the `rate` in the operation with `mean`. This represents the average
observation size over that time period. observation size over that time period.
Always specify a `by` clause with at least the `job` label when aggregating. Always specify a `without` clause with the labels you are aggregating away.
This is to prevent recording rules without a `job` label, which may cause This is to preserve all the other labels such as `job`, which will avoid
conflicts across different jobs. conflicts and give you more useful metrics and alerts.
## Examples ## Examples
Aggregating up requests per second: Aggregating up requests per second that has a `path` label:
``` ```
instance:requests:rate5m = instance_path:requests:rate5m =
rate(requests_total{job="myjob"}[5m]) rate(requests_total{job="myjob"}[5m])
job:requests:rate5m = path:requests:rate5m =
sum by (job)(instance:requests:rate5m{job="myjob"}) sum without (instance)(instance_path:requests:rate5m{job="myjob"})
``` ```
Calculating a request failure ratio and aggregating up to the job-level failure ratio: Calculating a request failure ratio and aggregating up to the job-level failure ratio:
``` ```
instance:request_failures:rate5m = instance_path:request_failures:rate5m =
rate(request_failures_total{job="myjob"}[5m]) rate(request_failures_total{job="myjob"}[5m])
instance:request_failures_per_requests:ratio_rate5m = instance_path:request_failures_per_requests:ratio_rate5m =
instance:request_failures:rate5m{job="myjob"} instance_path:request_failures:rate5m{job="myjob"}
/ /
instance:requests:rate5m{job="myjob"} instance_path:requests:rate5m{job="myjob"}
// Aggregate up numbeator and denominator, then divide. // Aggregate up numerator and denominator, then divide to get path-level ratio.
path:request_failures_per_requests:ratio_rate5m =
sum without (instance)(instance_path:request_failures:rate5m{job="myjob"})
/
sum without (instance)(instance_path:requests:rate5m{job="myjob"})
// No labels left from instrumentation or distinguishing instances,
// so we use 'job' as the level.
job:request_failures_per_requests:ratio_rate5m = job:request_failures_per_requests:ratio_rate5m =
sum by (job)(instance:request_failures:rate5m{job="myjob"}) sum without (instance, path)(instance_path:request_failures:rate5m{job="myjob"})
/ /
sum by (job)(instance:requests:rate5m{job="myjob"}) sum without (instance, path)(instance_path:requests:rate5m{job="myjob"})
``` ```
Calculating average latency over a time period from a Summary: Calculating average latency over a time period from a Summary:
``` ```
instance:request_latency_seconds_count:rate5m = instance_path:request_latency_seconds_count:rate5m =
rate(request_latency_seconds_count{job="myjob"}[5m]) rate(request_latency_seconds_count{job="myjob"}[5m])
instance:request_latency_seconds_sum:rate5m = instance_path:request_latency_seconds_sum:rate5m =
rate(request_latency_seconds_sum{job="myjob"}[5m]) rate(request_latency_seconds_sum{job="myjob"}[5m])
instance:request_latency_seconds:mean5m = instance_path:request_latency_seconds:mean5m =
instance:request_latency_seconds_sum:rate5m{job="myjob"} instance_path:request_latency_seconds_sum:rate5m{job="myjob"}
/ /
instance:request_latency_seconds_count:rate5m{job="myjob"} instance_path:request_latency_seconds_count:rate5m{job="myjob"}
// Aggregate up numerator and denominator, then divide. // Aggregate up numerator and denominator, then divide.
job:request_latency_seconds:mean5m = path:request_latency_seconds:mean5m =
sum by (job)(instance:request_latency_seconds_sum:rate5m{job="myjob"}) sum without (instance)(instance_path:request_latency_seconds_sum:rate5m{job="myjob"})
/ /
sum by (job)(instance:request_latency_seconds_count:rate5m{job="myjob"}) sum without (instance)(instance_path:request_latency_seconds_count:rate5m{job="myjob"})
``` ```
Calculating the average query rate across instances is done using the `avg()` function: Calculating the average query rate across instances and paths is done using the
`avg()` function:
``` ```
job:request_latency_seconds_count:avg_rate5m = job:request_latency_seconds_count:avg_rate5m =
avg by (job)(instance:request_latency_seconds_count:rate5m{job="myjob"}) avg without (instance, path)(instance:request_latency_seconds_count:rate5m{job="myjob"})
``` ```
Notice that when aggregating the labels in the `by` clause always match up with Notice that when aggregating that the labels in the `without` clause are removed
the level of the output recording rule. When there is no aggregation, the from the level of the output metric name compared to the input metric names.
levels always match. If this is not the case a mistake has likely been made in the rules. When there is no aggregation, the levels always match. If this is not the case
a mistake has likely been made in the rules.
...@@ -175,15 +175,16 @@ vector of fewer elements with aggregated values: ...@@ -175,15 +175,16 @@ vector of fewer elements with aggregated values:
* `count` (count number of elements in the vector) * `count` (count number of elements in the vector)
These operators can either be used to aggregate over **all** label dimensions These operators can either be used to aggregate over **all** label dimensions
or preserve distinct dimensions by including a `by`-clause. or preserve distinct dimensions by including a `without` or `by` clause.
<aggr-op>(<vector expression>) [by (<label list>)] [keep_common] <aggr-op>(<vector expression>) [without|by (<label list>)] [keep_common]
By default, labels that are not listed in the `by` clause will be dropped from `without` removes the listed labels from the result vector, while all other
the result vector, even if their label values are identical between all labels are preserved the output. `by` does the opposite and drops labels that
elements of the vector. The `keep_common` clause allows to keep those extra are not listed in the `by` clause, even if their label values are identical
labels (labels that are identical between elements, but not in the `by` between all elements of the vector. The `keep_common` clause allows to keep
clause). those extra labels (labels that are identical between elements, but not in the
`by` clause).
Until Prometheus 0.14.0, the `keep_common` keyword was called `keeping_extra`. Until Prometheus 0.14.0, the `keep_common` keyword was called `keeping_extra`.
The latter is still supported, but is deprecated and will be removed at some The latter is still supported, but is deprecated and will be removed at some
...@@ -195,7 +196,7 @@ If the metric `http_requests_total` had time series that fan out by ...@@ -195,7 +196,7 @@ If the metric `http_requests_total` had time series that fan out by
`application`, `instance`, and `group` labels, we could calculate the total `application`, `instance`, and `group` labels, we could calculate the total
number of seen HTTP requests per application and group over all instances via: number of seen HTTP requests per application and group over all instances via:
sum(http_requests_total) by (application, group) sum(http_requests_total) without (instance)
If we are just interested in the total of HTTP requests we have seen in **all** If we are just interested in the total of HTTP requests we have seen in **all**
applications, we could simply write: applications, we could simply write:
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment