Commit bf647326 authored by beorn7's avatar beorn7

Improvements after review.

Most importantly, created a completely new section for histograms and
summaries and updated all the references.

Also add other minor improvements.
parent 38a24dc3
......@@ -63,13 +63,14 @@ during a scrape:
* the **count** of events that have been observed, exposed as `<basename>_count` (identical to `<basename>_bucket{le="+Inf"}` above)
Use the [`histogram_quantile()`
function](/docs/querying/functions/#histogram_quantile()) to calculate quantiles
from histograms or even aggregations of histograms. A histogram is also suitable
to calculate an [Apdex score](http://en.wikipedia.org/wiki/Apdex). See [summary
vs. histogram](/docs/practices/instrumentation/#summary-vs.-histogram) for
details of histogram usage and differences to [summaries](#summary).
function](/docs/querying/functions/#histogram_quantile()) to calculate
quantiles from histograms or even aggregations of histograms. A
histogram is also suitable to calculate an [Apdex
score](http://en.wikipedia.org/wiki/Apdex). See [histograms and
summaries](/docs/practices/histograms) for details of histogram usage
and differences to [summaries](#summary).
Client library usage documentation for summaries:
Client library usage documentation for histograms:
* [Go](http://godoc.org/github.com/prometheus/client_golang/prometheus#Histogram)
* [Java](https://github.com/prometheus/client_java/blob/master/simpleclient/src/main/java/io/prometheus/client/Histogram.java) (histograms are only supported by the simple client but not by the legacy client)
......@@ -79,7 +80,7 @@ Client library usage documentation for summaries:
Similar to a _histogram_, a _summary_ samples observations (usually things like
request durations and response sizes). While it also provides a total count of
observation and a sum of all observed values, it calculates configurable
observations and a sum of all observed values, it calculates configurable
quantiles over a sliding time window.
A summary with a base metric name of `<basename>` exposes multiple time series
......@@ -89,9 +90,9 @@ during a scrape:
* the **total sum** of all observed values, exposed as `<basename>_sum`
* the **count** of events that have been observed, exposed as `<basename>_count`
See [summary
vs. histogram](/docs/practices/instrumentation/#summary-vs.-histogram) for
details of summary usage and differences to [histograms](#histogram).
See [histograms and summaries](/docs/practices/histograms) for
detailed explanations of φ-quantiles, summary usage, and differences
to [histograms](#histogram).
Client library usage documentation for summaries:
......
---
title: Alerting
sort_rank: 4
sort_rank: 5
---
# Alerting
......
......@@ -3,7 +3,7 @@ title: Consoles and dashboards
sort_rank: 3
---
## Consoles and dashboards
# Consoles and dashboards
It can be tempting to display as much data as possible on a dashboard, especially
when a system like Prometheus offers the ability to have such rich
......
---
title: Histograms and summaries
sort_rank: 4
---
# Histograms and summaries
Histograms and summaries are more complex metric types. Not only
creates a single histogram or summary a multitude of time series, it
is also more difficult to use them correctly. This section helps you
to pick and configure the appropriate metric type for your use case.
## Library support
First of all, check the library support for
[histograms](/docs/concepts/metric_types/#histogram) and
[summaries](/docs/concepts/metric_types/#summary). Full support for
both currently only exists in the Go client library. Many libraries
support only one of the two types, or they support summaries only in a
limited fashion (lacking [quantile
calculation](#quantiles)). [Contributions are welcome](/community/),
of course. In general, we expect histograms to be more urgently needed
than summaries. Histograms are also easier to implement in a client
library, so we recommend to implement histograms first, if in
doubt. The reason why some libraries offer summaries but not
histograms (Ruby, the legacy Java client) is that histograms are a
more recent feature of Prometheus.
## Count and sum of observations
Histograms and summaries both sample observations, typically request
durations or response sizes. They track the number of observations
*and* the sum of the observed values, allowing you to calculate the
*average* of the observed values. Note that the number of observations
(showing up in Prometheus as a time series with a `_count` suffix) is
inherently a counter (as described above, it only goes up). The sum of
observations (showing up as a time series with a `_sum` suffix)
behaves like a counter, too, as long as all observations are
positive. Obviously, request durations or response sizes are always
positive. In principle, however, you can use summaries and histograms
to observe negative values (e.g. temperatures in centigrade). In that
case, the sum of observations can go down, so you cannot apply
`rate()` to it anymore.
To calculate the average request duration during the last 5 minutes
from a histogram or summary called `http_request_duration_second`, use
the following expression:
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])
## Apdex score
A straight-forward use of histograms (but not summaries) is to count
observations falling into particular buckets of observation
values.
You might have an SLA to serve 95% of requests within 300ms. In that
case, configure a histogram to have a bucket with an upper limit of
0.3 seconds. You can then directly express the relative amount of
requests served within 300ms and easily alert if the value drops below
0.95. The following expression calculates it by job for the requests
served in the last 5 minutes. The request durations were collected with
a histogram called `http_request_duration_seconds`.
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job)
/
sum(rate(http_request_duration_seconds_count[5m])) by (job)
You can calculate the well-known [Apdex
score](http://en.wikipedia.org/wiki/Apdex) in a similar way. Configure
a bucket with the target request duration as upper bound and another
bucket with the tolerated request duration (usually 4 times the target
request duration) as upper bound. Example: The target request duration
is 300ms. The tolerable request duration is 1.2s. The following
expression yields the Apdex score over the last 5 minutes:
(
rate(http_request_duration_seconds_bucket{le="0.3"}[5m])
+
rate(http_request_duration_seconds_bucket{le="1.2"}[5m])
) / 2 / rate(http_request_duration_seconds_count[5m])
## Quantiles
You can use both summaries and histograms to calculate so-called φ-quantiles,
where 0 ≤ φ ≤ 1. The φ-quantile is the observation value that ranks at number
φ*N among the N observations. Examples for φ-quantiles: The 0.5-quantile is
known as the median. The 0.95-quantile is the 95th percentile.
The essential difference between summaries and histograms is that summaries
calculate streaming φ-quantiles on the client side and expose them directly,
while histograms expose bucketed observations counts and the calculation of
quantiles from the buckets of a histogram happens on the server side using the
[`histogram_quantile()`
function](/docs/querying/functions/#histogram_quantile()).
The two approaches have a number of different implications:
| | Histogram | Summary
|---|-----------|---------
| Required configuration | Pick buckets suitable for the expected range of observed values. | Pick desired φ-quantiles and sliding window. Other φ-quantiles and sliding windows cannot be calculated later.
| Client performance | Observations are very cheap as they only need to increment counters. | Observations are expensive due to the streaming quantile calculation.
| Server performance | The server has to calculate quantiles. You can use [recording rules](/docs/querying/rules/#recording-rules) should the ad-hoc calculation take too long (e.g. in a large dashboard). | Low server-side cost.
| Number of time series (in addition to the `_sum` and `_count` series) | One time series per configured bucket. | One time series per configured quantile.
| Quantile error (see below for details) | Error is limited in the dimension of observed values by the width of the relevant bucket. | Error is limited in the dimension of φ by a configurable value.
| Specification of φ-quantile and sliding time-window | Ad-hoc with [Prometheus expressions](/docs/querying/functions/#histogram_quantile()). | Preconfigured by the client.
| Aggregation | Ad-hoc with [Prometheus expressions](/docs/querying/functions/#histogram_quantile()). | In general [not aggregatable](http://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-you-cant-average.html).
Note the importance of the last item in the table. Let us return to
the SLA of serving 95% of requests within 300ms. This time, you do not
want to display the percentage of requests served within 300ms, but
instead the 95th percentile, i.e. the request duration within which
you have served 95% of requests. To do that, you can either configure
a summary with a 0.95-quantile and (for example) a 5-minute decay
time-window, or you configure a histogram with a few buckets around
the 300ms mark, e.g. `{le="0.1"}`, `{le="0.2"}`, `{le="0.3"}`, and
`{le="0.45"}`. If your service runs replicated with a number of
instances, you will collect request durations from every single one of
them, and then you want to aggregate everything into an overall 95th
percentile. However, aggregating the precomputed quantiles from a
summary rarely makes sense. In this particular case, averaging the
quantiles yields statistically nonsensical values.
avg(http_request_duration_seconds{quantile="0.95"}) // BAD!
Using histograms, the aggregation is perfectly possible with the
[`histogram_quantile()`
function](/docs/querying/functions/#histogram_quantile()).
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) // GOOD.
Furthermore, should your SLA change and you now want to plot the 90th
percentile, or you want to take into account the last 10 minutes
instead of the last 5 minutes, you only have to adjust the expression
above and you do not need to reconfigure the clients.
## Errors of quantile estimation
Quantiles, whether calculated client-side or server-side, are
estimated. It is important to understand the errors of that
estimation.
Continuing the histogram example from above, imagine your usual
request durations are almost all very close to 220ms, or in other
words, if you could plot the "true" histogram, you would see a very
sharp spike at 220ms. In the Prometheus histogram metric as configured
above, almost all observations, and therefore also the 95th percentile,
will fall into the bucket labeled `{le="0.3"}`, i.e. the bucket from
200ms to 300ms. The histogram implementation guarantees that the true
95th percentile is somewhere between 100ms and 200ms. To return a
single value (rather than an interval), it applies linear
interpolation, which yields 295ms in this case. The calculated
quantile gives you the impression that you are close to breaking the
SLA, but in reality, the 95th percentile is a tiny bit above 220ms,
a quite comfortable distance to your SLA.
Next step in our *Gedenkenexperiment*: A change in backend routing
adds a fixed amount of 100ms to all requent durations. Now the request
duration has its sharp spike at 320ms and almost all observations will
fall into the bucket from 300ms to 450ms. The 95th percentile is
calculated to be 442.5ms, although the correct values is close to
320ms. While you are only a tiny bit outside of your SLA, the
calculated 95th quantile looks much worse.
A summary would have had no problem calculating the correct percentile
value in both cases, at least if it uses an appropriate algorithm on
the client side (like the [one used by the Go
client](http://www.cs.rutgers.edu/~muthu/bquant.pdf)). Unfortunately,
you cannot use a summary if you need to aggregate the observations
from a number of instances.
Luckily, due to your appropriate choice of bucket boundaries, even in
this contrived example of very sharp spikes in the distribution of
observed values, the histogram was able to identify correctly if you
were within or outside of your SLA. Also, the closer the actual value
of the quantile is to our SLA (or in other words, the value we are
actually most interested in), the more accurate the calculated value
becomes.
Let us now modify the experiment once more. In the new setup, the
distributions of request durations has a spike at 150ms, but it is not
quite as sharp as before and only comprises 90% of the
observations. 10% of the observations are evenly spread out in a long
tail between 150ms and 450ms. With that distribution, the 95th
percentile happens to be exactly at our SLA of 300ms. With the
histogram, the calculated value is accurate, as the value of the 95th
percentile happens to coincide with one of the bucket boundaries. Even
slightly different values would still be accurate as the (contrived)
even distribution within the relevant buckets is exactly what the
linear interpolation within a bucket assumes.
The error of the quantile reported by a summary gets more interesting
now. The error of the quantile in a summary is configured in the
dimension of φ. In our case we might have configured 0.95±0.01,
i.e. the calculated value will be between the 94th and 96th
percentile. The 94th quantile with the distribution described above is
270ms, the 96th quantile is 330ms. The calculated value of the 95th
percentile reported by the summary can be anywhere in the interval
between 270ms and 330ms, which unfortunately is all the difference
between clearly within the SLA vs. clearly outside the SLA.
The bottom line is: If you use a summary, you control the error in the
dimension of φ. If you use a histogram, you control the error in the
dimension of the observed value (via choosing the appropriate bucket
layout). With a broad distribution, small changes in φ result in
large deviations in the observed value. With a sharp distribution, a
small interval of observed values covers a large interval of φ.
Two rules of thumb:
1. If you need to aggregate, choose histograms.
2. Otherwise, choose a histogram if you need accuracy in the
dimension of the observed values and you have an idea in which
ranges of observed values you are interested in. Choose a summary
if you need accuracy in the dimension of φ, no matter in which
ranges of observed values the quantile will end up.
......@@ -191,10 +191,14 @@ processing system.
If you are unsure, start with no labels and add more
labels over time as concrete use cases arise.
### Counter vs. gauge
### Counter vs. gauge, summary vs. histogram
It is important to know which of the four main metric types to use for
a given metric.
To pick between counter and gauge, there is a simple rule of thumb: if
the value can go down, it's a gauge.
-metric To pick between counter and gauge, there is a simple rule of
thumb: if the value can go down, it is a gauge.
Counters can only go up (and reset, such as when a process restarts). They are
useful for accumulating the number of events, or the amount of something at
......@@ -206,55 +210,8 @@ Gauges can be set, go up, and go down. They are useful for snapshots of state,
such as in-progress requests, free/total memory, or temperature. You should
never take a `rate()` of a gauge.
### Summary vs. histogram
Summaries and histograms are more complex metric types. They both sample
observations. They track the number of observations *and* the sum of the
observed values, allowing you to calculate the average observed value (useful
for latency, for example). Note that the number of observations (showing up in
Prometheus as a time series with a `_count` suffix) is inherently a counter (as
described above, it only goes up), while the sum of observations (showing up as
a time series with a `_sum` suffix) is inherently a gauge (if a negative value
is observed, it goes down).
The essential difference is that summaries calculate streaming φ-quantiles on
the client side and expose them, while histograms count observations in buckets
and expose those counts. Calculation of quantiles from histograms happens on the
server side using the [`histogram_quantile()`
function](/docs/querying/functions/#histogram_quantile()).
Both approaches have specific advantages and disadvantages:
| | Histogram | Summary
|---|-----------|---------
| Configuration | Need to configure buckets suitable for the expected range of observed values. | Need to configure φ-quantiles and sliding window, other φ-quantiles and sliding windows cannot be calculated later.
| Client performance | Observations are very cheap as they only need to increment counters. | Observations are expensive due to the streaming quantile calculation.
| Server performance | Calculating quantiles is expensive, consider [recording rules](/docs/querying/rules/#recording-rules) as a remedy. | Very low resource needs.
| Number of time series | Low for Apdex score (see below), very high for accurate quantiles. Each bucket creates a time series. | Low, one time series per configured quantile.
| Accuracy | Depends on number and layout of buckets. Higher accuracy requires more time series. | Configurable. Higher accuracy requires more client resources but is relatively cheap.
| Specification of φ-quantile and sliding time window | Ad-hoc in Prometheus expressions. | Preconfigured by the client.
| Aggregation | Ad-hoc aggregation with [Prometheus expressions](/docs/querying/functions/#histogram_quantile()). | In general [not aggregatable](http://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-you-cant-average.html).
Note the importance of the last item in the table. Let's say you run a service
with an SLA to respond to 95% of requests in under 200ms. In that case, you will
probably collect request durations from every single instance in your fleet, and
then you want to aggregate everything into an overall 95th percentile. You can
only do that with histograms, but not with summaries. Aggregating the
precomputed quantiles from a summary rarely makes sense.
A histogram is suitable to calculate the [Apdex
score](http://en.wikipedia.org/wiki/Apdex). Configure a bucket with the target
request duration as upper bound and another bucket with 4 times the request
duration as upper bound. Example: The target request duration is 250ms. The
tolerable request duration is 1s. The request duration are collected with a
histogram called `http_request_duration_seconds`. The following expression
yields the Apdex score:
```
(
http_request_duration_seconds_bucket{le="0.25"} + http_request_duration_seconds_bucket{le="1"}
) / 2 / http_request_duration_seconds_count
```
Summaries and histograms are more complex metric types discussed in
[their own section](/docs/practices/histograms/).
### Timestamps, not time since
......@@ -288,9 +245,11 @@ benchmarks are the best way to determine the impact of any given change.
### Avoid missing metrics
Time series that are not present until something happens are difficult to deal with,
as the usual simple operations are no longer sufficient to correctly handle
them. To avoid this, export a `0` for any time series you know may exist in advance.
Time series that are not present until something happens are difficult
to deal with, as the usual simple operations are no longer sufficient
to correctly handle them. To avoid this, export `0` (or `NaN`, if `0`
would be misleading) for any time series you know may exist in
advance.
Most Prometheus client libraries (including Go and Java Simpleclient) will
automatically export a `0` for you for metrics with no labels.
---
title: Recording rules
sort_rank: 5
sort_rank: 6
---
# Recording rules
......
......@@ -112,24 +112,25 @@ a `job` label set to `prometheus`:
The `offset` modifier allows changing the time offset for individual
instant and range vectors in a query.
For example, the following expression returns the value of `foo` 5
minutes in the past relative to the current query evaluation time:
For example, the following expression returns the value of
`http_requests_total` 5 minutes in the past relative to the current
query evaluation time:
foo offset 5m
http_requests_total offset 5m
Note that the `offset` modifier always needs to follow the selector
immediately, i.e. the following would be correct:
sum(foo offset 5m) // GOOD.
sum(http_requests_total{method="GET"} offset 5m) // GOOD.
While the following would be *incorrect*:
sum(foo) offset 5m // INVALID.
sum(http_requests_total{method="GET"}) offset 5m // INVALID.
The same works for range vectors. This returns the 5-minutes rate that
`foo` had a week ago:
`http_requests_total` had a week ago:
rate(foo[5m] offset 1w)
rate(http_requests_total[5m] offset 1w)
## Operators
......
......@@ -30,7 +30,7 @@ exist for a given metric name and label combination.
## `bottomk()`
`bottomk(k integer, v instant-vector` returns the `k` smallest elements of `v`
`bottomk(k integer, v instant-vector)` returns the `k` smallest elements of `v`
by sample value.
......@@ -88,13 +88,17 @@ to the nearest integer.
## `histogram_quantile()`
`histogram_quantile(φ float, b instant-vector)` calculates the φ-quantile (0 ≤ φ
≤ 1) from the buckets `b` of a histogram. The samples in `b` are the counts of
observations in each bucket. Each value must have a label `le` where the label
value denotes the inclusive upper bound of the bucket. (Samples without such a
label are ignored.) The [histogram metric
type](/docs/concepts/metric_types/#histogram) automatically
provides time series with the `_bucket` suffix and the appropriate labels.
`histogram_quantile(φ float, b instant-vector)` calculates the
φ-quantile (0 ≤ φ ≤ 1) from the buckets `b` of a
[histogram](/docs/concepts/metric_types/#histogram). (See [histograms
and summaries](/docs/practices/histograms) for a detailed explanation
of φ-quantiles and the usage of the histogram metric type in general.)
The samples in `b` are the counts of observations in each bucket. Each
sample must have a label `le` where the label value denotes the
inclusive upper bound of the bucket. (Samples without such a label are
silently ignored.) The [histogram metric
type](/docs/concepts/metric_types/#histogram) automatically provides
time series with the `_bucket` suffix and the appropriate labels.
Use the `rate()` function to specify the time window for the quantile
calculation.
......@@ -103,34 +107,29 @@ Example: A histogram metric is called `http_request_duration_seconds`. To
calculate the 90th percentile of request durations over the last 10m, use the
following expression:
```
histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m]))
```
histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m]))
The quantile is calculated for each label combination in
`http_request_duration_seconds`. To aggregate, use the `sum()` aggregator
outside of the `rate()` function. Since the `le` label is required by
around the `rate()` function. Since the `le` label is required by
`histogram_quantile()`, it has to be included in the `by` clause. The following
expression aggregates quantiles by `job`:
expression aggregates the 90th percentile by `job`:
```
histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (job, le))
```
histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (job, le))
To aggregate everything, specify only the `le` label:
```
histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (le))
```
The `histogram_quantile()` interpolates quantile values by assuming a linear
distribution within a bucket. The highest bucket must have an upper bound of
`+Inf`. (Otherwise, `NaN` is returned.) If a quantile is located in the highest
bucket, the upper bound of the second highest bucket is returned. A lower limit
of the lowest bucket is assumed to be 0 if the upper bound of that bucket is
greater than 0. In that case, linar interpolation is applied within that bucket
as usual. Otherwise, the upper bound of the lowest bucket is returned for
quantiles located in the lowest bucket.
histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (le))
The `histogram_quantile()` function interpolates quantile values by
assuming a linear distribution within a bucket. The highest bucket
must have an upper bound of `+Inf`. (Otherwise, `NaN` is returned.) If
a quantile is located in the highest bucket, the upper bound of the
second highest bucket is returned. A lower limit of the lowest bucket
is assumed to be 0 if the upper bound of that bucket is greater than
0. In that case, the usual linear interpolation is applied within that
bucket. Otherwise, the upper bound of the lowest bucket is returned
for quantiles located in the lowest bucket.
If `b` contains fewer than two buckets, `NaN` is returned. For φ < 0, `-Inf` is
returned. For φ > 1, `+Inf` is returned.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment