Commit bf647326 authored by beorn7's avatar beorn7

Improvements after review.

Most importantly, created a completely new section for histograms and
summaries and updated all the references.

Also add other minor improvements.
parent 38a24dc3
...@@ -63,13 +63,14 @@ during a scrape: ...@@ -63,13 +63,14 @@ during a scrape:
* the **count** of events that have been observed, exposed as `<basename>_count` (identical to `<basename>_bucket{le="+Inf"}` above) * the **count** of events that have been observed, exposed as `<basename>_count` (identical to `<basename>_bucket{le="+Inf"}` above)
Use the [`histogram_quantile()` Use the [`histogram_quantile()`
function](/docs/querying/functions/#histogram_quantile()) to calculate quantiles function](/docs/querying/functions/#histogram_quantile()) to calculate
from histograms or even aggregations of histograms. A histogram is also suitable quantiles from histograms or even aggregations of histograms. A
to calculate an [Apdex score](http://en.wikipedia.org/wiki/Apdex). See [summary histogram is also suitable to calculate an [Apdex
vs. histogram](/docs/practices/instrumentation/#summary-vs.-histogram) for score](http://en.wikipedia.org/wiki/Apdex). See [histograms and
details of histogram usage and differences to [summaries](#summary). summaries](/docs/practices/histograms) for details of histogram usage
and differences to [summaries](#summary).
Client library usage documentation for summaries: Client library usage documentation for histograms:
* [Go](http://godoc.org/github.com/prometheus/client_golang/prometheus#Histogram) * [Go](http://godoc.org/github.com/prometheus/client_golang/prometheus#Histogram)
* [Java](https://github.com/prometheus/client_java/blob/master/simpleclient/src/main/java/io/prometheus/client/Histogram.java) (histograms are only supported by the simple client but not by the legacy client) * [Java](https://github.com/prometheus/client_java/blob/master/simpleclient/src/main/java/io/prometheus/client/Histogram.java) (histograms are only supported by the simple client but not by the legacy client)
...@@ -79,7 +80,7 @@ Client library usage documentation for summaries: ...@@ -79,7 +80,7 @@ Client library usage documentation for summaries:
Similar to a _histogram_, a _summary_ samples observations (usually things like Similar to a _histogram_, a _summary_ samples observations (usually things like
request durations and response sizes). While it also provides a total count of request durations and response sizes). While it also provides a total count of
observation and a sum of all observed values, it calculates configurable observations and a sum of all observed values, it calculates configurable
quantiles over a sliding time window. quantiles over a sliding time window.
A summary with a base metric name of `<basename>` exposes multiple time series A summary with a base metric name of `<basename>` exposes multiple time series
...@@ -89,9 +90,9 @@ during a scrape: ...@@ -89,9 +90,9 @@ during a scrape:
* the **total sum** of all observed values, exposed as `<basename>_sum` * the **total sum** of all observed values, exposed as `<basename>_sum`
* the **count** of events that have been observed, exposed as `<basename>_count` * the **count** of events that have been observed, exposed as `<basename>_count`
See [summary See [histograms and summaries](/docs/practices/histograms) for
vs. histogram](/docs/practices/instrumentation/#summary-vs.-histogram) for detailed explanations of φ-quantiles, summary usage, and differences
details of summary usage and differences to [histograms](#histogram). to [histograms](#histogram).
Client library usage documentation for summaries: Client library usage documentation for summaries:
......
--- ---
title: Alerting title: Alerting
sort_rank: 4 sort_rank: 5
--- ---
# Alerting # Alerting
......
...@@ -3,7 +3,7 @@ title: Consoles and dashboards ...@@ -3,7 +3,7 @@ title: Consoles and dashboards
sort_rank: 3 sort_rank: 3
--- ---
## Consoles and dashboards # Consoles and dashboards
It can be tempting to display as much data as possible on a dashboard, especially It can be tempting to display as much data as possible on a dashboard, especially
when a system like Prometheus offers the ability to have such rich when a system like Prometheus offers the ability to have such rich
......
This diff is collapsed.
...@@ -191,10 +191,14 @@ processing system. ...@@ -191,10 +191,14 @@ processing system.
If you are unsure, start with no labels and add more If you are unsure, start with no labels and add more
labels over time as concrete use cases arise. labels over time as concrete use cases arise.
### Counter vs. gauge ### Counter vs. gauge, summary vs. histogram
It is important to know which of the four main metric types to use for
a given metric.
To pick between counter and gauge, there is a simple rule of thumb: if To pick between counter and gauge, there is a simple rule of thumb: if
the value can go down, it's a gauge. -metric To pick between counter and gauge, there is a simple rule of
thumb: if the value can go down, it is a gauge.
Counters can only go up (and reset, such as when a process restarts). They are Counters can only go up (and reset, such as when a process restarts). They are
useful for accumulating the number of events, or the amount of something at useful for accumulating the number of events, or the amount of something at
...@@ -206,55 +210,8 @@ Gauges can be set, go up, and go down. They are useful for snapshots of state, ...@@ -206,55 +210,8 @@ Gauges can be set, go up, and go down. They are useful for snapshots of state,
such as in-progress requests, free/total memory, or temperature. You should such as in-progress requests, free/total memory, or temperature. You should
never take a `rate()` of a gauge. never take a `rate()` of a gauge.
### Summary vs. histogram Summaries and histograms are more complex metric types discussed in
[their own section](/docs/practices/histograms/).
Summaries and histograms are more complex metric types. They both sample
observations. They track the number of observations *and* the sum of the
observed values, allowing you to calculate the average observed value (useful
for latency, for example). Note that the number of observations (showing up in
Prometheus as a time series with a `_count` suffix) is inherently a counter (as
described above, it only goes up), while the sum of observations (showing up as
a time series with a `_sum` suffix) is inherently a gauge (if a negative value
is observed, it goes down).
The essential difference is that summaries calculate streaming φ-quantiles on
the client side and expose them, while histograms count observations in buckets
and expose those counts. Calculation of quantiles from histograms happens on the
server side using the [`histogram_quantile()`
function](/docs/querying/functions/#histogram_quantile()).
Both approaches have specific advantages and disadvantages:
| | Histogram | Summary
|---|-----------|---------
| Configuration | Need to configure buckets suitable for the expected range of observed values. | Need to configure φ-quantiles and sliding window, other φ-quantiles and sliding windows cannot be calculated later.
| Client performance | Observations are very cheap as they only need to increment counters. | Observations are expensive due to the streaming quantile calculation.
| Server performance | Calculating quantiles is expensive, consider [recording rules](/docs/querying/rules/#recording-rules) as a remedy. | Very low resource needs.
| Number of time series | Low for Apdex score (see below), very high for accurate quantiles. Each bucket creates a time series. | Low, one time series per configured quantile.
| Accuracy | Depends on number and layout of buckets. Higher accuracy requires more time series. | Configurable. Higher accuracy requires more client resources but is relatively cheap.
| Specification of φ-quantile and sliding time window | Ad-hoc in Prometheus expressions. | Preconfigured by the client.
| Aggregation | Ad-hoc aggregation with [Prometheus expressions](/docs/querying/functions/#histogram_quantile()). | In general [not aggregatable](http://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-you-cant-average.html).
Note the importance of the last item in the table. Let's say you run a service
with an SLA to respond to 95% of requests in under 200ms. In that case, you will
probably collect request durations from every single instance in your fleet, and
then you want to aggregate everything into an overall 95th percentile. You can
only do that with histograms, but not with summaries. Aggregating the
precomputed quantiles from a summary rarely makes sense.
A histogram is suitable to calculate the [Apdex
score](http://en.wikipedia.org/wiki/Apdex). Configure a bucket with the target
request duration as upper bound and another bucket with 4 times the request
duration as upper bound. Example: The target request duration is 250ms. The
tolerable request duration is 1s. The request duration are collected with a
histogram called `http_request_duration_seconds`. The following expression
yields the Apdex score:
```
(
http_request_duration_seconds_bucket{le="0.25"} + http_request_duration_seconds_bucket{le="1"}
) / 2 / http_request_duration_seconds_count
```
### Timestamps, not time since ### Timestamps, not time since
...@@ -288,9 +245,11 @@ benchmarks are the best way to determine the impact of any given change. ...@@ -288,9 +245,11 @@ benchmarks are the best way to determine the impact of any given change.
### Avoid missing metrics ### Avoid missing metrics
Time series that are not present until something happens are difficult to deal with, Time series that are not present until something happens are difficult
as the usual simple operations are no longer sufficient to correctly handle to deal with, as the usual simple operations are no longer sufficient
them. To avoid this, export a `0` for any time series you know may exist in advance. to correctly handle them. To avoid this, export `0` (or `NaN`, if `0`
would be misleading) for any time series you know may exist in
advance.
Most Prometheus client libraries (including Go and Java Simpleclient) will Most Prometheus client libraries (including Go and Java Simpleclient) will
automatically export a `0` for you for metrics with no labels. automatically export a `0` for you for metrics with no labels.
--- ---
title: Recording rules title: Recording rules
sort_rank: 5 sort_rank: 6
--- ---
# Recording rules # Recording rules
......
...@@ -112,24 +112,25 @@ a `job` label set to `prometheus`: ...@@ -112,24 +112,25 @@ a `job` label set to `prometheus`:
The `offset` modifier allows changing the time offset for individual The `offset` modifier allows changing the time offset for individual
instant and range vectors in a query. instant and range vectors in a query.
For example, the following expression returns the value of `foo` 5 For example, the following expression returns the value of
minutes in the past relative to the current query evaluation time: `http_requests_total` 5 minutes in the past relative to the current
query evaluation time:
foo offset 5m http_requests_total offset 5m
Note that the `offset` modifier always needs to follow the selector Note that the `offset` modifier always needs to follow the selector
immediately, i.e. the following would be correct: immediately, i.e. the following would be correct:
sum(foo offset 5m) // GOOD. sum(http_requests_total{method="GET"} offset 5m) // GOOD.
While the following would be *incorrect*: While the following would be *incorrect*:
sum(foo) offset 5m // INVALID. sum(http_requests_total{method="GET"}) offset 5m // INVALID.
The same works for range vectors. This returns the 5-minutes rate that The same works for range vectors. This returns the 5-minutes rate that
`foo` had a week ago: `http_requests_total` had a week ago:
rate(foo[5m] offset 1w) rate(http_requests_total[5m] offset 1w)
## Operators ## Operators
......
...@@ -30,7 +30,7 @@ exist for a given metric name and label combination. ...@@ -30,7 +30,7 @@ exist for a given metric name and label combination.
## `bottomk()` ## `bottomk()`
`bottomk(k integer, v instant-vector` returns the `k` smallest elements of `v` `bottomk(k integer, v instant-vector)` returns the `k` smallest elements of `v`
by sample value. by sample value.
...@@ -88,13 +88,17 @@ to the nearest integer. ...@@ -88,13 +88,17 @@ to the nearest integer.
## `histogram_quantile()` ## `histogram_quantile()`
`histogram_quantile(φ float, b instant-vector)` calculates the φ-quantile (0 ≤ φ `histogram_quantile(φ float, b instant-vector)` calculates the
≤ 1) from the buckets `b` of a histogram. The samples in `b` are the counts of φ-quantile (0 ≤ φ ≤ 1) from the buckets `b` of a
observations in each bucket. Each value must have a label `le` where the label [histogram](/docs/concepts/metric_types/#histogram). (See [histograms
value denotes the inclusive upper bound of the bucket. (Samples without such a and summaries](/docs/practices/histograms) for a detailed explanation
label are ignored.) The [histogram metric of φ-quantiles and the usage of the histogram metric type in general.)
type](/docs/concepts/metric_types/#histogram) automatically The samples in `b` are the counts of observations in each bucket. Each
provides time series with the `_bucket` suffix and the appropriate labels. sample must have a label `le` where the label value denotes the
inclusive upper bound of the bucket. (Samples without such a label are
silently ignored.) The [histogram metric
type](/docs/concepts/metric_types/#histogram) automatically provides
time series with the `_bucket` suffix and the appropriate labels.
Use the `rate()` function to specify the time window for the quantile Use the `rate()` function to specify the time window for the quantile
calculation. calculation.
...@@ -103,34 +107,29 @@ Example: A histogram metric is called `http_request_duration_seconds`. To ...@@ -103,34 +107,29 @@ Example: A histogram metric is called `http_request_duration_seconds`. To
calculate the 90th percentile of request durations over the last 10m, use the calculate the 90th percentile of request durations over the last 10m, use the
following expression: following expression:
``` histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m]))
histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m]))
```
The quantile is calculated for each label combination in The quantile is calculated for each label combination in
`http_request_duration_seconds`. To aggregate, use the `sum()` aggregator `http_request_duration_seconds`. To aggregate, use the `sum()` aggregator
outside of the `rate()` function. Since the `le` label is required by around the `rate()` function. Since the `le` label is required by
`histogram_quantile()`, it has to be included in the `by` clause. The following `histogram_quantile()`, it has to be included in the `by` clause. The following
expression aggregates quantiles by `job`: expression aggregates the 90th percentile by `job`:
``` histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (job, le))
histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (job, le))
```
To aggregate everything, specify only the `le` label: To aggregate everything, specify only the `le` label:
``` histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (le))
histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (le))
``` The `histogram_quantile()` function interpolates quantile values by
assuming a linear distribution within a bucket. The highest bucket
The `histogram_quantile()` interpolates quantile values by assuming a linear must have an upper bound of `+Inf`. (Otherwise, `NaN` is returned.) If
distribution within a bucket. The highest bucket must have an upper bound of a quantile is located in the highest bucket, the upper bound of the
`+Inf`. (Otherwise, `NaN` is returned.) If a quantile is located in the highest second highest bucket is returned. A lower limit of the lowest bucket
bucket, the upper bound of the second highest bucket is returned. A lower limit is assumed to be 0 if the upper bound of that bucket is greater than
of the lowest bucket is assumed to be 0 if the upper bound of that bucket is 0. In that case, the usual linear interpolation is applied within that
greater than 0. In that case, linar interpolation is applied within that bucket bucket. Otherwise, the upper bound of the lowest bucket is returned
as usual. Otherwise, the upper bound of the lowest bucket is returned for for quantiles located in the lowest bucket.
quantiles located in the lowest bucket.
If `b` contains fewer than two buckets, `NaN` is returned. For φ < 0, `-Inf` is If `b` contains fewer than two buckets, `NaN` is returned. For φ < 0, `-Inf` is
returned. For φ > 1, `+Inf` is returned. returned. For φ > 1, `+Inf` is returned.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment