Merge pull request #43 from prometheus/beorn7/doc-improve

Document histograms and histogram_quantile.

Merge pull request #43 from prometheus/beorn7/doc-improve
Document histograms and histogram_quantile.
1e182ed2 · Julius Volz · eeb0170d · c532c7f2 · 1e182ed2 · 1e182ed2
Commit 1e182ed2 authored 10 years ago by Julius Volz
9 changed files
--- a/content/docs/concepts/metric_types.md
+++ b/content/docs/concepts/metric_types.md
@@ -24,7 +24,7 @@ tasks completed, errors occurred, etc. Counters should not be used to expose
 current counts of items whose number can also go down, e.g. the number of
 currently running goroutines. Use gauges for this use case.

-See the client library usage documentation for counters:
+Client library usage documentation for counters:

   * [Go](http://godoc.org/github.com/prometheus/client_golang/prometheus#Counter)
   * [Java](https://github.com/prometheus/client_java/blob/master/client/src/main/java/io/prometheus/client/metrics/Counter.java)
@@ -41,7 +41,7 @@ Gauges are typically used for measured values like temperatures or current
 memory usage, but also "counts" that can go up and down, like the number of
 running goroutines.

-See the client library usage documentation for gauges:
+Client library usage documentation for gauges:

   * [Go](http://godoc.org/github.com/prometheus/client_golang/prometheus#Gauge)
   * [Java](https://github.com/prometheus/client_java/blob/master/client/src/main/java/io/prometheus/client/metrics/Gauge.java)
@@ -49,26 +49,52 @@ See the client library usage documentation for gauges:
   * [Ruby](https://github.com/prometheus/client_ruby#gauge)
   * [Python](https://github.com/prometheus/client_python#gauge)

-## Summaries
+## Histogram

-A _summary_ samples observations (usually things like request durations) over
-sliding windows of time and provides instantaneous insight into their
-distributions, frequencies, and sums.
+A _histogram_ samples observations (usually things like request durations or
+response sizes) and counts them in configurable buckets. It also provides a sum
+of all observed values.
+
+A histogram with a base metric name of `<basename>` exposes multiple time series
+during a scrape:
+
+  * cumulative counters for the observation buckets, exposed as `<basename>_bucket{le="<upper inclusive bound>"}`
+  * the **total sum** of all observed values, exposed as `<basename>_sum`
+  * the **count** of events that have been observed, exposed as `<basename>_count` (identical to `<basename>_bucket{le="+Inf"}` above)
+
+Use the [`histogram_quantile()`
+function](/docs/querying/functions/#histogram_quantile()) to calculate
+quantiles from histograms or even aggregations of histograms. A
+histogram is also suitable to calculate an [Apdex
+score](http://en.wikipedia.org/wiki/Apdex). See [histograms and
+summaries](/docs/practices/histograms) for details of histogram usage
+and differences to [summaries](#summary).
+
+Client library usage documentation for histograms:
+
+   * [Go](http://godoc.org/github.com/prometheus/client_golang/prometheus#Histogram)
+   * [Java](https://github.com/prometheus/client_java/blob/master/simpleclient/src/main/java/io/prometheus/client/Histogram.java) (histograms are only supported by the simple client but not by the legacy client)
+   * [Python](https://github.com/prometheus/client_python#histogram)
+
+## Summary
+
+Similar to a _histogram_, a _summary_ samples observations (usually things like
+request durations and response sizes). While it also provides a total count of
+observations and a sum of all observed values, it calculates configurable
+quantiles over a sliding time window.

 A summary with a base metric name of `<basename>` exposes multiple time series
 during a scrape:

-  * streaming **quantile values** of observed events, exposed as `<basename>{quantile="<quantile label>"}`
+  * streaming **φ-quantiles** (0 ≤ φ ≤ 1) of observed events, exposed as `<basename>{quantile="<φ>"}`
  * the **total sum** of all observed values, exposed as `<basename>_sum`
  * the **count** of events that have been observed, exposed as `<basename>_count`

-This is quite convenient, for if you are interested in tracking latencies of an
-operation in real time, you get three types of information reported for free
-with one metric.
-
-A typical use-case is the observation of request latencies or response sizes.
+See [histograms and summaries](/docs/practices/histograms) for
+detailed explanations of φ-quantiles, summary usage, and differences
+to [histograms](#histogram).

-See the client library usage documentation for summaries:
+Client library usage documentation for summaries:

   * [Go](http://godoc.org/github.com/prometheus/client_golang/prometheus#Summary)
   * [Java](https://github.com/prometheus/client_java/blob/master/client/src/main/java/io/prometheus/client/metrics/Summary.java)

--- a/content/docs/introduction/roadmap.md
+++ b/content/docs/introduction/roadmap.md
@@ -23,17 +23,6 @@ detailed local views.

 GitHub issue: [#9](https://github.com/prometheus/prometheus/issues/9)

-**Aggregatable histograms**
-
-The current client-side [summary
-types](/docs/concepts/metric_types/#summaries) do not
-support aggregation of quantiles. For example, it is [statistically
-incorrect](http://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-you-cant-average.html)
-to average over the 90th percentile latency of multiple monitored instances.
-We plan to implement server-side histograms which will allow for this use case.
-
-GitHub issue: [#480](https://github.com/prometheus/prometheus/issues/480)
-
 **More flexible label matching in binary operations**

 [Binary operations](/docs/querying/operators/) between time series vectors

--- a/content/docs/practices/alerting.md
+++ b/content/docs/practices/alerting.md
 ---
 title: Alerting
-sort_rank: 4
+sort_rank: 5
 ---

 # Alerting

--- a/content/docs/practices/consoles.md
+++ b/content/docs/practices/consoles.md
@@ -3,7 +3,7 @@ title: Consoles and dashboards
 sort_rank: 3
 ---

-## Consoles and dashboards
+# Consoles and dashboards

 It can be tempting to display as much data as possible on a dashboard, especially
 when a system like Prometheus offers the ability to have such rich

--- a/content/docs/practices/histograms.md
+++ b/content/docs/practices/histograms.md
+---
+title: Histograms and summaries
+sort_rank: 4
+---
+
+# Histograms and summaries
+
+Histograms and summaries are more complex metric types. Not only does
+a single histogram or summary create a multitude of time series, it is
+also more difficult to use these metric types correctly. This section
+helps you to pick and configure the appropriate metric type for your
+use case.
+
+## Library support
+
+First of all, check the library support for
+[histograms](/docs/concepts/metric_types/#histogram) and
+[summaries](/docs/concepts/metric_types/#summary). Full support for
+both currently only exists in the Go client library. Many libraries
+support only one of the two types, or they support summaries only in a
+limited fashion (lacking [quantile
+calculation](#quantiles)).
+
+## Count and sum of observations
+
+Histograms and summaries both sample observations, typically request
+durations or response sizes. They track the number of observations
+*and* the sum of the observed values, allowing you to calculate the
+*average* of the observed values. Note that the number of observations
+(showing up in Prometheus as a time series with a `_count` suffix) is
+inherently a counter (as described above, it only goes up). The sum of
+observations (showing up as a time series with a `_sum` suffix)
+behaves like a counter, too, as long as there are no negative
+observations. Obviously, request durations or response sizes are
+never negative. In principle, however, you can use summaries and
+histograms to observe negative values (e.g. temperatures in
+centigrade). In that case, the sum of observations can go down, so you
+cannot apply `rate()` to it anymore.
+
+To calculate the average request duration during the last 5 minutes
+from a histogram or summary called `http_request_duration_seconds`,
+use the following expression:
+
+      rate(http_request_duration_seconds_sum[5m])
+    /
+      rate(http_request_duration_seconds_count[5m])
+
+## Apdex score
+
+A straight-forward use of histograms (but not summaries) is to count
+observations falling into particular buckets of observation
+values.
+
+You might have an SLA to serve 95% of requests within 300ms. In that
+case, configure a histogram to have a bucket with an upper limit of
+0.3 seconds. You can then directly express the relative amount of
+requests served within 300ms and easily alert if the value drops below
+0.95. The following expression calculates it by job for the requests
+served in the last 5 minutes. The request durations were collected with
+a histogram called `http_request_duration_seconds`.
+
+      sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job)
+    /
+      sum(rate(http_request_duration_seconds_count[5m])) by (job)
+
+
+You can calculate the well-known [Apdex
+score](http://en.wikipedia.org/wiki/Apdex) in a similar way. Configure
+a bucket with the target request duration as the upper bound and
+another bucket with the tolerated request duration (usually 4 times
+the target request duration) as the upper bound. Example: The target
+request duration is 300ms. The tolerable request duration is 1.2s. The
+following expression yields the Apdex score for each job over the last
+5 minutes:
+
+    (
+      sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job)
+    +
+      sum(rate(http_request_duration_seconds_bucket{le="1.2"}[5m])) by (job)
+    ) / 2 / sum(rate(http_request_duration_seconds_count[5m])) by (job)
+
+## Quantiles
+
+You can use both summaries and histograms to calculate so-called φ-quantiles,
+where 0 ≤ φ ≤ 1. The φ-quantile is the observation value that ranks at number
+φ*N among the N observations. Examples for φ-quantiles: The 0.5-quantile is
+known as the median. The 0.95-quantile is the 95th percentile.
+
+The essential difference between summaries and histograms is that summaries
+calculate streaming φ-quantiles on the client side and expose them directly,
+while histograms expose bucketed observation counts and the calculation of
+quantiles from the buckets of a histogram happens on the server side using the
+[`histogram_quantile()`
+function](/docs/querying/functions/#histogram_quantile()).
+
+The two approaches have a number of different implications:
+
+|   | Histogram | Summary
+|---|-----------|---------
+| Required configuration | Pick buckets suitable for the expected range of observed values. | Pick desired φ-quantiles and sliding window. Other φ-quantiles and sliding windows cannot be calculated later.
+| Client performance | Observations are very cheap as they only need to increment counters. | Observations are expensive due to the streaming quantile calculation.
+| Server performance | The server has to calculate quantiles. You can use [recording rules](/docs/querying/rules/#recording-rules) should the ad-hoc calculation take too long (e.g. in a large dashboard). | Low server-side cost.
+| Number of time series (in addition to the `_sum` and `_count` series) | One time series per configured bucket. | One time series per configured quantile.
+| Quantile error (see below for details) | Error is limited in the dimension of observed values by the width of the relevant bucket. | Error is limited in the dimension of φ by a configurable value.
+| Specification of φ-quantile and sliding time-window | Ad-hoc with [Prometheus expressions](/docs/querying/functions/#histogram_quantile()). | Preconfigured by the client.
+| Aggregation | Ad-hoc with [Prometheus expressions](/docs/querying/functions/#histogram_quantile()). | In general [not aggregatable](http://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-you-cant-average.html).
+
+Note the importance of the last item in the table. Let us return to
+the SLA of serving 95% of requests within 300ms. This time, you do not
+want to display the percentage of requests served within 300ms, but
+instead the 95th percentile, i.e. the request duration within which
+you have served 95% of requests. To do that, you can either configure
+a summary with a 0.95-quantile and (for example) a 5-minute decay
+time, or you configure a histogram with a few buckets around the 300ms
+mark, e.g. `{le="0.1"}`, `{le="0.2"}`, `{le="0.3"}`, and
+`{le="0.45"}`. If your service runs replicated with a number of
+instances, you will collect request durations from every single one of
+them, and then you want to aggregate everything into an overall 95th
+percentile. However, aggregating the precomputed quantiles from a
+summary rarely makes sense. In this particular case, averaging the
+quantiles yields statistically nonsensical values.
+
+    avg(http_request_duration_seconds{quantile="0.95"}) // BAD!
+
+Using histograms, the aggregation is perfectly possible with the
+[`histogram_quantile()`
+function](/docs/querying/functions/#histogram_quantile()).
+
+    histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) // GOOD.
+
+Furthermore, should your SLA change and you now want to plot the 90th
+percentile, or you want to take into account the last 10 minutes
+instead of the last 5 minutes, you only have to adjust the expression
+above and you do not need to reconfigure the clients.
+
+## Errors of quantile estimation
+
+Quantiles, whether calculated client-side or server-side, are
+estimated. It is important to understand the errors of that
+estimation.
+
+Continuing the histogram example from above, imagine your usual
+request durations are almost all very close to 220ms, or in other
+words, if you could plot the "true" histogram, you would see a very
+sharp spike at 220ms. In the Prometheus histogram metric as configured
+above, almost all observations, and therefore also the 95th percentile,
+will fall into the bucket labeled `{le="0.3"}`, i.e. the bucket from
+200ms to 300ms. The histogram implementation guarantees that the true
+95th percentile is somewhere between 100ms and 200ms. To return a
+single value (rather than an interval), it applies linear
+interpolation, which yields 295ms in this case. The calculated
+quantile gives you the impression that you are close to breaking the
+SLA, but in reality, the 95th percentile is a tiny bit above 220ms,
+a quite comfortable distance to your SLA.
+
+Next step in our thought experiment: A change in backend routing
+adds a fixed amount of 100ms to all request durations. Now the request
+duration has its sharp spike at 320ms and almost all observations will
+fall into the bucket from 300ms to 450ms. The 95th percentile is
+calculated to be 442.5ms, although the correct value is close to
+320ms. While you are only a tiny bit outside of your SLA, the
+calculated 95th quantile looks much worse.
+
+A summary would have had no problem calculating the correct percentile
+value in both cases, at least if it uses an appropriate algorithm on
+the client side (like the [one used by the Go
+client](http://www.cs.rutgers.edu/~muthu/bquant.pdf)). Unfortunately,
+you cannot use a summary if you need to aggregate the observations
+from a number of instances.
+
+Luckily, due to your appropriate choice of bucket boundaries, even in
+this contrived example of very sharp spikes in the distribution of
+observed values, the histogram was able to identify correctly if you
+were within or outside of your SLA. Also, the closer the actual value
+of the quantile is to our SLA (or in other words, the value we are
+actually most interested in), the more accurate the calculated value
+becomes.
+
+Let us now modify the experiment once more. In the new setup, the
+distributions of request durations has a spike at 150ms, but it is not
+quite as sharp as before and only comprises 90% of the
+observations. 10% of the observations are evenly spread out in a long
+tail between 150ms and 450ms. With that distribution, the 95th
+percentile happens to be exactly at our SLA of 300ms. With the
+histogram, the calculated value is accurate, as the value of the 95th
+percentile happens to coincide with one of the bucket boundaries. Even
+slightly different values would still be accurate as the (contrived)
+even distribution within the relevant buckets is exactly what the
+linear interpolation within a bucket assumes.
+
+The error of the quantile reported by a summary gets more interesting
+now. The error of the quantile in a summary is configured in the
+dimension of φ. In our case we might have configured 0.95±0.01,
+i.e. the calculated value will be between the 94th and 96th
+percentile. The 94th quantile with the distribution described above is
+270ms, the 96th quantile is 330ms. The calculated value of the 95th
+percentile reported by the summary can be anywhere in the interval
+between 270ms and 330ms, which unfortunately is all the difference
+between clearly within the SLA vs. clearly outside the SLA.
+
+The bottom line is: If you use a summary, you control the error in the
+dimension of φ. If you use a histogram, you control the error in the
+dimension of the observed value (via choosing the appropriate bucket
+layout). With a broad distribution, small changes in φ result in
+large deviations in the observed value. With a sharp distribution, a
+small interval of observed values covers a large interval of φ.
+
+Two rules of thumb:
+
+  1. If you need to aggregate, choose histograms.
+
+  2. Otherwise, choose a histogram if you have an idea of the range
+     and distribution of values that will be observed. Choose a
+     summary if you need an accurate quantile, no matter what the
+     range and distribution of the values is.
+
+
+## What can I do if my client library does not support the metric type I need?
+
+Implement it! [Code contributions are welcome](/community/). In
+general, we expect histograms to be more urgently needed than
+summaries. Histograms are also easier to implement in a client
+library, so we recommend to implement histograms first, if in
+doubt. The reason why some libraries offer summaries but not
+histograms (the Ruby client and the legacy Java client) is that
+histograms are a more recent feature of Prometheus.
\ No newline at end of file
--- a/content/docs/practices/instrumentation.md
+++ b/content/docs/practices/instrumentation.md
@@ -144,9 +144,9 @@ gauge for how long the collection took in seconds and another for the number of
 errors encountered.

 This is one of the two cases when it is okay to export a duration as a gauge
-rather than a summary, the other being batch job durations. This is because both
-represent information about that particular push/scrape, rather than
-tracking multiple durations over time.
+rather than a summary or a histogram, the other being batch job durations. This
+is because both represent information about that particular push/scrape, rather
+than tracking multiple durations over time.

 ## Things to watch out for

@@ -191,10 +191,14 @@ processing system.
 If you are unsure, start with no labels and add more
 labels over time as concrete use cases arise.

-### Counter vs. gauge vs. summary
+### Counter vs. gauge, summary vs. histogram

-It is important to know which of the three main metric types to use for a given
-metric. There is a simple rule of thumb: if the value can go down, it's a gauge.
+It is important to know which of the four main metric types to use for
+a given metric.
+
+To pick between counter and gauge, there is a simple rule of thumb: if
+-metric To pick between counter and gauge, there is a simple rule of
+thumb: if the value can go down, it is a gauge.

 Counters can only go up (and reset, such as when a process restarts). They are
 useful for accumulating the number of events, or the amount of something at
@@ -206,11 +210,8 @@ Gauges can be set, go up, and go down. They are useful for snapshots of state,
 such as in-progress requests, free/total memory, or temperature. You should
 never take a `rate()` of a gauge.

-Summaries are similar to having two counters. They track the number of events
-*and* the amount of something for each event, allowing you to calculate the
-average amount per event (useful for latency, for example). In addition,
-summaries can also export quantiles of the amounts, but note that [quantiles are not
-aggregatable](http://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-you-cant-average.html).
+Summaries and histograms are more complex metric types discussed in
+[their own section](/docs/practices/histograms/).

 ### Timestamps, not time since

@@ -244,9 +245,11 @@ benchmarks are the best way to determine the impact of any given change.

 ### Avoid missing metrics

-Time series that are not present until something happens are difficult to deal with,
-as the usual simple operations are no longer sufficient to correctly handle
-them. To avoid this, export a `0` for any time series you know may exist in advance.
+Time series that are not present until something happens are difficult
+to deal with, as the usual simple operations are no longer sufficient
+to correctly handle them. To avoid this, export `0` (or `NaN`, if `0`
+would be misleading) for any time series you know may exist in
+advance.

 Most Prometheus client libraries (including Go and Java Simpleclient) will
 automatically export a `0` for you for metrics with no labels.
--- a/content/docs/practices/rules.md
+++ b/content/docs/practices/rules.md
 ---
 title: Recording rules
-sort_rank: 5
+sort_rank: 6
 ---

 # Recording rules

--- a/content/docs/querying/basics.md
+++ b/content/docs/querying/basics.md
@@ -107,6 +107,31 @@ a `job` label set to `prometheus`:

    http_requests_total{job="prometheus"}[5m]

+### Offset modifier
+
+The `offset` modifier allows changing the time offset for individual
+instant and range vectors in a query.
+
+For example, the following expression returns the value of
+`http_requests_total` 5 minutes in the past relative to the current
+query evaluation time:
+
+    http_requests_total offset 5m
+
+Note that the `offset` modifier always needs to follow the selector
+immediately, i.e. the following would be correct:
+
+    sum(http_requests_total{method="GET"} offset 5m) // GOOD.
+
+While the following would be *incorrect*:
+
+    sum(http_requests_total{method="GET"}) offset 5m // INVALID.
+
+The same works for range vectors. This returns the 5-minutes rate that
+`http_requests_total` had a week ago:
+
+    rate(http_requests_total[5m] offset 1w)
+    
 ## Operators

 Prometheus supports many binary and aggregation operators. These are described

--- a/content/docs/querying/functions.md
+++ b/content/docs/querying/functions.md
@@ -28,6 +28,12 @@ the 1-element output vector from the input vector:
 This is useful for alerting on when no time series
 exist for a given metric name and label combination.

+## `bottomk()`
+
+`bottomk(k integer, v instant-vector)` returns the `k` smallest elements of `v`
+by sample value.
+
+
 ## `ceil()`

 `ceil(v instant-vector)` rounds the sample values of all elements in `v` up to
@@ -80,6 +86,54 @@ and value across all series in the input vector.
 `floor(v instant-vector)` rounds the sample values of all elements in `v` down
 to the nearest integer.

+## `histogram_quantile()`
+
+`histogram_quantile(φ float, b instant-vector)` calculates the
+φ-quantile (0 ≤ φ ≤ 1) from the buckets `b` of a
+[histogram](/docs/concepts/metric_types/#histogram). (See [histograms
+and summaries](/docs/practices/histograms) for a detailed explanation
+of φ-quantiles and the usage of the histogram metric type in general.)
+The samples in `b` are the counts of observations in each bucket. Each
+sample must have a label `le` where the label value denotes the
+inclusive upper bound of the bucket. (Samples without such a label are
+silently ignored.) The [histogram metric
+type](/docs/concepts/metric_types/#histogram) automatically provides
+time series with the `_bucket` suffix and the appropriate labels.
+
+Use the `rate()` function to specify the time window for the quantile
+calculation.
+
+Example: A histogram metric is called `http_request_duration_seconds`. To
+calculate the 90th percentile of request durations over the last 10m, use the
+following expression:
+
+    histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m]))
+
+The quantile is calculated for each label combination in
+`http_request_duration_seconds`. To aggregate, use the `sum()` aggregator
+around the `rate()` function. Since the `le` label is required by
+`histogram_quantile()`, it has to be included in the `by` clause. The following
+expression aggregates the 90th percentile by `job`:
+
+    histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (job, le))
+
+To aggregate everything, specify only the `le` label:
+
+    histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (le))
+
+The `histogram_quantile()` function interpolates quantile values by
+assuming a linear distribution within a bucket. The highest bucket
+must have an upper bound of `+Inf`. (Otherwise, `NaN` is returned.) If
+a quantile is located in the highest bucket, the upper bound of the
+second highest bucket is returned. A lower limit of the lowest bucket
+is assumed to be 0 if the upper bound of that bucket is greater than
+0. In that case, the usual linear interpolation is applied within that
+bucket. Otherwise, the upper bound of the lowest bucket is returned
+for quantiles located in the lowest bucket.
+
+If `b` contains fewer than two buckets, `NaN` is returned. For φ < 0, `-Inf` is
+returned. For φ > 1, `+Inf` is returned.
+
 ## `rate()`

 `rate(v range-vector)` calculate the per-second average rate of increase of the
@@ -123,6 +177,11 @@ Same as `sort`, but sorts in descending order.
 this does not actually return the current time, but the time at which the
 expression is to be evaluated.

+## `topk()`
+
+`topk(k integer, v instant-vector)` returns the `k` largest elements of `v` by
+sample value.
+
 ## `<aggregation>_over_time()`: Aggregating values over time:

 The following functions allow aggregating each series of a given range vector
@@ -133,11 +192,3 @@ over time and return an instant vector with per-series aggregation results:
 * `max_over_time(range-vector)`: the maximum value of all points under the specified interval.
 * `sum_over_time(range-vector)`: the sum of all values under the specified interval.
 * `count_over_time(range-vector)`: the count of all values under the specified interval.
-
-## `topk()` and `bottomk()`
-
-`topk(k integer, v instant-vector)` returns the `k` largest elements of `v` by
-sample value.
-
-`bottomk(k integer, v instant-vector` returns the `k` smallest elements of `v`
-by sample value.