Improvements after review.

Most importantly, created a completely new section for histograms and summaries and updated all the references. Also add other minor improvements.

Improvements after review.
Most importantly, created a completely new section for histograms and summaries and updated all the references. Also add other minor improvements.
bf647326 · beorn7 · 38a24dc3 · bf647326 · bf647326 · bf647326
Commit bf647326 authored Feb 24, 2015 by beorn7
8 changed files
--- a/content/docs/concepts/metric_types.md
+++ b/content/docs/concepts/metric_types.md
@@ -63,13 +63,14 @@ during a scrape:
  * the **count** of events that have been observed, exposed as `<basename>_count` (identical to `<basename>_bucket{le="+Inf"}` above)

 Use the [`histogram_quantile()`
-function](/docs/querying/functions/#histogram_quantile()) to calculate quantiles
-from histograms or even aggregations of histograms. A histogram is also suitable
-to calculate an [Apdex score](http://en.wikipedia.org/wiki/Apdex). See [summary
-vs. histogram](/docs/practices/instrumentation/#summary-vs.-histogram) for
-details of histogram usage and differences to [summaries](#summary).
+function](/docs/querying/functions/#histogram_quantile()) to calculate
+quantiles from histograms or even aggregations of histograms. A
+histogram is also suitable to calculate an [Apdex
+score](http://en.wikipedia.org/wiki/Apdex). See [histograms and
+summaries](/docs/practices/histograms) for details of histogram usage
+and differences to [summaries](#summary).

-Client library usage documentation for summaries:
+Client library usage documentation for histograms:

   * [Go](http://godoc.org/github.com/prometheus/client_golang/prometheus#Histogram)
   * [Java](https://github.com/prometheus/client_java/blob/master/simpleclient/src/main/java/io/prometheus/client/Histogram.java) (histograms are only supported by the simple client but not by the legacy client)
@@ -79,7 +80,7 @@ Client library usage documentation for summaries:

 Similar to a _histogram_, a _summary_ samples observations (usually things like
 request durations and response sizes). While it also provides a total count of
-observation and a sum of all observed values, it calculates configurable
+observations and a sum of all observed values, it calculates configurable
 quantiles over a sliding time window.

 A summary with a base metric name of `<basename>` exposes multiple time series
@@ -89,9 +90,9 @@ during a scrape:
  * the **total sum** of all observed values, exposed as `<basename>_sum`
  * the **count** of events that have been observed, exposed as `<basename>_count`

-See [summary
-vs. histogram](/docs/practices/instrumentation/#summary-vs.-histogram) for
-details of summary usage and differences to [histograms](#histogram).
+See [histograms and summaries](/docs/practices/histograms) for
+detailed explanations of φ-quantiles, summary usage, and differences
+to [histograms](#histogram).

 Client library usage documentation for summaries:


--- a/content/docs/practices/alerting.md
+++ b/content/docs/practices/alerting.md
 ---
 title: Alerting
-sort_rank: 4
+sort_rank: 5
 ---

 # Alerting

--- a/content/docs/practices/consoles.md
+++ b/content/docs/practices/consoles.md
@@ -3,7 +3,7 @@ title: Consoles and dashboards
 sort_rank: 3
 ---

-## Consoles and dashboards
+# Consoles and dashboards

 It can be tempting to display as much data as possible on a dashboard, especially
 when a system like Prometheus offers the ability to have such rich

--- a/content/docs/practices/histograms.md
+++ b/content/docs/practices/histograms.md
+---
+title: Histograms and summaries
+sort_rank: 4
+---
+
+# Histograms and summaries
+
+Histograms and summaries are more complex metric types. Not only
+creates a single histogram or summary a multitude of time series, it
+is also more difficult to use them correctly. This section helps you
+to pick and configure the appropriate metric type for your use case.
+
+## Library support
+
+First of all, check the library support for
+[histograms](/docs/concepts/metric_types/#histogram) and
+[summaries](/docs/concepts/metric_types/#summary). Full support for
+both currently only exists in the Go client library. Many libraries
+support only one of the two types, or they support summaries only in a
+limited fashion (lacking [quantile
+calculation](#quantiles)). [Contributions are welcome](/community/),
+of course. In general, we expect histograms to be more urgently needed
+than summaries. Histograms are also easier to implement in a client
+library, so we recommend to implement histograms first, if in
+doubt. The reason why some libraries offer summaries but not
+histograms (Ruby, the legacy Java client) is that histograms are a
+more recent feature of Prometheus.
+
+## Count and sum of observations
+
+Histograms and summaries both sample observations, typically request
+durations or response sizes. They track the number of observations
+*and* the sum of the observed values, allowing you to calculate the
+*average* of the observed values. Note that the number of observations
+(showing up in Prometheus as a time series with a `_count` suffix) is
+inherently a counter (as described above, it only goes up). The sum of
+observations (showing up as a time series with a `_sum` suffix)
+behaves like a counter, too, as long as all observations are
+positive. Obviously, request durations or response sizes are always
+positive. In principle, however, you can use summaries and histograms
+to observe negative values (e.g. temperatures in centigrade). In that
+case, the sum of observations can go down, so you cannot apply
+`rate()` to it anymore.
+
+To calculate the average request duration during the last 5 minutes
+from a histogram or summary called `http_request_duration_second`, use
+the following expression:
+
+    rate(http_request_duration_seconds_sum[5m])
+      /
+    rate(http_request_duration_seconds_count[5m])
+
+## Apdex score
+
+A straight-forward use of histograms (but not summaries) is to count
+observations falling into particular buckets of observation
+values.
+
+You might have an SLA to serve 95% of requests within 300ms. In that
+case, configure a histogram to have a bucket with an upper limit of
+0.3 seconds. You can then directly express the relative amount of
+requests served within 300ms and easily alert if the value drops below
+0.95. The following expression calculates it by job for the requests
+served in the last 5 minutes. The request durations were collected with
+a histogram called `http_request_duration_seconds`.
+
+    sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job)
+      /
+    sum(rate(http_request_duration_seconds_count[5m])) by (job)
+
+
+You can calculate the well-known [Apdex
+score](http://en.wikipedia.org/wiki/Apdex) in a similar way. Configure
+a bucket with the target request duration as upper bound and another
+bucket with the tolerated request duration (usually 4 times the target
+request duration) as upper bound. Example: The target request duration
+is 300ms. The tolerable request duration is 1.2s. The following
+expression yields the Apdex score over the last 5 minutes:
+
+    (
+      rate(http_request_duration_seconds_bucket{le="0.3"}[5m])
+        +
+      rate(http_request_duration_seconds_bucket{le="1.2"}[5m])
+    ) / 2 / rate(http_request_duration_seconds_count[5m])
+
+## Quantiles
+
+You can use both summaries and histograms to calculate so-called φ-quantiles,
+where 0 ≤ φ ≤ 1. The φ-quantile is the observation value that ranks at number
+φ*N among the N observations. Examples for φ-quantiles: The 0.5-quantile is
+known as the median. The 0.95-quantile is the 95th percentile.
+
+The essential difference between summaries and histograms is that summaries
+calculate streaming φ-quantiles on the client side and expose them directly,
+while histograms expose bucketed observations counts and the calculation of
+quantiles from the buckets of a histogram happens on the server side using the
+[`histogram_quantile()`
+function](/docs/querying/functions/#histogram_quantile()).
+
+The two approaches have a number of different implications:
+
+|   | Histogram | Summary
+|---|-----------|---------
+| Required configuration | Pick buckets suitable for the expected range of observed values. | Pick desired φ-quantiles and sliding window. Other φ-quantiles and sliding windows cannot be calculated later.
+| Client performance | Observations are very cheap as they only need to increment counters. | Observations are expensive due to the streaming quantile calculation.
+| Server performance | The server has to calculate quantiles. You can use [recording rules](/docs/querying/rules/#recording-rules) should the ad-hoc calculation take too long (e.g. in a large dashboard). | Low server-side cost.
+| Number of time series (in addition to the `_sum` and `_count` series) | One time series per configured bucket. | One time series per configured quantile.
+| Quantile error (see below for details) | Error is limited in the dimension of observed values by the width of the relevant bucket. | Error is limited in the dimension of φ by a configurable value.
+| Specification of φ-quantile and sliding time-window | Ad-hoc with [Prometheus expressions](/docs/querying/functions/#histogram_quantile()). | Preconfigured by the client.
+| Aggregation | Ad-hoc with [Prometheus expressions](/docs/querying/functions/#histogram_quantile()). | In general [not aggregatable](http://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-you-cant-average.html).
+
+Note the importance of the last item in the table. Let us return to
+the SLA of serving 95% of requests within 300ms. This time, you do not
+want to display the percentage of requests served within 300ms, but
+instead the 95th percentile, i.e. the request duration within which
+you have served 95% of requests. To do that, you can either configure
+a summary with a 0.95-quantile and (for example) a 5-minute decay
+time-window, or you configure a histogram with a few buckets around
+the 300ms mark, e.g. `{le="0.1"}`, `{le="0.2"}`, `{le="0.3"}`, and
+`{le="0.45"}`. If your service runs replicated with a number of
+instances, you will collect request durations from every single one of
+them, and then you want to aggregate everything into an overall 95th
+percentile. However, aggregating the precomputed quantiles from a
+summary rarely makes sense. In this particular case, averaging the
+quantiles yields statistically nonsensical values.
+
+    avg(http_request_duration_seconds{quantile="0.95"}) // BAD!
+
+Using histograms, the aggregation is perfectly possible with the
+[`histogram_quantile()`
+function](/docs/querying/functions/#histogram_quantile()).
+
+    histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) // GOOD.
+
+Furthermore, should your SLA change and you now want to plot the 90th
+percentile, or you want to take into account the last 10 minutes
+instead of the last 5 minutes, you only have to adjust the expression
+above and you do not need to reconfigure the clients.
+
+## Errors of quantile estimation
+
+Quantiles, whether calculated client-side or server-side, are
+estimated. It is important to understand the errors of that
+estimation.
+
+Continuing the histogram example from above, imagine your usual
+request durations are almost all very close to 220ms, or in other
+words, if you could plot the "true" histogram, you would see a very
+sharp spike at 220ms. In the Prometheus histogram metric as configured
+above, almost all observations, and therefore also the 95th percentile,
+will fall into the bucket labeled `{le="0.3"}`, i.e. the bucket from
+200ms to 300ms. The histogram implementation guarantees that the true
+95th percentile is somewhere between 100ms and 200ms. To return a
+single value (rather than an interval), it applies linear
+interpolation, which yields 295ms in this case. The calculated
+quantile gives you the impression that you are close to breaking the
+SLA, but in reality, the 95th percentile is a tiny bit above 220ms,
+a quite comfortable distance to your SLA.
+
+Next step in our *Gedenkenexperiment*: A change in backend routing
+adds a fixed amount of 100ms to all requent durations. Now the request
+duration has its sharp spike at 320ms and almost all observations will
+fall into the bucket from 300ms to 450ms. The 95th percentile is
+calculated to be 442.5ms, although the correct values is close to
+320ms. While you are only a tiny bit outside of your SLA, the
+calculated 95th quantile looks much worse.
+
+A summary would have had no problem calculating the correct percentile
+value in both cases, at least if it uses an appropriate algorithm on
+the client side (like the [one used by the Go
+client](http://www.cs.rutgers.edu/~muthu/bquant.pdf)). Unfortunately,
+you cannot use a summary if you need to aggregate the observations
+from a number of instances.
+
+Luckily, due to your appropriate choice of bucket boundaries, even in
+this contrived example of very sharp spikes in the distribution of
+observed values, the histogram was able to identify correctly if you
+were within or outside of your SLA. Also, the closer the actual value
+of the quantile is to our SLA (or in other words, the value we are
+actually most interested in), the more accurate the calculated value
+becomes.
+
+Let us now modify the experiment once more. In the new setup, the
+distributions of request durations has a spike at 150ms, but it is not
+quite as sharp as before and only comprises 90% of the
+observations. 10% of the observations are evenly spread out in a long
+tail between 150ms and 450ms. With that distribution, the 95th
+percentile happens to be exactly at our SLA of 300ms. With the
+histogram, the calculated value is accurate, as the value of the 95th
+percentile happens to coincide with one of the bucket boundaries. Even
+slightly different values would still be accurate as the (contrived)
+even distribution within the relevant buckets is exactly what the
+linear interpolation within a bucket assumes.
+
+The error of the quantile reported by a summary gets more interesting
+now. The error of the quantile in a summary is configured in the
+dimension of φ. In our case we might have configured 0.95±0.01,
+i.e. the calculated value will be between the 94th and 96th
+percentile. The 94th quantile with the distribution described above is
+270ms, the 96th quantile is 330ms. The calculated value of the 95th
+percentile reported by the summary can be anywhere in the interval
+between 270ms and 330ms, which unfortunately is all the difference
+between clearly within the SLA vs. clearly outside the SLA.
+
+The bottom line is: If you use a summary, you control the error in the
+dimension of φ. If you use a histogram, you control the error in the
+dimension of the observed value (via choosing the appropriate bucket
+layout). With a broad distribution, small changes in φ result in
+large deviations in the observed value. With a sharp distribution, a
+small interval of observed values covers a large interval of φ.
+
+Two rules of thumb:
+
+  1. If you need to aggregate, choose histograms.
+
+  2. Otherwise, choose a histogram if you need accuracy in the
+     dimension of the observed values and you have an idea in which
+     ranges of observed values you are interested in. Choose a summary
+     if you need accuracy in the dimension of φ, no matter in which
+     ranges of observed values the quantile will end up.
--- a/content/docs/practices/instrumentation.md
+++ b/content/docs/practices/instrumentation.md
@@ -191,10 +191,14 @@ processing system.
 If you are unsure, start with no labels and add more
 labels over time as concrete use cases arise.

-### Counter vs. gauge
+### Counter vs. gauge, summary vs. histogram
+
+It is important to know which of the four main metric types to use for
+a given metric.

 To pick between counter and gauge, there is a simple rule of thumb: if
-the value can go down, it's a gauge.
+-metric To pick between counter and gauge, there is a simple rule of
+thumb: if the value can go down, it is a gauge.

 Counters can only go up (and reset, such as when a process restarts). They are
 useful for accumulating the number of events, or the amount of something at
@@ -206,55 +210,8 @@ Gauges can be set, go up, and go down. They are useful for snapshots of state,
 such as in-progress requests, free/total memory, or temperature. You should
 never take a `rate()` of a gauge.

-### Summary vs. histogram
-
-Summaries and histograms are more complex metric types. They both sample
-observations. They track the number of observations *and* the sum of the
-observed values, allowing you to calculate the average observed value (useful
-for latency, for example). Note that the number of observations (showing up in
-Prometheus as a time series with a `_count` suffix) is inherently a counter (as
-described above, it only goes up), while the sum of observations (showing up as
-a time series with a `_sum` suffix) is inherently a gauge (if a negative value
-is observed, it goes down).
-
-The essential difference is that summaries calculate streaming φ-quantiles on
-the client side and expose them, while histograms count observations in buckets
-and expose those counts. Calculation of quantiles from histograms happens on the
-server side using the [`histogram_quantile()`
-function](/docs/querying/functions/#histogram_quantile()).
-
-Both approaches have specific advantages and disadvantages:
-
-|   | Histogram | Summary
-|---|-----------|---------
-| Configuration | Need to configure buckets suitable for the expected range of observed values. | Need to configure φ-quantiles and sliding window, other φ-quantiles and sliding windows cannot be calculated later.
-| Client performance | Observations are very cheap as they only need to increment counters. | Observations are expensive due to the streaming quantile calculation.
-| Server performance | Calculating quantiles is expensive, consider [recording rules](/docs/querying/rules/#recording-rules) as a remedy. | Very low resource needs.
-| Number of time series | Low for Apdex score (see below), very high for accurate quantiles. Each bucket creates a time series. | Low, one time series per configured quantile.
-| Accuracy | Depends on number and layout of buckets. Higher accuracy requires more time series. | Configurable. Higher accuracy requires more client resources but is relatively cheap.
-| Specification of φ-quantile and sliding time window | Ad-hoc in Prometheus expressions. | Preconfigured by the client.
-| Aggregation | Ad-hoc aggregation with [Prometheus expressions](/docs/querying/functions/#histogram_quantile()). | In general [not aggregatable](http://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-you-cant-average.html).
-
-Note the importance of the last item in the table. Let's say you run a service
-with an SLA to respond to 95% of requests in under 200ms. In that case, you will
-probably collect request durations from every single instance in your fleet, and
-then you want to aggregate everything into an overall 95th percentile. You can
-only do that with histograms, but not with summaries. Aggregating the
-precomputed quantiles from a summary rarely makes sense.
-
-A histogram is suitable to calculate the [Apdex
-score](http://en.wikipedia.org/wiki/Apdex). Configure a bucket with the target
-request duration as upper bound and another bucket with 4 times the request
-duration as upper bound. Example: The target request duration is 250ms. The
-tolerable request duration is 1s. The request duration are collected with a
-histogram called `http_request_duration_seconds`. The following expression
-yields the Apdex score:
-
-```
-(
-  http_request_duration_seconds_bucket{le="0.25"} + http_request_duration_seconds_bucket{le="1"}
-) / 2 / http_request_duration_seconds_count
-```
+Summaries and histograms are more complex metric types discussed in
+[their own section](/docs/practices/histograms/).

 ### Timestamps, not time since

@@ -288,9 +245,11 @@ benchmarks are the best way to determine the impact of any given change.

 ### Avoid missing metrics

-Time series that are not present until something happens are difficult to deal with,
-as the usual simple operations are no longer sufficient to correctly handle
-them. To avoid this, export a `0` for any time series you know may exist in advance.
+Time series that are not present until something happens are difficult
+to deal with, as the usual simple operations are no longer sufficient
+to correctly handle them. To avoid this, export `0` (or `NaN`, if `0`
+would be misleading) for any time series you know may exist in
+advance.

 Most Prometheus client libraries (including Go and Java Simpleclient) will
 automatically export a `0` for you for metrics with no labels.
--- a/content/docs/practices/rules.md
+++ b/content/docs/practices/rules.md
 ---
 title: Recording rules
-sort_rank: 5
+sort_rank: 6
 ---

 # Recording rules

--- a/content/docs/querying/basics.md
+++ b/content/docs/querying/basics.md
@@ -112,24 +112,25 @@ a `job` label set to `prometheus`:
 The `offset` modifier allows changing the time offset for individual
 instant and range vectors in a query.

-For example, the following expression returns the value of `foo` 5
-minutes in the past relative to the current query evaluation time:
+For example, the following expression returns the value of
+`http_requests_total` 5 minutes in the past relative to the current
+query evaluation time:

-    foo offset 5m
+    http_requests_total offset 5m

 Note that the `offset` modifier always needs to follow the selector
 immediately, i.e. the following would be correct:

-    sum(foo offset 5m) // GOOD.
+    sum(http_requests_total{method="GET"} offset 5m) // GOOD.

 While the following would be *incorrect*:

-    sum(foo) offset 5m // INVALID.
+    sum(http_requests_total{method="GET"}) offset 5m // INVALID.

 The same works for range vectors. This returns the 5-minutes rate that
-`foo` had a week ago:
+`http_requests_total` had a week ago:

-    rate(foo[5m] offset 1w)
+    rate(http_requests_total[5m] offset 1w)
    
 ## Operators


--- a/content/docs/querying/functions.md
+++ b/content/docs/querying/functions.md
@@ -30,7 +30,7 @@ exist for a given metric name and label combination.

 ## `bottomk()`

-`bottomk(k integer, v instant-vector` returns the `k` smallest elements of `v`
+`bottomk(k integer, v instant-vector)` returns the `k` smallest elements of `v`
 by sample value.


@@ -88,13 +88,17 @@ to the nearest integer.

 ## `histogram_quantile()`

-`histogram_quantile(φ float, b instant-vector)` calculates the φ-quantile (0 ≤ φ
-≤ 1) from the buckets `b` of a histogram. The samples in `b` are the counts of
-observations in each bucket. Each value must have a label `le` where the label
-value denotes the inclusive upper bound of the bucket. (Samples without such a
-label are ignored.) The [histogram metric
-type](/docs/concepts/metric_types/#histogram) automatically
-provides time series with the `_bucket` suffix and the appropriate labels.
+`histogram_quantile(φ float, b instant-vector)` calculates the
+φ-quantile (0 ≤ φ ≤ 1) from the buckets `b` of a
+[histogram](/docs/concepts/metric_types/#histogram). (See [histograms
+and summaries](/docs/practices/histograms) for a detailed explanation
+of φ-quantiles and the usage of the histogram metric type in general.)
+The samples in `b` are the counts of observations in each bucket. Each
+sample must have a label `le` where the label value denotes the
+inclusive upper bound of the bucket. (Samples without such a label are
+silently ignored.) The [histogram metric
+type](/docs/concepts/metric_types/#histogram) automatically provides
+time series with the `_bucket` suffix and the appropriate labels.

 Use the `rate()` function to specify the time window for the quantile
 calculation.
@@ -103,34 +107,29 @@ Example: A histogram metric is called `http_request_duration_seconds`. To
 calculate the 90th percentile of request durations over the last 10m, use the
 following expression:

-```
-histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m]))
-```
+    histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m]))

 The quantile is calculated for each label combination in
 `http_request_duration_seconds`. To aggregate, use the `sum()` aggregator
-outside of the `rate()` function. Since the `le` label is required by
+around the `rate()` function. Since the `le` label is required by
 `histogram_quantile()`, it has to be included in the `by` clause. The following
-expression aggregates quantiles by `job`:
+expression aggregates the 90th percentile by `job`:

-```
-histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (job, le))
-```
+    histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (job, le))

 To aggregate everything, specify only the `le` label:

-```
-histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (le))
-```
-
-The `histogram_quantile()` interpolates quantile values by assuming a linear
-distribution within a bucket. The highest bucket must have an upper bound of
-`+Inf`. (Otherwise, `NaN` is returned.) If a quantile is located in the highest
-bucket, the upper bound of the second highest bucket is returned. A lower limit
-of the lowest bucket is assumed to be 0 if the upper bound of that bucket is
-greater than 0. In that case, linar interpolation is applied within that bucket
-as usual. Otherwise, the upper bound of the lowest bucket is returned for
-quantiles located in the lowest bucket.
+    histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (le))
+
+The `histogram_quantile()` function interpolates quantile values by
+assuming a linear distribution within a bucket. The highest bucket
+must have an upper bound of `+Inf`. (Otherwise, `NaN` is returned.) If
+a quantile is located in the highest bucket, the upper bound of the
+second highest bucket is returned. A lower limit of the lowest bucket
+is assumed to be 0 if the upper bound of that bucket is greater than
+0. In that case, the usual linear interpolation is applied within that
+bucket. Otherwise, the upper bound of the lowest bucket is returned
+for quantiles located in the lowest bucket.

 If `b` contains fewer than two buckets, `NaN` is returned. For φ < 0, `-Inf` is
 returned. For φ > 1, `+Inf` is returned.