Improvements after review.

Most importantly, created a completely new section for histograms and summaries and updated all the references. Also add other minor improvements.

Improvements after review.
Most importantly, created a completely new section for histograms and summaries and updated all the references. Also add other minor improvements.
bf647326 · beorn7 · 38a24dc3 · bf647326 · bf647326 · bf647326
Commit bf647326 authored Feb 24, 2015 by beorn7
8 changed files
--- a/content/docs/concepts/metric_types.md
+++ b/content/docs/concepts/metric_types.md
@@ -63,13 +63,14 @@ during a scrape:
  * the **count** of events that have been observed, exposed as `<basename>_count` (identical to `<basename>_bucket{le="+Inf"}` above)
 Use the [`histogram_quantile()`
-function](/docs/querying/functions/#histogram_quantile()) to calculate quantiles
+function](/docs/querying/functions/#histogram_quantile()) to calculate
-from histograms or even aggregations of histograms. A histogram is also suitable
+quantiles from histograms or even aggregations of histograms. A
-to calculate an [Apdex score](http://en.wikipedia.org/wiki/Apdex). See [summary
+histogram is also suitable to calculate an [Apdex
-vs. histogram](/docs/practices/instrumentation/#summary-vs.-histogram) for
+score](http://en.wikipedia.org/wiki/Apdex). See [histograms and
-details of histogram usage and differences to [summaries](#summary).
+summaries](/docs/practices/histograms) for details of histogram usage
+and differences to [summaries](#summary).
-Client library usage documentation for summaries:
+Client library usage documentation for histograms:
   * [Go](http://godoc.org/github.com/prometheus/client_golang/prometheus#Histogram)
   * [Java](https://github.com/prometheus/client_java/blob/master/simpleclient/src/main/java/io/prometheus/client/Histogram.java) (histograms are only supported by the simple client but not by the legacy client)
@@ -79,7 +80,7 @@ Client library usage documentation for summaries:
 Similar to a _histogram_, a _summary_ samples observations (usually things like
 request durations and response sizes). While it also provides a total count of
-observation and a sum of all observed values, it calculates configurable
+observations and a sum of all observed values, it calculates configurable
 quantiles over a sliding time window.
 A summary with a base metric name of `<basename>` exposes multiple time series
@@ -89,9 +90,9 @@ during a scrape:
  * the **total sum** of all observed values, exposed as `<basename>_sum`
  * the **count** of events that have been observed, exposed as `<basename>_count`
-See [summary
+See [histograms and summaries](/docs/practices/histograms) for
-vs. histogram](/docs/practices/instrumentation/#summary-vs.-histogram) for
+detailed explanations of φ-quantiles, summary usage, and differences
-details of summary usage and differences to [histograms](#histogram).
+to [histograms](#histogram).
 Client library usage documentation for summaries:

--- a/content/docs/practices/alerting.md
+++ b/content/docs/practices/alerting.md
 ---
 title: Alerting
-sort_rank: 4
+sort_rank: 5
 ---
 # Alerting

--- a/content/docs/practices/consoles.md
+++ b/content/docs/practices/consoles.md
@@ -3,7 +3,7 @@ title: Consoles and dashboards
 sort_rank: 3
 ---
-## Consoles and dashboards
+# Consoles and dashboards
 It can be tempting to display as much data as possible on a dashboard, especially
 when a system like Prometheus offers the ability to have such rich

--- a/content/docs/practices/histograms.md
+++ b/content/docs/practices/histograms.md
--- a/content/docs/practices/instrumentation.md
+++ b/content/docs/practices/instrumentation.md
@@ -191,10 +191,14 @@ processing system.
 If you are unsure, start with no labels and add more
 labels over time as concrete use cases arise.
-### Counter vs. gauge
+### Counter vs. gauge, summary vs. histogram
+It is important to know which of the four main metric types to use for
+a given metric.
 To pick between counter and gauge, there is a simple rule of thumb: if
-the value can go down, it's a gauge.
+-metric To pick between counter and gauge, there is a simple rule of
+thumb: if the value can go down, it is a gauge.
 Counters can only go up (and reset, such as when a process restarts). They are
 useful for accumulating the number of events, or the amount of something at
@@ -206,55 +210,8 @@ Gauges can be set, go up, and go down. They are useful for snapshots of state,
 such as in-progress requests, free/total memory, or temperature. You should
 never take a `rate()` of a gauge.
-### Summary vs. histogram
+Summaries and histograms are more complex metric types discussed in
+[their own section](/docs/practices/histograms/).
-Summaries and histograms are more complex metric types. They both sample
-observations. They track the number of observations *and* the sum of the
-observed values, allowing you to calculate the average observed value (useful
-for latency, for example). Note that the number of observations (showing up in
-Prometheus as a time series with a `_count` suffix) is inherently a counter (as
-described above, it only goes up), while the sum of observations (showing up as
-a time series with a `_sum` suffix) is inherently a gauge (if a negative value
-is observed, it goes down).
-The essential difference is that summaries calculate streaming φ-quantiles on
-the client side and expose them, while histograms count observations in buckets
-and expose those counts. Calculation of quantiles from histograms happens on the
-server side using the [`histogram_quantile()`
-function](/docs/querying/functions/#histogram_quantile()).
-Both approaches have specific advantages and disadvantages:
-|   | Histogram | Summary
-|---|-----------|---------
-| Configuration | Need to configure buckets suitable for the expected range of observed values. | Need to configure φ-quantiles and sliding window, other φ-quantiles and sliding windows cannot be calculated later.
-| Client performance | Observations are very cheap as they only need to increment counters. | Observations are expensive due to the streaming quantile calculation.
-| Server performance | Calculating quantiles is expensive, consider [recording rules](/docs/querying/rules/#recording-rules) as a remedy. | Very low resource needs.
-| Number of time series | Low for Apdex score (see below), very high for accurate quantiles. Each bucket creates a time series. | Low, one time series per configured quantile.
-| Accuracy | Depends on number and layout of buckets. Higher accuracy requires more time series. | Configurable. Higher accuracy requires more client resources but is relatively cheap.
-| Specification of φ-quantile and sliding time window | Ad-hoc in Prometheus expressions. | Preconfigured by the client.
-| Aggregation | Ad-hoc aggregation with [Prometheus expressions](/docs/querying/functions/#histogram_quantile()). | In general [not aggregatable](http://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-you-cant-average.html).
-Note the importance of the last item in the table. Let's say you run a service
-with an SLA to respond to 95% of requests in under 200ms. In that case, you will
-probably collect request durations from every single instance in your fleet, and
-then you want to aggregate everything into an overall 95th percentile. You can
-only do that with histograms, but not with summaries. Aggregating the
-precomputed quantiles from a summary rarely makes sense.
-A histogram is suitable to calculate the [Apdex
-score](http://en.wikipedia.org/wiki/Apdex). Configure a bucket with the target
-request duration as upper bound and another bucket with 4 times the request
-duration as upper bound. Example: The target request duration is 250ms. The
-tolerable request duration is 1s. The request duration are collected with a
-histogram called `http_request_duration_seconds`. The following expression
-yields the Apdex score:
-```
-(
-  http_request_duration_seconds_bucket{le="0.25"} + http_request_duration_seconds_bucket{le="1"}
-) / 2 / http_request_duration_seconds_count
-```
 ### Timestamps, not time since
@@ -288,9 +245,11 @@ benchmarks are the best way to determine the impact of any given change.
 ### Avoid missing metrics
-Time series that are not present until something happens are difficult to deal with,
+Time series that are not present until something happens are difficult
-as the usual simple operations are no longer sufficient to correctly handle
+to deal with, as the usual simple operations are no longer sufficient
-them. To avoid this, export a `0` for any time series you know may exist in advance.
+to correctly handle them. To avoid this, export `0` (or `NaN`, if `0`
+would be misleading) for any time series you know may exist in
+advance.
 Most Prometheus client libraries (including Go and Java Simpleclient) will
 automatically export a `0` for you for metrics with no labels.
--- a/content/docs/practices/rules.md
+++ b/content/docs/practices/rules.md
 ---
 title: Recording rules
-sort_rank: 5
+sort_rank: 6
 ---
 # Recording rules

--- a/content/docs/querying/basics.md
+++ b/content/docs/querying/basics.md
@@ -112,24 +112,25 @@ a `job` label set to `prometheus`:
 The `offset` modifier allows changing the time offset for individual
 instant and range vectors in a query.
-For example, the following expression returns the value of `foo` 5
+For example, the following expression returns the value of
-minutes in the past relative to the current query evaluation time:
+`http_requests_total` 5 minutes in the past relative to the current
+query evaluation time:
-    foo offset 5m
+    http_requests_total offset 5m
 Note that the `offset` modifier always needs to follow the selector
 immediately, i.e. the following would be correct:
-    sum(foo offset 5m) // GOOD.
+    sum(http_requests_total{method="GET"} offset 5m) // GOOD.
 While the following would be *incorrect*:
-    sum(foo) offset 5m // INVALID.
+    sum(http_requests_total{method="GET"}) offset 5m // INVALID.
 The same works for range vectors. This returns the 5-minutes rate that
-`foo` had a week ago:
+`http_requests_total` had a week ago:
-    rate(foo[5m] offset 1w)
+    rate(http_requests_total[5m] offset 1w)
 ## Operators

--- a/content/docs/querying/functions.md
+++ b/content/docs/querying/functions.md
@@ -30,7 +30,7 @@ exist for a given metric name and label combination.
 ## `bottomk()`
-`bottomk(k integer, v instant-vector` returns the `k` smallest elements of `v`
+`bottomk(k integer, v instant-vector)` returns the `k` smallest elements of `v`
 by sample value.
@@ -88,13 +88,17 @@ to the nearest integer.
 ## `histogram_quantile()`
-`histogram_quantile(φ float, b instant-vector)` calculates the φ-quantile (0 ≤ φ
+`histogram_quantile(φ float, b instant-vector)` calculates the
-≤ 1) from the buckets `b` of a histogram. The samples in `b` are the counts of
+φ-quantile (0 ≤ φ ≤ 1) from the buckets `b` of a
-observations in each bucket. Each value must have a label `le` where the label
+[histogram](/docs/concepts/metric_types/#histogram). (See [histograms
-value denotes the inclusive upper bound of the bucket. (Samples without such a
+and summaries](/docs/practices/histograms) for a detailed explanation
-label are ignored.) The [histogram metric
+of φ-quantiles and the usage of the histogram metric type in general.)
-type](/docs/concepts/metric_types/#histogram) automatically
+The samples in `b` are the counts of observations in each bucket. Each
-provides time series with the `_bucket` suffix and the appropriate labels.
+sample must have a label `le` where the label value denotes the
+inclusive upper bound of the bucket. (Samples without such a label are
+silently ignored.) The [histogram metric
+type](/docs/concepts/metric_types/#histogram) automatically provides
+time series with the `_bucket` suffix and the appropriate labels.
 Use the `rate()` function to specify the time window for the quantile
 calculation.
@@ -103,34 +107,29 @@ Example: A histogram metric is called `http_request_duration_seconds`. To
 calculate the 90th percentile of request durations over the last 10m, use the
 following expression:
-```
+    histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m]))
-histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m]))
-```
 The quantile is calculated for each label combination in
 `http_request_duration_seconds`. To aggregate, use the `sum()` aggregator
-outside of the `rate()` function. Since the `le` label is required by
+around the `rate()` function. Since the `le` label is required by
 `histogram_quantile()`, it has to be included in the `by` clause. The following
-expression aggregates quantiles by `job`:
+expression aggregates the 90th percentile by `job`:
-```
+    histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (job, le))
-histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (job, le))
-```
 To aggregate everything, specify only the `le` label:
-```
+    histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (le))
-histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (le))
-```
+The `histogram_quantile()` function interpolates quantile values by
+assuming a linear distribution within a bucket. The highest bucket
-The `histogram_quantile()` interpolates quantile values by assuming a linear
+must have an upper bound of `+Inf`. (Otherwise, `NaN` is returned.) If
-distribution within a bucket. The highest bucket must have an upper bound of
+a quantile is located in the highest bucket, the upper bound of the
-`+Inf`. (Otherwise, `NaN` is returned.) If a quantile is located in the highest
+second highest bucket is returned. A lower limit of the lowest bucket
-bucket, the upper bound of the second highest bucket is returned. A lower limit
+is assumed to be 0 if the upper bound of that bucket is greater than
-of the lowest bucket is assumed to be 0 if the upper bound of that bucket is
+0. In that case, the usual linear interpolation is applied within that
-greater than 0. In that case, linar interpolation is applied within that bucket
+bucket. Otherwise, the upper bound of the lowest bucket is returned
-as usual. Otherwise, the upper bound of the lowest bucket is returned for
+for quantiles located in the lowest bucket.
-quantiles located in the lowest bucket.
 If `b` contains fewer than two buckets, `NaN` is returned. For φ < 0, `-Inf` is
 returned. For φ > 1, `+Inf` is returned.