2nd round of code reviews after major rework.

c532c7f2 · beorn7 · bf647326 · c532c7f2
Commit c532c7f2 authored Mar 02, 2015 by beorn7
Show whitespace changes
Inline Side-by-side

Showing with 51 additions and 45 deletions

histograms.md content/docs/practices/histograms.md +51 -45

No files found.
--- a/content/docs/practices/histograms.md
+++ b/content/docs/practices/histograms.md
@@ -5,10 +5,11 @@ sort_rank: 4
 # Histograms and summaries
-Histograms and summaries are more complex metric types. Not only
+Histograms and summaries are more complex metric types. Not only does
-creates a single histogram or summary a multitude of time series, it
+a single histogram or summary create a multitude of time series, it is
-is also more difficult to use them correctly. This section helps you
+also more difficult to use these metric types correctly. This section
-to pick and configure the appropriate metric type for your use case.
+helps you to pick and configure the appropriate metric type for your
+use case.
 ## Library support
@@ -18,13 +19,7 @@ First of all, check the library support for
 both currently only exists in the Go client library. Many libraries
 support only one of the two types, or they support summaries only in a
 limited fashion (lacking [quantile
-calculation](#quantiles)). [Contributions are welcome](/community/),
+calculation](#quantiles)).
-of course. In general, we expect histograms to be more urgently needed
-than summaries. Histograms are also easier to implement in a client
-library, so we recommend to implement histograms first, if in
-doubt. The reason why some libraries offer summaries but not
-histograms (Ruby, the legacy Java client) is that histograms are a
-more recent feature of Prometheus.
 ## Count and sum of observations
@@ -35,16 +30,16 @@ durations or response sizes. They track the number of observations
 (showing up in Prometheus as a time series with a `_count` suffix) is
 inherently a counter (as described above, it only goes up). The sum of
 observations (showing up as a time series with a `_sum` suffix)
-behaves like a counter, too, as long as all observations are
+behaves like a counter, too, as long as there are no negative
-positive. Obviously, request durations or response sizes are always
+observations. Obviously, request durations or response sizes are
-positive. In principle, however, you can use summaries and histograms
+never negative. In principle, however, you can use summaries and
-to observe negative values (e.g. temperatures in centigrade). In that
+histograms to observe negative values (e.g. temperatures in
-case, the sum of observations can go down, so you cannot apply
+centigrade). In that case, the sum of observations can go down, so you
-`rate()` to it anymore.
+cannot apply `rate()` to it anymore.
 To calculate the average request duration during the last 5 minutes
-from a histogram or summary called `http_request_duration_second`, use
+from a histogram or summary called `http_request_duration_seconds`,
-the following expression:
+use the following expression:
      rate(http_request_duration_seconds_sum[5m])
    /
@@ -71,17 +66,18 @@ a histogram called `http_request_duration_seconds`.
 You can calculate the well-known [Apdex
 score](http://en.wikipedia.org/wiki/Apdex) in a similar way. Configure
-a bucket with the target request duration as upper bound and another
+a bucket with the target request duration as the upper bound and
-bucket with the tolerated request duration (usually 4 times the target
+another bucket with the tolerated request duration (usually 4 times
-request duration) as upper bound. Example: The target request duration
+the target request duration) as the upper bound. Example: The target
-is 300ms. The tolerable request duration is 1.2s. The following
+request duration is 300ms. The tolerable request duration is 1.2s. The
-expression yields the Apdex score over the last 5 minutes:
+following expression yields the Apdex score for each job over the last
+5 minutes:
    (
-      rate(http_request_duration_seconds_bucket{le="0.3"}[5m])
+      sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job)
    +
-      rate(http_request_duration_seconds_bucket{le="1.2"}[5m])
+      sum(rate(http_request_duration_seconds_bucket{le="1.2"}[5m])) by (job)
-    ) / 2 / rate(http_request_duration_seconds_count[5m])
+    ) / 2 / sum(rate(http_request_duration_seconds_count[5m])) by (job)
 ## Quantiles
@@ -92,7 +88,7 @@ known as the median. The 0.95-quantile is the 95th percentile.
 The essential difference between summaries and histograms is that summaries
 calculate streaming φ-quantiles on the client side and expose them directly,
-while histograms expose bucketed observations counts and the calculation of
+while histograms expose bucketed observation counts and the calculation of
 quantiles from the buckets of a histogram happens on the server side using the
 [`histogram_quantile()`
 function](/docs/querying/functions/#histogram_quantile()).
@@ -115,8 +111,8 @@ want to display the percentage of requests served within 300ms, but
 instead the 95th percentile, i.e. the request duration within which
 you have served 95% of requests. To do that, you can either configure
 a summary with a 0.95-quantile and (for example) a 5-minute decay
-time-window, or you configure a histogram with a few buckets around
+time, or you configure a histogram with a few buckets around the 300ms
-the 300ms mark, e.g. `{le="0.1"}`, `{le="0.2"}`, `{le="0.3"}`, and
+mark, e.g. `{le="0.1"}`, `{le="0.2"}`, `{le="0.3"}`, and
 `{le="0.45"}`. If your service runs replicated with a number of
 instances, you will collect request durations from every single one of
 them, and then you want to aggregate everything into an overall 95th
@@ -157,11 +153,11 @@ quantile gives you the impression that you are close to breaking the
 SLA, but in reality, the 95th percentile is a tiny bit above 220ms,
 a quite comfortable distance to your SLA.
-Next step in our *Gedenkenexperiment*: A change in backend routing
+Next step in our thought experiment: A change in backend routing
-adds a fixed amount of 100ms to all requent durations. Now the request
+adds a fixed amount of 100ms to all request durations. Now the request
 duration has its sharp spike at 320ms and almost all observations will
 fall into the bucket from 300ms to 450ms. The 95th percentile is
-calculated to be 442.5ms, although the correct values is close to
+calculated to be 442.5ms, although the correct value is close to
 320ms. While you are only a tiny bit outside of your SLA, the
 calculated 95th quantile looks much worse.
@@ -213,8 +209,18 @@ Two rules of thumb:
  1. If you need to aggregate, choose histograms.
-  2. Otherwise, choose a histogram if you need accuracy in the
+  2. Otherwise, choose a histogram if you have an idea of the range
-     dimension of the observed values and you have an idea in which
+     and distribution of values that will be observed. Choose a
-     ranges of observed values you are interested in. Choose a summary
+     summary if you need an accurate quantile, no matter what the
-     if you need accuracy in the dimension of φ, no matter in which
+     range and distribution of the values is.
-     ranges of observed values the quantile will end up.
+## What can I do if my client library does not support the metric type I need?
+Implement it! [Code contributions are welcome](/community/). In
+general, we expect histograms to be more urgently needed than
+summaries. Histograms are also easier to implement in a client
+library, so we recommend to implement histograms first, if in
+doubt. The reason why some libraries offer summaries but not
+histograms (the Ruby client and the legacy Java client) is that
+histograms are a more recent feature of Prometheus.
\ No newline at end of file