Merge pull request #340 from prometheus/next-release

Merge in changes for 0.17.0

Merge pull request #340 from prometheus/next-release
Merge in changes for 0.17.0
191b7de7 · Fabian Reinartz · 3eb54103 · b48247cc · 191b7de7 · 191b7de7
Commit 191b7de7 authored Mar 03, 2016 by Fabian Reinartz
Expand all Show whitespace changes
Inline Side-by-side

Showing with 181 additions and 83 deletions

storage.md content/docs/operating/storage.md +134 -46

rules.md content/docs/practices/rules.md +38 -29

operators.md content/docs/querying/operators.md +9 -8

No files found.
--- a/content/docs/operating/storage.md
+++ b/content/docs/operating/storage.md
--- a/content/docs/practices/rules.md
+++ b/content/docs/practices/rules.md
@@ -17,8 +17,8 @@ convention.
 Recording rules should be of the general form `level:metric:operations`.
 `level` represents the aggregation level and labels of the rule output.
 `metric` is the metric name and should be unchanged other than stripping
-`_total` off counters when using `rate()`. `operations` is a list of operations
-that were applied to the metric, newest operation first. 
+`_total` off counters when using `rate()` or `irate()`. `operations` is a list
+of operations that were applied to the metric, newest operation first.

 Keeping the metric name unchanged makes it easy to know what a metric is and
 easy to find in the codebase. 
@@ -41,69 +41,78 @@ Instead keep the metric name without the `_count` or `_sum` suffix and replace
 the `rate` in the operation with `mean`. This represents the average
 observation size over that time period.

-Always specify a `by` clause with at least the `job` label when aggregating.
-This is to prevent recording rules without a `job` label, which may cause
-conflicts across different jobs.
+Always specify a `without` clause with the labels you are aggregating away.
+This is to preserve all the other labels such as `job`, which will avoid
+conflicts and give you more useful metrics and alerts.

 ## Examples

-Aggregating up requests per second:
+Aggregating up requests per second that has a `path` label:

 ```
-instance:requests:rate5m =
+instance_path:requests:rate5m =
  rate(requests_total{job="myjob"}[5m])

-job:requests:rate5m =
-  sum by (job)(instance:requests:rate5m{job="myjob"})
+path:requests:rate5m =
+  sum without (instance)(instance_path:requests:rate5m{job="myjob"})
 ```

 Calculating a request failure ratio and aggregating up to the job-level failure ratio:

 ```
-instance:request_failures:rate5m =
+instance_path:request_failures:rate5m =
  rate(request_failures_total{job="myjob"}[5m])

-instance:request_failures_per_requests:ratio_rate5m =
-    instance:request_failures:rate5m{job="myjob"}
+instance_path:request_failures_per_requests:ratio_rate5m =
+    instance_path:request_failures:rate5m{job="myjob"}
  /
-    instance:requests:rate5m{job="myjob"}
+    instance_path:requests:rate5m{job="myjob"}

-// Aggregate up numbeator and denominator, then divide.
+// Aggregate up numerator and denominator, then divide to get path-level ratio.
+path:request_failures_per_requests:ratio_rate5m =
+    sum without (instance)(instance_path:request_failures:rate5m{job="myjob"})
+  /
+    sum without (instance)(instance_path:requests:rate5m{job="myjob"})
+
+// No labels left from instrumentation or distinguishing instances,
+// so we use 'job' as the level.
 job:request_failures_per_requests:ratio_rate5m =
-    sum by (job)(instance:request_failures:rate5m{job="myjob"})
+    sum without (instance, path)(instance_path:request_failures:rate5m{job="myjob"})
  /
-    sum by (job)(instance:requests:rate5m{job="myjob"})
+    sum without (instance, path)(instance_path:requests:rate5m{job="myjob"})
 ```


 Calculating average latency over a time period from a Summary:

 ```
-instance:request_latency_seconds_count:rate5m =
+instance_path:request_latency_seconds_count:rate5m =
  rate(request_latency_seconds_count{job="myjob"}[5m])

-instance:request_latency_seconds_sum:rate5m =
+instance_path:request_latency_seconds_sum:rate5m =
  rate(request_latency_seconds_sum{job="myjob"}[5m])

-instance:request_latency_seconds:mean5m =
-    instance:request_latency_seconds_sum:rate5m{job="myjob"}
+instance_path:request_latency_seconds:mean5m =
+    instance_path:request_latency_seconds_sum:rate5m{job="myjob"}
  /
-    instance:request_latency_seconds_count:rate5m{job="myjob"}
+    instance_path:request_latency_seconds_count:rate5m{job="myjob"}

 // Aggregate up numerator and denominator, then divide.
-job:request_latency_seconds:mean5m =
-    sum by (job)(instance:request_latency_seconds_sum:rate5m{job="myjob"})
+path:request_latency_seconds:mean5m =
+    sum without (instance)(instance_path:request_latency_seconds_sum:rate5m{job="myjob"})
  /
-    sum by (job)(instance:request_latency_seconds_count:rate5m{job="myjob"})
+    sum without (instance)(instance_path:request_latency_seconds_count:rate5m{job="myjob"})
 ```

-Calculating the average query rate across instances is done using the `avg()` function:
+Calculating the average query rate across instances and paths is done using the
+`avg()` function:

 ```
 job:request_latency_seconds_count:avg_rate5m =
-  avg by (job)(instance:request_latency_seconds_count:rate5m{job="myjob"})
+  avg without (instance, path)(instance:request_latency_seconds_count:rate5m{job="myjob"})
 ```

-Notice that when aggregating the labels in the `by` clause always match up with
-the level of the output recording rule. When there is no aggregation, the
-levels always match. If this is not the case a mistake has likely been made in the rules.
+Notice that when aggregating that the labels in the `without` clause are removed
+from the level of the output metric name compared to the input metric names.
+When there is no aggregation, the levels always match. If this is not the case
+a mistake has likely been made in the rules.
--- a/content/docs/querying/operators.md
+++ b/content/docs/querying/operators.md
@@ -175,15 +175,16 @@ vector of fewer elements with aggregated values:
 * `count` (count number of elements in the vector)

 These operators can either be used to aggregate over **all** label dimensions
-or preserve distinct dimensions by including a `by`-clause.
+or preserve distinct dimensions by including a `without` or `by` clause.

-    <aggr-op>(<vector expression>) [by (<label list>)] [keep_common]
+    <aggr-op>(<vector expression>) [without|by (<label list>)] [keep_common]

-By default, labels that are not listed in the `by` clause will be dropped from
-the result vector, even if their label values are identical between all
-elements of the vector. The `keep_common` clause allows to keep those extra
-labels (labels that are identical between elements, but not in the `by`
-clause).
+`without` removes the listed labels from the result vector, while all other
+labels are preserved the output. `by` does the opposite and drops labels that
+are not listed in the `by` clause, even if their label values are identical
+between all elements of the vector. The `keep_common` clause allows to keep
+those extra labels (labels that are identical between elements, but not in the
+`by` clause).

 Until Prometheus 0.14.0, the `keep_common` keyword was called `keeping_extra`.
 The latter is still supported, but is deprecated and will be removed at some
@@ -195,7 +196,7 @@ If the metric `http_requests_total` had time series that fan out by
 `application`, `instance`, and `group` labels, we could calculate the total
 number of seen HTTP requests per application and group over all instances via:

-    sum(http_requests_total) by (application, group)
+    sum(http_requests_total) without (instance)

 If we are just interested in the total of HTTP requests we have seen in **all**
 applications, we could simply write: