Merge pull request #328 from brian-brazil/without

Document without aggregation modifier

Merge pull request #328 from brian-brazil/without
Document without aggregation modifier
b48247cc · Brian Brazil · 1733b1bf · 559c589a · b48247cc · b48247cc
Commit b48247cc authored Feb 20, 2016 by Brian Brazil
Hide whitespace changes
Inline Side-by-side

Showing with 47 additions and 37 deletions

rules.md content/docs/practices/rules.md +38 -29

operators.md content/docs/querying/operators.md +9 -8

No files found.
--- a/content/docs/practices/rules.md
+++ b/content/docs/practices/rules.md
@@ -17,8 +17,8 @@ convention.
 Recording rules should be of the general form `level:metric:operations`.
 `level` represents the aggregation level and labels of the rule output.
 `metric` is the metric name and should be unchanged other than stripping
-`_total` off counters when using `rate()`. `operations` is a list of operations
+`_total` off counters when using `rate()` or `irate()`. `operations` is a list
-that were applied to the metric, newest operation first. 
+of operations that were applied to the metric, newest operation first.
 Keeping the metric name unchanged makes it easy to know what a metric is and
 easy to find in the codebase. 
@@ -41,69 +41,78 @@ Instead keep the metric name without the `_count` or `_sum` suffix and replace
 the `rate` in the operation with `mean`. This represents the average
 observation size over that time period.
-Always specify a `by` clause with at least the `job` label when aggregating.
+Always specify a `without` clause with the labels you are aggregating away.
-This is to prevent recording rules without a `job` label, which may cause
+This is to preserve all the other labels such as `job`, which will avoid
-conflicts across different jobs.
+conflicts and give you more useful metrics and alerts.
 ## Examples
-Aggregating up requests per second:
+Aggregating up requests per second that has a `path` label:
 ```
-instance:requests:rate5m =
+instance_path:requests:rate5m =
  rate(requests_total{job="myjob"}[5m])
-job:requests:rate5m =
+path:requests:rate5m =
-  sum by (job)(instance:requests:rate5m{job="myjob"})
+  sum without (instance)(instance_path:requests:rate5m{job="myjob"})
 ```
 Calculating a request failure ratio and aggregating up to the job-level failure ratio:
 ```
-instance:request_failures:rate5m =
+instance_path:request_failures:rate5m =
  rate(request_failures_total{job="myjob"}[5m])
-instance:request_failures_per_requests:ratio_rate5m =
+instance_path:request_failures_per_requests:ratio_rate5m =
-    instance:request_failures:rate5m{job="myjob"}
+    instance_path:request_failures:rate5m{job="myjob"}
  /
-    instance:requests:rate5m{job="myjob"}
+    instance_path:requests:rate5m{job="myjob"}
-// Aggregate up numbeator and denominator, then divide.
+// Aggregate up numerator and denominator, then divide to get path-level ratio.
+path:request_failures_per_requests:ratio_rate5m =
+    sum without (instance)(instance_path:request_failures:rate5m{job="myjob"})
+  /
+    sum without (instance)(instance_path:requests:rate5m{job="myjob"})
+// No labels left from instrumentation or distinguishing instances,
+// so we use 'job' as the level.
 job:request_failures_per_requests:ratio_rate5m =
-    sum by (job)(instance:request_failures:rate5m{job="myjob"})
+    sum without (instance, path)(instance_path:request_failures:rate5m{job="myjob"})
  /
-    sum by (job)(instance:requests:rate5m{job="myjob"})
+    sum without (instance, path)(instance_path:requests:rate5m{job="myjob"})
 ```
 Calculating average latency over a time period from a Summary:
 ```
-instance:request_latency_seconds_count:rate5m =
+instance_path:request_latency_seconds_count:rate5m =
  rate(request_latency_seconds_count{job="myjob"}[5m])
-instance:request_latency_seconds_sum:rate5m =
+instance_path:request_latency_seconds_sum:rate5m =
  rate(request_latency_seconds_sum{job="myjob"}[5m])
-instance:request_latency_seconds:mean5m =
+instance_path:request_latency_seconds:mean5m =
-    instance:request_latency_seconds_sum:rate5m{job="myjob"}
+    instance_path:request_latency_seconds_sum:rate5m{job="myjob"}
  /
-    instance:request_latency_seconds_count:rate5m{job="myjob"}
+    instance_path:request_latency_seconds_count:rate5m{job="myjob"}
 // Aggregate up numerator and denominator, then divide.
-job:request_latency_seconds:mean5m =
+path:request_latency_seconds:mean5m =
-    sum by (job)(instance:request_latency_seconds_sum:rate5m{job="myjob"})
+    sum without (instance)(instance_path:request_latency_seconds_sum:rate5m{job="myjob"})
  /
-    sum by (job)(instance:request_latency_seconds_count:rate5m{job="myjob"})
+    sum without (instance)(instance_path:request_latency_seconds_count:rate5m{job="myjob"})
 ```
-Calculating the average query rate across instances is done using the `avg()` function:
+Calculating the average query rate across instances and paths is done using the
+`avg()` function:
 ```
 job:request_latency_seconds_count:avg_rate5m =
-  avg by (job)(instance:request_latency_seconds_count:rate5m{job="myjob"})
+  avg without (instance, path)(instance:request_latency_seconds_count:rate5m{job="myjob"})
 ```
-Notice that when aggregating the labels in the `by` clause always match up with
+Notice that when aggregating that the labels in the `without` clause are removed
-the level of the output recording rule. When there is no aggregation, the
+from the level of the output metric name compared to the input metric names.
-levels always match. If this is not the case a mistake has likely been made in the rules.
+When there is no aggregation, the levels always match. If this is not the case
+a mistake has likely been made in the rules.
--- a/content/docs/querying/operators.md
+++ b/content/docs/querying/operators.md
@@ -175,15 +175,16 @@ vector of fewer elements with aggregated values:
 * `count` (count number of elements in the vector)
 These operators can either be used to aggregate over **all** label dimensions
-or preserve distinct dimensions by including a `by`-clause.
+or preserve distinct dimensions by including a `without` or `by` clause.
-    <aggr-op>(<vector expression>) [by (<label list>)] [keep_common]
+    <aggr-op>(<vector expression>) [without|by (<label list>)] [keep_common]
-By default, labels that are not listed in the `by` clause will be dropped from
+`without` removes the listed labels from the result vector, while all other
-the result vector, even if their label values are identical between all
+labels are preserved the output. `by` does the opposite and drops labels that
-elements of the vector. The `keep_common` clause allows to keep those extra
+are not listed in the `by` clause, even if their label values are identical
-labels (labels that are identical between elements, but not in the `by`
+between all elements of the vector. The `keep_common` clause allows to keep
-clause).
+those extra labels (labels that are identical between elements, but not in the
+`by` clause).
 Until Prometheus 0.14.0, the `keep_common` keyword was called `keeping_extra`.
 The latter is still supported, but is deprecated and will be removed at some
@@ -195,7 +196,7 @@ If the metric `http_requests_total` had time series that fan out by
 `application`, `instance`, and `group` labels, we could calculate the total
 number of seen HTTP requests per application and group over all instances via:
-    sum(http_requests_total) by (application, group)
+    sum(http_requests_total) without (instance)
 If we are just interested in the total of HTTP requests we have seen in **all**
 applications, we could simply write: