Merge pull request #340 from prometheus/next-release

Merge in changes for 0.17.0

Merge pull request #340 from prometheus/next-release
Merge in changes for 0.17.0
191b7de7 · Fabian Reinartz · 3eb54103 · b48247cc · 191b7de7 · 191b7de7
Commit 191b7de7 authored Mar 03, 2016 by Fabian Reinartz
Show whitespace changes
Inline Side-by-side

Showing with 181 additions and 83 deletions

storage.md content/docs/operating/storage.md +134 -46

rules.md content/docs/practices/rules.md +38 -29

operators.md content/docs/querying/operators.md +9 -8

No files found.
--- a/content/docs/operating/storage.md
+++ b/content/docs/operating/storage.md
@@ -32,51 +32,6 @@ metrics `prometheus_local_storage_memory_chunks` and
 will come in handy. As a rule of thumb, you should have at least three
 times more RAM available than needed by the memory chunks alone.
-LevelDB is essentially dealing with data on disk and relies on the
-disk caches of the operating system for optimal performance. However,
-it maintains in-memory caches, whose size you can configure for each
-index via the following flags:
-* `storage.local.index-cache-size.fingerprint-to-metric`
-* `storage.local.index-cache-size.fingerprint-to-timerange`
-* `storage.local.index-cache-size.label-name-to-label-values`
-* `storage.local.index-cache-size.label-pair-to-fingerprints`
-## Disk usage
-Prometheus stores its on-disk time series data under the directory
-specified by the flag `storage.local.path`. The default path is
-`./data`, which is good to try something out quickly but most
-likely not what you want for actual operations. The flag
-`storage.local.retention` allows you to configure the retention time
-for samples. Adjust it to your needs and your available disk space.
-## Settings for high numbers of time series
-Prometheus can handle millions of time series. However, you have to
-adjust the storage settings for that. Essentially, you want to allow a
-certain number of chunks for each time series to be kept in RAM. The
-default value for the `storage.local.memory-chunks` flag (discussed
-above) is 1048576. Up to about 300,000 series, you still have three
-chunks available per series on average. For more series, you should
-increase the `storage.local.memory-chunks` value. Three times the
-number of series is a good first approximation. But keep the
-implication for memory usage (see above) in mind.
-Even more important is raising the value for the
-`storage.local.max-chunks-to-persist` flag at the same time. As a rule
-of thumb, keep it somewhere between 50% and 100% of the
-`storage.local.memory-chunks` value. The main drawback of a high value
-is larger checkpoints. The consequences of a value too low are much
-more serious.
-Out of the metrics that Prometheus exposes about itself, the following are
-particularly useful for tuning the flags above:
-* `prometheus_local_storage_memory_series`: The current number of series held in memory.
-* `prometheus_local_storage_memory_chunks`: The current number of chunks held in memory.
-* `prometheus_local_storage_chunks_to_persist`: The number of memory chunks that still need to be persisted to disk.
 PromQL queries that involve a high number of time series will make heavy use of
 the LevelDB backed indices. If you need to run queries of that kind, tweaking
 the index cache sizes might be required. The following flags are relevant:
@@ -92,7 +47,140 @@ the index cache sizes might be required. The following flags are relevant:
  completely.
 You have to experiment with the flag values to find out what helps. If a query
-touches 100,000+ time series, hundreds of MiB might be reasonable.
+touches 100,000+ time series, hundreds of MiB might be reasonable. If you have
+plenty of free memory available, using more of it for LevelDB cannot harm.
+## Disk usage
+Prometheus stores its on-disk time series data under the directory specified by
+the flag `storage.local.path`. The default path is `./data` (relative to the
+working directory), which is good to try something out quickly but most likely
+not what you want for actual operations. The flag `storage.local.retention`
+allows you to configure the retention time for samples. Adjust it to your needs
+and your available disk space.
+## Settings for high numbers of time series
+Prometheus can handle millions of time series. However, you have to adjust the
+storage settings to handle much more than 100,000 active time
+series. Essentially, you want to allow a certain number of chunks for each time
+series to be kept in RAM. The default value for the
+`storage.local.memory-chunks` flag (discussed above) is 1048576. Up to about
+300,000 series, you still have three chunks available per series on
+average. For more series, you should increase the `storage.local.memory-chunks`
+value. Three times the number of series is a good first approximation. But keep
+the implication for memory usage (see above) in mind.
+If you have more active time series than configured memory chunks, Prometheus
+will inevitably run into a sitation where it has to keep more chunks in memory
+than configured. If the number of chunks goes more than 10% above the
+configured limit, Prometheus will throttle ingestion of more samples (by
+skipping scrapes and rule evaluations) until the configured value is exceeded
+by less than 5%. _Throttled ingestion is really bad for various reasons. You
+really do not want to be in that situation._
+Equally important, especially if writing to a spinning disk, is raising the
+value for the `storage.local.max-chunks-to-persist` flag. As a rule of thumb,
+keep it around 50% of the `storage.local.memory-chunks`
+value. `storage.local.max-chunks-to-persist` controls how many chunks can be
+waiting to be written to your storage device, may it be spinning disk or SSD
+(which contains neither a disk nor a drive motor but we will refer to it as
+“disk“ for the sake of simplicity). If that number of waiting chunks is
+exceeded, Prometheus will once more throttle sample ingestion until the number
+has dropped to 95% of the configured value. Before that happens, Prometheus
+will try to speed up persisting chunks. See the
+[section about persistence pressure](#persistence-pressure-and-rushed-mode)
+below.
+The more chunks you can keep in memory per time series, the more write
+operations can be batched, which is especially important for spinning
+disks. Note that each active time series will have an incomplete head chunk,
+which cannot be persisted yet. It is a chunk in memory, but not a “chunk to
+persist” yet. If you have 1M active time series, you need 3M
+`storage.local.memory-chunks` to have three chunks for each series
+available. Only 2M of those can be persistable, so setting
+`storage.local.max-chunks-to-persist` to more than 2M can easily lead to more
+than 3M chunks in memory, despite the setting for
+`storage.local.memory-chunks`, which again will lead to the dreaded throttling
+of ingestion (but Prometheus will try its best to speed up persisting of chunks
+before it happens).
+The other drawback of a high value of chunks waiting for persistence is larger
+checkpoints.
+## Persistence pressure and “rushed mode”
+Naively, Prometheus would all the time try to persist completed chunk to disk
+as soon as possible. Such a strategy would lead to many tiny write operations,
+using up most of the I/O bandwidth and keeping the server quite busy. Spinning
+disks are more sensitive here, but even SSDs will not like it. Prometheus tries
+instead to batch up write operations as much as possible, which works better if
+it is allowed to use more memory. Setting the flags described above to values
+that lead to full utilization of the available memory is therefore crucial for
+high performance.
+Prometheus will also sync series files after each write (with
+`storage.local.series-sync-strategy=adaptive`, which is the default) and use
+the disk bandwidth for more frequent checkpoints (based on the count of “dirty
+series”, see [below](#crash-recovery)), both attempting to minimize data loss
+in case of a crash.
+But what to do if the number of chunks waiting for persistence grows too much?
+Prometheus calculates a score for urgency to persist chunks, which depends on
+the number of chunks waiting for persistence in relation to the
+`storage.local.max-chunks-to-persist` value and on how much the number of
+chunks in memory exceeds the `storage.local.memory-chunks` value (if at all,
+and only if there is a minimum number of chunks waiting for persistence so that
+faster persisting of chunks can help at all). The score is between 0 and 1,
+where 1 corresponds to the highest unrgency. Depending on the score, Prometheus
+will write to disk more frequently. Should the score ever pass the threshold
+of 0.8, Prometheus enters “rushed mode” (which you can see in the logs). In
+rushed mode, the following strategies are applied to speed up persisting chunks:
+* Series files are not synced after write operations anymore (making better use
+  of the OS's page cache at the price of an increased risk of losing data in
+  case of a server crash – this behavior can be overridden with the flag
+  `storage.local.series-sync-strategy`).
+* Checkpoints are only created as often as configured via the
+  `storage.local.checkpoint-interval` flag (freeing more disk bandwidth for
+  persisting chunks at the price of more data loss in case of a crash and an
+  increased time to run the subsequent crash recovery).
+* Write operations to persist chunks are not throttled anymore and performed as
+  fast as possible.
+Prometheus leaves rushed mode once the score has dropped below 0.7.
+## Settings for very long retention time
+If you have set a very long retention time via the `storage.local.retention`
+flag (more than a month), you might want to increase the flag value
+`storage.local.series-file-shrink-ratio`.
+Whenever Prometheus needs to cut off some chunks from the beginning of a series
+file, it will simply rewrite the whole file. (Some file systems support “head
+truncation”, which Prometheus currently does not use for several reasons.) To
+not rewrite a very large series file to get rid of very few chunks, the rewrite
+only happens if at least 10% of the chunks in the series file are removed. This
+value can be changed via the mentioned `storage.local.series-file-shrink-ratio`
+flag. If you have a lot of disk space but want to minimize rewrites (at the
+cost of wasted disk space), increase the flag value to higher values, e.g. 0.3
+for 30% of required chunk removal.
+## Helpful metrics
+Out of the metrics that Prometheus exposes about itself, the following are
+particularly useful for tuning the flags above:
+* `prometheus_local_storage_memory_series`: The current number of series held
+  in memory.
+* `prometheus_local_storage_memory_chunks`: The current number of chunks held
+  in memory.
+* `prometheus_local_storage_chunks_to_persist`: The number of memory chunks
+  that still need to be persisted to disk.
+* `prometheus_local_storage_persistence_urgency_score`: The urgency score as
+  discussed [above](#persistence-pressure-and-rushed-mode).
+* `prometheus_local_storage_rushed_mode` is 1 if Prometheus is in “rushed
+  mode”, 0 otherwise.
 ## Crash recovery

--- a/content/docs/practices/rules.md
+++ b/content/docs/practices/rules.md
@@ -17,8 +17,8 @@ convention.
 Recording rules should be of the general form `level:metric:operations`.
 `level` represents the aggregation level and labels of the rule output.
 `metric` is the metric name and should be unchanged other than stripping
-`_total` off counters when using `rate()`. `operations` is a list of operations
+`_total` off counters when using `rate()` or `irate()`. `operations` is a list
-that were applied to the metric, newest operation first. 
+of operations that were applied to the metric, newest operation first.
 Keeping the metric name unchanged makes it easy to know what a metric is and
 easy to find in the codebase. 
@@ -41,69 +41,78 @@ Instead keep the metric name without the `_count` or `_sum` suffix and replace
 the `rate` in the operation with `mean`. This represents the average
 observation size over that time period.
-Always specify a `by` clause with at least the `job` label when aggregating.
+Always specify a `without` clause with the labels you are aggregating away.
-This is to prevent recording rules without a `job` label, which may cause
+This is to preserve all the other labels such as `job`, which will avoid
-conflicts across different jobs.
+conflicts and give you more useful metrics and alerts.
 ## Examples
-Aggregating up requests per second:
+Aggregating up requests per second that has a `path` label:
 ```
-instance:requests:rate5m =
+instance_path:requests:rate5m =
  rate(requests_total{job="myjob"}[5m])
-job:requests:rate5m =
+path:requests:rate5m =
-  sum by (job)(instance:requests:rate5m{job="myjob"})
+  sum without (instance)(instance_path:requests:rate5m{job="myjob"})
 ```
 Calculating a request failure ratio and aggregating up to the job-level failure ratio:
 ```
-instance:request_failures:rate5m =
+instance_path:request_failures:rate5m =
  rate(request_failures_total{job="myjob"}[5m])
-instance:request_failures_per_requests:ratio_rate5m =
+instance_path:request_failures_per_requests:ratio_rate5m =
-    instance:request_failures:rate5m{job="myjob"}
+    instance_path:request_failures:rate5m{job="myjob"}
  /
-    instance:requests:rate5m{job="myjob"}
+    instance_path:requests:rate5m{job="myjob"}
-// Aggregate up numbeator and denominator, then divide.
+// Aggregate up numerator and denominator, then divide to get path-level ratio.
+path:request_failures_per_requests:ratio_rate5m =
+    sum without (instance)(instance_path:request_failures:rate5m{job="myjob"})
+  /
+    sum without (instance)(instance_path:requests:rate5m{job="myjob"})
+// No labels left from instrumentation or distinguishing instances,
+// so we use 'job' as the level.
 job:request_failures_per_requests:ratio_rate5m =
-    sum by (job)(instance:request_failures:rate5m{job="myjob"})
+    sum without (instance, path)(instance_path:request_failures:rate5m{job="myjob"})
  /
-    sum by (job)(instance:requests:rate5m{job="myjob"})
+    sum without (instance, path)(instance_path:requests:rate5m{job="myjob"})
 ```
 Calculating average latency over a time period from a Summary:
 ```
-instance:request_latency_seconds_count:rate5m =
+instance_path:request_latency_seconds_count:rate5m =
  rate(request_latency_seconds_count{job="myjob"}[5m])
-instance:request_latency_seconds_sum:rate5m =
+instance_path:request_latency_seconds_sum:rate5m =
  rate(request_latency_seconds_sum{job="myjob"}[5m])
-instance:request_latency_seconds:mean5m =
+instance_path:request_latency_seconds:mean5m =
-    instance:request_latency_seconds_sum:rate5m{job="myjob"}
+    instance_path:request_latency_seconds_sum:rate5m{job="myjob"}
  /
-    instance:request_latency_seconds_count:rate5m{job="myjob"}
+    instance_path:request_latency_seconds_count:rate5m{job="myjob"}
 // Aggregate up numerator and denominator, then divide.
-job:request_latency_seconds:mean5m =
+path:request_latency_seconds:mean5m =
-    sum by (job)(instance:request_latency_seconds_sum:rate5m{job="myjob"})
+    sum without (instance)(instance_path:request_latency_seconds_sum:rate5m{job="myjob"})
  /
-    sum by (job)(instance:request_latency_seconds_count:rate5m{job="myjob"})
+    sum without (instance)(instance_path:request_latency_seconds_count:rate5m{job="myjob"})
 ```
-Calculating the average query rate across instances is done using the `avg()` function:
+Calculating the average query rate across instances and paths is done using the
+`avg()` function:
 ```
 job:request_latency_seconds_count:avg_rate5m =
-  avg by (job)(instance:request_latency_seconds_count:rate5m{job="myjob"})
+  avg without (instance, path)(instance:request_latency_seconds_count:rate5m{job="myjob"})
 ```
-Notice that when aggregating the labels in the `by` clause always match up with
+Notice that when aggregating that the labels in the `without` clause are removed
-the level of the output recording rule. When there is no aggregation, the
+from the level of the output metric name compared to the input metric names.
-levels always match. If this is not the case a mistake has likely been made in the rules.
+When there is no aggregation, the levels always match. If this is not the case
+a mistake has likely been made in the rules.
--- a/content/docs/querying/operators.md
+++ b/content/docs/querying/operators.md
@@ -175,15 +175,16 @@ vector of fewer elements with aggregated values:
 * `count` (count number of elements in the vector)
 These operators can either be used to aggregate over **all** label dimensions
-or preserve distinct dimensions by including a `by`-clause.
+or preserve distinct dimensions by including a `without` or `by` clause.
-    <aggr-op>(<vector expression>) [by (<label list>)] [keep_common]
+    <aggr-op>(<vector expression>) [without|by (<label list>)] [keep_common]
-By default, labels that are not listed in the `by` clause will be dropped from
+`without` removes the listed labels from the result vector, while all other
-the result vector, even if their label values are identical between all
+labels are preserved the output. `by` does the opposite and drops labels that
-elements of the vector. The `keep_common` clause allows to keep those extra
+are not listed in the `by` clause, even if their label values are identical
-labels (labels that are identical between elements, but not in the `by`
+between all elements of the vector. The `keep_common` clause allows to keep
-clause).
+those extra labels (labels that are identical between elements, but not in the
+`by` clause).
 Until Prometheus 0.14.0, the `keep_common` keyword was called `keeping_extra`.
 The latter is still supported, but is deprecated and will be removed at some
@@ -195,7 +196,7 @@ If the metric `http_requests_total` had time series that fan out by
 `application`, `instance`, and `group` labels, we could calculate the total
 number of seen HTTP requests per application and group over all instances via:
-    sum(http_requests_total) by (application, group)
+    sum(http_requests_total) without (instance)
 If we are just interested in the total of HTTP requests we have seen in **all**
 applications, we could simply write: