Merge pull request #706 from prometheus/beorn7/storage

Update storage documentation for the Prometheus 1.6 release

Merge pull request #706 from prometheus/beorn7/storage
Update storage documentation for the Prometheus 1.6 release
4e15503b · Björn Rabenstein · GitHub · c2ef91ea · d670cb17 · 4e15503b
Commit 4e15503b authored Apr 11, 2017 by Björn Rabenstein Committed by GitHub Apr 11, 2017
Hide whitespace changes
Inline Side-by-side

Showing with 202 additions and 118 deletions

configuration.md content/docs/operating/configuration.md +1 -3

storage.md content/docs/operating/storage.md +201 -115

No files found.
--- a/content/docs/operating/configuration.md
+++ b/content/docs/operating/configuration.md
@@ -442,8 +442,6 @@ files:
 Where `<filename_pattern>` may be a path ending in `.json`, `.yml` or `.yaml`. The last path segment
 may contain a single `*` that matches any character sequence, e.g. `my/path/tg_*.json`.

-NOTE: Prior to v0.20, `names:` was used instead of `files:`.
-
 ### `<gce_sd_config>`

 CAUTION: GCE SD is in beta: breaking changes to configuration are still
@@ -1063,4 +1061,4 @@ tls_config:

 # Optional proxy URL.
 [ proxy_url: <string> ]
-```
\ No newline at end of file
+```
--- a/content/docs/operating/storage.md
+++ b/content/docs/operating/storage.md
@@ -12,43 +12,47 @@ sample data, it has its own custom storage layer, which organizes
 sample data in chunks of constant size (1024 bytes payload). These
 chunks are then stored on disk in one file per time series.

-## Memory usage
-
-Prometheus keeps all the currently used chunks in memory. In addition,
-it keeps the most recently used chunks in memory up to a threshold
-configurable via the `storage.local.memory-chunks` flag. If you have a
-lot of RAM available, you might want to increase it above the default
-value of 1048576 (and vice versa, if you run into RAM problems, you
-can try to decrease it). Note that the actual RAM usage of your server
-will be much higher than what you would expect from multiplying
-`storage.local.memory-chunks` by 1024 bytes. There is inevitable
-overhead for managing the sample data in the storage layer. Also, your
-server is doing many more things than just storing samples. The actual
-overhead depends on your usage pattern. In extreme cases, Prometheus
-has to keep more chunks in memory than configured because all those
-chunks are in use at the same time. You have to experiment a bit. The
-metrics `prometheus_local_storage_memory_chunks` and
-`process_resident_memory_bytes`, exported by the Prometheus server,
-will come in handy. As a rule of thumb, you should have at least three
-times more RAM available than needed by the memory chunks alone.
+This sections deals with the various configuration settings and issues you
+might run into. To dive deeper into the topic, check out the following talks:
+* [The Prometheus Time Series Database](https://www.youtube.com/watch?v=HbnGSNEjhUc).
+* [Configuring Prometheus for High Performance](https://www.youtube.com/watch?v=zop94VbeFeU).

-PromQL queries that involve a high number of time series will make heavy use of
-the LevelDB backed indexes. If you need to run queries of that kind, tweaking
-the index cache sizes might be required. The following flags are relevant:
+## Memory usage

-* `-storage.local.index-cache-size.label-name-to-label-values`: For regular
-  expression matching.
-* `-storage.local.index-cache-size.label-pair-to-fingerprints`: Increase the
-  size if a large number of time series share the same label pair or name.
-* `-storage.local.index-cache-size.fingerprint-to-metric` and
-  `-storage.local.index-cache-size.fingerprint-to-timerange`: Increase the size
-  if you have a large number of archived time series, i.e. series that have not
-  received samples in a while but are still not old enough to be purged
-  completely.
-
-You have to experiment with the flag values to find out what helps. If a query
-touches 100,000+ time series, hundreds of MiB might be reasonable. If you have
-plenty of free memory available, using more of it for LevelDB cannot harm.
+Prometheus keeps all the currently used chunks in memory. In addition, it keeps
+as many most recently used chunks in memory as possible. You have to tell
+Prometheus how much memory it may use for this caching. The flag
+`storage.local.target-heap-size` allows you to set the heap size (in bytes)
+Prometheus aims not to exceed. Note that the amount of physical memory the
+Prometheus server will use is the result of complex interactions of the Go
+runtime and the operating system and very hard to predict precisely. As a rule
+of thumb, you should have at least 50% headroom in physical memory over the
+configured heap size. (Or, in other words, set `storage.local.target-heap-size`
+to a value of two thirds of the physical memory limit Prometheus should not
+exceed.)
+
+The default value of `storage.local.target-heap-size` is 2GiB and thus tailored
+to 3GiB of physical memory usage. If you have less physical memory available,
+you have to lower the flag value. If you have more memory available, you should
+raise the value accordingly. Otherwise, Prometheus will not make use of the
+memory and thus will perform much worse than it could.
+
+Because Prometheus uses most of its heap for long-lived allocations of memory
+chunks, the
+[garbage collection target percentage](https://golang.org/pkg/runtime/debug/#SetGCPercent)
+is set to 40 by default. You can still override this setting via the `GOGC`
+environment variable as usual.
+
+For high-performance set-ups, you might need to adjust more flags. Please read
+through the sections below for details.
+
+NOTE: Prior to v1.6, there was no flag `storage.local.target-heap-size`.
+Instead, the number of chunks kept in memory had to be configured using the
+flags `storage.local.memory-chunks` and `storage.local.max-chunks-to-persist`.
+These flags still exist for compatibility reasons. However,
+`storage.local.max-chunks-to-persist` has no effect anymore, and if
+`storage.local.memory-chunks` is set to a non-zero value _x_, it is used to
+override the value for `storage.local.target-heap-size` to 3072*_x_.

 ## Disk usage

@@ -101,55 +105,81 @@ experimenting with your own use case is encouraged. Take into account, however,
 that only newly created chunks will use the newly selected chunk encoding, so
 it will take a while until you see the effects.

+For more details about the trade-off between the chunk encodings, see
+[this blog post](/blog/2016/05/08/when-to-use-varbit-chunks/).

 ## Settings for high numbers of time series

-Prometheus can handle millions of time series. However, you have to adjust the
-storage settings to handle much more than 100,000 active time
-series. Essentially, you want to allow a certain number of chunks for each time
-series to be kept in RAM. The default value for the
-`storage.local.memory-chunks` flag (discussed above) is 1048576. Up to about
-300,000 series, you still have three chunks available per series on
-average. For more series, you should increase the `storage.local.memory-chunks`
-value. Three times the number of series is a good first approximation. But keep
-the implication for memory usage (see above) in mind.
-
-If you have more active time series than configured memory chunks, Prometheus
-will inevitably run into a situation where it has to keep more chunks in memory
-than configured. If the number of chunks goes more than 10% above the
-configured limit, Prometheus will throttle ingestion of more samples (by
-skipping scrapes and rule evaluations) until the configured value is exceeded
-by less than 5%. _Throttled ingestion is really bad for various reasons. You
-really do not want to be in that situation._
-
-Equally important, especially if writing to a spinning disk, is raising the
-value for the `storage.local.max-chunks-to-persist` flag. As a rule of thumb,
-keep it around 50% of the `storage.local.memory-chunks`
-value. `storage.local.max-chunks-to-persist` controls how many chunks can be
-waiting to be written to your storage device, may it be spinning disk or SSD
-(which contains neither a disk nor a drive motor but we will refer to it as
-“disk“ for the sake of simplicity). If that number of waiting chunks is
-exceeded, Prometheus will once more throttle sample ingestion until the number
-has dropped to 95% of the configured value. Before that happens, Prometheus
-will try to speed up persisting chunks. See the
-[section about persistence pressure](#persistence-pressure-and-rushed-mode)
-below.
-
-The more chunks you can keep in memory per time series, the more write
-operations can be batched, which is especially important for spinning
-disks. Note that each active time series will have an incomplete head chunk,
-which cannot be persisted yet. It is a chunk in memory, but not a “chunk to
-persist” yet. If you have 1M active time series, you need 3M
-`storage.local.memory-chunks` to have three chunks for each series
-available. Only 2M of those can be persistable, so setting
-`storage.local.max-chunks-to-persist` to more than 2M can easily lead to more
-than 3M chunks in memory, despite the setting for
-`storage.local.memory-chunks`, which again will lead to the dreaded throttling
-of ingestion (but Prometheus will try its best to speed up persisting of chunks
-before it happens).
-
-The other drawback of a high value of chunks waiting for persistence is larger
-checkpoints.
+Prometheus can handle millions of time series. However, with the above
+mentioned default setting for `storage.local.target-heap-size`, you will be
+limited to about 200,000 time series simultaneously present in memory. For more
+series, you need more memory, and you need to configure Prometheus to make use
+of it as described above.
+
+Each of the aforementioned chunks contains samples of a single time series. A
+time series is thus represented as a series of chunks, which ultimately end up
+in a time series file (one file per time series) on disk.
+
+A series that has recently received new samples will have an open incomplete
+_head chunk_. Once that chunk is completely filled, or the series hasn't
+received samples in a while, the head chunk is closed and becomes a chunk
+waiting to be appended to its corresponding series file, i.e. it is _waiting
+for persistence_. After the chunk has been persisted to disk, it becomes
+_evictable_, provided it is not currently used by a query. Prometheus will
+evict evictable chunks from memory to satisfy the configured target heap
+size. A series with an open head chunk is called an _active series_. This is
+different from a _memory series_, which also includes series without an open
+head chunk but still other chunks in memory (whether waiting for persistence,
+used in a query, or evictable). A series without any chunks in memory may be
+_archived_, upon which it ceases to have any mandatory memory footprint.
+
+The amount of chunks Prometheus can keep in memory depends on the flag value
+for `storage.local.target-heap-size` and on the amount of memory used by
+everything else. If there are not enough chunks evictable to satisfy the target
+heap size, Prometheus will throttle ingestion of more samples (by skipping
+scrapes and rule evaluations) until the heap has shrunk enough. _Throttled
+ingestion is really bad for various reasons. You really do not want to be in
+that situation._
+
+Open head chunks, chunks still waiting for persistence, and chunks being used
+in a query are not evictable. Thus, the reasons for the inability to evict
+enough chunks include the following:
+1. Queries that use too many chunks.
+2. Chunks are piling up waiting for persistence because the storage layer
+   cannot keep up writing chunks.
+3. There are too many active time series, which results in too many open head
+   chunks.
+
+Currently, Prometheus has no defence against case (1). Abusive queries will
+essentially OOM the server.
+
+To defend against case (2), there is a concept of persistence urgency explained
+in the next section.
+
+Case (3) depends on the targets you monitor. To mitigate an unplanned explosion
+of the number of series, you can limit the number of samples per individual
+scrape (see `sample_limit` in the
+[scrape config](https://prometheus.io/docs/operating/configuration/#scrape_config)).
+If the number of active time series exceeds the number of memory chunks the
+Prometheus server can afford, the server will quickly throttle ingestion as
+described above. The only way out if this is to give Prometheus more RAM or
+reduce the number of time series to ingest.
+
+In fact, you want many more memory chunks than you have series in
+memory. Prometheus tries to batch up disk writes as much as possible as it
+helps for both HDD (write as much as possible after each seek) and SSD (tiny
+writes create write amplification, which limits the effective throughput and
+burns much more quickly through the lifetime of the device). The more
+Prometheus can batch up writes, the more efficient is the process of persisting
+chunks to disk. which helps case (2).
+
+In conclusion, to keep the Prometheus server healthy, make sure it has plenty
+of headroom of memory chunks available for the number of memory series. A
+factor of three is a good starting point. Refer to the
+[section about helpful metrics](#helpful-metrics) to find out what to look
+for. A very broad rule of thumb for an upper limit of memory series is the
+total available physical memory divided by 10,000, e.g. About 6M memory series
+on a 64GiB server.

 If you combine a high number of time series with very fast and/or large
 scrapes, the number of pre-allocated mutexes for series locking might not be
@@ -158,16 +188,35 @@ or processing expensive queries, try increasing the value of the
 `storage.local.num-fingerprint-mutexes` flag. Sometimes tens of thousands or
 even more are required.

-## Persistence pressure and “rushed mode”
+PromQL queries that involve a high number of time series will make heavy use of
+the LevelDB-backed indexes. If you need to run queries of that kind, tweaking
+the index cache sizes might be required. The following flags are relevant:
+
+* `-storage.local.index-cache-size.label-name-to-label-values`: For regular
+  expression matching.
+* `-storage.local.index-cache-size.label-pair-to-fingerprints`: Increase the
+  size if a large number of time series share the same label pair or name.
+* `-storage.local.index-cache-size.fingerprint-to-metric` and
+  `-storage.local.index-cache-size.fingerprint-to-timerange`: Increase the size
+  if you have a large number of archived time series, i.e. series that have not
+  received samples in a while but are still not old enough to be purged
+  completely.
+
+You have to experiment with the flag values to find out what helps. If a query
+touches 100,000+ time series, hundreds of MiB might be reasonable. If you have
+plenty of memory available, using more of it for LevelDB cannot harm. More
+memory for LevelDB will effectively reduce the number of memory chunks
+Prometheus can afford.
+
+## Persistence urgency and “rushed mode”

 Naively, Prometheus would all the time try to persist completed chunk to disk
 as soon as possible. Such a strategy would lead to many tiny write operations,
 using up most of the I/O bandwidth and keeping the server quite busy. Spinning
-disks are more sensitive here, but even SSDs will not like it. Prometheus tries
-instead to batch up write operations as much as possible, which works better if
-it is allowed to use more memory. Setting the flags described above to values
-that lead to full utilization of the available memory is therefore crucial for
-high performance.
+disks will appear to be very slow because of the many slow seeks required, and
+SSDs will suffer from write amplification. Prometheus tries instead to batch up
+write operations as much as possible, which works better if it is allowed to
+use more memory.

 Prometheus will also sync series files after each write (with
 `storage.local.series-sync-strategy=adaptive`, which is the default) and use
@@ -176,16 +225,12 @@ series”, see [below](#crash-recovery)), both attempting to minimize data loss
 in case of a crash.

 But what to do if the number of chunks waiting for persistence grows too much?
-Prometheus calculates a score for urgency to persist chunks, which depends on
-the number of chunks waiting for persistence in relation to the
-`storage.local.max-chunks-to-persist` value and on how much the number of
-chunks in memory exceeds the `storage.local.memory-chunks` value (if at all,
-and only if there is a minimum number of chunks waiting for persistence so that
-faster persisting of chunks can help at all). The score is between 0 and 1,
-where 1 corresponds to the highest urgency. Depending on the score, Prometheus
-will write to disk more frequently. Should the score ever pass the threshold
-of 0.8, Prometheus enters “rushed mode” (which you can see in the logs). In
-rushed mode, the following strategies are applied to speed up persisting chunks:
+Prometheus calculates a score for urgency to persist chunks. The score is
+between 0 and 1, where 1 corresponds to the highest urgency. Depending on the
+score, Prometheus will write to disk more frequently. Should the score ever
+pass the threshold of 0.8, Prometheus enters “rushed mode” (which you can see
+in the logs). In rushed mode, the following strategies are applied to speed up
+persisting chunks:

 * Series files are not synced after write operations anymore (making better use
  of the OS's page cache at the price of an increased risk of losing data in
@@ -200,6 +245,13 @@ rushed mode, the following strategies are applied to speed up persisting chunks:

 Prometheus leaves rushed mode once the score has dropped below 0.7.

+Throttling of ingestion happens if the urgency score reaches 1. Thus, the
+rushed mode is not _per se_ something to be avoided. It is, on the contrary, a
+measure the Prometheus server takes to avoid the really bad situation of
+throttled ingestion. Occasionally entering rushed mode is OK, if it helps and
+ultimately leads to leaving rushed mode again. _If rushed mode is entered but
+the urgency score still goes up, the server has a real problem._
+
 ## Settings for very long retention time

 If you have set a very long retention time via the `storage.local.retention`
@@ -216,33 +268,26 @@ flag. If you have a lot of disk space but want to minimize rewrites (at the
 cost of wasted disk space), increase the flag value to higher values, e.g. 0.3
 for 30% of required chunk removal.

-## Helpful metrics
-
-Out of the metrics that Prometheus exposes about itself, the following are
-particularly useful for tuning the flags above:
-
-* `prometheus_local_storage_memory_series`: The current number of series held
-  in memory.
-* `prometheus_local_storage_memory_chunks`: The current number of chunks held
-  in memory.
-* `prometheus_local_storage_chunks_to_persist`: The number of memory chunks
-  that still need to be persisted to disk.
-* `prometheus_local_storage_persistence_urgency_score`: The urgency score as
-  discussed [above](#persistence-pressure-and-rushed-mode).
-* `prometheus_local_storage_rushed_mode` is 1 if Prometheus is in “rushed
-  mode”, 0 otherwise.
-
 ## Crash recovery

 Prometheus saves chunks to disk as soon as possible after they are
 complete. Incomplete chunks are saved to disk during regular
 checkpoints. You can configure the checkpoint interval with the flag
 `storage.local.checkpoint-interval`. Prometheus creates checkpoints
-more frequently than that if too many time series are in a "dirty"
+more frequently than that if too many time series are in a “dirty”
 state, i.e. their current incomplete head chunk is not the one that is
 contained in the most recent checkpoint. This limit is configurable
 via the `storage.local.checkpoint-dirty-series-limit` flag.

+More active time series to cycle through lead in general to more chunks waiting
+for persistence, which in turns leads to larger checkpoints and ultimately more
+time needed for checkpointing. There is a clear trade-off between limiting the
+loss of data in case of a crash and the ability to scale to high number of
+active time series. To not spend the majority of the disk throughput for
+checkpointing, you have to increase the checkpoint interval. Prometheus itself
+limits the time spent in checkpointing to 50% by waiting after each
+checkpoint's completion for at least as long as the previous checkpoint took.
+
 Nevertheless, should your server crash, you might still lose data, and
 your storage might be left in an inconsistent state. Therefore,
 Prometheus performs a crash recovery after an unclean shutdown,
@@ -253,7 +298,9 @@ that cannot be recovered is moved to a directory called `orphaned`
 you do not need it anymore.

 The crash recovery usually takes less than a minute. Should it take much
-longer, consult the log to find out what has gone wrong.
+longer, consult the log to find out what is going on. With increasing number of
+time series in the storage (archived or not), the re-indexing tends to dominate
+the recovery time and can take tens of minutes in extreme cases.

 ## Data corruption

@@ -268,3 +315,42 @@ storage directory:
   1. Stop Prometheus.
   1. `rm -r <storage path>/*`
   1. Start Prometheus.
+
+## Helpful metrics
+
+Out of the metrics that Prometheus exposes about itself, the following are
+particularly useful to tweak flags and find out about the required
+resources. They also help to create alerts to find out in time if a Prometheus
+server has problems or is out of capacity.
+
+* `prometheus_local_storage_memory_series`: The current number of series held
+  in memory.
+* `prometheus_local_storage_open_head_chunks`: The number of open head chunks.
+* `prometheus_local_storage_chunks_to_persist`: The number of memory chunks
+  that still need to be persisted to disk.
+* `prometheus_local_storage_memory_chunks`: The current number of chunks held
+  in memory. If you substract the previous two, you get the number of persisted
+  chunks (which are evictable if not currently in use by a query).
+* `prometheus_local_storage_series_chunks_persisted`: A histogram of the number
+  of chunks persisted per batch.
+* `prometheus_local_storage_persistence_urgency_score`: The urgency score as
+  discussed [above](#persistence-pressure-and-rushed-mode).
+* `prometheus_local_storage_rushed_mode` is 1 if Prometheus is in “rushed
+  mode”, 0 otherwise. Can be used to calculate the percentage of time
+  Prometheus is in rushed mode.
+* `prometheus_local_storage_checkpoint_last_duration_seconds`: How long the
+  last checkpoint took.
+* `prometheus_local_storage_checkpoint_last_size_bytes`: Size of the last
+  checkpoint in bytes.
+* `prometheus_local_storage_checkpointing` is 1 while Prometheus is
+  checkpointing, 0 otherwise. Can be used to calculate the percentage of time
+  Prometheus is checkpointing.
+* `prometheus_local_storage_inconsistencies_total`: Counter for storage
+  inconsistencies found. If this is greater than 0, restart the server for
+  recovery.
+* `prometheus_local_storage_persist_errors_total`: Counter for persist errors.
+* `prometheus_local_storage_memory_dirty_series`: Current number of dirty series.
+* `process_resident_memory_bytes`: Broadly speaking the physical memory
+  occupied by the Prometheus process.
+* `go_memstats_alloc_bytes`: Go heap size (allocated objects in use plus allocated
+  objects not in use anymore but not yet garbage-collected).