Update FAQ

This is a fall-out of updating the documentation of 1.6. However, all the updates here are not 1.6-specific, so I want to merge them into master directly. (Which simplifies things as there are some FAQ updates in master already that are not in the next-release branch yet.)

Update FAQ
This is a fall-out of updating the documentation of 1.6. However, all the updates here are not 1.6-specific, so I want to merge them into master directly. (Which simplifies things as there are some FAQ updates in master already that are not in the next-release branch yet.)
14df4384 · beorn7 · cb083c4b · 14df4384
Commit 14df4384 authored Apr 10, 2017 by beorn7
Show whitespace changes
Inline Side-by-side

Showing with 73 additions and 60 deletions

faq.md content/docs/introduction/faq.md +73 -60

No files found.
--- a/content/docs/introduction/faq.md
+++ b/content/docs/introduction/faq.md
@@ -25,7 +25,10 @@ The main Prometheus server runs standalone and has no external dependencies.
 Yes, run identical Prometheus servers on two or more separate machines.
 Identical alerts will be deduplicated by the [Alertmanager](https://github.com/prometheus/alertmanager).
-The Alertmanager cannot currently be made highly available, but this is a goal.
+For [high availability of the Alertmanager](https://github.com/prometheus/alertmanager#high-availability),
+you can run multiple instances in a
+[Mesh cluster](https://github.com/weaveworks/mesh) and configure the Prometheus
+servers to send notifications to each of them.
 ### I was told Prometheus “doesn't scale”.
@@ -40,14 +43,17 @@ Python, and Ruby.
 ### How stable are Prometheus features, storage formats, and APIs?
-Although Prometheus and many of its ecosystem components are already quite
+All repositories in the Prometheus GitHub organization that have reached
-stable, we will still allow for occasional breaking changes until the
+version 1.0.0 broadly follow
-Prometheus server reaches version 1.0.0. These breaking changes will be pointed
+[semantic versioning](http://semver.org/). Breaking changes are indicated by
-out in release announcements for components that already have a proper release
+increments of the major version. Exceptions are possible for experimental
-process (like the Prometheus server) or communicated clearly otherwise. After
+components, which are clearly marked as such in announcements.
-releasing version 1.0.0, breaking changes will be indicated by increments of
-the major version. See also the documentation for [semantic
+Even repositories that have not yet reached version 1.0.0 are in general quite
-versioning](http://semver.org/), which we are following.
+stable. We aim for a proper release process and an eventual 1.0.0 release for
+each repository. In any case, breaking changes will be pointed out in release
+notes (marked by `[CHANGE]`) or communicated clearly for components that do not
+have formal releases yet.
 ### Why do you pull rather than push?
@@ -93,31 +99,33 @@ Prometheus is released under the
 ### What is the plural of Prometheus?
-After extensive research it has been determined that the correct plural of
+After [extensive research](https://youtu.be/B_CDeYrqxjQ), it has been determined
-'Prometheus' is 'Prometheis'.
+that the correct plural of 'Prometheus' is 'Prometheis'.
 ### Can I reload Prometheus's configuration?
-Yes, sending SIGHUP to the Prometheus process will reload
+Yes, sending SIGHUP to the Prometheus process or an HTTP POST request to the
-and apply the configuration file. The different components attempt
+`/-/reload` endpoint will reload and apply the configuration file. The
-to handle failing changes gracefully.
+various components attempt to handle failing changes gracefully.
 ### Can I send alerts?
-Yes, with the experimental [Alertmanager](https://github.com/prometheus/alertmanager).
+Yes, with the [Alertmanager](https://github.com/prometheus/alertmanager).
 Currently, the following external systems are supported:
 * Email
 * Generic Webhooks
-* [PagerDuty](http://www.pagerduty.com/)
 * [HipChat](https://www.hipchat.com/)
-* [Slack](https://slack.com/)
+* [OpsGenie](https://www.opsgenie.com/)
+* [PagerDuty](http://www.pagerduty.com/)
 * [Pushover](https://pushover.net/)
+* [Slack](https://slack.com/)
 ### Can I create dashboards?
-Yes, we recommend [Grafana](/docs/visualization/grafana/) for production usage. There are also [Console templates](/docs/visualization/consoles/).
+Yes, we recommend [Grafana](/docs/visualization/grafana/) for production
+usage. There are also [Console templates](/docs/visualization/consoles/).
 ### Can I change the timezone? Why is everything in UTC?
@@ -160,7 +168,7 @@ jobs.
 ### What applications can Prometheus monitor out of the box?
-See [exporters for third-party systems](/docs/instrumenting/exporters/).
+See [the list of exporters and integrations](/docs/instrumenting/exporters/).
 ### Can I monitor JVM applications via JMX?
@@ -178,19 +186,26 @@ latency-critical code.
 ## Troubleshooting
-### My server takes a long time to start up and spams the log with copious information about crash recovery.
+### My Prometheus server takes a long time to start up and spams the log with copious information about crash recovery.
+You are suffering from an unclean shutdown. Prometheus has to shut down cleanly
+after a `SIGTERM`, which might take a while for heavily used servers. If the
+server crashes or is killed hard (e.g. OOM kill by the kernel or your runlevel
+system got impatient while waiting for Prometheus to shutdown), a crash
+recovery has to be performed, which should take less than a minute under normal
+circumstances, but can take quite long under certain circumstances. See
+[crash recovery](/docs/operating/storage/#crash-recovery) for details.
+### My Prometheus server runs out of memory.
-You are suffering from an unclean shutdown. Prometheus has to shut
+See [the section about memory usage](https://prometheus.io/docs/operating/storage/#memory-usage)
-down cleanly after a `SIGTERM`, which might take a while for heavily
+to configure Prometheus for the amount of memory you have available.
-used servers. If the server crashes or is killed hard (e.g. OOM kill
-by the kernel or your runlevel system got impatient while waiting for
-Prometheus to shutdown), a crash recovery has to be performed, which
-should take less than a minute under normal circumstances. See [crash recovery](/docs/operating/storage/#crash-recovery) for details.
-### I am using ZFS on Linux, and the unit test `TestPersistLoadDropChunks` fails. If I run Prometheus despite the failing test, the weirdest things happen.
+### My Prometheus server reports to be in “rushed mode” or that “storage needs throttling”.
-You have run into a bug of ZFS on Linux. See [issue #484](https://github.com/prometheus/prometheus/issues/484)
+Your storage is under heavy load. Read
-for details. Upgrading to ZFS on Linux v0.6.4 should fix the issue.
+[the section about configuring the local storage](https://prometheus.io/docs/operating/storage/)
+to find out how you can tweak settings for better performance.
 ## Implementation
@@ -211,37 +226,35 @@ after over 285 years.
 ### Why does Prometheus use a custom storage backend rather than [some other storage method]? Isn't the "one file per time series" approach killing performance?
 Initially, Prometheus ran completely on LevelDB, but to achieve better
-performance, we had to change the storage for bulk sample data. We
+performance, we had to change the storage for bulk sample data. We evaluated
-evaluated many storage backends that were available at the time,
+many storage backends that were available at the time, without getting
-without getting satisfactory results. So we implemented exactly the
+satisfactory results. So we implemented exactly the parts we needed, while
-parts we needed, while keeping LevelDB for indexes and making heavy
+keeping LevelDB for indexes and making heavy use of file system
-use of file system capabilities. Obviously, we could not evaluate
+capabilities. Obviously, we could not evaluate every single storage backend out
-every single storage backend out there, and storage backends have
+there, and storage backends have evolved meanwhile. However, the performance of
-evolved meanwhile. However, the performance of the solution
+the solution implemented now is satisfactory for most use-cases. Our most
-implemented now is satisfactory for most use-cases. Our most important
+important requirements are an acceptable query speed for common queries and a
-requirements are an acceptable query speed for common queries and a
+sustainable ingestion rate of hundreds of thousands of samples per second. The
-sustainable ingestion rate of many thousands of samples per
+latter depends on many parameters, like the compressibility of the sample data,
-second. The latter depends on the compressibility of the sample data
+the number of time series the samples belong to, the retention policy, and even
-and on the number of time series the samples belong to, but to give
+more subtle aspects like how full your SSD is. If you want to know all the
-you an idea, here are some results from benchmarks:
+details, read
+[this document with detailed benchmark results](https://docs.google.com/document/d/1lRKBaz9oXI5nwFZfvSbPhpwzUbUr3-9qryQGG1C6ULk/edit?usp=sharing). The
-* On an older 8-core machine with Intel Core i7 CPUs, 8GiB RAM, and
+highlights:
-  two spinning disks (Samsung HD753LJ) in a RAID-1 setup, Prometheus
-  sustained an ingestion rate of 34k samples per second, belonging to
+* On a typical bare-metal server with 64GiB RAM, 32 CPU cores, and SSD,
-  170k time series, scraped from 600 targets.
+  Prometheus sustained an ingestion rate of 900k samples per second, belonging
+  to 1M time series, scraped from 720 targets.
-* On a modern server with 64GiB RAM, 32 CPU cores, and SSD, Prometheus
-  sustained an ingestion rate of 525k samples per second, belonging to 1.4M
+* On a server with HDD and 128GiB RAM, Prometheus sustained an ingestion rate
-  time series, scraped from 1650 targets.
+  of 250k samples per second, belonging to 1M time series, scraped from 720
+  targets.
-In both cases, there were no obvious bottlenecks. Various stages of the
-processing pipelines reached their limits more or less at the same
+Running out of inodes is unlikely in a usual set-up. However, if you have a lot
-ingestion rate.
+of short-lived time series, or you have configured your file system with an
+unusual low amount of inodes, you might run into inode depletion. Also, if you
-Running out of inodes is highly unlikely in a usual set-up. There is a
+want to delete Prometheus's storage directory, you will notice that some file
-possible downside: If you want to delete Prometheus's storage
+systems are very slow when deleting a large number of files.
-directory, you will notice that some file systems are very slow when
-deleting files.
 ### Why don't the Prometheus server components support TLS or authentication? Can I add those?