Update FAQ

This is a fall-out of updating the documentation of 1.6. However, all the updates here are not 1.6-specific, so I want to merge them into master directly. (Which simplifies things as there are some FAQ updates in master already that are not in the next-release branch yet.)

Update FAQ
This is a fall-out of updating the documentation of 1.6. However, all the updates here are not 1.6-specific, so I want to merge them into master directly. (Which simplifies things as there are some FAQ updates in master already that are not in the next-release branch yet.)
14df4384 · beorn7 · cb083c4b · 14df4384
Commit 14df4384 authored Apr 10, 2017 by beorn7
Hide whitespace changes
Inline Side-by-side

Showing with 73 additions and 60 deletions

faq.md content/docs/introduction/faq.md +73 -60

No files found.
--- a/content/docs/introduction/faq.md
+++ b/content/docs/introduction/faq.md
@@ -25,7 +25,10 @@ The main Prometheus server runs standalone and has no external dependencies.
 Yes, run identical Prometheus servers on two or more separate machines.
 Identical alerts will be deduplicated by the [Alertmanager](https://github.com/prometheus/alertmanager).

-The Alertmanager cannot currently be made highly available, but this is a goal.
+For [high availability of the Alertmanager](https://github.com/prometheus/alertmanager#high-availability),
+you can run multiple instances in a
+[Mesh cluster](https://github.com/weaveworks/mesh) and configure the Prometheus
+servers to send notifications to each of them.

 ### I was told Prometheus “doesn't scale”.

@@ -40,14 +43,17 @@ Python, and Ruby.

 ### How stable are Prometheus features, storage formats, and APIs?

-Although Prometheus and many of its ecosystem components are already quite
-stable, we will still allow for occasional breaking changes until the
-Prometheus server reaches version 1.0.0. These breaking changes will be pointed
-out in release announcements for components that already have a proper release
-process (like the Prometheus server) or communicated clearly otherwise. After
-releasing version 1.0.0, breaking changes will be indicated by increments of
-the major version. See also the documentation for [semantic
-versioning](http://semver.org/), which we are following.
+All repositories in the Prometheus GitHub organization that have reached
+version 1.0.0 broadly follow
+[semantic versioning](http://semver.org/). Breaking changes are indicated by
+increments of the major version. Exceptions are possible for experimental
+components, which are clearly marked as such in announcements.
+
+Even repositories that have not yet reached version 1.0.0 are in general quite
+stable. We aim for a proper release process and an eventual 1.0.0 release for
+each repository. In any case, breaking changes will be pointed out in release
+notes (marked by `[CHANGE]`) or communicated clearly for components that do not
+have formal releases yet.

 ### Why do you pull rather than push?

@@ -93,31 +99,33 @@ Prometheus is released under the

 ### What is the plural of Prometheus?

-After extensive research it has been determined that the correct plural of
-'Prometheus' is 'Prometheis'.
+After [extensive research](https://youtu.be/B_CDeYrqxjQ), it has been determined
+that the correct plural of 'Prometheus' is 'Prometheis'.

 ### Can I reload Prometheus's configuration?

-Yes, sending SIGHUP to the Prometheus process will reload
-and apply the configuration file. The different components attempt
-to handle failing changes gracefully.
+Yes, sending SIGHUP to the Prometheus process or an HTTP POST request to the
+`/-/reload` endpoint will reload and apply the configuration file. The
+various components attempt to handle failing changes gracefully.

 ### Can I send alerts?

-Yes, with the experimental [Alertmanager](https://github.com/prometheus/alertmanager).
+Yes, with the [Alertmanager](https://github.com/prometheus/alertmanager).

 Currently, the following external systems are supported:

 * Email
 * Generic Webhooks
-* [PagerDuty](http://www.pagerduty.com/)
 * [HipChat](https://www.hipchat.com/)
-* [Slack](https://slack.com/)
+* [OpsGenie](https://www.opsgenie.com/)
+* [PagerDuty](http://www.pagerduty.com/)
 * [Pushover](https://pushover.net/)
+* [Slack](https://slack.com/)

 ### Can I create dashboards?

-Yes, we recommend [Grafana](/docs/visualization/grafana/) for production usage. There are also [Console templates](/docs/visualization/consoles/).
+Yes, we recommend [Grafana](/docs/visualization/grafana/) for production
+usage. There are also [Console templates](/docs/visualization/consoles/).

 ### Can I change the timezone? Why is everything in UTC?

@@ -160,7 +168,7 @@ jobs.

 ### What applications can Prometheus monitor out of the box?

-See [exporters for third-party systems](/docs/instrumenting/exporters/).
+See [the list of exporters and integrations](/docs/instrumenting/exporters/).

 ### Can I monitor JVM applications via JMX?

@@ -178,19 +186,26 @@ latency-critical code.

 ## Troubleshooting

-### My server takes a long time to start up and spams the log with copious information about crash recovery.
+### My Prometheus server takes a long time to start up and spams the log with copious information about crash recovery.
+
+You are suffering from an unclean shutdown. Prometheus has to shut down cleanly
+after a `SIGTERM`, which might take a while for heavily used servers. If the
+server crashes or is killed hard (e.g. OOM kill by the kernel or your runlevel
+system got impatient while waiting for Prometheus to shutdown), a crash
+recovery has to be performed, which should take less than a minute under normal
+circumstances, but can take quite long under certain circumstances. See
+[crash recovery](/docs/operating/storage/#crash-recovery) for details.
+
+### My Prometheus server runs out of memory.

-You are suffering from an unclean shutdown. Prometheus has to shut
-down cleanly after a `SIGTERM`, which might take a while for heavily
-used servers. If the server crashes or is killed hard (e.g. OOM kill
-by the kernel or your runlevel system got impatient while waiting for
-Prometheus to shutdown), a crash recovery has to be performed, which
-should take less than a minute under normal circumstances. See [crash recovery](/docs/operating/storage/#crash-recovery) for details.
+See [the section about memory usage](https://prometheus.io/docs/operating/storage/#memory-usage)
+to configure Prometheus for the amount of memory you have available.

-### I am using ZFS on Linux, and the unit test `TestPersistLoadDropChunks` fails. If I run Prometheus despite the failing test, the weirdest things happen.
+### My Prometheus server reports to be in “rushed mode” or that “storage needs throttling”.

-You have run into a bug of ZFS on Linux. See [issue #484](https://github.com/prometheus/prometheus/issues/484)
-for details. Upgrading to ZFS on Linux v0.6.4 should fix the issue.
+Your storage is under heavy load. Read
+[the section about configuring the local storage](https://prometheus.io/docs/operating/storage/)
+to find out how you can tweak settings for better performance.

 ## Implementation

@@ -211,37 +226,35 @@ after over 285 years.
 ### Why does Prometheus use a custom storage backend rather than [some other storage method]? Isn't the "one file per time series" approach killing performance?

 Initially, Prometheus ran completely on LevelDB, but to achieve better
-performance, we had to change the storage for bulk sample data. We
-evaluated many storage backends that were available at the time,
-without getting satisfactory results. So we implemented exactly the
-parts we needed, while keeping LevelDB for indexes and making heavy
-use of file system capabilities. Obviously, we could not evaluate
-every single storage backend out there, and storage backends have
-evolved meanwhile. However, the performance of the solution
-implemented now is satisfactory for most use-cases. Our most important
-requirements are an acceptable query speed for common queries and a
-sustainable ingestion rate of many thousands of samples per
-second. The latter depends on the compressibility of the sample data
-and on the number of time series the samples belong to, but to give
-you an idea, here are some results from benchmarks:
-
-* On an older 8-core machine with Intel Core i7 CPUs, 8GiB RAM, and
-  two spinning disks (Samsung HD753LJ) in a RAID-1 setup, Prometheus
-  sustained an ingestion rate of 34k samples per second, belonging to
-  170k time series, scraped from 600 targets.
-
-* On a modern server with 64GiB RAM, 32 CPU cores, and SSD, Prometheus
-  sustained an ingestion rate of 525k samples per second, belonging to 1.4M
-  time series, scraped from 1650 targets.
-
-In both cases, there were no obvious bottlenecks. Various stages of the
-processing pipelines reached their limits more or less at the same
-ingestion rate.
-
-Running out of inodes is highly unlikely in a usual set-up. There is a
-possible downside: If you want to delete Prometheus's storage
-directory, you will notice that some file systems are very slow when
-deleting files.
+performance, we had to change the storage for bulk sample data. We evaluated
+many storage backends that were available at the time, without getting
+satisfactory results. So we implemented exactly the parts we needed, while
+keeping LevelDB for indexes and making heavy use of file system
+capabilities. Obviously, we could not evaluate every single storage backend out
+there, and storage backends have evolved meanwhile. However, the performance of
+the solution implemented now is satisfactory for most use-cases. Our most
+important requirements are an acceptable query speed for common queries and a
+sustainable ingestion rate of hundreds of thousands of samples per second. The
+latter depends on many parameters, like the compressibility of the sample data,
+the number of time series the samples belong to, the retention policy, and even
+more subtle aspects like how full your SSD is. If you want to know all the
+details, read
+[this document with detailed benchmark results](https://docs.google.com/document/d/1lRKBaz9oXI5nwFZfvSbPhpwzUbUr3-9qryQGG1C6ULk/edit?usp=sharing). The
+highlights:
+
+* On a typical bare-metal server with 64GiB RAM, 32 CPU cores, and SSD,
+  Prometheus sustained an ingestion rate of 900k samples per second, belonging
+  to 1M time series, scraped from 720 targets.
+
+* On a server with HDD and 128GiB RAM, Prometheus sustained an ingestion rate
+  of 250k samples per second, belonging to 1M time series, scraped from 720
+  targets.
+
+Running out of inodes is unlikely in a usual set-up. However, if you have a lot
+of short-lived time series, or you have configured your file system with an
+unusual low amount of inodes, you might run into inode depletion. Also, if you
+want to delete Prometheus's storage directory, you will notice that some file
+systems are very slow when deleting a large number of files.

 ### Why don't the Prometheus server components support TLS or authentication? Can I add those?