Commit 14df4384 authored by beorn7's avatar beorn7

Update FAQ

This is a fall-out of updating the documentation of 1.6.

However, all the updates here are not 1.6-specific, so I want to merge
them into master directly. (Which simplifies things as there are some
FAQ updates in master already that are not in the next-release branch
yet.)
parent cb083c4b
...@@ -25,7 +25,10 @@ The main Prometheus server runs standalone and has no external dependencies. ...@@ -25,7 +25,10 @@ The main Prometheus server runs standalone and has no external dependencies.
Yes, run identical Prometheus servers on two or more separate machines. Yes, run identical Prometheus servers on two or more separate machines.
Identical alerts will be deduplicated by the [Alertmanager](https://github.com/prometheus/alertmanager). Identical alerts will be deduplicated by the [Alertmanager](https://github.com/prometheus/alertmanager).
The Alertmanager cannot currently be made highly available, but this is a goal. For [high availability of the Alertmanager](https://github.com/prometheus/alertmanager#high-availability),
you can run multiple instances in a
[Mesh cluster](https://github.com/weaveworks/mesh) and configure the Prometheus
servers to send notifications to each of them.
### I was told Prometheus “doesn't scale”. ### I was told Prometheus “doesn't scale”.
...@@ -40,14 +43,17 @@ Python, and Ruby. ...@@ -40,14 +43,17 @@ Python, and Ruby.
### How stable are Prometheus features, storage formats, and APIs? ### How stable are Prometheus features, storage formats, and APIs?
Although Prometheus and many of its ecosystem components are already quite All repositories in the Prometheus GitHub organization that have reached
stable, we will still allow for occasional breaking changes until the version 1.0.0 broadly follow
Prometheus server reaches version 1.0.0. These breaking changes will be pointed [semantic versioning](http://semver.org/). Breaking changes are indicated by
out in release announcements for components that already have a proper release increments of the major version. Exceptions are possible for experimental
process (like the Prometheus server) or communicated clearly otherwise. After components, which are clearly marked as such in announcements.
releasing version 1.0.0, breaking changes will be indicated by increments of
the major version. See also the documentation for [semantic Even repositories that have not yet reached version 1.0.0 are in general quite
versioning](http://semver.org/), which we are following. stable. We aim for a proper release process and an eventual 1.0.0 release for
each repository. In any case, breaking changes will be pointed out in release
notes (marked by `[CHANGE]`) or communicated clearly for components that do not
have formal releases yet.
### Why do you pull rather than push? ### Why do you pull rather than push?
...@@ -93,31 +99,33 @@ Prometheus is released under the ...@@ -93,31 +99,33 @@ Prometheus is released under the
### What is the plural of Prometheus? ### What is the plural of Prometheus?
After extensive research it has been determined that the correct plural of After [extensive research](https://youtu.be/B_CDeYrqxjQ), it has been determined
'Prometheus' is 'Prometheis'. that the correct plural of 'Prometheus' is 'Prometheis'.
### Can I reload Prometheus's configuration? ### Can I reload Prometheus's configuration?
Yes, sending SIGHUP to the Prometheus process will reload Yes, sending SIGHUP to the Prometheus process or an HTTP POST request to the
and apply the configuration file. The different components attempt `/-/reload` endpoint will reload and apply the configuration file. The
to handle failing changes gracefully. various components attempt to handle failing changes gracefully.
### Can I send alerts? ### Can I send alerts?
Yes, with the experimental [Alertmanager](https://github.com/prometheus/alertmanager). Yes, with the [Alertmanager](https://github.com/prometheus/alertmanager).
Currently, the following external systems are supported: Currently, the following external systems are supported:
* Email * Email
* Generic Webhooks * Generic Webhooks
* [PagerDuty](http://www.pagerduty.com/)
* [HipChat](https://www.hipchat.com/) * [HipChat](https://www.hipchat.com/)
* [Slack](https://slack.com/) * [OpsGenie](https://www.opsgenie.com/)
* [PagerDuty](http://www.pagerduty.com/)
* [Pushover](https://pushover.net/) * [Pushover](https://pushover.net/)
* [Slack](https://slack.com/)
### Can I create dashboards? ### Can I create dashboards?
Yes, we recommend [Grafana](/docs/visualization/grafana/) for production usage. There are also [Console templates](/docs/visualization/consoles/). Yes, we recommend [Grafana](/docs/visualization/grafana/) for production
usage. There are also [Console templates](/docs/visualization/consoles/).
### Can I change the timezone? Why is everything in UTC? ### Can I change the timezone? Why is everything in UTC?
...@@ -160,7 +168,7 @@ jobs. ...@@ -160,7 +168,7 @@ jobs.
### What applications can Prometheus monitor out of the box? ### What applications can Prometheus monitor out of the box?
See [exporters for third-party systems](/docs/instrumenting/exporters/). See [the list of exporters and integrations](/docs/instrumenting/exporters/).
### Can I monitor JVM applications via JMX? ### Can I monitor JVM applications via JMX?
...@@ -178,19 +186,26 @@ latency-critical code. ...@@ -178,19 +186,26 @@ latency-critical code.
## Troubleshooting ## Troubleshooting
### My server takes a long time to start up and spams the log with copious information about crash recovery. ### My Prometheus server takes a long time to start up and spams the log with copious information about crash recovery.
You are suffering from an unclean shutdown. Prometheus has to shut down cleanly
after a `SIGTERM`, which might take a while for heavily used servers. If the
server crashes or is killed hard (e.g. OOM kill by the kernel or your runlevel
system got impatient while waiting for Prometheus to shutdown), a crash
recovery has to be performed, which should take less than a minute under normal
circumstances, but can take quite long under certain circumstances. See
[crash recovery](/docs/operating/storage/#crash-recovery) for details.
### My Prometheus server runs out of memory.
You are suffering from an unclean shutdown. Prometheus has to shut See [the section about memory usage](https://prometheus.io/docs/operating/storage/#memory-usage)
down cleanly after a `SIGTERM`, which might take a while for heavily to configure Prometheus for the amount of memory you have available.
used servers. If the server crashes or is killed hard (e.g. OOM kill
by the kernel or your runlevel system got impatient while waiting for
Prometheus to shutdown), a crash recovery has to be performed, which
should take less than a minute under normal circumstances. See [crash recovery](/docs/operating/storage/#crash-recovery) for details.
### I am using ZFS on Linux, and the unit test `TestPersistLoadDropChunks` fails. If I run Prometheus despite the failing test, the weirdest things happen. ### My Prometheus server reports to be in “rushed mode” or that “storage needs throttling”.
You have run into a bug of ZFS on Linux. See [issue #484](https://github.com/prometheus/prometheus/issues/484) Your storage is under heavy load. Read
for details. Upgrading to ZFS on Linux v0.6.4 should fix the issue. [the section about configuring the local storage](https://prometheus.io/docs/operating/storage/)
to find out how you can tweak settings for better performance.
## Implementation ## Implementation
...@@ -211,37 +226,35 @@ after over 285 years. ...@@ -211,37 +226,35 @@ after over 285 years.
### Why does Prometheus use a custom storage backend rather than [some other storage method]? Isn't the "one file per time series" approach killing performance? ### Why does Prometheus use a custom storage backend rather than [some other storage method]? Isn't the "one file per time series" approach killing performance?
Initially, Prometheus ran completely on LevelDB, but to achieve better Initially, Prometheus ran completely on LevelDB, but to achieve better
performance, we had to change the storage for bulk sample data. We performance, we had to change the storage for bulk sample data. We evaluated
evaluated many storage backends that were available at the time, many storage backends that were available at the time, without getting
without getting satisfactory results. So we implemented exactly the satisfactory results. So we implemented exactly the parts we needed, while
parts we needed, while keeping LevelDB for indexes and making heavy keeping LevelDB for indexes and making heavy use of file system
use of file system capabilities. Obviously, we could not evaluate capabilities. Obviously, we could not evaluate every single storage backend out
every single storage backend out there, and storage backends have there, and storage backends have evolved meanwhile. However, the performance of
evolved meanwhile. However, the performance of the solution the solution implemented now is satisfactory for most use-cases. Our most
implemented now is satisfactory for most use-cases. Our most important important requirements are an acceptable query speed for common queries and a
requirements are an acceptable query speed for common queries and a sustainable ingestion rate of hundreds of thousands of samples per second. The
sustainable ingestion rate of many thousands of samples per latter depends on many parameters, like the compressibility of the sample data,
second. The latter depends on the compressibility of the sample data the number of time series the samples belong to, the retention policy, and even
and on the number of time series the samples belong to, but to give more subtle aspects like how full your SSD is. If you want to know all the
you an idea, here are some results from benchmarks: details, read
[this document with detailed benchmark results](https://docs.google.com/document/d/1lRKBaz9oXI5nwFZfvSbPhpwzUbUr3-9qryQGG1C6ULk/edit?usp=sharing). The
* On an older 8-core machine with Intel Core i7 CPUs, 8GiB RAM, and highlights:
two spinning disks (Samsung HD753LJ) in a RAID-1 setup, Prometheus
sustained an ingestion rate of 34k samples per second, belonging to * On a typical bare-metal server with 64GiB RAM, 32 CPU cores, and SSD,
170k time series, scraped from 600 targets. Prometheus sustained an ingestion rate of 900k samples per second, belonging
to 1M time series, scraped from 720 targets.
* On a modern server with 64GiB RAM, 32 CPU cores, and SSD, Prometheus
sustained an ingestion rate of 525k samples per second, belonging to 1.4M * On a server with HDD and 128GiB RAM, Prometheus sustained an ingestion rate
time series, scraped from 1650 targets. of 250k samples per second, belonging to 1M time series, scraped from 720
targets.
In both cases, there were no obvious bottlenecks. Various stages of the
processing pipelines reached their limits more or less at the same Running out of inodes is unlikely in a usual set-up. However, if you have a lot
ingestion rate. of short-lived time series, or you have configured your file system with an
unusual low amount of inodes, you might run into inode depletion. Also, if you
Running out of inodes is highly unlikely in a usual set-up. There is a want to delete Prometheus's storage directory, you will notice that some file
possible downside: If you want to delete Prometheus's storage systems are very slow when deleting a large number of files.
directory, you will notice that some file systems are very slow when
deleting files.
### Why don't the Prometheus server components support TLS or authentication? Can I add those? ### Why don't the Prometheus server components support TLS or authentication? Can I add those?
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment