@@ -25,7 +25,10 @@ The main Prometheus server runs standalone and has no external dependencies.
Yes, run identical Prometheus servers on two or more separate machines.
Identical alerts will be deduplicated by the [Alertmanager](https://github.com/prometheus/alertmanager).
The Alertmanager cannot currently be made highly available, but this is a goal.
For [high availability of the Alertmanager](https://github.com/prometheus/alertmanager#high-availability),
you can run multiple instances in a
[Mesh cluster](https://github.com/weaveworks/mesh) and configure the Prometheus
servers to send notifications to each of them.
### I was told Prometheus “doesn't scale”.
...
...
@@ -40,14 +43,17 @@ Python, and Ruby.
### How stable are Prometheus features, storage formats, and APIs?
Although Prometheus and many of its ecosystem components are already quite
stable, we will still allow for occasional breaking changes until the
Prometheus server reaches version 1.0.0. These breaking changes will be pointed
out in release announcements for components that already have a proper release
process (like the Prometheus server) or communicated clearly otherwise. After
releasing version 1.0.0, breaking changes will be indicated by increments of
the major version. See also the documentation for [semantic
versioning](http://semver.org/), which we are following.
All repositories in the Prometheus GitHub organization that have reached
version 1.0.0 broadly follow
[semantic versioning](http://semver.org/). Breaking changes are indicated by
increments of the major version. Exceptions are possible for experimental
components, which are clearly marked as such in announcements.
Even repositories that have not yet reached version 1.0.0 are in general quite
stable. We aim for a proper release process and an eventual 1.0.0 release for
each repository. In any case, breaking changes will be pointed out in release
notes (marked by `[CHANGE]`) or communicated clearly for components that do not
have formal releases yet.
### Why do you pull rather than push?
...
...
@@ -93,31 +99,33 @@ Prometheus is released under the
### What is the plural of Prometheus?
After extensive research it has been determined that the correct plural of
'Prometheus' is 'Prometheis'.
After [extensive research](https://youtu.be/B_CDeYrqxjQ), it has been determined
that the correct plural of 'Prometheus' is 'Prometheis'.
### Can I reload Prometheus's configuration?
Yes, sending SIGHUP to the Prometheus process will reload
and apply the configuration file. The different components attempt
to handle failing changes gracefully.
Yes, sending SIGHUP to the Prometheus process or an HTTP POST request to the
`/-/reload` endpoint will reload and apply the configuration file. The
various components attempt to handle failing changes gracefully.
### Can I send alerts?
Yes, with the experimental [Alertmanager](https://github.com/prometheus/alertmanager).
Yes, with the [Alertmanager](https://github.com/prometheus/alertmanager).
Currently, the following external systems are supported:
* Email
* Generic Webhooks
*[PagerDuty](http://www.pagerduty.com/)
*[HipChat](https://www.hipchat.com/)
*[Slack](https://slack.com/)
*[OpsGenie](https://www.opsgenie.com/)
*[PagerDuty](http://www.pagerduty.com/)
*[Pushover](https://pushover.net/)
*[Slack](https://slack.com/)
### Can I create dashboards?
Yes, we recommend [Grafana](/docs/visualization/grafana/) for production usage. There are also [Console templates](/docs/visualization/consoles/).
Yes, we recommend [Grafana](/docs/visualization/grafana/) for production
usage. There are also [Console templates](/docs/visualization/consoles/).
### Can I change the timezone? Why is everything in UTC?
...
...
@@ -160,7 +168,7 @@ jobs.
### What applications can Prometheus monitor out of the box?
See [exporters for third-party systems](/docs/instrumenting/exporters/).
See [the list of exporters and integrations](/docs/instrumenting/exporters/).
### Can I monitor JVM applications via JMX?
...
...
@@ -178,19 +186,26 @@ latency-critical code.
## Troubleshooting
### My server takes a long time to start up and spams the log with copious information about crash recovery.
### My Prometheus server takes a long time to start up and spams the log with copious information about crash recovery.
You are suffering from an unclean shutdown. Prometheus has to shut down cleanly
after a `SIGTERM`, which might take a while for heavily used servers. If the
server crashes or is killed hard (e.g. OOM kill by the kernel or your runlevel
system got impatient while waiting for Prometheus to shutdown), a crash
recovery has to be performed, which should take less than a minute under normal
circumstances, but can take quite long under certain circumstances. See
[crash recovery](/docs/operating/storage/#crash-recovery) for details.
### My Prometheus server runs out of memory.
You are suffering from an unclean shutdown. Prometheus has to shut
down cleanly after a `SIGTERM`, which might take a while for heavily
used servers. If the server crashes or is killed hard (e.g. OOM kill
by the kernel or your runlevel system got impatient while waiting for
Prometheus to shutdown), a crash recovery has to be performed, which
should take less than a minute under normal circumstances. See [crash recovery](/docs/operating/storage/#crash-recovery) for details.
See [the section about memory usage](https://prometheus.io/docs/operating/storage/#memory-usage)
to configure Prometheus for the amount of memory you have available.
### I am using ZFS on Linux, and the unit test `TestPersistLoadDropChunks` fails. If I run Prometheus despite the failing test, the weirdest things happen.
### My Prometheus server reports to be in “rushed mode” or that “storage needs throttling”.
You have run into a bug of ZFS on Linux. See [issue #484](https://github.com/prometheus/prometheus/issues/484)
for details. Upgrading to ZFS on Linux v0.6.4 should fix the issue.
Your storage is under heavy load. Read
[the section about configuring the local storage](https://prometheus.io/docs/operating/storage/)
to find out how you can tweak settings for better performance.
## Implementation
...
...
@@ -211,37 +226,35 @@ after over 285 years.
### Why does Prometheus use a custom storage backend rather than [some other storage method]? Isn't the "one file per time series" approach killing performance?
Initially, Prometheus ran completely on LevelDB, but to achieve better
performance, we had to change the storage for bulk sample data. We
evaluated many storage backends that were available at the time,
without getting satisfactory results. So we implemented exactly the
parts we needed, while keeping LevelDB for indexes and making heavy
use of file system capabilities. Obviously, we could not evaluate
every single storage backend out there, and storage backends have
evolved meanwhile. However, the performance of the solution
implemented now is satisfactory for most use-cases. Our most important
requirements are an acceptable query speed for common queries and a
sustainable ingestion rate of many thousands of samples per
second. The latter depends on the compressibility of the sample data
and on the number of time series the samples belong to, but to give
you an idea, here are some results from benchmarks:
* On an older 8-core machine with Intel Core i7 CPUs, 8GiB RAM, and
two spinning disks (Samsung HD753LJ) in a RAID-1 setup, Prometheus
sustained an ingestion rate of 34k samples per second, belonging to
170k time series, scraped from 600 targets.
* On a modern server with 64GiB RAM, 32 CPU cores, and SSD, Prometheus
sustained an ingestion rate of 525k samples per second, belonging to 1.4M
time series, scraped from 1650 targets.
In both cases, there were no obvious bottlenecks. Various stages of the
processing pipelines reached their limits more or less at the same
ingestion rate.
Running out of inodes is highly unlikely in a usual set-up. There is a
possible downside: If you want to delete Prometheus's storage
directory, you will notice that some file systems are very slow when
deleting files.
performance, we had to change the storage for bulk sample data. We evaluated
many storage backends that were available at the time, without getting
satisfactory results. So we implemented exactly the parts we needed, while
keeping LevelDB for indexes and making heavy use of file system
capabilities. Obviously, we could not evaluate every single storage backend out
there, and storage backends have evolved meanwhile. However, the performance of
the solution implemented now is satisfactory for most use-cases. Our most
important requirements are an acceptable query speed for common queries and a
sustainable ingestion rate of hundreds of thousands of samples per second. The
latter depends on many parameters, like the compressibility of the sample data,
the number of time series the samples belong to, the retention policy, and even
more subtle aspects like how full your SSD is. If you want to know all the
details, read
[this document with detailed benchmark results](https://docs.google.com/document/d/1lRKBaz9oXI5nwFZfvSbPhpwzUbUr3-9qryQGG1C6ULk/edit?usp=sharing). The
highlights:
* On a typical bare-metal server with 64GiB RAM, 32 CPU cores, and SSD,
Prometheus sustained an ingestion rate of 900k samples per second, belonging
to 1M time series, scraped from 720 targets.
* On a server with HDD and 128GiB RAM, Prometheus sustained an ingestion rate
of 250k samples per second, belonging to 1M time series, scraped from 720
targets.
Running out of inodes is unlikely in a usual set-up. However, if you have a lot
of short-lived time series, or you have configured your file system with an
unusual low amount of inodes, you might run into inode depletion. Also, if you
want to delete Prometheus's storage directory, you will notice that some file
systems are very slow when deleting a large number of files.
### Why don't the Prometheus server components support TLS or authentication? Can I add those?