Commit 7ea0ccf4 authored by James Turnbull's avatar James Turnbull Committed by Brian Brazil

Some updates to the introduction documents (#861)

* Some updates to the introduction documents

1. Some spelling and grammar updates, settled on US English as that
seemed consistent? Happy to be wrong...

2. Removed a dead link from the FAQ.

3. Tidied up some awkward sentences.

4. Fixed some missing or extra words.

5. Added a couple of links to external tools, etc.

6. Added NOTE formatting when things are an actual node.
parent c984fe6a
......@@ -5,9 +5,8 @@ sort_rank: 3
# Jobs and instances
In Prometheus terms, any individually scraped target is called an _instance_,
usually corresponding to a single process. A collection of instances of the
same type (replicated for scalability or reliability) is called a _job_.
In Prometheus terms, an endpoint you can scrape is called an _instance_,
usually corresponding to a single process. A collection of instances with the same purpose, a process replicated for scalability or reliability for example, is called a _job_.
For example, an API server job with four replicated instances:
......
......@@ -24,17 +24,17 @@ to find faults.
Graphite stores numeric samples for named time series, much like Prometheus
does. However, Prometheus's metadata model is richer: while Graphite metric
names consist of dot-separated components which implicitly encode dimensions,
Prometheus encodes dimensions explicitly as key-value pairs (labels) attached
Prometheus encodes dimensions explicitly as key-value pairs, called labels, attached
to a metric name. This allows easy filtering, grouping, and matching by these
labels via in the query language.
labels via the query language.
Further, especially when Graphite is used in combination with
[StatsD](https://github.com/etsy/statsd/), it is common to store only
aggregated data over all monitored instances, rather than preserving the
instance as a dimension and being able to drill down into individual
problematic ones.
problematic instances.
As an example, storing the number of HTTP requests to API servers with the
For example, storing the number of HTTP requests to API servers with the
response code `500` and the method `POST` to the `/tracks` endpoint would
commonly be encoded like this in Graphite/StatsD:
......@@ -102,20 +102,18 @@ silencing functionality.
### Data model / storage
Like Prometheus, the InfluxDB data model has key-value pairs as labels, which
are called tags. In addition InfluxDB has a second level of labels called
are called tags. In addition, InfluxDB has a second level of labels called
fields, which are more limited in use. InfluxDB supports timestamps with up to
nanosecond resolution, and float64, int64, bool, and string data types.
Prometheus by contrast supports the float64 data type with limited support for
Prometheus, by contrast, supports the float64 data type with limited support for
strings, and millisecond resolution timestamps.
InfluxDB uses a variant of a [log-structured merge tree for storage with a
write ahead log](https://docs.influxdata.com/influxdb/v1.2/concepts/storage_engine/),
InfluxDB uses a variant of a [log-structured merge tree for storage with a write ahead log](https://docs.influxdata.com/influxdb/v1.2/concepts/storage_engine/),
sharded by time. This is much more suitable to event logging than Prometheus's
append-only file per time series approach.
[Logs and Metrics and Graphs, Oh My!](https://blog.raintank.io/logs-and-metrics-and-graphs-oh-my/)
describes the difference between event logging and metrics recording.
describes the differences between event logging and metrics recording.
### Architecture
......@@ -123,7 +121,7 @@ Prometheus servers run independently of each other and only rely on their local
storage for their core functionality: scraping, rule processing, and alerting.
The open source version of InfluxDB is similar.
The commercial InfluxDB offering is by design a distributed storage cluster
The commercial InfluxDB offering is, by design, a distributed storage cluster
with storage and queries being handled by many nodes at once.
This means that the commercial InfluxDB will be easier to scale horizontally,
......@@ -136,7 +134,7 @@ you better reliability and failure isolation.
Kapacitor currently has no [built-in distributed/redundant
options](https://github.com/influxdata/kapacitor/issues/277) for rules,
alerting or notifications. Prometheus and the Alertmanager by contrast offer a
alerting, or notifications. Prometheus and the Alertmanager by contrast offer a
redundant option via running redundant replicas of Prometheus and using the
Alertmanager's [High
Availability](https://github.com/prometheus/alertmanager#high-availability)
......@@ -149,7 +147,7 @@ There are many similarities between the systems. Both have labels (called tags
in InfluxDB) to efficiently support multi-dimensional metrics. Both use
basically the same data compression algorithms. Both have extensive
integrations, including with each other. Both have hooks allowing you to extend
them further, such as analysing data in statistical tools or performing
them further, such as analyzing data in statistical tools or performing
automated actions.
Where InfluxDB is better:
......@@ -183,12 +181,10 @@ The same scope differences as in the case of
### Data model
OpenTSDB's data model is almost identical to Prometheus's: time series are
identified by a set of arbitrary key-value pairs (OpenTSDB "tags" are
Prometheus "labels"). All data for a metric is [stored
together](http://opentsdb.net/docs/build/html/user_guide/writing/index.html#time-series-cardinality),
limiting the cardinality of metrics. There are minor differences though,
such as that Prometheus allows arbitrary characters in label values, while
OpenTSDB is more restrictive. OpenTSDB is also lacking a full query language,
identified by a set of arbitrary key-value pairs (OpenTSDB tags are
Prometheus labels). All data for a metric is [stored together](http://opentsdb.net/docs/build/html/user_guide/writing/index.html#time-series-cardinality),
limiting the cardinality of metrics. There are minor differences though: Prometheus allows arbitrary characters in label values, while
OpenTSDB is more restrictive. OpenTSDB also lacks a full query language,
only allowing simple aggregation and math via its API.
### Storage
......@@ -204,15 +200,14 @@ once the capacity of a single node is exceeded.
### Summary
Prometheus offers a much richer query language, can handle higher cardinality
metrics and forms part of a complete monitoring system. If you're already
metrics, and forms part of a complete monitoring system. If you're already
running Hadoop and value long term storage over these benefits, OpenTSDB is a
good choice.
## Prometheus vs. Nagios
[Nagios](https://www.nagios.org/) is a monitoring system that originated in the
90s as NetSaint.
1990s as NetSaint.
### Scope
......@@ -220,14 +215,11 @@ Nagios is primarily about alerting based on the exit codes of scripts. These are
There is silencing of individual alerts, however no grouping, routing or deduplication.
There are a variety of plugins. For example, piping the few kilobytes of
perfData plugins are allowed to return [to a time series database such as
Graphite](https://github.com/shawn-sterling/graphios) or using NRPE to [run
checks on remote
machines](https://exchange.nagios.org/directory/Addons/Monitoring-Agents/NRPE--2D-Nagios-Remote-Plugin-Executor/details).
perfData plugins are allowed to return [to a time series database such as Graphite](https://github.com/shawn-sterling/graphios) or using NRPE to [run checks on remote machines](https://exchange.nagios.org/directory/Addons/Monitoring-Agents/NRPE--2D-Nagios-Remote-Plugin-Executor/details).
### Data model
Nagios is host-based. Each host can have one or more services, which has one check.
Nagios is host-based. Each host can have one or more services and each service can perform one check.
There is no notion of labels or a query language.
......@@ -246,7 +238,7 @@ Nagios is suitable for basic monitoring of small and/or static systems where
blackbox probing is sufficient.
If you want to do whitebox monitoring, or have a dynamic or cloud based
environment then Prometheus is a good choice.
environment, then Prometheus is a good choice.
## Prometheus vs. Sensu
......@@ -257,8 +249,7 @@ environment then Prometheus is a good choice.
The same general scope differences as in the case of
[Nagios](/docs/introduction/comparison/#prometheus-vs-nagios) apply here.
The primary difference is that Sensu clients [register
themselves](https://sensuapp.org/docs/0.27/reference/clients.html#what-is-a-sensu-client),
The primary difference is that Sensu clients [register themselves](https://sensuapp.org/docs/0.27/reference/clients.html#what-is-a-sensu-client),
and can determine the checks to run either from central or local configuration.
Sensu does not have a limit on the amount of perfData.
......@@ -275,9 +266,8 @@ silences. It also stores all the clients that have registered with it.
### Architecture
Sensu has a [number of
components](https://sensuapp.org/docs/0.27/overview/architecture.html). It uses
RabbitMQ as a transport, Redis for current state, and a separate Server for
Sensu has a [number of components](https://sensuapp.org/docs/0.27/overview/architecture.html). It uses
RabbitMQ as a transport, Redis for current state, and a separate server for
processing.
Both RabbitMQ and Redis can be clustered. Multiple copies of the server can be
......@@ -285,8 +275,7 @@ run for scaling and redundancy.
### Summary
If you have an existing Nagios setup that you wish to scale as-is or taking
advantage of the registration feature of Sensu, then Sensu is a good choice.
If you have an existing Nagios setup that you wish to scale as-is, or want to take advantage of the registration feature of Sensu, then Sensu is a good choice.
If you want to do whitebox monitoring, or have a very dynamic or cloud based
environment, then Prometheus is a good choice.
......@@ -9,6 +9,7 @@ toc: full-width
## General
### What is Prometheus?
Prometheus is an open-source systems monitoring and alerting toolkit
with an active ecosystem. See the [overview](/docs/introduction/overview/).
......@@ -49,7 +50,7 @@ version 1.0.0 broadly follow
increments of the major version. Exceptions are possible for experimental
components, which are clearly marked as such in announcements.
Even repositories that have not yet reached version 1.0.0 are in general quite
Even repositories that have not yet reached version 1.0.0 are, in general, quite
stable. We aim for a proper release process and an eventual 1.0.0 release for
each repository. In any case, breaking changes will be pointed out in release
notes (marked by `[CHANGE]`) or communicated clearly for components that do not
......@@ -63,17 +64,14 @@ Pulling over HTTP offers a number of advantages:
* You can more easily tell if a target is down.
* You can manually go to a target and inspect its health with a web browser.
Overall we believe that pulling is slightly better than pushing, but it should
Overall, we believe that pulling is slightly better than pushing, but it should
not be considered a major point when considering a monitoring system.
The [Push vs Pull for Monitoring](http://www.boxever.com/push-vs-pull-for-monitoring)
blog post by Brian Brazil goes into more detail.
For cases where you must push, we offer the [Pushgateway](/docs/instrumenting/pushing/).
### How to feed logs into Prometheus?
Short answer: Don't! Use something like the ELK stack instead.
Short answer: Don't! Use something like the [ELK stack](https://www.elastic.co/products) instead.
Longer answer: Prometheus is a system to collect and process metrics, not an
event logging system. The Raintank blog post
......@@ -104,7 +102,7 @@ that the correct plural of 'Prometheus' is 'Prometheis'.
### Can I reload Prometheus's configuration?
Yes, sending SIGHUP to the Prometheus process or an HTTP POST request to the
Yes, sending `SIGHUP` to the Prometheus process or an HTTP POST request to the
`/-/reload` endpoint will reload and apply the configuration file. The
various components attempt to handle failing changes gracefully.
......@@ -152,7 +150,7 @@ the [exposition formats](/docs/instrumenting/exposition_formats/).
Yes, the [Node Exporter](https://github.com/prometheus/node_exporter) exposes
an extensive set of machine-level metrics on Linux and other Unix systems such
as CPU usage, memory, disk utilization, filesystem fullness and network
as CPU usage, memory, disk utilization, filesystem fullness, and network
bandwidth.
### Can I monitor network devices?
......@@ -172,8 +170,7 @@ See [the list of exporters and integrations](/docs/instrumenting/exporters/).
### Can I monitor JVM applications via JMX?
Yes, for applications that you cannot instrument directly with the Java client
you can use the [JMX Exporter](https://github.com/prometheus/jmx_exporter)
Yes, for applications that you cannot instrument directly with the Java client, you can use the [JMX Exporter](https://github.com/prometheus/jmx_exporter)
either standalone or as a Java Agent.
### What is the performance impact of instrumentation?
......@@ -219,9 +216,8 @@ native 64 bit integers would (only) help if you need integer precision
above 2<sup>53</sup> but below 2<sup>63</sup>. In principle, support
for different sample value types (including some kind of big integer,
supporting even more than 64 bit) could be implemented, but it is not
a priority right now. Note that a counter, even if incremented
one million times per second, will only run into precision issues
after over 285 years.
a priority right now. A counter, even if incremented one million times per
second, will only run into precision issues after over 285 years.
### Why does Prometheus use a custom storage backend rather than [some other storage method]? Isn't the "one file per time series" approach killing performance?
......@@ -239,8 +235,7 @@ latter depends on many parameters, like the compressibility of the sample data,
the number of time series the samples belong to, the retention policy, and even
more subtle aspects like how full your SSD is. If you want to know all the
details, read
[this document with detailed benchmark results](https://docs.google.com/document/d/1lRKBaz9oXI5nwFZfvSbPhpwzUbUr3-9qryQGG1C6ULk/edit?usp=sharing). The
highlights:
[this document with detailed benchmark results](https://docs.google.com/document/d/1lRKBaz9oXI5nwFZfvSbPhpwzUbUr3-9qryQGG1C6ULk/edit?usp=sharing). The highlights:
* On a typical bare-metal server with 64GiB RAM, 32 CPU cores, and SSD,
Prometheus sustained an ingestion rate of 900k samples per second, belonging
......@@ -266,10 +261,9 @@ monitoring system possible rather than supporting fully generic TLS and
authentication solutions in every server component.
If you need TLS or authentication, we recommend putting a reverse proxy in
front of Prometheus. See for example [Adding Basic Auth to Prometheus with
front of Prometheus. See, for example [Adding Basic Auth to Prometheus with
Nginx](https://www.robustperception.io/adding-basic-auth-to-prometheus-with-nginx/).
Note that this applies only to inbound connections. Prometheus does support
[scraping TLS- and auth-enabled
targets](/docs/operating/configuration/#%3Cscrape_config%3E), and other
This applies only to inbound connections. Prometheus does support
[scraping TLS- and auth-enabled targets](/docs/operating/configuration/#%3Cscrape_config%3E), and other
Prometheus components that create outbound connections have similar support.
......@@ -16,7 +16,7 @@ series data.
[Download the latest release](/download) of Prometheus for your platform, then
extract and run it:
```
```language-bash
tar xvfz prometheus-*.tar.gz
cd prometheus-*
```
......@@ -33,7 +33,7 @@ While a Prometheus server that collects only data about itself is not very
useful in practice, it is a good starting example. Save the following basic
Prometheus configuration as a file named `prometheus.yml`:
```
```language-yaml
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
......@@ -58,11 +58,9 @@ scrape_configs:
For a complete specification of configuration options, see the
[configuration documentation](/docs/operating/configuration).
## Starting Prometheus
To start Prometheus with your newly created configuration file, change to your
Prometheus build directory and run:
To start Prometheus with your newly created configuration file, change to the directory containing the Prometheus binary and run:
```language-bash
# Start Prometheus.
......@@ -70,9 +68,7 @@ Prometheus build directory and run:
./prometheus -config.file=prometheus.yml
```
Prometheus should start up and it should show a status page about itself at
http://localhost:9090. Give it a couple of seconds to collect data about itself
from its own HTTP metrics endpoint.
Prometheus should start up. You should also be able to browse to a status page about itself at http://localhost:9090. Give it a couple of seconds to collect data about itself from its own HTTP metrics endpoint.
You can also verify that Prometheus is serving metrics about itself by
navigating to its metrics endpoint: http://localhost:9090/metrics
......@@ -81,11 +77,9 @@ The number of OS threads executed by Prometheus is controlled by the
`GOMAXPROCS` environment variable. As of Go 1.5 the default value is
the number of cores available.
Blindly setting `GOMAXPROCS` to a high value can be
counterproductive. See the relevant [Go
FAQs](http://golang.org/doc/faq#Why_no_multi_CPU).
Blindly setting `GOMAXPROCS` to a high value can be counterproductive. See the relevant [Go FAQs](http://golang.org/doc/faq#Why_no_multi_CPU).
Note that Prometheus by default uses around 3GB in memory. If you have a
Prometheus by default uses around 3GB in memory. If you have a
smaller machine, you can tune Prometheus to use less memory. For details,
see the [memory usage documentation](/docs/operating/storage/#memory-usage).
......@@ -105,7 +99,7 @@ target scrapes). Go ahead and enter this into the expression console:
prometheus_target_interval_length_seconds
```
This should return a lot of different time series (along with the latest value
This should return a number of different time series (along with the latest value
recorded for each), all with the metric name
`prometheus_target_interval_length_seconds`, but with different labels. These
labels designate different latency percentiles and target group intervals.
......@@ -155,7 +149,7 @@ correct `GOPATH`) set up.
Download the Go client library for Prometheus and run three of these example
processes:
```bash
```language-bash
# Fetch the client library code and compile example.
git clone https://github.com/prometheus/client_golang.git
cd client_golang/examples/random
......@@ -231,10 +225,10 @@ job_service:rpc_durations_seconds_count:avg_rate5m = avg(rate(rpc_durations_seco
```
To make Prometheus pick up this new rule, add a `rule_files` statement to the
global configuration section in your `prometheus.yml`. The config should now
`global` configuration section in your `prometheus.yml`. The config should now
look like this:
```
```language-yaml
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
evaluation_interval: 15s # Evaluate rules every 15 seconds.
......
......@@ -20,8 +20,7 @@ notifications to email, Pagerduty, Slack etc.
### Bridge
A bridge is a component that takes samples from a client library and
exposes them to a non-Prometheus monitoring system. For example the Python
client can export metrics to Graphite.
exposes them to a non-Prometheus monitoring system. For example, the Python, Go, and Java clients can export metrics to Graphite.
### Client library
......@@ -32,28 +31,37 @@ pull metrics from other systems and expose the metrics to Prometheus.
### Collector
A collector is a part of an exporter that represents a set of metrics. It may be
a single metric as part of direct instrumentation, or many metrics if it is pulling
metrics from another system.
a single metric if it is part of direct instrumentation, or many metrics if it is pulling metrics from another system.
### Direct instrumentation
Direct instrumentation is when instrumentation is added inline as part the source code
Direct instrumentation is instrumentation added inline as part the source code
of a program.
### Endpoint
A source of metrics than can be scraped, usually corresponding to a single process.
### Exporter
An exporter is a binary that exposes Prometheus metrics, commonly by converting
metrics that are exposed in a non-Prometheus format into a format Prometheus supports.
### Instance
An instance is a label that uniquely identifies a target in a job.
### Job
A collection of targets with the same purpose, for example monitoring a group of like processes replicated for scalability or reliability, is called a job.
### Notification
A notification represents a group or one of more alerts, and is sent by the Alertmanager
to email, Pagerduty, Slack etc.
A notification represents a group of one of more alerts, and is sent by the Alertmanager to email, Pagerduty, Slack etc.
### Promdash
Promdash was a native dashboard builder for Prometheus. It has been replaced by
[Grafana](../../visualization/grafana/).
Promdash was a native dashboard builder for Prometheus. It has been deprecated and replaced by [Grafana](../../visualization/grafana/).
### Prometheus
......@@ -102,9 +110,9 @@ A remote write endpoint is what Prometheus talks to when doing a remote write.
### Silence
A silence in the Alertmanager prevents alerts with labels matching the silence from
A silence in the Alertmanager prevents alerts, with labels matching the silence, from
being included in notifications.
### Target
One application, server or endpoint that Prometheus is scraping.
A target is the definition of an object to scrape. For example, what labels to apply, any authentication required to connect, or other information that defines how the scrape will occur.
......@@ -40,7 +40,7 @@ two examples.
### Volumes & bind-mount
Bind-mount your prometheus.yml from the host by running:
Bind-mount your `prometheus.yml` from the host by running:
```
docker run -p 9090:9090 -v /tmp/prometheus.yml:/etc/prometheus/prometheus.yml \
......@@ -62,7 +62,7 @@ configuration itself is rather static and the same across all
environments.
For this, create a new directory with a Prometheus configuration and a
Dockerfile like this:
`Dockerfile` like this:
```
FROM prom/prometheus
......@@ -76,7 +76,7 @@ docker build -t my-prometheus .
docker run -p 9090:9090 my-prometheus
```
A more advanced option is to render the config dynamically on start
A more advanced option is to render the configuration dynamically on start
with some tooling or even have a daemon update it periodically.
## Using configuration management systems
......@@ -84,19 +84,19 @@ with some tooling or even have a daemon update it periodically.
If you prefer using configuration management systems you might be interested in
the following third-party contributions:
Ansible:
### Ansible
* [griggheo/ansible-prometheus](https://github.com/griggheo/ansible-prometheus)
* [William-Yeh/ansible-prometheus](https://github.com/William-Yeh/ansible-prometheus)
Chef:
### Chef
* [rayrod2030/chef-prometheus](https://github.com/rayrod2030/chef-prometheus)
Puppet:
### Puppet
* [puppet/prometheus](https://forge.puppet.com/puppet/prometheus)
SaltStack:
### SaltStack
* [bechtoldt/saltstack-prometheus-formula](https://github.com/bechtoldt/saltstack-prometheus-formula)
......@@ -12,19 +12,19 @@ monitoring and alerting toolkit originally built at
[SoundCloud](http://soundcloud.com). Since its inception in 2012, many
companies and organizations have adopted Prometheus, and the project has a very
active developer and user [community](/community). It is now a standalone open source project
and maintained independently of any company. To emphasize this and clarify
and maintained independently of any company. To emphasize this, and to clarify
the project's governance structure, Prometheus joined the
[Cloud Native Computing Foundation](https://cncf.io/) in 2016
as the second hosted project after [Kubernetes](http://kubernetes.io/).
as the second hosted project, after [Kubernetes](http://kubernetes.io/).
For a more elaborate overview, see the resources linked from the
For more elaborate overviews of Prometheus, see the resources linked from the
[media](/docs/introduction/media/) section.
### Features
Prometheus's main features are:
* a multi-dimensional [data model](/docs/concepts/data_model/) (time series identified by metric name and key/value pairs)
* a multi-dimensional [data model](/docs/concepts/data_model/) with time series data identified by metric name and key/value pairs
* a [flexible query language](/docs/querying/basics/)
to leverage this dimensionality
* no reliance on distributed storage; single server nodes are autonomous
......@@ -41,8 +41,8 @@ optional:
* the main [Prometheus server](https://github.com/prometheus/prometheus) which scrapes and stores time series data
* [client libraries](/docs/instrumenting/clientlibs/) for instrumenting application code
* a [push gateway](https://github.com/prometheus/pushgateway) for supporting short-lived jobs
* special-purpose [exporters](/docs/instrumenting/exporters/) (for HAProxy, StatsD, Graphite, etc.)
* an [alertmanager](https://github.com/prometheus/alertmanager)
* special-purpose [exporters](/docs/instrumenting/exporters/) for services like HAProxy, StatsD, Graphite, etc.
* an [alertmanager](https://github.com/prometheus/alertmanager) to handle alerts
* various support tools
Most Prometheus components are written in [Go](https://golang.org/), making
......@@ -50,16 +50,14 @@ them easy to build and deploy as static binaries.
### Architecture
This diagram illustrates the overall architecture of Prometheus and some of
This diagram illustrates the architecture of Prometheus and some of
its ecosystem components:
![Prometheus architecture](/assets/architecture.svg)
Prometheus scrapes metrics from instrumented jobs, either directly or via an
intermediary push gateway for short-lived jobs. It stores all scraped samples
locally and runs rules over this data to either record new time series from
existing data or generate alerts. Grafana or other API consumers can be used
to visualize the collected data.
locally and runs rules over this data to either aggregate and record new time series from existing data or generate alerts. [Grafana](https://grafana.com/) or other API consumers can be used to visualize the collected data.
## When does it fit?
......@@ -72,7 +70,7 @@ Prometheus is designed for reliability, to be the system you go to
during an outage to allow you to quickly diagnose problems. Each Prometheus
server is standalone, not depending on network storage or other remote services.
You can rely on it when other parts of your infrastructure are broken, and
you do not have to set up complex infrastructure to use it.
you do not need to setup extensive infrastructure to use it.
## When does it not fit?
......@@ -80,5 +78,5 @@ Prometheus values reliability. You can always view what statistics are
available about your system, even under failure conditions. If you need 100%
accuracy, such as for per-request billing, Prometheus is not a good choice as
the collected data will likely not be detailed and complete enough. In such a
case you would be best off using some other system to collect and analyse the
case you would be best off using some other system to collect and analyze the
data for billing, and Prometheus for the rest of your monitoring.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment