Commit 8db8d49b authored by beorn7's avatar beorn7

Merge branch 'master' into next-release

parents b0977c38 f7f2bfe5
...@@ -35,7 +35,7 @@ labels set by an earlier stage: ...@@ -35,7 +35,7 @@ labels set by an earlier stage:
1. Global labels, which are assigned to every target scraped by the Prometheus instance. 1. Global labels, which are assigned to every target scraped by the Prometheus instance.
2. The `job` label, which is configured as a default value for each scrape configuration. 2. The `job` label, which is configured as a default value for each scrape configuration.
3. Labels that are set per target group within a scrape configuration. 3. Labels that are set per target group within a scrape configuration.
4. Advanced label manipulation via [_relabeling_](/docs/operating/configuration/#relabeling-relabel_config). 4. Advanced label manipulation via [_relabeling_](/docs/operating/configuration/#target-relabeling-relabel_config).
Each stage overwrites any colliding labels from the earlier stages. Eventually, we have a flat Each stage overwrites any colliding labels from the earlier stages. Eventually, we have a flat
set of labels that describe a single target. Those labels are then attached to every time series that set of labels that describe a single target. Those labels are then attached to every time series that
...@@ -76,7 +76,7 @@ scrape_configs: ...@@ -76,7 +76,7 @@ scrape_configs:
job: 'job2' job: 'job2'
``` ```
Through a mechanism named [_relabeling_](http://prometheus.io/docs/operating/configuration/#relabeling-relabel_config), Through a mechanism named [_relabeling_](http://prometheus.io/docs/operating/configuration/#target-relabeling-relabel_config),
any label can be removed, created, or modified on a per-target level. This any label can be removed, created, or modified on a per-target level. This
enables fine-grained labeling that can also take into account metadata coming enables fine-grained labeling that can also take into account metadata coming
from the service discovery. Relabeling is the last stage of label assignment from the service discovery. Relabeling is the last stage of label assignment
...@@ -124,7 +124,7 @@ This rule transforms a target with the label set: ...@@ -124,7 +124,7 @@ This rule transforms a target with the label set:
You could then also remove the source labels in an additional relabeling step. You could then also remove the source labels in an additional relabeling step.
You can read more about relabeling and how you can use it to filter targets in the You can read more about relabeling and how you can use it to filter targets in the
[configuration documentation](/docs/operating/configuration#relabeling-relabel_config). [configuration documentation](/docs/operating/configuration#target-relabeling-relabel_config).
Over the next sections, we will see how you can leverage relabeling when using service discovery. Over the next sections, we will see how you can leverage relabeling when using service discovery.
...@@ -219,7 +219,7 @@ has the `production` or `canary` Consul tag, a respective `group` label is assig ...@@ -219,7 +219,7 @@ has the `production` or `canary` Consul tag, a respective `group` label is assig
Each target's `instance` label is set to the node name provided by Consul. Each target's `instance` label is set to the node name provided by Consul.
A full documentation of all configuration parameters for service discovery via Consul A full documentation of all configuration parameters for service discovery via Consul
can be found on the [Prometheus website](/docs/operating/configuration##relabeling-relabel_config). can be found on the [Prometheus website](/docs/operating/configuration#target-relabeling-relabel_config).
## Custom service discovery ## Custom service discovery
......
---
title: Practical Anomaly Detection
created_at: 2015-06-18
kind: article
author_name: Brian Brazil
---
In his *[Open Letter To Monitoring/Metrics/Alerting Companies](http://www.kitchensoap.com/2015/05/01/openlettertomonitoringproducts/)*,
John Allspaw asserts that attempting "to detect anomalies perfectly, at the right time, is not possible".
I have seen several attempts by talented engineers to build systems to
automatically detect and diagnose problems based on time series data. While it
is certainly possible to get a demonstration working, the data always turned
out to be too noisy to make this approach work for anything but the simplest of
real-world systems.
All hope is not lost though. There are many common anomalies which you can
detect and handle with custom-built rules. The Prometheus [query
language](../../../../../docs/querying/basics/) gives you the tools to discover
these anomalies while avoiding false positives.
## Building a query
A common problem within a service is when a small number of servers are not
performing as well as the rest, such as responding with increased latency.
Let us say that we have a metric `instance:latency_seconds:mean5m` representing the
average query latency for each instance of a service, calculated via a
[recording rule](/docs/querying/rules/) from a
[Summary](/docs/concepts/metric_types/#summary) metric.
A simple way to start would be to look for instances with a latency
more than two standard deviations above the mean:
```
instance:latency_seconds:mean5m
> on (job) group_left(instance)
(
avg by (job)(instance:latency_seconds:mean5m)
+ on (job)
2 * stddev by (job)(instance:latency_seconds:mean5m)
)
```
You try this out and discover that there are false positives when
the latencies are very tightly clustered. So you add a requirement
that the instance latency also has to be 20% above the average:
```
(
instance:latency_seconds:mean5m
> on (job) group_left(instance)
(
avg by (job)(instance:latency_seconds:mean5m)
+ on (job)
2 * stddev by (job)(instance:latency_seconds:mean5m)
)
)
> on (job) group_left(instance)
1.2 * avg by (job)(instance:latency_seconds:mean5m)
```
Finally, you find that false positives tend to happen at low traffic levels.
You add a requirement for there to be enough traffic for 1 query per second to
be going to each instance. You create an alert definition for all of this:
```
ALERT InstanceLatencyOutlier
IF
(
instance:latency_seconds:mean5m
> on (job) group_left(instance)
(
avg by (job)(instance:latency_seconds:mean5m)
+ on (job)
2 * stddev by (job)(instance:latency_seconds:mean5m)
)
)
> on (job) group_left(instance)
1.2 * avg by (job)(instance:latency_seconds:mean5m)
and on (job)
avg by (job)(instance:latency_seconds_count:rate5m)
>
1
FOR 30m
SUMMARY "{{$labels.instance}} in {{$labels.job}} is a latency outlier"
DESCRIPTION "{{$labels.instance}} has latency of {{humanizeDuration $value}}"
```
## Automatic actions
The above alert can feed into the
[Alertmanager](/docs/alerting/alertmanager/), and from there to
your chat, ticketing, or paging systems. After a while you might discover that the
usual cause of the alert is something that there is not a proper fix for, but there is an
automated action such as a restart, reboot, or machine replacement that resolves
the issue.
Rather than having humans handle this repetitive task, one option is to
get the Alertmanager to send the alert to a web service that will perform
the action with appropriate throttling and safety features.
The [generic webhook](/docs/alerting/alertmanager/#generic-webhook)
sends alert notifications to an HTTP endpoint of your choice. A simple Alertmanager
configuration that uses it could look like this:
```
# A simple notification configuration which only sends alert notifications to
# an external webhook.
notification_config {
name: "restart_webhook"
webhook_config {
url: "http://example.org/my/hook"
}
}
# An aggregation rule which matches all alerts with the label
# alertname="InstanceLatencyOutlier" and sends them using the "restart_webhook"
# notification configuration.
aggregation_rule {
filter {
name_re: "alertname"
value_re: "InstanceLatencyOutlier"
}
notification_config_name: "restart_webhook"
}
```
## Summary
The Prometheus query language allows for rich processing of your monitoring
data. This lets you to create alerts with good signal-to-noise ratios, and the
Alertmanager's generic webhook support can trigger automatic remediations.
This all combines to enable oncall engineers to focus on problems where they can
have the most impact.
When defining alerts for your services, see also our [alerting best practices](http://prometheus.io/docs/practices/alerting/).
This diff is collapsed.
<%= atom_feed :title => 'Prometheus Blog', :author_name => '© Prometheus Authors 2015', <%= atom_feed :title => 'Prometheus Blog', :author_name => '© Prometheus Authors 2015',
:author_uri => 'http://prometheus.io/blog/', :limit => 10 %> :author_uri => 'http://prometheus.io/blog/', :limit => 10,
:logo => 'http://prometheus.io/assets/prometheus_logo.png',
:icon => 'http://prometheus.io/assets/favicons/favicon.ico' %>
...@@ -5,11 +5,12 @@ sort_rank: 2 ...@@ -5,11 +5,12 @@ sort_rank: 2
# Metric types # Metric types
The Prometheus client libraries offer three core metric types: The Prometheus client libraries offer four core metric types:
* Counters * Counter
* Gauges * Gauge
* Summaries * Histogram
* Summary
These metric types are currently only differentiated in the client libraries These metric types are currently only differentiated in the client libraries
(to enable APIs tailored to the usage of the specific types) and in the wire (to enable APIs tailored to the usage of the specific types) and in the wire
......
...@@ -15,14 +15,15 @@ HTTP endpoint on your application’s instance: ...@@ -15,14 +15,15 @@ HTTP endpoint on your application’s instance:
* [Go](https://github.com/prometheus/client_golang) * [Go](https://github.com/prometheus/client_golang)
* [Java or Scala](https://github.com/prometheus/client_java) * [Java or Scala](https://github.com/prometheus/client_java)
* [Ruby](https://github.com/prometheus/client_ruby)
* [Python](https://github.com/prometheus/client_python) * [Python](https://github.com/prometheus/client_python)
* [Ruby](https://github.com/prometheus/client_ruby)
Unofficial third-party client libraries: Unofficial third-party client libraries:
* [Bash](https://github.com/aecolley/client_bash)
* [Haskell](https://github.com/fimad/prometheus-haskell)
* [Node.js](https://github.com/StreamMachine/prometheus_client_nodejs) * [Node.js](https://github.com/StreamMachine/prometheus_client_nodejs)
* [.NET / C#](https://github.com/andrasm/prometheus-net) * [.NET / C#](https://github.com/andrasm/prometheus-net)
* [Bash](https://github.com/aecolley/client_bash)
When Prometheus scrapes your instance's HTTP endpoint, the client library When Prometheus scrapes your instance's HTTP endpoint, the client library
sends the current state of all tracked metrics to the server. sends the current state of all tracked metrics to the server.
......
...@@ -16,38 +16,42 @@ These exporters are maintained as part of the official ...@@ -16,38 +16,42 @@ These exporters are maintained as part of the official
[Prometheus GitHub organization](https://github.com/prometheus): [Prometheus GitHub organization](https://github.com/prometheus):
* [Node/system metrics exporter](https://github.com/prometheus/node_exporter) * [Node/system metrics exporter](https://github.com/prometheus/node_exporter)
* [Graphite exporter](https://github.com/prometheus/graphite_exporter) * [AWS CloudWatch exporter](https://github.com/prometheus/cloudwatch_exporter)
* [Collectd exporter](https://github.com/prometheus/collectd_exporter) * [Collectd exporter](https://github.com/prometheus/collectd_exporter)
* [JMX exporter](https://github.com/prometheus/jmx_exporter) * [Consul exporter](https://github.com/prometheus/consul_exporter)
* [Graphite exporter](https://github.com/prometheus/graphite_exporter)
* [HAProxy exporter](https://github.com/prometheus/haproxy_exporter) * [HAProxy exporter](https://github.com/prometheus/haproxy_exporter)
* [StatsD bridge](https://github.com/prometheus/statsd_bridge)
* [AWS CloudWatch exporter](https://github.com/prometheus/cloudwatch_exporter)
* [Hystrix metrics publisher](https://github.com/prometheus/hystrix) * [Hystrix metrics publisher](https://github.com/prometheus/hystrix)
* [JMX exporter](https://github.com/prometheus/jmx_exporter)
* [Mesos task exporter](https://github.com/prometheus/mesos_exporter) * [Mesos task exporter](https://github.com/prometheus/mesos_exporter)
* [Consul exporter](https://github.com/prometheus/consul_exporter)
* [MySQL server exporter](https://github.com/prometheus/mysqld_exporter) * [MySQL server exporter](https://github.com/prometheus/mysqld_exporter)
* [StatsD bridge](https://github.com/prometheus/statsd_bridge)
The [JMX exporter](https://github.com/prometheus/jmx_exporter) can export from a The [JMX exporter](https://github.com/prometheus/jmx_exporter) can export from a
wide variety of JVM-based applications, for example [Kafka](http://kafka.apache.org/) and wide variety of JVM-based applications, for example [Kafka](http://kafka.apache.org/) and
[Cassandra](http://cassandra.apache.org/). [Cassandra](http://cassandra.apache.org/).
## Unofficial third-party exporters ## Independently maintained third-party exporters
There are also a number of exporters which are externally contributed and There are also a number of exporters which are externally contributed
maintained. Note that these may have not been vetted for best practices by the and maintained. We encourage the creation of more exporters but cannot
Prometheus core team yet: vet all of them for best practices. Commonly, those exporters are
hosted outside of the Prometheus GitHub organization.
* [RethinkDB exporter](https://github.com/oliver006/rethinkdb_exporter)
* [Redis exporter](https://github.com/oliver006/redis_exporter)
* [scollector exporter](https://github.com/tgulacsi/prometheus_scollector)
* [MongoDB exporter](https://github.com/dcu/mongodb_exporter)
* [CouchDB exporter](https://github.com/gesellix/couchdb-exporter) * [CouchDB exporter](https://github.com/gesellix/couchdb-exporter)
* [Django exporter](https://github.com/korfuri/django-prometheus) * [Django exporter](https://github.com/korfuri/django-prometheus)
* [Google's mtail log data extractor](https://github.com/google/mtail) * [Google's mtail log data extractor](https://github.com/google/mtail)
* [Minecraft exporter module](https://github.com/Baughn/PrometheusIntegration)
* [Meteor JS web framework exporter](https://atmospherejs.com/sevki/prometheus-exporter)
* [Memcached exporter](https://github.com/Snapbug/memcache_exporter) * [Memcached exporter](https://github.com/Snapbug/memcache_exporter)
* [Meteor JS web framework exporter](https://atmospherejs.com/sevki/prometheus-exporter)
* [Minecraft exporter module](https://github.com/Baughn/PrometheusIntegration)
* [MongoDB exporter](https://github.com/dcu/mongodb_exporter)
* [Munin exporter](https://github.com/pvdh/munin_exporter)
* [New Relic exporter](https://github.com/jfindley/newrelic_exporter) * [New Relic exporter](https://github.com/jfindley/newrelic_exporter)
* [RabbitMQ exporter](https://github.com/kbudde/rabbitmq_exporter)
* [Redis exporter](https://github.com/oliver006/redis_exporter)
* [RethinkDB exporter](https://github.com/oliver006/rethinkdb_exporter)
* [Rsyslog exporter](https://github.com/digitalocean/rsyslog_exporter)
* [scollector exporter](https://github.com/tgulacsi/prometheus_scollector)
## Directly instrumentated software ## Directly instrumentated software
...@@ -55,9 +59,9 @@ Some third-party software already exposes Prometheus metrics natively, so no ...@@ -55,9 +59,9 @@ Some third-party software already exposes Prometheus metrics natively, so no
separate exporters are needed: separate exporters are needed:
* [cAdvisor](https://github.com/google/cadvisor) * [cAdvisor](https://github.com/google/cadvisor)
* [Kubernetes](https://github.com/GoogleCloudPlatform/kubernetes)
* [Kubernetes-Mesos](https://github.com/mesosphere/kubernetes-mesos)
* [Etcd](https://github.com/coreos/etcd) * [Etcd](https://github.com/coreos/etcd)
* [gokit](https://github.com/peterbourgon/gokit)
* [go-metrics instrumentation library](https://github.com/armon/go-metrics) * [go-metrics instrumentation library](https://github.com/armon/go-metrics)
* [gokit](https://github.com/peterbourgon/gokit)
* [Kubernetes-Mesos](https://github.com/mesosphere/kubernetes-mesos)
* [Kubernetes](https://github.com/GoogleCloudPlatform/kubernetes)
* [RobustIRC](http://robustirc.net/) * [RobustIRC](http://robustirc.net/)
...@@ -32,12 +32,12 @@ If using Maven, add the following to `pom.xml`: ...@@ -32,12 +32,12 @@ If using Maven, add the following to `pom.xml`:
<dependency> <dependency>
<groupId>io.prometheus</groupId> <groupId>io.prometheus</groupId>
<artifactId>simpleclient</artifactId> <artifactId>simpleclient</artifactId>
<version>0.0.6</version> <version>0.0.10</version>
</dependency> </dependency>
<dependency> <dependency>
<groupId>io.prometheus</groupId> <groupId>io.prometheus</groupId>
<artifactId>simpleclient_pushgateway</artifactId> <artifactId>simpleclient_pushgateway</artifactId>
<version>0.0.6</version> <version>0.0.10</version>
</dependency> </dependency>
``` ```
...@@ -69,7 +69,7 @@ void executeBatchJob() throws Exception { ...@@ -69,7 +69,7 @@ void executeBatchJob() throws Exception {
} finally { } finally {
durationTimer.setDuration(); durationTimer.setDuration();
PushGateway pg = new PushGateway("127.0.0.1:9091"); PushGateway pg = new PushGateway("127.0.0.1:9091");
pg.pushAdd(registry, "my_batch_job", "my_batch_job"); pg.pushAdd(registry, "my_batch_job");
} }
} }
``` ```
......
...@@ -117,11 +117,9 @@ for the current state of this effort. ...@@ -117,11 +117,9 @@ for the current state of this effort.
### Which languages have instrumentation libraries? ### Which languages have instrumentation libraries?
Currently, there are client libraries for: There are a number of client libraries for instrumenting your services with
Prometheus metrics. See the [client libraries](/docs/instrumenting/clientlibs/)
* [Go](https://github.com/prometheus/client_golang) documentation for details.
* [Java or Scala](https://github.com/prometheus/client_java)
* [Ruby](https://github.com/prometheus/client_ruby)
If you are interested in contributing a client library for a new language, see If you are interested in contributing a client library for a new language, see
the [exposition formats](/docs/instrumenting/exposition_formats/). the [exposition formats](/docs/instrumenting/exposition_formats/).
...@@ -211,12 +209,13 @@ second. The latter depends on the compressibility of the sample data ...@@ -211,12 +209,13 @@ second. The latter depends on the compressibility of the sample data
and on the number of time series the samples belong to, but to give and on the number of time series the samples belong to, but to give
you an idea, here are some results from benchmarks: you an idea, here are some results from benchmarks:
* On an older 8-core machine with Intel Core i7 CPUs and two spinning * On an older 8-core machine with Intel Core i7 CPUs, 8GiB RAM, and
disks (Samsung HD753LJ) in a RAID-1 setup, Prometheus sustained an two spinning disks (Samsung HD753LJ) in a RAID-1 setup, Prometheus
ingestion rate of 34k samples per second, belonging to 170k time sustained an ingestion rate of 34k samples per second, belonging to
series, scraped from 600 targets. 170k time series, scraped from 600 targets.
* On a modern server with SSD, Prometheus sustained an ingestion rate
of 340k samples per second, belonging to 2M time * On a modern server with 64GiB RAM and SSD, Prometheus sustained an
ingestion rate of 340k samples per second, belonging to 2M time
series, scraped from 1800 targets. series, scraped from 1800 targets.
In both cases, there were no obvious bottlenecks. Various stages of the In both cases, there were no obvious bottlenecks. Various stages of the
......
...@@ -21,28 +21,23 @@ git clone https://github.com/prometheus/prometheus.git ...@@ -21,28 +21,23 @@ git clone https://github.com/prometheus/prometheus.git
## Building Prometheus ## Building Prometheus
Building Prometheus currently still requires a `make` step, as some parts of [Download the latest release](https://github.com/prometheus/prometheus/releases)
the source are autogenerated (web assets). of Prometheus for your platform, then extract and run it:
```language-bash ```
cd prometheus tar xvfz prometheus-*.tar.gz
make build ./prometheus
``` ```
Note that building requires a fair amount of memory. You should have It should fail to start, complaining about the absence of a configuration file.
at least 2GiB of RAM available.
If you encounter problems building Prometheus, see [the more detailed build
instructions](https://github.com/prometheus/prometheus#use-make) in the
README.md.
## Configuring Prometheus to monitor itself ## Configuring Prometheus to monitor itself
Prometheus collects metrics from monitored targets by scraping metrics HTTP Prometheus collects metrics from monitored targets by scraping metrics HTTP
endpoints on these targets. Since Prometheus also exposes data in the same endpoints on these targets. Since Prometheus also exposes data in the same
manner about itself, it may also be used to scrape and monitor its own health. manner about itself, it can also scrape and monitor its own health.
While a Prometheus server which collects only data about itself is not very While a Prometheus server that collects only data about itself is not very
useful in practice, it is a good starting example. Save the following basic useful in practice, it is a good starting example. Save the following basic
Prometheus configuration as a file named `prometheus.yml`: Prometheus configuration as a file named `prometheus.yml`:
...@@ -86,11 +81,11 @@ Prometheus build directory and run: ...@@ -86,11 +81,11 @@ Prometheus build directory and run:
``` ```
Prometheus should start up and it should show a status page about itself at Prometheus should start up and it should show a status page about itself at
http://localhost:9090. Give it a couple of seconds to start collecting data http://localhost:9090. Give it a couple of seconds to collect data about itself
about itself from its own HTTP metrics endpoint. from its own HTTP metrics endpoint.
You can also verify that Prometheus is serving metrics about itself by You can also verify that Prometheus is serving metrics about itself by
navigating to its metrics exposure endpoint: http://localhost:9090/metrics navigating to its metrics endpoint: http://localhost:9090/metrics
By default, Prometheus will only execute at most one OS thread at a By default, Prometheus will only execute at most one OS thread at a
time. In production scenarios on multi-CPU machines, you will most time. In production scenarios on multi-CPU machines, you will most
...@@ -140,7 +135,7 @@ To count the number of returned time series, you could write: ...@@ -140,7 +135,7 @@ To count the number of returned time series, you could write:
count(prometheus_target_interval_length_seconds) count(prometheus_target_interval_length_seconds)
``` ```
For further details about the expression language, see the For more about the expression language, see the
[expression language documentation](/docs/querying/basics/). [expression language documentation](/docs/querying/basics/).
## Using the graphing interface ## Using the graphing interface
...@@ -193,8 +188,8 @@ endpoints to a single job, adding extra labels to each group of targets. In ...@@ -193,8 +188,8 @@ endpoints to a single job, adding extra labels to each group of targets. In
this example, we will add the `group="production"` label to the first group of this example, we will add the `group="production"` label to the first group of
targets, while adding `group="canary"` to the second. targets, while adding `group="canary"` to the second.
To achieve this, add the following job definition to your `prometheus.yml` and To achieve this, add the following job definition to the `scrape_configs`
restart your Prometheus instance: section in your `prometheus.yml` and restart your Prometheus instance:
``` ```
scrape_configs: scrape_configs:
...@@ -262,6 +257,15 @@ rule_files: ...@@ -262,6 +257,15 @@ rule_files:
- 'prometheus.rules' - 'prometheus.rules'
scrape_configs: scrape_configs:
- job_name: 'prometheus'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
scrape_timeout: 10s
target_groups:
- targets: ['localhost:9090']
- job_name: 'example-random' - job_name: 'example-random'
scrape_interval: 5s scrape_interval: 5s
......
---
title: Media
sort_rank: 7
---
# Media
Resources on the Internet helpful to get started with Prometheus.
## Blogs
* This site has its own [blog](http://prometheus.io/blog/).
* [SoundCloud's blog post announcing Prometheus](https://developers.soundcloud.com/blog/prometheus-monitoring-at-soundcloud) – a more elaborate overview than the one given on this site.
* The [monitoring series](http://www.boxever.com/tag/monitoring) on Boxever's tech blog.
## Recorded talks
* [Prometheus: A Next-Generation Monitoring System](https://www.usenix.org/conference/srecon15europe/program/presentation/rabenstein) – Julius Volz and Björn Rabenstein at SREcon15 Europe, Dublin.
* [What is your application doing right now?](http://youtu.be/Z0LlilNpX1U) – Matthias Gruter, Transmode, at DevOps Stockholm Meetup.
* In German: [Monitoring mit Prometheus](https://entropia.de/GPN15:Monitoring_mit_Prometheus) – Michael Stapelberg at Gulaschprogrammiernacht 15.
* [Prometheus workshop](https://vimeo.com/131581353) - Jamie Wilkinson at Monitorama PDX 2015 ([Slides](https://docs.google.com/presentation/d/1X1rKozAUuF2MVc1YXElFWq9wkcWv3Axdldl8LOH9Vik/edit)).
## Presentation slides
* [Systems Monitoring with Prometheus](http://www.slideshare.net/brianbrazil/devops-ireland-systems-monitoring-with-prometheus) – Brian Brazil at Devops Ireland Meetup, Dublin.
* [Monitoring your Python with Prometheus](http://www.slideshare.net/brianbrazil/python-ireland-monitoring-your-python-with-prometheus) – Brian Brazil at Python Ireland Meetup, Dublin.
...@@ -12,8 +12,8 @@ monitoring and alerting toolkit built at [SoundCloud](http://soundcloud.com). ...@@ -12,8 +12,8 @@ monitoring and alerting toolkit built at [SoundCloud](http://soundcloud.com).
Since its inception in 2012, it has become the standard for instrumenting new Since its inception in 2012, it has become the standard for instrumenting new
services at SoundCloud and is seeing growing external usage and contributions. services at SoundCloud and is seeing growing external usage and contributions.
For a more elaborate overview, see also [SoundCloud's blog post which announces For a more elaborate overview, see the resources linked from the
Prometheus](https://developers.soundcloud.com/blog/prometheus-monitoring-at-soundcloud). [media](/docs/introduction/media/) section.
### Features ### Features
......
...@@ -110,6 +110,10 @@ dns_sd_configs: ...@@ -110,6 +110,10 @@ dns_sd_configs:
consul_sd_configs: consul_sd_configs:
[ - <consul_sd_config> ... ] [ - <consul_sd_config> ... ]
# List of Zookeeper Serverset service discovery configurations.
serverset_sd_configs:
[ - <serverset_sd_config> ... ]
# List of file service discovery configurations. # List of file service discovery configurations.
file_sd_configs: file_sd_configs:
[ - <file_sd_config> ... ] [ - <file_sd_config> ... ]
...@@ -118,9 +122,13 @@ file_sd_configs: ...@@ -118,9 +122,13 @@ file_sd_configs:
target_groups: target_groups:
[ - <target_group> ... ] [ - <target_group> ... ]
# List of relabel configurations. # List of target relabel configurations.
relabel_configs: relabel_configs:
[ - <relabel_config> ... ] [ - <relabel_config> ... ]
# List of metric relabel configurations.
metric_relabel_configs:
[ - <relabel_config> ... ]
``` ```
Where `<scheme>` may be `http` or `https` and `<path>` is a valid URL path. Where `<scheme>` may be `http` or `https` and `<path>` is a valid URL path.
...@@ -153,9 +161,9 @@ A DNS-SD configuration allows specifying a set of DNS SRV record names which ...@@ -153,9 +161,9 @@ A DNS-SD configuration allows specifying a set of DNS SRV record names which
are periodically queried to discover a list of targets (host-port pairs). The are periodically queried to discover a list of targets (host-port pairs). The
DNS servers to be contacted are read from `/etc/resolv.conf`. DNS servers to be contacted are read from `/etc/resolv.conf`.
During the [relabeling phase](#relabeling-relabel_config), the meta label `__meta_dns_srv_name` is During the [relabeling phase](#target-relabeling-relabel_config), the meta
available on each target and is set to the SRV record name that produced the label `__meta_dns_srv_name` is available on each target and is set to the SRV
discovered target. record name that produced the discovered target.
``` ```
# A list of DNS SRV record names to be queried. # A list of DNS SRV record names to be queried.
...@@ -198,6 +206,34 @@ services: ...@@ -198,6 +206,34 @@ services:
[ tag_separator: <string> | default = , ] [ tag_separator: <string> | default = , ]
``` ```
### Zookeeper Serverset SD configurations `<serverset_sd_config>`
Serverset SD configurations allow retrieving scrape targets from [Serversets]
(https://github.com/twitter/finagle/tree/master/finagle-serversets) which are
stored in [Zookeeper](https://zookeeper.apache.org/). Serversets are commonly
used by [Finagle](https://twitter.github.io/finagle/) and
[Aurora](http://aurora.apache.org/).
The following meta labels are available on targets during relabeling:
* `__meta_serverset_path`: the full path to the serverset member node in Zookeeper
* `__meta_serverset_endpoint_host`: the host of the default endpoint
* `__meta_serverset_endpoint_port`: the port of the default endpoint
* `__meta_serverset_endpoint_host_<endpoint>`: the host of the given endpoint
* `__meta_serverset_endpoint_port_<endpoint>`: the port of the given endpoint
* `__meta_serverset_status`: the status of the member
```
# The Zookeeper servers.
servers:
- <host>
# Paths can point to a single serverset, or the root of a tree of serversets.
paths:
- <string>
[ timeout: <duration> | default = 10s ]
```
Serverset data must be in the JSON format, the Thrift format is not currently supported.
### File-based SD configurations `<file_sd_config>` ### File-based SD configurations `<file_sd_config>`
...@@ -223,8 +259,9 @@ The JSON version of a target group has the following format: ...@@ -223,8 +259,9 @@ The JSON version of a target group has the following format:
As a fallback, the file contents are also re-read periodically at the specified As a fallback, the file contents are also re-read periodically at the specified
refresh interval. refresh interval.
Each target has a meta label `__meta_filepath` during the [relabeling phase](#relabeling-relabel_config). Each target has a meta label `__meta_filepath` during the
Its value is set to the filepath from which the target was extracted. [relabeling phase](#target-relabeling-relabel_config). Its value is set to the
filepath from which the target was extracted.
``` ```
# Patterns for files from which target groups are extracted. # Patterns for files from which target groups are extracted.
...@@ -239,7 +276,7 @@ Where `<filename_pattern>` may be a path ending in `.json`, `.yml` or `.yaml`. T ...@@ -239,7 +276,7 @@ Where `<filename_pattern>` may be a path ending in `.json`, `.yml` or `.yaml`. T
may contain a single `*` that matches any character sequence, e.g. `my/path/tg_*.json`. may contain a single `*` that matches any character sequence, e.g. `my/path/tg_*.json`.
### Relabeling `<relabel_config>` ### Target relabeling `<relabel_config>`
Relabeling is a powerful tool to dynamically rewrite the label set of a target before Relabeling is a powerful tool to dynamically rewrite the label set of a target before
it gets scraped. Multiple relabeling steps can be configured per scrape configuration. it gets scraped. Multiple relabeling steps can be configured per scrape configuration.
...@@ -271,7 +308,10 @@ source_labels: '[' <labelname> [, ...] ']' ...@@ -271,7 +308,10 @@ source_labels: '[' <labelname> [, ...] ']'
[ target_label: <labelname> ] [ target_label: <labelname> ]
# Regular expression against which the extracted value is matched. # Regular expression against which the extracted value is matched.
regex: <regex> [ regex: <regex> ]
# Modulus to take of the hash of the source label values.
[ modulus: <uint64> ]
# Replacement value against which a regex replace is performed if the # Replacement value against which a regex replace is performed if the
# regular expression matches. # regular expression matches.
...@@ -281,7 +321,9 @@ regex: <regex> ...@@ -281,7 +321,9 @@ regex: <regex>
[ action: <relabel_action> | default = replace ] [ action: <relabel_action> | default = replace ]
``` ```
`<regex>` is any valid [RE2 regular expression](https://github.com/google/re2/wiki/Syntax). `<regex>` is any valid [RE2 regular
expression](https://github.com/google/re2/wiki/Syntax). It is required for
the `replace`, `keep`, and `drop` actions.
`<relabel_action>` determines the relabeling action to take: `<relabel_action>` determines the relabeling action to take:
...@@ -290,3 +332,12 @@ regex: <regex> ...@@ -290,3 +332,12 @@ regex: <regex>
(`${1}`, `${2}`, ...) in `replacement` substituted by their value. (`${1}`, `${2}`, ...) in `replacement` substituted by their value.
* `keep`: Drop targets for which `regex` does not match the concatenated `source_labels`. * `keep`: Drop targets for which `regex` does not match the concatenated `source_labels`.
* `drop`: Drop targets for which `regex` matches the concatenated `source_labels`. * `drop`: Drop targets for which `regex` matches the concatenated `source_labels`.
* `hashmod`: Set `target_label` to the `modulus` of a hash of the concatenated `source_labels`.
### Metric relabeling `<metric_relabel_configs>`
Metric relabeling is applied to samples as the last step before ingestion. It
has the same configuration format and actions as target relabeling. Metric
relabeling does not apply to automatically generated timeseries such as `up`.
One use for this is to blacklist time series that are too expensive to ingest.
...@@ -69,6 +69,13 @@ of thumb, keep it somewhere between 50% and 100% of the ...@@ -69,6 +69,13 @@ of thumb, keep it somewhere between 50% and 100% of the
is larger checkpoints. The consequences of a value too low are much is larger checkpoints. The consequences of a value too low are much
more serious. more serious.
Out of the metrics that Prometheus exposes about itself, the following are
particularly useful for tuning the flags above:
* `prometheus_local_storage_memory_series`: The current number of series held in memory.
* `prometheus_local_storage_memory_chunks`: The current number of chunks held in memory.
* `prometheus_local_storage_chunks_to_persist`: The number of memory chunks that still need to be persisted to disk.
## Crash recovery ## Crash recovery
Prometheus saves chunks to disk as soon as possible after they are Prometheus saves chunks to disk as soon as possible after they are
......
---
title: HTTP API
sort_rank: 7
---
# HTTP API
The current stable HTTP API is reachable under `/api/v1` on a Prometheus
server. Any non-breaking additions will be added under that endpoint.
## Format overview
The API response format is JSON. Every successful API request returns a `2xx`
status code.
Invalid requests that reach the API handlers return a JSON error object
and the HTTP response code `422 Unprocessable Entity`
([RFC4918](http://tools.ietf.org/html/rfc4918#page-78)). Other non-`2xx` codes
may be returned for errors occurring before the API endpoint is reached.
The JSON response envelope format is as follows:
```
{
"status": "success" | "error",
"data": <data>,
// Only set if status is "error". The data field may still hold
// additional data.
"errorType": "<string>",
"error": "<string>"
}
```
Input timestamps may be provided either in
[RFC3339](https://www.ietf.org/rfc/rfc3339.txt) format or as a Unix timestamp
in seconds, with optional decimal places for sub-second precision. Output
timestamps are always represented as Unix timestamps in seconds.
Names of query parameters that may be repeated end with `[]`.
`<series_selector>` placeholders refer to Prometheus [time series
selectors](/docs/querying/basics/#time-series-selectors) like
`http_requests_total` or `http_requests_total{method=~"^GET|POST$"}` and need
to be URL-encoded.
`<duration>` placeholders refer to Prometheus duration strings of the form
`[0-9]+[smhdwy]`. For example, `5m` refers to a duration of 5 minutes.
## Expression queries
Query language expressions may be evaluated at a single instant or over a range
of time. The sections below describe the API endpoints for each type of
expression query.
### Instant queries
The following endpoint evaluates an instant query at a single point in time:
```
GET /api/v1/query
```
URL query parameters:
- `query=<string>`: Prometheus expression query string.
- `time=<rfc3339 | unix_timestamp>`: Evaluation timestamp.
The `data` section of the query result has the following format:
```
{
"resultType": "matrix" | "vector" | "scalar" | "string",
"result": <value>
}
```
`<value>` refers to the query result data, which has varying formats
depending on the `resultType`. See the [expression query result
formats](#expression-query-result-formats).
The following example evaluates the expression `up` at the time
`2015-07-01T20:10:51.781Z`:
```
$ curl 'http://localhost:9090/api/v1/query?query=up&time=2015-07-01T20:10:51.781Z'
{
"status" : "success",
"data" : {
"resultType" : "vector",
"result" : [
{
"metric" : {
"__name__" : "up",
"job" : "prometheus",
"instance" : "localhost:9090"
},
"value": [ 1435781451.781, "1" ]
},
{
"metric" : {
"__name__" : "up",
"job" : "node",
"instance" : "localhost:9100"
},
"value" : [ 1435781451.781, "0" ]
}
]
}
}
```
### Range queries
The following endpoint evaluates an expression query over a range of time:
```
GET /api/v1/query_range
```
URL query parameters:
- `query=<string>`: Prometheus expression query string.
- `start=<rfc3339 | unix_timestamp>`: Start timestamp.
- `end=<rfc3339 | unix_timestamp>`: End timestamp.
- `step=<duration>`: Query resolution step width.
The `data` section of the query result has the following format:
```
{
"resultType": "matrix",
"result": <value>
}
```
For the format of the `<value>` placeholder, see the [range-vector result
format](#range-vectors).
The following example evaluates the expression `up` over a 30-second range with
a query resolution of 15 seconds.
```
$ curl 'http://localhost:9090/api/v1/query_range?query=up&start=2015-07-01T20:10:30.781Z&end=2015-07-01T20:11:00.781Z&step=15s'
{
"status" : "success",
"data" : {
"resultType" : "matrix",
"result" : [
{
"metric" : {
"__name__" : "up",
"job" : "prometheus",
"instance" : "localhost:9090"
},
"values" : [
[ 1435781430.781, "1" ],
[ 1435781445.781, "1" ],
[ 1435781460.781, "1" ]
]
},
{
"metric" : {
"__name__" : "up",
"job" : "node",
"instance" : "localhost:9091"
},
"values" : [
[ 1435781430.781, "0" ],
[ 1435781445.781, "0" ],
[ 1435781460.781, "1" ]
]
}
]
}
}
```
## Querying metadata
### Finding series by label matchers
The following endpoint returns the list of time series that match a certain label set.
```
GET /api/v1/series
```
URL query parameters:
- `match[]=<series_selector>`: Repeated series selector argument that selects the
series to return. At least one `match[]` argument must be provided.
The `data` section of the query result consists of a list of objects that
contain the label name/value pairs which identify each series.
The following example returns all series that match either of the selectors
`up` or `process_start_time_seconds{job="prometheus"}`:
```
$ curl -g 'http://localhost:9090/api/v1/series?match[]=up&match[]=process_start_time_seconds{job="prometheus"}'
{
"status" : "success",
"data" : [
{
"__name__" : "up",
"job" : "prometheus",
"instance" : "localhost:9090"
},
{
"__name__" : "up",
"job" : "node",
"instance" : "localhost:9091"
},
{
"__name__" : "process_start_time_seconds",
"job" : "prometheus",
"instance" : "localhost:9090"
}
]
}
```
### Querying label values
The following endpoint returns a list of label values for a provided label name:
```
GET /api/v1/label/<label_name>/values
```
The `data` section of the JSON response is a list of string label names.
This example queries for all label values for the `job` label:
```
$ curl http://localhost:9090/api/v1/label/job/values
{
"status" : "success",
"data" : [
"node",
"prometheus"
]
}
```
## Deleting series
The following endpoint deletes matched series entirely from a Prometheus server:
```
DELETE /api/v1/series
```
URL query parameters:
- `match[]=<series_selector>`: Repeated label matcher argument that selects the
series to delete. At least one `match[]` argument must be provided.
The `data` section of the JSON response has the following format:
```
{
"numDeleted": <number of deleted series>
}
```
The following example deletes all series that match either of the selectors
`up` or `process_start_time_seconds{job="prometheus"}`:
```
$ curl -XDELETE -g 'http://localhost:9090/api/v1/series?match[]=up&match[]=process_start_time_seconds{job="prometheus"}'
{
"status" : "success",
"data" : {
"numDeleted" : 3
}
}
```
## Expression query result formats
Expression queries may return the following response values in the `result`
property of the `data` section. `<sample_value>` placeholders are numeric
sample values. JSON does not support special float values such as `NaN`, `Inf`,
and `-Inf`, so sample values are transferred as quoted JSON strings rather than
raw numbers.
### Range vectors
Range vectors are returned as result type `matrix`. The corresponding
`result` property has the following format:
```
[
{
"metric": { "<label_name>": "<label_value>", ... },
"values": [ [ <unix_time>, "<sample_value>" ], ... ]
},
...
]
```
### Instant vectors
Instant vectors are returned as result type `vector`. The corresponding
`result` property has the following format:
```
[
{
"metric": { "<label_name>": "<label_value>", ... },
"value": [ <unix_time>, "<sample_value>" ]
},
...
]
```
### Scalars
Scalar results are returned as result type `scalar`. The corresponding
`result` property has the following format:
```
[ <unix_time>, "<scalar_value>" ]
```
### Strings
String results are returned as result type `string`. The corresponding
`result` property has the following format:
```
[ <unix_time>, "<string_value>" ]
```
...@@ -9,7 +9,7 @@ sort_rank: 1 ...@@ -9,7 +9,7 @@ sort_rank: 1
Prometheus provides a functional expression language that lets the user select Prometheus provides a functional expression language that lets the user select
and aggregate time series data in real time. The result of an expression can and aggregate time series data in real time. The result of an expression can
either be shown as a graph, viewed as tabular data in Prometheus's expression either be shown as a graph, viewed as tabular data in Prometheus's expression
browser, or consumed by external systems via the HTTP API. browser, or consumed by external systems via the [HTTP API](/docs/querying/api/).
## Examples ## Examples
...@@ -84,6 +84,28 @@ For example, this selects all `http_requests_total` time series for `staging`, ...@@ -84,6 +84,28 @@ For example, this selects all `http_requests_total` time series for `staging`,
http_requests_total{environment=~"staging|testing|development",method!="GET"} http_requests_total{environment=~"staging|testing|development",method!="GET"}
Label matchers that match empty label values also select all time series that do
not have the specific label set at all.
Vector selectors must either specify a name or at least one label matcher
that does not match the empty string. The following expression is illegal:
{job=~".*"} # Bad!
In contrast, these expressions are valid as they both have a selector that does not
match empty label values.
{job=~".+"} # Good!
{job=~".*",method="get"} # Good!
Label matchers can also be applied to metric names by matching against the internal
`__name__` label. For example, the expression `http_requests_total` is equivalent to
`{__name__="http_requests_total"}`. Matchers other than `=` (`!=`, `=~`, `!~`) may also be used.
The following expression selects all metrics that have a name starting with `job:`:
{__name__=~"^job:.*"}
### Range Vector Selectors ### Range Vector Selectors
Range vector literals work like instant vector literals, except that they Range vector literals work like instant vector literals, except that they
......
--- ---
title: Query language title: Querying
sort_rank: 3 sort_rank: 3
nav_icon: search nav_icon: search
--- ---
...@@ -193,3 +193,17 @@ If we are just interested in the total of HTTP requests we have seen in **all** ...@@ -193,3 +193,17 @@ If we are just interested in the total of HTTP requests we have seen in **all**
applications, we could simply write: applications, we could simply write:
sum(http_requests_total) sum(http_requests_total)
## Binary operator precedence
The following list shows the precedence of binary operators in Prometheus, from
lowest to highest.
1. `OR`
2. `AND`
3. `==`, `!=`, `<=`, `<`, `>=`, `>`
4. `+`, `-`
5. `*`, `/`, `%`
Operators on the same precedence level are left-associative. For example,
`2 * 3 % 2` is equivalent to `(2 * 3) % 2`.
...@@ -17,14 +17,12 @@ process. The changes are only applied if all rule files are well-formatted. ...@@ -17,14 +17,12 @@ process. The changes are only applied if all rule files are well-formatted.
## Syntax-checking rules ## Syntax-checking rules
To quickly check whether a rule file is syntactically correct without starting To quickly check whether a rule file is syntactically correct without starting
a Prometheus server, install and run Prometheus's `rule_checker` tool: a Prometheus server, install and run Prometheus's `promtool` command-line
utility tool:
```bash ```bash
# If $GOPATH/github.com/prometheus/prometheus already exists, update it first: go get github.com/prometheus/prometheus/cmd/promtool
go get -u github.com/prometheus/prometheus promtool check-rules /path/to/example.rules
go install github.com/prometheus/prometheus/tools/rule_checker
rule_checker /path/to/example.rules
``` ```
When the file is syntactically valid, the checker prints a textual When the file is syntactically valid, the checker prints a textual
...@@ -54,7 +52,7 @@ Some examples: ...@@ -54,7 +52,7 @@ Some examples:
job:http_inprogress_requests:sum = sum(http_inprogress_requests) by (job) job:http_inprogress_requests:sum = sum(http_inprogress_requests) by (job)
# Drop or rewrite labels in the result time series: # Drop or rewrite labels in the result time series:
new_time series{label_to_change="new_value",label_to_drop=""} = old_time series new_time_series{label_to_change="new_value",label_to_drop=""} = old_time_series
Recording rules are evaluated at the interval specified by the Recording rules are evaluated at the interval specified by the
`evaluation_interval` field in the Prometheus configuration. During each `evaluation_interval` field in the Prometheus configuration. During each
......
...@@ -129,6 +129,9 @@ interpolated version of the given format string. To reference specific label ...@@ -129,6 +129,9 @@ interpolated version of the given format string. To reference specific label
values in the format string, use double curly braces: `{{label-name}}`. For values in the format string, use double curly braces: `{{label-name}}`. For
example: `{{host}} - cluster {{cluster}}`. example: `{{host}} - cluster {{cluster}}`.
Format strings support filters. See the Filters section below for a list of
currently available filters, expected inputs, and outputs.
### Link to graph ### Link to graph
The "Link to this graph" menu tab allows you to generate a link to a specific The "Link to this graph" menu tab allows you to generate a link to a specific
graph. This link will show the graph in a single-widget fullscreen view as it graph. This link will show the graph in a single-widget fullscreen view as it
...@@ -189,6 +192,24 @@ In the example of the host dashboard, the URL could look like this: ...@@ -189,6 +192,24 @@ In the example of the host dashboard, the URL could look like this:
http://promdash.somedomain.int/hoststats#!?var.host=myhost http://promdash.somedomain.int/hoststats#!?var.host=myhost
Template variables support filters. See the Filters section below for a list of
currently available filters, expected inputs, and outputs.
## Filters
Filters can be used in all places where variable interpolation is supported,
e.g. in legend format strings or template variables. The format is `{{variable
| filter}}` and the following filters are currently available:
- `toPercent`: Input: `0.5`; Output: `50%`
- `toPercentile`: Input: `0.5`; Output: `50th`
- `hostnameFqdn`: Input: `http://your-prometheus-endpoint.net:1111/`; Output: `your-prometheus-endpoint.net:1111`
- `hostname`: Input: `http://your-prometheus-endpoint.net:1111/`; Output: `your-prometheus-endpoint`
- `regex`: If `job` == `prometheus`, `{{job | regex:"pro":"faux"}}` => `fauxmetheus`
Filters are chainable, so `{{label | filter1 | filter2}}` will apply `filter1`
to `label`, and then apply `filter2` to that result.
## Annotations ## Annotations
PromDash allows you to load timestamped annotations from an external service PromDash allows you to load timestamped annotations from an external service
......
...@@ -54,6 +54,7 @@ If functions are used in a pipeline, the pipeline value is passed as the last ar ...@@ -54,6 +54,7 @@ If functions are used in a pipeline, the pipeline value is passed as the last ar
| humanize | number | string | Converts a number to a more readable format, using [metric prefixes](http://en.wikipedia.org/wiki/Metric_prefix). | humanize | number | string | Converts a number to a more readable format, using [metric prefixes](http://en.wikipedia.org/wiki/Metric_prefix).
| humanize1024 | number | string | Like `humanize`, but uses 1024 as the base rather than 1000. | | humanize1024 | number | string | Like `humanize`, but uses 1024 as the base rather than 1000. |
| humanizeDuration | number | string | Converts a duration in seconds to a more readable format. | | humanizeDuration | number | string | Converts a duration in seconds to a more readable format. |
| humanizeTimestamp | number | string | Converts a Unix timestamp in seconds to a more readable format. |
Humanizing functions are intended to produce reasonable output for consumption Humanizing functions are intended to produce reasonable output for consumption
by humans, and are not guaranteed to return the same results between Prometheus by humans, and are not guaranteed to return the same results between Prometheus
......
...@@ -38,7 +38,7 @@ layout: jumbotron ...@@ -38,7 +38,7 @@ layout: jumbotron
<div class="col-md-3"> <div class="col-md-3">
<h2><i class="fa fa-warning"></i> Alerting</h2> <h2><i class="fa fa-warning"></i> Alerting</h2>
<p class="desc">Alerts are defined based on Prometheus's flexible query language and maintain dimensional information. An alertmanager handles notifications and silencing.</p> <p class="desc">Alerts are defined based on Prometheus's flexible query language and maintain dimensional information. An alertmanager handles notifications and silencing.</p>
<p><a class="btn btn-default" href="/docs/querying/rules/#alerting-rules" role="button">View details &raquo;</a></p> <p><a class="btn btn-default" href="/docs/alerting/rules/" role="button">View details &raquo;</a></p>
</div> </div>
<div class="col-md-3"> <div class="col-md-3">
<h2><i class="fa fa-cloud-upload"></i> Exporters</h2> <h2><i class="fa fa-cloud-upload"></i> Exporters</h2>
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment