Commit c02fd856 authored by Brian Brazil's avatar Brian Brazil

Merge pull request #104 from brian-brazil/alertmanager

Add initial documentation for the alertmanager.
parents 5c55f50a 067ef4a4
---
title: Alertmanager
sort_rank: 2
nav_icon: sliders
---
# Alertmanager
The Alertmanager receives alerts from one or more Prometheus servers.
It manages those alerts, including silencing, inhibition, aggregation and
sending out notifications via methods such as email, PagerDuty and HipChat.
**WARNING: The Alertmanager is still considered to be very experimental.**
## Configuration
The Alertmanager is configured via command-line flags and a configuration file.
The configuration file is an ASCII protocol buffer. To specify which
configuration file to load, use the `-config.file` flag.
```
./alertmanager -config.file alertmanager.conf
```
To send all alerts to email, set the `-notification.smtp.smarthost` flag to
an SMTP smarthost (such as a [Postfix null client](http://www.postfix.org/STANDARD_CONFIGURATION_README.html#null_client))
and use the following configuration:
```
notification_config {
name: "alertmanager_test"
email_config {
email: "test@example.org"
}
}
aggregation_rule {
notification_config_name: "alertmanager_test"
}
```
### Filtering
An aggregation rule can be made to apply to only some alerts using a filter.
For example, to apply a rule only to alerts with a `severity` label with the value `page`:
```
aggregation_rule {
filter {
name_re: "severity"
value_re: "page"
}
notification_config_name: "alertmanager_test"
}
```
Multiple filters can be provided.
### Repeat Rate
By default an aggregation rule will repeat notifications every 2 hours. This can be changed using `repeat_rate_seconds`.
```
aggregation_rule {
repeat_rate_seconds: 3600
notification_config_name: "alertmanager_test"
}
```
### Notifications
The Alertmanager has support for a growing number of notification methods.
Multiple notifications methods of one or more types can be used in the same
notification config.
The `send_resolved` field can be used with all notification methods to enable or disable
sending notifications that an alert has stopped firing.
#### Email
The `-notification.smtp.smarthost` flag must be set to an SMTP smarthost.
The `-notification.smtp.sender` flag may be set to change the default From address.
```
notification_config {
name: "alertmanager_email"
email_config {
email: "test@example.org"
}
email_config {
email: "foo@example.org"
}
}
```
Plain and CRAM-MD5 SMTP authentication methods are supported.
The `SMTP_AUTH_USERNAME`, `SMTP_AUTH_SECRET`, `SMTP_AUTH_PASSWORD` and
`SMTP_AUTH_IDENTITY` environment variables are used to configure them.
#### PagerDuty
The Alertmanager integrates as a [Generic API
Service](https://support.pagerduty.com/hc/en-us/articles/202830340-Creating-a-Generic-API-Service)
with PagerDuty.
```
notification_config {
name: "alertmanager_pagerduty"
pagerduty_config {
service_key: "supersecretapikey"
}
}
```
#### Pushover
```
notification_config {
name: "alertmanager_pushover"
pushover_config {
token: "mypushovertoken"
user_key: "mypushoverkey"
}
}
```
#### HipChat
```
notification_config {
name: "alertmanager_hipchat"
hipchat_config {
auth_token: "hipchatauthtoken"
room_id: 123456
}
}
```
#### Slack
```
notification_config {
name: "alertmanager_slack"
slack_config {
webhook_url: "webhookurl"
channel: "channelname"
}
}
```
#### Flowdock
```
notification_config {
name: "alertmanager_flowdock"
flowdock_config {
api_token: "4c7234902348234902384234234cdb59"
from_address: "aliaswithgravatar@example.com"
tag: "monitoring"
}
}
```
#### Generic Webhook
The Alertmanager supports sending notifications as JSON to arbitrary
URLs. This could be used to perform automated actions when an
alert fires or integrate with a system that the Alertmanager does not support.
```
notification_config {
name: "alertmanager_webhook"
webhook_config {
url: "http://example.org/my/hook"
}
}
```
An example of JSON message it sends is below.
```json
{
"version": "1",
"status": "firing",
"alert": [
{
"summary": "summary",
"description": "description",
"labels": {
"alertname": "TestAlert"
},
"payload": {
"activeSince": "2015-06-01T12:55:47.356+01:00",
"alertingRule": "ALERT TestAlert IF absent(metric_name) FOR 0y WITH ",
"generatorURL": "http://localhost:9090/graph#%5B%7B%22expr%22%3A%22absent%28metric_name%29%22%2C%22tab%22%3A0%7D%5D",
"value": "1"
}
}
]
}
```
This format is subject to change.
---
title: Alerting
sort_rank: 7
nav_icon: bell-o
---
---
title: Alerting Overview
sort_rank: 1
nav_icon: sliders
---
# Alerting Overview
Alerting with Prometheus is separated into two parts. Alerting rules in
Prometheus servers send alerts to an Alertmanager. The Alertmanager then
manages those alerts, including silencing, inhibition, aggregation and sending
out notifications via methods such as email, PagerDuty and HipChat.
**WARNING: The Alertmanager is still considered to be very experimental.**
The main steps to setting up alerting and notifications are:
* Setup and configure the Alertmanager
* Configure Prometheus to talk to the Alertmanager with the `-alertmanager.url` flag
* Create alerting rules in Prometheus
---
title: Alerting rules
sort_rank: 3
---
# Alerting rules
Alerting rules allow you to define alert conditions based on Prometheus
expression language expressions and to send notifications about firing alerts
to an external service. Whenever the alert expression results in one or more
vector elements at a given point in time, the alert counts as active for these
elements' label sets.
Alerting rules are configured in Prometheus in the same way as [recording
rules](../../querying/rules).
### Defining alerting rules
Alerting rules are defined in the following syntax:
ALERT <alert name>
IF <expression>
[FOR <duration>]
WITH <label set>
SUMMARY "<summary template>"
DESCRIPTION "<description template>"
The optional `FOR` clause causes Prometheus to wait for a certain duration
between first encountering a new expression output vector element (like an
instance with a high HTTP error rate) and counting an alert as firing for this
element. Elements that are active, but not firing yet, are in pending state.
The `WITH` clause allows specifying a set of additional labels to be attached
to the alert. Any existing conflicting labels will be overwritten.
The `SUMMARY` should be a short, human-readable summary of the alert (suitable
for e.g. an email subject line), while the `DESCRIPTION` clause should provide
a longer description. Both string fields allow the inclusion of template
variables derived from the firing vector elements of the alert:
# To insert a firing element's label values:
{{$labels.<labelname>}}
# To insert the numeric expression value of the firing element:
{{$value}}
Examples:
# Alert for any instance that is unreachable for >5 minutes.
ALERT InstanceDown
IF up == 0
FOR 5m
WITH {
severity="page"
}
SUMMARY "Instance {{$labels.instance}} down"
DESCRIPTION "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
# Alert for any instance that have a median request latency >1s.
ALERT ApiHighRequestLatency
IF api_http_request_latencies_ms{quantile="0.5"} > 1000
FOR 1m
WITH {}
SUMMARY "High request latency on {{$labels.instance}}"
DESCRIPTION "{{$labels.instance}} has a median request latency above 1s (current value: {{$value}})"
### Inspecting alerts during runtime
To manually inspect which alerts are active (pending or firing), navigate to
the "Alerts" tab of your Prometheus instance. This will show you the exact
label sets for which each defined alert is currently active.
For pending and firing alerts, Prometheus also stores synthetic time series of
the form `ALERTS{alertname="<alert name>", alertstate="pending|firing", <additional alert labels>}`.
The sample value is set to `1` as long as the alert is in the indicated active
(pending or firing) state, and a single `0` value gets written out when an alert
transitions from active to inactive state. Once inactive, the time series does
not get further updates.
### Sending alert notifications
Prometheus's alerting rules are good at figuring what is broken *right now*,
but they are not a fully-fledged notification solution. Another layer is needed
to add summarization, notification rate limiting, silencing and alert
dependencies on top of the simple alert definitions. In Prometheus's ecosystem,
the [Alertmanager](../alertmanager) takes on this
role. Thus, Prometheus may be configured to periodically send information about
alert states to an Alertmanager instance, which then takes care of dispatching
the right notifications. The Alertmanager instance may be configured via the
`-alertmanager.url` command line flag.
...@@ -38,7 +38,7 @@ optional: ...@@ -38,7 +38,7 @@ optional:
* a [push gateway](https://github.com/prometheus/pushgateway) for supporting short-lived jobs * a [push gateway](https://github.com/prometheus/pushgateway) for supporting short-lived jobs
* a [GUI-based dashboard builder](/docs/visualization/promdash/) based on Rails/SQL * a [GUI-based dashboard builder](/docs/visualization/promdash/) based on Rails/SQL
* special-purpose [exporters](/docs/instrumenting/exporters/) (for HAProxy, StatsD, Ganglia, etc.) * special-purpose [exporters](/docs/instrumenting/exporters/) (for HAProxy, StatsD, Ganglia, etc.)
* an (experimental) [alert manager](https://github.com/prometheus/alertmanager) * an (experimental) [alertmanager](https://github.com/prometheus/alertmanager)
* a [command-line querying tool](https://github.com/prometheus/prometheus_cli) * a [command-line querying tool](https://github.com/prometheus/prometheus_cli)
* various support tools * various support tools
......
--- ---
title: Best practices title: Best practices
sort_rank: 7 sort_rank: 8
nav_icon: thumbs-o-up nav_icon: thumbs-o-up
--- ---
--- ---
title: Recording and alerting rules title: Recording rules
sort_rank: 6 sort_rank: 6
--- ---
# Defining recording and alerting rules # Defining recording rules
## Configuring rules ## Configuring rules
Prometheus supports two types of rules which may be configured and then Prometheus supports two types of rules which may be configured and then
evaluated at regular intervals: recording rules and alerting rules. To include evaluated at regular intervals: recording rules and [alerting
rules in Prometheus, create a file containing the necessary rule statements and rules](../../alerting/rules). To include rules in Prometheus, create a file
have Prometheus load the file via the `rule_files` field in the [Prometheus containing the necessary rule statements and have Prometheus load the file via
configuration](/docs/operating/configuration). the `rule_files` field in the [Prometheus configuration](/docs/operating/configuration).
The rule files can be reloaded at runtime by sending `SIGHUP` to the Prometheus The rule files can be reloaded at runtime by sending `SIGHUP` to the Prometheus
process. The changes are only applied if all rule files are well-formatted. process. The changes are only applied if all rule files are well-formatted.
...@@ -62,81 +62,3 @@ evaluation cycle, the right-hand-side expression of the rule statement is ...@@ -62,81 +62,3 @@ evaluation cycle, the right-hand-side expression of the rule statement is
evaluated at the current instant in time and the resulting sample vector is evaluated at the current instant in time and the resulting sample vector is
stored as a new set of time series with the current timestamp and a new metric stored as a new set of time series with the current timestamp and a new metric
name (and perhaps an overridden set of labels). name (and perhaps an overridden set of labels).
## Alerting rules
Alerting rules allow you to define alert conditions based on Prometheus
expression language expressions and to send notifications about firing alerts
to an external service. Whenever the alert expression results in one or more
vector elements at a given point in time, the alert counts as active for these
elements' label sets.
### Defining alerting rules
Alerting rules are defined in the following syntax:
ALERT <alert name>
IF <expression>
[FOR <duration>]
WITH <label set>
SUMMARY "<summary template>"
DESCRIPTION "<description template>"
The optional `FOR` clause causes Prometheus to wait for a certain duration
between first encountering a new expression output vector element (like an
instance with a high HTTP error rate) and counting an alert as firing for this
element. Elements that are active, but not firing yet, are in pending state.
The `WITH` clause allows specifying a set of additional labels to be attached
to the alert. Any existing conflicting labels will be overwritten.
The `SUMMARY` should be a short, human-readable summary of the alert (suitable
for e.g. an email subject line), while the `DESCRIPTION` clause should provide
a longer description. Both string fields allow the inclusion of template
variables derived from the firing vector elements of the alert:
# To insert a firing element's label values:
{{$labels.<labelname>}}
# To insert the numeric expression value of the firing element:
{{$value}}
Examples:
# Alert for any instance that is unreachable for >5 minutes.
ALERT InstanceDown
IF up == 0
FOR 5m
WITH {
severity="page"
}
SUMMARY "Instance {{$labels.instance}} down"
DESCRIPTION "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
# Alert for any instance that have a median request latency >1s.
ALERT ApiHighRequestLatency
IF api_http_request_latencies_ms{quantile="0.5"} > 1000
FOR 1m
WITH {}
SUMMARY "High request latency on {{$labels.instance}}"
DESCRIPTION "{{$labels.instance}} has a median request latency above 1s (current value: {{$value}})"
### Inspecting alerts during runtime
To manually inspect which alerts are active (pending or firing), navigate to
the "Alerts" tab of your Prometheus instance. This will show you the exact
label sets for which each defined alert is currently active.
For pending and firing alerts, Prometheus also stores synthetic time series of
the form `ALERTS{alertname="<alert name>", alertstate="pending|firing", <additional alert labels>}`.
The sample value is set to `1` as long as the alert is in the indicated active
(pending or firing) state, and a single `0` value gets written out when an alert
transitions from active to inactive state. Once inactive, the time series does
not get further updates.
### Sending alert notifications
Prometheus's alerting rules are good at figuring what is broken *right now*,
but they are not a fully-fledged notification solution. Another layer is needed
to add summarization, notification rate limiting, silencing and alert
dependencies on top of the simple alert definitions. In Prometheus's ecosystem,
the [Alert Manager](https://github.com/prometheus/alertmanager) takes on this
role. Thus, Prometheus may be configured to periodically send information about
alert states to an Alert Manager instance, which then takes care of dispatching
the right notifications. The Alert Manager instance may be configured via the
`-alertmanager.url` command line flag.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment