Commit cb0080d4 authored by Fabian Reinartz's avatar Fabian Reinartz

Document and overview for the new AM

parent a3986a81
......@@ -6,196 +6,101 @@ nav_icon: sliders
# Alertmanager
The Alertmanager receives alerts from one or more Prometheus servers.
It manages those alerts, including silencing, inhibition, aggregation and
sending out notifications via methods such as email, PagerDuty and HipChat.
The Alertmanager handles alerts sent by client applications such as the
Prometheus server. It takes care of deduplicating, grouping, and routing
them to the correct receiver integration such as email, PagerDuty, or OpsGenie.
It also takes care of silencing and inhibition of alerts.
**WARNING: The Alertmanager is still considered to be very experimental.**
The following describes the core concepts the Alertmanager implements. Consult
the [configuration documentation](../configuration) to learn how to use them
in more detail.
## Configuration
## Grouping
The Alertmanager is configured via command-line flags and a configuration file.
Grouping categorizes alerts of similar nature into a single notification. This
is especially useful during larger outages when many systems fail at once and
hundreds the thousands of alerts may be firing simultaniously.
The configuration file is an ASCII protocol buffer. To specify which
configuration file to load, use the `-config.file` flag.
**Example:** Dozens or hundreds of instances of a service are running in your
cluster when a network partition occurs. Half our your service instances
can no longer reach the database.
Alerting rules in Prometheus were configured to send an alert for each service
instance if it cannot communicate with the database. As a result hundreds of
alerts are sent to Alertmanager.
```
./alertmanager -config.file alertmanager.conf
```
To send all alerts to email, set the `-notification.smtp.smarthost` flag to
an SMTP smarthost (such as a [Postfix null client](http://www.postfix.org/STANDARD_CONFIGURATION_README.html#null_client))
and use the following configuration:
```
notification_config {
name: "alertmanager_test"
email_config {
email: "test@example.org"
}
}
aggregation_rule {
notification_config_name: "alertmanager_test"
}
```
### Filtering
An aggregation rule can be made to apply to only some alerts using a filter.
For example, to apply a rule only to alerts with a `severity` label with the value `page`:
```
aggregation_rule {
filter {
name_re: "severity"
value_re: "page"
}
notification_config_name: "alertmanager_test"
}
```
Multiple filters can be provided.
### Repeat Rate
By default an aggregation rule will repeat notifications every 2 hours. This can be changed using `repeat_rate_seconds`.
```
aggregation_rule {
repeat_rate_seconds: 3600
notification_config_name: "alertmanager_test"
}
```
### Notifications
The Alertmanager has support for a growing number of notification methods.
Multiple notifications methods of one or more types can be used in the same
notification config.
The `send_resolved` field can be used with all notification methods to enable or disable
sending notifications that an alert has stopped firing.
#### Email
As a user one only wants to get a single page while still being able to see
exactly which service instances were affected. Thus one can configure
Alertmanager to group alerts by their cluster and alertname so it sends a
single compact notification.
The `-notification.smtp.smarthost` flag must be set to an SMTP smarthost.
The `-notification.smtp.sender` flag may be set to change the default From address.
Grouping of alerts, timing for the grouped notifications, and the receivers
of those notificiations are configured by a routing tree in the configuration
file.
```
notification_config {
name: "alertmanager_email"
email_config {
email: "test@example.org"
}
email_config {
email: "foo@example.org"
}
}
```
## Inhibition
Plain and CRAM-MD5 SMTP authentication methods are supported.
The `SMTP_AUTH_USERNAME`, `SMTP_AUTH_SECRET`, `SMTP_AUTH_PASSWORD` and
`SMTP_AUTH_IDENTITY` environment variables are used to configure them.
Inhibition is a concept of surpressing notifications for certain alerts if
certain other alerts are already firing.
#### PagerDuty
**Example:** An alert is firing that informs that an entire cluster is not
reachable. Alertmanager can be configured to mute all other alerts concerning
this cluster if that particular alert is firing.
This prevents hundreds to thousands of alerts firing unrelated to the actual
issue.
The Alertmanager integrates as a [Generic API
Service](https://support.pagerduty.com/hc/en-us/articles/202830340-Creating-a-Generic-API-Service)
with PagerDuty.
Inhibitions are configured through the Alertmanager's configuration file.
```
notification_config {
name: "alertmanager_pagerduty"
pagerduty_config {
service_key: "supersecretapikey"
}
}
```
## Silences
#### Pushover
```
notification_config {
name: "alertmanager_pushover"
pushover_config {
token: "mypushovertoken"
user_key: "mypushoverkey"
}
}
```
Silences are a straightforward way to simply mute alerts for a given time.
A silence is configured based on matchers, just as the routing tree. Incoming
alerts are checked whether they match all the equality or regular expression
matchers of an active silence.
If they do, no notifications will be send out for that alert.
#### HipChat
```
notification_config {
name: "alertmanager_hipchat"
hipchat_config {
auth_token: "hipchatauthtoken"
room_id: 123456
}
}
```
Silences are configured in the web interface of the Alertmanager.
#### Slack
```
notification_config {
name: "alertmanager_slack"
slack_config {
webhook_url: "webhookurl"
channel: "channelname"
}
}
```
#### Flowdock
## Sending alerts
```
notification_config {
name: "alertmanager_flowdock"
flowdock_config {
api_token: "4c7234902348234902384234234cdb59"
from_address: "aliaswithgravatar@example.com"
tag: "monitoring"
}
}
```
#### Generic Webhook
__Prometheus automatically takes care of sending alerts generated by its
configured [alerting rules](../rules). The following is a general documentation for clients.__
The Alertmanager supports sending notifications as JSON to arbitrary
URLs. This could be used to perform automated actions when an
alert fires or integrate with a system that the Alertmanager does not support.
The Alertmanager listens for alerts on an API endpoint at `/api/v1/alerts`.
Clients are expected to continously re-send alerts as long as they are still
active (usually at the order of 30 seconds to 3 minutes).
Clients can push a list of alerts to that endpoint via a POST request of
the following format:
```
notification_config {
name: "alertmanager_webhook"
webhook_config {
url: "http://example.org/my/hook"
}
}
```
An example of JSON message it sends is below.
```json
{
"version": "1",
"status": "firing",
"alert": [
[
{
"summary": "summary",
"description": "description",
"labels": {
"alertname": "TestAlert"
"<labelname>": "<labelvalue>",
...
},
"payload": {
"activeSince": "2015-06-01T12:55:47.356+01:00",
"alertingRule": "ALERT TestAlert IF absent(metric_name) FOR 0y WITH ",
"generatorURL": "http://localhost:9090/graph#%5B%7B%22expr%22%3A%22absent%28metric_name%29%22%2C%22tab%22%3A0%7D%5D",
"value": "1"
}
}
]
}
"annotations": {
"<labelname>": "<labelvalue>",
},
"startsAt": "<rfc3339>",
"endAt": "<rfc3339>"
"generatorURL": "<generator_url>"
},
...
]
```
This format is subject to change.
The labels are used to identify identical instances of an alert and to perform
deduplication. The annotations are always set to those received most recently
and are not identifying an alert.
Both timestamps are optional. If `startsAt` is omitted, the current time
is assigned by the Alertmanager. `endsAt` is only set if the end time of an
alert is known. Otherwise it will be set to a configurable timeout period from
the time since the alert was last received.
The `generatorURL` field is a unique back-link which identifies the causing
entity of this alert in the client.
Alertmanager also supports a legacy endpoint on `/api/alerts` which is
compatible with Prometheus versions 0.16 and lower.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment