Document and overview for the new AM

cb0080d4 · Fabian Reinartz · a3986a81 · cb0080d4
Commit cb0080d4 authored Dec 23, 2015 by Fabian Reinartz
Hide whitespace changes
Inline Side-by-side

Showing with 98 additions and 193 deletions

alertmanager.md content/docs/alerting/alertmanager.md +98 -193

No files found.
--- a/content/docs/alerting/alertmanager.md
+++ b/content/docs/alerting/alertmanager.md
@@ -6,196 +6,101 @@ nav_icon: sliders

 # Alertmanager

-The Alertmanager receives alerts from one or more Prometheus servers.
-It manages those alerts, including silencing, inhibition, aggregation and
-sending out notifications via methods such as email, PagerDuty and HipChat.
-
-**WARNING: The Alertmanager is still considered to be very experimental.**
-
-## Configuration
-
-The Alertmanager is configured via command-line flags and a configuration file.
-
-The configuration file is an ASCII protocol buffer. To specify which
-configuration file to load, use the `-config.file` flag.
-
-```
-./alertmanager -config.file alertmanager.conf
-```
-
-To send all alerts to email, set the `-notification.smtp.smarthost` flag to
-an SMTP smarthost (such as a [Postfix null client](http://www.postfix.org/STANDARD_CONFIGURATION_README.html#null_client)) 
-and use the following configuration:
-
-```
-notification_config {
-  name: "alertmanager_test"
-  email_config {
-    email: "test@example.org"
-  }
-}
-
-aggregation_rule {
-  notification_config_name: "alertmanager_test"
-}
-```
-
-### Filtering
-
-An aggregation rule can be made to apply to only some alerts using a filter.
-
-For example, to apply a rule only to alerts with a `severity` label with the value `page`:
-
-```
-aggregation_rule {
-  filter {
-    name_re: "severity"
-    value_re: "page"
-  }
-  notification_config_name: "alertmanager_test"
-}
-```
-
-Multiple filters can be provided.
-
-### Repeat Rate
-By default an aggregation rule will repeat notifications every 2 hours. This can be changed using `repeat_rate_seconds`.
-
-```
-aggregation_rule {
-  repeat_rate_seconds: 3600
-  notification_config_name: "alertmanager_test"
-}
-```
-
-### Notifications
-
-The Alertmanager has support for a growing number of notification methods.
-Multiple notifications methods of one or more types can be used in the same
-notification config.
-
-The `send_resolved` field can be used with all notification methods to enable or disable
-sending notifications that an alert has stopped firing.
-
-#### Email
-
-The `-notification.smtp.smarthost` flag must be set to an SMTP smarthost.
-The `-notification.smtp.sender` flag may be set to change the default From address.
-
-```
-notification_config {
-  name: "alertmanager_email"
-  email_config {
-    email: "test@example.org"
-  }
-  email_config {
-    email: "foo@example.org"
-  }
-}
-```
-
-Plain and CRAM-MD5 SMTP authentication methods are supported.
-The `SMTP_AUTH_USERNAME`, `SMTP_AUTH_SECRET`, `SMTP_AUTH_PASSWORD` and
-`SMTP_AUTH_IDENTITY` environment variables are used to configure them.
-
-#### PagerDuty
-
-The Alertmanager integrates as a [Generic API
-Service](https://support.pagerduty.com/hc/en-us/articles/202830340-Creating-a-Generic-API-Service)
-with PagerDuty.
-
-```
-notification_config {
-  name: "alertmanager_pagerduty"
-  pagerduty_config {
-    service_key: "supersecretapikey"
-  }
-}
-```
-
-#### Pushover
-```
-notification_config {
-  name: "alertmanager_pushover"
-  pushover_config {
-    token: "mypushovertoken"
-    user_key: "mypushoverkey"
-  }
-}
-```
-
-#### HipChat
-```
-notification_config {
-  name: "alertmanager_hipchat"
-  hipchat_config {
-    auth_token: "hipchatauthtoken"
-    room_id: 123456
-  }
-}
-```
-
-#### Slack
-```
-notification_config {
-  name: "alertmanager_slack"
-  slack_config {
-    webhook_url: "webhookurl"
-    channel: "channelname"
-  }
-}
-```
-
-#### Flowdock
-
-```
-notification_config {
-  name: "alertmanager_flowdock"
-  flowdock_config {
-    api_token: "4c7234902348234902384234234cdb59"
-    from_address: "aliaswithgravatar@example.com"
-    tag: "monitoring"
-  }
-}
-```
-
-#### Generic Webhook
-
-The Alertmanager supports sending notifications as JSON to arbitrary
-URLs. This could be used to perform automated actions when an
-alert fires or integrate with a system that the Alertmanager does not support.
-
-```
-notification_config {
-  name: "alertmanager_webhook"
-  webhook_config {
-    url: "http://example.org/my/hook"
-  }
-}
-```
-
-An example of JSON message it sends is below.
-
-```json
-{
-   "version": "1",
-   "status": "firing",
-   "alert": [
-      {
-         "summary": "summary",
-         "description": "description",
-         "labels": {
-            "alertname": "TestAlert"
-         },
-         "payload": {
-            "activeSince": "2015-06-01T12:55:47.356+01:00",
-            "alertingRule": "ALERT TestAlert IF absent(metric_name) FOR 0y WITH ",
-            "generatorURL": "http://localhost:9090/graph#%5B%7B%22expr%22%3A%22absent%28metric_name%29%22%2C%22tab%22%3A0%7D%5D",
-            "value": "1"
-         }
-      }
-   ]
-}
-```
-
-This format is subject to change.
+The Alertmanager handles alerts sent by client applications such as the 
+Prometheus server. It takes care of deduplicating, grouping, and routing
+them to the correct receiver integration such as email, PagerDuty, or OpsGenie.
+It also takes care of silencing and inhibition of alerts.
+
+The following describes the core concepts the Alertmanager implements. Consult
+the [configuration documentation](../configuration) to learn how to use them
+in more detail.
+
+## Grouping
+
+Grouping categorizes alerts of similar nature into a single notification. This
+is especially useful during larger outages when many systems fail at once and
+hundreds the thousands of alerts may be firing simultaniously.
+
+**Example:** Dozens or hundreds of instances of a service are running in your
+cluster when a network partition occurs. Half our your service instances
+can no longer reach the database.
+Alerting rules in Prometheus were configured to send an alert for each service
+instance if it cannot communicate with the database. As a result hundreds of
+alerts are sent to Alertmanager.
+
+As a user one only wants to get a single page while still being able to see
+exactly which service instances were affected. Thus one can configure
+Alertmanager to group alerts by their cluster and alertname so it sends a
+single compact notification.
+
+Grouping of alerts, timing for the grouped notifications, and the receivers
+of those notificiations are configured by a routing tree in the configuration
+file.
+ 
+## Inhibition
+
+Inhibition is a concept of surpressing notifications for certain alerts if
+certain other alerts are already firing.
+
+**Example:** An alert is firing that informs that an entire cluster is not
+reachable. Alertmanager can be configured to mute all other alerts concerning
+this cluster if that particular alert is firing.
+This prevents hundreds to thousands of alerts firing unrelated to the actual
+issue.
+
+Inhibitions are configured through the Alertmanager's configuration file.
+
+## Silences
+
+Silences are a straightforward way to simply mute alerts for a given time.
+A silence is configured based on matchers, just as the routing tree. Incoming
+alerts are checked whether they match all the equality or regular expression
+matchers of an active silence.
+If they do, no notifications will be send out for that alert.
+
+Silences are configured in the web interface of the Alertmanager.
+
+
+## Sending alerts
+
+__Prometheus automatically takes care of sending alerts generated by its
+configured [alerting rules](../rules). The following is a general documentation for clients.__
+
+The Alertmanager listens for alerts on an API endpoint at `/api/v1/alerts`.
+Clients are expected to continously re-send alerts as long as they are still
+active (usually at the order of 30 seconds to 3 minutes).
+Clients can push a list of alerts to that endpoint via a POST request of
+the following format:
+
+```
+[
+  {
+    "labels": {
+      "<labelname>": "<labelvalue>",
+      ...
+    },
+    "annotations": {
+      "<labelname>": "<labelvalue>",
+    },
+    "startsAt": "<rfc3339>",
+    "endAt": "<rfc3339>"
+    "generatorURL": "<generator_url>"
+  },
+  ...
+]
+```
+
+The labels are used to identify identical instances of an alert and to perform
+deduplication. The annotations are always set to those received most recently
+and are not identifying an alert.
+
+Both timestamps are optional. If `startsAt` is omitted, the current time
+is assigned by the Alertmanager. `endsAt` is only set if the end time of an
+alert is known. Otherwise it will be set to a configurable timeout period from
+the time since the alert was last received.
+
+The `generatorURL` field is a unique back-link which identifies the causing
+entity of this alert in the client. 
+
+Alertmanager also supports a legacy endpoint on `/api/alerts` which is
+compatible with Prometheus versions 0.16 and lower.