Add a blog post on detecting and dealing with outliers.

1b400e9b · Brian Brazil · c4b75048 · 1b400e9b
Commit 1b400e9b authored Jun 17, 2015 by Brian Brazil
Show whitespace changes
Inline Side-by-side

Showing with 138 additions and 0 deletions

2015-06-17-practical-anomaly-detection.md content/blog/2015-06-17-practical-anomaly-detection.md +138 -0

No files found.
--- a/content/blog/2015-06-17-practical-anomaly-detection.md
+++ b/content/blog/2015-06-17-practical-anomaly-detection.md
+---
+title: Practical Anomaly Detection
+created_at: 2015-06-18
+kind: article
+author_name: Brian Brazil
+---
+
+In his *[Open Letter To Monitoring/Metrics/Alerting Companies](http://www.kitchensoap.com/2015/05/01/openlettertomonitoringproducts/)*,
+John Allspaw asserts that attempting "to detect anomalies perfectly, at the right time, is not possible".
+
+I have seen several attempts by talented engineers to build systems to
+automatically detect and diagnose problems based on time series data. While it
+is certainly possible to get a demonstration working, the data always turned
+out to be too noisy to make this approach work for anything but the simplest of
+real-world systems.
+
+All hope is not lost though. There are many common anomalies which you can
+detect and handle with custom-built rules. The Prometheus [query
+language](../../../../../docs/querying/basics/) gives you the tools to discover
+these anomalies while avoiding false positives.
+
+## Building a query
+
+A common problem within a service is when a small number of servers are not
+performing as well as the rest, such as responding with increased latency.
+
+Let us say that we have a metric `instance:latency_seconds:mean5m` representing the
+average query latency for each instance of a service, calculated via a
+[recording rule](/docs/querying/rules/) from a
+[Summary](/docs/concepts/metric_types/#summary) metric.
+
+A simple way to start would be to look for instances with a latency
+more than two standard deviations above the mean:
+
+```
+  instance:latency_seconds:mean5m
+> on (job) group_left(instance)
+  (
+      avg by (job)(instance:latency_seconds:mean5m)
+    + on (job)
+      2 * stddev by (job)(instance:latency_seconds:mean5m)
+  )
+```
+
+You try this out and discover that there are false positives when
+the latencies are very tightly clustered. So you add a requirement
+that the instance latency also has to be 20% above the average:
+
+```
+  (
+      instance:latency_seconds:mean5m
+    > on (job) group_left(instance)
+      (
+          avg by (job)(instance:latency_seconds:mean5m)
+        + on (job)
+          2 * stddev by (job)(instance:latency_seconds:mean5m)
+      )
+  )
+> on (job) group_left(instance)
+  1.2 * avg by (job)(instance:latency_seconds:mean5m)
+```
+
+Finally, you find that false positives tend to happen at low traffic levels.
+You add a requirement for there to be enough traffic for 1 query per second to
+be going to each instance. You create an alert definition for all of this:
+
+```
+ALERT InstanceLatencyOutlier
+  IF
+        (
+            instance:latency_seconds:mean5m
+          > on (job) group_left(instance)
+            (
+                avg by (job)(instance:latency_seconds:mean5m)
+              + on (job)
+                2 * stddev by (job)(instance:latency_seconds:mean5m)
+            )
+        )
+      > on (job) group_left(instance)
+        1.2 * avg by (job)(instance:latency_seconds:mean5m)
+    and on (job)
+        avg by (job)(instance:latency_seconds_count:rate5m)
+      >
+        1
+  FOR 30m
+  SUMMARY "{{$labels.instance}} in {{$labels.job}} is a latency outlier"
+  DESCRIPTION "{{$labels.instance}} has latency of {{humanizeDuration $value}}"
+```
+
+## Automatic actions
+
+The above alert can feed into the
+[Alertmanager](/docs/alerting/alertmanager/), and from there to
+your chat, ticketing, or paging systems. After a while you might discover that the
+usual cause of the alert is something that there is not a proper fix for, but there is an
+automated action such as a restart, reboot, or machine replacement that resolves
+the issue.
+
+Rather than having humans handle this repetitive task, one option is to
+get the Alertmanager to send the alert to a web service that will perform
+the action with appropriate throttling and safety features.
+
+The [generic webhook](/docs/alerting/alertmanager/#generic-webhook)
+sends alert notifications to an HTTP endpoint of your choice. A simple Alertmanager
+configuration that uses it could look like this:
+
+```
+# A simple notification configuration which only sends alert notifications to
+# an external webhook.
+notification_config {
+  name: "restart_webhook"
+  webhook_config {
+    url: "http://example.org/my/hook"
+  }
+}
+
+# An aggregation rule which matches all alerts with the label
+# alertname="InstanceLatencyOutlier" and sends them using the "restart_webhook"
+# notification configuration.
+aggregation_rule {
+  filter {
+    name_re: "alertname"
+    value_re: "InstanceLatencyOutlier"
+  }
+  notification_config_name: "restart_webhook"
+}
+```
+
+
+## Summary
+
+The Prometheus query language allows for rich processing of your monitoring
+data. This lets you to create alerts with good signal-to-noise ratios, and the
+Alertmanager's generic webhook support can trigger automatic remediations.
+This all combines to enable oncall engineers to focus on problems where they can
+have the most impact.
+
+When defining alerts for your services, see also our [alerting best practices](http://prometheus.io/docs/practices/alerting/).