Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
D
docs
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Administrator
docs
Commits
1b400e9b
Commit
1b400e9b
authored
Jun 17, 2015
by
Brian Brazil
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Add a blog post on detecting and dealing with outliers.
parent
c4b75048
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
138 additions
and
0 deletions
+138
-0
2015-06-17-practical-anomaly-detection.md
content/blog/2015-06-17-practical-anomaly-detection.md
+138
-0
No files found.
content/blog/2015-06-17-practical-anomaly-detection.md
0 → 100644
View file @
1b400e9b
---
title
:
Practical Anomaly Detection
created_at
:
2015-06-18
kind
:
article
author_name
:
Brian Brazil
---
In his
*[Open Letter To Monitoring/Metrics/Alerting Companies](http://www.kitchensoap.com/2015/05/01/openlettertomonitoringproducts/)*
,
John Allspaw asserts that attempting "to detect anomalies perfectly, at the right time, is not possible".
I have seen several attempts by talented engineers to build systems to
automatically detect and diagnose problems based on time series data. While it
is certainly possible to get a demonstration working, the data always turned
out to be too noisy to make this approach work for anything but the simplest of
real-world systems.
All hope is not lost though. There are many common anomalies which you can
detect and handle with custom-built rules. The Prometheus
[
query
language](../../../../../docs/querying/basics/) gives you the tools to discover
these anomalies while avoiding false positives.
## Building a query
A common problem within a service is when a small number of servers are not
performing as well as the rest, such as responding with increased latency.
Let us say that we have a metric
`instance:latency_seconds:mean5m`
representing the
average query latency for each instance of a service, calculated via a
[
recording rule
](
/docs/querying/rules/
)
from a
[
Summary
](
/docs/concepts/metric_types/#summary
)
metric.
A simple way to start would be to look for instances with a latency
more than two standard deviations above the mean:
```
instance:latency_seconds:mean5m
> on (job) group_left(instance)
(
avg by (job)(instance:latency_seconds:mean5m)
+ on (job)
2 * stddev by (job)(instance:latency_seconds:mean5m)
)
```
You try this out and discover that there are false positives when
the latencies are very tightly clustered. So you add a requirement
that the instance latency also has to be 20% above the average:
```
(
instance:latency_seconds:mean5m
> on (job) group_left(instance)
(
avg by (job)(instance:latency_seconds:mean5m)
+ on (job)
2 * stddev by (job)(instance:latency_seconds:mean5m)
)
)
> on (job) group_left(instance)
1.2 * avg by (job)(instance:latency_seconds:mean5m)
```
Finally, you find that false positives tend to happen at low traffic levels.
You add a requirement for there to be enough traffic for 1 query per second to
be going to each instance. You create an alert definition for all of this:
```
ALERT InstanceLatencyOutlier
IF
(
instance:latency_seconds:mean5m
> on (job) group_left(instance)
(
avg by (job)(instance:latency_seconds:mean5m)
+ on (job)
2 * stddev by (job)(instance:latency_seconds:mean5m)
)
)
> on (job) group_left(instance)
1.2 * avg by (job)(instance:latency_seconds:mean5m)
and on (job)
avg by (job)(instance:latency_seconds_count:rate5m)
>
1
FOR 30m
SUMMARY "{{$labels.instance}} in {{$labels.job}} is a latency outlier"
DESCRIPTION "{{$labels.instance}} has latency of {{humanizeDuration $value}}"
```
## Automatic actions
The above alert can feed into the
[
Alertmanager
](
/docs/alerting/alertmanager/
)
, and from there to
your chat, ticketing, or paging systems. After a while you might discover that the
usual cause of the alert is something that there is not a proper fix for, but there is an
automated action such as a restart, reboot, or machine replacement that resolves
the issue.
Rather than having humans handle this repetitive task, one option is to
get the Alertmanager to send the alert to a web service that will perform
the action with appropriate throttling and safety features.
The
[
generic webhook
](
/docs/alerting/alertmanager/#generic-webhook
)
sends alert notifications to an HTTP endpoint of your choice. A simple Alertmanager
configuration that uses it could look like this:
```
# A simple notification configuration which only sends alert notifications to
# an external webhook.
notification_config {
name: "restart_webhook"
webhook_config {
url: "http://example.org/my/hook"
}
}
# An aggregation rule which matches all alerts with the label
# alertname="InstanceLatencyOutlier" and sends them using the "restart_webhook"
# notification configuration.
aggregation_rule {
filter {
name_re: "alertname"
value_re: "InstanceLatencyOutlier"
}
notification_config_name: "restart_webhook"
}
```
## Summary
The Prometheus query language allows for rich processing of your monitoring
data. This lets you to create alerts with good signal-to-noise ratios, and the
Alertmanager's generic webhook support can trigger automatic remediations.
This all combines to enable oncall engineers to focus on problems where they can
have the most impact.
When defining alerts for your services, see also our
[
alerting best practices
](
http://prometheus.io/docs/practices/alerting/
)
.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment