We recommend that you read [My Philosophy on Alerting](https://docs.google.com/a/boxever.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit) based on Rob Ewaschuk's observations at Google.
We recommend that you read [My Philosophy on Alerting](https://docs.google.com/a/boxever.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit)
based on Rob Ewaschuk's observations at Google.
To summarize, keep alerting simple, alert on symptoms, have good consoles
To summarize, keep alerting simple, alert on symptoms, have good consoles
to allow pinpointing causes and avoid having pages where there is nothing to
to allow pinpointing causes and avoid having pages where there is nothing to
...
@@ -15,24 +16,24 @@ do.
...
@@ -15,24 +16,24 @@ do.
Aim to have as few alerts as possible, by alerting on symptoms that are
Aim to have as few alerts as possible, by alerting on symptoms that are
associated with end-user pain rather than trying to catch every possible way
associated with end-user pain rather than trying to catch every possible way
that pain could be caused. Alerts should link to relevant consoles,
that pain could be caused. Alerts should link to relevant consoles
and make it easy to figure out which component is at fault.
and make it easy to figure out which component is at fault.
Allow slack in alerting to accommodate small blips.
Allow for slack in alerting to accommodate small blips.
### Online serving systems
### Online serving systems
Typically alert on high latency and error rates as high up in the stack as possible.
Typically alert on high latency and error rates as high up in the stack as possible.
Only page on latency at one point in a stack, if a lower component is slower
Only page on latency at one point in a stack. If a lower-level component is
than it should be but the overall user latency is fine then there is no need to
slower than it should be, but the overall user latency is fine, then there is
page.
no need to page.
For error rates, page on errors to the user. If there are errors further down
For error rates, page on user-visible errors. If there are errors further down
the stack that will cause such a failure, there is no need to page on them
the stack that will cause such a failure, there is no need to page on them
separately. However if some failures do not cause a to the user-visible
separately. However, if some failures are not user-visible, but are otherwise
failure but are otherwise severe enough to require human involvment (for
severe enough to require human involvment (for example, you're losing a lot of
example, you're losing a lot of money), add pages to be sent on those.
money), add pages to be sent on those.
You may need alerts for different types of request if they have different
You may need alerts for different types of request if they have different
characteristics, or problems in a low-traffic type of request would be drowned
characteristics, or problems in a low-traffic type of request would be drowned
...
@@ -40,7 +41,7 @@ out by high-traffic requests.
...
@@ -40,7 +41,7 @@ out by high-traffic requests.
### Offline processing
### Offline processing
For offline processing systems the key metric is how long data takes to get
For offline processing systems, the key metric is how long data takes to get
through the system, so page if that gets high enough to cause user impact.
through the system, so page if that gets high enough to cause user impact.
### Batch jobs
### Batch jobs
...
@@ -51,7 +52,7 @@ recently enough, and this will cause user-visible problems.
...
@@ -51,7 +52,7 @@ recently enough, and this will cause user-visible problems.
This should generally be at least enough time for 2 full runs of the batch job.
This should generally be at least enough time for 2 full runs of the batch job.
For a job that runs every 4 hours and takes an hour, 10 hours would be a
For a job that runs every 4 hours and takes an hour, 10 hours would be a
reasonable threshold. If you cannot withstand a single run failing, run the
reasonable threshold. If you cannot withstand a single run failing, run the
job more often as a single failure should not require human intervention.
job more frequently, as a single failure should not require human intervention.
### Capacity
### Capacity
...
@@ -61,14 +62,14 @@ often requires human intervention to avoid an outage in the near future.
...
@@ -61,14 +62,14 @@ often requires human intervention to avoid an outage in the near future.
### Metamonitoring
### Metamonitoring
It is important to have confidence that monitoring is working. Accordingly, have
It is important to have confidence that monitoring is working. Accordingly, have
alerts to ensure Prometheus servers, Alertmanagers, PushGateways and
alerts to ensure that Prometheus servers, Alertmanagers, PushGateways, and
other monitoring infrastructure are available and running correctly.
other monitoring infrastructure are available and running correctly.
As always, if it is possible to alert on symptoms rather than causes,this helps
As always, if it is possible to alert on symptoms rather than causes,this helps
to reduce noise. For example, a blackbox test that alerts are getting from
to reduce noise. For example, a blackbox test that alerts are getting from
PushGateway to Prometheus to Alertmanager to email is better than individual
PushGateway to Prometheus to Alertmanager to email is better than individual
alerts on each.
alerts on each.
Supplementing the whitebox monitoring of Prometheus with external blackbox
Supplementing the whitebox monitoring of Prometheus with external blackbox
monitoring can catch problems that are otherwise invisible, and also serves as
monitoring can catch problems that are otherwise invisible, and also serves as
a fallback in-case internal systems completely fail.
a fallback incase internal systems completely fail.
The expression browser is available at `/graph` on the Prometheus server, allowing you to enter any expression and see its result either in a table or graphed over time.
The expression browser is available at `/graph` on the Prometheus server, allowing you to enter any expression and see it's result either in a table or graphed over time.
This is primarily useful for ad-hoc queries and debugging. For consoles, use [PromDash](../promdash/) or [Console templates](../consoles/).
This is primarily useful for ad-hoc queries and debugging, for consoles you should use [PromDash](../promdash/) or [Console templates](../consoles/).