Commit 895fe7b2 authored by juliusv's avatar juliusv

Merge pull request #4 from prometheus/doc-fixups

Followup fixups for PR #2.
parents 11fed144 475c876f
......@@ -50,13 +50,13 @@ Prometheus is designed for reliability, to be the system you go to
during an outage to allow you to quickly diagnose problems. Each Prometheus
server is standalone, not depending on network storage or other remote services.
You can rely it when other parts of your infrastructure are broken, and
you don't have to setup complex infrastructure to use it.
you don't have to set up complex infrastructure to use it.
## When doesn't it fit?
Prometheus values reliability. You can always view what statistics are
available about your system, even under failure conditions. If you need 100%
accuracy, such as for per-request billing, Prometheus is not a good choice as
we keep things simple and easy to understand. In such a case you would be best
using some other system to collect and analyse the data for billing, and
Prometheus for the rest of your monitoring.
the collected data will likely not be detailed and complete enough. In such a
case you would be best off using some other system to collect and analyse the
data for billing, and Prometheus for the rest of your monitoring.
......@@ -5,7 +5,8 @@ sort_rank: 4
# Alerting
We recommend that you read [My Philosophy on Alerting](https://docs.google.com/a/boxever.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit) based on Rob Ewaschuk's observations at Google.
We recommend that you read [My Philosophy on Alerting](https://docs.google.com/a/boxever.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit)
based on Rob Ewaschuk's observations at Google.
To summarize, keep alerting simple, alert on symptoms, have good consoles
to allow pinpointing causes and avoid having pages where there is nothing to
......@@ -15,24 +16,24 @@ do.
Aim to have as few alerts as possible, by alerting on symptoms that are
associated with end-user pain rather than trying to catch every possible way
that pain could be caused. Alerts should link to relevant consoles,
that pain could be caused. Alerts should link to relevant consoles
and make it easy to figure out which component is at fault.
Allow slack in alerting to accommodate small blips.
Allow for slack in alerting to accommodate small blips.
### Online serving systems
Typically alert on high latency and error rates as high up in the stack as possible.
Only page on latency at one point in a stack, if a lower component is slower
than it should be but the overall user latency is fine then there is no need to
page.
Only page on latency at one point in a stack. If a lower-level component is
slower than it should be, but the overall user latency is fine, then there is
no need to page.
For error rates, page on errors to the user. If there are errors further down
For error rates, page on user-visible errors. If there are errors further down
the stack that will cause such a failure, there is no need to page on them
separately. However if some failures do not cause a to the user-visible
failure but are otherwise severe enough to require human involvment (for
example, you're losing a lot of money), add pages to be sent on those.
separately. However, if some failures are not user-visible, but are otherwise
severe enough to require human involvment (for example, you're losing a lot of
money), add pages to be sent on those.
You may need alerts for different types of request if they have different
characteristics, or problems in a low-traffic type of request would be drowned
......@@ -40,7 +41,7 @@ out by high-traffic requests.
### Offline processing
For offline processing systems the key metric is how long data takes to get
For offline processing systems, the key metric is how long data takes to get
through the system, so page if that gets high enough to cause user impact.
### Batch jobs
......@@ -51,7 +52,7 @@ recently enough, and this will cause user-visible problems.
This should generally be at least enough time for 2 full runs of the batch job.
For a job that runs every 4 hours and takes an hour, 10 hours would be a
reasonable threshold. If you cannot withstand a single run failing, run the
job more often as a single failure should not require human intervention.
job more frequently, as a single failure should not require human intervention.
### Capacity
......@@ -61,14 +62,14 @@ often requires human intervention to avoid an outage in the near future.
### Metamonitoring
It is important to have confidence that monitoring is working. Accordingly, have
alerts to ensure Prometheus servers, Alertmanagers, PushGateways and
alerts to ensure that Prometheus servers, Alertmanagers, PushGateways, and
other monitoring infrastructure are available and running correctly.
As always, if it is possible to alert on symptoms rather than causes,this helps
As always, if it is possible to alert on symptoms rather than causes, this helps
to reduce noise. For example, a blackbox test that alerts are getting from
PushGateway to Prometheus to Alertmanager to email is better than individual
alerts on each.
Supplementing the whitebox monitoring of Prometheus with external blackbox
monitoring can catch problems that are otherwise invisible, and also serves as
a fallback in-case internal systems completely fail.
a fallback in case internal systems completely fail.
......@@ -9,32 +9,31 @@ It can be tempting to display as much data as possible on a dashboard, especiall
when a system like Prometheus offers the ability to have such rich
instrumentation of your applications. This can lead to consoles that are
impenetrable due to having too much information, that even an expert in the
system would have difficulty drawing meaning from. Hundreds of graphs on a
single page isn't unheard of, nor is a hundred plots on a single graph
system would have difficulty drawing meaning from.
Instead of trying to represent every piece of data you have, for operational
consoles think of what are the most likely failure modes and how you'd use the
consoles think of what are the most likely failure modes and how you would use the
consoles to differentiate them. Take advantage of the structure of your
services. For example if you've a big tree of services in an online serving
system, latency in some lower service is a typical problem. You could have one
big page with every service's information, a better approach is one page per
service that includes the latency and errors it sees for each service it talks
to. You can then start at the top and work your way down to the problem
services. For example, if you have a big tree of services in an online-serving
system, latency in some lower service is a typical problem. Rather than showing
every service's information on a single large dashboard, build separate dashboards
for each service that include the latency and errors for each service they talk
to. You can then start at the top and work your way down to the problematic
service.
We've found the following guidelines very effective:
We have found the following guidelines very effective:
* Have no more than 5 graphs on a console.
* Have no more than 5 plots (lines) on each graph. You can get away with more if it's a stacked/area graph.
* If using console templates, try to avoid more than 20-30 entries on the table on the right
* When using the provided console template examples, avoid more than 20-30 entries in the right-hand-side table.
If you find yourself exceeding these then you should demote the visibility of
If you find yourself exceeding these, it could make sense to demote the visibility of
less important information, possibly splitting out some subsystems to a new console.
For example you could graph aggregated rather than broken-down data, move
things to the right hand table or even remove it completely if it's rarely
For example, you could graph aggregated rather than broken-down data, move
it to the right-hand-side table, or even remove data completely if it is rarely
useful - you can always look at it in the [expression browser](../../visualization/browser/)!
Finally, it is difficult for a set of consoles to serve more than one master.
What you want to know when oncall (what's broken?) tends to be very different
from what you want when developing features (how many people hit corner
case X?). In such cases, two seperate sets of consoles can be useful.
case X?). In such cases, two separate sets of consoles can be useful.
---
title: Instrumention
title: Instrumentation
sort_rank: 3
---
# Instrumentation
This page gives guidelines for when you're adding instrumentation to your code.
This page provides an opinionated set of guidelines for instrumenting your code.
## How to instrument
The short answer is to instrument everything. Every library, subsystem and
service should have at least a few metrics to give you a rough idea of how it's
service should have at least a few metrics to give you a rough idea of how it is
performing.
Instrumentation should be an integral part of your code, instantiate the metric
Instrumentation should be an integral part of your code. Instantiate the metric
classes in the same file you use them. This makes going from alert to console to code
easy when you're chasing an error.
easy when you are chasing an error.
### The three types of services
For monitoring purposes services can generally be broken down into three types,
online serving, offline processing and batch jobs. There is overlap between
For monitoring purposes, services can generally be broken down into three types:
online-serving, offline-processing, and batch jobs. There is overlap between
them, but every service tends to fit well into one of these categories.
#### Online serving systems
#### Online-serving systems
An online serving system is one where someone is waiting on a response, for
example most database and http requests fall into this category.
An online-serving system is one where a human or another system is expecting an
immediate response. For example, most database and HTTP requests fall into
this category.
The key metrics are queries performed, errors and latency. The number of
inprogress requests can also be useful.
The key metrics in such a system are the number of performed queries, errors,
and latency. The number of in-progress requests can also be useful.
Online serving systems should be monitored on both the client and server side,
as if the two sides see different things that's very useful information for debugging.
If a service has many clients, it's also not practical for it to track them
individally so they have to rely on their own stats.
Online-serving systems should be monitored on both the client and server side.
If the two sides see different behaviors, that is very useful information for debugging.
If a service has many clients, it is also not practical for the service to track them
individally, so they have to rely on their own stats.
Be consistent in whether you count queries when they start or when they end.
When they end is suggested, as it'll line up with the error and latency stats,
When they end is suggested, as it will line up with the error and latency stats,
and tends to be easier to code.
#### Offline processing
For offline processing, noone is actively waiting for a response and batching
is common. There may also be multiple stages of processing.
For offline processing, no one is actively waiting for a response, and batching
of work is common. There may also be multiple stages of processing.
For each stage track the items coming in, how many are in progress, the last
time you processed something, and how many items went out. If batching, you
For each stage, track the items coming in, how many are in progress, the last
time you processed something, and how many items were sent out. If batching, you
should also track batches going in and out.
The last time you processed something is useful to detect if you've stalled,
but it's very localised information. A better approach is to send a heartbeat
though the system, that is some dummy item that gets passed all the way through
Knowing the last time that a system processed something is useful for detecting if it has stalled,
but it is very localised information. A better approach is to send a heartbeat
through the system: some dummy item that gets passed all the way through
and includes the timestamp when it was inserted. Each stage can export the most
recent heartbeat timestamp it has seen, letting you know how long items are
taking to propogate through the system. For systems that don't have quiet
taking to propogate through the system. For systems that do not have quiet
periods where no processing occurs, an explicit heartbeat may not be needed.
#### Batch jobs
There's a very fuzzy line between offline processing and batch jobs, as offline
There is a fuzzy line between offline-processing and batch jobs, as offline
processing may be done in batch jobs. Batch jobs are distinguished by the
fact that they don't run continuously, which makes scraping them difficult.
fact that they do not run continuously, which makes scraping them difficult.
The key metric of a batch job is the last time it succeeded. It's also useful to track
The key metric of a batch job is the last time it succeeded. It is also useful to track
how long each major stage of the job took, the overall runtime and the last
time the job completed (successful or failed). These are all Gauges, and should
be pushed to a PushGateway. There are generally also some overall job-specific
statistics that it'd be useful to track, such as total number of records
processed.
time the job completed (successful or failed). These are all gauges, and should
be [pushed to a PushGateway](/docs/instrumenting/pushing/).
There are generally also some overall job-specific statistics that would be
useful to track, such as the total number of records processed.
For batch jobs that take more than a few minutes to run, it is useful to also
scrape them in the usual pull way. This lets you see the same metrics over time
as for other types of job such as resource usage and latency talking to other
scrape them using pull-based monitoring. This lets you track the same metrics over time
as for other types of jobs, such as resource usage and latency when talking to other
systems. This can aid debugging if the job starts to get slow.
For batch jobs that run very often (say more often than every 15 minutes), you should
consider converting them into daemons and handling them as offline processing jobs.
For batch jobs that run very often (say, more often than every 15 minutes), you should
consider converting them into daemons and handling them as offline-processing jobs.
### Subsystems
In addition the the three main types of services, systems have sub-parts that
it's also good to monitor.
In addition to the three main types of services, systems have sub-parts that
should also be monitored.
#### Libraries
Libraries should provide aim instrumentation with no additional configuration
Libraries should provide instrumentation with no additional configuration
required by users.
Where the library is to access some resource outside of the process (e.g.
network, disk, IPC), that's an online serving system and you should track
overall query count, errors (if errors are possible) and latency at a minimum.
If it is a library used to access some resource outside of the process (for example,
network, disk, or IPC), track the overall query count, errors (if errors are possible)
and latency at a minimum.
Depending on how heavy the library is, you should track internal errors and
Depending on how heavy the library is, track internal errors and
latency within the library itself, and any general statistics you think may be
useful.
A library may be used by multiple independant parts of an application against
different resources, so take care to distinguish uses with labels where
appropriate. For example a database connection pool should distinguish based
on what database it's talking to, whereas there's no need to differentiate
appropriate. For example, a database connection pool should distinguish the databases
it is talking to, whereas there is no need to differentiate
between users of a DNS client library.
#### Logging
As a general rule, for every line of logging code you should also have a
counter that is incremented. If you find an interesting log message, you want
counter that is incremented. If you find an interesting log message, you want to
be able to see how often it has been happening and for how long.
If there's multiple closely related log messages in the same function (for example
If there are multiple closely-related log messages in the same function (for example
different branches of an if or switch statement), it can sometimes make sense
to have them all increment the same one counter.
increment the same one counter for all of them.
It's also generally useful to export the total number of info/error/warning
It is also generally useful to export the total number of info/error/warning
lines that were logged by the application as a whole, and check for significant
differences as part of your release process.
#### Failure
#### Failures
Failure should be handled similarly to logging, every time there's a failure a
Failures should be handled similarly to logging. Every time there is a failure, a
counter should be incremented. Unlike logging, the error may also bubble up to a
more general error counter depending on how your code is strctured.
When reporting failure, you should generally have some other metric
representing total attempts. This makes the failure ratio easy to calculate.
When reporting failures, you should generally have some other metric
representing the total number of attempts. This makes the failure ratio easy to calculate.
#### Threadpools
For any sort of threadpool, the key metrics are the number of queued requests, the number of
threads in use, the total number of threads, the number of tasks processed and how long they took.
It's also useful to track how long things were waiting in the queue.
threads in use, the total number of threads, the number of tasks processed, and how long they took.
It is also useful to track how long things were waiting in the queue.
#### Caches
The key metrics for a cache are total queries, hits, overall latency and then
the query count, errors and latency of whatever online serving system the cache is in front of.
the query count, errors and latency of whatever online-serving system the cache is in front of.
#### Collectors
When implementing a non-trivial custom Collector, it's advised to export a
Gauge for how long the collection took in seconds and another for the number of
When implementing a non-trivial custom metrics collector, it is advised to export a
gauge for how long the collection took in seconds and another for the number of
errors encountered.
This is one of the two cases when it's okay to export a duration as a Gauge
rather than a Summary, the other being batch job durations. This is as both
This is one of the two cases when it is okay to export a duration as a gauge
rather than a summary, the other being batch job durations. This is as both
represent information about that particular push/scrape, rather than
tracking multiple durations over time.
## Things to watch out for
There's some things to be aware of when doing monitoring generally, and also
with Prometheus-style monitoring in particular.
There are some general things to be aware of when doing monitoring, and also
Prometheus-specific ones in particular.
### Use labels
Very few monitoring systems have the notion of labels and a rules language to
Few monitoring systems have the notion of labels and an expression language to
take advantage of them, so it takes a bit of getting used to.
When you have multiple metrics that you want to add/average/sum, they should
usually be one metric with labels rather than multiple metrics.
For example rather `http_responses_500_total` and `http_resonses_403_total`
you should have one metric called `http_responses_total` with a `code` label
For example, rather `http_responses_500_total` and `http_resonses_403_total`,
create a single metric called `http_responses_total` with a `code` label
for the HTTP response code. You can then process the entire metric as one in
rules and graphs.
As a rule of thumb no part of a metric name should ever be procedurally
generated, you should use labels instead. The one exception is when proxying
As a rule of thumb, no part of a metric name should ever be procedurally
generated (use labels instead). The one exception is when proxying metrics
from another monitoring/instrumentation system.
See also the [naming](../naming) section.
### Don't overuse labels
### Do not overuse labels
Each labelset is an additional timeseries that has RAM, CPU, disk and network
costs. Usually this is negligable in the grand scheme of things, however if you
have lots of metrics with hundreds of labelsets across hundreds of servers this
can add up quickly.
Each labelset is an additional timeseries that has RAM, CPU, disk, and network
costs. Usually the overhead is negligible, but in scenarios with lots of
metrics and hundreds of labelsets across hundreds of servers, this can add up
quickly.
As a general guideline try to keep the cardinality of your metrics below 10,
As a general guideline, try to keep the cardinality of your metrics below 10,
and for metrics that exceed that, aim to limit them to a handful across your
whole system. The vast majority of your metrics should have no labels.
If you have a metric that has a cardinality over 100 or the potential to grow
that large, investigate alternate solutions such as reducing the number of
dimensions or moving the analysis away from monitoring and to a general purpose
dimensions or moving the analysis away from monitoring and to a general-purpose
processing system.
If you're unsure, start with no labels and add more
If you are unsure, start with no labels and add more
labels over time as concrete use cases arise.
### Counter vs Gauge vs Summary
### Counter vs. gauge vs. summary
It's important to know which of the three main metric types to use for a given
metric. There's a simple rule of thumb, if it can go down it's a Gauge.
It is important to know which of the three main metric types to use for a given
metric. There is a simple rule of thumb: if the value can go down, it's a gauge.
Counters can only go up (and reset, such as when a process restarts). They're
Counters can only go up (and reset, such as when a process restarts). They are
useful for accumulating the number of events, or the amount of something at
each event. For example the total number of HTTP requests, or the total amount
of bytes send in HTTP requests. Raw counters are rarely useful, use the
`rate()` function to get the rate at which they're incresing per second.
each event. For example, the total number of HTTP requests, or the total number of
of bytes sent in HTTP requests. Raw counters are rarely useful. Use the
`rate()` function to get the per-second rate at which they are increasing.
Gauges can be set, go up and go down. They're useful for snapshots of state,
such as in-progress requests, free/total memory or temperature. You should
never take a `rate()` of a Gauge.
Gauges can be set, go up, and go down. They are useful for snapshots of state,
such as in-progress requests, free/total memory, or temperature. You should
never take a `rate()` of a gauge.
Summaries are similar to having two Counters, they track the number of events
Summaries are similar to having two counters. They track the number of events
*and* the amount of something for each event, allowing you to calculate the
average amount per event (useful for latency, for example). In addition you can
also get quantiles of the amounts, but note that this isn't aggregatable.
average amount per event (useful for latency, for example). In addition,
summaries can also export quantiles of the amounts, but note that quantiles are not
aggregatable.
### Timestamps, not time since
If you want to track the amount of time since something happened export the
unix timestamp at which it happened - not the time since it happened.
If you want to track the amount of time since something happened, export the
Unix timestamp at which it happened - not the time since it happened.
With the timestamp exported you can use `time() - my_timestamp_metric` to
With the timestamp exported, you can use `time() - my_timestamp_metric` to
calculate the time since the event, removing the need for update logic and
protecting you against the update logic getting stuck.
### Inner loops
In general the additional resource cost of instrumentation is far outweighed by
In general, the additional resource cost of instrumentation is far outweighed by
the benefits it brings to operations and development.
For code which is performance critical or called more than 100k times a second
For code which is performance-critical or called more than 100k times a second
inside a given process, you may wish to take some care as to how many metrics
you update.
A Java Simpleclient counter takes
[12-17ns](https://github.com/prometheus/client_java/blob/master/benchmark/README.md)
to increment depending on contention, other languages will have similar
to increment depending on contention. Other languages will have similar
performance. If that amount of time is significant for your inner loop, limit
the number of metrics you increment in the inner loop and avoid labels (or
cache the result of the label lookup, for example the return value of `With()`
cache the result of the label lookup, for example, the return value of `With()`
in Go or `labels()` in Java) where possible.
Beware also of metrics updates involving time or durations, as getting the time
may involve a syscall. As with all matters involving performance critical code,
Beware also of metric updates involving time or durations, as getting the time
may involve a syscall. As with all matters involving performance-critical code,
benchmarks are the best way to determine the impact of any given change.
### Avoid missing metrics
Time series that aren't present until something happens are difficult to deal with,
Time series that are not present until something happens are difficult to deal with,
as the usual simple operations are no longer sufficient to correctly handle
them. To avoid this, export a 0 for any time series you know may exist in advance.
them. To avoid this, export a `0` for any time series you know may exist in advance.
Most Prometheus client libraries (including Go and Java Simpleclient) will
automatically export a 0 for you for metrics with no labels.
automatically export a `0` for you for metrics with no labels.
......@@ -7,7 +7,7 @@ sort_rank: 1
The metric and label conventions presented in this document are not required
for using Prometheus, but can serve as both a style-guide and collection of
best practices. Individual organizations might want to approach e.g. naming
best practices. Individual organizations may want to approach e.g. naming
conventions differently.
## Metric names
......@@ -18,7 +18,7 @@ A metric name:
* <code><b>prometheus</b>\_notifications\_total</code>
* <code><b>indexer</b>\_requests\_latencies\_milliseconds</code>
* <code><b>processor</b>\_requests\_total</code>
* must have a single unit (i.e. don't mix seconds with milliseconds)
* must have a single unit (i.e. do not mix seconds with milliseconds)
* should have a units suffix
* <code>api\_http\_request\_latency\_<b>milliseconds</b></code>
* <code>node\_memory\_usage\_<b>bytes</b></code>
......@@ -29,7 +29,7 @@ A metric name:
* instantaneous resource usage as a percentage
As a rule of thumb, either the `sum()` or the `avg()` over all dimensions of a
given metric should be meaningful (though not necessarily useful). If it isn't
given metric should be meaningful (though not necessarily useful). If it is not
meaningful, split the data up into multiple metrics. For example, having the
capacity of various queues in the metric is good, mixing the capacity of a
queue with the current number of elements in the queue is not.
......@@ -41,11 +41,11 @@ Use labels to differentiate the characteristics of the thing that is being measu
* `api_http_requests_total` - differentiate request types: `type="create|update|delete"`
* `api_request_duration_nanoseconds` - differentiate request stages: `stage="extract|transform|load"`
Don't put the label names in the metric name, as that's redundant and
will cause confusion if it's aggregated away.
Do not put the label names in the metric name, as this introduces redundancy
and will cause confusion if the respective labels are aggregated away.
CAUTION: <b>CAUTION:</b> Remember that every unique key-value label pair
represents a new time series, which can dramatically increase the amount of
data stored. Don't use labels to store dimensions with high cardinality (many
data stored. Do not use labels to store dimensions with high cardinality (many
different label values), such as user IDs, email addresses, or other unbounded
sets of values.
......@@ -5,7 +5,6 @@ sort_rank: 1
# Expression browser
The expression browser is available at `/graph` on the Prometheus server, allowing you to enter any expression and see its result either in a table or graphed over time.
The expression browser is available at `/graph` on the Prometheus server, allowing you to enter any expression and see it's result either in a table or graphed over time.
This is primarily useful for ad-hoc queries and debugging, for consoles you should use [PromDash](../promdash/) or [Console templates](../consoles/).
This is primarily useful for ad-hoc queries and debugging. For consoles, use [PromDash](../promdash/) or [Console templates](../consoles/).
......@@ -5,7 +5,7 @@ sort_rank: 2
# PromDash
PromDash is a simple, easy and quick way to create consoles from your browser.
PromDash is a simple, easy, and quick way to create consoles from your browser.
See the [documentation](https://github.com/prometheus/promdash/blob/master/README.md) for more information.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment