Merge pull request #2 from brian-brazil/master

Add lots of docs, mostly best practices.

Merge pull request #2 from brian-brazil/master
Add lots of docs, mostly best practices.
bf082f68 · juliusv · 4d2f47bd · af897df4 · bf082f68 · bf082f68
Commit bf082f68 authored Jan 02, 2015 by juliusv
11 changed files
--- a/content/docs/introduction/getting_started.md
+++ b/content/docs/introduction/getting_started.md
@@ -228,11 +228,11 @@ minutes. We could write this as:
 avg(rate(rpc_calls_total[5m]))
 ```

-To record this expression as a new time series called `rpc_calls_rate`, create a
+To record this expression as a new time series called `job:rpc_calls:avg_rate5m`, create a
 file with the following recording rule and save it as `prometheus.rules`:

 ```
-rpc_calls_rate_mean = avg(rate(rpc_calls_total[5m]))
+job:rpc_calls:avg_rate5m = avg(rate(rpc_calls_total[5m]))
 ```

 To make Prometheus pick up this new rule, add a `rule_files` statement to the
@@ -258,5 +258,5 @@ global: {
 ```

 Restart Prometheus with the new configuration and verify that a new time series
-with the metric name `rpc_calls_rate_mean` is now available by querying it
+with the metric name `job:rpc_calls:avg_rate5m` is now available by querying it
 through the expression browser or graphing it.
--- a/content/docs/introduction/overview.md
+++ b/content/docs/introduction/overview.md
@@ -21,7 +21,6 @@ Prometheus's main distinguishing features are:
 - **pushing time series** is supported via an intermediary gateway
 - targets are discovered via **service discovery** or **static configuration**
 - multiple modes of **graphing and dashboarding support**
- **federation support** coming soon

 The Prometheus ecosystem consists of multiple components, many of which are
 optional:
@@ -47,4 +46,17 @@ of highly dynamic service-oriented architectures. In a world of microservices,
 its support for multi-dimensional data collection and querying is a particular
 strength.

-TODO: highlight advantage of not depending on distributed storage.
+Prometheus is designed for reliability, to be the system you go to
+during an outage to allow you to quickly diagnose problems. Each Prometheus
+server is standalone, not depending on network storage or other remote services.
+You can rely it when other parts of your infrastructure are broken, and
+you don't have to setup complex infrastructure to use it.
+
+## When doesn't it fit?
+
+Prometheus values reliability. You can always view what statistics are
+available about your system, even under failure conditions. If you need 100%
+accuracy, such as for per-request billing, Prometheus is not a good choice as
+we keep things simple and easy to understand. In such a case you would be best
+using some other system to collect and analyse the data for billing, and
+Prometheus for the rest of your monitoring.
--- a/content/docs/operating/rules.md
+++ b/content/docs/operating/rules.md
@@ -45,8 +45,8 @@ file:

 Some examples:

-    // Saving the per-job HTTP request count as a new set of time series:
-    job:api_http_requests_total:sum = sum(api_http_requests_total) by (job)
+    // Saving the per-job HTTP in-progress request count as a new set of timeseries:
+    job:http_inprogress_requests:sum = sum(http_inprogress_requests) by (job)

    // Drop or rewrite labels in the result time series:
    new_time series{label_to_change="new_value",label_to_drop=""} = old_time series

--- a/content/docs/practices/alerting.md
+++ b/content/docs/practices/alerting.md
+---
+title: Alerting
+sort_rank: 4
+---
+
+# Alerting
+
+We recommend that you read [My Philosophy on Alerting](https://docs.google.com/a/boxever.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit) based on Rob Ewaschuk's observations at Google.
+
+To summarize, keep alerting simple, alert on symptoms, have good consoles
+to allow pinpointing causes and avoid having pages where there is nothing to
+do.
+
+## What to alert on
+
+Aim to have as few alerts as possible, by alerting on symptoms that are
+associated with end-user pain rather than trying to catch every possible way
+that pain could be caused. Alerts should link to relevant consoles,
+and make it easy to figure out which component is at fault.
+
+Allow slack in alerting to accommodate small blips.
+
+### Online serving systems
+
+Typically alert on high latency and error rates as high up in the stack as possible.
+
+Only page on latency at one point in a stack, if a lower component is slower
+than it should be but the overall user latency is fine then there is no need to
+page.
+
+For error rates, page on errors to the user. If there are errors further down
+the stack that will cause such a failure, there is no need to page on them
+separately. However if some failures do not cause a to the user-visible
+failure but are otherwise severe enough to require human involvment (for
+example, you're losing a lot of money), add pages to be sent on those.
+
+You may need alerts for different types of request if they have different
+characteristics, or problems in a low-traffic type of request would be drowned
+out by high-traffic requests.
+
+### Offline processing
+
+For offline processing systems the key metric is how long data takes to get
+through the system, so page if that gets high enough to cause user impact.
+
+### Batch jobs
+
+For batch jobs it makes sense to page if the batch job has not succeeded
+recently enough, and this will cause user-visible problems.
+
+This should generally be at least enough time for 2 full runs of the batch job.
+For a job that runs every 4 hours and takes an hour, 10 hours would be a
+reasonable threshold. If you cannot withstand a single run failing, run the
+job more often as a single failure should not require human intervention.
+
+### Capacity
+
+While not a problem causing immediate user impact, being close to capacity
+often requires human intervention to avoid an outage in the near future.
+
+### Metamonitoring
+
+It is important to have confidence that monitoring is working. Accordingly, have
+alerts to ensure Prometheus servers, Alertmanagers, PushGateways and
+other monitoring infrastructure are available and running correctly.
+
+As always, if it is possible to alert on symptoms rather than causes,this helps
+to reduce noise. For example, a blackbox test that alerts are getting from
+PushGateway to Prometheus to Alertmanager to email is better than individual
+alerts on each.
+
+Supplementing the whitebox monitoring of Prometheus with external blackbox
+monitoring can catch problems that are otherwise invisible, and also serves as
+a fallback in-case internal systems completely fail.
--- a/content/docs/practices/consoles.md
+++ b/content/docs/practices/consoles.md
+---
+title: Consoles and dashboards
+sort_rank: 3
+---
+
+## Consoles and dashboards
+
+It can be tempting to display as much data as possible on a dashboard, especially
+when a system like Prometheus offers the ability to have such rich
+instrumentation of your applications. This can lead to consoles that are
+impenetrable due to having too much information, that even an expert in the
+system would have difficulty drawing meaning from. Hundreds of graphs on a
+single page isn't unheard of, nor is a hundred plots on a single graph
+
+Instead of trying to represent every piece of data you have, for operational
+consoles think of what are the most likely failure modes and how you'd use the
+consoles to differentiate them. Take advantage of the structure of your
+services. For example if you've a big tree of services in an online serving
+system, latency in some lower service is a typical problem. You could have one
+big page with every service's information, a better approach is one page per
+service that includes the latency and errors it sees for each service it talks
+to. You can then start at the top and work your way down to the problem
+service.
+
+We've found the following guidelines very effective:
+
+* Have no more than 5 graphs on a console.
+* Have no more than 5 plots (lines) on each graph. You can get away with more if it's a stacked/area graph.
+* If using console templates, try to avoid more than 20-30 entries on the table on the right
+
+If you find yourself exceeding these then you should demote the visibility of
+less important information, possibly splitting out some subsystems to a new console.
+For example you could graph aggregated rather than broken-down data, move
+things to the right hand table or even remove it completely if it's rarely
+useful - you can always look at it in the [expression browser](../../visualization/browser/)!
+
+Finally, it is difficult for a set of consoles to serve more than one master.
+What you want to know when oncall (what's broken?) tends to be very different
+from what you want when developing features (how many people hit corner
+case X?). In such cases, two seperate sets of consoles can be useful.
--- a/content/docs/practices/instrumentation.md
+++ b/content/docs/practices/instrumentation.md
+---
+title: Instrumention
+sort_rank: 3
+---
+
+# Instrumentation
+
+This page gives guidelines for when you're adding instrumentation to your code.
+
+## How to instrument
+
+The short answer is to instrument everything. Every library, subsystem and
+service should have at least a few metrics to give you a rough idea of how it's
+performing.
+
+Instrumentation should be an integral part of your code, instantiate the metric
+classes in the same file you use them. This makes going from alert to console to code
+easy when you're chasing an error.
+
+### The three types of services
+
+For monitoring purposes services can generally be broken down into three types,
+online serving, offline processing and batch jobs. There is overlap between
+them, but every service tends to fit well into one of these categories.
+
+#### Online serving systems
+
+An online serving system is one where someone is waiting on a response, for
+example most database and http requests fall into this category.
+
+The key metrics are queries performed, errors and latency. The number of
+inprogress requests can also be useful.
+
+Online serving systems should be monitored on both the client and server side,
+as if the two sides see different things that's very useful information for debugging.
+If a service has many clients, it's also not practical for it to track them
+individally so they have to rely on their own stats.
+
+Be consistent in whether you count queries when they start or when they end.
+When they end is suggested, as it'll line up with the error and latency stats,
+and tends to be easier to code.
+
+#### Offline processing
+
+For offline processing, noone is actively waiting for a response and batching
+is common. There may also be multiple stages of processing.
+
+For each stage track the items coming in, how many are in progress, the last
+time you processed something, and how many items went out. If batching, you
+should also track batches going in and out.
+
+The last time you processed something is useful to detect if you've stalled,
+but it's very localised information.  A better approach is to send a heartbeat
+though the system, that is some dummy item that gets passed all the way through
+and includes the timestamp when it was inserted. Each stage can export the most
+recent heartbeat timestamp it has seen, letting you know how long items are
+taking to propogate through the system. For systems that don't have quiet
+periods where no processing occurs, an explicit heartbeat may not be needed.
+
+#### Batch jobs
+
+There's a very fuzzy line between offline processing and batch jobs, as offline
+processing may be done in batch jobs. Batch jobs are distinguished by the
+fact that they don't run continuously, which makes scraping them difficult.
+
+The key metric of a batch job is the last time it succeeded. It's also useful to track
+how long each major stage of the job took, the overall runtime and the last
+time the job completed (successful or failed). These are all Gauges, and should
+be pushed to a PushGateway. There are generally also some overall job-specific
+statistics that it'd be useful to track, such as total number of records
+processed.
+
+For batch jobs that take more than a few minutes to run, it is useful to also
+scrape them in the usual pull way. This lets you see the same metrics over time
+as for other types of job such as resource usage and latency talking to other
+systems. This can aid debugging if the job starts to get slow.
+
+For batch jobs that run very often (say more often than every 15 minutes), you should
+consider converting them into daemons and handling them as offline processing jobs.
+
+### Subsystems
+
+In addition the the three main types of services, systems have sub-parts that
+it's also good to monitor.
+
+#### Libraries
+
+Libraries should provide aim instrumentation with no additional configuration
+required by users.
+
+Where the library is to access some resource outside of the process (e.g.
+network, disk, IPC), that's an online serving system and you should track
+overall query count, errors (if errors are possible) and latency at a minimum.
+
+Depending on how heavy the library is, you should track internal errors and
+latency within the library itself, and any general statistics you think may be
+useful.
+
+A library may be used by multiple independant parts of an application against
+different resources, so take care to distinguish uses with labels where
+appropriate. For example a database connection pool should distinguish based
+on what database it's talking to, whereas there's no need to differentiate
+between users of a DNS client library.
+
+#### Logging
+
+As a general rule, for every line of logging code you should also have a
+counter that is incremented. If you find an interesting log message, you want
+be able to see how often it has been happening and for how long.
+
+If there's multiple closely related log messages in the same function (for example
+different branches of an if or switch statement), it can sometimes make sense
+to have them all increment the same one counter.
+
+It's also generally useful to export the total number of info/error/warning
+lines that were logged by the application as a whole, and check for significant
+differences as part of your release process.
+
+#### Failure
+
+Failure should be handled similarly to logging, every time there's a failure a
+counter should be incremented. Unlike logging, the error may also bubble up to a
+more general error counter depending on how your code is strctured.
+
+When reporting failure, you should generally have some other metric
+representing total attempts. This makes the failure ratio easy to calculate.
+
+#### Threadpools
+
+For any sort of threadpool, the key metrics are the number of queued requests, the number of
+threads in use, the total number of threads, the number of tasks processed and how long they took.
+It's also useful to track how long things were waiting in the queue.
+
+#### Caches
+
+The key metrics for a cache are total queries, hits, overall latency and then
+the query count, errors and latency of whatever online serving system the cache is in front of.
+
+#### Collectors
+
+When implementing a non-trivial custom Collector, it's advised to export a
+Gauge for how long the collection took in seconds and another for the number of
+errors encountered.
+
+This is one of the two cases when it's okay to export a duration as a Gauge
+rather than a Summary, the other being batch job durations. This is as both
+represent information about that particular push/scrape, rather than
+tracking multiple durations over time.
+
+## Things to watch out for
+
+There's some things to be aware of when doing monitoring generally, and also
+with Prometheus-style monitoring in particular.
+
+### Use labels
+
+Very few monitoring systems have the notion of labels and a rules language to
+take advantage of them, so it takes a bit of getting used to.
+
+When you have multiple metrics that you want to add/average/sum, they should
+usually be one metric with labels rather than multiple metrics.
+
+For example rather `http_responses_500_total` and `http_resonses_403_total`
+you should have one metric called `http_responses_total` with a `code` label
+for the HTTP response code. You can then process the entire metric as one in
+rules and graphs.
+
+As a rule of thumb no part of a metric name should ever be procedurally
+generated, you should use labels instead. The one exception is when proxying
+from another monitoring/instrumentation system.
+
+See also the [naming](../naming) section.
+
+### Don't overuse labels
+
+Each labelset is an additional timeseries that has RAM, CPU, disk and network
+costs. Usually this is negligable in the grand scheme of things, however if you
+have lots of metrics with hundreds of labelsets across hundreds of servers this
+can add up quickly.
+
+As a general guideline try to keep the cardinality of your metrics below 10,
+and for metrics that exceed that, aim to limit them to a handful across your
+whole system. The vast majority of your metrics should have no labels.
+
+If you have a metric that has a cardinality over 100 or the potential to grow
+that large, investigate alternate solutions such as reducing the number of
+dimensions or moving the analysis away from monitoring and to a general purpose
+processing system.
+
+If you're unsure, start with no labels and add more
+labels over time as concrete use cases arise.
+
+### Counter vs Gauge vs Summary
+
+It's important to know which of the three main metric types to use for a given
+metric. There's a simple rule of thumb, if it can go down it's a Gauge. 
+
+Counters can only go up (and reset, such as when a process restarts). They're
+useful for accumulating the number of events, or the amount of something at
+each event. For example the total number of HTTP requests, or the total amount
+of bytes send in HTTP requests. Raw counters are rarely useful, use the
+`rate()` function to get the rate at which they're incresing per second.
+
+Gauges can be set, go up and go down. They're useful for snapshots of state,
+such as in-progress requests, free/total memory or temperature. You should
+never take a `rate()` of a Gauge.
+
+Summaries are similar to having two Counters, they track the number of events
+*and* the amount of something for each event, allowing you to calculate the
+average amount per event (useful for latency, for example). In addition you can
+also get quantiles of the amounts, but note that this isn't aggregatable.
+
+### Timestamps, not time since
+
+If you want to track the amount of time since something happened export the
+unix timestamp at which it happened - not the time since it happened.
+
+With the timestamp exported you can use `time() - my_timestamp_metric` to
+calculate the time since the event, removing the need for update logic and
+protecting you against the update logic getting stuck.
+
+### Inner loops
+
+In general the additional resource cost of instrumentation is far outweighed by
+the benefits it brings to operations and development.
+
+For code which is performance critical or called more than 100k times a second
+inside a given process, you may wish to take some care as to how many metrics
+you update.
+
+A Java Simpleclient counter takes
+[12-17ns](https://github.com/prometheus/client_java/blob/master/benchmark/README.md)
+to increment depending on contention, other languages will have similar
+performance. If that amount of time is significant for your inner loop, limit
+the number of metrics you increment in the inner loop and avoid labels (or
+cache the result of the label lookup, for example the return value of `With()`
+in Go or `labels()` in Java) where possible.
+
+Beware also of metrics updates involving time or durations, as getting the time
+may involve a syscall. As with all matters involving performance critical code,
+benchmarks are the best way to determine the impact of any given change.
+
+### Avoid missing metrics
+
+Time series that aren't present until something happens are difficult to deal with,
+as the usual simple operations are no longer sufficient to correctly handle
+them. To avoid this, export a 0 for any time series you know may exist in advance.
+
+Most Prometheus client libraries (including Go and Java Simpleclient) will
+automatically export a 0 for you for metrics with no labels.
--- a/content/docs/practices/naming.md
+++ b/content/docs/practices/naming.md
@@ -41,6 +41,9 @@ Use labels to differentiate the characteristics of the thing that is being measu
 * `api_http_requests_total` - differentiate request types: `type="create|update|delete"`
 * `api_request_duration_nanoseconds` - differentiate request stages: `stage="extract|transform|load"`

+Don't put the label names in the metric name, as that's redundant and
+will cause confusion if it's aggregated away.
+
 CAUTION: <b>CAUTION:</b> Remember that every unique key-value label pair
 represents a new time series, which can dramatically increase the amount of
 data stored. Don't use labels to store dimensions with high cardinality (many

--- a/content/docs/visualization/browser.md
+++ b/content/docs/visualization/browser.md
@@ -5,4 +5,7 @@ sort_rank: 1

 # Expression browser

-TODO: Add content.
+
+The expression browser is available at `/graph` on the Prometheus server, allowing you to enter any expression and see it's result either in a table or graphed over time.
+
+This is primarily useful for ad-hoc queries and debugging, for consoles you should use [PromDash](../promdash/) or [Console templates](../consoles/). 
--- a/content/docs/visualization/consoles.md
+++ b/content/docs/visualization/consoles.md
@@ -5,4 +5,35 @@ sort_rank: 3

 # Console templates

-TODO: Add content.
+The Console templates allow for creation of arbitrary consoles using the [Go
+templating langauge](http://golang.org/pkg/text/template/). These are served
+from the Prometheus server.
+
+Console templates are the most powerful way to create templates that can be easily managed in source control, there is a learning curve though so 
+users new to this style of monitoring should try out [PromDash](../promdash/) first.
+
+## Getting started
+
+Prometheus comes with an example set of consoles to get you going, these can be found at `/consoles/index.html.example` on a running Prometheus
+and will let you see Node Exporter consoles if your node exporters have a `job="node"` label.
+
+Consoles have 5 parts:
+
+1. A navigation bar on top
+1. A menu on the left
+1. Time controls on the bottom
+1. The main content in the center, usually graphs
+1. A table on the right
+
+The navigation bar is for links to other systems, such as other Prometheuses, documentation and whatever else makes sense to you.
+The menu is for navigation inside the Prometheus, it's very useful to be able to quickly open a console in another tab to correlate information.
+Both are configured in `console_libraries/menu.lib`.
+
+The time controls allow changing of the duration and range of the graphs. Console URLs can be shared with others, they'll see the same graphs.
+
+The main content is usually graphs. There's a configurable Javascript graphing
+library provided that'll handle requesting data from Prometheus, and rendering
+it via [Rickshaw](http://code.shutterstock.com/rickshaw/).
+
+Finally, the table on the right can be used to display statistics in a more
+compact form than graphs.
--- a/content/docs/visualization/promdash.md
+++ b/content/docs/visualization/promdash.md
@@ -3,6 +3,10 @@ title: PromDash
 sort_rank: 2
 ---

-# Console templates
+# PromDash
+
+PromDash is a simple, easy and quick way to create consoles from your browser.
+
+See the [documentation](https://github.com/prometheus/promdash/blob/master/README.md) for more information.

 TODO: Add content.
--- a/static/docs.css
+++ b/static/docs.css
@@ -101,6 +101,14 @@ footer p {
  background: none;
 }

+.doc-content > h3 {
+  font-size: 25px;
+}
+
+.doc-content > h4 {
+  font-size: 21px;
+}
+
 pre {
  font-family: "Courier New", Monaco, Menlo, Consolas, monospace;
  background-color: #444;