Add lots of docs, mostly best practices.

Start out on the visualisation docs. Expand overview, remove federation as a feature we don't have that yet. Switch example variable to follow naming scheme.

Add lots of docs, mostly best practices.
Start out on the visualisation docs. Expand overview, remove federation as a feature we don't have that yet. Switch example variable to follow naming scheme.
af897df4 · Brian Brazil · b3e57b67 · af897df4 · af897df4 · af897df4
Commit af897df4 authored Dec 26, 2014 by Brian Brazil
11 changed files
--- a/content/docs/introduction/getting_started.md
+++ b/content/docs/introduction/getting_started.md
@@ -228,11 +228,11 @@ minutes. We could write this as:
 avg(rate(rpc_calls_total[5m]))
 ```

-To record this expression as a new time series called `rpc_calls_rate`, create a
+To record this expression as a new time series called `job:rpc_calls:avg_rate5m`, create a
 file with the following recording rule and save it as `prometheus.rules`:

 ```
-rpc_calls_rate_mean = avg(rate(rpc_calls_total[5m]))
+job:rpc_calls:avg_rate5m = avg(rate(rpc_calls_total[5m]))
 ```

 To make Prometheus pick up this new rule, add a `rule_files` statement to the
@@ -258,5 +258,5 @@ global: {
 ```

 Restart Prometheus with the new configuration and verify that a new time series
-with the metric name `rpc_calls_rate_mean` is now available by querying it
+with the metric name `job:rpc_calls:avg_rate5m` is now available by querying it
 through the expression browser or graphing it.
--- a/content/docs/introduction/overview.md
+++ b/content/docs/introduction/overview.md
@@ -21,7 +21,6 @@ Prometheus's main distinguishing features are:
 - **pushing time series** is supported via an intermediary gateway
 - targets are discovered via **service discovery** or **static configuration**
 - multiple modes of **graphing and dashboarding support**
- **federation support** coming soon

 The Prometheus ecosystem consists of multiple components, many of which are
 optional:
@@ -47,4 +46,17 @@ of highly dynamic service-oriented architectures. In a world of microservices,
 its support for multi-dimensional data collection and querying is a particular
 strength.

-TODO: highlight advantage of not depending on distributed storage.
+Prometheus is designed for reliability, to be the system you go to
+during an outage to allow you to quickly diagnose problems. Each Prometheus
+server is standalone, not depending on network storage or other remote services.
+You can rely it when other parts of your infrastructure are broken, and
+you don't have to setup complex infrastructure to use it.
+
+## When doesn't it fit?
+
+Prometheus values reliability. You can always view what statistics are
+available about your system, even under failure conditions. If you need 100%
+accuracy, such as for per-request billing, Prometheus is not a good choice as
+we keep things simple and easy to understand. In such a case you would be best
+using some other system to collect and analyse the data for billing, and
+Prometheus for the rest of your monitoring.
--- a/content/docs/operating/rules.md
+++ b/content/docs/operating/rules.md
@@ -45,8 +45,8 @@ file:

 Some examples:

-    // Saving the per-job HTTP request count as a new set of time series:
-    job:api_http_requests_total:sum = sum(api_http_requests_total) by (job)
+    // Saving the per-job HTTP in-progress request count as a new set of timeseries:
+    job:http_inprogress_requests:sum = sum(http_inprogress_requests) by (job)

    // Drop or rewrite labels in the result time series:
    new_time series{label_to_change="new_value",label_to_drop=""} = old_time series

--- a/content/docs/practices/alerting.md
+++ b/content/docs/practices/alerting.md
+---
+title: Alerting
+sort_rank: 4
+---
+
+# Alerting
+
+We recommend that you read [My Philosophy on Alerting](https://docs.google.com/a/boxever.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit) based on Rob Ewaschuk's observations at Google.
+
+To summarize, keep alerting simple, alert on symptoms, have good consoles
+to allow pinpointing causes and avoid having pages where there is nothing to
+do.
+
+## What to alert on
+
+Aim to have as few alerts as possible, by alerting on symptoms that are
+associated with end-user pain rather than trying to catch every possible way
+that pain could be caused. Alerts should link to relevant consoles,
+and make it easy to figure out which component is at fault.
+
+Allow slack in alerting to accommodate small blips.
+
+### Online serving systems
+
+Typically alert on high latency and error rates as high up in the stack as possible.
+
+Only page on latency at one point in a stack, if a lower component is slower
+than it should be but the overall user latency is fine then there is no need to
+page.
+
+For error rates, page on errors to the user. If there are errors further down
+the stack that will cause such a failure, there is no need to page on them
+separately. However if some failures do not cause a to the user-visible
+failure but are otherwise severe enough to require human involvment (for
+example, you're losing a lot of money), add pages to be sent on those.
+
+You may need alerts for different types of request if they have different
+characteristics, or problems in a low-traffic type of request would be drowned
+out by high-traffic requests.
+
+### Offline processing
+
+For offline processing systems the key metric is how long data takes to get
+through the system, so page if that gets high enough to cause user impact.
+
+### Batch jobs
+
+For batch jobs it makes sense to page if the batch job has not succeeded
+recently enough, and this will cause user-visible problems.
+
+This should generally be at least enough time for 2 full runs of the batch job.
+For a job that runs every 4 hours and takes an hour, 10 hours would be a
+reasonable threshold. If you cannot withstand a single run failing, run the
+job more often as a single failure should not require human intervention.
+
+### Capacity
+
+While not a problem causing immediate user impact, being close to capacity
+often requires human intervention to avoid an outage in the near future.
+
+### Metamonitoring
+
+It is important to have confidence that monitoring is working. Accordingly, have
+alerts to ensure Prometheus servers, Alertmanagers, PushGateways and
+other monitoring infrastructure are available and running correctly.
+
+As always, if it is possible to alert on symptoms rather than causes,this helps
+to reduce noise. For example, a blackbox test that alerts are getting from
+PushGateway to Prometheus to Alertmanager to email is better than individual
+alerts on each.
+
+Supplementing the whitebox monitoring of Prometheus with external blackbox
+monitoring can catch problems that are otherwise invisible, and also serves as
+a fallback in-case internal systems completely fail.
--- a/content/docs/practices/consoles.md
+++ b/content/docs/practices/consoles.md
+---
+title: Consoles and dashboards
+sort_rank: 3
+---
+
+## Consoles and dashboards
+
+It can be tempting to display as much data as possible on a dashboard, especially
+when a system like Prometheus offers the ability to have such rich
+instrumentation of your applications. This can lead to consoles that are
+impenetrable due to having too much information, that even an expert in the
+system would have difficulty drawing meaning from. Hundreds of graphs on a
+single page isn't unheard of, nor is a hundred plots on a single graph
+
+Instead of trying to represent every piece of data you have, for operational
+consoles think of what are the most likely failure modes and how you'd use the
+consoles to differentiate them. Take advantage of the structure of your
+services. For example if you've a big tree of services in an online serving
+system, latency in some lower service is a typical problem. You could have one
+big page with every service's information, a better approach is one page per
+service that includes the latency and errors it sees for each service it talks
+to. You can then start at the top and work your way down to the problem
+service.
+
+We've found the following guidelines very effective:
+
+* Have no more than 5 graphs on a console.
+* Have no more than 5 plots (lines) on each graph. You can get away with more if it's a stacked/area graph.
+* If using console templates, try to avoid more than 20-30 entries on the table on the right
+
+If you find yourself exceeding these then you should demote the visibility of
+less important information, possibly splitting out some subsystems to a new console.
+For example you could graph aggregated rather than broken-down data, move
+things to the right hand table or even remove it completely if it's rarely
+useful - you can always look at it in the [expression browser](../../visualization/browser/)!
+
+Finally, it is difficult for a set of consoles to serve more than one master.
+What you want to know when oncall (what's broken?) tends to be very different
+from what you want when developing features (how many people hit corner
+case X?). In such cases, two seperate sets of consoles can be useful.
--- a/content/docs/practices/instrumentation.md
+++ b/content/docs/practices/instrumentation.md
--- a/content/docs/practices/naming.md
+++ b/content/docs/practices/naming.md
@@ -41,6 +41,9 @@ Use labels to differentiate the characteristics of the thing that is being measu
 * `api_http_requests_total` - differentiate request types: `type="create|update|delete"`
 * `api_request_duration_nanoseconds` - differentiate request stages: `stage="extract|transform|load"`

+Don't put the label names in the metric name, as that's redundant and
+will cause confusion if it's aggregated away.
+
 CAUTION: <b>CAUTION:</b> Remember that every unique key-value label pair
 represents a new time series, which can dramatically increase the amount of
 data stored. Don't use labels to store dimensions with high cardinality (many

--- a/content/docs/visualization/browser.md
+++ b/content/docs/visualization/browser.md
@@ -5,4 +5,7 @@ sort_rank: 1

 # Expression browser

-TODO: Add content.
+
+The expression browser is available at `/graph` on the Prometheus server, allowing you to enter any expression and see it's result either in a table or graphed over time.
+
+This is primarily useful for ad-hoc queries and debugging, for consoles you should use [PromDash](../promdash/) or [Console templates](../consoles/). 
--- a/content/docs/visualization/consoles.md
+++ b/content/docs/visualization/consoles.md
@@ -5,4 +5,35 @@ sort_rank: 3

 # Console templates

-TODO: Add content.
+The Console templates allow for creation of arbitrary consoles using the [Go
+templating langauge](http://golang.org/pkg/text/template/). These are served
+from the Prometheus server.
+
+Console templates are the most powerful way to create templates that can be easily managed in source control, there is a learning curve though so 
+users new to this style of monitoring should try out [PromDash](../promdash/) first.
+
+## Getting started
+
+Prometheus comes with an example set of consoles to get you going, these can be found at `/consoles/index.html.example` on a running Prometheus
+and will let you see Node Exporter consoles if your node exporters have a `job="node"` label.
+
+Consoles have 5 parts:
+
+1. A navigation bar on top
+1. A menu on the left
+1. Time controls on the bottom
+1. The main content in the center, usually graphs
+1. A table on the right
+
+The navigation bar is for links to other systems, such as other Prometheuses, documentation and whatever else makes sense to you.
+The menu is for navigation inside the Prometheus, it's very useful to be able to quickly open a console in another tab to correlate information.
+Both are configured in `console_libraries/menu.lib`.
+
+The time controls allow changing of the duration and range of the graphs. Console URLs can be shared with others, they'll see the same graphs.
+
+The main content is usually graphs. There's a configurable Javascript graphing
+library provided that'll handle requesting data from Prometheus, and rendering
+it via [Rickshaw](http://code.shutterstock.com/rickshaw/).
+
+Finally, the table on the right can be used to display statistics in a more
+compact form than graphs.
--- a/content/docs/visualization/promdash.md
+++ b/content/docs/visualization/promdash.md
@@ -3,6 +3,10 @@ title: PromDash
 sort_rank: 2
 ---

-# Console templates
+# PromDash
+
+PromDash is a simple, easy and quick way to create consoles from your browser.
+
+See the [documentation](https://github.com/prometheus/promdash/blob/master/README.md) for more information.

 TODO: Add content.
--- a/static/docs.css
+++ b/static/docs.css
@@ -101,6 +101,14 @@ footer p {
  background: none;
 }

+.doc-content > h3 {
+  font-size: 25px;
+}
+
+.doc-content > h4 {
+  font-size: 21px;
+}
+
 pre {
  font-family: "Courier New", Monaco, Menlo, Consolas, monospace;
  background-color: #444;