Commit cc9e98ac authored by Tobias Schmidt's avatar Tobias Schmidt

Move querying documentation to prometheus/prometheus

parent 8d8e494e
......@@ -16,7 +16,7 @@ real-world systems.
All hope is not lost though. There are many common anomalies which you can
detect and handle with custom-built rules. The Prometheus [query
language](../../../../../docs/querying/basics/) gives you the tools to discover
language](/docs/prometheus/latest/querying/basics/) gives you the tools to discover
these anomalies while avoiding false positives.
<!-- more -->
......@@ -28,7 +28,7 @@ performing as well as the rest, such as responding with increased latency.
Let us say that we have a metric `instance:latency_seconds:mean5m` representing the
average query latency for each instance of a service, calculated via a
[recording rule](/docs/querying/rules/) from a
[recording rule](/docs/prometheus/latest/querying/rules/) from a
[Summary](/docs/concepts/metric_types/#summary) metric.
A simple way to start would be to look for instances with a latency
......@@ -116,7 +116,6 @@ route:
receiver: restart_webhook
```
## Summary
The Prometheus query language allows for rich processing of your monitoring
......
......@@ -142,7 +142,7 @@ to finish within a reasonable amount of time. This happened to us when we wanted
to graph the top 5 utilized links out of ~18,000 in total. While the query
worked, it would take roughly the amount of time we set our timeout limit to,
meaning it was both slow and flaky. We decided to use Prometheus' [recording
rules](/docs/querying/rules/) for precomputing heavy queries.
rules](/docs/prometheus/latest/querying/rules/) for precomputing heavy queries.
precomputed_link_utilization_percent = rate(ifHCOutOctets{layer!='access'}[10m])*8/1000/1000
/ on (device,interface,alias)
......
......@@ -128,7 +128,7 @@ However, if your dashboard query doesn't only touch a single time series but
aggregates over thousands of time series, the number of chunks to access
multiplies accordingly, and the overhead of the sequential scan will become
dominant. (Such queries are frowned upon, and we usually recommend to use a
[recording rule](https://prometheus.io/docs/querying/rules/#recording-rules)
[recording rule](https://prometheus.io/docs/prometheus/latest/querying/rules/#recording-rules)
for queries of that kind that are used frequently, e.g. in a dashboard.) But
with the double-delta encoding, the query time might still have been
acceptable, let's say around one second. After the switch to varbit encoding,
......@@ -147,7 +147,7 @@ encoding. Start your Prometheus server with
`-storage.local.chunk-encoding-version=2` and wait for a while until you have
enough new chunks with varbit encoding to vet the effects. If you see queries
that are becoming unacceptably slow, check if you can use
[recording rules](https://prometheus.io/docs/querying/rules/#recording-rules)
[recording rules](https://prometheus.io/docs/prometheus/latest/querying/rules/#recording-rules)
to speed them up. Most likely, those queries will gain a lot from that even
with the old double-delta encoding.
......
......@@ -12,7 +12,7 @@ vector elements at a given point in time, the alert counts as active for these
elements' label sets.
Alerting rules are configured in Prometheus in the same way as [recording
rules](../../querying/rules).
rules](/docs/prometheus/latest/querying/rules).
### Defining alerting rules
......@@ -42,7 +42,7 @@ can be templated.
#### Templating
Label and annotation values can be templated using [console templates](../../visualization/consoles).
Label and annotation values can be templated using [console templates](/docs/visualization/consoles).
The `$labels` variable holds the label key/value pairs of an alert instance
and `$value` holds the evaluated value of an alert instance.
......@@ -91,7 +91,7 @@ Prometheus's alerting rules are good at figuring what is broken *right now*,
but they are not a fully-fledged notification solution. Another layer is needed
to add summarization, notification rate limiting, silencing and alert
dependencies on top of the simple alert definitions. In Prometheus's ecosystem,
the [Alertmanager](../alertmanager) takes on this
the [Alertmanager](/docs/alertmanager) takes on this
role. Thus, Prometheus may be configured to periodically send information about
alert states to an Alertmanager instance, which then takes care of dispatching
the right notifications. The Alertmanager instance may be configured via the
......
......@@ -56,7 +56,7 @@ during a scrape:
* the **count** of events that have been observed, exposed as `<basename>_count` (identical to `<basename>_bucket{le="+Inf"}` above)
Use the
[`histogram_quantile()` function](/docs/querying/functions/#histogram_quantile)
[`histogram_quantile()` function](/docs/prometheus/latest/querying/functions/#histogram_quantile)
to calculate quantiles from histograms or even aggregations of histograms. A
histogram is also suitable to calculate an
[Apdex score](http://en.wikipedia.org/wiki/Apdex). When operating on buckets,
......
......@@ -70,7 +70,7 @@ also refer to the Prometheus monitoring system as a whole.
### PromQL
[PromQL](../../querying/basics/) is the Prometheus Query Language. It allows for
[PromQL](/docs/prometheus/latest/querying/basics/) is the Prometheus Query Language. It allows for
a wide range of operations including aggregation, slicing and dicing, prediction and joins.
### Pushgateway
......
......@@ -25,7 +25,7 @@ For more elaborate overviews of Prometheus, see the resources linked from the
Prometheus's main features are:
* a multi-dimensional [data model](/docs/concepts/data_model/) with time series data identified by metric name and key/value pairs
* a [flexible query language](/docs/querying/basics/)
* a [flexible query language](/docs/prometheus/latest/querying/basics/)
to leverage this dimensionality
* no reliance on distributed storage; single server nodes are autonomous
* time series collection happens via a pull model over HTTP
......@@ -57,7 +57,9 @@ its ecosystem components:
Prometheus scrapes metrics from instrumented jobs, either directly or via an
intermediary push gateway for short-lived jobs. It stores all scraped samples
locally and runs rules over this data to either aggregate and record new time series from existing data or generate alerts. [Grafana](https://grafana.com/) or other API consumers can be used to visualize the collected data.
locally and runs rules over this data to either aggregate and record new time
series from existing data or generate alerts. [Grafana](https://grafana.com/) or
other API consumers can be used to visualize the collected data.
## When does it fit?
......
......@@ -99,7 +99,7 @@ calculate streaming φ-quantiles on the client side and expose them directly,
while histograms expose bucketed observation counts and the calculation of
quantiles from the buckets of a histogram happens on the server side using the
[`histogram_quantile()`
function](/docs/querying/functions/#histogram_quantile).
function](/docs/prometheus/latest/querying/functions/#histogram_quantile).
The two approaches have a number of different implications:
......@@ -107,11 +107,11 @@ The two approaches have a number of different implications:
|---|-----------|---------
| Required configuration | Pick buckets suitable for the expected range of observed values. | Pick desired φ-quantiles and sliding window. Other φ-quantiles and sliding windows cannot be calculated later.
| Client performance | Observations are very cheap as they only need to increment counters. | Observations are expensive due to the streaming quantile calculation.
| Server performance | The server has to calculate quantiles. You can use [recording rules](/docs/querying/rules/#recording-rules) should the ad-hoc calculation take too long (e.g. in a large dashboard). | Low server-side cost.
| Server performance | The server has to calculate quantiles. You can use [recording rules](/docs/prometheus/latest/querying/rules/#recording-rules) should the ad-hoc calculation take too long (e.g. in a large dashboard). | Low server-side cost.
| Number of time series (in addition to the `_sum` and `_count` series) | One time series per configured bucket. | One time series per configured quantile.
| Quantile error (see below for details) | Error is limited in the dimension of observed values by the width of the relevant bucket. | Error is limited in the dimension of φ by a configurable value.
| Specification of φ-quantile and sliding time-window | Ad-hoc with [Prometheus expressions](/docs/querying/functions/#histogram_quantile). | Preconfigured by the client.
| Aggregation | Ad-hoc with [Prometheus expressions](/docs/querying/functions/#histogram_quantile). | In general [not aggregatable](http://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-you-cant-average.html).
| Specification of φ-quantile and sliding time-window | Ad-hoc with [Prometheus expressions](/docs/prometheus/latest/querying/functions/#histogram_quantile). | Preconfigured by the client.
| Aggregation | Ad-hoc with [Prometheus expressions](/docs/prometheus/latest/querying/functions/#histogram_quantile). | In general [not aggregatable](http://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-you-cant-average.html).
Note the importance of the last item in the table. Let us return to
the SLA of serving 95% of requests within 300ms. This time, you do not
......@@ -132,7 +132,7 @@ quantiles yields statistically nonsensical values.
Using histograms, the aggregation is perfectly possible with the
[`histogram_quantile()`
function](/docs/querying/functions/#histogram_quantile).
function](/docs/prometheus/latest/querying/functions/#histogram_quantile).
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) // GOOD.
......
......@@ -5,9 +5,9 @@ sort_rank: 6
# Recording rules
A consistent naming scheme for [recording rules](/docs/querying/rules/) makes it
easier to interpret the meaning of a rule at a glance. It also avoids mistakes by
making incorrect or meaningless calculations stand out.
A consistent naming scheme for [recording rules](/docs/prometheus/latest/querying/rules/)
makes it easier to interpret the meaning of a rule at a glance. It also avoids
mistakes by making incorrect or meaningless calculations stand out.
This page documents how to correctly do aggregation and suggests a naming
convention.
......@@ -21,7 +21,7 @@ Recording rules should be of the general form `level:metric:operations`.
of operations that were applied to the metric, newest operation first.
Keeping the metric name unchanged makes it easy to know what a metric is and
easy to find in the codebase.
easy to find in the codebase.
To keep the operations clean, `_sum` is omitted if there are other operations,
as `sum()`. Associative operations can be merged (for example `min_min` is the
......@@ -29,7 +29,7 @@ same as `min`).
If there is no obvious operation to use, use `sum`. When taking a ratio by
doing division, separate the metrics using `_per_` and call the operation
`ratio`.
`ratio`.
When aggregating up ratios, aggregate up the numerator and denominator
separately and then divide. Do not take the average of a ratio or average of an
......
This diff is collapsed.
---
title: Querying basics
nav_title: Basics
sort_rank: 1
---
# Querying Prometheus
Prometheus provides a functional expression language that lets the user select
and aggregate time series data in real time. The result of an expression can
either be shown as a graph, viewed as tabular data in Prometheus's expression
browser, or consumed by external systems via the [HTTP API](/docs/querying/api/).
## Examples
This document is meant as a reference. For learning, it might be easier to
start with a couple of [examples](/docs/querying/examples/).
## Expression language data types
In Prometheus's expression language, an expression or sub-expression can
evaluate to one of four types:
* **Instant vector** - a set of time series containing a single sample for each time series, all sharing the same timestamp
* **Range vector** - a set of time series containing a range of data points over time for each time series
* **Scalar** - a simple numeric floating point value
* **String** - a simple string value; currently unused
Depending on the use-case (e.g. when graphing vs. displaying the output of an
expression), only some of these types are legal as the result from a
user-specified expression. For example, an expression that returns an instant
vector is the only type that can be directly graphed.
## Literals
### String literals
Strings may be specified as literals in single quotes, double quotes or
backticks.
PromQL follows the same [escaping rules as
Go](https://golang.org/ref/spec#String_literals). In single or double quotes a
backslash begins an escape sequence, which may be followed by `a`, `b`, `f`,
`n`, `r`, `t`, `v` or `\`. Specific characters can be provided using octal
(`\nnn`) or hexadecimal (`\xnn`, `\unnnn` and `\Unnnnnnnn`).
No escaping is processed inside backticks. Unlike Go, Prometheus does not discard newlines inside backticks.
Example:
"this is a string"
'these are unescaped: \n \\ \t'
`these are not unescaped: \n ' " \t`
### Float literals
Scalar float values can be literally written as numbers of the form
`[-](digits)[.(digits)]`.
-2.43
## Time series Selectors
### Instant vector selectors
Instant vector selectors allow the selection of a set of time series and a
single sample value for each at a given timestamp (instant): in the simplest
form, only a metric name is specified. This results in an instant vector
containing elements for all time series that have this metric name.
This example selects all time series that have the `http_requests_total` metric
name:
http_requests_total
It is possible to filter these time series further by appending a set of labels
to match in curly braces (`{}`).
This example selects only those time series with the `http_requests_total`
metric name that also have the `job` label set to `prometheus` and their
`group` label set to `canary`:
http_requests_total{job="prometheus",group="canary"}
It is also possible to negatively match a label value, or to match label values
against regular expressions. The following label matching operators exist:
* `=`: Select labels that are exactly equal to the provided string.
* `!=`: Select labels that are not equal to the provided string.
* `=~`: Select labels that regex-match the provided string (or substring).
* `!~`: Select labels that do not regex-match the provided string (or substring).
For example, this selects all `http_requests_total` time series for `staging`,
`testing`, and `development` environments and HTTP methods other than `GET`.
http_requests_total{environment=~"staging|testing|development",method!="GET"}
Label matchers that match empty label values also select all time series that do
not have the specific label set at all. Regex-matches are fully anchored.
Vector selectors must either specify a name or at least one label matcher
that does not match the empty string. The following expression is illegal:
{job=~".*"} # Bad!
In contrast, these expressions are valid as they both have a selector that does not
match empty label values.
{job=~".+"} # Good!
{job=~".*",method="get"} # Good!
Label matchers can also be applied to metric names by matching against the internal
`__name__` label. For example, the expression `http_requests_total` is equivalent to
`{__name__="http_requests_total"}`. Matchers other than `=` (`!=`, `=~`, `!~`) may also be used.
The following expression selects all metrics that have a name starting with `job:`:
{__name__=~"^job:.*"}
### Range Vector Selectors
Range vector literals work like instant vector literals, except that they
select a range of samples back from the current instant. Syntactically, a range
duration is appended in square brackets (`[]`) at the end of a vector selector
to specify how far back in time values should be fetched for each resulting
range vector element.
Time durations are specified as a number, followed immediately by one of the
following units:
* `s` - seconds
* `m` - minutes
* `h` - hours
* `d` - days
* `w` - weeks
* `y` - years
In this example, we select all the values we have recorded within the last 5
minutes for all time series that have the metric name `http_requests_total` and
a `job` label set to `prometheus`:
http_requests_total{job="prometheus"}[5m]
### Offset modifier
The `offset` modifier allows changing the time offset for individual
instant and range vectors in a query.
For example, the following expression returns the value of
`http_requests_total` 5 minutes in the past relative to the current
query evaluation time:
http_requests_total offset 5m
Note that the `offset` modifier always needs to follow the selector
immediately, i.e. the following would be correct:
sum(http_requests_total{method="GET"} offset 5m) // GOOD.
While the following would be *incorrect*:
sum(http_requests_total{method="GET"}) offset 5m // INVALID.
The same works for range vectors. This returns the 5-minutes rate that
`http_requests_total` had a week ago:
rate(http_requests_total[5m] offset 1w)
## Operators
Prometheus supports many binary and aggregation operators. These are described
in detail in the [expression language operators](/docs/querying/operators/) page.
## Functions
Prometheus supports several functions to operate on data. These are described
in detail in the [expression language functions](/docs/querying/functions/) page.
## Gotchas
### Interpolation and staleness
When queries are run, timestamps at which to sample data are selected
independently of the actual present time series data. This is mainly to support
cases like aggregation (`sum`, `avg`, and so on), where multiple aggregated
time series do not exactly align in time. Because of their independence,
Prometheus needs to assign a value at those timestamps for each relevant time
series. It does so by simply taking the newest sample before this timestamp.
If no stored sample is found (by default) 5 minutes before a sampling timestamp,
no value is assigned for this time series at this point in time. This
effectively means that time series "disappear" from graphs at times where their
latest collected sample is older than 5 minutes.
NOTE: <b>NOTE:</b> Staleness and interpolation handling might change. See
https://github.com/prometheus/prometheus/issues/398 and
https://github.com/prometheus/prometheus/issues/581.
### Avoiding slow queries and overloads
If a query needs to operate on a very large amount of data, graphing it might
time out or overload the server or browser. Thus, when constructing queries
over unknown data, always start building the query in the tabular view of
Prometheus's expression browser until the result set seems reasonable
(hundreds, not thousands, of time series at most). Only when you have filtered
or aggregated your data sufficiently, switch to graph mode. If the expression
still takes too long to graph ad-hoc, pre-record it via a [recording
rule](/docs/querying/rules/#recording-rules).
This is especially relevant for Prometheus's query language, where a bare
metric name selector like `api_http_requests_total` could expand to thousands
of time series with different labels. Also keep in mind that expressions which
aggregate over many time series will generate load on the server even if the
output is only a small number of time series. This is similar to how it would
be slow to sum all values of a column in a relational database, even if the
output value is only a single number.
---
title: Querying examples
nav_title: Examples
sort_rank: 4
---
# Query examples
## Simple time series selection
Return all time series with the metric `http_requests_total`:
http_requests_total
Return all time series with the metric `http_requests_total` and the given
`job` and `handler` labels:
http_requests_total{job="apiserver", handler="/api/comments"}
Return a whole range of time (in this case 5 minutes) for the same vector,
making it a range vector:
http_requests_total{job="apiserver", handler="/api/comments"}[5m]
Note that an expression resulting in a range vector cannot be graphed directly,
but viewed in the tabular ("Console") view of the expression browser.
Using regular expressions, you could select time series only for jobs whose
name match a certain pattern, in this case, all jobs that end with `server`.
Note that this does a substring match, not a full string match:
http_requests_total{job=~"server$"}
To select all HTTP status codes except 4xx ones, you could run:
http_requests_total{status!~"^4..$"}
## Using functions, operators, etc.
Return the per-second rate for all time series with the `http_requests_total`
metric name, as measured over the last 5 minutes:
rate(http_requests_total[5m])
Assuming that the `http_requests_total` time series all have the labels `job`
(fanout by job name) and `instance` (fanout by instance of the job), we might
want to sum over the rate of all instances, so we get fewer output time series,
but still preserve the `job` dimension:
sum(rate(http_requests_total[5m])) by (job)
If we have two different metrics with the same dimensional labels, we can apply
binary operators to them and elements on both sides with the same label set
will get matched and propagated to the output. For example, this expression
returns the unused memory in MiB for every instance (on a fictional cluster
scheduler exposing these metrics about the instances it runs):
(instance_memory_limit_bytes - instance_memory_usage_bytes) / 1024 / 1024
The same expression, but summed by application, could be written like this:
sum(
instance_memory_limit_bytes - instance_memory_usage_bytes
) by (app, proc) / 1024 / 1024
If the same fictional cluster scheduler exposed CPU usage metrics like the
following for every instance:
instance_cpu_time_ns{app="lion", proc="web", rev="34d0f99", env="prod", job="cluster-manager"}
instance_cpu_time_ns{app="elephant", proc="worker", rev="34d0f99", env="prod", job="cluster-manager"}
instance_cpu_time_ns{app="turtle", proc="api", rev="4d3a513", env="prod", job="cluster-manager"}
instance_cpu_time_ns{app="fox", proc="widget", rev="4d3a513", env="prod", job="cluster-manager"}
...
...we could get the top 3 CPU users grouped by application (`app`) and process
type (`proc`) like this:
topk(3, sum(rate(instance_cpu_time_ns[5m])) by (app, proc))
Assuming this metric contains one time series per running instance, you could
count the number of running instances per application like this:
count(instance_cpu_time_ns) by (app)
This diff is collapsed.
---
title: Querying
sort_rank: 3
nav_icon: search
---
This diff is collapsed.
---
title: Recording rules
sort_rank: 6
---
# Defining recording rules
## Configuring rules
Prometheus supports two types of rules which may be configured and then
evaluated at regular intervals: recording rules and [alerting
rules](../../alerting/rules). To include rules in Prometheus, create a file
containing the necessary rule statements and have Prometheus load the file via
the `rule_files` field in the [Prometheus configuration](/docs/operating/configuration).
The rule files can be reloaded at runtime by sending `SIGHUP` to the Prometheus
process. The changes are only applied if all rule files are well-formatted.
## Syntax-checking rules
To quickly check whether a rule file is syntactically correct without starting
a Prometheus server, install and run Prometheus's `promtool` command-line
utility tool:
```bash
go get github.com/prometheus/prometheus/cmd/promtool
promtool check-rules /path/to/example.rules
```
When the file is syntactically valid, the checker prints a textual
representation of the parsed rules to standard output and then exits with
a `0` return status.
If there are any syntax errors, it prints an error message to standard error
and exits with a `1` return status. On invalid input arguments the exit status
is `2`.
## Recording rules
Recording rules allow you to precompute frequently needed or computationally
expensive expressions and save their result as a new set of time series.
Querying the precomputed result will then often be much faster than executing
the original expression every time it is needed. This is especially useful for
dashboards, which need to query the same expression repeatedly every time they
refresh.
To add a new recording rule, add a line of the following syntax to your rule
file:
<new time series name>[{<label overrides>}] = <expression to record>
Some examples:
# Saving the per-job HTTP in-progress request count as a new set of time series:
job:http_inprogress_requests:sum = sum(http_inprogress_requests) by (job)
# Drop or rewrite labels in the result time series:
new_time_series{label_to_change="new_value",label_to_drop=""} = old_time_series
Recording rules are evaluated at the interval specified by the
`evaluation_interval` field in the Prometheus configuration. During each
evaluation cycle, the right-hand-side expression of the rule statement is
evaluated at the current instant in time and the resulting sample vector is
stored as a new set of time series with the current timestamp and a new metric
name (and perhaps an overridden set of labels).
......@@ -6,7 +6,7 @@ layout: jumbotron
<h1>From metrics to insight</h1>
<p class="subtitle">Power your metrics and alerting with a leading<br>open-source monitoring solution.</p>
<p>
<a class="btn btn-default btn-lg" href="/docs/prometheus/1.8/getting_started/" role="button">Get Started</a>
<a class="btn btn-default btn-lg" href="/docs/prometheus/latest/getting_started/" role="button">Get Started</a>
<a class="btn btn-default btn-lg" href="/download" role="button">Download</a>
</p>
</div>
......@@ -25,7 +25,7 @@ layout: jumbotron
</a>
</div>
<div class="col-md-3 col-sm-6 col-xs-12 feature-item">
<a href="/docs/querying/basics/">
<a href="/docs/prometheus/latest/querying/basics/">
<h2><i class="fa fa-search"></i> Powerful queries</h2>
<p>A flexible query language allows slicing and dicing of collected time series data in order to generate ad-hoc graphs, tables, and alerts.</p>
</a>
......@@ -46,7 +46,7 @@ layout: jumbotron
<div class="row">
<div class="col-md-3 col-sm-6 col-xs-12 feature-item">
<a href="/docs/prometheus/1.8/configuration/">
<a href="/docs/prometheus/latest/configuration/">
<h2><i class="fa fa-cog"></i> Simple operation</h2>
<p>Each server is independent for reliability, relying only on local storage. Written in Go, all binaries are statically linked and easy to deploy.</p>
</a>
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment