Commit 01d28167 authored by James Turnbull's avatar James Turnbull Committed by Brian Brazil

Quick pass on the writing exporters docs (#935)

Quick pass on the writing exporters docs to make them clearer and more consistent
parent 833665e7
...@@ -5,67 +5,81 @@ sort_rank: 5 ...@@ -5,67 +5,81 @@ sort_rank: 5
# Writing exporters # Writing exporters
When directly instrumenting your own code, the general rules of how to If you are instrumenting your own code, the [general rules of how to
instrument code with a Prometheus client library can be followed quite instrument code with a Prometheus client
directly. When taking metrics from another monitoring or instrumentation library](/docs/practices/instrumentation/) should be followed. When
system, things tend not to be so black and white. taking metrics from another monitoring or instrumentation system, things
tend not to be so black and white.
This document contains things you should consider when writing an exporter or This document contains things you should consider when writing an
custom collector. The theory covered will also be of interest to those doing exporter or custom collector. The theory covered will also be of
direct instrumentation. interest to those doing direct instrumentation.
If you are writing an exporter and are unclear on anything here, contact us on If you are writing an exporter and are unclear on anything here, please
IRC (#prometheus on Freenode) or the [mailing list](/community). contact us on IRC (#prometheus on Freenode) or the [mailing
list](/community).
## Maintainability and purity ## Maintainability and purity
The main decision you need to make when writing an exporter is how much work The main decision you need to make when writing an exporter is how much
you’re willing to put in to get perfect metrics out of it. work you’re willing to put in to get perfect metrics out of it.
If the system in question has only a handful of metrics that rarely change, If the system in question has only a handful of metrics that rarely
then getting everything perfect is an easy choice (e.g. the [haproxy change, then getting everything perfect is an easy choice, a good
exporter](https://github.com/prometheus/haproxy_exporter)). example of this is the [HAProxy
exporter](https://github.com/prometheus/haproxy_exporter).
If on the other hand the system has hundreds of metrics that change On the other hand, if you try to get things perfect when the system has
continuously with new versions, if you try to get things perfect then you’ve hundreds of metrics that change frequently with new versions, then
signed yourself up for a lot of ongoing work. The [mysql you’ve signed yourself up for a lot of ongoing work. The [MySQL
exporter](https://github.com/prometheus/mysqld_exporter) is on this end of the exporter](https://github.com/prometheus/mysqld_exporter) is on this end
spectrum. of the spectrum.
The [node exporter](https://github.com/prometheus/node_exporter) is a mix, The [node exporter](https://github.com/prometheus/node_exporter) is a
varying by module. For mdadm we have to hand-parse a file and come up with our mix of these, with complexity varying by module. For example, the
own metrics, so we may as well get the metrics right while we’re at it. For `mdadm` collector hand-parses a file and exposes metrics created
meminfo on the other hand, the results vary across kernel versions so we end up specifically for that collector, so we may as well get the metrics
doing just enough of a transform to create valid metrics. right. For the `meminfo` collector the results vary across kernel
versions so we end up doing just enough of a transform to create valid
metrics.
## Configuration ## Configuration
When working with applications, you should aim for an exporter that requires no When working with applications, you should aim for an exporter that
custom configuration by the user beyond telling it where the application is. requires no custom configuration by the user beyond telling it where the
You may also need to offer the ability to filter out certain metrics if they application is. You may also need to offer the ability to filter out
may be too granular and expensive on large setups (e.g. the haproxy exporter certain metrics if they may be too granular and expensive on large
allows filtering of per-server stats). Similarly there may be expensive metrics setups, for example the [HAProxy
exporter](https://github.com/prometheus/haproxy_exporter) allows
filtering of per-server stats. Similarly, there may be expensive metrics
that are disabled by default. that are disabled by default.
When working with monitoring systems, frameworks and protocols things are not When working with other monitoring systems, frameworks and protocols you
so simple. will often need to provide additional configuration or customization to
generate metrics suitable for Prometheus. In the best case scenario, a
In the best case the system in question has a similar enough data model to monitoring system has a similar enough data model to Prometheus that you
Prometheus that you can automatically determine how to transform metrics. This can automatically determine how to transform metrics. This is the case
is the case for Cloudwatch, SNMP and Collectd. At most we need the ability to for [Cloudwatch](https://github.com/prometheus/cloudwatch_exporter),
let the user select which metrics they want to pull out. [SNMP](https://github.com/prometheus/snmp_exporter) and
[collectd](https://github.com/prometheus/collectd_exporter). At most, we
In the more common case metrics from the system are completely non-standard, need the ability to let the user select which metrics they want to pull
depending on how the user is using it and what the underlying application is. out.
In that case the user has to tell us how to transform the metrics. The JMX
exporter is the worst offender here, with the graphite and statsd exporters In other cases, metrics from the system are completely non-standard,
also requiring configuration to extract labels. depending on the usage of the system and the underlying application. In
that case the user has to tell us how to transform the metrics. The [JMX
Providing something that produces some output out of the box and a selection of exporter](https://github.com/prometheus/jmx_exporter) is the worst
example configurations is advised. When writing configurations for such offender here, with the
exporters, this document should be kept in mind. [Graphite](https://github.com/prometheus/graphite_exporter) and
[StatsD](https://github.com/prometheus/statsd_exporter) exporters also
YAML is the standard Prometheus configuration format. requiring configuration to extract labels.
Ensuring the exporter works out of the box without configuration, and
providing a selection of example configurations for transformation if
required, is advised.
YAML is the standard Prometheus configuration format, all configuration
should use YAML by default.
## Metrics ## Metrics
...@@ -73,111 +87,119 @@ YAML is the standard Prometheus configuration format. ...@@ -73,111 +87,119 @@ YAML is the standard Prometheus configuration format.
Follow the [best practices on metric naming](/docs/practices/naming). Follow the [best practices on metric naming](/docs/practices/naming).
Generally metric names should allow someone who’s familiar with Prometheus but Generally metric names should allow someone who is familiar with
not a particular system to make a good guess as to what a metric means. A Prometheus but not a particular system to make a good guess as to what a
metric named `http_requests_total` is not extremely useful - are these being metric means. A metric named `http_requests_total` is not extremely
measured as they come in, in some filter or when they get to the user’s code? useful - are these being measured as they come in, in some filter or
And `requests_total` is even worse, what type of requests? when they get to the user’s code? And `requests_total` is even worse,
what type of requests?
To put it another way with direct instrumentation, a given metric should exist With direct instrumentation, a given metric should exist within exactly
within exactly one file. Accordingly within exporters and collectors, a metric one file. Accordingly, within exporters and collectors, a metric should
should apply to exactly one subsystem and be named accordingly. apply to exactly one subsystem and be named accordingly.
Metric names should never be procedurally generated, except when writing a Metric names should never be procedurally generated, except when writing
custom collector or exporter. a custom collector or exporter.
Metric names for applications should generally be prefixed by the exporter Metric names for applications should generally be prefixed by the
name, e.g. `haproxy_up`. exporter name, e.g. `haproxy_up`.
Metrics must use base units (e.g. seconds, bytes) and leave converting them to Metrics must use base units (e.g. seconds, bytes) and leave converting
something more readable to the graphing software. No matter what units you end them to something more readable to graphing tools. No matter what units
up using, the units in the metric name must match the units in use. Similarly you end up using, the units in the metric name must match the units in
expose ratios, not percentages (though a counter for each of the two components use. Similarly, expose ratios, not percentages. Even better, specify a
of the ratio is better). counter for each of the two components of the ratio.
Metric names should not include the labels that they’re exported with (e.g. Metric names should not include the labels that they’re exported with,
`by_type`) as that won’t make sense if the label is aggregated away. e.g. `by_type`, as that won’t make sense if the label is aggregated
away.
The one exception is when you’re exporting the same data with different labels The one exception is when you’re exporting the same data with different
via multiple metrics, in which case that’s usually the sanest way to labels via multiple metrics, in which case that’s usually the sanest way
distinguish them. For direct instrumentation this should only come up when to distinguish them. For direct instrumentation, this should only come
exporting a single metric with all the labels would have too high a up when exporting a single metric with all the labels would have too
cardinality. high a cardinality.
Prometheus metrics and label names are written in `snake_case`. Converting Prometheus metrics and label names are written in `snake_case`.
`camelCase` to `snake_case` is desirable, though doing so automatically Converting `camelCase` to `snake_case` is desirable, though doing so
doesn’t always produce nice results for things like `myTCPExample` or `isNaN` automatically doesn’t always produce nice results for things like
so sometimes it’s best to leave them as-is. `myTCPExample` or `isNaN` so sometimes it’s best to leave them as-is.
Exposed metrics should not contain colons, these are for users to use when Exposed metrics should not contain colons, these are reserved for users
aggregating. to use when aggregating.
Only `[a-zA-Z0-9:_]` are valid in metric names, any other characters should be Only `[a-zA-Z0-9:_]` are valid in metric names, any other characters
sanitized to an underscore. should be sanitized to an underscore.
The `_sum`, `_count`, `_bucket` and `_total` suffixes are used by Summaries, The `_sum`, `_count`, `_bucket` and `_total` suffixes are used by
Histograms and Counters. Unless you’re producing one of those, avoid these Summaries, Histograms and Counters. Unless you’re producing one of
suffixes. those, avoid these suffixes.
`_total` is a convention for counters, you should use it if you’re using the `_total` is a convention for counters, you should use it if you’re using
COUNTER type. the COUNTER type.
The `process_` and `scrape_` prefixes are reserved. It’s okay to add your own The `process_` and `scrape_` prefixes are reserved. It’s okay to add
prefix on to these if they follow the [matching your own prefix on to these if they follow the [matching
semantics](https://docs.google.com/document/d/1Q0MXWdwp1mdXCzNRak6bW5LLVylVRXhdi7_21Sg15xQ/edit). semantics](https://docs.google.com/document/d/1Q0MXWdwp1mdXCzNRak6bW5LLVylVRXhdi7_21Sg15xQ/edit).
E.g. Prometheus has `scrape_duration_seconds` for how long a scrape took, it’s For example, Prometheus has `scrape_duration_seconds` for how long a
good practice to have e.g. `jmx_scrape_duration_seconds` saying how long the scrape took, it's good practice to also have an exporter-centric metric,
JMX collector took to do its thing. For process stats where you have access to e.g. `jmx_scrape_duration_seconds`, saying how long the specific
the pid, both Go and Python offer collectors that’ll handle this for you (see exporter took to do its thing. For process stats where you have access
the [haproxy exporter](https://github.com/prometheus/haproxy_exporter) for an to the PID, both Go and Python offer collectors that’ll handle this for
example). you. A good example of this is the [HAProxy
exporter](https://github.com/prometheus/haproxy_exporter).
When you have a successful request count and a failed request count, the best
way to expose this is as one metric for total requests and another metric for When you have a successful request count and a failed request count, the
failed requests. This makes it easy to calculate the failure ratio. Do not use best way to expose this is as one metric for total requests and another
one metric with a failed/success label. Similarly with hit/miss for caches, metric for failed requests. This makes it easy to calculate the failure
it’s better to have one metric for total and another for hits. ratio. Do not use one metric with a failed or success label. Similarly,
with hit or miss for caches, it’s better to have one metric for total and
Consider the likelihood that someone using monitoring will do a code or web another for hits.
search for the metric name. If the names are very well established and unlikely
to be used outside of the realm of people used to those names (e.g. SNMP and Consider the likelihood that someone using monitoring will do a code or
network engineers) then leaving them as-is may be a good idea. This logic web search for the metric name. If the names are very well-established
doesn’t apply for e.g. MySQL as non-DBAs can be expected to be poking around and unlikely to be used outside of the realm of people used to those
the metrics. A `HELP` string with the original name can provide most of the names, for example SNMP and network engineers, then leaving them as-is
same benefits as using the original names. may be a good idea. This logic doesn’t apply for all exporters, for
example the MySQL exporter metric's may be used by a variety of people,
not just DBAs. A `HELP` string with the original name can provide most
of the same benefits as using the original names.
### Labels ### Labels
Read the [general Read the [general
advice](/docs/practices/instrumentation/#things-to-watch-out-for) on labels. advice](/docs/practices/instrumentation/#things-to-watch-out-for) on
labels.
Avoid `type` as a label name, it’s too generic and meaningless. You should also Avoid `type` as a label name, it’s too generic and often meaningless.
try where possible to avoid names that are likely to clash with target labels, You should also try where possible to avoid names that are likely to
such as `region`, `zone`, `cluster`, `availability_zone`, `az`, `datacenter`, clash with target labels, such as `region`, `zone`, `cluster`,
`dc`, `owner`, `customer`, `stage`, `service`, `environment` and `env` - though `availability_zone`, `az`, `datacenter`, `dc`, `owner`, `customer`,
if that’s what the application calls something it’s best not to cause confusion `stage`, `service`, `environment` and `env`. If, however, that’s what
by renaming it. the application calls some resource, it’s best not to cause confusion by
renaming it.
Avoid the temptation to put things into one metric just because they share a Avoid the temptation to put things into one metric just because they
prefix. Unless you’re sure something makes sense as one metric, multiple share a prefix. Unless you’re sure something makes sense as one metric,
metrics is safer. multiple metrics is safer.
The label `le` has special meaning for Histograms, and `quantile` for The label `le` has special meaning for Histograms, and `quantile` for
Summaries. Avoid these labels generally. Summaries. Avoid these labels generally.
Read/write and send/receive are best as separate metrics, rather than as a Read/write and send/receive are best as separate metrics, rather than as
label. This is usually because you care about only one of them at a time, and a label. This is usually because you care about only one of them at a
it’s easier to use them that way. time, and it is easier to use them that way.
The rule of thumb is that one metric should make sense when summed or averaged. The rule of thumb is that one metric should make sense when summed or
There is one other case that comes up with exporters, and that’s where the data averaged. There is one other case that comes up with exporters, and
is fundamentally tabular and doing otherwise would require users to do regexes that’s where the data is fundamentally tabular and doing otherwise would
on metric names to be usable. Consider the voltage sensors on your require users to do regexes on metric names to be usable. Consider the
motherboard, while doing math across them is meaningless, it makes sense to voltage sensors on your motherboard, while doing math across them is
have them in one metric rather than having one metric per sensor. All values meaningless, it makes sense to have them in one metric rather than
within a metrics should (almost) always have the same unit (consider if fan having one metric per sensor. All values within a metrics should
speeds were mixed in with the voltages, and you had no way to automatically (almost) always have the same unit, for example consider if fan speeds
separate them). were mixed in with the voltages, and you had no way to automatically
separate them.
Don’t do this: Don’t do this:
...@@ -195,32 +217,35 @@ my_metric{label=b} 6 ...@@ -195,32 +217,35 @@ my_metric{label=b} 6
<b>my_metric{} 7</b> <b>my_metric{} 7</b>
</pre> </pre>
The former breaks people who do a `sum()` over your metric, and the latter The former breaks for people who do a `sum()` over your metric, and the
breaks sum and also is quite difficult to work with. Some client libraries latter breaks sum and is quite difficult to work with. Some client
(e.g. Go) will actively try to stop you doing the latter in a custom collector, libraries, for example Go, will actively try to stop you doing the
and all client libraries should stop you from doing the former with direct latter in a custom collector, and all client libraries should stop you
instrumentation. Never do either of these, rely on Prometheus aggregation from doing the latter with direct instrumentation. Never do either of
instead. these, rely on Prometheus aggregation instead.
If your monitoring exposes a total like this, drop the total. If you have to If your monitoring exposes a total like this, drop the total. If you
keep it around for some reason (e.g. the total includes things not counted have to keep it around for some reason, for example the total includes
individually), use different metric names. things not counted individually, use different metric names.
Instrumentation labels should be minimal, every extra label is one more that Instrumentation labels should be minimal, every extra label is one more
users need to consider when writing their PromQL. Accordingly, avoid having that users need to consider when writing their PromQL. Accordingly,
instrumentation labels which could be removed without affecting uniqueness of avoid having instrumentation labels which could be removed without
time series. Additional information around a metric can be added via an info affecting the uniqueness of the time series. Additional information
metric, see below around how to handle version numbers. around a metric can be added via an info metric, for an example see
below how to handle version numbers.
However there are cases where it is expected that virtually all users of a
metric will want the additional information a non-unique label could add, so an However, there are cases where it is expected that virtually all users of
info metric is not the right tradeoff. For example the mysqld_exporter's a metric will want the additional information. If so, adding a
`mysqld_perf_schema_events_statements_total`'s `digest` label is a hash of the non-unique label, rather than an info metric, is the right solution. For
full query pattern and is sufficient for uniqueness. However it is of little example the
use without the human readable `digest_text` label, which for long queries will [mysqld_exporter](https://github.com/prometheus/mysqld_exporter)'s
contain only the start of the query pattern and is thus not unique. Thus we end `mysqld_perf_schema_events_statements_total`'s `digest` label is a hash
up with both the `digest_text` for humans and the `digest` label for of the full query pattern and is sufficient for uniqueness. However, it
uniqueness. is of little use without the human readable `digest_text` label, which
for long queries will contain only the start of the query pattern and is
thus not unique. Thus we end up with both the `digest_text` label for
humans and the `digest` label for uniqueness.
### Target labels, not static scraped labels ### Target labels, not static scraped labels
...@@ -229,256 +254,277 @@ metrics, stop. ...@@ -229,256 +254,277 @@ metrics, stop.
There’s generally two cases where this comes up. There’s generally two cases where this comes up.
The first is some label it’d be useful to have on the metrics that are about, The first is for some label it would be useful to have on the metrics
such as the version number of the software. Use the approach described at such as the version number of the software. Instead, use the approach
[https://www.robustperception.io/how-to-have-labels-for-machine-roles/](http://www.robustperception.io/how-to-have-labels-for-machine-roles/) described at
instead. [https://www.robustperception.io/how-to-have-labels-for-machine-roles/](http://www.robustperception.io/how-to-have-labels-for-machine-roles/).
The other case are what are really target labels. These are things like region, The second case is when a label is really a target label. These are
cluster names, and so on, that come from your infrastructure setup rather than things like region, cluster names, and so on, that come from your
the application itself. It’s not for an application to say where it fits in infrastructure setup rather than the application itself. It’s not for an
your label taxonomy, that’s for the person running the Prometheus server to application to say where it fits in your label taxonomy, that’s for the
configure and different people monitoring the same application may give it person running the Prometheus server to configure and different people
different names. monitoring the same application may give it different names.
Accordingly these labels belong up in the scrape configs of Prometheus via Accordingly, these labels belong up in the scrape configs of Prometheus
whatever service discovery you’re using. It’s okay to apply the concept of via whatever service discovery you’re using. It’s okay to apply the
machine roles here as well, as it’s likely useful information for at least some concept of machine roles here as well, as it’s likely useful information
of the people scraping it. for at least some people scraping it.
### Types ### Types
You should try to match up the types of your metrics to Prometheus types. This You should try to match up the types of your metrics to Prometheus
usually means counters and gauges. The `_count` and `_sum` of summaries are types. This usually means counters and gauges. The `_count` and `_sum`
also relatively common, and on occasion you’ll see quantiles. Histograms are of summaries are also relatively common, and on occasion you’ll see
rare, if you come across one remember that the exposition format exposes quantiles. Histograms are rare, if you come across one remember that the
cumulative values. exposition format exposes cumulative values.
Often it won’t be obvious what the type of a metric is (especially if you’re Often it won’t be obvious what the type of metric is, especially if
automatically processing a set of metrics), use `UNTYPED` in that case. In you’re automatically processing a set of metrics. In general `UNTYPED`
general `UNTYPED` is a safe default. is a safe default.
Counters can’t go down, so if you’ve a counter type coming from another Counters can’t go down, so if you have a counter type coming from
instrumentation system that has a way to decrement it (e.g. Dropwizard metrics) another instrumentation system that can be decremented, for example
that’s not a counter - it’s a gauge. `UNTYPED` is probably the best type to use Dropwizard metrics then it's not a counter, it's a gauge. `UNTYPED` is
there, as `GAUGE` would be misleading if it were being used as a counter. probably the best type to use there, as `GAUGE` would be misleading if
it were being used as a counter.
### Help strings ### Help strings
When you’re transforming metrics it’s useful for users to be able to track back When you’re transforming metrics it’s useful for users to be able to
to what the original was, and what rules were in play that caused that track back to what the original was, and what rules were in play that
transform. Putting in the name of the collector/exporter, the id of any rule caused that transformation. Putting in the name of the
that was applied and the name/details of the original metric into the help collector or exporter, the ID of any rule that was applied and the
string will greatly aid users. name and details of the original metric into the help string will greatly
aid users.
Prometheus doesn’t like one metric having different help strings. If you’re Prometheus doesn’t like one metric having different help strings. If
making one metric from many others, choose one of them to put in the help you’re making one metric from many others, choose one of them to put in
string. the help string.
For examples of this, the SNMP exporter uses the OID and the JMX exporter puts For examples of this, the SNMP exporter uses the OID and the JMX
in a sample mBean name. The [haproxy exporter puts in a sample mBean name. The [HAProxy
exporter](https://github.com/prometheus/haproxy_exporter) has hand-written exporter](https://github.com/prometheus/haproxy_exporter) has
strings. The [node exporter](https://github.com/prometheus/node_exporter) has a hand-written strings. The [node
wide variety of examples. exporter](https://github.com/prometheus/node_exporter) also has a wide
variety of examples.
### Drop less useful statistics ### Drop less useful statistics
Some instrumentation systems expose 1m/5m/15m rates, average rates since Some instrumentation systems expose 1m, 5m, 15m rates, average rates since
application start (called `mean` in dropwizard metrics for example), minimums, application start (these are called `mean` in Dropwizard metrics for
maximums and standard deviations. example) in addition to minimums, maximums and standard deviations.
These should all be dropped, as they’re not very useful and add clutter. These should all be dropped, as they’re not very useful and add clutter.
Prometheus can calculate rates itself, and usually more accurately (these are Prometheus can calculate rates itself, and usually more accurately as
usually exponentially decaying averages). You don’t know what time the min/max the averages exposed are usually exponentially decaying. You don’t know
were calculated over, and the stddev is statistically useless (expose sum of what time the min or max were calculated over, and the standard deviation
squares, `_sum` and `_count` if you ever need to calculate it). is statistically useless and you can always expose sum of squares,
`_sum` and `_count` if you ever need to calculate it.
Quantiles have related issues, you may choose to drop them or put them in a Quantiles have related issues, you may choose to drop them or put them
Summary. in a Summary.
### Dotted strings ### Dotted strings
Many monitoring systems don’t have labels, instead doing things like Many monitoring systems don’t have labels, instead doing things like
`my.class.path.mymetric.labelvalue1.labelvalue2.labelvalue3`. `my.class.path.mymetric.labelvalue1.labelvalue2.labelvalue3`.
The graphite and statsd exporters share a way of doing this with a small The [Graphite](https://github.com/prometheus/graphite_exporter) and
configuration language. Other exporters should implement the same. It’s [StatsD](https://github.com/prometheus/statsd_exporter) exporters share
currently implemented only in Go, and would benefit from being factored out a way of transforming these with a small configuration language. Other
into a separate library. exporters should implement the same. The transformation is currently
implemented only in Go, and would benefit from being factored out into a
separate library.
## Collectors ## Collectors
When implementing the collector for your exporter, you should never use the When implementing the collector for your exporter, you should never use
usual direct instrumentation approach and then update the metrics on each the usual direct instrumentation approach and then update the metrics on
scrape. each scrape.
Rather create new metrics each time. In Go this is done with Rather create new metrics each time. In Go this is done with
[MustNewConstMetric](https://godoc.org/github.com/prometheus/client_golang/prometheus#MustNewConstMetric) [MustNewConstMetric](https://godoc.org/github.com/prometheus/client_golang/prometheus#MustNewConstMetric)
in your `Update()` method. For Python see in your `Update()` method. For Python see
[https://github.com/prometheus/client_python#custom-collectors](https://github.com/prometheus/client_python#custom-collectors) [https://github.com/prometheus/client_python#custom-collectors](https://github.com/prometheus/client_python#custom-collectors)
and for Java generate a `List<MetricFamilySamples>` in your collect method - and for Java generate a `List<MetricFamilySamples>` in your collect
see method, see
[StandardExports.java](https://github.com/prometheus/client_java/blob/master/simpleclient_hotspot/src/main/java/io/prometheus/client/hotspot/StandardExports.java) [StandardExports.java](https://github.com/prometheus/client_java/blob/master/simpleclient_hotspot/src/main/java/io/prometheus/client/hotspot/StandardExports.java)
for an example. for an example.
The reason for this is firstly that two scrapes could happen at the same time, The reason for this is two-fold. Firstly, two scrapes could happen at
and direct instrumentation uses what are effectively (file-level) global the same time, and direct instrumentation uses what are effectively
variables so you’ll get race conditions. The second reason is that if a label file-level global variables, so you’ll get race conditions. Secondly, if
value disappears, it’ll still be exported. a label value disappears, it’ll still be exported.
Instrumenting your exporter itself via direct instrumentation is fine, e.g. Instrumenting your exporter itself via direct instrumentation is fine,
total bytes transferred or calls performed by the exporter across all scrapes. e.g. total bytes transferred or calls performed by the exporter across
For exporters such as the blackbox exporter and snmp exporter which aren’t tied all scrapes. For exporters such as the [blackbox
to a single target, these should only be exposed on a vanilla `/metrics` call - exporter](https://github.com/prometheus/blackbox_exporter) and [SMNP
not on a scrape of a particular target. exporter](https://github.com/prometheus/snmp_exporter), which aren’t
tied to a single target, these should only be exposed on a vanilla
`/metrics` call, not on a scrape of a particular target.
### Metrics about the scrape itself ### Metrics about the scrape itself
Sometimes you’d like to export metrics that are about the scrape, like how long Sometimes you’d like to export metrics that are about the scrape, like
it took or how many records you processed. how long it took or how many records you processed.
These should be exposed as gauges (as they’re about an event, the scrape) and These should be exposed as gauges as they’re about an event, the scrape,
the metric name prefixed by the exporter name e.g. and the metric name prefixed by the exporter name, for example
`jmx_scrape_duration_seconds`. Usually the `_exporter` is excluded (and if the `jmx_scrape_duration_seconds`. Usually the `_exporter` is excluded and
exporter also makes sense to use as just a collector, definitely exclude it). if the exporter also makes sense to use as just a collector, then
definitely exclude it.
### Machine and process metrics ### Machine and process metrics
Many systems (e.g. elasticsearch) expose machine metrics such a CPU, memory and Many systems, for example Elasticsearch, expose machine metrics such a
filesystem information. As the node exporter provides these in the Prometheus CPU, memory and filesystem information. As the [node
ecosystem, such metrics should be dropped. exporter](https://github.com/prometheus/node_exporter) provides these in
the Prometheus ecosystem, such metrics should be dropped.
In the Java world, many instrumentation frameworks expose process-level and In the Java world, many instrumentation frameworks expose process-level
JVM-level stats such as CPU and GC. The Java client and JMX exporter already and JVM-level stats such as CPU and GC. The Java client and JMX exporter
include these in the preferred form via already include these in the preferred form via
[DefaultExports.java](https://github.com/prometheus/client_java/blob/master/simpleclient_hotspot/src/main/java/io/prometheus/client/hotspot/DefaultExports.java), [DefaultExports.java](https://github.com/prometheus/client_java/blob/master/simpleclient_hotspot/src/main/java/io/prometheus/client/hotspot/DefaultExports.java),
so these should be dropped. so these should also be dropped.
Similarly with other languages. Similarly with other languages and frameworks.
## Deployment ## Deployment
Each exporter should monitor exactly one instance application, preferably Each exporter should monitor exactly one instance application,
sitting right beside it on the same machine. That means for every haproxy you preferably sitting right beside it on the same machine. That means for
run, you run a `haproxy_exporter` process. For every machine with a mesos every HAProxy you run, you run a `haproxy_exporter` process. For every
slave, you run the mesos exporter on it (and another one for the master if a machine with a Mesos worker, you run the [Mesos
machine has both). exporter](https://github.com/mesosphere/mesos_exporter) on it, and
another one for the master, if a machine has both.
The theory behind this is that for direct instrumentation this is what you’d be
doing, and we’re trying to get as close to that as we can in other layouts. The theory behind this is that for direct instrumentation this is what
This means that all service discovery is done in Prometheus, not in exporters. you’d be doing, and we’re trying to get as close to that as we can in
This also has the benefit that Prometheus has the target information it needs other layouts. This means that all service discovery is done in
to allow users probe your service with the blackbox exporter. Prometheus, not in exporters. This also has the benefit that Prometheus
has the target information it needs to allow users probe your service
with the [blackbox
exporter](https://github.com/prometheus/blackbox_exporter).
There are two exceptions: There are two exceptions:
The first is where running beside the application your monitoring is completely The first is where running beside the application your monitoring is
nonsensical. SNMP, blackbox and IPMI are the main examples of this. IPMI and completely nonsensical. The SNMP, blackbox and IPMI exporters are the
SNMP as the devices are effectively black boxes that it’s impossible to run main examples of this. The IPMI and SNMP exporters as the devices are
code on (though if you could run a node exporter on them instead that’d be often black boxes that it’s impossible to run code on (though if you
better), and blackbox as if you’re monitoring something like a DNS name there’s could run a node exporter on them instead that’d be better), and the
nothing to run on. In this case Prometheus should still do service discovery, blackbox exporter where you’re monitoring something like a DNS name,
and pass on the target to be scraped. See the blackbox and SNMP exporters for where there’s also nothing to run on. In this case, Prometheus should
examples. still do service discovery, and pass on the target to be scraped. See
the blackbox and SNMP exporters for examples.
Note that it is only currently possible to write this type of exporter with the
Go, Python and Java client libraries. Note that it is only currently possible to write this type of exporter
with the Go, Python and Java client libraries.
The other is where you’re pulling some stats out of a random instance of a
system and don’t care which one you’re talking to. Consider a set of MySQL The second exception is where you’re pulling some stats out of a random
slaves you wanted to run some business queries against the data to then export. instance of a system and don’t care which one you’re talking to.
Having an exporter that uses your usual load balancing approach to talk to one Consider a set of MySQL replicas you wanted to run some business queries
slave is the sanest approach. against the data to then export. Having an exporter that uses your usual
load balancing approach to talk to one replica is the sanest approach.
This doesn’t apply when you’re monitoring a system with master-election, in
that case you should monitor each instance individually and deal with the This doesn’t apply when you’re monitoring a system with master-election,
masterness in Prometheus. This is as there isn’t always exactly one master, in that case you should monitor each instance individually and deal with
and changing what a target is underneath Prometheus’s feet will cause oddities. the "masterness" in Prometheus. This is as there isn’t always exactly
one master, and changing what a target is underneath Prometheus’s feet
will cause oddities.
### Scheduling ### Scheduling
Metrics should only be pulled from the application when Prometheus scrapes Metrics should only be pulled from the application when Prometheus
them, exporters should not perform scrapes based on their own timers. That is, scrapes them, exporters should not perform scrapes based on their own
all scrapes should be synchronous. timers. That is, all scrapes should be synchronous.
Accordingly you should not set timestamps on the metric you expose, let Accordingly, you should not set timestamps on the metrics you expose, let
Prometheus take care of that. If you think you need timestamps, then you Prometheus take care of that. If you think you need timestamps, then you
probably need the pushgateway (without timestamps) instead. probably need the
[Pushgateway](https://prometheus.io/docs/instrumenting/pushing/)
instead.
If a metric is particularly expensive to retrieve (i.e. takes more than a If a metric is particularly expensive to retrieve, i.e. takes more than
minute), it is acceptable to cache it. This should be noted in the `HELP` a minute, it is acceptable to cache it. This should be noted in the
string. `HELP` string.
The default scrape timeout for Prometheus is 10 seconds. If your exporter can The default scrape timeout for Prometheus is 10 seconds. If your
be expected to exceed this, you should explicitly call this out in your user exporter can be expected to exceed this, you should explicitly call this
docs. out in your user documentation.
### Pushes ### Pushes
Some applications and monitoring systems only push metrics e.g. statsd, Some applications and monitoring systems only push metrics, for example
graphite and collectd. StatsD, Graphite and collectd.
There’s two considerations here. There are two considerations here.
Firstly, when do you expire metrics? Collected and things talking to Graphite Firstly, when do you expire metrics? Collectd and things talking to
both export regularly, and when they stop we want to stop exposing the metrics. Graphite both export regularly, and when they stop we want to stop
Collected includes an expiry time so we use that, Graphite doesn’t so it’s a exposing the metrics. Collectd includes an expiry time so we use that,
flag on the exporter. Graphite doesn’t so it is a flag on the exporter.
Statsd is a bit different, as it’s dealing with events rather than metrics. The StatsD is a bit different, as it is dealing with events rather than
best model is to run one exporter beside each application and restart them when metrics. The best model is to run one exporter beside each application
the application restarts so that state is cleared. and restart them when the application restarts so that the state is
cleared.
The second is that these sort of systems tend to allow your users to send Secondly, these sort of systems tend to allow your users to send either
either deltas or raw counters. You should rely on the raw counters as far as deltas or raw counters. You should rely on the raw counters as far as
possible, as that’s the general Prometheus model. possible, as that’s the general Prometheus model.
For service-level metrics (e.g. service-level batch jobs) you should have your For service-level metrics, e.g. service-level batch jobs, you should
exporter push into the push gateway and exit after the event rather than have your exporter push into the Pushgateway and exit after the event
handling the state yourself. For instance-level batch metrics, there is no rather than handling the state yourself. For instance-level batch
clear pattern yet - options are either to abuse the node exporter’s textfile metrics, there is no clear pattern yet. The options are either to abuse
collector, rely on in-memory state (probably best if you don’t need to persist the node exporter’s textfile collector, rely on in-memory state
over a reboot) or implement similar functionality to the textfile collector. (probably best if you don’t need to persist over a reboot) or implement
similar functionality to the textfile collector.
### Failed scrapes ### Failed scrapes
There are currently two patterns for failed scrapes where the application There are currently two patterns for failed scrapes where the
you’re talking to doesn’t respond or has other problems. application you’re talking to doesn’t respond or has other problems.
The first is to return a 5xx error. The first is to return a 5xx error.
The seconds is to have an `myexporter_up` (e.g. `haproxy_up`) variable that’s The second is to have a `myexporter_up`, e.g. `haproxy_up`, variable
0/1 depending on whether the scrape worked. that has a value of 0 or 1 depending on whether the scrape worked.
The latter is better where there’s still some useful metrics you can get even The latter is better where there’s still some useful metrics you can get
with a failed scrape, such as the haproxy exporter providing process stats. The even with a failed scrape, such as the HAProxy exporter providing
former is a tad easier for users to deal with, as `up` works in the usual way process stats. The former is a tad easier for users to deal with, as
(though you can’t distinguish between the exporter being down and the `up` works in the usual way, although you can’t distinguish between the
application being down). exporter being down and the application being down.
### Landing page ### Landing page
It’s nicer for users if visiting `http://yourexporter/` has a simple html page It’s nicer for users if visiting `http://yourexporter/` has a simple
with the name of the exporter, and a link to the `/metrics`. HTML page with the name of the exporter, and a link to the `/metrics`
page.
### Port numbers ### Port numbers
A user may have many exporters and Prometheus components on the same machine, A user may have many exporters and Prometheus components on the same
so to make that easier each has a unique port number. machine, so to make that easier each has a unique port number.
[https://github.com/prometheus/prometheus/wiki/Default-port-allocations](https://github.com/prometheus/prometheus/wiki/Default-port-allocations) [https://github.com/prometheus/prometheus/wiki/Default-port-allocations](https://github.com/prometheus/prometheus/wiki/Default-port-allocations)
is where we track them, this is publicly editable. is where we track them, this is publicly editable.
Feel free to grab the next free port number when developing your exporter, Feel free to grab the next free port number when developing your
preferably before publicly announcing it. If you’re not ready to release yet, exporter, preferably before publicly announcing it. If you’re not ready
putting your username and WIP is fine. to release yet, putting your username and WIP is fine.
This is a registry to make our users’ lives a little easier, not a commitment This is a registry to make our users’ lives a little easier, not a
to develop particular exporters. For exporters for internal applications we commitment to develop particular exporters. For exporters for internal
recommend using ports outside of the range of default port allocations. applications we recommend using ports outside of the range of default
port allocations.
## Announcing ## Announcing
Once you’re ready to announce your exporter to the world, send an email to the Once you’re ready to announce your exporter to the world, email the
mailing list and send a PR to add it to [the list of available mailing list and send a PR to add it to [the list of available
exporters](https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exporters.md). exporters](https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exporters.md).
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment