Commit b287a44d authored by Julius Volz's avatar Julius Volz

Merge pull request #397 from prometheus/word-wrap

Word-wrap exporter and clientlib guideline docs.
parents dcc835e4 7c44b728
......@@ -5,15 +5,23 @@ sort_rank: 2
# Client Library Guidelines
This document covers what functionality and API Prometheus client libraries should offer, with the aim of consistency across libraries, making the easy use cases easy and avoiding offering functionality that may lead users down the wrong path.
This document covers what functionality and API Prometheus client libraries
should offer, with the aim of consistency across libraries, making the easy use
cases easy and avoiding offering functionality that may lead users down the
wrong path.
There are [10 languages already supported](/docs/instrumenting/clientlibs) at the time of writing, so we’ve gotten a good sense by now of how to write a client. These guidelines aim to help authors of new client libraries produce good libraries.
There are [10 languages already supported](/docs/instrumenting/clientlibs) at
the time of writing, so we’ve gotten a good sense by now of how to write a
client. These guidelines aim to help authors of new client libraries produce
good libraries.
## Conventions
MUST/MUST NOT/SHOULD/SHOULD NOT/MAY have the meanings given in [https://www.ietf.org/rfc/rfc2119.txt](https://www.ietf.org/rfc/rfc2119.txt)
MUST/MUST NOT/SHOULD/SHOULD NOT/MAY have the meanings given in
[https://www.ietf.org/rfc/rfc2119.txt](https://www.ietf.org/rfc/rfc2119.txt)
In addition ENCOURAGED means that a feature is desirable for a library to have, but it’s okay if it’s not present. In other words, a nice to have.
In addition ENCOURAGED means that a feature is desirable for a library to have,
but it’s okay if it’s not present. In other words, a nice to have.
Things to keep in mind:
......@@ -37,37 +45,72 @@ The common use cases are (in order):
## Overall structure
Clients MUST be written to be callback based internally. Clients SHOULD generally follow the structure described here.
Clients MUST be written to be callback based internally. Clients SHOULD
generally follow the structure described here.
The key class is the Collector. This has a method (typically called ‘collect’) that returns zero or more metrics and their samples. Collectors get registered with a CollectorRegistry. Data is exposed by passing a CollectorRegistry to a class/method/function "bridge", which returns the metrics in a format Prometheus supports. Every time the CollectorRegistry is scraped it must callback to each of the Collectors’ collect method.
The key class is the Collector. This has a method (typically called ‘collect’)
that returns zero or more metrics and their samples. Collectors get registered
with a CollectorRegistry. Data is exposed by passing a CollectorRegistry to a
class/method/function "bridge", which returns the metrics in a format
Prometheus supports. Every time the CollectorRegistry is scraped it must
callback to each of the Collectors’ collect method.
The interface most users interact with are the Counter, Gauge, Summary, and Histogram Collectors. These represent a single metric, and should cover the vast majority of use cases where a user is instrumenting their own code.
The interface most users interact with are the Counter, Gauge, Summary, and
Histogram Collectors. These represent a single metric, and should cover the
vast majority of use cases where a user is instrumenting their own code.
More advanced uses cases (such as proxying from another monitoring/instrumentation system) require writing a custom Collector. Someone may also want to write a "bridge" that takes a CollectorRegistry and produces data in a format a different monitoring/instrumentation system understands, allowing users to only have to think about one instrumentation system.
More advanced uses cases (such as proxying from another
monitoring/instrumentation system) require writing a custom Collector. Someone
may also want to write a "bridge" that takes a CollectorRegistry and produces
data in a format a different monitoring/instrumentation system understands,
allowing users to only have to think about one instrumentation system.
CollectorRegistry SHOULD offer `register()`/`unregister()` functions, and a Collector SHOULD be allowed to be registered to multiple CollectorRegistrys.
CollectorRegistry SHOULD offer `register()`/`unregister()` functions, and a
Collector SHOULD be allowed to be registered to multiple CollectorRegistrys.
Client libraries MUST be thread safe.
For non-OO languages such as C, client libraries should follow the spirit of this structure as much as is practical.
For non-OO languages such as C, client libraries should follow the spirit of
this structure as much as is practical.
### Naming
Client libraries SHOULD follow function/method/class names mentioned in this document, keeping in mind the naming conventions of the language they’re working in. For example, `set_to_current_time()` is good for a method name Python, but `SetToCurrentTime()` is better in Go and `setToCurrentTime()` is the convention in Java. Where names differ for technical reasons (e.g. not allowing function overloading), documentation/help strings SHOULD point users towards the other names.
Client libraries SHOULD follow function/method/class names mentioned in this
document, keeping in mind the naming conventions of the language they’re
working in. For example, `set_to_current_time()` is good for a method name
Python, but `SetToCurrentTime()` is better in Go and `setToCurrentTime()` is
the convention in Java. Where names differ for technical reasons (e.g. not
allowing function overloading), documentation/help strings SHOULD point users
towards the other names.
Libraries MUST NOT offer functions/methods/classes with the same or similar names to ones given here, but with different semantics.
Libraries MUST NOT offer functions/methods/classes with the same or similar
names to ones given here, but with different semantics.
## Metrics
The Counter, Gauge, Summary and Histogram [metric types](/docs/concepts/metric_types/) are the primary interface by users.
The Counter, Gauge, Summary and Histogram [metric
types](/docs/concepts/metric_types/) are the primary interface by users.
Counter and Gauge MUST be part of the client library. At least one of Summary and Histogram MUST be offered.
Counter and Gauge MUST be part of the client library. At least one of Summary
and Histogram MUST be offered.
These should be primarily used as file-static variables, that is, global variables defined in the same file as the code they’re instrumenting. The client library SHOULD enable this. The common use case is instrumenting a piece of code overall, not a piece of code in the context of one instance of an object. Users shouldn’t have to worry about plumbing their metrics throughout their code, the client library should do that for them (and if it doesn’t, users will write a wrapper around the library to make it "easier" - which rarely tends to go well).
These should be primarily used as file-static variables, that is, global
variables defined in the same file as the code they’re instrumenting. The
client library SHOULD enable this. The common use case is instrumenting a piece
of code overall, not a piece of code in the context of one instance of an
object. Users shouldn’t have to worry about plumbing their metrics throughout
their code, the client library should do that for them (and if it doesn’t,
users will write a wrapper around the library to make it "easier" - which
rarely tends to go well).
There MUST be a default CollectorRegistry, the standard metrics MUST by default implicitly register into it with no special work required by the user. There MUST be a way to have metrics not register to the default CollectorRegistry, for use in batch jobs and unittests. Custom collectors SHOULD also follow this.
There MUST be a default CollectorRegistry, the standard metrics MUST by default
implicitly register into it with no special work required by the user. There
MUST be a way to have metrics not register to the default CollectorRegistry,
for use in batch jobs and unittests. Custom collectors SHOULD also follow this.
Exactly how the metrics should be created varies by language. For some (Java, Go) a builder approach is best, whereas for others (Python) function arguments are rich enough to do it in one call.
Exactly how the metrics should be created varies by language. For some (Java,
Go) a builder approach is best, whereas for others (Python) function arguments
are rich enough to do it in one call.
For example in the Java Simpleclient we have:
......@@ -79,11 +122,16 @@ class YourClass {
}
```
This will register requests with the default CollectorRegistry. By calling `build()` rather than `register()` the metric won’t be registered (handy for unittests), you can also pass in a CollectorRegistry to `register()` (handy for batch jobs).
This will register requests with the default CollectorRegistry. By calling
`build()` rather than `register()` the metric won’t be registered (handy for
unittests), you can also pass in a CollectorRegistry to `register()` (handy for
batch jobs).
### Counter
[Counter](/docs/concepts/metric_types/#counter) is a monotonically increasing counter. It MUST NOT allow the value to decrease, however it MAY be reset to 0 (such as by server restart).
[Counter](/docs/concepts/metric_types/#counter) is a monotonically increasing
counter. It MUST NOT allow the value to decrease, however it MAY be reset to 0
(such as by server restart).
A counter MUST have the following methods:
......@@ -92,13 +140,15 @@ A counter MUST have the following methods:
A counter is ENCOURAGED to have:
A way to count exceptions throw/raised in a given piece of code, and optionally only certain types of exceptions. This is count_exceptions in Python.
A way to count exceptions throw/raised in a given piece of code, and optionally
only certain types of exceptions. This is count_exceptions in Python.
Counters MUST start at 0.
### Gauge
[Gauge](/docs/concepts/metric_types/#gauge) represents a value that can go up and down.
[Gauge](/docs/concepts/metric_types/#gauge) represents a value that can go up
and down.
A gauge MUST have the following methods:
......@@ -108,7 +158,8 @@ A gauge MUST have the following methods:
- `dec(double v)`: Decrement the gauge by the given amount
- `set(double v)`: Set the gauge to the given value
Gauges MUST start at 0, you MAY offer a way for a given gauge to start at a different number.
Gauges MUST start at 0, you MAY offer a way for a given gauge to start at a
different number.
A gauge SHOULD have the following methods:
......@@ -116,15 +167,25 @@ A gauge SHOULD have the following methods:
A gauge is ENCOURAGED to have:
A way to track in-progress requests in some piece of code/function. This is `track_inprogress` in Python.
A way to track in-progress requests in some piece of code/function. This is
`track_inprogress` in Python.
A way to time a piece of code and set the gauge to its duration in seconds. This is useful for batch jobs. This is startTimer/setDuration in Java and the `time()` decorator/context manager in Python. This SHOULD match the pattern in Summary/Histogram (though `set()` rather than `observe()`).
A way to time a piece of code and set the gauge to its duration in seconds.
This is useful for batch jobs. This is startTimer/setDuration in Java and the
`time()` decorator/context manager in Python. This SHOULD match the pattern in
Summary/Histogram (though `set()` rather than `observe()`).
### Summary
A [summary](/docs/concepts/metric_types/#summary) samples observations (usually things like request durations) over sliding windows of time and provides instantaneous insight into their distributions, frequencies, and sums.
A [summary](/docs/concepts/metric_types/#summary) samples observations (usually
things like request durations) over sliding windows of time and provides
instantaneous insight into their distributions, frequencies, and sums.
A summary MUST NOT allow the user to set "quantile" as a label name, as this is used internally to designate summary quantiles. A summary is ENCOURAGED to offer quantiles as exports, though these can’t be aggregated and tend to be slow. A summary MUST allow not having quantiles, as just `_count`/`_sum` is quite useful and this MUST be the default.
A summary MUST NOT allow the user to set "quantile" as a label name, as this is
used internally to designate summary quantiles. A summary is ENCOURAGED to
offer quantiles as exports, though these can’t be aggregated and tend to be
slow. A summary MUST allow not having quantiles, as just `_count`/`_sum` is
quite useful and this MUST be the default.
A summary MUST have the following methods:
......@@ -132,19 +193,28 @@ A summary MUST have the following methods:
A summary SHOULD have the following methods:
Some way to time code for users in seconds. In Python this is the `time()` decorator/context manager. In Java this is startTimer/observeDuration. Units other than seconds MUST NOT be offered (if a user wants something else, they can do it by hand). This should follow the same pattern as Gauge/Histogram.
Some way to time code for users in seconds. In Python this is the `time()`
decorator/context manager. In Java this is startTimer/observeDuration. Units
other than seconds MUST NOT be offered (if a user wants something else, they
can do it by hand). This should follow the same pattern as Gauge/Histogram.
Summary `_count`/`_sum` MUST start at 0.
### Histogram
[Histograms](/docs/concepts/metric_types/#histogram) allow aggregatable distributions of events, such as request latencies. This is at its core a counter per bucket.
[Histograms](/docs/concepts/metric_types/#histogram) allow aggregatable
distributions of events, such as request latencies. This is at its core a
counter per bucket.
A histogram MUST NOT allow `le` as a user-set label, as `le` is used internally to designate buckets.
A histogram MUST NOT allow `le` as a user-set label, as `le` is used internally
to designate buckets.
A histogram MUST offer a way to manually choose the buckets. Ways to set buckets in a `linear(start, width, count)` and `exponential(start, factor, count)` fashion SHOULD be offered. Count MUST exclude the `+Inf` bucket.
A histogram MUST offer a way to manually choose the buckets. Ways to set
buckets in a `linear(start, width, count)` and `exponential(start, factor,
count)` fashion SHOULD be offered. Count MUST exclude the `+Inf` bucket.
A histogram SHOULD have the same default buckets as other client libraries. Buckets MUST NOT be changeable once the metric is created.
A histogram SHOULD have the same default buckets as other client libraries.
Buckets MUST NOT be changeable once the metric is created.
A histogram MUST have the following methods:
......@@ -152,98 +222,172 @@ A histogram MUST have the following methods:
A histogram SHOULD have the following methods:
Some way to time code for users in seconds. In Python this is the `time()` decorator/context manager. In Java this is `startTimer`/`observeDuration`. Units other than seconds MUST NOT be offered (if a user wants something else, they can do it by hand). This should follow the same pattern as Gauge/Summary.
Some way to time code for users in seconds. In Python this is the `time()`
decorator/context manager. In Java this is `startTimer`/`observeDuration`.
Units other than seconds MUST NOT be offered (if a user wants something else,
they can do it by hand). This should follow the same pattern as Gauge/Summary.
Histogram `_count`/`_sum` and the buckets MUST start at 0.
**Further metrics considerations**
Providing additional functionality in metrics beyond what’s documented above as makes sense for a given language is ENCOURAGED.
Providing additional functionality in metrics beyond what’s documented above as
makes sense for a given language is ENCOURAGED.
If there’s a common use case you can make simpler then go for it, as long as it won’t encourage undesirable behaviours (such as suboptimal metric/label layouts, or doing computation in the client).
If there’s a common use case you can make simpler then go for it, as long as it
won’t encourage undesirable behaviours (such as suboptimal metric/label
layouts, or doing computation in the client).
### Labels
Labels are one of the [most powerful aspects](/docs/practices/instrumentation/#use-labels) of Prometheus, but [easily abused](/docs/practices/instrumentation/#do-not-overuse-labels). Accordingly client libraries must be very careful in how labels are offered to users.
Labels are one of the [most powerful
aspects](/docs/practices/instrumentation/#use-labels) of Prometheus, but
[easily abused](/docs/practices/instrumentation/#do-not-overuse-labels).
Accordingly client libraries must be very careful in how labels are offered to
users.
Client libraries MUST NOT under any circumstances allow users to have different label names for the same metric for Gauge/Counter/Summary/Histogram or any other Collector offered by the library.
Client libraries MUST NOT under any circumstances allow users to have different
label names for the same metric for Gauge/Counter/Summary/Histogram or any
other Collector offered by the library.
If your client library does validation of metrics at collect time, it MAY also verify this for custom Collectors.
If your client library does validation of metrics at collect time, it MAY also
verify this for custom Collectors.
While labels are powerful, the majority of metrics will not have labels. Accordingly the API should allow for labels but not dominate it.
While labels are powerful, the majority of metrics will not have labels.
Accordingly the API should allow for labels but not dominate it.
A client library MUST allow for optionally specifying a list of label names at Gauge/Counter/Summary/Histogram creation time. A client library SHOULD support any number of label names. A client library MUST validate that label names meet the [documented requirements](/docs/concepts/data_model/#metric-names-and-labels).
A client library MUST allow for optionally specifying a list of label names at
Gauge/Counter/Summary/Histogram creation time. A client library SHOULD support
any number of label names. A client library MUST validate that label names meet
the [documented
requirements](/docs/concepts/data_model/#metric-names-and-labels).
The general way to provide access to labeled dimension of a metric is via a `labels()` method that takes either a list of the label values or a map from label name to label value and returns a "Child". The usual `.inc()`/`.dec()`/`.observe()` etc. methods can then be called on the Child.
The general way to provide access to labeled dimension of a metric is via a
`labels()` method that takes either a list of the label values or a map from
label name to label value and returns a "Child". The usual
`.inc()`/`.dec()`/`.observe()` etc. methods can then be called on the Child.
The Child returned by `labels()` SHOULD be cacheable by the user, to avoid having to look it up again - this matters in latency-critical code.
The Child returned by `labels()` SHOULD be cacheable by the user, to avoid
having to look it up again - this matters in latency-critical code.
Metrics with labels SHOULD support a `remove()` method with the same signature as `labels()` that will remove a Child from the metric no longer exporting it, and a `clear()` method that removes all Children from the metric. These invalidate caching of Children.
Metrics with labels SHOULD support a `remove()` method with the same signature
as `labels()` that will remove a Child from the metric no longer exporting it,
and a `clear()` method that removes all Children from the metric. These
invalidate caching of Children.
There SHOULD be a way to initialize a given Child with the default value, usually just calling `labels()`. Metrics without labels MUST always be initialized to avoid [problems with missing metrics](/docs/practices/instrumentation/#avoid-missing-metrics).
There SHOULD be a way to initialize a given Child with the default value,
usually just calling `labels()`. Metrics without labels MUST always be
initialized to avoid [problems with missing
metrics](/docs/practices/instrumentation/#avoid-missing-metrics).
### Metric names
Metric names must follow the [specification](/docs/concepts/data_model/#metric-names-and-labels). As with label names, this MUST be met for uses of Gauge/Counter/Summary/Histogram and in any other Collector offered with the library.
Metric names must follow the
[specification](/docs/concepts/data_model/#metric-names-and-labels). As with
label names, this MUST be met for uses of Gauge/Counter/Summary/Histogram and
in any other Collector offered with the library.
Many client libraries offer setting the name in three parts: `namespace_subsystem_name` of which only the `name` is mandatory.
Many client libraries offer setting the name in three parts:
`namespace_subsystem_name` of which only the `name` is mandatory.
Dynamic/generated metric names or subparts of metric names MUST be discouraged, except when a custom Collector is proxying from other instrumentation/monitoring systems. Generated/dynamic metric names are a sign that you should be using labels instead.
Dynamic/generated metric names or subparts of metric names MUST be discouraged,
except when a custom Collector is proxying from other
instrumentation/monitoring systems. Generated/dynamic metric names are a sign
that you should be using labels instead.
### Metric description/help
Gauge/Counter/Summary/Histogram MUST require metric descriptions/help to be provided.
Gauge/Counter/Summary/Histogram MUST require metric descriptions/help to be
provided.
Any custom Collectors provided with the client libraries MUST have descriptions/help on their metrics.
Any custom Collectors provided with the client libraries MUST have
descriptions/help on their metrics.
It is suggested to make it a mandatory argument, but not to check that it’s of a certain length as if someone really doesn’t want to write docs we’re not going to convince them otherwise. Collectors offered with the library (and indeed everywhere we can within the ecosystem) SHOULD have good metric descriptions, to lead by example.
It is suggested to make it a mandatory argument, but not to check that it’s of
a certain length as if someone really doesn’t want to write docs we’re not
going to convince them otherwise. Collectors offered with the library (and
indeed everywhere we can within the ecosystem) SHOULD have good metric
descriptions, to lead by example.
## Exposition
Clients MUST implement one of the documented [exposition formats](/docs/instrumenting/exposition_formats).
Clients MUST implement one of the documented [exposition
formats](/docs/instrumenting/exposition_formats).
Clients MAY implement more than one format. There SHOULD be a human readable format offered.
Clients MAY implement more than one format. There SHOULD be a human readable
format offered.
If in doubt, go for the text format. It doesn’t have a dependency (protobuf), tends to be easy to produce, is human readable and the performance benefits of protobuf are not that significant for most use cases.
If in doubt, go for the text format. It doesn’t have a dependency (protobuf),
tends to be easy to produce, is human readable and the performance benefits of
protobuf are not that significant for most use cases.
Reproducible order of the exposed metrics is ENCOURAGED (especially for human readable formats) if it can be implemented without a significant resource cost.
Reproducible order of the exposed metrics is ENCOURAGED (especially for human
readable formats) if it can be implemented without a significant resource cost.
## Standard and runtime collectors
Client libraries SHOULD offer what they can of the Standard exports, documented at [https://docs.google.com/document/d/1Q0MXWdwp1mdXCzNRak6bW5LLVylVRXhdi7_21Sg15xQ/edit](https://docs.google.com/document/d/1Q0MXWdwp1mdXCzNRak6bW5LLVylVRXhdi7_21Sg15xQ/edit)
Client libraries SHOULD offer what they can of the Standard exports, documented
at
[https://docs.google.com/document/d/1Q0MXWdwp1mdXCzNRak6bW5LLVylVRXhdi7_21Sg15xQ/edit](https://docs.google.com/document/d/1Q0MXWdwp1mdXCzNRak6bW5LLVylVRXhdi7_21Sg15xQ/edit)
In addition, client libraries are ENCOURAGED to also offer whatever makes sense in terms of metrics for their language’s runtime (e.g. Garbage collection stats).
In addition, client libraries are ENCOURAGED to also offer whatever makes sense
in terms of metrics for their language’s runtime (e.g. Garbage collection
stats).
These SHOULD be implemented as custom Collectors, and registered by default on the default CollectorRegistry. There SHOULD be a way to disable these, as there are some very niche use cases where they get in the way.
These SHOULD be implemented as custom Collectors, and registered by default on
the default CollectorRegistry. There SHOULD be a way to disable these, as there
are some very niche use cases where they get in the way.
## Unit tests
Client libraries SHOULD have unit tests covering the core instrumentation library and exposition.
Client libraries SHOULD have unit tests covering the core instrumentation
library and exposition.
Client libraries are ENCOURAGED to offer ways that make it easy for users to unit-test their use of the instrumentation code. For example, the `CollectorRegistry.get_sample_value` in Python.
Client libraries are ENCOURAGED to offer ways that make it easy for users to
unit-test their use of the instrumentation code. For example, the
`CollectorRegistry.get_sample_value` in Python.
## Packaging and Dependencies
Ideally, a client library can be included in any application to add some instrumentation, without having to worry about it breaking the application.
Ideally, a client library can be included in any application to add some
instrumentation, without having to worry about it breaking the application.
Accordingly, caution is advised when adding dependencies to the client library. For example, if a user adds a library that uses a Prometheus client that requires version 1.4 of protobuf but the application uses 1.2 elsewhere, what will happen?
Accordingly, caution is advised when adding dependencies to the client library.
For example, if a user adds a library that uses a Prometheus client that
requires version 1.4 of protobuf but the application uses 1.2 elsewhere, what
will happen?
It is suggested that where this may arise, that the core instrumentation is separated from the bridges/exposition of metrics in a given format. For example, the Java simpleclient `simpleclient` module has no dependencies, and the `simpleclient_servlet` has the HTTP bits.
It is suggested that where this may arise, that the core instrumentation is
separated from the bridges/exposition of metrics in a given format. For
example, the Java simpleclient `simpleclient` module has no dependencies, and
the `simpleclient_servlet` has the HTTP bits.
## Performance Considerations
As client libraries must be thread-safe, some form of concurrency control is required and consideration must be given to performance on multi-core machines and applications.
As client libraries must be thread-safe, some form of concurrency control is
required and consideration must be given to performance on multi-core machines
and applications.
In our experience the least performant is mutexes.
Processor atomic instructions tend to be in the middle, and generally acceptable.
Processor atomic instructions tend to be in the middle, and generally
acceptable.
Approaches that avoid different CPUs mutating the same bit of RAM work best, such as the DoubleAdder in Java’s simpleclient. There is a memory cost though.
Approaches that avoid different CPUs mutating the same bit of RAM work best,
such as the DoubleAdder in Java’s simpleclient. There is a memory cost though.
As noted above, the result of `labels()` should be cacheable. The concurrent maps that tend to back metric with labels tend to be relatively slow. Special-casing metrics without labels to avoid `labels()`-like lookups can help a lot.
As noted above, the result of `labels()` should be cacheable. The concurrent
maps that tend to back metric with labels tend to be relatively slow.
Special-casing metrics without labels to avoid `labels()`-like lookups can help
a lot.
Metrics SHOULD avoid blocking when they are being incremented/decremented/set etc. as it’s undesirable for the whole application to be held up while a scrape is ongoing.
Metrics SHOULD avoid blocking when they are being incremented/decremented/set
etc. as it’s undesirable for the whole application to be held up while a scrape
is ongoing.
Having benchmarks of the main instrumentation operations, including labels, is ENCOURAGED.
Having benchmarks of the main instrumentation operations, including labels, is
ENCOURAGED.
Resource consumption, particularly RAM, should be kept in mind when performing exposition. Consider reducing the memory footprint by streaming results, and potentially having a limit on the number of concurrent scrapes.
Resource consumption, particularly RAM, should be kept in mind when performing
exposition. Consider reducing the memory footprint by streaming results, and
potentially having a limit on the number of concurrent scrapes.
......@@ -5,33 +5,65 @@ sort_rank: 5
# Exporter Guidelines
When directly instrumenting your own code, the general rules of how to instrument code with a Prometheus client library can be followed quite directly. When taking metrics from another monitoring or instrumentation system, things tend not to be so black and white.
When directly instrumenting your own code, the general rules of how to
instrument code with a Prometheus client library can be followed quite
directly. When taking metrics from another monitoring or instrumentation
system, things tend not to be so black and white.
This document contains things you should consider when writing an exporter or custom collector. The theory covered will also be of interest to those doing direct instrumentation.
This document contains things you should consider when writing an exporter or
custom collector. The theory covered will also be of interest to those doing
direct instrumentation.
If you are writing an exporter and are unclear on anything here, contact us on IRC (#prometheus on Freenode) or the [mailing list](/community).
If you are writing an exporter and are unclear on anything here, contact us on
IRC (#prometheus on Freenode) or the [mailing list](/community).
## Maintainability and Purity
The main decision you need to make when writing an exporter is how much work you’re willing to put in to get perfect metrics out of it.
The main decision you need to make when writing an exporter is how much work
you’re willing to put in to get perfect metrics out of it.
If the system in question has only a handful of metrics that rarely change, then getting everything perfect is an easy choice (e.g. the [haproxy exporter](https://github.com/prometheus/haproxy_exporter)).
If the system in question has only a handful of metrics that rarely change,
then getting everything perfect is an easy choice (e.g. the [haproxy
exporter](https://github.com/prometheus/haproxy_exporter)).
If on the other hand the system has hundreds of metrics that change continuously with new versions, if you try to get things perfect then you’ve signed yourself up for a lot of ongoing work. The [mysql exporter](https://github.com/prometheus/mysqld_exporter) is on this end of the spectrum.
If on the other hand the system has hundreds of metrics that change
continuously with new versions, if you try to get things perfect then you’ve
signed yourself up for a lot of ongoing work. The [mysql
exporter](https://github.com/prometheus/mysqld_exporter) is on this end of the
spectrum.
The [node exporter](https://github.com/prometheus/node_exporter) is a mix, varying by module. For mdadm we have to hand-parse a file and come up with our own metrics, so we may as well get the metrics right while we’re at it. For meminfo on the other hand, the results vary across kernel versions so we end up doing just enough of a transform to create valid metrics.
The [node exporter](https://github.com/prometheus/node_exporter) is a mix,
varying by module. For mdadm we have to hand-parse a file and come up with our
own metrics, so we may as well get the metrics right while we’re at it. For
meminfo on the other hand, the results vary across kernel versions so we end up
doing just enough of a transform to create valid metrics.
## Configuration
When working with applications, you should aim for an exporter that requires no custom configuration by the user beyond telling it where the application is. You may also need to offer the ability to filter out certain metrics if they may be too granular and expensive on large setups (e.g. the haproxy exporter allows filtering of per-server stats). Similarly there may be expensive metrics that are disabled by default.
When working with applications, you should aim for an exporter that requires no
custom configuration by the user beyond telling it where the application is.
You may also need to offer the ability to filter out certain metrics if they
may be too granular and expensive on large setups (e.g. the haproxy exporter
allows filtering of per-server stats). Similarly there may be expensive metrics
that are disabled by default.
When working with monitoring systems, frameworks and protocols things are not so simple.
When working with monitoring systems, frameworks and protocols things are not
so simple.
In the best case the system in question has a similar enough data model to Prometheus that you can automatically determine how to transform metrics. This is the case for Cloudwatch, SNMP and Collectd. At most we need the ability to let the user select which metrics they want to pull out.
In the best case the system in question has a similar enough data model to
Prometheus that you can automatically determine how to transform metrics. This
is the case for Cloudwatch, SNMP and Collectd. At most we need the ability to
let the user select which metrics they want to pull out.
In the more common case metrics from the system are completely non-standard, depending on how the user is using it and what the underlying application is. In that case the user has to tell us how to transform the metrics. The JMX exporter is the worst offender here, with the graphite and statsd exporters also requiring configuration to extract labels.
In the more common case metrics from the system are completely non-standard,
depending on how the user is using it and what the underlying application is.
In that case the user has to tell us how to transform the metrics. The JMX
exporter is the worst offender here, with the graphite and statsd exporters
also requiring configuration to extract labels.
Providing something that produces some output out of the box and a selection of example configurations is advised. When writing configurations for such exporters, this document should be kept in mind.
Providing something that produces some output out of the box and a selection of
example configurations is advised. When writing configurations for such
exporters, this document should be kept in mind.
YAML is the standard Prometheus configuration format.
......@@ -41,49 +73,111 @@ YAML is the standard Prometheus configuration format.
Follow the [best practices on metric naming](/docs/practices/naming).
Generally metric names should allow someone who’s familiar with Prometheus but not a particular system to make a good guess as to what a metric means. A metric named `http_requests_total` is not extremely useful - are these being measured as they come in, in some filter or when they get to the user’s code? And `requests_total` is even worse, what type of requests?
To put it another way with direct instrumentation, a given metric should exist within exactly one file. Accordingly within exporters and collectors, a metric should apply to exactly one subsystem and be named accordingly.
Metric names should never be procedurally generated, except when writing a custom collector or exporter.
Metric names for applications should generally be prefixed by the exporter name, e.g. `haproxy_up`.
Metrics must use base units (e.g. seconds, bytes) and leave converting them to something more readable to the graphing software. No matter what units you end up using, the units in the metric name must match the units in use. Similarly expose ratios, not percentages (though a counter for each of the two components of the ratio is better).
Metric names should not include the labels that they’re exported with (e.g. `by_type`) as that won’t make sense if the label is aggregated away.
The one exception is when you’re exporting the same data with different labels via multiple metrics, in which case that’s usually the sanest way to distinguish them. For direct instrumentation this should only come up when exporting a single metric with all the labels would have too high a cardinality.
Prometheus metrics and label names are written in `snake_case`. Converting `camelCase` to `snake_case` is desirable, though it doing so automatically doesn’t always produce nice results for things like `myTCPExample` or `isNaN` so sometimes it’s best to leave them as-is.
Exposed metrics should not contain colons, these are for users to use when aggregating.
Only `[a-zA-Z0-9:_]` are valid in metric names, any other characters should be sanitized to an underscore.
The `_sum`, `_count`, `_bucket` and `_total` suffixes are used by Summaries, Histograms and Counters. Unless you’re producing one of those, avoid these suffixes.
`_total` is a convention for counters, you should use it if you’re using the COUNTER type.
The `process_` and `scrape_` prefixes are reserved. It’s okay to add your own prefix on to these if they follow the [matching semantics](https://docs.google.com/document/d/1Q0MXWdwp1mdXCzNRak6bW5LLVylVRXhdi7_21Sg15xQ/edit). E.g. Prometheus has `scrape_duration_seconds` for how long a scrape took, it’s good practice to have e.g. `jmx_scrape_duration_seconds` saying how long the JMX collector took to do it’s thing. For process stats where you have access to the pid, both Go and Python offer collectors that’ll handle this for you (see the [haproxy exporter](https://github.com/prometheus/haproxy_exporter) for an example).
When you have a successful request count and a failed request count, the best way to expose this is as one metric for total requests and another metric for failed requests. This makes it easy to calculate the failure ratio. Do not use one metric with a failed/success label. Similarly with hit/miss for caches, it’s better to have one metric for total and another for hits.
Consider the likelihood that someone using monitoring will do a code or web search for the metric name. If the names are very well established and unlikely to be used outside of the realm of people used to those names (e.g. SNMP and network engineers) then leaving them as-is may be a good idea. This logic doesn’t apply for e.g. MySQL as non-DBAs can be expected to be poking around the metrics. A `HELP` string with the original name can provide most of the same benefits as using the original names.
Generally metric names should allow someone who’s familiar with Prometheus but
not a particular system to make a good guess as to what a metric means. A
metric named `http_requests_total` is not extremely useful - are these being
measured as they come in, in some filter or when they get to the user’s code?
And `requests_total` is even worse, what type of requests?
To put it another way with direct instrumentation, a given metric should exist
within exactly one file. Accordingly within exporters and collectors, a metric
should apply to exactly one subsystem and be named accordingly.
Metric names should never be procedurally generated, except when writing a
custom collector or exporter.
Metric names for applications should generally be prefixed by the exporter
name, e.g. `haproxy_up`.
Metrics must use base units (e.g. seconds, bytes) and leave converting them to
something more readable to the graphing software. No matter what units you end
up using, the units in the metric name must match the units in use. Similarly
expose ratios, not percentages (though a counter for each of the two components
of the ratio is better).
Metric names should not include the labels that they’re exported with (e.g.
`by_type`) as that won’t make sense if the label is aggregated away.
The one exception is when you’re exporting the same data with different labels
via multiple metrics, in which case that’s usually the sanest way to
distinguish them. For direct instrumentation this should only come up when
exporting a single metric with all the labels would have too high a
cardinality.
Prometheus metrics and label names are written in `snake_case`. Converting
`camelCase` to `snake_case` is desirable, though it doing so automatically
doesn’t always produce nice results for things like `myTCPExample` or `isNaN`
so sometimes it’s best to leave them as-is.
Exposed metrics should not contain colons, these are for users to use when
aggregating.
Only `[a-zA-Z0-9:_]` are valid in metric names, any other characters should be
sanitized to an underscore.
The `_sum`, `_count`, `_bucket` and `_total` suffixes are used by Summaries,
Histograms and Counters. Unless you’re producing one of those, avoid these
suffixes.
`_total` is a convention for counters, you should use it if you’re using the
COUNTER type.
The `process_` and `scrape_` prefixes are reserved. It’s okay to add your own
prefix on to these if they follow the [matching
semantics](https://docs.google.com/document/d/1Q0MXWdwp1mdXCzNRak6bW5LLVylVRXhdi7_21Sg15xQ/edit).
E.g. Prometheus has `scrape_duration_seconds` for how long a scrape took, it’s
good practice to have e.g. `jmx_scrape_duration_seconds` saying how long the
JMX collector took to do it’s thing. For process stats where you have access to
the pid, both Go and Python offer collectors that’ll handle this for you (see
the [haproxy exporter](https://github.com/prometheus/haproxy_exporter) for an
example).
When you have a successful request count and a failed request count, the best
way to expose this is as one metric for total requests and another metric for
failed requests. This makes it easy to calculate the failure ratio. Do not use
one metric with a failed/success label. Similarly with hit/miss for caches,
it’s better to have one metric for total and another for hits.
Consider the likelihood that someone using monitoring will do a code or web
search for the metric name. If the names are very well established and unlikely
to be used outside of the realm of people used to those names (e.g. SNMP and
network engineers) then leaving them as-is may be a good idea. This logic
doesn’t apply for e.g. MySQL as non-DBAs can be expected to be poking around
the metrics. A `HELP` string with the original name can provide most of the
same benefits as using the original names.
### Labels
Read the [general advice](/docs/practices/instrumentation/#things-to-watch-out-for) on labels.
Avoid `type` as a label name, it’s too generic and meaningless. You should also try where possible to avoid names that are likely to clash with target labels, such as `region`, `zone`, `cluster`, `availability_zone`, `az`, `datacenter`, `dc`, `owner`, `customer`, `stage`, `environment` and `env` - though if that’s what the application calls something it’s best not to cause confusion by renaming it.
Avoid the temptation to put things into one metric just because they share a prefix. Unless you’re sure something makes sense as one metric, multiple metrics is safer.
The label `le` has special meaning for Histograms, and `quantile` for Summaries. Avoid these labels generally.
Read/write and send/receive are best as separate metrics, rather than as a label. This is usually because you care about only one of them at a time, and it’s easier to use them that way.
The rule of thumb is that one metric should make sense when summed or averaged. There is one other case that comes up with exporters, and that’s where the data is fundamentally tabular and doing otherwise would require users to do regexes on metric names to be useable. Consider the voltage sensors on your motherboard, while doing math across them is meaningless, it makes sense to have them in one metric rather than having one metric per sensor. All values within a metrics should (almost) always have the same unit (consider if fan speeds were mixed in with the voltages, and you had no way to automatically separate them).
Read the [general
advice](/docs/practices/instrumentation/#things-to-watch-out-for) on labels.
Avoid `type` as a label name, it’s too generic and meaningless. You should also
try where possible to avoid names that are likely to clash with target labels,
such as `region`, `zone`, `cluster`, `availability_zone`, `az`, `datacenter`,
`dc`, `owner`, `customer`, `stage`, `environment` and `env` - though if that’s
what the application calls something it’s best not to cause confusion by
renaming it.
Avoid the temptation to put things into one metric just because they share a
prefix. Unless you’re sure something makes sense as one metric, multiple
metrics is safer.
The label `le` has special meaning for Histograms, and `quantile` for
Summaries. Avoid these labels generally.
Read/write and send/receive are best as separate metrics, rather than as a
label. This is usually because you care about only one of them at a time, and
it’s easier to use them that way.
The rule of thumb is that one metric should make sense when summed or averaged.
There is one other case that comes up with exporters, and that’s where the data
is fundamentally tabular and doing otherwise would require users to do regexes
on metric names to be useable. Consider the voltage sensors on your
motherboard, while doing math across them is meaningless, it makes sense to
have them in one metric rather than having one metric per sensor. All values
within a metrics should (almost) always have the same unit (consider if fan
speeds were mixed in with the voltages, and you had no way to automatically
separate them).
Don’t do this:
......@@ -94,146 +188,281 @@ my_metric{label=b} 6
</pre>
or this:
<pre>
my_metric{label=a} 1
my_metric{label=b} 6
<b>my_metric{} 7</b>
</pre>
The former breaks people who do a `sum()` over your metric, and the latter breaks sum and also is quite difficult to work with. Some client libraries (e.g. Go) will actively try to stop you doing the latter in a custom collector, and all client libraries should stop you from doing the former with direct instrumentation. Never do either of these, rely on Prometheus aggregation instead.
The former breaks people who do a `sum()` over your metric, and the latter
breaks sum and also is quite difficult to work with. Some client libraries
(e.g. Go) will actively try to stop you doing the latter in a custom collector,
and all client libraries should stop you from doing the former with direct
instrumentation. Never do either of these, rely on Prometheus aggregation
instead.
If your monitoring exposes a total like this, drop the total. If you have to keep it around for some reason (e.g. the total includes things not counted individually), use different metric names.
If your monitoring exposes a total like this, drop the total. If you have to
keep it around for some reason (e.g. the total includes things not counted
individually), use different metric names.
### Target labels, not static scraped labels
If you ever find yourself wanting to apply the same label to all of your metrics, stop.
If you ever find yourself wanting to apply the same label to all of your
metrics, stop.
There’s generally two cases where this comes up.
The first is some label it’d be useful to have on the metrics that are about, such as the version number of the software. Use the approach described at [http://www.robustperception.io/how-to-have-labels-for-machine-roles/](http://www.robustperception.io/how-to-have-labels-for-machine-roles/) instead.
The first is some label it’d be useful to have on the metrics that are about,
such as the version number of the software. Use the approach described at
[http://www.robustperception.io/how-to-have-labels-for-machine-roles/](http://www.robustperception.io/how-to-have-labels-for-machine-roles/)
instead.
The other case are what are really target labels. These are things like region, cluster names, and so on, that come from your infrastructure setup rather than the application itself. It’s not for an application to say where it fits in your label taxonomy, that’s for the person running the Prometheus server to configure and different people monitoring the same application may give it different names.
The other case are what are really target labels. These are things like region,
cluster names, and so on, that come from your infrastructure setup rather than
the application itself. It’s not for an application to say where it fits in
your label taxonomy, that’s for the person running the Prometheus server to
configure and different people monitoring the same application may give it
different names.
Accordingly these labels belong up in the scrape configs of Prometheus via whatever service discovery you’re using. It’s okay to apply the concept of machine roles here as well, as it’s likely useful information for at least some of the people scraping it.
Accordingly these labels belong up in the scrape configs of Prometheus via
whatever service discovery you’re using. It’s okay to apply the concept of
machine roles here as well, as it’s likely useful information for at least some
of the people scraping it.
### Types
You should try to match up the types of your metrics to Prometheus types. This usually means counters and gauges. The `_count` and `_sum` of summaries are also relatively common, and on occasion you’ll see quantiles. Histograms are rare, if you come across one remember that the exposition format exposes cumulative values.
You should try to match up the types of your metrics to Prometheus types. This
usually means counters and gauges. The `_count` and `_sum` of summaries are
also relatively common, and on occasion you’ll see quantiles. Histograms are
rare, if you come across one remember that the exposition format exposes
cumulative values.
Often it won’t be obvious what the type of a metric is (especially if you’re automatically processing a set of metrics), use `UNTYPED` in that case. In general `UNTYPED` is a safe default.
Often it won’t be obvious what the type of a metric is (especially if you’re
automatically processing a set of metrics), use `UNTYPED` in that case. In
general `UNTYPED` is a safe default.
Counters can’t go down, so if you’ve a counter type coming from another instrumentation system that has a way to decrement it (e.g. Dropwizard metrics) that’s not a counter - it’s a gauge. `UNTYPED` is probably the best type to use there, as `GAUGE` would be misleading if it were being used as a counter.
Counters can’t go down, so if you’ve a counter type coming from another
instrumentation system that has a way to decrement it (e.g. Dropwizard metrics)
that’s not a counter - it’s a gauge. `UNTYPED` is probably the best type to use
there, as `GAUGE` would be misleading if it were being used as a counter.
### Help Strings
When you’re transforming metrics it’s useful for users to be able to track back to what the original was, and what rules were in play that caused that transform. Putting in the name of the collector/exporter, the id of any rule that was applied and the name/details of the original metric into the help string will greatly aid users.
When you’re transforming metrics it’s useful for users to be able to track back
to what the original was, and what rules were in play that caused that
transform. Putting in the name of the collector/exporter, the id of any rule
that was applied and the name/details of the original metric into the help
string will greatly aid users.
Prometheus doesn’t like one metric having different help strings. If you’re making one metric from many others, choose one of them to put in the help string.
Prometheus doesn’t like one metric having different help strings. If you’re
making one metric from many others, choose one of them to put in the help
string.
For examples of this, the SNMP exporter uses the OID and the JMX exporter puts in a sample mBean name. The [haproxy exporter](https://github.com/prometheus/haproxy_exporter) has hand-written strings. The [node exporter](https://github.com/prometheus/node_exporter) has a wide variety of examples.
For examples of this, the SNMP exporter uses the OID and the JMX exporter puts
in a sample mBean name. The [haproxy
exporter](https://github.com/prometheus/haproxy_exporter) has hand-written
strings. The [node exporter](https://github.com/prometheus/node_exporter) has a
wide variety of examples.
### Drop less useful statistics
Some instrumentation systems expose 1m/5m/15m rates, average rates since application start (called `mean` in dropwizard metrics for example), minimums, maximums and standard deviations.
Some instrumentation systems expose 1m/5m/15m rates, average rates since
application start (called `mean` in dropwizard metrics for example), minimums,
maximums and standard deviations.
These should all be dropped, as they’re not very useful and add clutter. Prometheus can calculate rates itself, and usually more accurately (these are usually exponentially decaying averages). You don’t know what time the min/max were calculated over, and the stddev is statistically useless (expose sum of squares, `_sum` and `_count` if you ever need to calculate it).
These should all be dropped, as they’re not very useful and add clutter.
Prometheus can calculate rates itself, and usually more accurately (these are
usually exponentially decaying averages). You don’t know what time the min/max
were calculated over, and the stddev is statistically useless (expose sum of
squares, `_sum` and `_count` if you ever need to calculate it).
Quantiles have related issues, you may choose to drop them or put them in a Summary.
Quantiles have related issues, you may choose to drop them or put them in a
Summary.
### Dotted strings
Many monitoring systems don’t have labels, instead doing things like `my.class.path.mymetric.labelvalue1.labelvalue2.labelvalue3`.
Many monitoring systems don’t have labels, instead doing things like
`my.class.path.mymetric.labelvalue1.labelvalue2.labelvalue3`.
The graphite and statsd exporters share a way of doing this with a small configuration language. Other exporters should implement the same. It’s currently implemented only in Go, and would benefit from begin factored out into a separate library.
The graphite and statsd exporters share a way of doing this with a small
configuration language. Other exporters should implement the same. It’s
currently implemented only in Go, and would benefit from begin factored out
into a separate library.
## Collectors
When implementing the collector for your exporter, you should never use the usual direct instrumentation approach and then update the metrics on each scrape.
Rather create new metrics each time. In Go this is done with [MustNewConstMetric](https://godoc.org/github.com/prometheus/client_golang/prometheus#MustNewConstMetric) in your `Update()` method. For Python see [https://github.com/prometheus/client_python#custom-collectors](https://github.com/prometheus/client_python#custom-collectors) and for Java generate a `List<MetricFamilySamples>` in your collect method - see [StandardExports.java](https://github.com/prometheus/client_java/blob/master/simpleclient_hotspot/src/main/java/io/prometheus/client/hotspot/StandardExports.java) for an example.
The reason for this is firstly that two scrapes could happen at the same time, and direct instrumentation uses what are effectively (file-level) global variables so you’ll get race conditions. The second reason is that if a label value disappears, it’ll still be exported.
Instrumenting your exporter itself via direct instrumentation is fine, e.g. total bytes transferred or calls performed by the exporter across all scrapes. For exporters such as the blackbox exporter and snmp exporter which aren’t tied to a single target, these should only be exposed on a vanilla `/metrics` call - not on a scrape of a particular target.
When implementing the collector for your exporter, you should never use the
usual direct instrumentation approach and then update the metrics on each
scrape.
Rather create new metrics each time. In Go this is done with
[MustNewConstMetric](https://godoc.org/github.com/prometheus/client_golang/prometheus#MustNewConstMetric)
in your `Update()` method. For Python see
[https://github.com/prometheus/client_python#custom-collectors](https://github.com/prometheus/client_python#custom-collectors)
and for Java generate a `List<MetricFamilySamples>` in your collect method -
see
[StandardExports.java](https://github.com/prometheus/client_java/blob/master/simpleclient_hotspot/src/main/java/io/prometheus/client/hotspot/StandardExports.java)
for an example.
The reason for this is firstly that two scrapes could happen at the same time,
and direct instrumentation uses what are effectively (file-level) global
variables so you’ll get race conditions. The second reason is that if a label
value disappears, it’ll still be exported.
Instrumenting your exporter itself via direct instrumentation is fine, e.g.
total bytes transferred or calls performed by the exporter across all scrapes.
For exporters such as the blackbox exporter and snmp exporter which aren’t tied
to a single target, these should only be exposed on a vanilla `/metrics` call -
not on a scrape of a particular target.
### Metrics about the scrape itself
Sometimes you’d like to export metrics that are about the scrape, like how long it took or how many records you processed.
Sometimes you’d like to export metrics that are about the scrape, like how long
it took or how many records you processed.
These should be exposed as gauges (as they’re about an event, the scrape) and the metric name prefixed by the exporter name e.g. `jmx_scrape_duration_seconds`. Usually the `_exporter` is excluded (and if the exporter also makes sense to use as just a collector, definitely exclude it).
These should be exposed as gauges (as they’re about an event, the scrape) and
the metric name prefixed by the exporter name e.g.
`jmx_scrape_duration_seconds`. Usually the `_exporter` is excluded (and if the
exporter also makes sense to use as just a collector, definitely exclude it).
### Machine and process metrics
Many systems (e.g. elasticsearch) expose machine metrics such a CPU, memory and filesystem information. As the node exporter provides these in the Prometheus ecosystem, such metrics should be dropped.
Many systems (e.g. elasticsearch) expose machine metrics such a CPU, memory and
filesystem information. As the node exporter provides these in the Prometheus
ecosystem, such metrics should be dropped.
In the Java world, many instrumentation frameworks expose process-level and JVM-level stats such as CPU and GC. The Java client and JMX exporter already include these in the preferred form via [DefaultExports.java](https://github.com/prometheus/client_java/blob/master/simpleclient_hotspot/src/main/java/io/prometheus/client/hotspot/DefaultExports.java), so these should be dropped.
In the Java world, many instrumentation frameworks expose process-level and
JVM-level stats such as CPU and GC. The Java client and JMX exporter already
include these in the preferred form via
[DefaultExports.java](https://github.com/prometheus/client_java/blob/master/simpleclient_hotspot/src/main/java/io/prometheus/client/hotspot/DefaultExports.java),
so these should be dropped.
Similarly with other languages.
## Deployment
Each exporter should monitor exactly one instance application, preferably sitting right beside it on the same machine. That means for every haproxy you run, you run a `haproxy_exporter` process. For every machine with a mesos slave, you run the mesos exporter on it (and another one for the master if a machine has both).
Each exporter should monitor exactly one instance application, preferably
sitting right beside it on the same machine. That means for every haproxy you
run, you run a `haproxy_exporter` process. For every machine with a mesos
slave, you run the mesos exporter on it (and another one for the master if a
machine has both).
The theory behind this is that for direct instrumentation this is what you’d be doing, and we’re trying to get as close to that as we can in other layouts. This means that all service discovery is done in Prometheus, not in exporters. This also has the benefit that Prometheus has the target information it needs to allow users probe your service with the blackbox exporter.
The theory behind this is that for direct instrumentation this is what you’d be
doing, and we’re trying to get as close to that as we can in other layouts.
This means that all service discovery is done in Prometheus, not in exporters.
This also has the benefit that Prometheus has the target information it needs
to allow users probe your service with the blackbox exporter.
There are two exceptions:
The first is where running beside the application your monitoring is completely nonsensical. SNMP, blackbox and IPMI are the main examples of this. IPMI and SNMP as the devices are effectively black boxes that it’s impossible to run code on (though if you could run a node exporter on them instead that’d be better), and blackbox as if you’re monitoring something like a DNS name there’s nothing to run on. In this case Prometheus should still do service discovery, and pass on the target to be scraped. See the blackbox and SNMP exporters for examples.
Note that it is only currently possible to write this type of exporter with the Python and Java client libraries (the blackbox exporter which is written in Go is doing the text format by hand, don’t do this).
The other is where you’re pulling some stats out of a random instance of a system and don’t care which one you’re talking to. Consider a set of MySQL slaves you wanted to run some business queries against the data to then export. Having an exporter that uses your usual load balancing approach to talk to one slave is the sanest approach.
This doesn’t apply when you’re monitoring a system with master-election, in that case you should monitor each instance individually and deal with the masterness in Prometheus. This is as there isn’t always exactly one master, and changing what a target is underneath Prometheus’s feet will cause oddities.
The first is where running beside the application your monitoring is completely
nonsensical. SNMP, blackbox and IPMI are the main examples of this. IPMI and
SNMP as the devices are effectively black boxes that it’s impossible to run
code on (though if you could run a node exporter on them instead that’d be
better), and blackbox as if you’re monitoring something like a DNS name there’s
nothing to run on. In this case Prometheus should still do service discovery,
and pass on the target to be scraped. See the blackbox and SNMP exporters for
examples.
Note that it is only currently possible to write this type of exporter with the
Python and Java client libraries (the blackbox exporter which is written in Go
is doing the text format by hand, don’t do this).
The other is where you’re pulling some stats out of a random instance of a
system and don’t care which one you’re talking to. Consider a set of MySQL
slaves you wanted to run some business queries against the data to then export.
Having an exporter that uses your usual load balancing approach to talk to one
slave is the sanest approach.
This doesn’t apply when you’re monitoring a system with master-election, in
that case you should monitor each instance individually and deal with the
masterness in Prometheus. This is as there isn’t always exactly one master,
and changing what a target is underneath Prometheus’s feet will cause oddities.
### Scheduling
Metrics should only be pulled from the application when Prometheus scrapes them, exporters should not perform scrapes based on their own timers. That is, all scrapes should be synchronous.
Metrics should only be pulled from the application when Prometheus scrapes
them, exporters should not perform scrapes based on their own timers. That is,
all scrapes should be synchronous.
Accordingly you should not set timestamps on the metric you expose, let Prometheus take care of that. If you think you need timestamps, then you probably need the pushgateway (without timestamps) instead.
Accordingly you should not set timestamps on the metric you expose, let
Prometheus take care of that. If you think you need timestamps, then you
probably need the pushgateway (without timestamps) instead.
If a metric is particularly expensive to retrieve (i.e. takes more than a minute), it is acceptable to cache it. This should be noted in the `HELP` string.
If a metric is particularly expensive to retrieve (i.e. takes more than a
minute), it is acceptable to cache it. This should be noted in the `HELP`
string.
The default scrape timeout for Prometheus is 10 seconds. If your exporter can be expected to exceed this, you should explicitly call this out in your user docs.
The default scrape timeout for Prometheus is 10 seconds. If your exporter can
be expected to exceed this, you should explicitly call this out in your user
docs.
### Pushes
Some applications and monitoring systems only push metrics e.g. statsd, graphite and collectd.
Some applications and monitoring systems only push metrics e.g. statsd,
graphite and collectd.
There’s two considerations here.
Firstly, when do you expire metrics? Collected and things talking to Graphite both export regularly, and when they stop we want to stop exposing the metrics. Collected includes an expiry time so we use that, Graphite doesn’t so it’s a flag on the exporter.
Firstly, when do you expire metrics? Collected and things talking to Graphite
both export regularly, and when they stop we want to stop exposing the metrics.
Collected includes an expiry time so we use that, Graphite doesn’t so it’s a
flag on the exporter.
Statsd is a bit different, as it’s dealing with events rather than metrics. The best model is to run one exporter beside each application and restart them when the application restarts so that state is cleared.
Statsd is a bit different, as it’s dealing with events rather than metrics. The
best model is to run one exporter beside each application and restart them when
the application restarts so that state is cleared.
The second is that these sort of systems tend to allow your users to send either deltas or raw counters. You should rely on the raw counters as far as possible, as that’s the general Prometheus model.
The second is that these sort of systems tend to allow your users to send
either deltas or raw counters. You should rely on the raw counters as far as
possible, as that’s the general Prometheus model.
For service-level metrics (e.g. service-level batch jobs) you should have your exporter push into the push gateway and exit after the event rather than handling the state yourself. For instance-level batch metrics, there is no clear pattern yet - options are either to abuse the node exporter’s textfile collector, rely on in-memory state (probably best if you don’t need to persist over a reboot) or implement similar functionality to the textfile collector.
For service-level metrics (e.g. service-level batch jobs) you should have your
exporter push into the push gateway and exit after the event rather than
handling the state yourself. For instance-level batch metrics, there is no
clear pattern yet - options are either to abuse the node exporter’s textfile
collector, rely on in-memory state (probably best if you don’t need to persist
over a reboot) or implement similar functionality to the textfile collector.
### Failed scrapes
There are currently two patterns for failed scrapes where the application you’re talking to doesn’t respond or has other problems.
There are currently two patterns for failed scrapes where the application
you’re talking to doesn’t respond or has other problems.
The first is to return a 5xx error.
The seconds is to have an `myexporter_up` (e.g. `haproxy_up`) variable that’s 0/1 depending on whether the scrape worked.
The seconds is to have an `myexporter_up` (e.g. `haproxy_up`) variable that’s
0/1 depending on whether the scrape worked.
The latter is better where there’s still some useful metrics you can get even with a failed scrape, such as the haproxy exporter providing process stats. The former is a tad easier for users to deal with, as `up` works in the usual way (though you can’t distinguish between the exporter being down and the application being down).
The latter is better where there’s still some useful metrics you can get even
with a failed scrape, such as the haproxy exporter providing process stats. The
former is a tad easier for users to deal with, as `up` works in the usual way
(though you can’t distinguish between the exporter being down and the
application being down).
### Landing page
It’s nicer for users if visiting `http://yourexporter/` has a simple html page with the name of the exporter, and a link to the `/metrics`.
It’s nicer for users if visiting `http://yourexporter/` has a simple html page
with the name of the exporter, and a link to the `/metrics`.
### Port numbers
A user may have many exporters and Prometheus components on the same machine, so to make that easier each has a unique port number.
A user may have many exporters and Prometheus components on the same machine,
so to make that easier each has a unique port number.
[https://github.com/prometheus/prometheus/wiki/Default-port-allocations](https://github.com/prometheus/prometheus/wiki/Default-port-allocations) is where we track them, this is publically editable.
[https://github.com/prometheus/prometheus/wiki/Default-port-allocations](https://github.com/prometheus/prometheus/wiki/Default-port-allocations)
is where we track them, this is publically editable.
Feel free to grab the next free port number when developing your exporter, preferably before publicly announcing it. If you’re not ready to release yet, putting your username and WIP is fine.
Feel free to grab the next free port number when developing your exporter,
preferably before publicly announcing it. If you’re not ready to release yet,
putting your username and WIP is fine.
This is a registry to make our users’ lives a little easier, not a commitment to develop particular exporters.
This is a registry to make our users’ lives a little easier, not a commitment
to develop particular exporters.
## Announcing
Once you’re ready to announce your exporter to the world, send an email to the mailing list and send a PR to add it to [the list of available exporters](https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exporters.md).
Once you’re ready to announce your exporter to the world, send an email to the
mailing list and send a PR to add it to [the list of available
exporters](https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exporters.md).
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment