Various documentation improvments.

- FAQ about timezone. - FAQs about implementation. - Fix swapped issue URLs. - Add section about possible hash collisions. - Mention `GOMAXPROCS`, especially important for Go beginners. - Mention possible storage optimization for histograms.

Various documentation improvments.
- FAQ about timezone. - FAQs about implementation. - Fix swapped issue URLs. - Add section about possible hash collisions. - Mention `GOMAXPROCS`, especially important for Go beginners. - Mention possible storage optimization for histograms.
4f3658e9 · beorn7 · a7166d49 · 4f3658e9 · 4f3658e9 · 4f3658e9
Commit 4f3658e9 authored Feb 05, 2015 by beorn7
4 changed files
--- a/content/docs/introduction/faq.md
+++ b/content/docs/introduction/faq.md
@@ -89,6 +89,16 @@ Yes, with the experimental [Alertmanager](https://github.com/prometheus/alertman
 Yes, with [PromDash](/docs/visualization/promdash/) and [Console
 templates](/docs/visualization/consoles/).

+### Can I change the timezone? Why is everything in UTC?
+
+To avoid any kind of timezone confusion, especially when the so-called
+daylight saving time is involved, we decided to exclusively use Unix
+time internally and UTC for display purposes in all components of
+Prometheus. A carefully done timezone selection could be introduced
+into the UI. Contributions are welcome. See
+[issue #500](https://github.com/prometheus/prometheus/issues/500)
+for the current state of this effort.
+
 ## Instrumentation

 ### Which languages have instrumentation libraries?
@@ -153,3 +163,53 @@ should take less than a minute under normal circumstances. See [crash recovery](

 You have run into a bug of ZFS on Linux. See [issue #484](https://github.com/prometheus/prometheus/issues/484)
 for details. Upgrading to ZFS on Linux v0.6.4 should fix the issue.
+
+## Implementation
+
+### Why are all sample values 64-bit floats? I want integers.
+
+We restrained ourselves to 64-bit floats to simplify the design. The
+[IEEE 754 double-precision binary floating-point
+format](http://en.wikipedia.org/wiki/Double-precision_floating-point_format)
+supports integer precision for values up to 2<sup>53</sup>. Supporting
+native 64 bit integers would (only) help if you need integer precision
+above 2<sup>53</sup> but below 2<sup>63</sup>. In principle, support
+for different sample value types (including some kind of big integer,
+supporting even more than 64 bit) could be implemented, but it is not
+a priority right now. Note that a counter, even if incremented
+one million times per second, will only run into precision issues
+after over 285 years.
+
+### Why does Prometheus use a custom storage backend rather than [some other storage method]? Isn't the "one file per time series" approach killing performance?
+
+Initially, Prometheus ran completely on LevelDB, but to achieve better
+performance, we had to change the storage for bulk sample data. We
+evaluated many storage backends that were available at the time,
+without getting satisfactory results. So we implemented exactly the
+parts we needed, while keeping LevelDB for indexes and making heavy
+use of file system capabilities. Obviously, we could not evaluate
+every single storage backend out there, and storage backends have
+evolved meanwhile. However, the performance of the solution
+implemented now is satisfactory for most use-cases. Our most important
+requirements are an acceptable query speed for common queries and a
+sustainable ingestion rate of many thousands of samples per
+second. The latter depends on the compressibility of the sample data
+and on the number of time series the samples belong to, but to give
+you an idea, here are some results from benchmarks:
+
+* On an older 8-core machine with Intel Core i7 CPUs and two spinning
+  disks (Samsung HD753LJ) in a RAID-1 setup, Prometheus sustained an
+  ingestion rate of 20k samples per second, belonging to 450k time
+  series, scraped from 1500 targets.
+* On a modern server with SSD, Prometheus sustained an ingestion rate
+  of more than 100k samples per second, belonging to millions of time
+  series, scraped from thousands of targets.
+
+In both cases, the bottleneck was identified as insufficiently
+parallelized hash calculation, which happens before samples even hit
+the storage backend.
+
+Running out of inodes is highly unlikely in a usual set-up. There is a
+possible downside: If you want to delete Prometheus's storage
+directory, you will notice that some file systems are very slow when
+deleting files.
--- a/content/docs/introduction/getting_started.md
+++ b/content/docs/introduction/getting_started.md
@@ -93,6 +93,20 @@ about itself from its own HTTP metrics endpoint.
 You can also verify that Prometheus is serving metrics about itself by
 navigating to its metrics exposure endpoint: http://localhost:9090/metrics

+By default, Prometheus will only execute at most one OS thread at a
+time. In production scenarios on multi-CPU machines, you will most
+likely achieve better performance by setting the `GOMAXPROCS`
+environment variable to a value similar to the number of available CPU
+cores:
+
+```language-bash
+GOMAXPROCS=8 ./prometheus -config.file=prometheus.conf
+```
+
+Blindly setting `GOMAXPROCS` to a high value can be
+counterproductive. See the relevant [Go
+FAQs](http://golang.org/doc/faq#Why_no_multi_CPU).
+
 ## Using the expression browser

 Let's try looking at some data that Prometheus has collected about itself. To

--- a/content/docs/introduction/roadmap.md
+++ b/content/docs/introduction/roadmap.md
@@ -21,7 +21,7 @@ of global Prometheus servers which collect and store only aggregated data from
 those local servers. This allows you to have an aggregate global view and
 detailed local views.

-GitHub issue: [#480](https://github.com/prometheus/prometheus/issues/480)
+GitHub issue: [#9](https://github.com/prometheus/prometheus/issues/9)

 **Aggregatable histograms**

@@ -32,7 +32,7 @@ incorrect](http://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-you-
 to average over the 90th percentile latency of multiple monitored instances.
 We plan to implement server-side histograms which will allow for this use case.

-GitHub issue: [#9](https://github.com/prometheus/prometheus/issues/9)
+GitHub issue: [#480](https://github.com/prometheus/prometheus/issues/480)

 **More flexible label matching in binary operations**

@@ -86,12 +86,15 @@ GitHub issue: [#398](https://github.com/prometheus/prometheus/issues/398)

 **Server-side metric metadata support**

-At this time, metric types and other metadata are only used in the client
-libaries and in the exposition format, but not persisted or utilized in the
-Prometheus server. We plan on making use of this metadata in the future. For
-example, we could suggest automatic rates over counters, warn users if they
-take the rate of a gauge, or display metric documentation strings. The details
-of this are still to be determined.
+At this time, metric types and other metadata are only used in the
+client libaries and in the exposition format, but not persisted or
+utilized in the Prometheus server. We plan on making use of this
+metadata in the future. For example, we could suggest automatic rates
+over counters, warn users if they take the rate of a gauge, or display
+metric documentation strings. Some metric types, like the upcoming
+[server-side histograms](https://github.com/prometheus/prometheus/issues/480),
+could also be stored and processed in a more efficient way.  The
+details of this are still to be determined.

 **More client libraries and exporters**


--- a/content/docs/operating/storage.md
+++ b/content/docs/operating/storage.md
@@ -86,3 +86,16 @@ storage directory:
   1. Stop Prometheus.
   1. `rm -r <storage path>/*`
   1. Start Prometheus.
+
+## Hash collisions
+
+Prometheus currently uses 64-bit fingerprints to identify time
+series. On a large server with several million time series, the chance
+of a hash collision is about one to one million (assuming the FNV-1a
+hash function works well). While that might appear safe enough, the
+problem is that a hash collision will effectively lead to undetected
+data corruption. Also, with more powerful hardware and future
+improvements of the Prometheus code, much higher numbers of time
+series might be handled by a single server, increasing the chance of a
+collision. See [Prometheus issue #509] for efforts to deal with the
+problem.