2015-08-17-service-discovery-with-etcd.md 9.35 KB
Newer Older
1 2 3 4
---
title: Custom service discovery with etcd
created_at: 2015-08-17
kind: article
5
author_name: Fabian Reinartz
6 7 8 9 10 11 12 13 14 15 16 17 18 19
---

In a [previous post](/blog/2015/06/01/advanced-service-discovery/) we
introduced numerous new ways of doing service discovery in Prometheus.
Since then a lot has happened. We improved the internal implementation and
received fantastic contributions from our community, adding support for
service discovery with Kubernetes and Marathon. They will become available
with the release of version 0.16.

We also touched on the topic of [custom service discovery](/blog/2015/06/01/advanced-service-discovery/#custom-service-discovery).

Not every type of service discovery is generic enough to be directly included
in Prometheus. Chances are your organisation has a proprietary
system in place and you just have to make it work with Prometheus.
20
This does not mean that you cannot enjoy the benefits of automatically
21 22 23 24 25 26
discovering new monitoring targets.

In this post we will implement a small utility program that connects a custom
service discovery approach based on [etcd](https://coreos.com/etcd/), the
highly consistent distributed key-value store, to Prometheus.

27 28
<!-- more -->

29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## Targets in etcd and Prometheus

Our fictional service discovery system stores services and their
instances under a well-defined key schema:

```
/services/<service_name>/<instance_id> = <instance_address>
```

Prometheus should now automatically add and remove targets for all existing
services as they come and go.
We can integrate with Prometheus's file-based service discovery, which
monitors a set of files that describe targets as lists of target groups in
JSON format.

A single target group consists of a list of addresses associated with a set of
labels. Those labels are attached to all time series retrieved from those
targets.
One example target group extracted from our service discovery in etcd could
look like this:

```
{
52 53
  "targets": ["10.0.33.1:54423", "10.0.34.12:32535"],
  "labels": {
54 55 56 57 58 59 60 61 62 63 64 65 66
    "job": "node_exporter"
  }
}
```

## The program

What we need is a small program that connects to the etcd cluster and performs
a lookup of all services found in the `/services` path and writes them out into
a file of target groups.

Let's get started with some plumbing. Our tool has two flags: the etcd server
to connect to and the file to which the target groups are written. Internally,
67
the services are represented as a map from service names to instances.
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
Instances are a map from the instance identifier in the etcd path to its
address.

```
const servicesPrefix = "/services"

type (
  instances map[string]string
  services  map[string]instances
)

var (
  etcdServer = flag.String("server", "http://127.0.0.1:4001", "etcd server to connect to")
  targetFile = flag.String("target-file", "tgroups.json", "the file that contains the target groups")
)
```

Our `main` function parses the flags and initializes our object holding the
current services. We then connect to the etcd server and do a recursive read
of the `/services` path.
We receive the subtree for the given path as a result and call `srvs.handle`,
89
which recursively performs the `srvs.update` method for each node in the
90 91 92
subtree. The `update` method modifies the state of our `srvs` object to be
aligned with the state of our subtree in etcd.
Finally, we call `srvs.persist` which transforms the `srvs` object into a list
93
of target groups and writes them out to the file specified by the
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
`-target-file` flag.

```
func main() {
  flag.Parse()

  var (
    client  = etcd.NewClient([]string{*etcdServer})
    srvs    = services{}
  )

  // Retrieve the subtree of the /services path.
  res, err := client.Get(servicesPrefix, false, true)
  if err != nil {
    log.Fatalf("Error on initial retrieval: %s", err)
  }
  srvs.handle(res.Node, srvs.update)
  srvs.persist()
}
```

Let's assume we have this as a working implementation. We could now run this
tool every 30 seconds to have a mostly accurate view of the current targets in
117
our service discovery.
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133

But can we do better?

The answer is _yes_. etcd provides watches, which let us listen for updates on
any path and its sub-paths. With that, we are informed about changes
immediately and can apply them immediately. We also don't have to work through
the whole `/services` subtree again and again, which can become important for
a large number of services and instances.

We extend our `main` function as follows:

```
func main() {
  // ...

  updates := make(chan *etcd.Response)
134

135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156
  // Start recursively watching for updates.
  go func() {
    _, err := client.Watch(servicesPrefix, 0, true, updates, nil)
    if err != nil {
      log.Errorln(err)
    }
  }()

  // Apply updates sent on the channel.
  for res := range updates {
    log.Infoln(res.Action, res.Node.Key, res.Node.Value)

    handler := srvs.update
    if res.Action == "delete" {
      handler = srvs.delete
    }
    srvs.handle(res.Node, handler)
    srvs.persist()
  }
}
```

157 158
We start a goroutine that recursively watches for changes to entries in
`/services`. It blocks forever and sends all changes to the `updates` channel.
159 160 161 162 163
We then read the updates from the channel and apply it as before. In case an
instance or entire service disappears however, we call `srvs.handle` using the
`srvs.delete` method instead.

We finish each update by another call to `srvs.persist` to write out the
164
changes to the file Prometheus is watching.
165

166
### Modification methods
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321

So far so good – conceptually this works. What remains are the `update` and
`delete` handler methods as well as the `persist` method.

`update` and `delete` are invoked by the `handle` method which simply calls
them for each node in a subtree, given that the path is valid:

```
var pathPat = regexp.MustCompile(`/services/([^/]+)(?:/(\d+))?`)

func (srvs services) handle(node *etcd.Node, handler func(*etcd.Node)) {
  if pathPat.MatchString(node.Key) {
    handler(node)
  } else {
    log.Warnf("unhandled key %q", node.Key)
  }

  if node.Dir {
    for _, n := range node.Nodes {
      srvs.handle(n, handler)
    }
  }
}
```

#### `update`

The update methods alters the state of our `services` object
based on the node which was updated in etcd.

```
func (srvs services) update(node *etcd.Node) {
  match := pathPat.FindStringSubmatch(node.Key)
  // Creating a new job directory does not require any action.
  if match[2] == "" {
    return
  }
  srv := match[1]
  instanceID := match[2]

  // We received an update for an instance.
  insts, ok := srvs[srv]
  if !ok {
    insts = instances{}
    srvs[srv] = insts
  }
  insts[instanceID] = node.Value
}
```

#### `delete`

The delete methods removes instances or entire jobs from our `services`
object depending on which node was deleted from etcd.

```
func (srvs services) delete(node *etcd.Node) {
  match := pathPat.FindStringSubmatch(node.Key)
  srv := match[1]
  instanceID := match[2]

  // Deletion of an entire service.
  if instanceID == "" {
    delete(srvs, srv)
    return
  }

  // Delete a single instance from the service.
  delete(srvs[srv], instanceID)
}
```

#### `persist`

The persist method transforms the state of our `services` object into a list of `TargetGroup`s. It then writes this list into the `-target-file` in JSON
format.

```
type TargetGroup struct {
  Targets []string          `json:"targets,omitempty"`
  Labels  map[string]string `json:"labels,omitempty"`
}

func (srvs services) persist() {
  var tgroups []*TargetGroup
  // Write files for current services.
  for job, instances := range srvs {
    var targets []string
    for _, addr := range instances {
      targets = append(targets, addr)
    }

    tgroups = append(tgroups, &TargetGroup{
      Targets: targets,
      Labels:  map[string]string{"job": job},
    })
  }

  content, err := json.Marshal(tgroups)
  if err != nil {
    log.Errorln(err)
    return
  }

  f, err := create(*targetFile)
  if err != nil {
    log.Errorln(err)
    return
  }
  defer f.Close()

  if _, err := f.Write(content); err != nil {
    log.Errorln(err)
  }
}
```

## Taking it live

All done, so how do we run this?

We simply start our tool with a configured output file:

```
./etcd_sd -target-file /etc/prometheus/tgroups.json
```

Then we configure Prometheus with file based service discovery
using the same file. The simplest possible configuration looks like this:

```
scrape_configs:
- job_name: 'default' # Will be overwritten by job label of target groups.
  file_sd_configs:
  - names: ['/etc/prometheus/tgroups.json']
```

And that's it. Now our Prometheus stays in sync with services and their
instances entering and leaving our service discovery with etcd.

## Conclusion

If Prometheus does not ship with native support for the service discovery of
your organisation, don't despair. Using a small utility program you can easily
bridge the gap and profit from seamless updates to the monitored targets.
Thus, you can remove changes to the monitoring configuration from your
deployment equation.

A big thanks to our contributors [Jimmy Dyson](https://twitter.com/jimmidyson)
and [Robert Jacob](https://twitter.com/xperimental) for adding native support
for [Kubernetes](http://kubernetes.io/) and [Marathon](https://mesosphere.github.io/marathon/).
Also check out [Keegan C Smith's](https://twitter.com/keegan_csmith) take on [EC2 service discovery](https://github.com/keegancsmith/prometheus-ec2-discovery) based on files.

You can find the [full source of this blog post on GitHub](https://github.com/fabxc/prom_sd_example/tree/master/etcd_simple).