etcd 2 to 3: new APIs and new possibilities


The change from version 2 to 3 of the distributed etcd database also sees massive changes in how the database works. To help you understand the what and why of the changes, read on...

At Compose our engineering teams have been getting deep into etcd version 3.x, the follow-up to etcd 2.x that is currently deployable on Compose. Etcd has become an essential tool behind the scenes of many cloud computing projects and products as it offers a simple, reliable, consistent, key-value database that can be used as the source of truth for huge clusters of cloud-deployed applications and their configuration.

A jump in major numbers always means that a lot of things change in any product, usually in response to the requirements of customers and users of the preceding version. In etcd 3.x, this is doubly so as fundamental concepts have been reworked to suit the demands of scale and efficiency and that means there's a new learning curve.

From HTTP to gRPC

Let's start with a change that touches every point of the system; how applications communicate with etcd. The etcd 2.x system's API was built on JSON communicated HTTP endpoints. This was very accessible; all you needed was curl or similar and you could work with it. This is what is now called the etcd API version 2. It worked for the original scale of etcd but the developers were looking to handling "tens of thousands of clients and millions of keys in a single cluster".

For that, they have moved over to gRPC which is built on top of Protocol Buffers. It's inspired by HTTP/REST but runs over HTTP/2, uses static routes only rather than ones with parameters embedded in them and sends back API-centric results rather than HTTP status codes. It also builds in support for full-duplex streaming for long running connections. This is the etcd API version 3.

An etcd 2.x server only understands the version 2 API. An etcd 3.x server can understand both version 2 and version 3 APIs but, and it's a huge but, anything you create with clients using one API version will be invisible to clients using the other API version. That's because around the back end, each API routes to a separate data store - they are so different that they are isolated from each other inside the server.

All change in etcdctl

That split goes all the way up to the command line often your first port of call when working with etcd. Etcdctl, the command-line tool for etcd, is one binary but it now behaves like one of two programs depending on the ETCDCTL_API environment variable. Set it to 2, and it behaves like the etcdctl application from etcdv2 using HTTP/JSON communications and the familiar set of commands. Set it to 3 and pretty much every command is different as the applications works in terms of the newer API. To give you an idea, here's a screenshot of both versions of the command side by side.

etcdctl help screens

From this point on, when we say etcd2, we're referring to the API version 2 and etcd3 refers to the API version 3.

Goodbye hierarchy, hello flat keyspace

One of the interesting attributes of keys in etcd2 is the ability to also hold directories of more keys with values or more directories. This lets you create hierarchical file-system like structures for holding your data, like "/clusters/node00/activity/xyz". You could perform various operations with reference to this hierarchy too, so etcd2 allowed clients to wait for activity on a key or a directory (or any of its children) so, for example, you could monitor "/clusters/node00" for changes.

Well, that's all gone. There's now a simple flat namespace for keys. The switch to flat namespaces makes things much easier to manage in terms of consistency and efficiency in clustered systems which is why most people want something like etcd in the first place.

You can create a key that's "/clusters/node00/activity/xyz" but it's handled as a single string. There's no directories implied or created. That said, you can create your own hierarchy through how you name things and etcd3 is there with a prefix option to let you match anything that starts with a particular key value. So you can emulate directory structures; for example, given that key above, we could just look for changes for anything in "node00" with this command:

ETCDCTL_API=3 etcdctl watch --prefix "/cluster/node00/"  

And get a similar effect. Prefixes mitigate the loss of directory structures in etcd3 for the more predictable flat namespace. If you are making extensive use of directory structures in etcd2, this is going to be the first thing you want to allow for in your migration to etcd3.

Compare and swap out, Transactions in

In etcd2, much is made of the atomicity of particular options, such as compare-and-swap to ensure that no two clients interfere with each other and leave the data inconsistent. The problem with atomic actions is, though, as things get more complex more data needs to be consistently modified and an atomic action is by definition, limited in scope to protecting the action.

Etcd3 still has atomic operations, but they are now joined by the more interesting transactions. These aren't transactions in the traditional "giant lock" sense, but a compact guarded "if ... then ... else" operation. Here's a small sample of Go code and the clientv3 library using a transaction:

    tx := cli.Txn(context.TODO())

    txresp, err := tx.If(
        clientv3.Compare(clientv3.Value("foo"), "=", "bar"),
        clientv3.OpPut("foo", "sanfoo"), clientv3.OpPut("newfoo", "newbar"),
        clientv3.OpPut("foo", "bar"), clientv3.OpDelete("newfoo"),

In the If() section, a comparison is defined (checking key foo to see if it's equal to bar). You can have multiple comparison operators here; the If is true if all the comparisons are true. If that is true, the operations in the Then() section are run. If not, the Else() sections operations are run. You can do multiple operations and all the changes will be handled as a single index increment in etcd's database.

It's quite a powerful primitive and it's what you'll use to replace the Compare-and-swap and Compare-and-delete operations in etcd2 code.

TTLs expired, Leases obtained

The change with TTLs in etcd3 sees the per key TTLs of etcd2 turn into a more general Lease. Leases can be created and have keys attached to them. The Lease itself has a time to live and when that expires all the keys attached to the Lease get expired. You can keep the Lease alive with a KeepAlive request or make it go away with a Revoke request. What this gives you, practically, is much better-synchronized behavior. A server could create a set of property values with all the keys to those values under one Lease. If it is the server's responsibility to send KeepAlive requests to the Lease, when it stops doing that then all the related properties neatly disappear. Working with it is simple enough too:

    // Get a lease
    lease, err := cli.Grant(context.TODO(), 10)
    // Attach a key to it
    _, err = cli.Put(context.TODO(), "foo", "bar", clientv3.WithLease(lease.ID))
    // Prod it to keep alive once...
   _, err = cli.KeepAliveOnce(context.TODO(), lease.ID)
    // Sleep
    // Read the time to live
    status, err = cli.TimeToLive(context.TODO(), lease.ID)
    fmt.Printf("Status: %v\n", status.TTL)

Watching rather than waiting

Watching in etcd2 meant waiting for changes; opening an HTTP connection for each key you wanted to watch and waiting for it to return changes. For etcd3, and in keeping with getting everything to scale better, the way you watch is now handled by watcher RPCs. Create a watcher RPC and request watches on keys or ranges of keys from it and it'll return a stream of changes to those keys. You can ask for previous revisions too, back to when the server last compacted its data, and play back from there.

In the Go client for etcd3, the Watcher RPC is managed for you and all you need to do is request a Watch which returns you a Go channel down which the changes arrive. That looks something like this:

    rch := cli.Watch(context.Background(), "foo", clientv3.WithPrefix())

    go func(chn clientv3.WatchChan) {
        for wresp := range chn {
            for _, ev := range wresp.Events {
                fmt.Printf("%s %q : %q\n", ev.Type, ev.Kv.Key, ev.Kv.Value)

This snippet launches a goroutine which prints out incoming change events. I'm using the prefix option which was mentioned earlier. This uses the key value as the prefix we want to match with so I get changes for "foo", "foo2", "foonicular", "foo/bar/ftang/ftang" and whatever other keys start with "foo".

Previous values or not

Many etcd2 operations could return the previous value associated with a key so you could see what you'd deleted or what you'd replaced. By default, etcd3 doesn't do this. There is a WithPrevKV() option you can add to operations, but don't assume it'll always return anything. To optimize etcdv3, the server compacts the data regularly and if the compacted data isn't available, there's nothing for WithPrevKV() to return. If you can, stop relying on this behavior. If you can't though, an option is to create a transaction which reads the current value and returns it before changing it. It's fiddly, but it'll be atomic and reliable.

So etcd3?

Given all these changes, it is pragmatically worth considering etcd 3.x and the etcd's version 3 API as a new database in terms of developing your client and creating your ops workflows. It is built for efficient scaling up of workloads though and avoids the dangers of simple operations in complex environments with its use of leases and watchers.

There's no simple migration path for applications and, currently, there are not as many client drivers for various languages as there are for etcd2. That said, gRPC is widely available and you can consider developing your own driver.

If you want an enterprise-scaled, consistent, observable source of truth, then etcd 3.x and the etcd version 3 API are the way to go. We've only skimmed over the changes here and not touched any of the new features that have appeared; we'll have more on that when it gets closer to etcd3 being made available on Compose.

If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at We're happy to hear from you.

attribution HypnoArt

Dj Walker-Morgan
Dj Walker-Morgan was Compose's resident Content Curator, and has been both a developer and writer since Apples came in II flavors and Commodores had Pets. Love this article? Head over to Dj Walker-Morgan’s author page to keep reading.

Conquer the Data Layer

Spend your time developing apps, not managing databases.