How to keep your etcd lean and mean

Published

How big is your database? Is it the size of all the records you keep in it? Or is it like etcd where your workload is reflected in the size of the database. Read on to find out how revisions, compacting and defrag affect etcd.

Inside any database, there is always pressure on resources like memory. With etcd one of those pressures is the retaining of revisions.

A quick look at revisions

There's a single revision counter which starts at 0 on etcd. Each change made to the keys and their values in the database is manifested as a new revision, created as the latest version of that key and with the increnmented revision counter's value associated with it.

The revision count is database wide, not per key. Say you create key/values in a new database for keys a, b and c, the revision count is going to be at 3. Key a will have a rev count of 1, b of 2 and c of 3. If you set b to a new value, the rev count goes to 4, and a new revision of b with the new value is created with that rev count value. The old version is marked as old and retained. When nodes are syncing up, it's this sequence of revisions that provides the reference for ensuring the integrity of the nodes.

To counter the build up of old versions, there's a process called compaction which removes older revisions up to a specified revision count. So say your database has had 1000 updates, the revision count will be at 1000; the most recent record updated will have a revision count of 1000 too. If you ask etcd to compact up to revision 900, revisions 1 to 900 will vanish from memory if they don't hold a latest generation of a key.

Getting your stats

First thing we need to do is know what memory we are using. That's where the "endpoint status" command comes in. We're keeping our endpoints in the environment variable $ETCDENDPOINTS and our etcd username and password in $ETCDCREDS for these examples.

❯ ETCDCTL_API=3 etcdctl --endpoints=$ETCDENDPOINTS --user=$ETCDCREDS endpoint status -w table
+-----------------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                            ENDPOINT                             |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://portal918-11.democompact.3268603687.composedb.com:28414 |  14029f60c465804 |   3.3.3 |  573 kB |     false |         4 |      16751 |
| https://portal1305-6.democompact.3268603687.composedb.com:28414 | 67ea4eb47c529c75 |   3.3.3 |  573 kB |      true |         4 |      16752 |
+-----------------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+

Here we can see two nodes reporting that the database size on both of them is 573KB (scroll to the left if you cant see the rest of the table).

About proxies and nodes

The sharp eyed among you may be going "Hey, aren't there supposed to be three nodes in this etcd setup?"... There are but they are behind two HAProxy nodes which control traffic and send that traffic onwards to the nodes on a round robin basis. So there's two endpoints that are HAProxy nodes and behind them three etcd database nodes.

Now, this means you need to pay attention to the node ID in the list to see which actual database node it is referring to. The nodes can be listed out too:

$ ETCDCTL_API=3 etcdctl --endpoints=$ETCDENDPOINTS --user=$ETCDCREDS member list -w table
+------------------+---------+------------------------------------------+--------------------------+--------------------------+
|        ID        | STATUS  |                   NAME                   |        PEER ADDRS        |       CLIENT ADDRS       |
+------------------+---------+------------------------------------------+--------------------------+--------------------------+
|  14029f60c465804 | started | etcd258.sl-eu-lon-2-memory.9.dblayer.com | http://10.161.125.4:2380 | http://10.161.125.4:2379 |
| 67ea4eb47c529c75 | started | etcd306.sl-eu-lon-2-memory.7.dblayer.com | http://10.161.125.2:2380 | http://10.161.125.2:2379 |
| e79e78f6e5fa3b99 | started | etcd277.sl-eu-lon-2-memory.8.dblayer.com | http://10.161.125.3:2380 | http://10.161.125.3:2379 |
+------------------+---------+------------------------------------------+--------------------------+--------------------------+

It also means that your commands will land at any one of the three nodes. For etcd operations in general, thats not a problem. The cluster will handle it. But when it's an admin operation on a node, then it gets a little trickier as we'll see...

Finding revision counts

The first thing is to know what your current revision count is. This is harder than it should be, because you have to query a non-existent key to get that value, like so:

$ ETCDCTL_API=3 etcdctl --endpoints=$ETCDENDPOINTS --user=$ETCDCREDS get revisiontestkey -w json
{"header":{"cluster_id":16448426242120634882,"member_id":16689910270897306521,"revision":8028,"raft_term":4}}
$

The "revision" value tells us where we are in changes. 8028 changes have happened to this database since it was created. Each one of those changes has been retained too. We can peer into the history with the watch command. This uses the revision count as a starting point...

$ ETCDCTL_API=3 etcdctl --endpoints=$ETCDENDPOINTS --user=$ETCDCREDS watch --rev=1 key1      
PUT  
key1  
value340  
PUT  
key1  
value52  
PUT  
key1  
value331  
PUT  
key1  
value401  
PUT  
key1  
value333  
PUT  
key1  
value753  
PUT  
key1  
value603  
PUT  
key1  
value318  
PUT  
key1  
value357  
PUT  
key1  
value21  
CTRL-C  
$

We are randomly generating keys and values on this database to generate the activity and as you can see, there's been plenty of updates to "key1".

Auto-compaction and you

We've been generating short random values for around 1000 keys and updating them regularly. That, as you can see above is 573KB of data in the database for around 1000 keys. Now, if we leave this database cluster passive for 24 hours and come back to it... [Insert "Doing databases" montage here] ...then look at the stats:

❯ ETCDCTL_API=3 etcdctl --endpoints=$ETCDENDPOINTS --user=$ETCDCREDS endpoint status -w table
+-----------------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                            ENDPOINT                             |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://portal918-11.democompact.3268603687.composedb.com:28414 |  14029f60c465804 |   3.3.3 |   94 kB |     false |         4 |      27676 |
| https://portal1305-6.democompact.3268603687.composedb.com:28414 | e79e78f6e5fa3b99 |   3.3.3 |   94 kB |     false |         4 |      27677 |
+-----------------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+

94KB? So what's happened to make the memory go down like that? Well, half the story is compaction. On Compose, our etcd deployments are set to compact revisions every hour. Compacting is a background process which doesn't block the database but it can result in more memory being used as the keys and revisions are reorganized.

That fragmented memory is tidied up when a node is asked to defragment. This reclaims all the memory from the keys and values that were compacted away. Each node on a Compose etcd cluster runs a defrag job once every 24 hours, each node running on a different hour. Thats because defrag, unlike compact, blocks the server.

So at any time, the three nodes in a Compose etcd cluster could be showing different db sizes. Defragged at different times and with different numbers of revisions retained and compacted, its pretty unlikely the numbers would line up. Rule of thumb is that the smallest value is closest to the actual amount of data into your etcd and the largest is closer to representing how much memory is needed for your workload.

Let's manually compact!

We'll push some more data into our database and see what the status is...

$ ETCDCTL_API=3 etcdctl --endpoints=$ETCDENDPOINTS --user=$ETCDCREDS get revisiontestkey -w json
{"header":{"cluster_id":16448426242120634882,"member_id":7487883867543739509,"revision":9741,"raft_term":4}}
$ ETCDCTL_API=3 etcdctl --endpoints=$ETCDENDPOINTS --user=$ETCDCREDS endpoint status -w table
+-----------------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                            ENDPOINT                             |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://portal918-11.democompact.3268603687.composedb.com:28414 |  14029f60c465804 |   3.3.3 |  324 kB |     false |         4 |      31778 |
| https://portal1305-6.democompact.3268603687.composedb.com:28414 | e79e78f6e5fa3b99 |   3.3.3 |  324 kB |     false |         4 |      31779 |
+-----------------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
$

Ok, 324KB data, 9741 revisions. What we want to do is clear out all that revision history. The compact command takes a revision number to clear out up to. So, let's say 9740...

$ ETCDCTL_API=3 etcdctl --endpoints=$ETCDENDPOINTS --user=$ETCDCREDS compact 9740            
compacted revision 9740

$ ETCDCTL_API=3 etcdctl --endpoints=$ETCDENDPOINTS --user=$ETCDCREDS endpoint status -w table
+-----------------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                            ENDPOINT                             |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://portal918-11.democompact.3268603687.composedb.com:28414 | 67ea4eb47c529c75 |   3.3.3 |  598 kB |      true |         4 |      31813 |
| https://portal1305-6.democompact.3268603687.composedb.com:28414 | e79e78f6e5fa3b99 |   3.3.3 |  598 kB |     false |         4 |      31814 |
+-----------------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+

And... wait a minute, 598KB? The memory used has gone up as it appears etcd makes a clean copy of the latest keys and fragments the memory a bit more. If we waited 24 hours, the node defrag would kick in and tidy things up. Or we can run the defrag command to reclaim the memory on each node. As we can't explicitly address the nodes, we'll have to run it a couple of times and that should hit all the nodes:

$ ETCDCTL_API=3 etcdctl --endpoints=$ETCDENDPOINTS --user=$ETCDCREDS defrag                  
Finished defragmenting etcd member[https://portal918-11.democompact.3268603687.composedb.com:28414]  
Finished defragmenting etcd member[https://portal1305-6.democompact.3268603687.composedb.com:28414]

~
$ ETCDCTL_API=3 etcdctl --endpoints=$ETCDENDPOINTS --user=$ETCDCREDS defrag
Finished defragmenting etcd member[https://portal918-11.democompact.3268603687.composedb.com:28414]  
Finished defragmenting etcd member[https://portal1305-6.democompact.3268603687.composedb.com:28414]

$ ETCDCTL_API=3 etcdctl --endpoints=$ETCDENDPOINTS --user=$ETCDCREDS endpoint status -w table
+-----------------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                            ENDPOINT                             |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://portal918-11.democompact.3268603687.composedb.com:28414 |  14029f60c465804 |   3.3.3 |  213 kB |     false |         4 |      31871 |
| https://portal1305-6.democompact.3268603687.composedb.com:28414 | e79e78f6e5fa3b99 |   3.3.3 |  213 kB |     false |         4 |      31872 |
+-----------------------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+

There, post defrag, 213KB. If we go back to the start of this article where we listed out the changes in "key1" we find...

$ ETCDCTL_API=3 etcdctl --endpoints=$ETCDENDPOINTS --user=$ETCDCREDS watch --rev=1 key1 
watch was canceled (etcdserver: mvcc: required revision has been compacted)  
Error: watch is canceled by the server

$

... that there's no such revision any more. We can, of course, start watching from "now" by not specifying that --rev=n.

ETCDCTL_API=3 etcdctl --endpoints=$ETCDENDPOINTS --user=$ETCDCREDS watch key1  
PUT  
key1  
value910  
...

In short

It's a fact that etcd gets its solidity from the revision system managing multiple versions. It's also a fact that a heavy workload will be reflected in that revision system. If you update 30 keys with 1MB values in them every minute, you'll consume 1800MB in an hour. Compacting will free up space in the database, but it won't be fully reclaimed till there a defrag. If you're concerned about responsiveness of your etcd, figure out how big your un-compacted workload appears in your database and scale your deployment according to that.


Read more articles about Compose databases - use our Curated Collections Guide for articles on each database type. If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com. We're happy to hear from you.

attribution Maarten van den Heuvel via Unsplash

Dj Walker-Morgan
Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer since Apples came in II flavors and Commodores had Pets. Love this article? Head over to Dj Walker-Morgan’s author page to keep reading.

Conquer the Data Layer

Spend your time developing apps, not managing databases.