Optimizing SDNs for Compose ElasticsearchPublished
As you may have read, we've taken the beta label off Compose Elasticsearch. We spent months refining our high performance Elasticsearch offering and making sure it was as exceptionally good as our other databases. During that time we hit a few obstacles which had to be overcome before we could push forward to release. Now Compose Elasticsearch is out of beta, we thought we'd talk about one in particular – how we optimized our networking for Compose Elasticsearch.
When developing the Compose platform, we were able to maximize our CPU and storage I/O performance on the host whilst retaining flexibility in resource allocation. The modern database is the sum of its distributed components and the connections between them and the advent of powerful software defined networking has meant we can use many of the same approaches for our network connections as we use elsewhere.
That has, in turn, allowed for a tighter, but also more flexible, fit of our networking topology to the task at hand - running your database deployments. To illustrate how this works in practice, let's show you how we initially deployed our SDN and what we had to change to maximize its potential.
How we connect
Each node in any user's deployment is run in an LXC-based container. We're big believers in the containerized ecosystem and have been using the technology for several years. We had to create our own way of packaging the applications and filesystems for these containers and we call this combination a "capsule". Capsules are brought to life within an LXC container and are the unit of compute power around Compose. They are used throughout Compose's internal infrastructure, providing both user facing and internal applications on demand on a range of hardware.
When we create a database deployment, we create capsules for the database and access portals on large, heavily provisioned, shared systems. Each one of these host systems runs Open vSwitch and we initially connected to every other host over a mesh of GRE tunnels. Now this is not ideal unless you only have two hosts, because there's a loop in the network. By enabling Open vSwitch's Spanning Tree Protocol support, after a short interval, the network settles into a stable, connected state nullifying the loop problem.
Each host's Open vSwitch was configured to regulate the flow of packets between the hosts and when a deployment was begun capsules on the various hosts would be started up and connect to Open vSwitch automatically. Tunnels would be generated and connections made.
The emerging problem
This arrangement worked well for us but as our Elasticsearch beta went on, we started to notice a problem. As we scaled up with more users deploying Elasticsearch, network performance was not what we had been hoping for and at particular times we were seeing packet loss on the network, emphasising the fact that there were issues with the network. Even though the losses were marginal this was unacceptable to us and so we set about finding the source of the problem.
It turned out to be linked to changes in configuration of the network. When we added or removed a deployment, we would experience the degraded performance. As more users made use of Compose's platform for database services, so more deployments were being added or removed and it was this that was creating the performance drop. Digging deeper still, we found it was the use of the Spanning Tree Protocol in Open vSwitch that was the source of the trouble. STP is great when the network is relatively stable, but whenever there's a change made to the topology it has to go and measure it out again to work out how not to loop, even if the change was relatively minor. STP can take thirty to fifty seconds to respond to those topology changes
Stopping the spanning
The first stop should have been RSTP, Rapid Spanning Tree Protocol , a 2001 extension to STP which directly addresses the topology change problem and pulls the response time down to around six seconds by default, faster still in the case of a physical link failure. There's only two problems, it's still in development for Open vSwitch (but not that far off) and it doesn't completely address the problem.
We already used the OpenFlow implementation in Open vSwitch (OVS) to control the flow of traffic between hosts and we wondered if it would be possible to extend that control to the flow of traffic between the capsules running on the hosts. The answer is, yes, but with one other change we switched the tunnelling between hosts to VXLAN. This gave us more control over the configuration.
Now, rather than letting the STP in Open vSwitch work out the network topology and the associated routing, when we add or remove a node we also add and remove rules in OVS's OpenFlow configuration. That configuration defines the incoming and outgoing flow of packets from the capsule on one host to capsules on other hosts. We automate the process of course. By statically defining these rules though we remove the topology discovery of STP and let Open vSwitch start routing packets as soon as the capsules are up.
When the database instances in the capsules on the various hosts connect, we know the tunnelling is being routed correctly. We then monitor the capsules and VXLAN connections to detect any problems.
Some weeks ago, we switched over the Elasticsearch deployments to this new configuration. Because of the change in tunnelling, we had to schedule a brief period of down time. Once done though we found a network configuration that is now unbothered by packet loss or slowed during reconfiguration. The ability to roll this change out without physical hardware changes and map it precisely to our use case also shows us how powerful software defined networking can be.