Building a high-availability cluster without using magic

28 Feb 2021 5 minutes

There are incredibly powerful tools available to build and manage high-availability clusters these days, the most popular probably being Kubernetes. These tools are can do amazing things, and if you’ve ever seen them in action you’ll know they look like magic. When considering my options for a recent project I wanted to opt for a simpler solution when I was setting up a high-availability cluster for an HPC environment. Note that I don’t mean simpler in terms of how easy it is to use, but rather simplicity in terms of how much the tool does (insert some generic systemd comment here). Let’s break down what we need from a high-availability cluster and then investigate how we can solve each problem on its own.

Note that if you’re planning on managing millions of containers this approach probably won’t scale well, but if you have a few dozen services and just want a simple solution to keep them running this might be an option.

High availability: what do we need?

In its simplest form, we need our infrastructure to be robust against one or more nodes failing or becoming unavailable. To that end, we will require the following:

Something to monitor, manage and move around resources
A distributed storage solution on which we can run resources and store their data

Resources here can be anything: VMs, containers or even just an Apache installation. You’ll soon see that you can plug just about anything into this setup. To make this happen we’ll be using Pacemaker, a high-availability resource manager that is incredibly flexible and easy to adapt to your exact needs.

As for the distributed storage solution I’ve opted to use Ceph, a personal favourite of mine (and also CERN, but who cares about that). It provides excellent reliability and is simple to set up and use. Once set up you’ll be left with one or more block devices in /dev/rbd that you can use just like a normal storage devices. You can however use any other distributed storage solution that you’d like by just adapting the resource scripts we’ll be setting up in the next sections.

Pacemaker: the heartbeat of the cluster

Pacemaker is a cluster resource manager - think of it like a distributed, highly-available init system. It starts, stops and monitors services on various nodes and makes sure that everything keeps on running. The cool thing is that resources are essentially just scripts that you can set up to fit your exact needs. Pacemaker then takes care of managing your resource via your provided script - if your script says the resource is down, Pacemaker can automatically start it up on another node. It’s that simple™. Let’s take a look at an (incomplete) example Pacemaker resource script (the proper name is a “resource agent”) that is meant to make sure that Apache is running:

# NOTE this is not a valid resource agent but this is essentially how they work
# A valid resource agent will have some extra boilerplate and needs to return
# specific values from the various functions

start() {
    # here we can start some service, for example apache
    systemctl start apache2
}

stop() {
    # here we need to stop whatever we started
    systemctl stop apache2
}

monitor() {
    # here we need to check if the resource is still running and inform
    # pacemaker if there is any issues
    systemctl status apache2 &> /dev/null
    return $?
}

Resource agents are incredibly flexible - have a look that the ClusterLabs resource agents repo which contains a bunch of examples to get you started. You can also read this short introduction, and when you’re ready here is detailed documentation for resource agent development.

Pacemaker allows you to configure how you want to orchestrate your resources in a myriad of ways (e.g. node affinity, resource constraints and load balancing just to name a few). Pacemaker even supports cool things like STONITH (Shoot The Other Node In The Head) where it can trigger hardware power switches to power off misbehaving nodes. Be sure to read through the official documentation to set up Pacemaker and check out what it allows you to do.

High-availability storage

Once you have Pacemaker up and running you will need some way to share data between your nodes: this is where Ceph comes in. Ceph is a distributed, highly scalable and reliable storage solution that allows you to combine multiple nodes into one big clustered storage pool. You can then access your data from any of these nodes as if they had a native hard-drive plugged in (assuming you’re using Ceph’s RBD images). Ceph also provides data redundancy, automatic fail-over and data scrubbing all while being easy to set up and use. Ceph has fantastic documentation - be sure to check it out the official documentation here.

Ceph is by no means the only distributed storage solution, there is also Gluster, OpenAFS or even just plain old NFS. As long as you have some way for your various nodes to access the resource data (databases, VM hard-drives etc.) you will be able to adapt your resource agents to use them.

Putting it all together

Now we have a way to orchestrate our resources and using the distributed storage we can access our resource data from various nodes. It is now simply a matter of choosing what your resources should be and deploying them. Some examples of possible deployments (each of them using some distributed filesystem for data storage):

VMs managed with libvirt
MariaDB server
Nginx load balancer

Below you can see a resource agent that uses an LXC + Ceph deployment which I’m currently using in a HPC project. The resource agent essentially does the following:

LXC containers are each stored on their own RBD image which are formatted using EXT4
To start a container the RBD image is mapped to one node and is then mounted inside /var/lib/lxc/<container-name> on that node
Starts the container using lxc-start after it is mounted
Stops the container using lxc-stop, after which it unmounts the container image and unmaps the Ceph RBD image
Monitors the container using lxc-ls

This deployment has been serving me well for over two years and it currently manages around 70 LXC containers with around 5.5TB of data without any problems.