migrating from qcow2 on raid to drbd on lvm on hba

Migrating from QCOW2 on RAID to DRBD on LVM on HBA

~ And other acronyms ~

This is a very niche topic but I needed to know about it and now I do and so can you.

Colocataires had a single server hosting virtual machines which experienced some hardware issues. It seems like those are resolved by making sure both power-supplies are plugged in, but regardless, it made me skittish about having "one" of anything. (Two is one; one is none).

Ahead of taking paying customers for virtual machines I looked into backup solutions for QCOW files and none of them made me very happy. Regardless, I don't really want historical backups — our customers can manage those themselves — I want the ability to lose a machine and start up the VM somewhere else.

I bought a second server, nearly the same specification as the first. Now we need to make the whole thing highly available.

DRBD is the distributed replicated block device. As a project it's managed by German company LINBIT but the previous major version (8.4) has been merged into the Linux kernel and generally it's been a stable piece of kernel infrastructure for a long time. The 9.x series has many interesting features, but not interesting enough for me to build a kernel for.

Conceptually, DRBD operates like a RAID1 with a network pipe inbetween -- give it two block devices on two different machines and it will keep them in sync. Ordinarily it forces one to be "secondary" and one to be "primary" but it crucially also supports "dual-primary" mode, where both sides can write. This is absolutely a dangerous mode to operate in, but it's also neccessary to cleanly support live migration of machines from one host to another.

Now's a good time to say: if you're here looking for things to copy and paste: I get it. But please, please, read the whole DRBD manual before attempting to follow anything from this page. Storage is not the place to fuck around and find out.

The original configuration

One HP DL360 server (hip1.coloc.systems) with four (of eight) drive bays occupied:

The OS is installed on the SSD RAID, and QCOW2 files for each VM's disk are stored on the filesystem. These are copy-on-write files and they can dynamically grow, allowing one to over-commit their storage and not allocate everything up front (with some risk and some performance overhead).

libvirtd with QEMU is the hypervisor layer. The configuration is XML but basically human readable and very scriptable and flexible.

The flaws

Mixing the OS drive with the VM storage was a mistake. Annoyingly, it's one I made before, back when I worked at iWeb. Back then, we fixed it by booting the OS off of USB thumb-drives sticking out of the front of the machines. But times have changed: there's now a USB port inside the server for that task.

Taking snapshots of QCOW2 files looks to be harder and more error-prone than relying on storage level things that I'm familiar with, like ZFS zvols or LVM lvs. Ideally I would not rely on communicating with the guests to quiesce their disks: they're not “my” guest VMs; I can't control what they do.

There's no provision to make things reliable beyond one machine. Dual PSUs and RAID and ECC RAM is all nice, but totally losing a machine is always a possibility. In fact, I've lost way more machines to crashes and total hardware write-off than I have lost power-supplies in my 20+ year IT career.

The new hotness

I wanted to fix my mistakes: don't use the hot-swap storage for the OS. Move redudancy to the machine-level and not to the disk level. But also, if I'm going to be duplicating VMs across machines, I don't want to pay 2 × storage costs to duplicate them within the machine, too. So let's drop RAID1 and manage each disk slice separately.

A sidebar

Internet Archive Petabox machines used to have mount points for each drive. /1 through /24 for each disk on the front of a data node. They've moved to ZFS-based "solo" nodes which I think just have a /0 but I loved the brutal simplicity of the old scheme for attaching a lot of storage to (somewhat) disposable machines, where duplication is handled at the layer above the operating system.

I installed a new Debian trixie onto a 32GiB USB pen installed in the internal slot on the second machine, and named the machine “pollux”. It has the same bridge configuration that the old machine (“hip1”) had, with unique IPs. I cloned the disk (foreshadowing) to insert it into hip1 later on.

So we add one HP DL380 server (pollux.coloc.systems) with three (of eight) drive bays occupied:

This machine is 2U (vs. the 1U DL360) but it's actually basically the same spec and still only has 8 bays. What it does have is a lot of PCIe slots (seven). I don't need this, but I like it!

I used ssacli to enable HBA mode, turning off all of the RAID features: hpssacli controller slot=1 modify hbamode=on. It actually rescanned and shuffled the block devices available to Linux live, without requiring a reboot, which was extra nice.

The install

My son and I went to the datacentre and racked up the machine; made sure that it could ping and be pinged and then I could do the rest of everything from the comfort of my home where there isn't 96dB(A) of fan noise.

three servers in the rack
(pictured: pollux, patliputra and castor)

I installed the pre-requisites:

  apt install libvirt-clients-qemu libvirt-daemon-driver-qemu \
    libvirt-daemon-system lvm2 drbd-utils kpartx

Moving some VMs (the plan)

My plan was to take the existing disks and make them part of a DRBD pair. Then I could replicate that to the new machine and do a live migration and suffer (almost) no downtime. The usual way to deploy DRBD, if you didn't have existing data, is to give it a whole device and it'll use some of the end of it for its own metadata and then give you the rest of the device for your filesystem. But I already have a filesystem, so I chose to use a separate device for this metadata. (In the future, that will also make it easier to grow these volumes online).

Also: we don't have a device to give it. I've been giving libvirt QCOW2 disk images. Luckily, qemu-nbd exists — this lets you export any format that QEMU can read as a network block device, and it has a shortcut syntax for connecting to that block device locally. It's handy if you need to access a VM's filesystem for maintenance from outside of the VM environment – but it's also perfect for my needs.

Also: I don't actually have any disk devices that I can use for metadata, so I'll create some empty files and use loopback devices with losetup. There's a literal formula in the DRBD docs for estimating this size, but I just decided to standardize on 100MiB because a) it's going to be enough for any reasonable size disk and b) storage is cheap and mistakes are expensive.

I can't export a disk as an NBD while it's in use, so I'll need to stop the VM to do that, but I don't want to take excessive downtime, so I want to be able to bring the VM back up as soon as possible. I can rewrite the VM's config to use the DRBD block device, even when the initial sync is not completed.

Once they're in sync, I want to move everything from hip1 to pollux, with no downtime. We'll do this by allowing both machines to be primary (i.e. both will accept writes) and performing a QEMU live migration (essentially: pausing the CPU and copying the RAM over the network, then resuming everything on the other side).

Moving some VMs (the execution)

Create disks on pollux and hip1

  root@pollux:/home/colocataires# lvcreate -L10G -n svc1 slot1
    Logical volume "svc1" created.
  root@pollux:/home/colocataires# lvcreate -L100M -n svc1-metadata slot1
    Logical volume "svc1-metadata" created.
  root@hip1:/var/lib/libvirt/images# truncate -s 100M svc1-metadata
  root@hip1:/var/lib/libvirt/images# losetup /dev/loop0 svc1-metadata

Configure DRBD's config file (/etc/drbd.d/svc1.res)

resource svc1 {
  on hip1 {
    device    /dev/drbd_svc1 minor 1; # this syntax lets us
       # have nice symlinks instead of /dev/drbd1
    disk      /dev/nbd0;
    meta-disk /dev/loop0;
    address   192.168.32.6:20004; # unique port numbers for each resource
  }
  on pollux {
    device    /dev/drbd_svc1 minor 1;
    disk      /dev/slot1/svc1;
    meta-disk /dev/slot1/svc1-metadata;
    address   192.168.32.11:20004;
  }
  net {
    protocol C; # synchronous writes
    max-buffers 36k;
    sndbuf-size 1024k; # make network go fast
    rcvbuf-size 2048k;
  }
  disk {
    c-plan-ahead 15; # based on RTT between machines
    c-fill-target 24M;
    c-min-rate 80M; # straight up copied from StackOverflow
    c-max-rate 720M;
  }
}

Pre-configure the VM to use its new disk device

  root@hip1:/# virsh edit svc1
... look for the file device being defined: ...
      <disk type='file' device='disk'>
        <driver name='qemu' type='qcow2' discard='unmap'/>
        <source file='/var/lib/libvirt/images/svc1.qcow2'/>
        <target dev='vda' bus='virtio'/>
... and change it to a block device: ...
      <disk type='block' device='disk'>
        <driver name='qemu' type='raw' discard='unmap'/>
        <source dev='/dev/drbd_svc1'/>

Saving the config does not change the running VM -- it just saves us making this edit while the VM is powered off.

Bring the machine down

svc1# poweroff

With the VM off, create the NBD device and DRBD device on hip1

  root@hip1:/var/lib/libvirt/images# modprobe nbd
  root@hip1:/var/lib/libvirt/images# qemu-nbd -c /dev/nbd0 svc1.qcow2
  root@hip1:/var/lib/libvirt/images# drbdadm create-md svc1
  initializing activity log
  initializing bitmap (4 KB) to all zero
  Writing meta data...
  New drbd meta data block successfully created.
  root@hip1:/var/lib/libvirt/images# drbdadm up svc1
  root@hip1:/var/lib/libvirt/images# drbdadm primary --force svc1

Force is required because DRBD boots up not knowing if it's the leader and fails safe

We can start the VM again at this point. We've only been down for seconds.

root@hip1:/# virsh start svc1

The service is running and that's the last downtime it'll need to take for the rest of the process.

Let's bring up DRBD on pollux and tell both hosts that it's okay for there to be two primaries

  root@pollux:/# drbdadm create-md svc1
  root@pollux:/# drbdadm up svc1
  # it comes up secondary and starts to sync
  root@hip1:/# cat /proc/drbd
  version: 8.4.11 (api:1/proto:86-101)
  srcversion: 96ED19D4C144624490A9AB1

   1: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent A r-----
      ns:20784 nr:0 dw:0 dr:20784 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:10464976
          [>....................] sync'ed:  0.3% (10216/10240)M
          finish: 0:58:41 speed: 2,968 (2,968) K/sec

  # the final speed ended up around 110Mbyte/s
  # ... once we're caught up:

  root@hip1:/# drbdadm net-options --protocol=C --allow-two-primaries svc1
  root@pollux:/# drbdadm net-options --protocol=C --allow-two-primaries svc1
  root@pollux:/# drbdadm primary svc1

Now that they are both writable we can perform the live migration:

  root@hip1:/# virsh migrate --live --domain svc1 --undefinesource \
    --persistent --desturi qemu+ssh://192.168.32.11/system

I was logged into the machine while this happened and it's still magic to see a machine move from one host to another without dropping your SSH session.

Also let's demote the old co-primary and tell DRBD not to allow co-primaries again until we're preparing for more migrations at some future point:

  root@hip1:/# drbdadm secondary svc1
  root@hip1:/# drbdadm adjust svc1 # reload config from disk
    # which doesn't include "allow-two-primaries"
  root@pollux:/# drbdadm adjust svc1

You mentioned foreshadowing?

After a couple of hours I'd moved every machine off hip1 and it was ready to start its new life as castor. I made sure I had a remote console session via the integrated “Lights Out” controller and rebooted, knowing it would prefer the USB drive I installed when I was at the datacentre. I'd pre-configured the new hostname and network configuration onto the pen and, indeed, the machine came straight up as castor.

Except something was really wrong with its networking and some hosts could reach it, but others could not. I suspected the proxy_arp settings, because these machines perform ARP on behalf of their VMs, but then I got a hint in dmesg:

[111973.183911] br1: received packet on eno1 with own address as source address (addr:5a:03:90:d2:ba:97, vlan:0)
[111973.315769] br1001: received packet on eno1.1001 with own address as source address (addr:6a:a5:62:be:31:66, vlan:0)
[111973.695777] br1003: received packet on eno1.1003 with own address as source address (addr:2e:01:3c:88:51:22, vlan:0)

What? I was lucky to find this bug report against systemd (closed as intended behaviour: classic!). Even though I don't use systemd-networkd to configure my network, it is assigning MAC addresses to the bridge devices based on /etc/machine-id. I had neglected to change this file when I cloned the USB boot drives. A quick random change to the ID and reboot and everything was just fine.

Wrapping up

I won't go through the steps, but I had to recreate these resources on the new castor persona of this machine, as I had converted the existing disks to HBA mode and lost all of their data (intentionally). They are configured through LVM, just like pollux and because the DRBD device names are stable between the two machines I can live migrate to my heart's content with zero downtime. The dream!

And if one of these machines should unexpectedly die, I can take a manual action to bring its VMs up on the other host. It's still costing us 2 × the storage, but as I'm not using RAID1 I am saving the equivalent amount there. Yes: RAID is for uptime and not data consistency, and so it's possible that these VMs will be exposed to a greater chance of failure through the disk in their machine dying, but if that happens, recovery is simple.

For an enthusiast-oriented hosting company I think it's better to take a (rare) downtime because of a disk failure than to suffer the loss of your VM because of a machine failure.

Thank you for reading this far.

Aaron Brady 🍀 🍁