High-Availability KVM Virtualization on Ubuntu 12.04, 12.10, 13.04

It took me quite some time to set this up, since the documentation on the subject is quite poor. Therefore I have decided to put up this "how-to". It's not meant to be a beginner's guide but more of a guide to set things up on Ubuntu 12.04, since I've found some deceiving web pages out there. You need some knowledge of what the certain basic commands do and how to set-up the network bridge, but you can google it up or read some man pages.

We will be setting up a 2-node "failover" cluster which is pretty basic once you connect the dots. The cluster packages we shall use are pacemaker, cman, corosync, drbd and ocfs2. We assume there's no SAN available, so DRBD takes care of disk synchronization between the nodes.

1. Install Ubuntu server 12.04 LTE, 12.10 or 13.04

Do it on both nodes as you normally would, except keep your root partition small (up to about 10GB), as you will need space to set up DRBD volumes. I suggest you use LVM partitioning since that's really simple to manage later. Enable only the "OpenSSH server" when asked - we will install the necessary packages in the next step.

2. Install required packages

For clustering, network monitoring and virtualization:

    sudo apt-get install lsscsi sysstat iftop iptraf iputils-arping bridge-utils ifenslave drbd8-utils drbdlinks qemu-kvm ubuntu-vm-builder libvirt-bin virt-viewer ntp pacemaker pacemaker-mgmt corosync openais cman resource-agents fence-agents

In case you also need a shared read-write volume between those clusters, you will also need a cluster file system. Here we use ocfs2, but you might go for gfs or something else. If you want to set up ocfs2, install these packages:

    sudo apt-get install ocfs2-tools ocfs2-tools-cman ocfs2-tools-pacemaker

3. Prepare your two servers for clustering

Some basic steps are needed:
- edit your /etc/hosts file so both hosts are set up correctly and can ping each other over the network.
- in case you have multiple connections available between the two, you can optionally split the communication in three main groups: services, clustering, storage. Services is where the bridge(s) for the virtual machines reside, clustering is where the cluster nodes communicate, storage is where the drbd sync is running. When you do that please note that cman will start cluster communications on the network that matches nodes' hostnames. Node names within the /etc/cluster/cluster.conf and /etc/drbd.d/*.res must match the designated system hostnames which must also be set in /etc/hosts. For example:
 ------------/etc/hosts example:

192.168.10.11     nodeA
192.168.10.12     nodeB
192.168.11.11     nodeA-clus
192.168.11.12     nodeB-clus
192.168.12.11     nodeA-stor
192.168.12.12     nodeB-stor

--------------------------------------
  If your hostnames of the nodes are set to nodeA and nodeB, then your /etc/cluster/cluster.conf also includes these names - hence the communication of the cluster will go via the 192.168.10.0 network.
  You can use "alias" parameters within the /etc/cluster/cluster.conf or just name your nodes nodeA-clus and nodeB-clus to have cluster communicate over 192.168.11.0 network. Remember that your nodes will be reachable over the network by the names you put in your DNS - the /etc/hosts is for internal use and it overrides DNS.

- set up your interfaces accordingly (don't forget to set up the network bridge for virtualization),
- create and exchange root ssh keys between the servers using sudo, ssh-keygen and ssh-copy-id
- create lvm volumes to be used for drbd and to run virtual machines on - the volumes must be of the same size on both nodes.
- create a symlink if you will use ocfs2 :
    sudo ln -s /usr/sbin/ocfs2_controld.cman /sbin/ocfs2_controld.cman
- set some defaults in cman:
    sudo echo "CMAN_QUORUM_TIMEOUT=0" >> /etc/default/cman
- set some defaults in o2cb:
    sudo echo "O2CB_STACK=cman" >> /etc/default/o2cb
- edit /etc/default/o2cb accordingly to your setup:
    O2CB_ENABLED=false
    O2CB_BOOTCLUSTER=<ocfs2_cluster_name>
:: we can disable o2cb,ocfs,drbd services at bootup since the cluster starts them (use update-rc.d -f <service> remove) ::
:: the <ocfs2_cluster_name> is what you will set up in /etc/ocfs2/cluster.conf - it's not used anywhere else in the cluster as far as I know. ::
- set up /etc/ntp.conf as it fits to your environment.
- fix /etc/init.d/o2cb by replacing the line where it says ``killproc "$DAEMON" $SIGNAL`` to ``kill -HUP $PID || sleep 5 && kill -HUP $PID || kill -9 $PID``
- lengthen the time that virtual machine has to shutdown safely inside /etc/init/libvirt-bin.conf
- time to reboot your two servers and double check if everything is set-up.

4. Set-up the DRBD volumes

Some configuration first(on both nodes):
/etc/drbd.d/global_common.conf :

global {
    usage-count no; # you may leave this as default "yes" if you want to.
}

common {
    protocol C;

    handlers {
        pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
        fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
        after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
    }

    startup {
        wfc-timeout 40;
        degr-wfc-timeout 240;
    }

    disk {
        # on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
        # no-disk-drain no-md-flushes max-bio-bvecs
    }

    net {
        after-sb-0pri discard-zero-changes;
        after-sb-1pri discard-secondary;
        after-sb-2pri disconnect;
        data-integrity-alg sha1;
    }

    syncer {
        verify-alg sha1;
        rate 100M;
    }
}

Let's first set-up a volume for ocfs2 on both nodes and I'll just call it "shared" like this:
    sudo lvcreate -n shared -L 20G vg_nodeA
    sudo "ssh nodeB lvcreate -n shared -L 20G vg_nodeB"


Now for the resource definition file(on both nodes):
/etc/drbd.d/shared.res :

resource shared {
    meta-disk internal;
    device /dev/drbd0;
    disk {
        fencing resource-and-stonith;
    }
    handlers {
#        outdate-peer " "; # kill the other node!
        after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
    }
    net {
        allow-two-primaries;
    }
    startup {
#        become-primary-on both;   #uncomment this line after the sync has completed and ocfs2 was created!
    }
    on nodeA {
        disk /dev/vg_nodeA/shared;
        address    192.168.12.11:7790;
    }
    on nodeB {
        disk /dev/vg_nodeB/shared;
        address 192.168.12.12:7790;
    }
}


Time to initialize the DRBD resource called "shared" on the two nodes:
    sudo service drbd start
    sudo drbdadm create-md shared
    sudo drbdadm
up shared



The next command has to be run on only one of the nodes:
    sudo drbdadm -- --overwrite-data-of-peer primary shared

Next we create a volume for the virtual machine - we start by creating an lv for it:
    sudo lvcreate -n vm1 -L 10G vg_nodeA
    sudo "ssh nodeB lvcreate -n vm1 -L 10G vg_nodeB"


Now for the resource definition file(on both nodes):
/etc/drbd.d/vm1.res :

resource vm1 {
    meta-disk internal;
    device /dev/drbd1;
    on nodeA {
        address 192.168.12.11:7791;
        disk /dev/vg_nodeA/vm1;
    }
    on nodeB {
        address 192.168.12.12:7791;
        disk /dev/vg_nodeB/vm1;
    }
}


We initialize the volume on both nodes:
    sudo drbdadm up vm1
    sudo drbdadm create-md vm1


The next command has to be run on only one of the nodes (we should run virtual machine installer on that exact host):
    sudo drbdadm -- --overwrite-data-of-peer primary vm1


5. Initialize the cluster


First for some configuration files:
/etc/cluster/cluster.conf (put on both nodes!):

<?xml version="1.0"?>
<cluster config_version="10" name="<pacemaker_cluster_name>">
    <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="-1"/>
    <clusternodes>
        <clusternode name="nodeA" nodeid="1" votes="1">
            <fence>
                <method name="pcmk-redirect">
                    <device name="pcmk" port="nodeA"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="nodeB" nodeid="2" votes="1">
            <fence>
                <method name="pcmk-redirect">
                    <device name="pcmk" port="nodeB"/>
                </method>
            </fence>
        </clusternode>
    </clusternodes>
    <fencedevices>
        <fencedevice name="pcmk" agent="fence_pcmk"/>
    </fencedevices>
    <cman two_node="1" expected_votes="1">
        <multicast addr="239.10.10.1"/>
    </cman>
    <uidgid uid="admin" gid="admin" />
</cluster>
       

Note: <pacemaker_cluster_name> can be the name of your choice. It may differ from <ocfs2_cluster_name>.


Start cluster on both nodes:
    sudo service cman start && sudo service pacemaker start

Check the status with:
    sudo cman_tool status
    sudo crm_mon

If everything works properly, you can import an initial configuration for the ocfs2 cluster file system.
    sudo crm configure edit
:: paste next yellow lines into the editor (below the node lines!) - modify if needed, save and exit ::
primitive DLM ocf:pacemaker:controld \
    params daemon="dlm_controld" \
    op monitor interval="120s" \
    meta is-managed="true"
primitive DRBD_shared ocf:linbit:drbd \
    params drbd_resource="shared" \
    operations $id="DRBD_shared-operations" \
    op monitor interval="20" role="Master" timeout="20" \
    op monitor interval="30" role="Slave" timeout="20" \
    meta is-managed="true"
primitive DRBD_vm1 ocf:linbit:drbd \
    params drbd_resource="vm1" \
    operations $id="DRBD_shared-operations" \
    op monitor interval="20" role="Master" timeout="20" \
    op monitor interval="30" role="Slave" timeout="20" \
    meta is-managed="true"
primitive O2CB ocf:pacemaker:o2cb \
    params stack="cman" \
    op monitor interval="120s" \
    meta is-managed="true"
ms ms_DRBD_vm1 DRBD_vm1 \
    meta master-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Master"

clone ClusterDLM DLM \
    meta globally-unique="false" interleave="true" target-role="Started"
clone ClusterO2CB O2CB \
    meta globally-unique="false" interleave="true" target-role="Started"
property default-resource-stickiness="100" \
    symmetric-cluster="true" \
    stonith-enabled="false" \
    stonith-action="reboot" \
    expected-quorum-votes="1" \
    no-quorum-policy="ignore"


Note: DLM and O2CB are needed to set up OCFS2 (hence ClusterDLM and ClusterO2CB, since they need to run on both nodes). There are no location or order constraints in place since we don't have the ocfs2 file-system ready.  Properties are set up the way I use them at initialization. Once you have fencing in place you can enable it by changing the value of "stonith-enabled" to "true".

Check if the ClusterDLM and ClusterO2CB are running on both nodes with:
    sudo crm resource show

6. Set-up OCFS2

Create a mount point for the ocfs2:
    sudo mkdir -p /mnt/shared

Create the ocfs2 config file (on both nodes):
/etc/ocfs2/cluster.conf :

node:
    ip_port = 7777
    ip_address = 192.168.10.11
    number = 0
    name = nodeA
    cluster = <ocfs2_cluster_name>
node:
    ip_port = 7777
    ip_address = 192.168.10.12
    number = 1
    name = nodeB
    cluster = <ocfs2_cluster_name>
cluster:
    node_count = 2
    name = <ocfs2_cluster_name>


Time to format the "shared" volume using ocfs2:
    sudo mkfs.ocfs2 -N 2 -L ocfs2_drbd0 /dev/drbd0

Note: this is when you can uncomment the commented line in /etc/drbd.d/shared.res and then reload the drbd service by running "sudo service drbd reload". The resource can now become primary on both nodes.

Next, add the resource definitions of ocfs2 to the pacemaker configuration:
    sudo crm configure edit
:: paste next yellow lines into the editor - modify if needed, save and exit ::
primitive OCFS2_shared ocf:heartbeat:Filesystem \
    params device="/dev/drbd/by-res/shared" directory="/mnt/shared" fstype="ocfs2" options="rw" \
    meta is-managed="true"
ms ms_DRBD_shared DRBD_shared \
    meta master-max="2" clone-max="2" notify="true" target-role="Master"
clone ClusterOCFS2_shared OCFS2_shared \
    meta interleave="true" ordered="true" target-role="Started"
colocation col_DLM_DRBD_shared inf: ClusterDLM ms_DRBD_shared:Master
colocation col_O2CB_DLM inf: ClusterO2CB ClusterDLM
colocation col_OCFS2_O2CB inf: ClusterOCFS2_shared ClusterO2CB
order o_DLM_O2CB inf: ClusterDLM ClusterO2CB
order o_DRBD_shared_DLM inf: ms_DRBD_shared:promote ClusterDLM
order o_DRBD_shared_OCFS2 inf: ClusterO2CB ClusterOCFS2_shared


In case you need to reconfigure the ocfs2 to use cman stack:
    sudo crm resource stop ClusterOCFS2_shared
    sudo fsck.ocfs2 /dev/drbd0
fsck.ocfs2 1.6.4
[RECOVER_CLUSTER_INFO] The running cluster is using the classic o2cb stack, but the filesystem is configured for the cman stack with the cluster name quantal.  Thus, fsck.ocfs2 cannot determine whether the filesystem is in use.  fsck.ocfs2 can reconfigure the filesystem to use the currently running cluster configuration.  DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION.  Recover cluster configuration information the running cluster? <n> y

Re-check /etc/default/o2cb if it includes "O2CB_STACK=cman". Reboot the two nodes.

7. Install a virtual machine

Remember the drbd volume "vm1"? Here's where it comes into play!

    export VM_NAME=vm1
    export DISK=/dev/drbd/by-res/$VM_NAME
    export RAM=2048
    export CPU=1
    export CDROM=/path_to/installation/file.iso

    export OS_TYPE=linux # can be windows - check man page for available os types
    export OS_VARIANT=rhel5.4  # check man page for variations
    export NET0=vbr0 # name of the bridge interface
    export VNC_PORT=5901    # increment for each new virtual machine

    sudo virt-install --connect qemu:///system \
    --name $VM_NAME \
    --ram $RAM \
    --arch x86_64 \
    --vcpus $CPU \
    --cdrom $CDROM \
    --os-type=$OS_TYPE \
    --os-variant=$OS_VARIANT \
    --disk=$DISK
    --graphics vnc,port=$VNC_PORT \
    --network bridge=$NET0


Finish the installation as you normally would. Copy the VM config file /etc/libvirt/qemu/vm1.xml to the other node!

Add the resources into pacemaker configuration:
    sudo crm configure edit
:: paste next yellow lines into the editor - modify if needed, save and exit ::
primitive VM_vm1 ocf:heartbeat:VirtualDomain \
    params config="/etc/libvirt/qemu/vm1.xml" \
    hypervisor="qemu:///system" \
    op start timeout="120s" \
    op stop timeout="120s" \
    op monitor depth="0" timeout="30" interval="10" \
    meta allow-migrate="false"
colocation col_DRBD_vm1 inf: VM_vm1 ms_DRBD_vm1:Master
order ord_DRBD_vm1 inf: ms_DRBD_vm1:promote VM_vm1:start


Note: if you use dual-primary configuration for the drbd resource "vm1", you can allow live migrations by changing "allow-migrate" value to "true". You will not benefit anything HA-wise when hardware fails, but you can live-migrate machines from one node to the other in case you need to (hardware or software maintenance). It can produce more problems then benefits in split-brain scenarios.

8. Check to see if everything works

    sudo crm_mon

9. Final words


I hope this post helps you with setting up a pacemaker cluster on Ubuntu 12.04 or 12.10 server. You really should set-up stonith and enable fencing afterwards, since it's crucial to the wellbeing of the ocfs2 file-system in case of communication break-down or other hardware failures. Also make sure you use redundant network connections and defend your drbd volumes using RAID configurations.
There are plenty of how-to-s out there that help you set-up kvm/xen/other virtualization, and even more on how to properly set up networking on ubuntu. That is the sole reason why I didn't write about their configuration.
For more in-depth knowledge (recommended!) go read the manuals: DRBD 8.3 and Pacemaker
Comments