TL;DR: Poor man’s private cloud network without proper IPAM or DNS integration, and with a severed cluster firewall on Proxmox VE. As simple as it gets.
I had a dream of trying out redundant storage on a private cloud (as in VPS hosting). It took a year and a half to comprehend the vocabulary, another year and a half to try KVM out on a Fedora workstation, yet another year and a half of cluster laying, but at last I’ve got to a PoC.
Only things I would complain are:
- Fast-dying x540-AT2 PCI-e network cards
- Lack of PCI-e lanes for NVMe storage, but who doesn’t!
- Oh, and CEPH is crawling with 2.5 inch SATA SSDs on mere 5 nodes.
- Not having a use for it (🤡).
Mind Map
Let me explain briefly, putting it in a paragraph would further complicate it:
- I want to have a private VPS hosting for fun. And frankly, this was fun.
- I want High Availability to be able to brag, as I cannot yet afford a bit truck.
- I want “Hyper-Converged Infrastructure” (HCI) to have smaller number of computers thus less expense.
- To be able to hand off a virtual machine (VM) (see: 2) from a non-operational host (e.g. hypervisor with power and/or cluster network issues) to another requires not having strong ties on afore-mentioned host. This includes virtual storage and network devices as well as pass-through PCIe and/or USB devices on the previous host.
- I need shared storage (see: 4), or frequent enough snapshot synchronization of current VM state to have something failure resilliant. I don’t want to lose tiny bits of data within that unrealized last syncronization cycle, so shared storage it is.
- The alternative might prove faster, but is much more complex anyway. “If storage stuck, freeze cluster!” is consistent enough for me.
- Shared storage must be HA itself, or is a signle point of failure.
- It could be totally external or hosted beside the compute, I have gone for the latter (see: 3).
- I do not differentiate between block and file storage, but am aware block storage should be faster for VM disks than qcow2 on a filesystem on the same hardware.
- CEPH on Proxmox VE is officially supported, checks all the bells and whistles. And is cool, I mean it! Having several exporters, like S3, iSCSI, NFS, etc. are not my focus for now.
- I don’t want to use FRR for CEPH network, as it was not stable in my case. Connections were dropping and then coming back, however, I do not have an actual need for self-healing networking for distributed storage, especially when it may drive performance further down.
- Post Catalogus: XOSTOR is XenOrchestra’s official solution based on Linstor’s DRDB solution, but that is a total overhaul of my current ecosystem and is for another year.
- Begin gossip. These two European virtualization technologies are kind of converging with Proxmox’s Datacenter Manager and Vates’ XOLite. Both already have SDN, and Veeam supposedly supporting both of them, it’s a tie. End gossip.
- I need networking to be shared (see: 4). SDN is the way to do it on PVE. I am yet unaware of XO’s equivelant, it probably is tidier.
- FRR is used in the official docs, so be it.
- I had set up Simple SDN, and with proper firewall rules, DHCP and DNS work out of the box.
- On FRR, I couldn’t get DHCP to work, although static IP addresses with gateway at first usable IP work.
- Using a virtual router can circumvent this inconvenience, however, undermine granularity of SDN firewall in the mean time. Firewall management is a per subnet endeavour with separate web panels. (*Insert “Community - I will allow it” meme here*.)
- pfSense and OPNsense are the usual suspects, but are BSDs (not in a derogatory sense [bu-dam-tss!]). Their support of VirtIO network device is not as terrific as Linux, thus constrainting them to a single gigabit AFAIAA.
- Unless one is dedicated enough to set up link aggregation on virtual NICs. But come on, I am not that bored!
- OpenWRT is linux based and supports VirtIO network devices out-of-the-box, and has
qemu-ga
OPKG package for guest agent which does indeed work on showing an IP address on web panel of the respective VM. - My 10GBaseT 5-node mesh, had previously sustained multi-gig inter-VM iperf3 tests without jumbo packets. Also, 2.5G equipment are cheap enough to be used for VM network on a 2.5G switch for uplink.
- So, OpenWRT it is.
- pfSense and OPNsense are the usual suspects, but are BSDs (not in a derogatory sense [bu-dam-tss!]). Their support of VirtIO network device is not as terrific as Linux, thus constrainting them to a single gigabit AFAIAA.
Some Crumbs
This has taken me some time, so much so that I would spend a lot of time finding all the sources re-doing it for a step-by-step tutorial. So unlike the usual, I will leave some crumbs for the eager to follow. This is a huge disservice to my documentation purposes, but a service is better than no service. So this shall be the exception which prooves the rule.
SDN via FRR
I followed official SDN docs but what got me to a working FRR setup was Bennet Gallein’s blog post. Note that, after setting the evpn setting on Proxmox Web UI, the files contents get overwritten. For instance, the IP addresses of VTEP neighbours. I do not have a solid understanding of FRR, and it seems to require a ton of networking knowledge that I currently lack. I will try to avoid this rabbit hole for some more time.
I have also come across a question on Proxmox Forum by cheabred which was similar enough to my target. He has shared his solution there, however, I haven’t conducted a test of it.
Static Routing
Having networks on the cluster is fine. Yet, setting a static route on my homeab router (e.g. 10.50.0.0/16 via 192.168.60.41
) is better. This way I can access VMs on the cluster. However this have a catch, that I have a single point of failure for accessing into the SDN.
The reason for this suboptimal setup is that my simple TP-Link router does not support BGP, or OSPF. Someone with enough know-how, could add a router/switch to FRR network of the hypervisors. I had indeed set up FRR on a fresh install of non-clustered separate nodes, and still perform iperf3 on separate nodes’ respective local VMs.
For this static routing with overlay network to work, an Exit Node should be specified under SDN EVPN settings. This is not necessary for the ideal OSPF or BGP setup I currently cannot achive out of a personal shortcoming.
BTW, yes I am using IPv4, not not my self-claimed IPv6 range. Apalrd has a fine tutorial on FRR/OSPF/IPv6.
OpenWRT
I have found a thorough a ComputingForGeeks tutorial for setting up OpenWRT on Proxmox.
I have used two EVPN networks, one shared with gateway specified and is used only by OpenWRT instances and other static servers (i.e. vxnet001), and a second which is specific to each virtual router (i.e. vxnet254). This is the configuration for template VM, and shall be updated per each new clone, with their respective EVPN network getting created.
-
vxnet001:
- CIDR: 10.50.1.0/24
- GW: 10.50.1.1
- OpenWRT IP: 10.50.1.254
-
vxnet254:
- CIDR: 10.50.254.0/24
- GW: 10.50.254.1
- OpenWRT IP: 10.50.254.1
Then to set IP addresses of OpenWRT template VM, I followed Network Configuration on OpenWRT Docs. Also, added DNS.
option dns '1.1.1.1 8.8.8.8' option peerdns '0'
However, having set up static routing to the hole SDN subnet of /16
. I can directly access WAN network of virtual routers (e.g. 10.50.1.254), which does not have access enabled on OpenWRT by default. Since I do not want to use a Desktop VM to configure my subnets all the time, I have enabled it by following jwmullally’s instruction on an OpenWRT Forum topic.
Below is his answer:
uci add firewall rule
uci set firewall.@rule[-1].name='Allow-Admin'
uci set firewall.@rule[-1].enabled='true'
uci set firewall.@rule[-1].src='wan'
uci set firewall.@rule[-1].proto='tcp'
uci set firewall.@rule[-1].dest_port='22 80 443'
uci set firewall.@rule[-1].target='ACCEPT'
uci commit firewall
service firewall restart
CEPH
I have set CEPH with several configurations. Here are my findings:
- Using NVMe is fastest (duh!), it is
- Using NVMe as Write-Ahead-Log (WAL) for SATA SSDs does not effect read that much, but random writes are more than 10 times faster. But binds all OSDs on a host to a single cache disk.
- Using NVMe as DB for SATA SSDs was indistinguisable for me, and binds all OSDs on a host to a single cache disk.
- Either way, sequential read and write speeds on a single disk are sliced to 1/3 or less on CEPH.
- Reads can be stupidly fast, but I ignore that as I was not hitting the cache as much when tested with 4 VMs running KDiskMark.
- Single-VM test did not differ from 4-VM test of KDiskMark. Performance is not peak, but it seems consistent enough for my case.
- I have not tried to saturage the disk/network by pushing CEPH to limits. This was not my priority, having maximum disk performance butchered to HDD levels was a good enough finding for my taste.
- CEPH will freeze silently if underlaying network breaks silently. My mesh had bad connections, this should be much more visible on a start topology with a switch. VMs would sometimes not boot, disks could not be moved to other storage, etc. I think there were warnings on CEPH panel on Proxmox WebUI. But those warnings were not any wisener. Ordered some more 10G cards to repair 4th and 5th node. With non-faulty NICs CEPH is solid, I can’t blame them just because I am cheap.
- I have ~100MB/s sequential write speed on ~450MB/s SSD disks, and ~340MB/s sequential write speed on ~1500MB/s NVMe disks.
- I have ~80MB/s sequential writes on the same setup with Erasure-Coding of
k=3
,m=2
. Although this is slower than replicating to 3 blocks, has higher utilization of storage (33.3% to 60%) and with failure domain of host, this should ensure all 5 nodes having a bit of the data. I believe this is worth the trade-off.
To configure Erasure-Coding, as stated under Proxmox Docs on Erasure Coded Pools, one needs to use CLI. Below is an example with also setting OSD class type constraint on the replication rule:
pveceph pool create ec32_nvme --application rbd --erasure-coding k=3,m=2,device-class=nvme,failure-domain=host --pg_autoscale_mode on --pg_num 128
veceph pool create ec32_ssd --application rbd --erasure-coding k=3,m=2,device-class=ssd,failure-domain=host --pg_autoscale_mode on --pg_num 128
This way, I was able to use nvme separately with triple the speed. Using them as WAL would increase SSD class’ performance considerably. I have not yet decided upon which to use.
HA and Migration
I haven’t yet gotten to this stage on this cluster. However, setting up a migration groups and running OpenWRT virtual routers alongside VMs in HA configuration should allow for seamless failure resilliancy. For anything performance hungry, I may end up re-using one of the nodes as storage on the mesh. A Gen4 SSD is capable of ~7GB/s throughput, which would not reallistically saturate 4 10Gbps links. If I am crazy enough, I could virtualize storage machine to host further VMs on the cluster, but that would probably butcher CEPH latency and drive its performance further down.
So it is still possible to mix redundant and performant setups on mere 5 nodes. But CEPH definetely does not shine on this. If I had needed perfromance, I would look into what TrueNAS Scale offers for real-time active-passive setup on ZFS.
Some Config Files
-
Proxmox Node
n1.lab.bug.tr
(by the way, I got a new domain):- Directory
/etc/systemd/networkd/
has previously blogged systemd-network renaming.
00-eni0.link 01-eni1.link 10-enx0.link 11-enx1.link 12-enx2.link 13-enx3.link
/etc/network/interfaces
:
config interface 'loopback' option device 'lo' option proto 'static' option ipaddr '127.0.0.1' option netmask '255.0.0.0' config globals 'globals' option ula_prefix 'fd98:05b5:fd38::/48' config device option name 'br-wan' option type 'bridge' list ports 'eth0' config interface 'wan' option device 'br-wan' option proto 'static' option ipaddr '10.50.1.2' option gateway '10.50.1.1' option netmask '255.255.255.0' option ip6assign '60' option peerdns '0' list dns '1.1.1.1' list dns '8.8.8.8' config device option name 'br-lan' option type 'bridge' list ports 'eth1' config interface 'lan' option device 'br-lan' option proto 'static' option ipaddr '10.50.2.1' option netmask '255.255.255.0' option ip6assign '60'
/etc/frr/frr.conf
:
frr version 8.5.2 frr defaults datacenter hostname n1 log syslog informational service integrated-vtysh-config ! ! vrf vrf_evpn vni 10000 exit-vrf ! router bgp 65000 bgp router-id 192.168.60.41 no bgp hard-administrative-reset no bgp default ipv4-unicast coalesce-time 1000 no bgp graceful-restart notification neighbor VTEP peer-group neighbor VTEP remote-as 65000 neighbor VTEP bfd neighbor 192.168.60.42 peer-group VTEP neighbor 192.168.60.43 peer-group VTEP neighbor 192.168.60.44 peer-group VTEP neighbor 192.168.60.45 peer-group VTEP ! address-family ipv4 unicast import vrf vrf_evpn exit-address-family ! address-family ipv6 unicast import vrf vrf_evpn exit-address-family ! address-family l2vpn evpn neighbor VTEP activate neighbor VTEP route-map MAP_VTEP_IN in neighbor VTEP route-map MAP_VTEP_OUT out advertise-all-vni exit-address-family exit ! router bgp 65000 vrf vrf_evpn bgp router-id 192.168.60.41 no bgp hard-administrative-reset no bgp graceful-restart notification ! address-family ipv4 unicast redistribute connected exit-address-family ! address-family ipv6 unicast redistribute connected exit-address-family ! address-family l2vpn evpn default-originate ipv4 default-originate ipv6 exit-address-family exit ! ip prefix-list only_default seq 1 permit 0.0.0.0/0 ! ipv6 prefix-list only_default_v6 seq 1 permit ::/0 ! route-map MAP_VTEP_IN deny 1 match ip address prefix-list only_default exit ! route-map MAP_VTEP_IN deny 2 match ipv6 address prefix-list only_default_v6 exit ! route-map MAP_VTEP_IN permit 3 exit ! route-map MAP_VTEP_OUT permit 1 exit ! line vty
- Directory
-
OpenWRT VM with
vxnet002
:/etc/config/network
on OpenWRT VM for a clone withvxnet002
:
config interface 'loopback' option device 'lo' option proto 'static' option ipaddr '127.0.0.1' option netmask '255.0.0.0' config globals 'globals' option ula_prefix 'fd98:05b5:fd38::/48' config device option name 'br-wan' option type 'bridge' list ports 'eth0' config interface 'wan' option device 'br-wan' option proto 'static' option ipaddr '10.50.1.2' option gateway '10.50.1.1' option netmask '255.255.255.0' option ip6assign '60' option peerdns '0' list dns '1.1.1.1' list dns '8.8.8.8' config device option name 'br-lan' option type 'bridge' list ports 'eth1' config interface 'lan' option device 'br-lan' option proto 'static' option ipaddr '10.50.2.1' option netmask '255.255.255.0' option ip6assign '60'
Bibliography
- Proxmox VE Docs: SDN
- Bennet Gallein: Setting Up EVPN on Proxmox SDN: A Comprehensive Guide
- Proxmox Forum: Mesh Ceph Network with FRR Openfabric - Route question by cheabred
- Apalrd: Fully Routed Networks in Proxmox! Point-to-Point and Weird Cluster Configs Made Easy
- [Github: Compiling iPXE binaries trusting your SSL certificate Failed][10]
- OpenWRT Docs: Network Configuration
- OpenWRT Forum: jwmullally’s answer to Access web interface on WAN
- Proxmox VE Docs: Erasure Coded Pools