tl;dr I got bit by an interface naming change (bug?) https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0#Linux_Bridge_MAC-Address_Change. Network didn’t come back up after reboot and I spent a long time figuring it out.
Here’s the longer version about the outage on August 24, 2021:
After finishing the package upgrades on my Proxmox hosts for the new release (Proxmox 7.0, corresponding to Debian 11/bullseye), I typed reboot and pressed enter, crossing my fingers that it would come back up as expected.
It didn’t.
Luckily I had done one last round of VM-level backups before starting the upgrade! I started restoring the backups to one of my other servers, but my authoritative DNS is hosted on the same server as tilde.team, so that needed to happen first.
I got the ns1 set up on my Proxmox node at Hetzner, but my ns2 secondary zones had been hosted at ovh. Time to move those to he.net to get it going again (and move away from a provider-dependent solution).
While shuffling VMs around, I ended up starting a restore of the tilde.team VM on my infra-2 server at OVH. It’s a large VM with two 300gb disks so it would take a while.
I started working to update the DNS records for tilde.team to live on OVH instead of my soyoustart box, but shortly after, I received a mail (in my non-tilde inbox luckily) from the ovh monitoring team that my server had been rebooted into rescue mode after being unpingable for this long.
I was able to log in with the temporary ssh password and update /etc/network/interfaces
to use the currently working MAC address that the rescue system was using.
Once I figured out how to disable the netboot rescue mode in the control panel, I hit reboot once more. we’re back up and running on the server that it was on at the start of the day!
ejabberd wasn’t happy with mysql for some reason but everything else seems to have come back up now.
Like usual, holler if you see anything amiss!
Cheers, ~ben