Categories
Uncategorized

Update Adventures

tl;dr I got bit by an interface naming change (bug?) https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0#Linux_Bridge_MAC-Address_Change. Network didn’t come back up after reboot and I spent a long time figuring it out.


Here’s the longer version about the outage on August 24, 2021:

After finishing the package upgrades on my Proxmox hosts for the new release (Proxmox 7.0, corresponding to Debian 11/bullseye), I typed reboot and pressed enter, crossing my fingers that it would come back up as expected.

It didn’t.

Luckily I had done one last round of VM-level backups before starting the upgrade! I started restoring the backups to one of my other servers, but my authoritative DNS is hosted on the same server as tilde.team, so that needed to happen first.

I got the ns1 set up on my Proxmox node at Hetzner, but my ns2 secondary zones had been hosted at ovh. Time to move those to he.net to get it going again (and move away from a provider-dependent solution).

While shuffling VMs around, I ended up starting a restore of the tilde.team VM on my infra-2 server at OVH. It’s a large VM with two 300gb disks so it would take a while.

I started working to update the DNS records for tilde.team to live on OVH instead of my soyoustart box, but shortly after, I received a mail (in my non-tilde inbox luckily) from the ovh monitoring team that my server had been rebooted into rescue mode after being unpingable for this long.

I was able to log in with the temporary ssh password and update /etc/network/interfaces to use the currently working MAC address that the rescue system was using.

Once I figured out how to disable the netboot rescue mode in the control panel, I hit reboot once more. we’re back up and running on the server that it was on at the start of the day!

ejabberd wasn’t happy with mysql for some reason but everything else seems to have come back up now.

Like usual, holler if you see anything amiss!

Cheers, ~ben

Categories
Uncategorized

Mastodon PostgreSQL upgrade fun

Howdy friends!

If you’re a mastodon user on tilde.zone (the tildeverse mastodon instance), you might’ve noticed some downtime recently.

Here’s a quick recap of what went down during the upgrade process.

We run the current stable version of PostgreSQL from the postgres apt repos. PostgreSQL 13 was released recently and the apt upgrades automatically created a new cluster running 13.

The database for mastodon has gotten quite large (about 16gb) which complicates this upgrade a bit. This was my initial plan:

  1. drop the 13 cluster created by the apt package upgrades
  2. upgrade the 12-main cluster to 13
  3. drop the 12 cluster

These steps appeared to work fine, but closer inspection afterwards led me to discover that the new cluster had ended up with SQL_ASCII encoding somehow. This is not a situation we want to be in. Time to fix it.

Here’s the new plan:

  1. stop mastodon:
    for i in streaming sidekiq web; do systemctl stop mastodon-$i; done
  2. dump current database state:

    pg_dump mastodon_production > db.dump

  3. drop and recreate cluster with utf8 encoding:
    pg_dropcluster 13 main --stop pg_createcluster --locale=en_US.UTF8 13 main --start
  4. restore backup:
    sudo -u postgres psql -c "create user mastodon createdb;" sudo -u mastodon createdb -E utf8 mastodon_production
    sudo -u mastodon psql < db.dump

I’m still not 100% sure how the encoding reverted to ASCII but it seems that the locale was not correctly set while running the apt upgrades…

If this happens to you, hopefully this helps you wade out while keeping all your data 🙂

Categories
Uncategorized

Networking nonsense

I’ve recently been working on setting up Drone CI on the tilde.team machine. However, there’s been something strange going on with the networking on there.

Starting up drone with docker-compose didn’t seem to be working: netstat -tulpn showed the port binding properly to 127.0.0.1:8888 but I was completely unable to get anything from it (using curl the nginx proxy that was to come).

I ended up scrapping docker on the ~team box itself and moving it into a LXD container (pronounced “lex-dee”) with nesting enabled.

This got us in to another problem that had been seen before when using nginx to proxy to apps running in other containers. Requests were dropped intermittently, sometimes hanging for upwards of 30 seconds.

Getting frustrated with this error, I tried to reproduce it on another host. Both the docker-proxy and nginx->LXD proxies worked on the first try, yielding no clues as to where things were going wrong.

In a half-awake stupor last Saturday evening, I decided to try rule out IPv6 by disabling it system-wide. As is expected for sleepy work, it didn’t fix the problem and created more in the process.

Feeling satisfied that the problem didn’t lie with IPv6, I re-enabled it, only to find that I was unable to bind nginx to my allocated /64. I may or may not have ranted a bit about this on IRC but I was able to get it back up and running by restarting systemd-networkd.

One step forwards broke something and now we’re back to where we started with the original problem of the intermittent hangups to the LXD container.

Seeing my troubles on IRC, jchelpau offered to help dig in to the problem with a fresh set of eyes. He noted right away that pings over ipv6 to the containers worked fine, but ipv4 did not.

We ended up looking at the firewall configurations, only to find that one of the subnets I blocked after november’s nmap incident included lxdbr0’s subnet (the bridge device used by LXD).

Now that I made the exception for lxdbr0, everything is working as expected!

Thanks to fosslinux and jchelpau for their debugging help!

Categories
Uncategorized

RAID nonsense

Last week, I did some maintenance on the tilde.team box. Probably should have written about it sooner but I didn’t make time for it until now.

The gist of the problem was that the default images provided by Hetzner default to RAID1 between the available disks. Our box has two 240gb SSDs, which resulted in about 200gb usable space for /. It also defaulted to giving us a huge swap partition which I deem unnecessary for a box with 64gb of RAM.

The only feasible solution that I’ve found involved using the rescue system and the installimage software to reconfigure the disk partitions.

deepend recently upgraded to a beefier dedi (more threads and more disk space) and had a bit of contract time on the old one. He offered to let me use it as a staging box for the meantime while I reinstalled and reconfigured my raid settings.

I’ve migrated tilde.team twice before (from Linode > Woothosting > Hetzner > and now back to Hetzner on the same box) using a slick little rsync that I’ve put together.

rsync -auHxv --numeric-ids \
    --exclude=/etc/fstab \
    --exclude=/etc/network/* \
    --exclude=/proc/* \
    --exclude=/tmp/* \
    --exclude=/sys/* \
    --exclude=/dev/* \
    --exclude=/mnt/* \
    --exclude=/boot/* \
    --exclude=/root/* \
    root@oldbox:/* /

As long as the destination and source boxen are running the same distro/version, you should be good to go after rebooting the destination box!

The only thing to watch out for is running databases. It happened to me this time with mysql. There were 3 pending transactions that were left open during the rsync backup. It kept failing to start after I got the box back up, along with all the other services that depend on it.

Eventually I was able to get mysqld back up and running in recovery mode (basically read-only) and got a mysqldump of all databases. I then purged all existing mysql data, reinstalled mariadb-server, and restored the mysqldump. Everything came up as expected and we were good to go!

The raid is now in a RAID0 config, leaving us with 468gb (not GiB) available space. Thanks for tuning in to this episode of sysadmin adventures!

Categories
Uncategorized

November 13 post mortem

We had something of an outage on november 13, 2018 on tilde.team.

I awoke, not suspecting anything to be amiss. As soon as I logged in to check my email and IRC mentions, it became clear.

tilde.team was at the least inaccessible, and at the worst, down completely. According to the message in my inbox, there had been an attempted “attack” from my IP.

We have indications that there was an attack from your server. Please take all necessary measures to avoid this in the future and to solve the issue.

At this point, I have no idea what could have happened overnight while I was sleeping. The timestamp shows that it arrived only 30 minutes after I’d turned in for the night.

When I finally log on in the morning to check mails and IRC mentions, I find that I’m unable to connect to tilde.team… strange, but ok; time to troubleshoot. I refresh the webmail to see what I’m missing. It ends up failing to find the server. Even stranger! I’d better get the mails off my phone if they’re on my @tilde.team mail!

Here, i launch in to full debugging mode: what command was it? who ran it?

Searching ~/.bash_history per user was not very successful. Nothing I could find was related to net or map. I had checked sudo grep nmap /home/*/.bash_history and many other commands.

At this point, I had connected with other ~teammates across other IRC nets (#!, ~town, etc). Among suggestions to check /var/log/syslog, /var/log/kern.log, and dmesg, I finally decided to check ps. ps -ef | grep nmap yielded nmap on an obscured uid and gid, which I shortly established to belong to a LXD container I had provisioned for ~fosslinux.

I’m not considering methods of policing access to any site over port 80 and port 443. This is crazy. How do you police nmap when it isn’t scanning on every port?

After a bit of shit-talking and reassurance from other sysadmins, I reexamined and realized that ~fosslinux had only run nmap for addresses in the 10.0.0.0/8 space. The 10/8 address space is intended to not be addressable outside the local space. How could Hetzner have found out about a localhost network probe!?

Finally, after speaking with more people than I expected to speak with in one day, I ended up sending three different support emails to Hetzner support, which finally resulted in them unlocking the IP.

It’s definitely time to research redundancy options!

Categories
Uncategorized

DNS shenanigans post-mortem

Let’s start by saying I probably should have done a bit more research before diving head-first into this endeavor.

I’ve been thinking about transferring my domains off google domains for some time now, as part of my personal goal to self host and limit my dependence on google and other large third-party monstrosities. Along that line, I asked for registrar recommendations. ~tomasino responded with namesilo. I found that they had $3.99 registrations for .team and .zone domains, which is 1/10th the cost of the $40 registration on google domains.

I started out by getting the list of domains from the google console. 2 or 3 of them had been registered within the last 60 days, so I wasn’t able to transfer those just yet. I grabbed all the domain unlock codes and dropped them into namesilo. I failed to realize that the DNS panel on google domains would disappear as soon as it went through, but more importantly that the nameservers would be left pointing to the old defunct google domains ones.

I updated the nameservers as soon as I realized this error from the namesilo panel. Some of the domains propagated quickly. Others, not so much. tilde.team was still in a state of flux between the old and new nameservers.

In a rush to get the DNS problem fixed, and under recommendation from several people on IRC, I decided to switch the nameservers for tilde.team and tilde.zone to cloudflare, leaving another layer of flux for the DNS to be stuck in…

Of the five domains that I moved to cloudflare, 3 returned with a DNSSEC error, claiming that I needed to remove the DS record from that zone. D’oh!

I removed the DNSSEC from those affected domains, so we should be good to go as soon as it all propagates through the fickle beast that is DNS.