I haven’t posted an update for almost a month, but that doesn’t mean nothing’s been happening. On October 12th, a significant change occurred: I migrated Visualizer to a new server provider. If you haven’t muted #extensions on Discord or if you’re subscribed to the status page, you probably saw the update.
The Story So Far
Visualizer started way back in 2020 and, like most hobby Ruby projects at the time, was deployed to Heroku. It was a great fit back then, and with a simple git push
I could see my changes reflected almost immediately. However, Heroku’s focus shifted towards large companies / enterprises, and their pricing reflected that. Additionally, as Visualizer grew, I became increasingly frustrated by the performance I got there.
In November 2022, I moved Visualizer to Fly.io. It was like a glass of ice-cold water in hell. Fly was this cool upstart that promised the same developer ergonomics, but with better performance at a reasonable price. I tried it out, and it worked really well. I was very happy with it and loved the Fly Community. Sure, the support left something to be desired, but that was to be expected from a small company. Once I switched to the Launch Plan, that improved.
The Fly Experience
But for the past couple of months I had a feeling that performance was getting worse. I upgraded web and database servers and it did improve things slightly, but it also got more expensive. In the beginning of October, I started seeing some weird errors in the logs about database malfunctioning. I contacted support, but didn’t hear back. Then the first random downtime occurred. For a minute. And in a day, another one. 3 minutes. So I’ve contacted them again and selected a higher priority.
Would you mind letting us know what app you’re seeing this on? This might help us better pinpoint the issue.
Are you kidding me? There’s only one app!
Taking a look at the recent health checks, it looks like the issue has gone away? Do you have any sense regarding how long the original issue lasted?
None of the issues went away. All the errors were still there and happening every 5 to 10 minutes.
I just took a look at your logs and haven’t seen the error you shared in quite some time. Di (sic) you manage to fix the issue you were running into by chance?
Again, all the errors were still there. Did I mention there were days between these (non-)responses?
The Kamal Experience
Still impressed by the Rails World 2024 opening keynote I mentioned in the previous post, I decided to try moving my other project, the Business European Coffee Trip from Fly to Hetzner with Kamal 2.
A couple of hours later, it was up and running. I was blown away by the speed. It’s a small project so I only need a single server. I got roughly 2-3× performance boost on CAX21
(ARM64, 4 cores, 8GB @ €5.99/month) compared to shared-cpu-2x
on Fly.io (x86, 2 cores, 4GB @ $24.69/month). And…it…just works? 🤔
A couple of days later, while working, I get an SMS from my uptime monitor. Visualizer is down. Ugh, again? What’s going on now? I log into Fly and I see both database instances are down. No warning, no emails from Fly, no nothing. Just down. I run a restart command and one instance comes back up, the other one forever dead. 11 minutes of downtime and Visualizer is hobbling along. Several hours pass before everything is back up and running somewhat normally.
Then I see their newest post on Community: We’re making pricing simpler!. Basically, they’re removing the Launch Plan which includes the service and support, and are splitting support into a separate paid add-on: $29/month for Standard, $199/month for Premium, or Enterprise starting at $2500. It’s clear I’m no longer their target audience. And by that point I was convinced: I will be moving Visualizer to Hetzner too.
Friend’s Experience
As if to confirm my decision, my friend attempted to deploy a simple app using the fly launch
, and I’ll let the chat speak for itself:
WTF is going on at Fly?!? “To launch an app just run fly launch!” I do, it runs, suggest FRA region and then tells me I can’t use FRA region on my plan. The links it provides give no information on this, nor on the price of other plans nor on how to switch, nor what my plan is, nor what regions I can use (all except Frankfurt and Mumbai for some reason), and there’s no way to sort regions by price.
I explained that Frankfurt was previously only available on Launch plans and above, and that now they’re removing those plans.
Wait, what? You can’t deploy to FRA without a Launch plan but you can’t get a Launch plan any more? I can’t even report this bug to support because I’d have to pay to be able to contact support. 😂
The Hetzner Experience
I knew moving Visualizer was going to be a bit more difficult than ECT. For starters its database size is ~35 GB while European Coffee Trip’s is ~150 MB. With ECT it was as simple as stopping fly, pg_dump
, rsync
to the new server, pg_restore
, switch DNS records, done. But Visualizer needed a bit more thought and planning.
I looked into and trialed many different options: pg_auto_failover, pglogical, pg_easy_replicate, but they all had their own issues. Mainly of various limitations because of the way Fly runs Postgres.
Then I came across pgsync. It’s a fairly simple tool, that basically utilizes pg_dump
and pg_restore
under the hood, but provides a much better developer experience. Namely, it allowed me to easily select which tables to sync. This allowed me to make the following plan:
- Do a sync of
ShotInformation
(at ~30 GB by far the biggest table that holds, well, shot information)
- Stop fly web server
- Sync all tables except
ShotInformation
- Start web server on Hetzner
- Switch DNS records
- Connect to Fly database from Hetzner and sync the missing rows in
ShotInformation
With the plan in place I went through the entire thing except stopping the server and switching DNS records. The whole thing took about 2 hours. But the crucial bit: without step 1 (which I could do before stopping the Fly server) the entire process took just over 5 minutes.
I announced the migration on the status page and Discord with the planned downtime of ~15 minutes and started the migration on a Saturday morning. It went as smoothly as the test run, and we were back up and running on Hetzner in 6 minutes. As expected, there were some minor hiccups, but in a span of about half an hour everything was back to normal and better than ever.
The Story Now
Similar to ECT, I’m seeing 2-3× performance improvements across the board. Together with speedups last month, the progress really feels incredible. The database calls are faster, the page rendering is faster, and shot parsing is much faster. Most importantly: so far there have been 0 reliability issues, server errors, or any downtime whatsoever.
There’s still a lot of work to be done. I want to improve database resilience and backup strategy, I want to improve server monitoring, I want to add some other bells and whistles. But already I feel much more confident in this setup than I did with Fly for the last couple of months.
Thanks for reading, and I hope you now know a bit more about the recent instability and downtimes and what I did about it. Feel free to email me or ping me on Discord if you have any questions.
And, as always, you can check the entire diff since the last update. Which again includes some breadcrumbs about future features, and some smaller features like adding support for roasting date formats.
Have a great day! ☕