This began as a tongue-in-cheek internal document I wrote on a plane for work.
Background
I maintain a blog at jacogoldman.dev “monetized” via GPT (check out the ad featuring my dog below). It offers a quick and simple method for verifying production GPT as opposed to visiting cluttered publisher sites that slow devtools to a crawl (cough CNN cough). It also has some writing on it to supplement the ads.
The website runs on a server located in my apartment. This is a single point of failure, a distributed and globally available system has been planned through tactically placed Raspberry Pi’s dispersed across the world, but funding has not yet been secured. Cloudflare runs in front of the server.
This server also hosts two mission critical Minecraft servers for external customers.
Cause
On August 9th 2022 at approximately lunch time, there was a brief power surge in the Camberville metro. Later that evening, the oncall engineer was paged by his little brother with an outage report.
The circuit breaker providing electricity to (1) a tableside lamp and (2) the server was identified to be tripped. After Googling how to fix this, the engineer flipped the switch to the “on” position and restored power. He then manually rebooted the server and restarted the disk array, nginx server hosting the blog, and Minecraft servers.
Service was restored by dinner time, just in time for a delicious steak dinner at Grill 23.
Impact
jacobgoldman.dev was unreachable for a few hours. Cloudflare reported a 522 connection timeout error, indicating failures to communicate with the host.
Blog: 5 to 10 potential impressions were lost, with a revenue impact in the fractions of cents. Sundar will be notified for this significant revenue loss.
Minecraft: at least 20 potential diamonds were lost. Reputational damage is immeasurable but severe.
Lessons Learned
- Hosting things in your apartment can be annoying.
- Continuous monitoring should probably implemented, perhaps via probes. The engineering team will install pagerduty.
- Power surges are not uncommon during the summer.
The Good
- No saved progress was lost in the Minecraft servers.
- Restarting everything was pretty easy. The server fortunately did not get fried by the surge.
The Bad
- I didn’t know how to flip a breaker and needed to Google a tutorial.
- There was a single point of failure requiring physical intervention.
The Lucky
- Google wasn’t down, only jacobgoldman.dev.
- Early alerting from the little brother.
Call to Action
- Invest in an uninterruptible power supply.
- Migrate jacobgoldman.dev to the cloud, or, better yet, build the next generation cloud platform purely for hosting Minecraft servers and my blog. Trust me, it’s going to be bigger than GCP!