OnlineOrNot Diaries 3
Max Rozen (@RozenMD) / March 10, 2023
Another Friday evening here in Toulouse, I've poured myself a Delirium Tremens (I shit you not, that's what the beer is called), let's go into how OnlineOrNot went this week.
Jake recently replied to one of the OnlineOrNot diaries, asking:
Have you ever detailed what a 'marketing week' is?
To which I replied (paraphrasing here):
My marketing weeks are normally based on vibes, but I typically:
- update the changelog
- write blog posts/screencasts/guides that are missing
- tweet more
- participate more in forums (mainly hacker news and reddit)
- talk to existing and potential customers about how they monitor their URLs, alert their customers to incidents, and generally how they manage their incidents
- work on landing pages
This week was supposed to be a marketing week, and then over the weekend before this week started, fly.io had reliability issues.
At the same time, I was in the process of rolling out a change that would make OnlineOrNot check uptime globally by default (rather than from a single data center + verifying downtime globally), and ended up mistakenly sending 2500% more emails than usual on Saturday and Sunday, thanks to a bad deployment in Singapore, followed by a bad deployment in Sydney.
The idea was to allow significantly faster checks, and use several failing checks in a row as a signal that your URL is indeed globally unavailable, but I had to roll it back and start from scratch.
After fixing all the issues I managed to step back, breathe, and write about it in monitoring our monitoring (that counts as marketing, right?)
In short, I now have graphs that tell me if what I just merged caused false positives (or for the checks to stop entirely):
Side note: the spikyness in the graphs comes from previously only running checks at the start of each minute - as of this week, checks are run every 10 seconds, which will smooth out the graph eventually.
This week I:
- started running uptime checks every 10 seconds
- started logging each check in ClickHouse, and started monitoring our monitoring with Grafana
- quadrupled how many uptime checks individual VMs perform (after verifying it had no impact on false positives)
- tripled the default timeout of our uptime checks from 10 seconds to 30 seconds, and made it possible to choose how long is considered a timeout (some folks are okay with their users waiting 10-30 seconds for the page to respond, and don't consider that an outage - I'm not one to judge)
- tripled the number of retries each VM performs before checking on another hosting provider
- wrote about the reliability dramas
- pitched OnlineOrNot to some ex-Atlassian alumni
So in short, not much marketing, but in the end I think OnlineOrNot is better for it. Maybe I'll move towards a 3 week program of Coding Week -> Marketing Week -> Reliability Week in the future, who knows.
Got something you're curious about? Feel free to tweet at me, or subscribe to the mailing list below to get these every Friday.