OnlineOrNot Diaries 3

Another Friday evening here in Toulouse, I've poured myself a Delirium Tremens (I shit you not, that's what the beer is called), let's go into how OnlineOrNot went this week.

What even is a marketing week?

Jake recently replied to one of the OnlineOrNot diaries, asking:

Have you ever detailed what a 'marketing week' is?

To which I replied (paraphrasing here):

My marketing weeks are normally based on vibes, but I typically:

update the changelog
write blog posts/screencasts/guides that are missing
tweet more
participate more in forums (mainly hacker news and reddit)
talk to existing and potential customers about how they monitor their URLs, alert their customers to incidents, and generally how they manage their incidents
work on landing pages

Actually, time for a reliability week

This week was supposed to be a marketing week, and then over the weekend before this week started, fly.io had reliability issues.

At the same time, I was in the process of rolling out a change that would make OnlineOrNot check uptime globally by default (rather than from a single data center + verifying downtime globally), and ended up mistakenly sending 2500% more emails than usual on Saturday and Sunday, thanks to a bad deployment in Singapore, followed by a bad deployment in Sydney.

The idea was to allow significantly faster checks, and use several failing checks in a row as a signal that your URL is indeed globally unavailable, but I had to roll it back and start from scratch.

After fixing all the issues I managed to step back, breathe, and write about it in monitoring our monitoring (that counts as marketing, right?)

In short, I now have graphs that tell me if what I just merged caused false positives (or for the checks to stop entirely):

Passing uptime checks Side note: the spikyness in the graphs comes from previously only running checks at the start of each minute - as of this week, checks are run every 10 seconds, which will smooth out the graph eventually. Failing uptime checks

This week I:

started running uptime checks every 10 seconds
started logging each check in ClickHouse, and started monitoring our monitoring with Grafana
quadrupled how many uptime checks individual VMs perform (after verifying it had no impact on false positives)
tripled the default timeout of our uptime checks from 10 seconds to 30 seconds, and made it possible to choose how long is considered a timeout (some folks are okay with their users waiting 10-30 seconds for the page to respond, and don't consider that an outage - I'm not one to judge)
tripled the number of retries each VM performs before checking on another hosting provider
wrote about the reliability dramas
pitched OnlineOrNot to some ex-Atlassian alumni

So in short, not much marketing, but in the end I think OnlineOrNot is better for it. Maybe I'll move towards a 3 week program of Coding Week -> Marketing Week -> Reliability Week in the future, who knows.

Got something you're curious about? Feel free to tweet at me, or subscribe to the mailing list below to get these every Friday.

OnlineOrNot Diaries 3

What even is a marketing week?

Actually, time for a reliability week

Follow the Journey