~dricottone/blog

ref: 87e3aa45c4eddffd41649c23bd177b485a99cfc0 blog/content/posts/rate_limiting.md -rw-r--r-- 3.9 KiB
87e3aa45Dominic Ricottone Portability bug fixes 1 year, 2 months ago
                                                                                
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
title: Rate Limiting
date: "2023-03-29T23:55:56-05:00"
draft: false
---

In October, my public MoinMoin wiki started getting hammered with rapid
requests. It wasn't a major concern of mine at first.
 1. It was nowhere near the scale of a DOS attack.
 2. MoinMoin has built-in surge protection that always allows logged-in users
    access.
 3. As a principle, I'm in favor of public documents remaining truly *public*.
    Even to potentially bad actors.

But then it started impacting my uptime percentage.
*Oh boy* do I love seeing that perfect, bright-green 100%.
And it turns out, that's all the motivation I need to start walking back a
principle.

----

I've been using [upptime](https://github.com/upptime/upptime) for quite a while
as my uptime monitor/tracker.
It's an exceedingly simple system that can be hosted freely on GitHub Pages.
Well, as long as you are willing and able to tweak with the CI/CD limits.
It ships with some defaults that are a touch too far for an unpaid account.

The major advantage to upptime is that my monitor is 100% isolated from my true
production servers.
The CI/CD does all maintenance automatically.
Certificate expirations, host uptime, and end user error are all neatly folded
out of the equation.

The extra advantage is GitHub Issues as a web admin ticketing system.
For the most part, I host services by and for myself.
It would never be worthwhile to spin up a ticketing system just for server
administration.
But there are still *lessons learned* and *personal notes* that I would ideally
staple to incident reports, even just for myself.
GitHub Issues is a clunky beast, but it's more than enough for this use case.

This is all to say, you can find a timeline of events and a conversation I had
with myself at [#52](https://github.com/dricottone/upptime/issues/52).

----

I did *briefly* wonder if upptime was in error, and reporting a false positive.
I certainly could not replicate the issues.
And it wouldn't be the first time I needed to tweak the CI/CD due to GitHub's
race to the bottom on unpaid account limits.
As I now know, my inability to replicate was a result of MoinMoin's clever
surge protection.

I did a trial run with Uptime Robot but quickly realized that it reported the
same issues.
And if an uglier, paywalled monitor (*that has no ticketing system!*) was going
to give me the same outcomes, why would I give it another thought?
Haven't logged in since.

I then did some exploration into the realm of building my own monitor on
another cloud platform (*GCP?*).
That didn't make it even as far as a trial run.

----

While this all began in October, it took me several months to take this
seriously as a threat (to my uptime streak).

My first approach was to try to understand the spam.
I added more and more thorough connection logging, and even some HTTP headers.
I monitored logs in the hope that I could predict incidents--even intervene.
I tried to parse metrics in the aftermath of incidents to better predict the
next one.

This didn't really amount to anything.
As far as I can tell, I've randomly become the latest client (*victim?*) of a
particularly zealous spider.

My second approach was sticky rate limiting.
I have always sat my services behind an instance of HAProxy, because I'm a huge
fan of premature optimization and building for scale that never actualizes.
This made the introduction of rate limiting (and the aforementioned logging)
incredibly simple.

I started with limits that mirrored MoinMoin's own surge protection, figuring
that someone more intelligent that I probably came up with those defaults.
Those turned out to be a bit too loose, so I spent the next few months playing
a back-and-forth tuning game.

In the end, I've landed on a configuration that has staved off spam and
maintained uptime since mid-January.
(*Leaving aside a brief certificate expiration... oops.*)
I'm extremely pleased with this result, and my uptime streak is so beautiful.