So, late on the 3rd/early am in the 4th depending on your timezone you probably noticed the board was offline for an extended period. After doing a post mortem on the logs it looks like the issue was caused by the server running out of memory, both physical and swap and thus no new processes could be started and the Out Of Memory (OOM) killer just couldn’t trash processes fast enough.
So, short term solution is you’ll see some brief downtime each night around 3am when MySQL and Apache restart themselves to free up memory. Long term, I’m going to have to upgrade the VPS which is going to double the current cost, which the current level of monthly donations will not cover.
The only other alternative is moving the database to be on my shared database server for my sites and my friends’ sites. Which is pretty much the way things were when everything was on the dedicated server so not a big deal, but also not a solution I’m fond of because I prefer to keep this site self-contained if possible.
^that up there was the really short and dumb downed version of what happened. The part about restarting Apache as well isn’t just there for the hell of it. I really do not want to go into intimate detail of what happened as it would give out way more information than is good about some of the internals. Plus there’s the whole needing 7 hours to recover the info necessary to post mortem the situation and analyze it. Really not in the mood to go trough it again on less than 2 hours sleep in the last 3 days.
Well, fuck. After getting some sleep and having a lot more time to go over the logs and crash dumps a bit more thoroughly I now have the distinct displeasure of saying the situation was a bit nastier than my original analysis hinted at.
Apparently the OOM situation was caused by runaway MySQL and Apache processes, but not for the reason it originally appeared. Originally it appeared that it was just a huge spike in traffic that they couldn’t handle due to there only being a small swap file. Actually, a 12GB swap file wouldn’t have helped the situation much. After taking a closer look at the logs it was a distributed attack on the board to try and force it offline. Due to the nature of the logs being damaged by having to force the VPS to hard reset some info was lost, but at one point Apache was trying to serve 300 copies of the RSS feed that shows the most recent public posts…
Now, while the feed isn’t cached, MySQL does keep a query cache… the problem is that this particular query couldn’t stay cached because it needed more physical memory than the server could allocate at the time because each Apache child process needed a good chunk of the physical memory to serve the feed because of the PHP process needing huge chunks of memory to generate the feed from the data that came from the result of the query.
Long story short: No way in hell was the server going to be able to keep up. Now more RAM would help, but so would more swap, but it would have just delayed the inevitable in this case.
So short term solution: I added a good sized chunk of swap file space (which will help some) and I have outright blocked the Russian, Ukrainian, and Chinese providers the bulk of the requests came from in the firewall. That will be a bit more useful. And of course the restarts are still happening to help flush the process memory until I can implement a more long term and proper solution.
Long term though I still need to upgrade the VPS, which as I mentioned previously the current level of monthly donations will not support doing that. I’m currently looking into alternatives to help offset the cost, but so far nothing useful has come from it.
Targeted DDoS, yes. But no, the problem in this case is that literally no amount of memory would have prevented the problem completely, just delayed it longer and given me (or hell the monitoring system) a chance to see the problem and mitigate it before it all crashed and burned. IF I’d had enough time to see it, I could have banned the IPs before it got to the point it did.
I’m seriously thinking of adding one mitigation though. I’ve been watching the logs and apparently there’s only a few legitimate users of the feeds so I’m just going to kill that function completely.
Uh… there’s literally no way for me to find out who they are. At least one is through that piece of shit Apple News app on iOS, all but two of the rest are through Feedly. The only one that’s not the person using it has never posted so I have no way to determine who they are either.
I have to admit I’m curious as to why people would use the feed when they could just create an account and subscribe to get notified by emails, and thus only the stuff they want instead of every random post in the public forums. On the plus side though I just realized that at least they’ll see the feed is going away and why because these posts are part of the feed
For those who use the RSS feeds, you can ignore the part about them being disabled. I actually found a work around that will keep them working for you but also not allow them to be used in another DDoS. The server itself is now allowed to call the URL and it will refresh a cached copy roughly once every hour OR when a new post is inserted into the DB (so basically it appears to function normally )
HOWEVER, when you request it, it will just send back the cached copy directly and bypass the database completely, so whether one person or one hundred request it at the same time, the impact will be negligible.
I will implementing the code change when I upgrade the VPS now that a generous member has agreed to cover the difference between what the previous donations were covering and what the upgraded VPS will cost