Jump to content

[CRITICAL] Server outage report #2 (April 19th-20th, 2018)


Nekone

Recommended Posts

We sincerely apologize for that prolonged outage the other day. After monitoring the server for a few days I am ready to post a detailed report about what happened this time.

 

What happened?!

At approximately 4:00 PM CDT (GMT-5) on April 19th, the entire set of Kametsu websites became completely unresponsive. Active troubleshooting began at 8 PM CDT - I had been unable to respond to the incident immediately as I had been asleep from 3 PM to 8 PM CDT. Initial investigation revealed that the CPU load average on the Kametsu server had nearly exceeded 7 times the maximum CPU time available, and this had in turn caused the web server to effectively freeze, unable to accept or process requests due to the enormously abnormal load on the system.

 

A check of the process list - after a few failed attempts to obtain that list due to resources being obviously strained - showed that the web server processes were consuming almost the entirety of this CPU time. The web server processes were all forcibly killed to ease the load on the system and give it time to return to idle. Once this had occurred, the main web server process was restarted. Within minutes, the CPU load began to spike again, and when this started to occur, a quick check of the process list once again showed one of the web server processes consuming well over 100% of the CPU consistently, without abating. Troubleshooting immediately shifted to the web server configuration and its various modules and libraries, and tracing back each and every one to determine if one of them was causing the problem. Unfortunately, this took a lot of time as there is a lot of these to go through and trace, and this effect wasn't always immediate, sometimes it took a while to actually occur with each process restart.

 

Eventually I had determined it was not any of the loaded modules or libraries. Given that the server had previously thrown errors the day before (see the other RFO for details on that), I opted to reboot the entire server itself to see if that would fix anything. While this was effective in providing a cleaner environment, this unfortunately did not fix the issue. Troubleshooting then turned to the server configuration and server logs themselves. The log files hinted at a possible recursion problem, but did not indicate where it would possibly be if there was one. I began looking at our redirect rules that we use in various places across the server to see if one of them would be causing it. I eventually found the problem after nearly 2 hours of testing at this stage - an old redirect rule that redirected forums.kametsu.com (which is not valid anymore) to the base domain of kametsu.com that had been mistakenly left in place when it should have been removed. This rule was causing a recursion error with search bots that were still relying on the old URL, because the supporting rules that assisted in that redirect had already been removed. Given the frequency by which search bots crawl our site, this was causing so many recursing requests that it was completely overtaxing the web server. I had to observe this over a period of about an hour to be able to confirm this was the case, and once I was able to confirm this, the old redirect rule was disabled, the web server restarted, and monitored for a day or two to ensure no recurrences. Thankfully, there were none, and the server returned to normal soon after.

 

What was done to resolve this?

As stated above, once the cause was nailed down, the old redirect rule was fully disabled as it should have been in the first place. Once it was disabled, the recursion errors went away and the server went back to normal load. Normal service was confirmed restored at approximately 4:25 AM CDT on April 20th.

 

Last words

Once again, I sincerely apologize for the trouble this caused everyone, and especially for the extensive downtime. Had I been able to respond quicker, it probably wouldn't have taken as long. From this point forward we'll have a better system in place for responding to incidents like this in the future. I'll also make sure to be thorough with my work on the server from now on - and probably a good idea not to touch things if I'm drunk or drowsy. :P

 

Many thanks again to our wonderful staff for keeping everyone updated as I progressed through troubleshooting. I spent nearly 12 hours straight troubleshooting the server, even sacrificing my dinner, to get this community back up and running. It was not easy to do but I love this community so much I'm always willing to go that extra mile.

  • Like 7
  • Thanks 7
Link to comment
Share on other sites

  • Nekone unfeatured, featured, unfeatured, unpinned and pinned this topic
  • Koby unpinned this topic

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...
Please Sign In or Sign Up