Jump to content

[CRITICAL] Unexpected server outage 4/18/2018 - 4/19/2018


Nekone

Recommended Posts

We apologize for that unexpected prolonged outage. Here is a brief failure report.

 

What happened?!

During the overnight hours on April 18th-19th (US time), we experienced a database crash. The database was subsequently checked and restarted, and this appeared to fix the problem initially. Unfortunately, unbeknownst to me at the time, the root cause of this had been left unaddressed, and as a result the entire web infrastructure became unusable minutes later, and I was forced to redirect the site to my emergency backup server in order to troubleshoot. Upon inspection of the server, it was discovered that the web server's log files had grown to such an enormous size that they had eaten up every last byte of remaining disk space on the root partition of the server. We found that our usual log rotation processes were not performing correctly, and this is why the log file had grown so large. It was also discovered that this had been the cause of the previous database issue, as well, as those data files are stored on the same root partition.

 

What was done to resolve this?

In order to resolve this problem, I first had to move the overgrown log files off the server. We retain a subset of these logs for diagnostic purposes in case of internal web server errors. Since the server processes for rotating old log files out of the way was not functioning properly, I had to quickly determine exactly which of these log files to move and retain, then compress them and move them off to the larger partition. This process took time to complete due to the sheer size of some of these files. After this was done, the entire web server log directory was completely wiped clean and the disk space freed.

 

Since the database files were also on the same root partition, I wanted to avoid any future problems regarding these files. The data directory was moved to the larger partition, along with the existing data files. The original copies located on the root partition were left untouched, and what we will do from this point forward is back up to the root partition and run the database from the larger partition. This should avoid the database being constrained by the limitations of the root partition itself from now on.

 

Once all these operations were complete, the primary infrastructure was restarted, and the web server was pushed back to the primary server. Login sessions were all reset as a result of this so you will have to log in again, but everything APPEARS to be in order again.

 

Last words

I sincerely apologize for the trouble this caused everyone. It certainly caught me off guard, when I should have been watching the disk space more closely than I had been. I feel wholly responsible for allowing this to occur in the first place. The good news is nothing appears to be damaged, and everything does appear to be working again. But if anyone encounters a forum error again, please let me know ASAP.

 

Many thanks to @Renzourin for catching and reporting the initial issue to me, and to @Koby for jumping on IRC and keeping everyone informed while I worked on the server.

  • Like 5
  • Thanks 6
Link to comment
Share on other sites

  • Nekone unfeatured this topic
  • Koby unpinned this topic

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...
Please Sign In