[OH-Dev] OpenHatch is down

Tue Dec 25 00:03:59 UTC 2012

Howdy all,

Yes -- sorry about that. I slept in a lot last night/today, and addressed 
it around 3 PM Eastern today.

I thought it'd be useful to explain some of these errors that you ran 
into, and then talk about the future.

On Mon, 24 Dec 2012, Jessica McKellar wrote:

> I for the life of me can't find a user that I have access to that lets
> me operate with the necessary privileges, or I would fix this:
>
> deploy at linode:~$ /etc/init.d/apache2 restart
> Warning: DocumentRoot [/var/web/inside.openhatch.org/] does not exist
> Syntax error on line 7 of /etc/apache2/sites-enabled/ssl:
> SSLCertificateKeyFile: file
> '/etc/ssl/private/openhatch.org.2011-05-22.key' does not exist or is
> empty
> Action 'configtest' failed.
> The Apache error log may have more information.
> failed!
>
> Which is at least 1 thing causing the site to be down.

>From what I understand, this error is a red herring caused by running the 
Apache init script as a non-root user. (Only root can read that secret key 
file, so otherwise the script will print the SSL key error.)

> If someone on this list has root on linode and can do something about 
> this, please do:
>
> http://openhatch.readthedocs.org/en/latest/internals/emergency_operations.html

I can see that page doesn't really address the situation, except to share 
you how to log in and get root.

As for your authentication... I looked, and I don't see a user account for 
you, Jessica! I'd be happy to add one for you; I'll give it username 
'jesstess' unless you request something else; just email me an SSH key 
(probably off-list, although it's a public key, so it'd probably be okay 
on-list). I'll give sudo powers to that account as well.

> Note also the following errors on lish:
>
> linode.openhatch.org login: ERROR 144 (HY000) at line 1: Table
> './oh_milestone_a/sessionprofile_sessionprofile' is marked as crashed
> and last (automatic?) repair failed

Yeah.

So, here's what happened: the root cause was running out of disk space the 
on / partition. This caused me to get some error emails, and I logged in 
and found the low disk space condition. I freed up some space. then 
restarted MySQL in case it wanted a restart.

I also ran a trivial script to, for every table in the production MySQL 
DB, run "repair table" on the table. This filled up the disk again, while 
repairing sessionprofile_sessionprofile. At this point my memory is 
slightly hazy. I believe I restarted MySQL again, and now MySQL had marked 
that table as crashed and the last repair as having failed, so it was 
extra unhappy.

I freed up more space, re-ran the repair on the table, which took a while, 
and then restarted MySQL once for good measure and things seem fine and 
stable.

Sorry about that.

> ** !!! PID file tmp/pids/mongrel.8100.pid already exists.  Mongrel
> could be running already.  Check your
> /var/log/mongrels/calagator.8100.log for errors.
> ** !!! Exiting with error.  You must stop mongrel and clear the .pid
> before I'll attempt a start.
> NOTE: Gem::SourceIndex.from_installed_gems is deprecated with no
> replacement. It will be removed on or after 2011-10-01.
> Gem::SourceIndex.from_installed_gems called from
> /usr/lib/ruby/1.8/gem_plugin.rb:109

Yeah... back in the day we dreamed of running Calagator, but we never 
quite got it working the way we wanted, so we have a half-set-up install 
on the machine. I've removed the init script that causes this error 
message to get printed (and also that causes Calagator to attempt to 
start).

More info about the space hogging: We're mostly using the InnoDB storage 
engine with MySQL, which in its default configuration grows forever and 
never frees up space that becomes unused as data churn happens.

innodb_file_per_table is the my.cnf option that lets one separate the data 
out by table; to really get any space back, we'd need to drop and 
re-import (from a backup) the databases.

So... the above is "what happened, and when", at least approximately.

Safeguards that would have helped:

* Nagios monitoring to tell us when we are low, rather than zero, on disk: 
well, we already have that, but we're operating with the warning always 
on. That's bad news. I'll get on that now, after sending this mail.

Things that slowed response:

* The unfortunate timing of myself versus the events I'm not sure how to 
address with a safeguard except that if we have more lead time on disk 
space problems, that won't be an issue.

* More people having root access to address these problems can possibly 
help. Jessica, if you want that, see my offer above.

Things that would have made the system less prone to fail:

* Disk head room, in general. I'm on that now.

* Reconfiguring our MySQL server to simply use less disk, or having 
someone else maintain it. I've looked into a few options as part of 
writing this email, and it seems like we'd experience pretty high latency 
if we used MySQL hosting outside the data center we're in at the moment, 
and services like Amazon RDS have their own administrative overhead.

* We could have a less jarring "Error 500" page. I wasn't really expecting 
many people to ever see it.

* Just having a bigger / filesystem would help this very easily. We only 
have 16GB of space. Having just looked into this, I see that Linode.com 
actually gave us 8GB of extra space somewhat recently. I have to shut down 
the VM before they'll let us grow the filesystem, so we'll need to 
schedule an outage for that. That'll be by far the easiest way to reach a 
bigger proportion of space free.

That's my post mortem. Other thoughts and comments welcome.

-- Asheesh.