[OH-Dev] OpenHatch is down
Asheesh Laroia
asheesh at asheesh.org
Tue Dec 25 00:03:59 UTC 2012
Howdy all,
Yes -- sorry about that. I slept in a lot last night/today, and addressed
it around 3 PM Eastern today.
I thought it'd be useful to explain some of these errors that you ran
into, and then talk about the future.
On Mon, 24 Dec 2012, Jessica McKellar wrote:
> I for the life of me can't find a user that I have access to that lets
> me operate with the necessary privileges, or I would fix this:
>
> deploy at linode:~$ /etc/init.d/apache2 restart
> Warning: DocumentRoot [/var/web/inside.openhatch.org/] does not exist
> Syntax error on line 7 of /etc/apache2/sites-enabled/ssl:
> SSLCertificateKeyFile: file
> '/etc/ssl/private/openhatch.org.2011-05-22.key' does not exist or is
> empty
> Action 'configtest' failed.
> The Apache error log may have more information.
> failed!
>
> Which is at least 1 thing causing the site to be down.
>From what I understand, this error is a red herring caused by running the
Apache init script as a non-root user. (Only root can read that secret key
file, so otherwise the script will print the SSL key error.)
> If someone on this list has root on linode and can do something about
> this, please do:
>
> http://openhatch.readthedocs.org/en/latest/internals/emergency_operations.html
I can see that page doesn't really address the situation, except to share
you how to log in and get root.
As for your authentication... I looked, and I don't see a user account for
you, Jessica! I'd be happy to add one for you; I'll give it username
'jesstess' unless you request something else; just email me an SSH key
(probably off-list, although it's a public key, so it'd probably be okay
on-list). I'll give sudo powers to that account as well.
> Note also the following errors on lish:
>
> linode.openhatch.org login: ERROR 144 (HY000) at line 1: Table
> './oh_milestone_a/sessionprofile_sessionprofile' is marked as crashed
> and last (automatic?) repair failed
Yeah.
So, here's what happened: the root cause was running out of disk space the
on / partition. This caused me to get some error emails, and I logged in
and found the low disk space condition. I freed up some space. then
restarted MySQL in case it wanted a restart.
I also ran a trivial script to, for every table in the production MySQL
DB, run "repair table" on the table. This filled up the disk again, while
repairing sessionprofile_sessionprofile. At this point my memory is
slightly hazy. I believe I restarted MySQL again, and now MySQL had marked
that table as crashed and the last repair as having failed, so it was
extra unhappy.
I freed up more space, re-ran the repair on the table, which took a while,
and then restarted MySQL once for good measure and things seem fine and
stable.
Sorry about that.
> ** !!! PID file tmp/pids/mongrel.8100.pid already exists. Mongrel
> could be running already. Check your
> /var/log/mongrels/calagator.8100.log for errors.
> ** !!! Exiting with error. You must stop mongrel and clear the .pid
> before I'll attempt a start.
> NOTE: Gem::SourceIndex.from_installed_gems is deprecated with no
> replacement. It will be removed on or after 2011-10-01.
> Gem::SourceIndex.from_installed_gems called from
> /usr/lib/ruby/1.8/gem_plugin.rb:109
Yeah... back in the day we dreamed of running Calagator, but we never
quite got it working the way we wanted, so we have a half-set-up install
on the machine. I've removed the init script that causes this error
message to get printed (and also that causes Calagator to attempt to
start).
More info about the space hogging: We're mostly using the InnoDB storage
engine with MySQL, which in its default configuration grows forever and
never frees up space that becomes unused as data churn happens.
innodb_file_per_table is the my.cnf option that lets one separate the data
out by table; to really get any space back, we'd need to drop and
re-import (from a backup) the databases.
So... the above is "what happened, and when", at least approximately.
Safeguards that would have helped:
* Nagios monitoring to tell us when we are low, rather than zero, on disk:
well, we already have that, but we're operating with the warning always
on. That's bad news. I'll get on that now, after sending this mail.
Things that slowed response:
* The unfortunate timing of myself versus the events I'm not sure how to
address with a safeguard except that if we have more lead time on disk
space problems, that won't be an issue.
* More people having root access to address these problems can possibly
help. Jessica, if you want that, see my offer above.
Things that would have made the system less prone to fail:
* Disk head room, in general. I'm on that now.
* Reconfiguring our MySQL server to simply use less disk, or having
someone else maintain it. I've looked into a few options as part of
writing this email, and it seems like we'd experience pretty high latency
if we used MySQL hosting outside the data center we're in at the moment,
and services like Amazon RDS have their own administrative overhead.
* We could have a less jarring "Error 500" page. I wasn't really expecting
many people to ever see it.
* Just having a bigger / filesystem would help this very easily. We only
have 16GB of space. Having just looked into this, I see that Linode.com
actually gave us 8GB of extra space somewhat recently. I have to shut down
the VM before they'll let us grow the filesystem, so we'll need to
schedule an outage for that. That'll be by far the easiest way to reach a
bigger proportion of space free.
That's my post mortem. Other thoughts and comments welcome.
-- Asheesh.
More information about the Devel
mailing list