Journal/24 Thermidor CCXIV from Evan Prodromou

I said a few days ago that Ross Mayfield had one of the best blogs about wikis and the wiki business. But Eugene Eric Kim gives him a run for his money. EEK Speaks covers in detail a lot of the public and private aspects of wiki going on right now.

Of course, I wouldn't paint EEK and Ross as rivals; after all, they won the Web 1.0 pitch contest at Wikimania together.

tags:

Victory is mine (I think)

For the last few weeks, some of the more advanced features of this wiki have been blowing up with a 500 Internal Server Error, to my great chagrin. I had switched from using CGI to FastCGI for WiLiKi and I'd seen a real improvement in the responsiveness of the site, but these occasional errors were becoming more frequent.

I finally hunkered down to figure out the problem. Annoyingly, there was no error output in the Apache error logs, so I guessed that the problem was sooooo bad that the process didn't even get a chance to scream on stderr before dying. I had all kinds of theories about this. Most of the problems had to do with full-text searches of the database, so I guessed that the underlying cause had to do with I/O and concurrency.

Gauche has a nice logging facility, so I threw in a lot of logging printf statements (or their equivalent) and isolated the problem to a few calls into Gauche's dbm libraries. Aha! The problem had to be concurrent DB access. I layered on a big slathery scoop of mutexes and locking and other such good concurrency stuff and... the problem was still there. D'oh! So much for my diagnostic skills. I decided to take a step back.

After ascertaining for myself that any writes to stderr from within a FastCGI program should indeed show up in the error log -- by reading and by writing a couple of sample programs -- I turned to the debugging tool of last resort: strace. strace logs all system calls by a program, which can be a lot of information. I couldn't, of course, just start my FastCGI program, so I had to strace the entire Apache process, with lots of command-line flags to divvy up the output on a per-process level.

Once I had that going, I could kind of see what was going on. My full-text searches were starting up fine, but about halfway through the search they'd get a SIGTERM and terminate. Holy moly! That wasn't an error in my program -- someone was murdering my poor little FastCGI processes. At first I suspected fratricide -- that the processes were terminating each other when they needed the database resource -- but I realized right away how dumb that was. That only left one suspect -- the parent!

I dug a little bit into the fcgid documentation, and sure enough, if a process went too long without giving any output, fcgid would terminate it. I upped the time limit for this wait, and now all my search processes are working fine.

I'm going to need to start doing something smarter with these full-text searches. I think doing some indexing at write time is the best bet -- that should probably speed stuff up quite well. But for the time being the processes will run, if slowly. The only bummer is now I have to go dig out all the mutex-y lock-y stuff I wrote earlier.

tags: