Don Marti

Mon 14 Sep 2009 06:32:58 AM PDT

Spring cleaning for Perl scripts

Here are a couple of notes on things I've been doing to clean up some random Perl scripts. Using Perl to do stuff doesn't have to mean that you have a directory full of scripts that are hard to read and modify, break on real input since you only tested them with a few files, and run slowly. (Yes, I used to have a mess of Perl scripts that were like that, but I'm starting to see the light.)

This isn't about going from decent Perl to very high-quality Perl. It's about going from mess to tolerable.

use modules: This is the obvious one. Usually when I find really ugly or unreliable Perl code, there's a module to replace it. Instead of gross pattern matches that break on real-world CSV, HTML, or XML, there's a module to get the thing you need.

take command-line options: here are three lines that give a script a --verbose or -v option.

use Getopt::Long;
my $VERBOSE = 0;
GetOptions ("verbose|v" => \$VERBOSE);

Saves having to edit the script to change flags.

write perldoc: Instead of making the beginning of a script into a big block comment, add the POD markup. (example: sitemap-o-matic). Now you can just say perldoc [script] to remind yourself what it does and how to use it. Details at perldoc perldoc and perldoc perlpod.

use threads: This is potentially hairy, but if an often-run script is just doing the same thing to a bunch of files, and the result is one value, it's not too hard just to do a File::Find::find over the directory you want to look at, start a new thread per file, and then "join" all the threads. Probably not worth it for most scripts, but here's a podcast client that can kick off multiple downloads while other threads are still parsing XML. It was too slow when it parsed everything and then downloaded one at a time.

use memcache: OK, kind of weird, but a lot of command line or cron job Perl scripts do something like this: (1) fetch or read in a bunch of data, (2) build some kind of data structure (3) spew out some kind of data. So take the result of step 2 and stick it in memcache, and the script will run much faster while you get step 3 right.

my $memd = new Cache::Memcached {
    'servers' => [ "127.0.0.1:11211" ],
    'debug' => 0,
    'namespace' => 'my_nifty_web_client_script'
};

...

my $stuff;
unless ($stuff = $memd->get($url)) {
    my $res = $ua->get($url);
    if ($res->is_success()) {
        $stuff = expensive_operation($res->content);
        $memd->set($url, $stuff, $EXPIRE);
    }
}

This is good for using a site's Web API without getting the webmaster cheesed off at you.