Archive for the ‘Perl’ Category

linux 100% swap Screenshot

Saturday, March 6th, 2010

Nice screenshot of 100% swap space being used on a popular but ill Perl app running under ModPerl::PerlRun. :)

Tasks:  85 total,   2 running,  83 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us, 14.8%sy,  0.0%ni, 17.0%id, 68.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8174024k total,  8132492k used,    41532k free,      284k buffers
Swap:  2096472k total,  2096472k used,        0k free,     5648k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  308 root      10  -5     0    0    0 D 17.5  0.0   0:05.14 kswapd0
15985 apache    18   0 19.4g 7.7g   84 D 15.1 98.3   0:12.09 httpd
15996 root      16   0 12740  624  368 R  5.4  0.0   0:00.48 top
    1 root      16   0 10348  124   32 S  0.0  0.0   0:01.69 init

The test server is a Dell 1950 with 8 GB RAM running CentOS 5.4 x64 and Apache 2.x.

The above problem illustrates one of the many reasons that almost all hosting providers adopted PHP instead of mod_perl.

PHP gives you good performance without the headaches of mod_perl, which get magnified in a shared environment.

However, if you have a dedicated machine, mod_perl is a great way to accelerate a Perl application as long as the program is reasonably well-behaved.

Linux CentOS Cluster Setup Tips

Sunday, February 14th, 2010

I made a linux cluster using 16 dual Opteron 248 machines, gigabit Ethernet and CentOS 5.4 DVD with kickstart.

Nodes can be remotely rebuilt upon command in about 3 minutes each in parallel, with no manual intervention, as long as you’re careful to treat nodes like appliances and don’t save data on them.

Some tips to save time and effort are:

  • collect the MAC addresses of all nodes at one time using the most efficient possible way, either from a manifest, or simply power all the nodes on and type on one node:
    ping -b 10.0.0.255 or
    fping -A -q -c 1 -g 10.0.0.0/24 or
    nmap -sP 10.0.0.0/24
    and
    arp -n
    
  • on your main client test node, which you may do 50 reinstalls on, save boot time by disabling memory checking, boot splash screens, etc. and use small filesystems during initial testing
  • install one machine by hand from DVD first to generate the anaconda-ks.cfg file, which contains your preferred package list (the CentOS installer itself uses kickstart even for local installs)
  • I found that having kickstart fetch the distro files using HTTP was a lot easier to setup and troubleshoot than NFS, and easier to secure later.
  • it’s common to use a BIOS boot order of “PXE, CD, HD” on each machine to bootstrap the cluster if the hard drive is not blank, then switch to “CD, HD, PXE” after linux is successfully installed and you’re able to login remotely. Subsequent reboots will try the HD first unless you force a PXE boot, which can be done with a script I wrote called unboot that both deactivates the boot partition and erases the MBR:
    #!/bin/bash
    
    parted /dev/hda set 1 boot off
    dd if=/dev/zero of=/dev/hda bs=512 count=1
    
  • do a web search for several good sample kickstart files. I found that merging 3 or 4 good ones provided very nice results.
  • by default, kickstart configures your networking with DHCP if you are doing network installs, but you can overwrite that in your post-install section with multiple static IP addresses if desired.
  • test your tftpd setup from the server (or another node) with tftp localhost -v -c get pxelinux.0
  • do tail -f /var/log/messages on the DHCP server to monitor DHCP requests by client nodes.
  • Make sure “/var/lib/dhcp/dhcpd.leases” exists.

Likely I will move to Rocks Clusters later, which is also derived from CentOS.

The Rocks Clusters people handle PXE boot in a more sophisticated way, configuring PXE boot to read the kernel image from the local hard drive, sparing tftpd from being swamped on clusters of thousands of nodes. Their unboot utility is called cluster-kickstart-pxe.

hp.com: Setting up a Linux PXE server and integrating clients – Howto (c00257674.pdf)

RedHat Linux KickStart HOWTO
Remote Network Boot via PXE
communities.vmware.com: How to Pass Parameters to a Kickstart Script?
aboveaverageurl.com: PXE Booting
Howtoforge: Unattended Fedora 8 Installation With NFS And Kickstart
Yu Dong, NASA: Installing Linux over Network: PXE, DHCP, TFTP, NFS and Kickstart
Rocks Cluster 5.3: Forcing a Re-install at Next PXE Boot
[Rocks-Discuss]cluster-fork ‘/boot/kickstart/cluster-kickstart–start’has no effect?
IEEE OUI and Company_id Assignments (MAC Address Database)
ftp://ftp.rocksclusters.org/pub/rocks
Reading Dell service tag number – dmidecode -s system-serial-number
Debian – setting hostname from DHCP result

Dell OpenManage and check_openmanage Update Problems on Linux

Tuesday, February 9th, 2010

Just before Christmas 2009, a new version of Dell OpenManage 6.2 for Linux was “released” – well, thrown over the wall, untested, resulting in this scary message on my Dell PE 2950’s:

# omreport storage controller
No controllers found

That sure got my attention …

There were at least 2 issues caused by this update:

  1. Although the individual packages were fine, the installer script had bugs that resulted in the combination of packages to not work correctly, regardless of whether you were updating an old system, or a fresh CentOs installation. Even though disk volumes were still mountable, most omreport options did not. Somebody posted a script that usually fixes that on the Dell forum, and I have added some modprobe commands that some people also recommended:
    #/bin/bash
    
    # this script based on Dell Forums samples
    
    /sbin/modprobe ipmi_si
    /sbin/modprobe ipmi_devintf
    
    yum remove srvadmin*
    yum install srvadmin-all
    yum install dell_ft_install
    cd /opt/dell/srvadmin/etc
    ./autoconf_cim_component.sh
    yum remove srvadmin-iws srvadmin-webserver srvadmin-jre
    srvadmin-services.sh start
    omreport storage controller # now works properly, or reboot first
    
    #Somebody really pooched the dependancies list in the OMSA 6.2 install !!!!
    
  2. omreport was installed in a new location, so the commonly-used check_openmanage monitoring perl script failed to find it. A simple edit fixes that:
/usr/lib64/nagios/plugins/contrib/check_openmanage:

#
# Locate the omreport binary
#
sub find_omreport {
    # Possible full paths for omreport

    my @omreport_paths
      = (
         '/usr/bin/omreport',                            # default on Linux
         '/opt/dell/srvadmin/oma/bin/omreport.sh',       # alternate on Linux
         '/opt/dell/srvadmin/oma/bin/omreport',          # alternate on Linux
+         '/opt/dell/srvadmin/bin/omreport',               # alternate on Linux
         'c:\progra~1\dell\sysmgt\oma\bin\omreport.exe', # default on Windows
         'c:\progra~2\dell\sysmgt\oma\bin\omreport.exe', # default on Windows x64
        );

Dell power-edge list: OpenManage 6.2 Storage Controller not found fix
Dell Forums: OMSA daemons appear to crash a minute after startup

Installing munin-1.4-alpha on Centos 4.5

Wednesday, November 18th, 2009

munin logoMunin is a data collection, graphing and limited monitoring system written in Perl that is built on top of RRDTool.

Typically, sysadmins use Munin for graphing system performance, alongside another dedicated monitoring system like Nagios.

If you follow along with the munin INSTALL document, you automatically get about 100 system and application graphs with no manual configuration except double-checking your hostname setup and http document root in munin.conf.

After a few days, I think most people find only a few graphs to be useful, including the memory usage and load graphs. At that time you can remove the symlinks made during the install process for items not worth monitoring.

Some minor tweaking of the plugin scripts, written in sh, help to make the graphs more readable. For example, commenting out the Apache free slots code in apache_processes improves chart readability by just showing busy and idle processes, instead of scaling to show the maximum limit, which is 256 slots for Apache 1.3, or ServerSlots for Apache 2.)


Munin apache_processes chart with proper scaling

Munin apache_processes chart with proper scaling

Also, forcing the memory script to use total physical RAM * 1.1 provides a common y-axis value, improves readability in the chart area, and minimizes fluctuations due to any swap activity. See “Adjusting the Scale of Munin Graphs” for how to do that.


Munin memory chart with forced RAM total on y-axis

Munin memory chart with forced RAM total on y-axis. (Getting rid of swap_cache (dark orange) in linux requires a reboot.)

I just finished installing munin 1.4 alpha from source on a Centos 4.5 machine, since recent binary packages have dependency issues on this server.

However, there were a couple issues to overcome:

  1. munin-graph could not find $libdir/VeraMono.ttf, causing the graph labels and legend to not be displayed. Changing the single-quotes to double-quotes everywhere VeraMono.ttf is mentioned in Munin/Master/GraphOld.pm allows $libdir to get string-interpolated properly.
  2. below is a munin-node startup script that I adapted from a Sun version:
#!/bin/sh

prog="munin-node"
path="/opt/munin/sbin"

mkdir -p /var/run/munin

case "$1" in
'restart')
        #stop the daemon, then fall through to the start
        /usr/bin/pkill -x $prog
        $path/$prog
        ;;
'start')
        $path/$prog
        ;;

'stop')
        /usr/bin/pkill -x $prog
        ;;

*)
        echo "Usage: $0 { start | stop | restart }"
        exit 1
        ;;
esac
exit 0

It was worth installing.

Shortly after installing Munin, it captured an abuse event that caused serious swapping (see above memory graph.) Using the graph timescale, I was able to narrow down which user caused it, and configured the application to not allow that again. :)

Here are the commands to create and install your own munin monitoring script:


cd /opt/munin/lib/plugins
cp uptime my_script
vi my_script
MUNIN_LIBDIR=/opt/munin/lib ./my_script
ln -s /opt/munin/lib/plugins/my_script /etc/opt/munin/plugins/my_script
service munin-node restart

sysmonblog.co.uk: Adjusting the Scale of Munin Graphs
Cacti Plugin for Nagios

Perl CGI.pm Module and Random EXPORT_TAGS Processing of LABEL Tag

Friday, November 13th, 2009

Perl LogoI was tracking down missing HTML4 LABEL tags (semantic tags that improve usability by associating mouse clicks for text labels with form inputs) in CGI.pm scripts using the radio_group() function, and discovered something that merits a blog post in case other programmers run into the same problem …

Using the -no_xhtml directive to CGI.pm disables the LABEL tag, regardless of what EXPORT_TAGS you specify. So even if :standard, :all, or :html4 are specified, the LABEL tag is not emitted.

The :html4 EXPORT_TAG looks like this:

                ':html4'=>[qw/abbr acronym bdo col colgroup del fieldset iframe
                            ins label legend noframes noscript object optgroup Q
                            thead tbody tfoot/],

But the plot thickened as I continued testing, because even if you don’t use -no_xhtml, the following incorrect behavior occurs …

CGI.pm has several EXPORT_TAGS, such as :html2, :html3 and :html4 that should restrict the available HTML tags according to the HTML version you want. However, LABEL, which appeared in HTML4 and is correctly defined only in the :html4 tag, is always displayed around radio_groups, etc.

Then I noticed CGI.pm conditional code like this:

    return $XHTML ? CGI::label(qq{<input type="checkbox" name="$name" value="$value"
             $tabindex$checked$other/>$the_label})
                  : qq{<input type="checkbox" name="$name" value="$value"
                    $checked$other>$the_label};

and

        if ($XHTML) {
           push @elements,
              CGI::label(
                   qq(<input type="$box_type" name="$name" value="$_"
                   $checkit$other$tab$attribs$disable/>$label)).${break};
        } else {
            push(@elements,qq/<input type="$box_type" name="$name" value="$_"
            $checkit$other$tab$attribs$disable>${label}${break}/);
        }

I don’t see why XHTML mode is needed for a HTML4 tag like LABEL. And if :html2 or :html3 is specified, LABEL should not be emitted.

At this point, I would have to question all of the XHTML-related code that was added to CGI.pm since 2.67. See changelog.

Another bug is that there should be a space before $checked above, so that the literal ‘checked’ becomes ‘ checked’.

Test script:

use CGI qw / -no_xhtml :html4 :form /;
#use CGI qw / :html4 :form /;

print 'Perl version: ', $], ', ', '$CGI::VERSION: ', $CGI::VERSION, "\n";

print radio_group(
                  -name=>'how far',
                  -values=>['10 ft','1 mile','10 miles','real far'],
                  -default=>'1 mile',
);

print "\n",
      bdo("HTML4 BDO tag"),
      "\n";

The output incorrectly omits LABEL tags.

Tested with Perl 5.8.5 (CGI.pm 3.29) and 5.8.8.

ApacheCon 2009 Oakland

Friday, November 6th, 2009

I went to ApacheCon 2009 in Oakland. Why Oakland? The ASF was founded here 10 years ago.

Executive Summary

Most of the attendees that I talked to were primarily interested in search technologies, or were Apache project comitters. The search users were already using either Lucene and Solr, or were using commercial software and evaluating Lucene and Solr.

Also a lot of interest in Hadoop, Zookeeper and NoSQL projects.

I added a wikipedia NoSQL project features table after the NoSQL BoF.

The conference was very well-organized, with tutorials, BoFs, a BarCamp, and sessions. Meetup.com was used to generate the highest BoF turnout that I’ve ever seen – close to 100 at the Lucene and Hadoop BoFs. (O’Reilly Conferences can learn from that.)

The Oakland Convention Center was a good venue for this conference, though the attached Oakland Marriott hotel is $$$$ and fond of surcharges, like $33/day for parking, $5 draught beer and $3.75 for a bottle of water in-room.

The keynotes and one track per day were recorded and are available for $99 at Linux Pro Magazine Streaming.

StoneCircle Productions was the conference organizer.

Conference Notes

Monday

Although I live in San Jose, Oakland is far enough away that I’ve never been there. Oakland has a compact downtown full of historical-era buildings, and Alameda is also nice, but things get less pretty at night.

I went to the Lucene tutorial on Monday.

Lunch Conversations

- awesome views of Bay Area past Golden Gate bridge from 21st floor
- FAST pretty good indexing and search solution, but bought by Microsoft recently (going to continue linux support or not?)
- FAST has FQL (users pronounce it fecal) query language :)
- 150 FAST servers replaced by 40 lucene servers by 1 company
- FAST4 to FAST5 upgrade tough, similar to port to say lucene, forced upgrades for support
- linguistics is 60% of value of Fast according to Monster, 13 languages supported
- “bad stems” can be a nightmare
- SOLR gives you 90% of what you would need to program in java, built on top of Lucene
- Open Source search is not really about price, but about control and flexibility

Monday Afternoon – Lucene Tutorial

- user-assigned document id not mandatory, but great idea for many reasons, including after an index-rebuild
- lucene-assigned id only valid for that snapshot (life of score doc)
- parameter to keep or delete old index directory
- StringBuilder is more efficient than strcat
- populating title column is a good idea
- results boosting handy for ecommerce, specials, etc.
- LUKE – handy tool for index statistics, etc.
- Searcher class, snapshot in time, won’t see new merges
- contrib/ has more analyzers
- snowball stemmers
- use 1 tokenizer and 0 or more token filters
- precision-recall curve ??
- n-grams and shingles (”the president”, “United states”)
- pre-2.9 lucene, numbers and dates really strings
- 2.9 NumericField builds tri structure, help optimize range queries
- SOLR analysis tool apache-solr
- relevance feedback with MoreLikeThis

Monday BoFs

Couchdb

- “ground computing”
- “offline by default”
- now an ubuntu service
- mozilla raindrop to combine chat client msgs
- lockless
- append-only btree
- rsyncable since append-only, also replication
- checksums everywhere
- windows not first class yet, mozilla improving it

@mozilla

- browsercouch
- don’t like sql
- brasstacks test tool storage
- store now, index later
- replicate to handle large indexing load
- testbot ci

Marklogic

- commercial
- xml-centric
- great for articles, books
- transactional
- search-centric
- structure-aware
- schema-free
- xquery-driven
- extremely fast, largest 200 TB xml, 166 on hosts
- clustered
- database server
- 180 clients, 150 employees
- markmail.org demo contains 42 million email messages, very impressive performance with 5 views in almost realtime. Search is distributed across 160 nodes.

JCR in 15 minutes

- Bertrand Del
- JCR is JackRabbit,
a fully conforming implementation of the Content Repository for Java Technology API (JCR). A content repository is a hierarchical content store with support for structured and unstructured content, full text search, versioning, transactions, observation, and more.
- the ultimate content store
- content repo, union of database and filesystem, best of both worlds
- full-text search combined with structured search

Solr Flair

- information forage
- “resume-driven design”

Lucene Numerics

- available in 1.4
- tune by modifying precisionStep

HBASE

One bewildered attendee wished for a NoSQL product matrix, so I added that to the wikipedia NoSQL page.

Wednesday Sessions

Becoming a Pig Developer, Alan Gates

- Apache Pig is a sub-project of Apache Hadoop.
- this talk was really how to use PIG as an end-user, not to become a Pig project developer

Apache Hadoop in the Cloud, Tom White

- general comments on using EC2 with Hadoop mostly

Practical HBase, Michael Stack

- Apache HBase is the Apache Hadoop database, similar to BigTable.
- HBASE usage

mod_jk / mod_proxy and others, Jean-Frederic Clere and 2 others

- mod_jk, mod_proxy, mod_serf and mod_cluster original topics
- mostly focused on mod_jk, mod_proxy and isapi_redirect
- good talk by 3 long-term project contributors
- jk is kind of Java-centric, with support for Apache JServ Protocol (AJP) only available in Java back-end servers for now, like Tomcat
- isapi_redirect is primary way to do redirects on Windows IIS
- survey of audience showed several mod_proxy users, maybe one intentional mod_jk user

Thursday Sessions

“Apache Lucene and Apache Solr Performance Tuning with Mark Miller” was packed, so moving along to a different room …

Scalable Internet Architectures, Theo Schlossnagle

- amazing and thought-provoking talk, also one of the most popular
- think about performance from network packet level to application level
- carp, vrrp, whackamole
- alterdns, neustar
- dynact
- anycast (shared IP), geoip (but need actually accurate database)
- activemq, rabbitmq instead of Spread
- “memcached is the worst thing that ever happened to our industry – it solves a problem, just not the original problem”

- many apps today are so poorly designed that network issues never become scalability concerns – ie. RoR applications :)
- max out at 500 requests per second across 40 boxes – RoR
- firebug and yslow have been fantastic at making front-end engineers aware of networking performance
- 10 gb nics suck
- instead of one big 20 Gbps loadbalancer, use anycast from core router to 5x 4 gpbs cheaper load balancers
- spiky load or DDoS – announce a /32 to separate load balancer, use symmetric return path

- jms, aqmp, spread

durable message queues

- activemq (java)
- openamq (c) – hard to use
- rabbitmq (erlang) – nice except in durable mode because erlang disk io blows

- most common protocol Stomp is awful and slow (hard to read 100k messages per second) and not binary, but lots of clients exist.

- activemq and stomp is a good start.
- rabbitmq and native connectors are better, but no perl client.

- PCI compliance requires a stateful firewall. Hard to do 1.5 million packets per second traffic for most medium-sized data centers, need to use a CDN to distribute static requests and distribute the packets somewhere else
- leaving trailing / off causes 302, doubles traffic
- Slides
- read/write ratio is 1 … likely IM or email?
- went over some networking details with Paul L. afterwards

Recent Developments in SSL and Browsers, Rick Andrews, Thawte

- 1.6 billion OCSP requests per day, need good infrastructure to support that
- intermediate CA allows root CA to be offline – chained hierarchy – SSLCertificateChainFile,
needs intermediate certificates before cross-certificates, some clients need in proper order
- EV hierarchy more complex. wanted new EV root, but older browsers don’t know about it.
- browser ubiquity problem with any new feature, hash or crypto algorithm
- logotypes – trademark and copyright issues with using other companies’ logos in a product
- Verisign does not have apache httpd committers, but should
- 1 attendee wanted to sign JavaScript files, but what does it mean if most sites link to 10 advertising and tracking scripts? what do you tell the user if 1 JS is not signed?

Subversion Meetup

Organizers didn’t show up, so spent 10 minutes talking to a handful of end-users about subversion gripes and moved along to …

Hadoop Meetup

Zookeeper

- zk is persistent to disk
- can run on one node, but 3 is minimum non-toy
- zk is popular in academia now for some reason
- avoid split-brain partitioning between 2 data centers – bad
- very recent merge to fix -368, not ready for production yet
- people using it for a message queue, perhaps more reliable than many other Open Source ones
- need 1 zk node for testing, but 3 zk nodes for non-trivial implementation

Scribe

- github
- 4x to 5x compression with lzo. similar disk bw improvement

A local owner of a gelato store handed out 6 free samples from a portable gelato freezer. :)

Friday Sessions

Building Intelligent Search Applications with the Lucene Ecosystem, Ted Dunnin

- some matrix math
- using his matrix math optimization, a perl program on 1 server was faster than Mahout running on a $250k cluster :)
- tdunning.blogspot.com

- the original LLR in NLP paper
“Accurate Methods for the Statistics of Surprise and Coincidence” check on citeseer
- Mahout project
tdunning [at] apache.org

Realtime Search, Jason Rutherglen

- many technical issues prevent Lucene from being able to do realtime search
- lots of patches done, lots to do
- audience member thanked author for great work so far

Closing Plenary: Brian Behlendorf on Open Source and Charity

Talked to Alex Karasulu a little after the final presentation. He’s a committer on the Apache Directory project. He suggested adding dbm to the NoSQL product matrix. Wants a MacBook Air with 8 GB RAM to run his Java apps. :)

Conference Schedule Grid

Mac OS X MacPorts and Fink software port systems

Wednesday, September 23rd, 2009

Like other Unix-based systems, the Mac also has packaging and network repository systems for installing Open Source software.

I have been using the MacPorts system, which is quite nice and has over 6300 packages in source form. Another is fink, which uses Debian tools like dpkg, dselect and apt-get to manage over 2500 packages, both source and binary forms.

In general, just type “sudo port install packagename” to have MacPorts install whatever Open Source programs you want, including end-user apps like R and octave.

The only wrinkles so far have been that packages are sources, so have to be built on your machine (slowly – ‘port upgrade outdated’ is glacial if you have a lot of stale packages installed, so ensure your AC adapter is plugged in), and some common package dependencies, like tiff, require Apple Xcode 3.1 (a free download) or higher to be installed these days or one gets the following fatal error message:

[...]
--->  Extracting tiff
On Mac OS X 10.5, tiff 3.9.1 requires Xcode 3.1 or later but you have Xcode 3.0.
Error: Target org.macports.extract returned: incompatible Xcode version
[...]
Error: Status 1 encountered during processing.

Xcode (and MacPorts) are not updated from Mac OS X Software Update, so you must do that manually. Obviously that is a potential security problem.

Some handy port commands

# see what ports are available

port list
port list all

# see what ports are already locally installed

port installed | grep -v xorg

# commonly-used packages for developers

sudo port install vim lynx links wget aquaterm htop

# easy way to install X11 and most common package dependencies in about 2 hours, so use an AC adapter

sudo port install octave r

# for Internet engineers (yes, PHP 5.3.0+ with Apache2 is installed …)

sudo port install apache2 php5 mysql5 squid lighttpd nginx pound varnish webalizer wget wireshark

To activate PHP5:

cd /opt/local/apache2/modules
/opt/local/apache2/bin/apxs -a -e -n "php5" libphp5.so

To configure daemons like apache2 to start at boot time and also start immediately, first edit the respective configuration file (on notebooks I usually restrict listening to 127.0.0.1), then:

sudo port load apache2

# update MacPorts system

sudo port selfupdate
sudo port upgrade outdated

If you want to use vector graphics in an AquaTerm with gnuplot or octave, you may need to start AquaTerm first, or set the following envariables in your .profile startup script:

GNUTERM=aqua
GNUTERMAPP=/opt/local/var/macports/software/aquaterm/1.0.1_5/Applications/MacPorts/AquaTerm.app

MacPorts FAQ
Ryan Schmidt’s comments on tiff and the dependency on Xcode 3.1

Fink can be tricky to install with a securely configured Mac, but installation can be done from the command line easily:

hdiutil attach Fink-0.9.0-Intel-Installer.dmg
sudo installer -pkg "Fink 0.9.0-Intel Installer.pkg" -target /
/sw/bin/pathsetup.sh
fink selfupdate
fink --version
fink list

macosx.com: How to Install a .dmg File At a Command Line?
Install Apache 2 and PHP 5 with MacPorts

Install Tsoft for WINDOWS on your Mac with WINE via MacPorts Project

ocf.berkeley.edu: Security Issues concerning X

sf.pm.org: Oops! I i18n’d your app

Wednesday, August 26th, 2009

Perl Camel LogoJeff Goff (DrForr) gave a sf.pm.org talk on internationalizing web apps at Six Apart in San Francisco.

(It was a long trip from San Jose on the Caltrain. I knew that I had arrived in San Francisco when I could smell the stench of urine upon leaving the station.)

Jeff mentioned working on ticketmaster.com before, and used S5 slides to illustrate a variety of localization issues with languages like Chinese, Japanese and Malaysian.

Some of his tips for identifying and preventing translation string corruption were:

  • check for double-encoding of UTF-8 strings, perhaps with Test::utf8::is_sane_utf8()
  • check complete toolchain for UTF-8 cleanliness
  • can use Unicode script and block properties to identify language when possible, as documented in perldoc perlunicode
  • use RCS pre-commit hook feature to inspect checkins, though can be slow with large input files.
  • important to decide how much cleanup the translator is responsible for vs. internal.
  • JavaScript string localization will likely require careful escaping of quotes.

Audience members also suggested:

  • enable online web editing of translations as well as batch export
  • consider locking columns if translators use excel worksheets.

As always, my comment is that it’s more important to focus on locale definition than charsets in i18n projects.

Several members were looking for perl jobs, so post your offers on the mailing list.

Thanks to Six Apart for hosting the meeting.

Juerd’s Perl Unicode Advice
Unicode.org
wikipedia: UTF-8
Jeff’s CPAN