I attended an excellent class on Sun Grid Engine (SGE) Cluster Administration at the Santa Clara Hyatt. The instructor was Chris Dagdigian, from BioTeam, and the sponsor was UnivaUD.
This was a one-day version of his two-day class, so things moved pretty fast.
Chris is very familiar with SGE use cases applied across a number of different industries, and how SGE differs from LSF.
Since a few attendees worked in EDA, Chris also provided useful information specific to EDA, such as SGE resource configuration and retrying license checkout requests in epilog scripts for FlexLM.
LSF has had a mature public programming API for a long time, while SGE has a limited API named DRMAA for job submission and limited administration – so you’re stuck writing wrappers for the command-line utilities.
LSF is also queue-oriented, while SGE is more policy-oriented. So an LSF configuration can have 10x as many queues as SGE (not better, just different.)
The SGE “lab” was available on 10 Amazon EC2 Extra Large instances, one for each person (BYON = Bring Your Own Netbook).
I found it useful to quickly try the command-line tools as Chris talked about them. In a full 2-day class, you would also install, configure and do reporting.
(He said I was one of the few qmon fans. Most people just use the qmon Motif/X11 GUI for cluster monitoring, but I also do most administration with it.)
Chris said that SGE releases now alternate between feature and performance enhancement versions.
SGE performance has been improved on large systems by tuning for the Texas Advanced Computing Center at the University of Texas (TACC) Ranger cluster, which has 580 teraflops and 63,000 cores. (It had a Top500 supercomputer ranking of #4.)
Although scheduling improvements have resulted, some of the command-line tools default behavior have been neutered to reduce load, so you will need to add more options to get the same result now.
It’s still early days to see how batch computing and cloud computing (Amazon EC2, Hadoop, etc.) will coexist. With on-demand scheduling, SGE could possibly be used to farm out Internet web request jobs to Amazon EC2, but the job submission overhead would have to be measured.
Chris is also a storage geek, so offered some advice on cluster storage. He insists on RAID6/double parity for storage hardware.
He mentioned the Nexsan SATAbeast devices, which support RAID6 and use 40% less power by spinning down drives (called AutoMAID).
The meeting room was nice, cozy enough for about 10 people, with reliable WiFi and a gourmet Mexican/American lunch.
Thanks to BioTeam and UnivaUD for organizing and hosting this event.
If you’re using SGE and need training, contact Chris and tell him what industry you’re in to get a class tailored to your requirements.
wikipedia: Platform LSF
Jonathon Schwartz’s blog entry on Ranger


