Sun Grid Engine (SGE)

SGE LogoSun Grid Engine (SGE) is a free and Open Source batch-queuing software system for scheduling and managing computer jobs on a cluster of computers. The jobs themselves can be single or parallel processing.

I’ve had the opportunity to administer a small cluster (a dozen linux multi-core nodes dedicated to running EDA applications) for a while and can say that SGE is pretty amazing:

  • free
  • lots of scheduling and resource options
  • command-line, Motif graphical UI interface (qmon) and web UI (xml-qstat) and also Perl modules
  • fairly good documentation
  • trivial to setup on a new client node (one startup script)
  • consumables resource concept, for example limited application licenses
  • is likely the standard for departmental or single-campus job scheduling.

How does SGE differ from other similar software? SGE doesn’t focus on process migration like Mosix did, since SGE is not kernel-level. It does support kernel or application checkpointing.

Any engineer or scientist can read the SGE User Manual, install it, set up a single queue, and go live in an hour using the defaults without any training. As long as the cluster networking is reliable and jobs are submitted correctly, SGE should run with nearly zero administration.

But I would recommend that anybody with “Grid Administrator” in their job title interview their users for grid requirements, then take a class. SGE has a lot of options, and a class will end up taking less time than googling around endlessly or testing changes on a live production grid.

Some of the questions to ask would be:

  • how many compute nodes will be available?
  • how many queues do you foresee? purpose? priority?
  • how long/how much memory/how much CPU/how many cores do each type of job need?
  • does the user care about calendar execution time of jobs? (in EDA, engineers do)
  • is there a need for a failover master (known as a shadow master in SGE terminology)?
  • what access control will be needed?
  • what job accounting will be needed? (SGE has a fairly sophisticated backend for reporting)
  • will interactive jobs be allowed?
  • will the cluster be dedicated server nodes, or also include user desktops?
  • what commercially-licensed software applications will run on the grid? are there enough licenses? (SGE supports some licensing servers, such as FlexLM)
  • what local software needs to be modified or written to support the grid? a submit wrapper?
  • what end-user training and documentation will be available?

SGE can also be used for “cycle-stealing” (running jobs in the background on end-user workstations) by installing a sensor program to check the clients and suspend or resume jobs based on interactivity or load. Some issues would be swapping out user jobs under memory pressure, and reboots by end-users.

What’s interesting is that some grids run multiple job schedulers, usually a primary one plus a SGE “backfill” one.

wikipedia: Sun Grid Engine
Condor HPC Project
MarkMail: Gridengine Forums

This entry was posted in Business, Linux, Open Source, Perl, Tech, Toys. Bookmark the permalink.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>