Running TestDisk on linux with Dell Perc Controllers

February 9th, 2010

While testing the Dell OpenManage 6.2 update recently, the main ext3 filesystem superblock on a gpt partition was damaged by the CentOS 5.4 installer.

I did not ask the CentOS installer to touch the non-system partitions in any way, but it happened.

Fortunately, mke2fs writes superblock backups to each filesystem in case something bad happens.

e2fsck -b could be used to recover a superblock from a copy, but I found a friendlier tool …

I used an Open Source tool by Christophe GRENIER called TestDisk to scan for a backup superblock, and overwrote the bad superblock in about 30 seconds. Then I added the original mount label and mounted the filesystem:

# testdisk_static (or testdisk_static /log /dev/sdb)
# parted /dev/sdb name 1 /data (works on gpt partition types)
# mount -a
# ls -l /data
# tune2fs -l /dev/sdb1

TestDisk worked perfectly, even on a complex system with Perc 6i and Perc 5e RAID controllers with 4 TB partitions, but you must carefully read and navigate TestDisk’s menus, and actually write the new superblock to disk for each filesystem that was lost. TestDisk can also be used to recover files and preventively to save superblocks before an issue occurs.

There are versions of TestDisk for several operating systems, including Windows, Linux 2.4, Linux 2.6 and FreeBSD.

Note that parted also has a rescue mode for partitions:

(parted) help rescue
  rescue START END      # rescue a lost partition near START and END

Other tools to look at when fixing linux filesystems include tune2fs and partprobe.

For deeper insight into ext2 and ext3 recovery, search for the excellent articles by Ted Ts’o.

Dell OpenManage and check_openmanage Update Problems on Linux

February 9th, 2010

Just before Christmas 2009, a new version of Dell OpenManage 6.2 for Linux was “released” – well, thrown over the wall, untested, resulting in this scary message on my Dell PE 2950’s:

# omreport storage controller
No controllers found

That sure got my attention …

There were at least 2 issues caused by this update:

  1. Although the individual packages were fine, the installer script had bugs that resulted in the combination of packages to not work correctly, regardless of whether you were updating an old system, or a fresh CentOs installation. Even though disk volumes were still mountable, most omreport options did not. Somebody posted a script that usually fixes that on the Dell forum, and I have added some modprobe commands that some people also recommended:
    #/bin/bash
    
    # this script based on Dell Forums samples
    
    /sbin/modprobe ipmi_si
    /sbin/modprobe ipmi_devintf
    
    yum remove srvadmin*
    yum install srvadmin-all
    yum install dell_ft_install
    cd /opt/dell/srvadmin/etc
    ./autoconf_cim_component.sh
    yum remove srvadmin-iws srvadmin-webserver srvadmin-jre
    srvadmin-services.sh start
    omreport storage controller # now works properly, or reboot first
    
    #Somebody really pooched the dependancies list in the OMSA 6.2 install !!!!
    
  2. omreport was installed in a new location, so the commonly-used check_openmanage monitoring perl script failed to find it. A simple edit fixes that:
/usr/lib64/nagios/plugins/contrib/check_openmanage:

#
# Locate the omreport binary
#
sub find_omreport {
    # Possible full paths for omreport

    my @omreport_paths
      = (
         '/usr/bin/omreport',                            # default on Linux
         '/opt/dell/srvadmin/oma/bin/omreport.sh',       # alternate on Linux
         '/opt/dell/srvadmin/oma/bin/omreport',          # alternate on Linux
+         '/opt/dell/srvadmin/bin/omreport',               # alternate on Linux
         'c:\progra~1\dell\sysmgt\oma\bin\omreport.exe', # default on Windows
         'c:\progra~2\dell\sysmgt\oma\bin\omreport.exe', # default on Windows x64
        );

Dell power-edge list: OpenManage 6.2 Storage Controller not found fix
Dell Forums: OMSA daemons appear to crash a minute after startup

Bali Trip Notes for January 2010

February 9th, 2010

I just spent a month in Bali, mostly the Tuban-Kuta area.

In the past, Bali was regarded as an inexpensive place for young Australians and others to vacation.

For the first time however, I would have to say that is becoming a memory of the past.

There are the occasional local hotels still available for under $25/nite, but none of the newer hotels, which are aiming for $100 to $200/nite.

If you can afford it, the new Holiday Inn Baruna Bali in Tuban at $120 to $200/nite is awesome – opening right onto Tuban/Wanasegara Beach. The style is more modern than Balinese, but you can visit the Risata Hotel Bali down the street and see lush Balinese gardens and stonework.

Taxis have greatly increased in price recently. The fare used to be an afterthought, typically less than $1 within a city.

There are 2 classes of taxis now:

  1. Bluebird – great service and fair prices – old (cheaper) argo meter settings, worth calling in
  2. other companies – average service and high prices – new (higher) argo setttings, or even 40 ribu minimum pickup fare from Galeria Mall. Indonesian visitors are scared of these prices.

To save money, use an ojek (motorcycle taxi), or try carpooling and scheduling multiple stops on the same trip.

Or pick a hotel within easy walking distance of sites that’s also near a major travel artery. In Kuta, that would be at the exit of Jl. Legian near Jl. Pantai Kuta (easy walk to the Legian nightclub scene, memorial and Kuta Beach as well as near taxis to Tuban or Denpasar.) In Tuban, that would be on Jl. Wana segara or Jl. Kartika Pl. (easy walk to Tuban Beach or Discovery Mall/Mal Centro.)

I talked to some merchants in Tuban, and asking rents for storefronts have doubled in the past 12 months.

All of the computer stores selling PCs in Kuta, Tuban and Sanur have closed, likely due to high rents, low margins and lack of capital. There are a few Mac stores, such as PC Max and one in Carrefour. Otherwise, you must go to the large Rimo Computer Mall in Denpasar. Rimo is pretty good for basic parts, with new releases lagging Jakarta by 2 to 3 weeks.

The most comprehensive selection of DSLR batteries and accessories in Kuta-Tuban is in the Zoom Digital Kiosk in Discovery Mall, Tuban.

If you’re a computer or business person and need to stay in touch online, visit Internet Sartika at Jalan Wana segara No. 29, Tuban. It has dual broadband connections (1 Mbps DSL and 1 Mbps fiber optic) and new 3 Ghz Intel Duo Core 2 computers, for the quickest Internet connections.

Several tourists asked me what’s worthwhile to see in Bali.

One of my favorite places is still GWK Cultural Park, which has massive stone monuments, great views overlooking Kuta and local dances starting at twilite. It’s a photographer’s paradise. GWK is only 30 minutes from Kuta or Tuban by taxi and can take anywhere from 2 hours to a day to appreciate.

James’ NoSQL Feature Matrix Link

December 22nd, 2009

A wikipedia editor (SamJohnston) has reverted the useful NoSQL Feature Comparison Matrix that I added at ApacheCon Oakland.

Here is the last NoSQL article version that has the detailed feature matrix.

Thanks to several contributors for improving it before it was suppressed.

I would be in favor of moving it to a wiki outside of wikipedia for long-term maintenance.

One nice improvement would be sort-by-column.

Jimmy Wales’ goal for wikipedia “to be the sum of all human knowledge” is not possible with a rigid NPOV and without primary sources. Instead, wikipedia will remain “the entertaining cartoon of human knowledge.”

I’ll have to think twice before contributing my time, content or money to wikipedia in the future.

rackspacecloud.com: NoSQL Ecosystem

Apple Genius Bar Advice on Notebook Battery Life

December 13th, 2009

I talked to an Apple Genius recently about improving notebook battery life.

His recommendations to improve notebook battery life were:

  • kill any runaway programs
  • reduce screen brightness to 50% or less
  • move any Desktop files you don’t need on the Desktop to another folder. This reduces the amount of screen redraw work.
  • update to the latest SMC firmware
  • once a month, unplug the power adapter and run your notebook until it sleeps automatically. Then plug in the adapter and allow it to charge for 8 hours.
  • if there’s still a problem, drop by an Apple store and he’ll run the battery diagnostics program from their bootable service iPod nano. Bring along your receipt in case there’s a problem still covered under warranty.

Reducing the screen brightness from max to 50% immediately improved battery life on my old notebook by 50%, from about 2:15 to 3:30.

Also, remove the plastic packaging from new batteries to prevent them from permananently sticking to the plastic battery casing. The plastic is sticky on one side and can be cut into several cell phone display protectors. :)

apple.com: Apple Notebook Battery Care
apple.com: Lithium-Ion Battery Care
support.apple.com: Apple Portables: Tips for maximizing your battery charge
gizmodo.com: How To Maximize Your iPhone 3G’s Questionably Adequate Battery Life
theappleblog.com: What’s the Ideal Strategy to Maximize Notebook Battery Lifespan?

Replacing Mac Powerbook G4 12″ Keyboard

December 5th, 2009

The Powerbook G4 12″ that I bought from Craigslist was sweet overall (1.5 Ghz, 1.5 GB RAM, 250 GB hard drive), but the keyboard looked a little ratty.

So I bought a new keyboard from a Hong Kong seller on eBay for $28.00 (including shipping) and installed it today according to the relevant faqintosh. Very shiny!

My only issue was getting the keyboard connector mated securely enough for all keys to work. And not knowing what a black stick is. :)

Update: The Apple Genius Bar will sell you replacement keyboards for $40, installation included.

faqintosh.com: Remove keyboard on a PowerBook 12”
command-tab.com: Apple’s “Black Stick”
tuaw.com: Tracking the mysterious ‘black stick’

MD1000 and MD3000 Redundancy Matrix

December 4th, 2009

I manage several Dell/Equallogic MD1000 storage arrays. They require careful understanding of their limited redundancy features.

I find that it helps to remember this: the MD1000 is a JBOD. Period. Any data redundancy actually depends on your RAID controller.

Although most components are modular, that only helps with field repair, not availability:

Device Component Feature Note
MD1000 Disks Hotswappable Yes, but depends on your RAID controller. You should offline the disk first. Note that rebuilds involving large volumes will take weeks to complete, so use 73GB disks if that’s a problem.
    Redundant Yes, depends on your RAID controller.
  Cables Hotswappable No, requires host and array power down first.
    Redundant No, MD1000 does not support multipath like the MD3000.
  Power Supply Hotswappable Yes
    Redundant Only for 5 minutes since 3/4 fans are required to cool.
  Fans Hotswappable Yes, as part of Power Supply module.
    Redundant Yes, 1 spare with both power supplies working.
  EMM Hotswappable Not really for EMM0, since removing EMM0 loses disk communication, and failback is not automatic. EMM1 can be hot swappable if enclosure is in unified mode, thus not used. Also, the host must be rebooted.
    Redundant No. There can be 2 EMMs installed, but if EMM0 fails, then some disk communications will be lost, whether unified or split mode. Also, no automatic failback. EMMs provide redundancy for enclosure control functions only.
  Front Panel Hotswappable No, requires power down and enclosure disassembly to replace.
    Redundant No, only 1.
  Firmware Updates without Reboot Server Maybe with latest firmware, but a reboot is recommended (common sense), especially with multiple arrays.
    MD1000 Maybe with latest firmware and Perc 6/E. See R216024. A reboot is recommended (common sense), especially with multiple arrays.
  Clustering   MD1000: No. MD3000: Yes, should work if you’re lucky.

Notes:

  • A backup is recommended and needed before any configuration changes are made in hardware or software.
  • Adding a MD1000 to an existing daisychain requires latest firmware to be loaded first. Mixing firmware is undefined. Also a reboot has been required in the past.
  • Adding a MD1000 to MD3000 daisychain requires reformatting of MD1000 disks.
  • Adding a MD1000 to a MD1000 daisychain renumbers the enclosure IDs.
  • All MD1000s must finish startup before MD3000s, and before host servers, or disks will be “missing” or “foreign”. So configure servers to “off” after power failure.
  • In fact, Perc controllers can “forget” their volume configuration after reboot, so put everything on a UPS and never reboot.
  • MD3000 with MD1000s daisy-chained from 2 HBAs supports redundant cables.
  • Do not read log if a rebuild is in progress.
  • Snapshots can be done with linux and any filesystem that supports it, such as ext3 with LVM.

So what are you paying for? Basically, a well-manufactured, well-tested, non-HA, field-serviceable DAS device with good Linux and Windows support and reasonable performance without licensing encumbrances (well, Dell has starting to restrict disk replacements.) Most people who buy other no-name JBOD devices that I’ve talked to never get something that works right, though SuperMicro multi-disk servers seem to be popular.

Choosing the iSCSI versions could avoid a lot of the cabling and reboot order issues.

Please add a comment with your suggestion or tip about managing the MD1000 or MD3000, or recommendations for affordable, more-available arrays.

Dell Powervault MD1000 Manuals
MD3000 and MD3000i – Generation 2 Firmware Update (2008-12-21)
bladewatch.com: Dell Server Firmware
dell.com: PERC 5/E Fault Tolerance Features
ftp.dell.com: Dell PERC 6/i Integrated Firmware Update 6.2.0-0013 – R216024
INetU Labs takes on the Dell MD3000i: Is it an Enterprise-capable workgroup SAN?
IDC numbers show Dell server storage booming

theregister.co.uk: Dell servers block un-Dell HDDs
dell.com: Third-party drives not permitted on Gen 11 servers
dell.com: Why Customers Should Insist on DELL™ Hard Drives for Enterprise Systems

Nagios check_http.c Patch to Display Result Page Snippet

December 1st, 2009

Nagios LogoHere’s a minor change to the nagios plugin check_http.c that shows the first 128 bytes of the page body if the -s match option is used:

  /* check elapsed time */
  if (strlen(string_expect) && strlen(page)) {
#define MAX_BUFFER_PAGE_SAMPLE 128
     char s[MAX_BUFFER_PAGE_SAMPLE];
     strncpy(s, page, MAX_BUFFER_PAGE_SAMPLE);
     s[MAX_BUFFER_PAGE_SAMPLE-1] = 0;
     /* Need to strip JavaScript here to prevent XSS */
     strip_xss(s);
     strip(s);

     asprintf (&msg,
            _("%s - %d bytes in %.3f second response time %s%s|%s %s"),
            msg, page_len, elapsed_time,
            (display_html ? "" : ""), s,
            perfd_time (elapsed_time), perfd_size (page_len));
  }
  else {
     asprintf (&msg,
            _("%s - %d bytes in %.3f second response time %s|%s %s"),
            msg, page_len, elapsed_time,
            (display_html ? "" : ""),
            perfd_time (elapsed_time), perfd_size (page_len));
  }

void
strip_xss (char *t)
{
   char *s;

   for (s=t;*t;*t++) {
       if (*t == ' ' ||
           *t == '.' ||
           *t == '-' ||
           *t == '\'' ||
           *t == ':' ||
           *t == ',' ||
           isalnum(*t)) {
          *s++ = *t;
       }
   }
   *s = 0;
}

Then you can use the usual Nagios check_http command with the -s option:

# 'check_http_str' command definition with options:
# -u URI without leading scheme and hostname
# -s "string" to match
# -M seconds is max acceptable age of page
# -L $HOSTADDRESS$ makes it a hyperlink to source page

define command{
        command_name    check_http_str
        command_line    $USER1$/check_http -I $HOSTADDRESS$ -u "$ARG1$" -s "$ARG2$" -M $ARG3$
        }

The result looks like:

HTTP OK: HTTP/1.1 200 OK - 377 bytes in 0.163 second response time
OK - System: 'PowerEdge 2950', SN: 'N1234', hardware working fine,
11 logical drives, 51 physical drives

The improved status display allows me to more easily use HTTP for remote nagios monitoring, instead of NRPE. Almost worth dropping into C for. :)