| Subcribe via RSS

Knowing your PERC 6/i BBU

February 5th, 2010 Posted in Nagios, Performance, Uncategorized

I’ve recently become supremely disappointed in the availability of Nagios checks for RAID cards. Too often, I see administrators rely on chance (or their hosting provider) to discover failed drives, a dying BBU, or a degrading capacity on their RAID cards. So I began work on check_raid (part of check_mysql_all) to provide a suite of checks. One of the first cards I wanted to support was the PERC 6/i, so I scoured the documentation, forums, and picked the brains of my friends before finally getting on a marathon 4 hour call with Dell support. I’ll now share the interesting things that I’ve learned.

%> MegaCli64 -AdpBbuCmd -GetBbuStatus -aALL

BBU status for Adapter: 0

BatteryType: BBU
Voltage: 4052 mV
Current: 0 mA
Temperature: 24 C
Firmware Status: 00000000

Battery state: 

GasGuageStatus:
  Fully Discharged        : No
  Fully Charged           : Yes
  Discharging             : Yes
  Initialized             : Yes
  Remaining Time Alarm    : No
  Remaining Capacity Alarm: No
  Discharge Terminated    : No
  Over Temperature        : No
  Charging Terminated     : Yes
  Over Charged            : No

Relative State of Charge: 98 %
Charger Status: Complete
Remaining Capacity: 1572 mAh
Full Charge Capacity: 1605 mAh
isSOHGood: Yes

Exit Code: 0x00

The first thing you’ll note is that I’m using LSI’s MegaCli64, rather than the Dell-supported OpenManage. It’s a GUI-based program and I almost NEVER work on systems with windowing. I imagine this is a very common case.

Most existing Nagios checks for the PERC 6/i simply check as follows:

$status =
    `MegaCli64 -AdpBbuCmd -GetBbuStatus -aALL \
     |grep '^isSOHGood' |awk {'print $2'}`;

if ($status == 'Yes') {
    print 'OK';
    exit $ERRORS{'OK'};
} else {
    print 'CRITICAL: BBU is broken';
    exit $ERRORS{'CRITICAL'};
}

Sure, it’s an over-simplication of what’s out there, but not by much. There is a wealth of information available to us from this command, much of it actionable, the rest should be graphed over time (using Cacti, Munin, or your favorite RRD-like tool). Let’s walk through, step by step:

Voltage: 4052 mV

This can change over time. I have never seen this change, I’m curious if anybody else has.

Current: 0 mA

Although there’s nothing actionable here, it’s interesting to see over time. Significant deviations from the normal patterns may warrant further investigation.

Temperature: 24 C

As of this writing, I am unable to determine the acceptable operating temperatures of the PERC 6/i. With OpenManage, one can configure warning and critical temperatures, but not with MegaCli[64]. Does anybody know the default max temperatures that will trigger “Over Temperature” to Yes? If Dell doesn’t get back to me with this number, I’m going to have to go pick up a hair dryer and find out for myself….

When I asked the Dell rep about what the minimum safe operating temperature, he said there wasn’t one. If my hair dryer experiment doesn’t ruin the PERC 6/i, I’ll pick up a can of compressed air and test it:)

UPDATE: The PERC 6/i BBU maximum operating temperature is 140 degrees Farenheit. However, the operating temperature of the RAID card itself is determined by those specified on the hardware (in my case, I’m working on an R900, which has an operating range of 50-95 degrees).

  Over Temperature        : No

As seen in the “Temperature” readout above, this boolean field will be tripped to Yes once an as-yet-unknown critical temperature has been reached. If “Over Temperature” is true, the isSOHGood will be marked as No, the BBU disabled, and the Cache Policy will be set to WriteThrough (WriteThrough caching is used in all conditions in which the battery is missing, in a low charge state, or otherwise not behaving as expected).

This is a self-correcting problem and should recover on it’s own after a time. If the state remains for a significant time or flaps back-and-forth, it’s time to call Dell.

  Over Charged            : No

In the event that the battery becomes over charged, this will be marked as Yes, the isSOHGood will be marked as No, the BBU disabled, and the Cache Policy will be set to WriteThrough. This is a self-correcting problem and should recover within minutes. If the state remains for a significant time or flaps back-and-forth, it’s time to call Dell.

Relative State of Charge: 98 %

In order to preserve the life of the battery, it is not kept fully charged. Rather, this value will fluctuate as the battery partially discharges and then recharges. Empirical evidence suggests that the common range here is 95-100% and there is no official range quoted by Dell. It is possible that this may deteriorate over time as the full charge capacity (below) lessens.

Remaining Capacity: 1572 mAh
Full Charge Capacity: 1605 mAh

The ratio here will match the “Relative State of Charge” above (1572/1605 = 0.979). As such, it doesn’t necessarily provide immediately actionable information, but would be tremendously useful to graph over time in relation to one another. This is because the Full Charge Capacity may deteriorate over time.

Dell guarantees 24 hours of battery life, but claim 72 hours for cards with 256M and 48 hours for cards with 512M. Extrapolating (this is a new 256M card I’m looking at), we can expect when the Full Charge Capacity reaches below 535 mAh that the 24 hour guarantee is no longer attainable.

The Full Charge Capacity is calculated at the end of each “learn cycle” that the PERC 6/i goes through. The timeframe for completion of a learn cycle is a function of the battery charge capacity and the discharge/charge current used. For example, on a Perc6/i, the expected timeframe for a completion of the learn cycle is approximately 7 hours. A learn cycle is performed approximately every 3 months. The learn cycle shortens as the life of the battery degrades over time.

During the learn cycle discharge phase, the battery charger is disabled and will remain disabled until the battery is discharged. After the battery is discharged, the charger is re-enabled and the controller measures the time taken for battery recharge. During this time, the BBU effectively becomes disabled and the Cache Policy will be set to WriteThrough.

The learn cycle can only be disabled through OpenManage (there is also a somewhat limited facility for scheduling the next learn cycle), which you may be considering if your performance is highly dependent on WB setting on the RAID card. However, I recommend against this because then you may never get alerted if your battery’s maximum capacity has diminished beyond the 24-hour mark.

isSOHGood: Yes

If you just want a “Is the BBU healthy?” check, this is the only thing that you need to look at. A “No” value here will indicate that the battery is unable to hold a charge, but will give no indication why. I hope I’ve pointed to some other items that can (and should) be paid attention to in order to provide a more comprehensive insight into your systems.

check_raid is still in development, waiting for some additional information. I’ve also opened this feature request for some ready-made Cacti graphs.

If you have any additional information on the above, I welcome it.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Google Bookmarks
  • StumbleUpon
  • Technorati

One Response to “Knowing your PERC 6/i BBU”

  1. Nils Says:

    It’s really a shame how bad the CLI tools for those controllers are compared to 3ware or Areca, yet Dell/LSI refuse to fix the problem…


Leave a Reply