Monday, January 12, 2009

AACRAID based controllers timing out / aborting / SCSI hang

We’ve been lately starting to use more Adaptec RAID controllers rather than 3ware RAID controllers. 3ware has been nothing but trouble for us, dropping hard drives, even RAID5 arrays are running slower than a regular hard drive with no RAID. Our latest issue was a server just simply having a Kernel Panic when using high IO, our experience with 3ware RAID controllers & Linux is terrible.

On this other side, Adaptec has been great. We’ve been using them for a while now and see no problems at all, however there is just a small catch, Linux usually has a SCSI subsystem timeout of less than 30 seconds which results in a small difference between the controller timeout (at 35 seconds) versus the Linux timeout (at 30 seconds). This usually brings a server to a halt for a couple of seconds (and minutes in cases) till the server recovers, errors like this are thrown in the console:

aacraid: Host adapter abort request (0,1,3,0)
aacraid: Host adapter abort request (0,1,1,0)
aacraid: Host adapter abort request (0,1,2,0)
aacraid: Host adapter abort request (0,1,1,0)
aacraid: Host adapter abort request (0,1,2,0)
aacraid: Host adapter reset request. SCSI hang ?

The best method that usually works best is to increase the timeout higher than 45 to ensure that the Linux timeout does not occur before the RAID controller timeout, this is done per device / array.

echo '45' > /sys/block/sda/device/timeout
echo '45' > /sys/block/sdb/device/timeout
echo '45' > /sys/block/sdc/device/timeout

This should be done to every device, 45 is a good number however you can use what you’d like as long as it’s over 35. If you’re experiencing issues with loads going sky-high with no apparent reason, this might very well be the reason, to check if this is a possible cause, you can run the following

dmesg | grep aacraid

If you see errors like the ones that I have up there, then I suggest using that small workaround, if even after using the workaround, you’re still facing these problems, here are the suggestions/checklist that Adaptec suggests:

  • Check for any updated firmware for the motherboard, controller, targets and enclosure on the respective manufacturer’s web sites.
  • Check per-device queue depth in SYSFS to make sure it is reasonable.
  • Engage disk drive manufacturer’s technical support department to check through compatibility or drive class issues.
  • Engage enclosure manufacturer’s technical support department to check through compatibility issues.

Anyhow, just like with every Linux issue, your mileage may vary, so if you know of any other fixes or figured out a way how to fix this, feel free to post it as a comment to help others.

No comments: