Total Failure

Date: 16th Dec
Server: SANMAN
State: All disks unresponsive.

SANMAN is my P2P FC SAN storage server hosting 20TB of disks for the server Omashu. Saturday evening on the 16th of December i was attempting to configure my new Linux desktop to talk to the disks in SANMAN. Omashu uses a samba share to host the disks stored in SANMAN. It was at this point i realized something was wrong.

No shares were responsive. Omahsu had dropped the FibreChannel link by attempting a link reset. The system could have been in this state for nearly a month. In November i had got my new computer and i had neglected to service my servers. It was time to grab the cupboard screen and head for the hallway. I discovered there was no VGA output when i plugged in the screen. SANMAN had completely locked up.

Rebooting showed more failures. Live backup drop disk was degraded.

Upon rebooting, things started to show errors.
1) The bios was talking longer to boot so was failing from having all of the USB drives attached.
2) One of the RAID arrays was degraded.

The BIOS refusing to boot due to too many disks was nothing new. The new thing was the automatic USB power relay was getting scrambled. The relays are designed to switch on power to the two USB Hubs after the bios has posted. Attaching the Hubs too soon causes the BIOS to panic from running out of resources trying to initialize 24 USB disks.

The reason for this issue was due to the additional BIOS post codes out of the debug port. These codes related to the system halting in the BIOS of the 3ware RAID card as it had halted due to one array becoming degraded.

The disks reported no SMART errors and all filesystem checks came back fine. After marking the array to rebuild I “Hand Started” the server by manually attaching the USB hubs at the GRUB bootloader. Everything came back up like nothing had ever gone wrong… The next thing i attempted was some updates. I had not run updates on the server in close to a year since it sits off the network minding its own business.

interestING failING

Running “Sudo apt-get upgrade” had some interest”ING” results… “unpacking” turned into just “ing ing ing ing” . This could only mean one thing…

THE USB-RAID HAS FAILED

Removing the USB raid and attaching it to my computer showed fsck.f2fs had a few errors

After removing the drives and running a file system check the disks seemed to repair okay with lots of missing files. Who knows what went missing because it failed to save anything. The main thing was it was readable, it booted and it started the Fibre Channel driver and target application.

After some short tests I ran the updates again for a second time without error and left it be.

~~ONE MONTH LATER~

Me being me i wanted to see how long this thing would go for before completely dying. Deep down i was hoping it would offer some clues as to what failed. This time i was more lucky. The filesystem failed from a USB disk reset. This sudden removal of media caused the file system to fault semi gracefully and fail into read only.

One of the USB disks triggered a reset.

As the disk ‘sds’ came back again as ‘sdt’ the file system was able to fail over to read only. This ran from the 18th of December until the 16th of January in read only without me noticing. When i was attempting to understand why the raid card was acting strange i discovered the issue. My best guess at this time to why the RAID card was acting strange is because the file system was read only. I think the results of the self checks the card runs could not be saved to disk. This likely caused the raid cards BBU to get marked as failed. Having the BBU disabled more than halves the throughput of the card as all caching is disabled. This might also have been why that 1TB array got marked as degraded the first time SANMAN locked up.

I connected to SANMAN, tailing the live logs over SSH, in the hope i would catch more interesting faults. Today on the 21st of January at 6:25am the server ungracefully rebooted without logging anything. Stuck in a boot-loop i powered of SANMAN.

In hindsight i probably should have been attached to the serial console rather than tailing the disk logs. A disk which has/had failed to read only does not save its logs. I Removed the USB RAID array from the server and attempting to read the drives.

Checking the disks with fsck showed the file system was badly damaged more so than last time. I’m not sure where to go from here.

0xfd36 thats quite a few more than 0xc7

I guess its time to restore backups.

I may do some more testing to try and find whats happening. i suspect one of the USB drives is discarding bad blocks silently causing silent bit rot. i have two problems ether way:
1) the RAID card failing tests and disabling the cache.
2) the OS file system becoming corrupted .


One Reply to “Total Failure”

Leave a Reply

Your email address will not be published. Required fields are marked *