Total Failure

Date: 16th Dec
Server: SANMAN
State: All disks unresponsive.

SANMAN is my P2P FC SAN storage server hosting 20TB of disks for the server Omashu. Saturday evening on the 16th of December i was attempting to configure my new Linux desktop to talk to the disks in SANMAN. Omashu uses a samba share to host the disks stored in SANMAN. It was at this point i realized something was wrong.

No shares were responsive. Omahsu had dropped the FibreChannel link by attempting a link reset. The system could have been in this state for nearly a month. In November i had got my new computer and i had neglected to service my servers. It was time to grab the cupboard screen and head for the hallway. I discovered there was no VGA output when i plugged in the screen. SANMAN had completely locked up.

Rebooting showed more failures. Live backup drop disk was degraded.

Upon rebooting, things started to show errors.
1) The bios was talking longer to boot so was failing from having all of the USB drives attached.
2) One of the RAID arrays was degraded.

The BIOS refusing to boot due to too many disks was nothing new. The new thing was the automatic USB power relay was getting scrambled. The relays are designed to switch on power to the two USB Hubs after the bios has posted. Attaching the Hubs too soon causes the BIOS to panic from running out of resources trying to initialize 24 USB disks.

The reason for this issue was due to the additional BIOS post codes out of the debug port. These codes related to the system halting in the BIOS of the 3ware RAID card as it had halted due to one array becoming degraded.

The disks reported no SMART errors and all filesystem checks came back fine. After marking the array to rebuild I “Hand Started” the server by manually attaching the USB hubs at the GRUB bootloader. Everything came back up like nothing had ever gone wrong… The next thing i attempted was some updates. I had not run updates on the server in close to a year since it sits off the network minding its own business.

interestING failING

Running “Sudo apt-get upgrade” had some interest”ING” results… “unpacking” turned into just “ing ing ing ing” . This could only mean one thing…

THE USB-RAID HAS FAILED

Removing the USB raid and attaching it to my computer showed fsck.f2fs had a few errors

After removing the drives and running a file system check the disks seemed to repair okay with lots of missing files. Who knows what went missing because it failed to save anything. The main thing was it was readable, it booted and it started the Fibre Channel driver and target application.

After some short tests I ran the updates again for a second time without error and left it be.

~~ONE MONTH LATER~

Me being me i wanted to see how long this thing would go for before completely dying. Deep down i was hoping it would offer some clues as to what failed. This time i was more lucky. The filesystem failed from a USB disk reset. This sudden removal of media caused the file system to fault semi gracefully and fail into read only.

One of the USB disks triggered a reset.

As the disk ‘sds’ came back again as ‘sdt’ the file system was able to fail over to read only. This ran from the 18th of December until the 16th of January in read only without me noticing. When i was attempting to understand why the raid card was acting strange i discovered the issue. My best guess at this time to why the RAID card was acting strange is because the file system was read only. I think the results of the self checks the card runs could not be saved to disk. This likely caused the raid cards BBU to get marked as failed. Having the BBU disabled more than halves the throughput of the card as all caching is disabled. This might also have been why that 1TB array got marked as degraded the first time SANMAN locked up.

I connected to SANMAN, tailing the live logs over SSH, in the hope i would catch more interesting faults. Today on the 21st of January at 6:25am the server ungracefully rebooted without logging anything. Stuck in a boot-loop i powered of SANMAN.

In hindsight i probably should have been attached to the serial console rather than tailing the disk logs. A disk which has/had failed to read only does not save its logs. I Removed the USB RAID array from the server and attempting to read the drives.

Checking the disks with fsck showed the file system was badly damaged more so than last time. I’m not sure where to go from here.

0xfd36 thats quite a few more than 0xc7

I guess its time to restore backups.

I may do some more testing to try and find whats happening. i suspect one of the USB drives is discarding bad blocks silently causing silent bit rot. i have two problems ether way:
1) the RAID card failing tests and disabling the cache.
2) the OS file system becoming corrupted .


Portable Routing – For those who can’t get enough

RouterOS+mAPlite

One thing which was always been on my mind is when I go out and about, is when I initially connect to a network, Exactly how much data is leaked during the setup time of the VPN on windows?

At the time i was using a script to dial a SSTP VPN connection to my home windows server from my netbook. This script would trigger each time I connected to my university’s WiFi.

There was two problems with this. The first problem was the WiFi was open to all student devices across the whole campus. The WiFi was using a 10.32.0.0/16 subnet. This easily exceeded 500 devices when i checked it with an IP  scan of the subnet. The next issue was the WiFi was not secured. Although there was secured staff WiFi and special education logins this was not open to any device and any student.

My current solution of monitoring Windows events for an “on WiFi connect” left me open for more than three different situations. The first most annoying one was windows not correctly logging events. Sometimes on a WiFi connection due to how Windows handles high latency The script will run early which causes it will fail to connect on its first try. This then leads to windows attempting to connect again 30 seconds later.

The second situation is windows split tunnel VPN routing. Due to how strict VPNs are implemented in windows there can be situations were windows for what ever reason passed packets down the wrong pipe. Leaking information due to improper forwarding is surprisingly more common than you would think.

Windows firewall being sub par was the third situation I had to deal with. Due to how large and open the wireless network was I was heavily dependent on windows firewall.

My solution was presented to me in the form of a portable USB powered WiFi router made by mikrotik. The mAP lite is a dual chain 802.11 b/g/n WiFi AP/ Router running the Linux based RouterOS. It can be powered ether off 750mA USB (two sockets worth of non-negotiable power) or Power over Ethernet. This is the smallest device they make which runs their full routing operating system. One of RouterOS’s many features is the highly customisation firewall and NAT table. This allowed me to block all of my laptop’s traffic before it reached the WiFi network. Another feature of RouterOS is the support of SSTP VPNs. Although not perfect it functions as expected with minor differences.

Solo mAP lite in action

In my application I was powering the mAP lite from two of my three USB ports. My laptop’s ethernet would connect to the mAP lite and get an IP address. The mAP lite would then operate in station mode to connect to the campus WiFi network only forwarding  my data over the WiFi as if it was connected to the VPN running on my home server.

RouterOS+mAPlite
The mAP lite powered from the mAP PoE port providing local network through a VPN overseas

There is only one downside to the mAP lite, the antennas are smaller than your average laptop’s antenna so occasionally moving the AP off the table to be hanging or stood up on the table achieves the best signal strength.

In my next post i’ll show how you can pair the mAP lite to a mAP to get a WiFi Hotspot with 3G support. This is my now favourite setup

The Great USB RAID

This never was supposed to be serious but when you are having fun, serious and joking around can mix and match.

This is the summery write up and details of my USB raid project to date.

———-
What started out as a “fairly” simple storage server with 8TB of storage turned into the all ridiculous USB-RAID hosting 20TB array via Fibre Channel

Why did I start out?

I started out making a SAN storage server back in 2014 because I wanted storage solution for my old slow server. I originally wanted to save files for distribution over my university’s residential local area network.

My first attempt was to try using a PCI-X card and four 1TB WD blues and two second hand disks. I had a limitation of 2TB logical volumes on the card so I made three and used windows software raid to stitch them together.

How this all worked was less than satisfactory as read and writing speeds were 5MB/s . I thought I needed something better. I had picked up some 4Gbt fibre channel cards second hand and wanted to see if I could beat 1Gbt. I quickly managed to source another desktop PC to test my cards which seemed to function after boot but nothing was configurable with out a target config. After learning that OpenFiler supported Fibre Channel target configs I started out playing with a configuration  which shared a 200GB SATA disk.

SANMAN

OpenFiler functioned okay more or less it did the job but configuration was patchy and not all settings were always applied.

I was happy enough so I started buying 4TB hard drives. This is where the USB idea popped up.

I wanted **_all_** my SATA ports. The boot OS was so simple it fitted on less than 8GB. The idea was if I can boot a live Ubuntu distro off my flash drive I could boot OpenFiler. I waited about a month for a 64GB USB3 drive to ship from the US and by then I started filling drives and knew I would need it very soon.

All went without a hitch I transferred my OpenFiler OS to the flash drive and it booted no problem. I was unhappy that the drive had to hang out the back of the case and I had one situation where I came very close to breaking the USB drive off the back while I temporary had it setup in my sisters flat.

To mount it inside the case, I went about cutting an old bracket and screwing it inside the case as shown above.

This went without problems running 24/7 for about 3 months until I finally ran out of disk space AND SATA ports.

I went off and ordered a RAID card and a PCIe riser card.

Already regretting my choice of motherboard I installed both cards and after several reboots and reconfiguration I realised there was a problem and a very big one.

Backup backup backup backup! OMG JUST BACK IT UP!

My hours and hours of work were lost in a simple reboot. The flash drives partition table was gone and several hours of trying to recover it were wasted.
No important data was lost from my storage but for about a month i couldn’t access it due to OpenFiler having a poor application of fibre channel.

My flatmate being in IT he was able to mount the file system locally until I could repair the OS to a usable state.
OpenFiler had been abandoned not only by me at this point but by their developers 😂. I made the decision to change to stock Debain and run the same tool SCST which OpenFiler had used just this time I had all the configuration in my hands.
My choice of OS drive was a single crappy 8GB USB drive given to me by uni. this proved to be far too slow in that simple IO response queries which were processed by the OS tended to time out the storage.
at this point i had spent another few months pissing around trying to get the silly thing to work.

I joked to my flatmate “why don’t i just put the stupid thing onto USB RAID” and thus out of pure silliness and having plenty of drives…

The first USB RAID was born.
i had shared the logical disk over to my desktop to run tests and my experience using one USB bus was not ideal

View post on imgur.com

It was clear I was maxing out my bus for the USB2 RootHub performance was far better than a single disk but I knew I could do better.
By then I had migrated the OS to the USB raid and my storage was functioning without issues once again after about a full year of maintenance.

I added an extra hub to a different bus, specifically the one which served the USB3 ports. With this I was able to achieve over two buses Sequential r/w 63.5 MB/s, 27.3 MB/s http://pastebin.com/ni4ksCXt

I was happy at this point so I started working out how to put the thing inside the case.

I was frustrated at how the case for the white USB hub was and I thought screw it I’m modding this. Out came the trusty soldering iron and tools and I started rotating each port

View post on imgur.com

Things were getting bigger and bigger

View post on imgur.com

and now with 11 drives and 22GB raw, I was in states of hysterics when ever I was working on it. I knew at this point I had created something completely crazy. Each time there was an issue I just made something completely over the top to work around it.

Too much?

Night mode enabled

Interestingly enough most systems don’t post with more than 10 USB drives attached so after powering it up for the first time to take the iconic photos I realised I had to manually plug in the hubs after the bios posted.

Over engineering craziness to the rescue! I’ll just use some relays to power it on once the bios has finished posting.
How this works is the micro-controller chip on this custom board listens to the motherboard debug header which emits post bios codes. When it has finished the micro then switches the power through the relays powering on the USB-RAID just in time for grub bootloader to chain-load across to it.

My last tweak to the USB RAID was switching to F2FS which allowed for better file management and apparent less wear and tear on flash based file systems.

My system has been running since June 2016 without any errors.

My current project to date is to make a USB-RAID card card which piggybacks onto a Startech 4port PCIe USB3 card with 4 dedicated USB ports (each with their own bus). I have purchased USB hub controller chips online to give me 16 ports total. Only thing left to do is do up the PCB which due to current funds and work hours has been slow.

I will post more updates as my card develops.

love the blinkies (potato shutter only shows the drives with slow flashing activity LEDs)

View post on imgur.com

Original USB RAID Reddit Post