Wednesday of last week, I came home to find my three new 1TB hard disks waiting for me, destined to upgrade our ReadyNAS NV+.
Being a hot-plug-online-upgradable-all-singing-all-dancing sort of
widget, I followed the recommended upgrade procedure and popped out one
of the current 500GB drives, waited a few seconds, slotted one of the
new 1TB replacements, waited ’till it started resynchronizing the
volume, and went down to make dinner.
And spent the next several days picking up the pieces…
One critical bit of background – the NAS had three disks in a single RAID-5
volume. RAID-5 can tolerate one disk failure without data loss, but if
two disks fail (regardless of the number of disks in the volume), kiss
your data good bye.
When I went back upstairs after dinner to check on progress I
discovered that the NAS had locked up, and completely dropped off the
network. Wouldn’t answer it’s web management UI, and wasn’t responding
to pings.
Hesitantly, I power-cycled it. It started booting, and hung about a quarter of the way through checking the volume.
After several reboot attempts all locking up at the same place, I
applied a bit of coercion and convinced the box to boot. I checked the
system logs and found nothing telling, removed and re-seated the new 1TB
drive, and watched it start the resync again.
A couple hours later, sync still proceeding, I went to bed.
And woke the next morning to find the unit again fallen off the network.
Buried in the log messages – which I’d left scrolling past over night – was a warning that disk 2 was reporting SMART warnings about having to relocate failing sectors.
In other words, one disk of the three was being rebuilt while another one was busy dying.
At this point it became a race – would the rebuild complete (leaving
me with two good disks, and intact data) before the failing one died
completely.
In order to try to buy some insurance, I shut down the NAS,
transplanted the failing drive into a spare PC, and started a
disk-to-disk copy of it’s data onto the working 500GB disk I had removed
at the start of this mounting disaster.
Despite valiant attempts by both dd_rescue and myrescue,
the disk was dying faster than data could be retrieved, and after a day
and a half of effort, I had to face the fact that I wasn’t going to be
able to save it.
Fortunately, I had setup off-site backups using CrashPlan, so I had Vince bring my backup drive to work, and retrieved it from him on Friday.
Saturday was spent restoring our photos, music, and email (more later) from the backup.
Unfortunately, despite claiming to have been backing up Dawnises
inbox, it was nowhere to be found in the CrashPlan backup set, and the
most recent “hand-made” backup I found was almost exactly a year old
(from her PC to Mac conversion). Losing a year of email is better than
losing everything, but that seems like meager consolation under the
circumstances.
By Saturday night I had things mostly back to rights, and had a chance to reflect on what had gone wrong.
The highlights:
1. SMART, as google discovered (and published)
is a terrible predictor of failure. The drive that failed (and is
being RMAd under warranty, for all the good it’ll do me) had never
issued a SMART error before catastrophically failing.
2. In retrospect, I should have rebooted the NAS and done a full
volume scan before starting the upgrade. That might have put enough
load on the failing drive to make it show itself before I had made the
critical and irreversible decision to remove a drive from the array.
3. By failing to provide disk scrubbing (a process whereby the system
periodically touches every bit of every hard disk) the ReadyNAS fails
to detect failing drives early.
4. While I had done test restores during my evaluation of CrashPlan, I
had never actually done a test restore to Dawnise’s Mac. Had I done
so, I might have discovered the missing files and been able to avoid
losing data.
I have a support ticket opened with the CrashPlan folks, as it seems
there’s a bug of some kind here. At the very least, I would have
expected a warning from CrashPlan that it was unable to backup all the
files in it’s backup set.
5. In my effort to be frugal, I bought a 500GB external drive to use
as my remote backup destination – the sweet spot in the capacity/cost
curve at the time.
Since I had more than 500GB of data, that meant I had to pick and
choose what data I did and didn’t backup. My choices were ok, but not
perfect. There’s some data lost which should have been in the backup
set, but wasn’t due to space limitations.
6. CrashPlan worked well – but not flawlessly – and without it, I’d
have been in a world of hurt. Having an off-site backup means that I
didn’t lose my 20GB worth of digital photos, or several hundred GB of
ripped music.
Aside from digital purchases, the bulk of the music would have been
recoverable from the source CDs, but at great time expense. The photos
would have just been lost.
7. In this case, the off-site aspect of CrashPlan wasn’t critical,
but it’s easy to imagine a scenario where it would have been.
8. The belief that RAID improves your chances of retaining data is
built largely on what I’m going to refer henceforth to as “The RAID
fallacy” – that failure modes of the drives in the array are completely
independent events. The reality is that many (most?) RAID arrays are
populated with near-identical drives. Same manufacturer, same capacity
(and model) , and often the same or very similar vintage. So the drives
age together under similar work loads, and any inherent defect (like,
say, a firmware bug that causes the drives not to POST reliably) is likely to affect multiple drives, which spells disaster for the volume.