Tuesday, October 16, 2012

Hot-replacing a failing disk that is a part of Linux Software RAID and ZFS pools

Disks break: not "if", "when".

Yes, that's what they do. I run a 4-disk setup that hold one Linux Software RAID6 array, and two ZFS RAIDZ2 pools. 

Clouds in the sky

As of a few days ago, one of the disks started to fail, which was apparent by the syslog entries like these:

[1318523.293294] ata2.00: failed command: READ FPDMA QUEUED
[1318523.304015] ata2.00: cmd 60/01:00:8f:da:14/00:00:4d:00:00/40 tag 0 ncq 512 in
[1318523.304021]          res 41/40:00:00:00:00/00:00:00:00:00/00 Emask 0x9 (media error)
[1318523.346321] ata2.00: status: { DRDY ERR }
[1318523.356810] ata2.00: error: { UNC }
[1318523.367279] ata2.00: failed command: READ FPDMA QUEUED
[1318523.377664] ata2.00: cmd 60/3f:08:60:ad:14/00:00:4d:00:00/40 tag 1 ncq 32256 in
[1318523.377670]          res 41/40:00:98:ad:14/00:00:4d:00:00/40 Emask 0x409 (media error)
[1318523.419883] ata2.00: status: { DRDY ERR }
[1318523.430424] ata2.00: error: { UNC }
[1318523.440904] ata2.00: failed command: READ FPDMA QUEUED
[1318523.451164] ata2.00: cmd 60/01:10:95:29:00/00:00:4e:00:00/40 tag 2 ncq 512 in
[1318523.451169]          res 41/40:00:00:00:00/00:00:00:00:00/00 Emask 0x9 (media error)
[1318523.492656] ata2.00: status: { DRDY ERR }
[1318523.503246] ata2.00: error: { UNC }

As I did not have a spare disk on hand (tsk, tsk, tsk, yes, I know...) I immediately ordered one, even before sending the old disk for RMA. Initially, as I ran a zpool scrub on the pools, there would be only these messages, but the zpool itself did not notice trouble. 

Thunderstorms in the sky

As of yesterday, errors started making it to the zpool layer:

$ sudo zpool status
  pool: data
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
 scan: scrub repaired 356K in 3h25m with 0 errors on Sun Oct  7 15:16:16 2012
config:

NAME                                                 STATE     READ WRITE CKSUM
data                                                 ONLINE       0     0     0
 raidz2-0                                           ONLINE       0     0     0
   ata-WDC_WD2002FYPS-[serial]-part1  ONLINE       0     0  422K
   ata-WDC_WD2002FYPS-[serial]-part1  ONLINE       0     0     0
   ata-WDC_WD2002FYPS-[serial]-part1  ONLINE       0     0     0
   ata-WDC_WD2002FYPS-[serial]-part1  ONLINE       0     0     0

errors: No known data errors

  pool: ttank
 state: ONLINE
 scan: scrub repaired 0 in 0h56m with 0 errors on Fri Oct  5 11:53:31 2012
config:

NAME                                                 STATE     READ WRITE CKSUM
ttank                                                ONLINE       0     0     0
 raidz2-0                                           ONLINE       0     0     0
   ata-WDC_WD2002FYPS-[serial]-part3  ONLINE       0     0     0
   ata-WDC_WD2002FYPS-[serial]-part3  ONLINE       0     0     0
   ata-WDC_WD2002FYPS-[serial]-part3  ONLINE       0     0     0
   ata-WDC_WD2002FYPS-[serial]-part3  ONLINE       0     0     0

By now, the drive not only had read errors; it even started to return faulty data (despite claiming that said data is ok). Fortunately, ZFS is built from the ground up to never trust hardware, so that its checksumming mechanism detected the faulty data. Clearly, it was now time to replace that disk. Fortunately, the spare drive just came in by mail.

Taking the old disk offline

I run my disks in an IcyBox Hotplug backplane, so I wish to replace the disk without even so much as rebooting the server. One first needs to know which disk this is, of course. Since I use the disk-ID links, just looking at the symlinks in /dev/disk/by-id tells me that the disk in question is /dev/sdb.
To be safe, I read a gigabyte of data off the disk, to physically inspect which drive light switches on as I do so:

# dd if=/dev/sdb of=/dev/null bs=1048576 count=1024

Visual inspection tells me that this is the top drive in the IcyBox. Good.

As for ZFS, there is nothing special that one needs to do. For Linux Software RAID, one needs to tell the system to fail, and subsequently remove the disk from the array:

# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid6 sde2[5] sdb2[0] sdc2[4] sdd2[2]
      409996800 blocks super 1.2 level 6, 256k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices:

Fail the disk:

# mdadm /dev/md0 --fail /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md0

# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid6 sde2[5] sdb2[0](F) sdc2[4] sdd2[2]
      409996800 blocks super 1.2 level 6, 256k chunk, algorithm 2 [4/3] [_UUU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices:

Remove the disk:

# mdadm /dev/md0 --remove /dev/sdb2
mdadm: hot removed /dev/sdb2 from /dev/md0

# cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid6 sde2[5] sdc2[4] sdd2[2]
      409996800 blocks super 1.2 level 6, 256k chunk, algorithm 2 [4/3] [_UUU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices:

At this point, one could yank the disk out, but it's better to tell Linux that you are going to do so. Switching off the disk and detaching it from the system is done as follows:

# echo 1 > /sys/block/sdb/device/delete

The syslog will tell you that the device indeed went offline:

[1734127.293861] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
[1734127.331629] sd 1:0:0:0: [sdb] Stopping disk
[1734127.768141] ata2.00: disabled

As this point, the tray can be taken from the Hotplug backplane, and the old disk can be replaced by the new one.

Bringing the new disk online

After physically taking out the tray, removing the old disk from the tray, and adding the new disk to the tray, I replaced the tray. The kernel detects the disk:


[1743181.511929] ata2: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
[1743181.512460] ata2: irq_stat 0x00000040, connection status changed
[1743181.512883] ata2: SError: { CommWake DevExch }
[1743181.513215] ata2: hard resetting link
[1743187.276049] ata2: link is slow to respond, please be patient (ready=0)
[1743190.860073] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[1743190.998197] ata2.00: ATA-9: WDC WD20EFRX-[serial], max UDMA/133
[1743190.998206] ata2.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[1743190.998836] ata2.00: configured for UDMA/133
[1743190.998855] ata2: EH complete
[1743190.999097] scsi 1:0:0:0: Direct-Access     ATA      WDC WD20EFRX-[serial]
[1743190.999679] sd 1:0:0:0: [sdf] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
[1743190.999691] sd 1:0:0:0: [sdf] 4096-byte physical blocks
[1743190.999705] sd 1:0:0:0: Attached scsi generic sg1 type 0
[1743191.000185] sd 1:0:0:0: [sdf] Write Protect is off
[1743191.000197] sd 1:0:0:0: [sdf] Mode Sense: 00 3a 00 00
[1743191.000328] sd 1:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[1743191.014153]  sdf: unknown partition table
[1743191.014902] sd 1:0:0:0: [sdf] Attached SCSI disk
[1743415.817135]  sdf: unknown partition table

Obviously, there are no partitions on the disk yet. In order to create them, I simply copy them off one of the other drives:

# sfdisk -b /dev/sdc | sfdisk /dev/sdf

This is readily picked up by the kernel:

[1743415.817135]  sdf: unknown partition table
[1743416.227972]  sdf: sdf1 sdf2 sdf3

Resilvering the arrays

The first array I decide to resilver is the most important one: the primary data pool:

# zpool replace data /dev/disk-by-id/ata-WDC_WD2002FYPS-[serial]-part1 /dev/disk/by-id/ata-WDC_WD20EFRX-[serial]-part1

This is going to take a long time: more when this is done.