Tuesday, September 2, 2014

New server hardware: SuperMicro X10SAE and Xeon E3 1265Lv3

As my previous server mainboard died, I decided to upgrade to a  SuperMicro X10SAE and a Xeon E3 1265Lv3.

Just a quick post for those interested in running this combination under Linux.

$ lspci -tv:

-[0000:00]-+-00.0  Intel Corporation Xeon E3-1200 v3 Processor DRAM Controller
           +-01.0-[01]----00.0  LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
           +-02.0  Intel Corporation Xeon E3-1200 v3 Processor Integrated Graphics Controller
           +-03.0  Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller
           +-14.0  Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI
           +-16.0  Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1
           +-16.3  Intel Corporation 8 Series/C220 Series Chipset Family KT Controller
           +-19.0  Intel Corporation Ethernet Connection I217-LM
           +-1a.0  Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2
           +-1b.0  Intel Corporation 8 Series/C220 Series Chipset High Definition Audio Controller
           +-1c.0-[02]--
           +-1c.3-[03]----00.0  Intel Corporation I210 Gigabit Network Connection
           +-1c.5-[04-05]----00.0-[05]----03.0  Texas Instruments TSB43AB22A IEEE-1394a-2000 Controller (PHY/Link) [iOHCI-Lynx]
           +-1c.6-[06]----00.0  Renesas Technology Corp. uPD720202 USB 3.0 Host Controller
           +-1c.7-[07]----00.0  ASMedia Technology Inc. ASM1062 Serial ATA Controller
           +-1d.0  Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1
           +-1f.0  Intel Corporation C226 Series Chipset Family Server Advanced SKU LPC Controller
           +-1f.2  Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode]
           +-1f.3  Intel Corporation 8 Series/C220 Series Chipset Family SMBus Controller

           \-1f.6  Intel Corporation 8 Series Chipset Family Thermal Management Controller

Note that the SAS2008 is a plugged-in PCI-e x8 SAS HBA, so that device will not show up in lspci on a vanilla mainboard.

The two network interfaces work out of the box on a Linux 3.13 kernel: they use the igb and e1000e kernel modules, respectively.

The first CPU (in /proc/cpuinfo) looks like:

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Xeon(R) CPU E3-1265L v3 @ 2.50GHz
stepping : 3
microcode : 0x17
cpu MHz : 2500.056
cache size : 8192 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc eagerfpu pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm ida arat epb xsaveopt pln pts dtherm fsgsbase bmi1 hle avx2 bmi2 erms rtm
bogomips : 5000.11
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:

Note that the OS runs as a Xen Dom0, hence the hypervisor flag.


Thursday, May 15, 2014

Serious and pressure-aware memory overcommitment in Xen: Transcendental Memory

Overcommitting VM memory for fun and profit

What is the issue?

If, like me, you run virtual machines on Xen, you are probably aware of the fact that you can overcommit memory for your guests; if you have, say, 8 GB of free memory in addition to what your Domain-0 uses, it is perfectly fine to run 6 VMs with 2 GB of RAM each, as long as those VMs are Xen aware (and practically all modern Linux kernels for all distributions are).

What will happen is that Xen will use Ballooning: each VM has a "balloon" device that does nothing, but does logically use RAM (from the VM's point of view), while this RAM is not actually mapped to any physical RAM on the host. So if, in the above example, you already started 4 VMs (thus filling up your physical RAM), and you start your 5th VM, Xen will first inflate the balloons in the 4 guests. To the guest VMs, this looks like the Balloon requiring more and more memory, so they will have less memory available to the system. The VMs may also need to evict read-cache pages to accommodate this. At some point, there is enough RAM to start the 5th VM, and once it started, Xen will also inflate its Balloon, until the situation gets to an equilibrium where all VMs use the same amount of memory. Start the 6th VM, and the same thing happens all over again.

Now, this is all very nice, but it comes with one major shortcoming: This mechanism does not respond to memory pressure within a given VM! This means that if you crammed 6 2GB VMs into 8 GB of physical RAM, each VM will have 1.33 GB of available RAM. Indeed, a VM that temporarily needs more will run out of memory (up to the point where its kernel starts killing processes to free up RAM), even if all the other VMs are not currently requiring their full 1.33GB of RAM! 

It doesn't have to be this way. Enter Transcendental Memory.

Transcendental Memory? What is this, Zen class?

No, it's Xen class: Transcendental Memory (tmem) is memory that exists outside the VM kernel, and over which the VM kernel does not have direct control. Practically speaking, this is RAM that is managed by the hypervisor, and which the VM can indirectly access through a special API. This tmem is of unknown (to the VM) size, and its size may (and will indeed) change over time. Also, the VM may not always be allowed to write to tmem (if the hypervisor has no more free RAM to manage, for example).

Transcendental Memory comes in "pools", where a VM typically requests two of these: a (shared) ephemeral pool and a persistent pool. An ephemeral pool is a pool to which a VM may be able to write, but for which there is no guarantee whatsoever that the page it just wrote can be read later on. In a persistent pool, on the other hand, it is guaranteed that a page you wrote can later be read.

Linux VMs access Transcendental Memory using the tmem kernel module. Internally, this enables three things:
  • selfballooning/selfshrinking: The VM will continually use a balloon to create artificial memory pressure, in order to get rid of RAM that it does not currently need. Of course, the hypervisor itself may also balloon the VM due to external reasons, which further reduces available RAM.
  • cleancache: At some point, the VM's RAM becomes so small that it has to evict pages from its read cache (also called the "clean cache", hence the name). Rather than just evicting a page, the kernel will first try to write it into the ephemeral pool, and will then evict the page. Conversely, if a process in the VM issues a block-device/file-system read request, the VM will first ask the ephemeral pool whether it has that page. If so, the VM just saved one disk read.
  • frontswap: With selfballooning/selfshrinking at work, the VM will be under constant memory pressure; it will be left with whatever it actually needs, plus a very small margin. Of course, if you start a large process under these conditions, there will not be enough RAM to start it. The selfballooning mechanism will respond to the memory pressure, but with a certain delay. Therefore, before the balloon has had a chance to deflate, the kernel will need to swap out pages. This would of course be slow, and that is where frontswap comes in: Before swapping out a page, the kernel will first try to write it to its persistent tmem pool. If successful, it does not need to actually write the page to a block device. If not successful, the kernel will write the page to the block device. In the majority of cases, tmem will be able to absorb the initial memory shock, thus actual swaps occur rarely. In addition, there exists a mechanism that will slowly swap pages back in from tmem and the swap device, so that neither is clogged up with useless pages.
So that is it: tmem will allow you to share your read cache between VMs, while keeping the RAM that your VM claims at any one time as small as possible.

Good. How do I use it?

The first step is to enable tmem in Xen: In your Domain-0, edit /etc/default/grub (this is assuming Debian or Ubuntu), and ensure that the GRUB_CMDLINE_XEN_DEFAULT string contains tmem. You'll have to run update-grub and reboot your physical host for this to take effect.

Second, you'll have to set up your guests to actually use tmem. For this, inside the VM, you edit /etc/default/grub such that GRUB_CMDLINE_LINUX contains tmem. Also, add tmem to /etc/modules. You'll have to run update-grub and reboot the guest for this to take effect.

NOTE: It is critical to have a swap device configured in your guest, if only a small swap device, for otherwise frontswap will NOT work! That means that without a swap device in your guest, you'll continually be running out of RAM when trying to start processes.

And that's it. Here's what xl top tells me:

xentop - 17:04:20   Xen 4.3.0
5 domains: 1 running, 4 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
Mem: 16651980k total, 12513168k used, 4138812k free, 38912k freeable, CPUs: 4 @ 2500MHz
      NAME  STATE   CPU(sec) CPU(%)     MEM(k) MEM(%)  MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS   VBD_OO   VBD_RD   VBD_WR  VBD_RSECT  VBD_WSECT SSID
  (redacted) --b---      12324    3.7    1191920    7.2    2098176      12.6     1    1 1475615857 47146601    1        0   292522    47176   29285682    4827432    0
Tmem:  Curr eph pages:        0   Succ eph gets:        0   Succ pers puts:   463995   Succ pers gets:   463807
  (redacted) --b---       1476    0.2     226068    1.4    2098176      12.6     1    1    74948    30323    1      544  8277450    20808  239717602    1317408    0
Tmem:  Curr eph pages:        0   Succ eph gets:        0   Succ pers puts:   119478   Succ pers gets:   119364
  Domain-0 -----r      27858    5.9   10397272   62.4   10485760      63.0     2    0        0        0    0        0        0        0          0          0    0
  (redacted) --b---        172    0.1     225820    1.4    2098176      12.6     2    1     7837      655    1        2    36238     9679    2126866     673864    0
Tmem:  Curr eph pages:       75   Succ eph gets:       75   Succ pers puts:     2359   Succ pers gets:     2359
  (redacted) --b---         59    0.1     207624    1.2    1741824      10.5     2    1     1377     2512    1        1    39733     1242    3007082      64040    0
Tmem:  Curr eph pages:        0   Succ eph gets:        0   Succ pers puts:      623   Succ pers gets:      623

Note that each VM is configured to have 1.7 to 2 GB of RAM, but most use only about 250 MB.


However... In Debian, the standard kernel does not have all required features compiled in, or as a module


This means that you'll have to build your own kernel package. To do this, you issue the following commands (modulo your kernel version):

# apt-get build-dep sudo apt-get build-dep linux-image-3.13-1-amd64
# apt-get build-dep sudo apt-get source linux-image-3.13-1-amd64

You cd into the linux-3.13.10 directory, and copy over the config from the live system:

# cp /boot/config-3.13-1-amd64 ./.config

Apply the following changes to .config:

482,483c482,483
< # CONFIG_CLEANCACHE is not set
< # CONFIG_FRONTSWAP is not set
---
> CONFIG_CLEANCACHE=y
> CONFIG_FRONTSWAP=y
485a486
> # CONFIG_ZSWAP is not set
5137a5139
> CONFIG_XEN_SELFBALLOONING=y
5148a5151
> CONFIG_XEN_TMEM=m

Build the packages:

# make deb-pkg LOCALVERSION=-tmem KDEB_PKGVERSION=1

This will create four packages in the parent directory. Normally, you need to install just the -tmem image:

# dpkg -i ../linux-image-3.13.10-tmem_1_amd64.deb

Enjoy!



Thursday, April 10, 2014

More slight downloading inconveniences in the digital age

The Netherlands forbids downloading of IP-protected material

Today, the news broke that, as of today, the downloading of IP-protected material has become illegal in The Netherlands. Uploading has been illegal for some time, but now an EU court has ruled that downloading is illegal, too.

How ever so slightly inconvenient.

How utterly useless.

Let's fix it right away.

What is the goal here?

Although I live in Switzerland, where IP laws are a lot saner, and where downloading of music and movies for personal use is allowed, this news annoyed me, so I decided to fix the issue once and for all.
In this post, I show how to route a complete subnet from your home network through a VPN provider, so that its traffic surfaces in a country with poor IP protection. See, e.g., the 2013 IP ranking list (the bottom part, that is) for some suitable countries; there are plenty to choose from.

Step 1: Choose a VPN provider, preferably OpenVPN

There are plenty of alternatives here. I chose HideMyAss for this experiment, mainly because they have servers in many countries, and they accept payment in Bitcoin.

Step 2: Edit the OpenVPN config file, to prevent all of your home-network data from being routed through the VPN

Although you're of course welcome to route all of your data through the VPN, this is not recommended for two reasons:

  1. It will generally be slower than your "open" connection.
  2. It is bad privacy practice to run both your identifiable data (email, etc.) and your file-sharing traffic through the same end point.
I blogged earlier in a lot of detail on how to do this. In my case, I used a default HideMyAss OpenVPN config file, and commented out the "route-metric 1" line. Then, I added a "route-noexec" line to prevent the VPN server from pushing routes on me. I also replaced the "auth-user-pass" line with "auth-user-pass hma.pass", where hma.pass contains my login and password (each on a separate line) to automate the login process. Finally, I replaced the "dev tun" line with "dev tun-hma", so that I have a stable TUN-device name whenever I connect to this VPN.

On my Debian gateway server, I copied this file to /etc/openvpn, so that it starts at boot. If you want to start it right away, issue an /etc/init.d/openvpn restart.

Step 3: Route your outgoing traffic from your file-sharing subnet through the VPN.

In my case, I added a new VLAN to my home network; VLAN #6, which runs network 192.168.6.0/24. I will use Linux Source Routing to route only VLAN6 traffic, and then only outgoing traffic, through device tun-hma.  I added the following rules to my /etc/rc.local file to achieve this:


# Route traffic from the file-sharing network to our own
# network via normal tables.
/sbin/ip rule add from 192.168.6.1/24 to 192.168.1.1/16 lookup main prio 200
# Route all other traffic from the file-sharing network
# through table "3", which, in turn gets routed through HMA.
/sbin/ip rule add from 192.168.6.1/24 lookup 3 prio 201
/sbin/ip route add from 192.168.6.1/24 dev tun-hma table 3
# Traffic via HMA must be NAT'ed.
/sbin/iptables --table nat -A POSTROUTING -o tun-hma -j MASQUERADE

And that's it. The first "prio 200" line is actually fairly critical: it ensures that your filesharing network is able to connect to other machine on your home network through the default routing tables. That is convenient, since you'll possibly want to save files that you download to NFS, or elsewhere on the local network. The "prio 201" line (where 201 means a lower priority than 200) then routes traffic that does not match the "file-sharing net to other local nets" rule through the VPN. The number 3 is arbitrary. 

Step 4: (optional) Add some more plumbing

Im my case, the new VLAN is also sent to my Xen host machine, using 802.1q tagging. Inside the Xen host, I run a software bridge that bridges the VLAN to my main up/downloading machine. And that's it: I now upload and download in another country.

Friday, July 12, 2013

Local-disk encryption - LUKS and OPAL - to protect against casual privacy loss

This is a detailed continuation of one of my previous posts.

Scope

In my case, I use local-disk encryption to guard against privacy loss when RMA'ing broken disks or if (not: "when", hopefully) my equipment gets stolen. The scope is not to guard against overly-intrusive governments; If a government wants my data, and a judge forces me to turn over my encryption keys, there is little one can do.

The setup

My home server is an Ivy-Bridge based Core-i7 system that supports VT-d. The actual hardware runs Xen 4.2, and there is a small Dom0 Linux system that handles a single SSD to store its own system and the guest disk images. A nice property of the CPU in question is that it supports AES-NI, so that it can do in-CPU AES encryption at some 500 Mbyte/s/core.


The DomU file server

The DomU file server runs Ubuntu 13.10 server on which I run native ZFS-on-Linux to manage the data.


The disks

The file-server disks all hang off a SAS controller that I pass through to the DomU file server using VT-d (I disable that controller in Dom0, so that Dom0 never sees more than its own single SSD that hang off the on-board SATA3 controller). There are five disks:
  • Four 2-Tbyte old-school hard drives forming a single RAIDZ2 vdev in a zpool.
  • A single SSD that I use as an L2ARC cache device in the same zpool.
The "old-school" hard drives support no native encryption options at all, so we will use LUKS to encrypt those. The SSD (an Intel 520) is a native Self-Encrypting Drive (SED) OPAL compliant device that can be encrypted simple by setting an ATA password (and which can be utterly wiped in one second by resetting that password).


The boot context

The file-server's root "disk" is an image on the Dom0 system, on which I use encrypted LVM (which, internally, uses LUKS as well). This allows me to put a "clear-text" automatic-booting key on the server's root filesystem (since that is encrypted). This allows me to start the file server from Dom0, attach to its console, and enter a single password (to unlock the server's system LVM), after which I use the key stored in "clear text" within that LVM to automatically unlock all other drives. Since LUKS allows one to define more than one key for a volume, I do not put the key in clear-text on the Dom0 (that would defeat the purpose). Rather, I will add a key that exists only in my mind.


Setting up the encryption

I will assume that a DomU has already been installed on encrypted LVM; the Ubuntu installer image makes this very easy, so I am not going to elaborate on that. I will also assume that the kernel-module PPA from ZFS-on-Linux has been installed already.


Defining an encryption key


You can use any method here, but I am using a 32-byte hex key (and no more than 32 bytes, since the ATA password cannot be any longer). I put that key in /root/key.txt.

Setting up the LUKS devices

WARNING: The disks in question will be wiped. Your old data will be gone. Forever. Make backups beforehand.

NOTE: I always refer to disks in /dev/disk/by-id, rather than just /dev/sd?, since at each boot the disks may appear in a different order in /dev, whereas the names in /dev/disk/by-id are always the same.

I start by becoming root and cd-ing to /dev/disk/by-id. I then set up encryption for the four data drives using cryptsetup:

# cryptsetup luksFormat -d /root/key.txt ata-WDC_WD2002FYPS-1

# cryptsetup luksFormat -d /root/key.txt ata-WDC_WD2002FYPS-2

# cryptsetup luksFormat -d /root/key.txt ata-WDC_WD2002FYPS-3

# cryptsetup luksFormat -d /root/key.txt ata-WDC_WD2002FYPS-4

Also, I add my "in-mind-only" key to each of them:

# cryptsetup luksAddKey -d /root/key.txt ata-WDC_WD2002FYPS-1

# cryptsetup luksAddKey -d /root/key.txt ata-WDC_WD2002FYPS-2

# cryptsetup luksAddKey -d /root/key.txt ata-WDC_WD2002FYPS-3

# cryptsetup luksAddKey -d /root/key.txt ata-WDC_WD2002FYPS-4

Cryptsetup will prompt you for that key in each case. Now, we unlock each of the drives. In order to prevent confusion, I encrypted-device name I choose is the same as the actual device name:

# cryptsetup luksOpen -d /root/key.txt ata-WDC_WD2002FYPS-1 ata-WDC_WD2002FYPS-1

# cryptsetup luksOpen -d /root/key.txt ata-WDC_WD2002FYPS-2 ata-WDC_WD2002FYPS-2

# cryptsetup luksOpen -d /root/key.txt ata-WDC_WD2002FYPS-3 ata-WDC_WD2002FYPS-3

# cryptsetup luksOpen -d /root/key.txt ata-WDC_WD2002FYPS-4 ata-WDC_WD2002FYPS-4

Now, we have the same names available in /dev/mapper.

Setting up the OPAL SED device

Although I am not sure whether this step is needed, I perform an initial secure erase anyway. Again, I am going to assume that you are root in /dev/disk/by-id. We start by setting the key:

# hdparm --security-set-pass `cat /root/key.txt` ata-INTEL_SSD1

Then, we issue the enhanced secure erase. Since this only changes the encryption key, this takes under a second:

# hdparm --security-erase-enhanced `cat /root/key.txt` ata-INTEL_SSD1

All data is from the SSD is now irrevocably gone. You did take my advice on making backups, eh? Good. We now set up the encryption again:

# hdparm --security-set-pass `cat /root/key.txt` ata-INTEL_SSD1

And that's it.

Setting up the zpool

With the encrypted devices now in place, we can set up the pool. Data first:

# cd /dev/mapper
# zpool create tank raidz2 ata-WDC_WD2002FYPS-1 ata-WDC_WD2002FYPS-2 ata-WDC_WD2002FYPS-3 ata-WDC_WD2002FYPS-4

And then the L2ARC cache device:

# cd /dev/disk/by-id
# zpool add tank cache ata-INTEL_SSD1

We now have our pool up and running. However, we will see none of these devices after a reboot. Therefore, the next section is on getting things unlocked at the right point during boot-up.

Setting up automatic unlocking

As I mentioned before, entering the file-server's root-LVM key will have to be done by hand (otherwise, anyone booting Dom0 would have access to my data anyway). However, we want all of the other disks to be set up automatically.

A slightly irritating fact in that regard is the fact that the current ZFS-on-Linux implementation suffers from some race conditions in the Ubuntu Upstart phase (Ubuntu's parallel init-script execution). In particular, it is almost impossible to run the drive unlocking during Upstart in such a way that the devices are all available when the ZFS module loads.

Therefore, we do the unlocking in an earlier phase: In the initial ramdisk, when the actual root filesystem has already been mounted, but Upstart has not started yet. As it turns out, the so-called "local-bottom" phase of the initial-ramdisk execution is just the place where both of these conditions are satisfied.

Preparations for LUKS


Even though I will make sure that my devices are unlocked before running the normal Upstart script that reads /etc/crypttab, I will put my LUKS-device definitions there anyway; Upstart does not get confused by this (it will just see that the devices are already unlocked, and leave it at that), and it provides for a clean shutdown of the system. In my case, I append my data drives there:

ata-WDC_WD2002FYPS-1 /dev/disk/by-id/ata-WDC_WD2002FYPS-1 /root/key.txt luks
ata-WDC_WD2002FYPS-2 /dev/disk/by-id/ata-WDC_WD2002FYPS-2 /root/key.txt luks
ata-WDC_WD2002FYPS-3 /dev/disk/by-id/ata-WDC_WD2002FYPS-3 /root/key.txt luks
ata-WDC_WD2002FYPS-4 /dev/disk/by-id/ata-WDC_WD2002FYPS-4 /root/key.txt luks

Preparations for OPAL

For OPAL, I created a file called /etc/crypt_opal_devices.txt that contains the name of my OPAL device:

ata-INTEL_SSD1

This is a manual solution, but it will have to do.


Writing the initial-ramdisk configuration for the LUKS devices


Since I use a LUKS-encrypted root filesystem, the cryptsetup binary is already in the initial ramdisk. The remainder of the configuration consists of two items:

  1. Ensuring that I have a list of to-be-unlocked LUKS devices in a text file in my ramdisk (a "hook").
  2. Ensuring that this text file is used to issue to correct cryptsetup commands (a "script").

For the first item, I create the executable script /etc/initramfs-tools/hooks/early_luks_devices , containing the following code to automatically scrape together the list of devices from /etc/crypttab.

#!/bin/sh

set -e

PREREQ=""

prereqs () {
        echo "${PREREQ}"
}

case "${1}" in
        prereqs)
                prereqs
                exit 0
                ;;
esac

. /usr/share/initramfs-tools/hook-functions

cat /etc/crypttab | grep ^ata- | tr '\t' ' ' | tr -s ' ' |  cut -d ' ' -f 1 > ${DESTDIR}/etc/cryptdevs

exit 0

For the second item, I contain the executable script /etc/initramfs-tools/scripts/local-bottom/unlock_luks_partitions , containing:

#!/bin/sh

set -e

case "${1}" in
        prereqs)
                exit 0
                ;;
esac

. /scripts/functions

for i in `cat /etc/cryptdevs`; do
  log_success_msg Unlocking ${i} ...
  /sbin/cryptsetup luksOpen -d ${rootmnt}/root/key.txt /dev/disk/by-id/${i} ${i}
done;

exit 0

This is all that is needed for the LUKS devices.

Writing the initial-ramdisk configuration for the OPAL device

Here, we need the following three items:
  1. We need to include /sbin/hdparm in the ramdisk.
  2. We need a list of to-be-unlocked OPAL devices in the ramdisk (a "hook").
  3. We need to run the correct hdparm invocation to unlock the device (a "script").
For the first item, I create the executable script /etc/initramfs-tools/hooks/include_hdparm , containing:

#!/bin/sh

set -e

PREREQ=""

prereqs () {
        echo "${PREREQ}"
}

case "${1}" in
        prereqs)
                prereqs
                exit 0
                ;;
esac

. /usr/share/initramfs-tools/hook-functions

copy_exec /sbin/hdparm /sbin

exit 0

For the second item, I create the executable script /etc/initramfs-tools/hooks/early_opal_devices , containing:

#!/bin/sh

set -e

PREREQ=""

prereqs () {
        echo "${PREREQ}"
}

case "${1}" in
        prereqs)
                prereqs
                exit 0
                ;;
esac

. /usr/share/initramfs-tools/hook-functions

cp /etc/crypt_opal_devices.txt ${DESTDIR}/etc/crypt_opal_devices.txt 

exit 0

For the third item, I create the executable script /etc/initramfs-tools/scripts/local-bottom/unlock_opal_devices , containing:

#!/bin/sh

set -e

case "${1}" in
        prereqs)
                exit 0
                ;;
esac

. /scripts/functions

for i in `cat /etc/crypt_opal_devices.txt`; do
  log_success_msg Unlocking OPAL device ${i} ...
  /sbin/hdparm --security-unlock `cat ${rootmnt}/root/opal.txt` /dev/disk/by-id/${i}
done;

exit 0

And that's it.

Refreshing the actual initial ramdisk


NOTE: This is crucial!

You run the following simple command:

# update-initramfs -u

And you're all set. Enjoy!

Thursday, December 13, 2012

Locked yourself out of your WNDR3800 on OpenWRT? Here's how you recover

Oops...

This week, I made a mistake when editing my network-bridge configuration on OpenWRT's LuCi web interface. After I pressed "Save and Apply", it occurred to me that things were taking quite a bit longer than usual. Then, I had no more internet connectivity from my backend machines, and I realized that I had made a mistake.

This was no cause for panic, since routers and firmwares usually have a recovery option: I would just look up how to do that on the intern... oh, wait... :(

Fortunately, I could enable WiFi tethering on my Android phone (that has a mobile data package), so I could use a laptop to look up the solution on the internet.

Solving the problem

Actually, the solution is quite easy: OpenWRT has a built-in recovery mode that you can enable by pressing the correct button at the correct time during the boot procedure. To that end, set up a backend machine to the static IP 192.168.1.2, and start a tcpdump:

# tcpdump -Ani eth0 port 4919 and udp

Now switch the router off and back on. After some 10-15 seconds, your tcpdump will show a message saying (amount of a lot of dots): "Please press button now to enter failsafe". At that point, on the WNDR3800, press the lowermost button (the one that is normally used for WPS auto-setup). The power LED will then start blinking very rapidly. Now, wait another ~30 seconds, and telnet into the router from your backend machine:

$ telnet 192.168.1.1

This will drop you straight to a root shell. You will want to remount the root filesystem read-write, on the OpenWRT:

# mount -o remount,rw /

You can now fix the problem (in the files under /etc/config) and reboot.

Tuesday, November 6, 2012

Multi-site private IPv6 networking using ULA and IPSEC

Wait, what?! Why?

So here is the situation: I have a home network behind a Netgear WNDR3800 router running OpenWRT, and I rent a remote server on which I run XEN with several VMs on a virtual backend network. Both sites have full IPv6 connectivity; all backend systems have a global IPv6 address, and although they are free to communicate with the entire (IPv6) world, I do have basic firewalling in place to allow new connections to some internal IPv6 hosts running OpenSSH only.

There also is the usual IPv4 NAT (NAT44) story on both backend networks, but this post is not about IPv4.

What I want is this: I want systems on both backend networks to be able to openly talk to each other over IPv6, yet in a secure way. In other words; to internal systems, I want a completely open and private IPv6 network.

Is that even possible?

Well, yes, it is! Here is how.

IPv6 Unique Local Addresses (ULA)

Even though IPv6 prefixes are usually stable, I do not want to depend on that when/if I switch providers. Fortunately, IPv6 was built from the ground up on the concept that an interface can have multiple IPv6 addresses.; two of them are the normal Link Local address (in fe80::/64) and the Global address (in 2000::/3), but it is possible to add more.

One of the possibilities that IPv6 offers is Unique Local Addresses; these are addresses in fd00::/8 (and fc00::/8, though that should not be used until there is a global registry) that one can use in the same way as one would have used the IPv4 private address space (10.0.0.0/8, 192.168.0.0/16, and friends). You can randomly generate a /48 in fd00::/8 by choosing 40 more bits, e.g. by running noise through sha256sum or something similar. Within this /48, you can create as many subdivisions as you want, though is it customary to create /64s, so that IPv6 autoconfiguration works on your clients.

The networks you create in fd00::/8 should not be routed outside your internal network, and nobody will willingly route these prefixes from the external world to you. However, internally, and between sites, you can route them seven ways from Sunday any way you want.

In the remainder of this post, I will describe how to set up ULA on both sites, how to connect both sites, and how to then make the inter-site connection secure.

Choosing a ULA and setting it up on both sites

The relevant RFC suggests that you generate 40 random bits using any sufficiently random method, e.g. by running some data from /dev/urandom through sha256sum, and copying the first 10 hex digits. Let us, for the sake of simplicity, assume the following ULA:

fd12:3456:789a::/48

This is of course very non-random, and you should not use it yourself, but it makes this post a bit easier to read.

Now that we have our ULA, we could pause for a second, and appreciate how unimaginably large that network that we just created is. This is a /48 prefix, which means that we have 128 - 48 = 80 bits of address space all to ourselves! It is customary to divide the space into 65,536 /64 networks, each of which will can hold ~2^64 unique addresses. Now that is a large number: 2^64 is about 2 * 10^19. That means that if we buy 20 million 1 TByte hard drives, we could assign a unique IPv6 address to each bit on each drive! And within our ULA, we could have 65,536 stacks of 20 million 1 Tbyte drives :-).

Anyway, in practice we will have fewer devices. Let's say that we choose obvious yet simple networks for both sites:

  • Site 1: Network fd12:3456:789a:1::/64 .
  • Site 2: Network fd12:3456:789a:2::/64 .
We will also need a network to connect both sites, but I will get to that later.

Setting up ULA on both sites consists of setting a (preferably simple) address for the router, and announcing the network prefix to other machines on the site network.

Setting up ULA on site 1

In my case, site 1 has a router running OpenWRT 10.03. Since the internal interface already has a static address for the Globally routable network (also a /64), and the LuCi web interface on the router does not allow me to add multiple IPv6 addresses on the lan interface, I define an alias in /etc/config/network:

config 'alias' 'lanula'
        option 'interface' 'lan'
        option 'proto' 'static'
        option 'ip6addr' 'fd12:3456:789a:1::1/64'

This will give the router the first available (::1) address within site 1's network. I then need to tell radvd to start announcing the prefix. That is done by editing /etc/config/radvd:

config 'prefix'
        option 'interface' 'lan'
        option 'AdvOnLink' '1'
        option 'AdvAutonomous' '1'
        list 'prefix' '2***:****:****::/64 fd12:3456:789a:1::/64'
        option 'ignore' '0'

Here, the starred-out 2***:****:****::/64 is my actual global prefix. Just add the ULA prefix on that same line. After rebooting the router, your clients will automatically obtain an address in both the global prefix and on the ULA prefix. In fact, if you use IPv6 privacy extentions (Linux does, usually), you will even get a temporary IPv6 address in both networks.

At this point, it is a good idea to ensure that you can ping6 fd12:3456:789a:1::1 from a client.

Setting up ULA on Site 2

In my case, the "router" on Site 2 is the dom0 domain of a Xen box that runs the other backend machines as domU domains. It too, already has full IPv6 connectivity; my server hoster routes a /64 to my dom0, which I then distribute to my domUs using radvd.

The dom0 in question is a vanilla Ubuntu Server release, so I can configure the interfaces in /etc/network/interfaces. However, since I can add only one IPv6 address (in addition to the link-local address) in there, I have to use the up/down logic to assign the ULA address.

iface ibr0 inet6 static
  address 2###:####:####:####::1
  netmask 64
  up /sbin/ifconfig ibr0 inet6 add fd12:3456:789a:2::1/64
  down /sbin/ifconfig ibr0 inet6 del fd12:3456:789a:2::1/64

Here, ibr0 is the internal backend bridge to which all my domUs are connected, and the hashed-out 2###:####:####:####::/64 is my Global address on the interface.

As on site 1, I have to configure radvd to advertise the prefix. To this end, I edit /etc/radvd.conf to include:

interface ibr0 { 
        AdvSendAdvert on;
        MinRtrAdvInterval 3; 
        MaxRtrAdvInterval 10;
        prefix 2###:####:####:####::/64 { 
                AdvOnLink on; 
                AdvAutonomous on; 
                AdvRouterAddr on; 
        };
        prefix fd12:3456:789a:2::/64 { 
                AdvOnLink on; 
                AdvAutonomous on; 
                AdvRouterAddr off; 
        };
};

At this point, both sites have a functioning network in the ULA range. Please do check that you can ping the router from client machines, as this is essential to getting the rest to work.

What does not yet work, is the connection between both sites; more on that later, but there is something else that needs to be taken care off: on both sites, firewall rules should be set up to neither send nor receive any ULA addresses on their external IPv6 interface; block the full fc00::/7 both coming in and going out. If we do not do this, a machine on site 1 trying to ping a machine on site 2 realizes that site 2 is outside its /64, and the router will try to route the message onto the public IPv6 net.

Connecting both sites

In order to connect both sites, really any mechanism that allows for sending IPv6 will do: one could set up an OpenVPN tunnel (with tap devices, as you need to be able to set IPv6 addresses on the interfaces), an ipv6-in-ipv4 tunnel, etc. In this case, though, I will try to not touch IPv4 at all, and I will use what is already there: I will use an ipv6-in-ipv6 tunnel between the sites' external IPv6 addresses, where the traffic inside the tunnel runs in the ULA space.

Fortunately, Linux supports such a setup out-of-the-box using its ip6ip6 mechanism on a tun device.

In the remainder of this example, I will use 2111:1111:1111:1111::1 as site 1's external address, and 2222:2222:2222:2222:2 as site 2's external address.

Inside the tunnel, we will use the new fd12:3456:789a:3::/64 network inside our ULA space.

Setting up the tunnel portal on site 1

Site 1 runs the OpenWRT router, which is a bit tricky in how you configure it. I did not find a good way to set up an ip6ip6 tunnel in the LuCi web interface, so I will include the command to do that in the additional startup script under System -> Startup. Before I do so, though, I will add a new interface called mytun, configure it as static, and set address fd12:3456:789a:3::1/64 on it. My (self-chosen) logic here is that the final part of the address is "1" since this is site 1's end of the tunnel.

Now go to Network -> Interfaces, and add a new "zone" called (e.g.) "tunnel", which includes the mytun interface. Set up firewall rules to allow all traffic inside our ULA in both directions, and also allow open routing between "tunnel" and your "lan" zone. Also go to Network -> Static Routes, and route both fd12:3456:789a:2::/64 and fd12:3456:789a:3::/64 onto the mytun device; we want to be able to reach both the other end of the tunnel and the network on the other side of the tunnel.

Finally, go to System -> Startup, and add the command that will set up the tunnel:

ip -6 tunnel add mytun mode ip6ip6 remote 2222:2222:2222:2222:2 local 2111:1111:1111:1111::1 dev eth1
ifconfig mytun mtu 1400

That final MTU setting requires some explanation: I do not really have native IPv6 on site 1; I have native (and dynamic) IPv4, and my IPv6 comes through an AICCU tunnel with SixXS. Now, by default, SixXS will set an MTU of 1280 bytes for you. This is a safe bet, but if is also the very minimum that IPv6 will accept (IPv4 had a 576-byte minimum). Now, if SixXS tunnel has a 1280-byte MTU, our ip6ip6 tunnel cannot have its minimum 1280-byte MTU, as some bytes are needed for the encapsulation message.

In my case (and after reading the SixXS documentation), it seems that their IPv6-in-IPv4 scheme has 20 bytes of encapsulation, so that I can use an MTU of 1480 bytes inside the Ethernet IP MTU of 1500 bytes. In the case of SixXS, I had to log into my account on their site, and I had to change the tunnel MTU from 1280 to 1480. AICCU then required a restart to pick up the new value. NOTE: The fact that I get IPv6 through a tunnel also means that I needed to substitute eth1 with sixxs.0 in the above command.

The ip6ip6 tunnel inside the SixXS IPv6-in-IPv4 tunnel must have a smaller MTU than the SixXS tunnel. I do not know exactly how much smaller, but 80 bytes of encapsulation is a safe bet; I thus went for 1400 bytes.


Setting up the tunnel portal on site 2

Site 2 is a vanilla Ubuntu Server running in dom0. IPv6 is offered native, on the external eth0 interface. As such, configuring the interface can be done in /etc/networks/interfaces:

# ULA tunnel to Site 1.
auto mytun
iface mytun inet6 static
  address fd12:3456:789a:3::2
  netmask 64
  mtu 1400
  pre-up ip -6 tunnel add mytun mode ip6ip6 remote 2111:1111:1111:1111::1 local 2222:2222:2222:2222::2 dev eth0
  post-up ip -6 route add fd12:3456:789a:1::1/64 dev mytun mtu 1300
  pre-down ip -6 route del fd12:3456:789a:1::1/64 dev mytun mtu 1300
  post-down ip -6 tunnel del mytun mode ip6ip6 remote 2111:1111:1111:1111::1 local 2222:2222:2222:2222::2 dev eth0

This configures the tunnel.

Testing the tunnel

At this point, one should ensure that the tunnel itself works, logging onto site 1's router, and issuing ping6 fd12:3456:789a:3::2, and logging onto site 2's router and issuing ping6 fd12:3456:789a:3::1.

If that works, try pinging across the tunnel: first ping a machine on site1's backend network from the router on site 2, and a machine on site 2's backend network from a the router on site 1. Finally, pinging from a machine on site 1's network directly to a machine on site 2's network should work, and vise versa!

Securing the tunnel

In this example, I will use IPSec to secure the tunnel. Whereas in IPv4, IPSec requires opening up some UDP ports on both routers, in IPv6 it is built right into the protocol itself. IPSec can operate in three modes:
  1. AH, for direct host-to-host communication. This is hardly used in practice.
  2. ESP, for network-to-network communication, where all hosts on both networks need to cooperate in the IPSec setup.
  3. ESP Tunnel, where network-to-network communication is encrypted on the tunnel only, without the machines on either network needing to know about it.
The easiest setup for my situation is ESP Tunnel: this way, I need to configure IPSec on the routers only, and the whole secured tunnel is transparent to all backend machines.

To this end, I install the ipsec-tools package on both routers; this package is available for both Ubuntu Server and OpenWRT. I will use a simple pre-shared key infrastructure to keep the whole setup as simple as possible.

For a unidirectional ruleset, IPSec needs two keys: an encryption key, and an authentication key. As communication in both directions is treated separately in IPSec, we also need two keys for the other direction. We thus need four keys. In this example, I will use the keys from this howto; do not use these, but generate your own random keys, just as I did!

On site1's router, create a file /etc/ipsec-tools.conf with the following content (replacing the keys with your own), and permissions 700.

#!/usr/sbin/setkey -f

## Flush the SAD and SPD
flush;
spdflush;

# Just simple static keys.
# ESP SAs using 192 bit long keys (168 + 24 parity)
add fd12:3456:789a:3::2 fd12:3456:789a:3::1 esp 0x201 -m tunnel -E aes-cbc 0x7aeaca3f87d060a12f4a4487d5a5c3355920fae69a96c831 -A hmac-md5 0xc0291ff014dccdd03874d9e8e4cdf3e6;
add fd12:3456:789a:3::1 fd12:3456:789a:3::2 esp 0x301 -m tunnel -E aes-cbc 0xf6ddb555acfd9d77b03ea3843f2653255afe8eb5573965df -A hmac-md5 0x96358c90783bbfa3d7b196ceabe0536b;

# Require encryption in between the networks over this tunnel.
spdadd fd12:3456:789a:2::/64 fd12:3456:789a:1::/64 any -P in ipsec
   esp/tunnel/fd12:3456:789a:3::2-fd12:3456:789a:3::1/require;
spdadd fd12:3456:789a:1::/64 fd12:3456:789a:2::/64 any -P out ipsec
  esp/tunnel/fd12:3456:789a:3::2-fd12:3456:789a:3::1/require;

On site 2's router, create the same file, with only one tiny difference: swap in and out on the last two lines, as denoted in red below:

#!/usr/sbin/setkey -f

## Flush the SAD and SPD
flush;
spdflush;

# Just simple static keys.
# ESP SAs using 192 bit long keys (168 + 24 parity)
add fd12:3456:789a:3::2 fd12:3456:789a:3::1 esp 0x201 -m tunnel -E aes-cbc 0x7aeaca3f87d060a12f4a4487d5a5c3355920fae69a96c831 -A hmac-md5 0xc0291ff014dccdd03874d9e8e4cdf3e6;
add fd12:3456:789a:3::1 fd12:3456:789a:3::2 esp 0x301 -m tunnel -E aes-cbc 0xf6ddb555acfd9d77b03ea3843f2653255afe8eb5573965df -A hmac-md5 0x96358c90783bbfa3d7b196ceabe0536b;

# Require encryption in between the networks over this tunnel.
spdadd fd12:3456:789a:2::/64 fd12:3456:789a:1::/64 any -P out ipsec
   esp/tunnel/fd12:3456:789a:3::2-fd12:3456:789a:3::1/require;
spdadd fd12:3456:789a:1::/64 fd12:3456:789a:2::/64 any -P in ipsec
  esp/tunnel/fd12:3456:789a:3::2-fd12:3456:789a:3::1/require;

The add statements set up the keys and the IPSec type (ESP Tunnel), whereas the spdadd statements require the use of these encryption methods in both directions. The only difference between the two sites is which direction is "in", and which direction is "out".

On Ubuntu server, the init/upstart scripts will automatically use the information from the above file on startup. If you want to enable it now, without rebooting, simply run /etc/ipsec-tools.conf as root.

On OpenWRT, we need to add this script on the System -> Startup page. Simply add the command:

/etc/ipsec-tools.conf

And that is it; run it as root if you want to activate it now without rebooting.

Testing the secured tunnel

Initial tests can be done using the same methodology as before: ping router<->router, router<->backend, backend<->router, and backend<->backend. If that all works, we should ensure that the communication is indeed encrypted. 

To this end, I log on to router 2, and start listening for what the external interface sees when I communicate:

tcpdump -n -i eth0 src 2111:1111:1111:1111::1 or dst 2111:1111:1111:1111::1

Then, from a machine on site 1's backend network, I ping a machine on site 2's backend network. Ensure that you use the ULA address, since otherwise the traffic goes over the public net rather than through the tunnel! If all is well, not only do the pings work, but the tcpdump command will show something like:

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
20:06:30.469386 IP6 2111:1111:1111:1111::1 > 2222:2222:2222:2222::2: IP6 fd12:3456:789a:3::1 > fd12:3456:789a:3::2: ESP(spi=0x00000301,seq=0xa7), length 88
20:06:30.469852 IP6 2222:2222:2222:2222::2 > 2111:1111:1111:1111::1: IP6 fd12:3456:789a:3::2 > fd12:3456:789a:3::1: ESP(spi=0x00000301,seq=0x917a), length 88

The first message is the ping request, and the second is the reply. Let's take a closer look at what we see here:
  • We can see the external addresses (logically, since otherwise no communication would be possible) of the routers only.
  • We can see the ULA addresses of the routers only.
  • We can see that we are transporting an 88-byte ESP-encrypted payload.
Let's also mention what we do not see:
  • We do not see what addresses on both backend networks are communicating with each other.
  • We do not see what is being communicated.
And there you have it! Enjoy your secure networking.


Tuesday, October 16, 2012

Hot-replacing a failing disk that is a part of Linux Software RAID and ZFS pools

Disks break: not "if", "when".

Yes, that's what they do. I run a 4-disk setup that hold one Linux Software RAID6 array, and two ZFS RAIDZ2 pools. 

Clouds in the sky

As of a few days ago, one of the disks started to fail, which was apparent by the syslog entries like these:

[1318523.293294] ata2.00: failed command: READ FPDMA QUEUED
[1318523.304015] ata2.00: cmd 60/01:00:8f:da:14/00:00:4d:00:00/40 tag 0 ncq 512 in
[1318523.304021]          res 41/40:00:00:00:00/00:00:00:00:00/00 Emask 0x9 (media error)
[1318523.346321] ata2.00: status: { DRDY ERR }
[1318523.356810] ata2.00: error: { UNC }
[1318523.367279] ata2.00: failed command: READ FPDMA QUEUED
[1318523.377664] ata2.00: cmd 60/3f:08:60:ad:14/00:00:4d:00:00/40 tag 1 ncq 32256 in
[1318523.377670]          res 41/40:00:98:ad:14/00:00:4d:00:00/40 Emask 0x409 (media error)
[1318523.419883] ata2.00: status: { DRDY ERR }
[1318523.430424] ata2.00: error: { UNC }
[1318523.440904] ata2.00: failed command: READ FPDMA QUEUED
[1318523.451164] ata2.00: cmd 60/01:10:95:29:00/00:00:4e:00:00/40 tag 2 ncq 512 in
[1318523.451169]          res 41/40:00:00:00:00/00:00:00:00:00/00 Emask 0x9 (media error)
[1318523.492656] ata2.00: status: { DRDY ERR }
[1318523.503246] ata2.00: error: { UNC }

As I did not have a spare disk on hand (tsk, tsk, tsk, yes, I know...) I immediately ordered one, even before sending the old disk for RMA. Initially, as I ran a zpool scrub on the pools, there would be only these messages, but the zpool itself did not notice trouble. 

Thunderstorms in the sky

As of yesterday, errors started making it to the zpool layer:

$ sudo zpool status
  pool: data
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
 scan: scrub repaired 356K in 3h25m with 0 errors on Sun Oct  7 15:16:16 2012
config:

NAME                                                 STATE     READ WRITE CKSUM
data                                                 ONLINE       0     0     0
 raidz2-0                                           ONLINE       0     0     0
   ata-WDC_WD2002FYPS-[serial]-part1  ONLINE       0     0  422K
   ata-WDC_WD2002FYPS-[serial]-part1  ONLINE       0     0     0
   ata-WDC_WD2002FYPS-[serial]-part1  ONLINE       0     0     0
   ata-WDC_WD2002FYPS-[serial]-part1  ONLINE       0     0     0

errors: No known data errors

  pool: ttank
 state: ONLINE
 scan: scrub repaired 0 in 0h56m with 0 errors on Fri Oct  5 11:53:31 2012
config:

NAME                                                 STATE     READ WRITE CKSUM
ttank                                                ONLINE       0     0     0
 raidz2-0                                           ONLINE       0     0     0
   ata-WDC_WD2002FYPS-[serial]-part3  ONLINE       0     0     0
   ata-WDC_WD2002FYPS-[serial]-part3  ONLINE       0     0     0
   ata-WDC_WD2002FYPS-[serial]-part3  ONLINE       0     0     0
   ata-WDC_WD2002FYPS-[serial]-part3  ONLINE       0     0     0

By now, the drive not only had read errors; it even started to return faulty data (despite claiming that said data is ok). Fortunately, ZFS is built from the ground up to never trust hardware, so that its checksumming mechanism detected the faulty data. Clearly, it was now time to replace that disk. Fortunately, the spare drive just came in by mail.

Taking the old disk offline

I run my disks in an IcyBox Hotplug backplane, so I wish to replace the disk without even so much as rebooting the server. One first needs to know which disk this is, of course. Since I use the disk-ID links, just looking at the symlinks in /dev/disk/by-id tells me that the disk in question is /dev/sdb.
To be safe, I read a gigabyte of data off the disk, to physically inspect which drive light switches on as I do so:

# dd if=/dev/sdb of=/dev/null bs=1048576 count=1024

Visual inspection tells me that this is the top drive in the IcyBox. Good.

As for ZFS, there is nothing special that one needs to do. For Linux Software RAID, one needs to tell the system to fail, and subsequently remove the disk from the array:

# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid6 sde2[5] sdb2[0] sdc2[4] sdd2[2]
      409996800 blocks super 1.2 level 6, 256k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices:

Fail the disk:

# mdadm /dev/md0 --fail /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md0

# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid6 sde2[5] sdb2[0](F) sdc2[4] sdd2[2]
      409996800 blocks super 1.2 level 6, 256k chunk, algorithm 2 [4/3] [_UUU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices:

Remove the disk:

# mdadm /dev/md0 --remove /dev/sdb2
mdadm: hot removed /dev/sdb2 from /dev/md0

# cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid6 sde2[5] sdc2[4] sdd2[2]
      409996800 blocks super 1.2 level 6, 256k chunk, algorithm 2 [4/3] [_UUU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices:

At this point, one could yank the disk out, but it's better to tell Linux that you are going to do so. Switching off the disk and detaching it from the system is done as follows:

# echo 1 > /sys/block/sdb/device/delete

The syslog will tell you that the device indeed went offline:

[1734127.293861] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
[1734127.331629] sd 1:0:0:0: [sdb] Stopping disk
[1734127.768141] ata2.00: disabled

As this point, the tray can be taken from the Hotplug backplane, and the old disk can be replaced by the new one.

Bringing the new disk online

After physically taking out the tray, removing the old disk from the tray, and adding the new disk to the tray, I replaced the tray. The kernel detects the disk:


[1743181.511929] ata2: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
[1743181.512460] ata2: irq_stat 0x00000040, connection status changed
[1743181.512883] ata2: SError: { CommWake DevExch }
[1743181.513215] ata2: hard resetting link
[1743187.276049] ata2: link is slow to respond, please be patient (ready=0)
[1743190.860073] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[1743190.998197] ata2.00: ATA-9: WDC WD20EFRX-[serial], max UDMA/133
[1743190.998206] ata2.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[1743190.998836] ata2.00: configured for UDMA/133
[1743190.998855] ata2: EH complete
[1743190.999097] scsi 1:0:0:0: Direct-Access     ATA      WDC WD20EFRX-[serial]
[1743190.999679] sd 1:0:0:0: [sdf] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
[1743190.999691] sd 1:0:0:0: [sdf] 4096-byte physical blocks
[1743190.999705] sd 1:0:0:0: Attached scsi generic sg1 type 0
[1743191.000185] sd 1:0:0:0: [sdf] Write Protect is off
[1743191.000197] sd 1:0:0:0: [sdf] Mode Sense: 00 3a 00 00
[1743191.000328] sd 1:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[1743191.014153]  sdf: unknown partition table
[1743191.014902] sd 1:0:0:0: [sdf] Attached SCSI disk
[1743415.817135]  sdf: unknown partition table

Obviously, there are no partitions on the disk yet. In order to create them, I simply copy them off one of the other drives:

# sfdisk -b /dev/sdc | sfdisk /dev/sdf

This is readily picked up by the kernel:

[1743415.817135]  sdf: unknown partition table
[1743416.227972]  sdf: sdf1 sdf2 sdf3

Resilvering the arrays

The first array I decide to resilver is the most important one: the primary data pool:

# zpool replace data /dev/disk-by-id/ata-WDC_WD2002FYPS-[serial]-part1 /dev/disk/by-id/ata-WDC_WD20EFRX-[serial]-part1

This is going to take a long time: more when this is done.