Friday, December 6, 2013

S/W RAID6 Performance Influenced by stripe_cache_size

A while back, I set up a 4-disk S/W RAID6 array (using mdadm, of course) on one of my openSUSE 12.3 machines (all machines are now openSUSE 13.1). I used four 320MB SATA drives that I had been using for backup. The performance was OK. A few days ago, I ran across another 320MB drive I had forgotten about, and decided to add it to the array, reshaping it to use five (5) drives. With mdadm, this is so easy it's scandelous.

Each of the drives is set up with three (3) partitions:
  • 1GB for /boot (in MD RAID1)
  • 512MB for swap
  • the remainder as a component of MD RAID6.

First, I used sfdisk to copy the partitioning from one device to another:

sfdisk --dump /dev/sda | sfdisk /dev/sde

Ignoring the first two partitions of /dev/sde for now, turning a N-device raid6 into an (N+1)-device raid6 consists of just two steps:

1. add the new device to the array as a spare
2. grow the array

First, you add the new device as a spare:

mdadm --add /dev/md1 /dev/sde3

Then you tell the raid to rehape itself. I was happy with the chunk size (512KB), so I left that alone. NOTE: the following command is wrong and I'll explain why in a moment.

mdadm --grow /dev/md1 --raid-devices=6 --backup-file=/boot/backup-md1

The command grumped something about too few devices, which I ignored because... I did.
So I re-issued the command with --force:

mdadm --grow /dev/md1 --raid-devices=6 --force --backup-file=/boot/backup-md1

at which point the kernel began the reshape.

Now, I did a bad thing, there. I told the four (4) device raid to grow to six (6) devices, even though only five (5) are available. Which it did, over the course of several hours. Shortly after issuing the command, I realized my mistake. Fortunately, md is super awesome so it dealt with it anyway.

When the reshape was complete, I issued a corrected command to reshape, this time to the proper number of devices (5).

mdadm --grow /dev/md1 --raid-devices=5 --backup-file=/boot/backup-md1

When the (second) reshape was complete, I resized (enlarged) the filesystem with resize2fs and ran some benchmarks. Performance was improved, but then I remembered about stripe_cache_size, which is essentially buffers the read-change-write cycle of parity-based raids. Initially, I just set it to the largest value that it would accept (32768) and left it at that, but then I read a posting to the linux-raid mailing list which suggested that such a strategy was non-optimal.

Intrigued, I wrote a small script to test each value in 512, 1024, 2048, 4096, 8192, 16384, and 32768, and then graphed those results. The results were surprising, a bit: 4096 was clearly the best performing in terms of both read and write.

The whole test script (with paths edited, slightly) is below:

#! /bin/sh
set -ex
for STRIPE_CACHE_SIZE in 512 1024 2048 4096 8192 16384 32768; do
  echo ${STRIPE_CACHE_SIZE} > /sys/block/md1/md/stripe_cache_size

  # write and read a 4G file
  dd if=/dev/zero of=/some/path/to/some/file bs=4096 count=1048576 conv=fdatasync
  dd if=/some/path/to/some/file of=/dev/null bs=4096
  echo 3 > /proc/sys/vm/drop_caches

I then parsed and charted the results ( using the pretty awesome Veusz ) and the results are below. Red is write and blue is read. I wish I knew how to use Veusz better, but I'm learning.

No comments: