Wednesday, December 19, 2007

Some RAID10 performance numbers

I've been using Linux software raid (md) for a very long time - more or less since the beginning, and it's quite honestly been great to me. I've always deployed in a raid5 configuration and never gave much thought to the other levels. Recently, raid10 became available. Not raid1+0 but was is considered (by some) to be a non-standard raid10 implementation which allows a non-even number of components to comprise the (usual) raid1 portion. See the wikipedia article and 'man md' for more details.
Some notes about the hardware and setup:


  • the kernel is 2.6.22.12, openSUSE "default" kernel
  • the CPU is an AMD x86-64, x2 3600+ in power-saving mode (1000 MHz)
  • the motherboard is an EPoX MF570SLI which uses the nVidia MCP55 SATA controller (PCIe)
  • the drives are 3 different makes of SATA II, 7200 rpm
  • each drive is capable of not more than 75 MiB/s (at best - the outermost tracks) and closer to 70 MiB/s the portions of the disk involved in these tests
  • each drive is partitioned, identically, into 4 partitions. This test involves the third partition, 4 GiB in size, which is 2 GiB from the start.
  • the system was largely idle but does other things
  • the system has 1 GiB of RAM
  • the raid was created with:
    mdadm --create /dev/md2 --level=${level} --raid-devices=3 --spare-devices=0 --layout=${format} --chunk=256 --assume-clean --metadata=1.0 ${DEVICES}
  • the deadline I/O scheduler was used on each component drive
  • the stripe_sizes and caches, queues sizes, and flusher parameters were left at their defaults
  • the caches were dropped before each invocation of 'dd':
    echo 3 > /proc/sys/vm/drop_caches
  • the 'write' portion consisted of this dd invocation:
    dd if=/dev/zero of=/dev/md2 bs=256K count=15000 conv=fdatasync
  • the 'read' portion consisted of this dd invocation:
    dd if=/dev/md2 of=/dev/null bs=256K count=15000
  • I did not test any chunk size other than 256K but probably will in the future.
  • I can supply the entire test script if necessary (I intend on doing this in the future, after some additional refinement.)
It is also worthwhile to note that these tests involve the block layer and not the filesystem layer, and are not intended to be a my-raid-is-faster-than-your-raid test but instead a (brief) exploration into raid10 on Linux md. I chose to compare it against raid0 and raid5 as these are common and more likely to be well understood.
The Results:
The results are all in MiB/s.
NOTE: As of 2007-Dec-30 I have updated the table (but not the chart, yet) with more representative numbers. I removed some run-time noise and ran each test 3 times, taking the mean average.
level format Writing Reading Writing (Degraded) Reading (Degraded)
raid5 left-asymmetric 55 129 46 124
raid5 left-symmetric 54 123 50 122
raid5 right-asymmetric 54 124 49 124
raid5 right-symmetric 54 128 49 116
raid10 n2 103 95 103 104
raid10 o2 102 94 100 102
raid10 f2 97 162 97 51
raid0 - 205 186 n/a n/a

An alternate chart (thanks due to suggestions received on the Linux-RAID mailing list.), and assuming 'x' is 70 MiB/s, or the speed of one component:

level format Writing Reading Writing (Degraded) Reading (Degraded)
raid5 left-asymmetric 0.8 1.8 0.7 1.8
raid5 left-symmetric 0.8 1.8 0.7 1.7
raid5 right-asymmetric 0.8 1.8 0.7 1.8
raid5 right-symmetric 0.8 1.8 0.7 1.7
raid10 n2 1.5 1.4 1.4 1.4
raid10 o2 1.5 1.4 1.4 1.4
raid10 f2 1.4 2.3 1.4 0.7
raid0 - 3.0 2.8 n/a n/a

Chart:



What do these numbers tell us?

(Of course, these observations only apply for 3 drives in this configuration. Caveat, handwaving.)
  1. Degraded read speed on raid5 is 85-90% of non-degraded. That's pretty good.
  2. Degraded writing on raid5 is virtually indistinguishable for non-degraded.
  3. raid10 near and offset performance, reading or writing, degraded or not, is very consistent.
  4. raid10 far layout has awesome read performance (non-degraded) - I'd ballpark it near raid0 performance. Degraded, however, shows much worse performance. Why?
  5. raid10 far layout has no discernible performance difference when writing in degraded mode.


Footnotes:
  1. Whenever possible, always use conv=fdatasync for tests like this. I'll explain why in a future installment.
  2. I did not use iflag=direct (which sets O_DIRECT).







3 comments:

natmaka said...

Suggestion: benchmark random I/O (64k here, 64 there...), not only contiguous read/writes done on contiguous blocks.

RAID10 reads each fraction of the needed blocks from a different spindle. When you have a single mirror and read B blocks, it read B/2 blocks from one drive and the rest from another one. In degraded mode there is a missing spindle, hence the read performance loss.

Jon said...

That's true. Some recent benchmarks on the linux-raid mailing list have shown that raid10,f2 outperforms even raid0 when it comes to random I/O (writing), in some cases by quite a bit.

There is also a patch that is being explored that would change the raid10,f2 algorithm to *always* use the outer tracks instead of just the "nearest" tracks of a given disk, but since each disk is constantly switching back and forth between the inner and outer tracks (as each disk is a mirror for some /other/ disk) I wonder if it will help at all.

David James Spillett said...

"raid10 far layout has awesome read performance (non-degraded) ... Degraded, however, shows much worse performance. Why?"

Because of the way the blocks are spread over the remaining drives, sequential access requires much more head movement than normal operation.

In the worst case (a two drive RAID10 far set with one drive dead) sequential reads result in a half-drive head movement every block (block 1 is at the start of the drive, block 2 in the middle, block 3 is start+1, block 4 is start+1, 5=start+2, 6=middle+2, ...).

This probably means that rebuild times will be much higher too for "far" arrays than "near" ones too.

This won't affect SSDs in the same way as their non-sequential access time is as close to 0 as makes no odds, because there is no physical head to move and they don't have to wait for the disc to spin into the right position either.