Friday, December 6, 2013

S/W RAID6 Performance Influenced by stripe_cache_size

A while back, I set up a 4-disk S/W RAID6 array (using mdadm, of course) on one of my openSUSE 12.3 machines (all machines are now openSUSE 13.1). I used four 320MB SATA drives that I had been using for backup. The performance was OK. A few days ago, I ran across another 320MB drive I had forgotten about, and decided to add it to the array, reshaping it to use five (5) drives. With mdadm, this is so easy it's scandelous.

Each of the drives is set up with three (3) partitions:
  • 1GB for /boot (in MD RAID1)
  • 512MB for swap
  • the remainder as a component of MD RAID6.

First, I used sfdisk to copy the partitioning from one device to another:

sfdisk --dump /dev/sda | sfdisk /dev/sde

Ignoring the first two partitions of /dev/sde for now, turning a N-device raid6 into an (N+1)-device raid6 consists of just two steps:

1. add the new device to the array as a spare
2. grow the array

First, you add the new device as a spare:

mdadm --add /dev/md1 /dev/sde3

Then you tell the raid to rehape itself. I was happy with the chunk size (512KB), so I left that alone. NOTE: the following command is wrong and I'll explain why in a moment.

mdadm --grow /dev/md1 --raid-devices=6 --backup-file=/boot/backup-md1

The command grumped something about too few devices, which I ignored because... I did.
So I re-issued the command with --force:

mdadm --grow /dev/md1 --raid-devices=6 --force --backup-file=/boot/backup-md1

at which point the kernel began the reshape.

Now, I did a bad thing, there. I told the four (4) device raid to grow to six (6) devices, even though only five (5) are available. Which it did, over the course of several hours. Shortly after issuing the command, I realized my mistake. Fortunately, md is super awesome so it dealt with it anyway.

When the reshape was complete, I issued a corrected command to reshape, this time to the proper number of devices (5).

mdadm --grow /dev/md1 --raid-devices=5 --backup-file=/boot/backup-md1

When the (second) reshape was complete, I resized (enlarged) the filesystem with resize2fs and ran some benchmarks. Performance was improved, but then I remembered about stripe_cache_size, which is essentially buffers the read-change-write cycle of parity-based raids. Initially, I just set it to the largest value that it would accept (32768) and left it at that, but then I read a posting to the linux-raid mailing list which suggested that such a strategy was non-optimal.

Intrigued, I wrote a small script to test each value in 512, 1024, 2048, 4096, 8192, 16384, and 32768, and then graphed those results. The results were surprising, a bit: 4096 was clearly the best performing in terms of both read and write.

The whole test script (with paths edited, slightly) is below:

#! /bin/sh
set -ex
for STRIPE_CACHE_SIZE in 512 1024 2048 4096 8192 16384 32768; do
  echo ${STRIPE_CACHE_SIZE} > /sys/block/md1/md/stripe_cache_size

  # write and read a 4G file
  dd if=/dev/zero of=/some/path/to/some/file bs=4096 count=1048576 conv=fdatasync
  dd if=/some/path/to/some/file of=/dev/null bs=4096
  echo 3 > /proc/sys/vm/drop_caches

I then parsed and charted the results ( using the pretty awesome Veusz ) and the results are below. Red is write and blue is read. I wish I knew how to use Veusz better, but I'm learning.

Sunday, February 20, 2011

Software for CUPS, Host headers, NAT'd access and workarounds

Here is what I promised earlier. It's got all sort of caveat written all over it, but it seems to work.
Comments welcome!

This software is totally freeware and I'm not responsible for anything you do or don't do with it, any problems it causes or solutions it brings, etc...

#! /usr/bin/python
import sys
import os
import getopt
import BaseHTTPServer
import httplib

class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
  def do_GET(self):
    conn = httplib.HTTPConnection("localhost:631")
    conn._http_vsn_str = 'HTTP/1.0'
    conn.putrequest(self.command, self.path)
    for h in self.headers:
      if h.lower() == 'connection':
        v = 'close'
      elif h.lower() == 'host':
        v = self.headers[h]
      conn.putheader(h, v)
    # copy payload, only for POST
    if self.command == 'POST':
      content_length = int(self.headers['content-length'])
      remaining = content_length
      while remaining > 0:
        data = min(4096, remaining) )
        if not data:
        remaining -= len(data)
    resp = conn.getresponse()
    if resp.version == 11:
      http_version = 'HTTP/1.1'
    elif resp.version == 10:
      http_version = 'HTTP/1.0'
      raise ValueError
    self.wfile.write( http_version + ' ' + str(resp.status) + ' ' + resp.reason + '\n' )
    self.wfile.write( str(resp.msg) )
    self.wfile.write( '\n' )
    datalen = 0
    while True:
      data = 4096 )
      if not data:
      datalen += len(data)
    if datalen:
      self.log_request(resp.status, datalen)

  def do_POST(self):
    return self.do_GET()

def run(server_class=BaseHTTPServer.HTTPServer, handler_class=Handler):
  (optlist, args) = getopt.getopt( sys.argv[1:], 'dp:', [ 'no-daemon', 'port=' ] )
  daemonize = True
  port = 6310
  for (o,arg) in optlist:
    if o in [ '-d', '--no-daemon' ]:
      daemonize = False
    elif o in [ '-p', '--port' ]:
      port = int(arg)

  # FIXME: add parsing for listen addresses and ports, etc....
  if daemonize:
    pid = os.fork()
    if pid == 0:
      # child
      pid2 = os.fork()
      if pid2 == 0:
        # second child

  server_address = ('', port )
  httpd = server_class(server_address, handler_class)

def main():

if __name__ == '__main__':

Sunday, December 26, 2010

CUPS, Host headers, NAT'd access and workarounds

A friend of mine has a Windows XP install which runs in a VirtualBox environment hosted by his openSUSE install. It works well for the two or three times a year it gets use. Recently, he needed to have the ability to print from the virtualized environment to his home printer. Since the printer was already set up on the host in CUPS, I thought that this would be a pretty trivial change to make. I was wrong.
The virtualized environment is using NAT for network access. This works really well. The environment is set up with an 'internal' DHCP server, using by default (I might be wrong on the mask, but it's not relevant here). The 'host' gets and the guest gets one of the other addresses. This way, the guest can access resources on the host using the address, and whatever the host has access to, the guest does too (I'm generalizing and hand-waving a bit here). Since this is really NAT'd, services on the host 'see' access from localhost, and most services work great.
What I decided to start with was to access from the guest to see if I could talk to the CUPS server on the host. If I could do that, setting up a print driver from there should be trivial. No such luck. Checking the logs in CUPS is usually an exercise in futility, but it's good practice anyway. There wasn't much there except:
E [24/Dec/2010:07:59:01 -0600] Request from "localhost" using invalid Host: field ""
Since CUPS is HTTP-based, what is happening is this: the client makes an HTTP request and supplies a Host header with the value "", which is expected. The request (due to NAT) arrives over localhost and CUPS grumps because the arrival interface (localhost) and the Host header ( don't match. Most HTTP servers allow you to configure aliases for which the server will still respond, and a quick google seems to suggest exactly that for CUPS. From what I read, it appeared that the problem can be solved by telling CUPS that is OK, with either "ServerAlias" or "ServerAlias" or even "ServerAlias *" (accept everything). However, none of these, together or in combination, work. More googling seems to suggest that others have run into *similar* issues. See this and this and this and this.
None of those exactly addressed my issue, and the workarounds didn't work or were suboptimal for this situation (changing ServerName, for example).
I suppose I could have kept digging, but it seemed to me that the code was hard-wired to perform certain tests if the request came in over localhost, and I really didn't care to dig into it too much further.
So, what to do? I didn't want to change the networking mode of the virtualized environment -- for this usage scenario NAT is exactly correct, and any other mode has drawbacks. Since CUPS uses HTTP, and the problem is that the Host header isn't matching what CUPS expects it to match, I'll just fix that problem by writing a bit of code which accepts HTTP requests, strips the Host (and Connection, for good measure) headers, and relays that request to localhost:631. Python makes this trivial, and 92 lines of code later and I have a little CUPS proxy. It's dumb, it's got hard-coded bits in it, and it's not very tolerant, but it works and it works well. It listens on localhost:6310 and operates exactly as previously described. I become root, and fire up python to give it a try. Then I try to set up a printer in the virtualized environment, talking to, and it went perfectly. Problem solved.
I put a bit more polish on the software, whip up a spec file, an init script, and even a logrotate config file and a few short seconds after 'rpmbuild -ba' I have an easily-installable RPM that I can install, manage the startup/shutdown of easily, and (most importantly), it solves an obnoxious problem.
Addendum: a quick view of the source confirms my suspicions. See scheduler/client.c, in the 'valid_host' function. If the request arrives over the localhost interface, and the Hoste header is present, it *must* match "localhost" or one of several other localhost equivalents.

Thursday, September 3, 2009

Lost Print Jobs Issue Mysteriously Resolves Itself

Previously, I blogged about a friend's situation where print jobs appeared to go off into the ether. I inquired about the situation recently and he said that it mysteriously resolved itself.


Sunday, August 2, 2009

HP Quality is really slipping

Over the years, I've had a bunch of printers. My first printer (which I still have, and it still works) is an Apple ImageWriter II. A dot matrix printer. My second printer was an HP DeskJet 520. It worked (but was noisy) when I got rid of it a year or so back. I've had a slew of Epson inkjets (clogged heads doomed them all), a bunch of HP Inkjets (various issues), and some others.

For the longest time, HP used to be the go-to company for quality inkjets. Canon never used to have very good Linux support. Epson had reasonable (but not great) Linux support, but their printer design (permanent heads) doomed every one of them to mediocre or troublesome printing with a few years. HP, however, had excellent, early Linux support and their for some of their printers, the print heads are built into the cartridges - this makes fixing a print head issue as easy as replacing the cartridge or cartridges. I'd love to give Kodak a try but their refusal to provide any sort of Linux support - at all - not even basic drivers - puts them out of the running. I won't even talk about Lexmark's awful injket printers.

However, the overall quality of HP printers over the last few years has really left me wondering what happened to this once-great "engineeer's" company. A failed LCD panel doomed one printer - rendering it totally worthless for anything but basic printing - and a google search revealed this to be a very common problem on that series. Another printer (DJ6540) I have occasionally loses its mind and needs to be hard reset. This officejet is a higher-end inkjet, certainly, and I expected no problems. By and large that has been the case. However, driver installation on Windows has proven to be a complete and total disaster. I had to reinstall windows to get the drivers to install - sorta - and their firmware updater absolutely refuses to upgrade a printer unless it is configured in windows as a printer. This, despite the fact that the firmware updater clearly tries and appears to support upgrading over TCP/IP. Did anybody test this thing? How did this ever get released to consumers? A google search also reveals I'm not alone. Driver problems, driver installer failures, and so on. I expected more out you, HP, and I'm not only disappointed I guarantee you it's going to influence the advice I give and the buying decisions I make in the future. I will be trying your competitors, but sadly I suspect that in general printing is a race to the bottom.

Failed Diagnosis of Lost Print Jobs

A friend of mine came to me recently and told me that, for some reason, his printer wasn't working any more. I set this friend up with an openSUSE home server for printing, DNS, DHCP, and internet access (via qtsmppd) and by and large it works great. The printing is managed by CUPS, and the printer is an HP OfficeJet Pro L7680. It's a nice printer with Vivera inks. It is fast and generally works great. I have it set up to print over port 9100.

I asked him to bring the machine to me and he did so. When I examined the logs, there was nothing amiss - according to the CUPS logs, jobs were queued, accepted by the printer, and when complete removed -- exactly as expected. I set an identical printer up in exactly the same way, but left it off. I queued a print job and, as expected, CUPS indicated that it could not contact the printer and simply kept retrying the job. When I turned the printer on, it began accepting jobs and the job printed out just fine. No matter what I did, I could not get CUPS to lose the jobs. How do I reconcile this with the earlier behavior of "I printed a bunch of stuff but nothing ever came out of the printer". I may need to see how things work in the exact environment the server is in to find out.

Thursday, June 25, 2009

KDE 4.2 and KDE 4.3

It's been a while since I've commented on KDE. Partly, it's because I've been insanely busy with life and work, but I'm pleased to report that somewhere around 4.2.2 KDE started getting pretty usable. I ran 4.2.4 for a while before upgrading to one of the 4.3 betas, and other than one weird issue, it's actually be quite pleasant to use. By and large everything works, there are still some obnoxious issues (like the konsole issues with -e) and so on, but overall it's really coming together and is, dare I say it, pretty attractive. I'm not much for eye candy that doesn't actually make my day easier. It's not like I run fvwm or anything (although I like fvwm and have a nice config which only took me a WEEK to make), but while some advanced eye candy is actually pretty useful (it draws your eye here or it helps you remember where something went, etc...) most of the eye candy is just that - it's candy. Fun once in a while, but not very good for you. And it rots your teeth. So KDE 4.3 is faster, a bit leaner, a bunch noisier (have you ever tailed .xsession?), but overall pretty reliable. Few crashes. Bluetooth doesn't work. The network manager integration is just getting to the point where I use it instead of knetworkmanager (which always worked very well thank you). Plasma does crash now and again, but now when it does so it's more of a surprise than a regular occurance, and so far it's not caused me to lose work. I do have one really weird issue, so weird that I can't even figure out how to file a bug report. When 4.3RC1 becomes available over the next few days, I hope that fixes it. Am I looking forward to openSUSE 11.2? Oh yeah. I did grab the dvd .iso and play with it in qemu, and it boots fast, and since it'll have KDE 4.3 that's all the more reason to get excited...

hpcups broke my printing

I had a thoroughly impressive rant goin' on here but I just don't have the energy to type it all out. Suffice it to say that hplip-hpcups suffers from some fairly basic forgot-to-catch-the-errorcode problems, and it ate up an hour or more of my life. As a friend says, I'm entertaining when I have my indignation all worked up, but it's better in person.

Wednesday, March 4, 2009

Adobe, Taxes, and Government

My State Government and especially Adobe have annoyed me. For reasons that escape me, the Wisconsin e-File Form-1, Form-1A, and Form-WIZ, all located at the Wisconsin Department of Revenue e-file page here, all require Adobe PDF Reader version 9. I ask: Why?

For those that don't know, I run Linux and only Linux, and Adobe PDF Reader 9 is not available for Linux. Even if it were, I'm not sure I would use it. It's a horrible, bloated, insecure bit of software. It also comes with "for free" Adobe AIR. I don't want Adobe AIR. I want to view PDFs. Do I have to install Adobe AIR to view PDFs, too? Adobe PDF Reader 9 and Adobe AIR are not Open Source, and so I am loathe to give them any quarter.

Adobe has a decidedly anti-Linux stance, in my opinion - apparently they feel there is no market for a 64-bit version of Flash (for Linux), and the other solutions (like nspluginwrapper), no matter how well-intentioned or implemented, still can't cure the problem of trying to use a 64-bit browser and a 32-bit plugin. I use the 32-bit Firefox just so that flash doesn't crash literally every other time I use it. I still dislike the use of flash in any case, as I have no control whatsoever over how it operates, interacts with my system, and so on. It seems like once a month there is a security problem in Flash!

The PDF format itself is supposed to be reasonably open, but if that's true then why can't I use okular, kpdf, evince, or xpdf to view and use the PDFs as provided by my own state government?

There are several issues here. The first issue is that my state government is requiring the use of proprietary software to use e-File when I see no reason for that requirement. Open Formats and USABLE documents should be a legal requirement for governments! The Open Office document standard should be used, and when something like PDF appears necessary, at least use a version that works on more than the very-latest install of proprietary software.

My guess is that there is a bit of javascript or whatever inside the PDF which tries to identify the viewer and simply bails if it's not Adobe PDF Reader 9. The actual text is:

To view the full contents of this document, you need a later version of the PDF viewer. You can upgrade to the latest version of Adobe Reader from

For further support, go to

In light of these recent Adobe PDF view issues, the recent (and seemingly rather serious) Adobe PDF security issues, the recent Adobe Flash issues, and so on I'm wondering if people aren't going to start to flock to the reputedly vastly better architected Silverlight by, shudder, Microsoft, a company with it's own long and glorious history of supporting open standards. It would appear, however, that Silverlight (for now) is an exception, as there is a reasonably free and open implementation with Moonlight. Given what I've seen so far, I'm starting to give hope that Moonlight starts to put the hurt on Flash, or that Adobe truly opens the Flash *format* so that truly open and free implementations can begin to flourish.

The most remarkable thing about free software like Linux, BSD, KDE, and others is that once people use it, they stay with it not necessarily because it works better or looks better or is more reliable - I think it comes down to choice. People *like* having choice and more and more often people are recognizing at some level things like that vendor lock-in (Adobe, Microsoft) grate against people in very fundamental if non-obvious ways. People *don't* like being told what to do, what to run, what they can and can't do.

UPDATE: at least part of the issue has to do with Okular (via poppler's) incomplete support for javascript. The Adobe spec for the javascript support is reported to be some 700 pages long. So at least part of the problem lies there. However, does it *really* require Adobe PDF Viewer 9 or would 8 suffice? Javascript appeared in version 7, if I recall.

Sunday, March 1, 2009

Getting suspend-to-RAM working on my Lenovo T61p

I spent a bit of time this weekend trying to get suspend-to-ram (S3) working on my work-issued laptop. It works great. However, there are (as usual) some issues.

The first issue is that it I cannot use s2ram when I'm running X, because for whatever reason s2ram with the right options works great outside of X but when the NVidia driver is in use there is a conflict of sorts (I believe it has to do with POSTing the card after return from suspending). I have to use the NVidia driver because vesa is limited in geometry and the nv driver misdetects my display badly. I'll note that until openSUSE 11.1 the nv driver didn't work *at all*, just a hung laptop, so this is progress and I'm glad for it!

In any case, I had to force pm-utils not to use s2ram at all by removing /usr/sbin/s2ram. One package or another pulls in suspend (the package containing /usr/sbin/s2ram) as a dependancy and the pm-utils "functions" script checks to see if it exists and is executable - using it if possible. I don't want it to use s2ram so I simply removed /usr/sbin/s2ram. I filed a bug for a more elegant solution, which allows a config variable to simply specify "use the in-kernel method always".

As an aside, for those that might find this useful, using s2ram without the NVidia driver active works great with: --vbe_mode --vbe_post --acpi_sleep 2

With the NVidia driver active, using the "in-kernel" S3 mode works great:

echo -n "mem" > /sys/power/state

I ran into issues with the above failing if ksysguardd is running (probably related to ksysguardd checking the temp of my CPUs) - suspending simply fails. I've worked around this by using a script in /etc/pm/sleep.d to SIGSTOP ksysgardd and the SIGCONT upon wakeup, and this appears to work but it hasn't received much testing.

One thing that complicated things is the lack of documentation on where to configure things and what to put there. After I had determined that s2ram (with the aforementioned options) works fine when X is not running, I tried to make it work in X. pm-utils sometimes pulls it's options from s2ram, sometimes from /etc/pm/config.d/*, and other times from HAL. HAL was configured suboptimally for my laptop (a Lenovo T61p) so, but ultimately I determined that using s2ram wasn't going to work no matter what options I sent along, provided I was using the NVidia driver. Thus, the HAL config for my laptop (/usr/share/hal/fdi/information/10freedesktop/20-video-quirk-pm-lenovo.fdi) didn't get used at all, as far as I can tell.

If you want the HAL config it looks like this, more-or-less replicating the above s2ram config. THIS DOES NOT WORK with NVidia drivers loaded.

<match key="system.hardware.product" prefix_outof="6460">
<merge key="power_management.quirk.s3_bios" type="bool">false</merge>
<merge key="power_management.quirk.vbe_post" type="bool">true</merge>
<merge key="power_management.quirk.s3_mode" type="bool">true</merge>
<merge key="power_management.quirk.vbemode_restore" type="bool">true</merge>

The best documentation I found was pm-utils' documentation which detailed (confusingly, however) the origination for each decision and config. Expanding and improving that document, as well as promoting it as the de-facto source of information, might have saved me a bunch of time.

Next up: getting my laptop's video mode switching to work.

Saturday, February 21, 2009

openSUSE 11.1 - 32bit -> 64bit upgrade not so smooth

I recently upgraded a 64 bit machine that was running a 32 bit version of openSUSE 10.3. I upgraded it to 11.1. Yes, I know this is a jump of two versions. The update didn't entirely hose the machine, but tons of stuff was left as i586 instead of x86_64, including a bunch of pam stuff. Thus, ssh did not work. LOTS of stuff didn't work. Like half of YaST2. Once I was able to get in, a 'zypper dup' sorted out *some* of the problems, but not all of them. As I write this I'm still in the process of un-hosing the machine... UPDATE: it's basically unhosed. kudateapplet was missing a dependancy on kdelibs4 (or perhaps one of it's dependencies. Oh well. It's working great now!

KDE 4.2 coming along, still not there

As I mentioned earlier, I gave KDE 4.2 a try and it was a disaster. Since then, the team responsible have been producing almost nightly new releases, and these new releases have improved things considerably. Specifically, konsole, while still not without it's issues, is very usable now. The taskbar is usable, and the desktop area is usable. Assuming you turn desktop effects off, things generally work OK - I get 2 or 3 crashes a day. Some of these crashes are pretty annoying. krunner keeps getting hung in futex, which indicates some sort of threading problem. When that happens, alt-f2 (something I use quite frequently) goes belly up (as alt-f2 runs krunner), and for some reason all of dbus starts acting weird, negatively impacting many apps including gtk+ and most of KDE. One of the big issues I have with konsole is the auto-saving of it's window sizes. KDE 3.5 used to allow you to save the dimensions of the window (as well as other changes you've made to the session) in an on-demand fashion. However, KDE 4.2's konsole does this more or less every time you make a change. I open hundreds of konsoles a day, easily, as some of them are very short-lived. I really, really want konsole to open to the same size every single time. I resize many of konsole instances to sizes that are totally unsuitable for other purposes. Ugh! It also took me a few days to find out how to disable the "start konsole in the same working directory" when creating new tabs. That's also hugely annoying. I've found a BUNCH of bugs in konsole and won't detail them all here, but konsole is KDE 4's biggest step backwards yet. The next issue is the seemingly more haphazard way in which the system settings are arranged, compared to KDE 3.5. I sometimes have to look in 2 or 3 different modules to find what I'm looking for, especially when it has to do with themes or visuals or what-not. Overall, however, KDE 4.2 is coming along and I sincerely hope and believe that these issues will get resolved - I suddenly have hope for the future of KDE again. Good job, folks!

Saturday, January 31, 2009

KDE 4.2

I recently gave KDE 4.2 a try, thanks to the good folks at Novell. The install went fine, and it wasn't nearly as crashy as it used to be - I "only" got one crash of the window manager. By and large it works better, but it's still not nearly as usable as 3.X was for me. There is no way to highlight a program in the main menu and put it on the task bar. Moving things on the task bar as yet eludes me. The "no icons on your desktop" decision is not just dictatorial but it's painful for users and I see no merit in it whatsoever. Konsole *still* doesn't display fonts worth a damn, when configuring fonts the "example" display looks great but the *actual* display still looks terrible. At startup, there was a short-lived message about the size of my screen being undetectable, at which point it looked like it tried to start a program but failed, didn't retry, and didn't tell me anything about what went wrong. Nice. The visual effects are nice, but I could care less if I spent more time, effort, and energy getting my job done. fvwm2 is looking better every day. KDE 3.X has worked extremely well for me over the years - solid, consistent, predictable, and most of all *functional*. Maybe KDE 4.3 will get start to get good enough to use for me, but I'm getting pretty sick of the dictatorial attitude that KDE has taken on lately - one of the biggest reasons I stopped using GNOME was because of these reasons - crashy software and dictatorial attitude. Being "pretty" doesn't mean a damn thing if I can't use the software effectively.

Wednesday, January 7, 2009

openSUSE 11.0 -> 11.1 Upgrade Issues

I recently upgraded a number of machines to openSUSE 11.1 from openSUSE 11.0. By and large the upgrade went perfectly, however, a number of issues cropped up almost immediately. The first was that I could not log in.

I use an encrypted home directory via pam_mount. The pam options for various programs got good and hosed by the upgrade, and openSUSE's response to the bug I (and others) filed was "you should always check your .rpmsave files after an upgrade". Lame. I might be experienced enough to check /etc/pam.d/* for changes but the overwhelming majority of folks aren't. If I had used the (easy) system tools (YaST) to create an encrypted home directory in 10.3 or 11.0 and then had it stop working in 11.1, and been unable to fix it, I'd have been royally cranked. My various attempts at fixing the problem also resulted in a number of bugs filed.

The next problem, and a far more serious one, was repeated filesystem corruption the same volume. Coincidence? I think not. I use ext3, typically with journal=data, and haven't had filesystem corruption of any kind since about 2000, and it was a rarity even then. Further investigation showed me that while my home directory was being mounted correctly, it wasn't being *un* mounted correctly. Furthermore, subsequent logins re-did the loopback and dm-crypt setup so I had 3 or 4 or 5 mappings, only the last of which was any good. This resulted in some lost work for me, an hour or two above and beyond what I had lost as part of the investigation. I filed another bug on that and it's yet to receive any attention after almost two weeks. Look, I know the developers are busy, but releasing a new operating system around the Holidays when lots of folks are going to be one, or taking, or recovering from vacation doesn't strike me as a great idea. Still, anybody with encrypted home directories mounted via pam_mount might be experiencing the same problems and THAT doesn't make for good PR - "openSUSE 11.1 hosed my home directory" is not a headline I would like to read, if I worked for Novell.

So I set about to determine if a newer version of pam_mount could help solve my problem, since rolling back to the older version didn't. I built newer versions of libHX and pam_mount (which has a thoroughly scary changelog, although one must remember the great complexity it must deal with) using the openSUSE Build Service (which really is a truly awesome and wonderful thing!). That didn't work (no amount of futzing could cajole it to mount my home directory), so I went ahead with the latest versions of each and sat down for a longer debugging session. Armed with diff and lots of other stuff, I basically determined that the latest version will work just fine - it correctly avoids duplicate mounts and deals better with unmounting and so on and so forth - but the trick is two-fold:

  1. You have to basically remove almost everything from /etc/security/pam_mount.conf.xml except for the <volume> elements and, of course, the outer element.
  2. You also have to fill out options which didn't need to be specified with 0.35 (or 0.47) because the values were the same as the defaults, but the defaults have changed. Those options are:
    • fskeycipher - use aes-256-cbc here as it's the (former?) default
    • fskeyhash - use md5 here as it is the former default
Again, make *sure* to remove the other elements as the defaults seem to work and the ones that are present (again, from an older version) do not.

I hope this post serves three purposes:

  1. To help some other poor schmuck out of the near-disaster that was *my* upgrade experience
  2. Hopefully to cajole Novell into giving a *bit* more thought into how others might react to losing access to their $HOME
  3. To rant (this is the biggest one).

Some updates:

The bugs I filed did get some attention! The verdict: from a clean 11.0 upgraded to 11.1 the bugs to not appear. I went so far as to reinstall the stock pam_mount version, try stuff, and back and forth. No such luck. However, since my workstation is now working again, at least I'm not so cranky. Someday if I get some ambition I'll try to find out what is going wrong here, but I got some help from Novell and that's more than I deserved!

Thursday, January 1, 2009

KDE4 > KDE3? Not so much.

A recent upgrade to openSUSE 11.1 afforded me the opportunity to give KDE4 a try again. I'm a long-time fan of KDE3 and was hoping for pleasant surprises. I got half my wish. KDE4 may be a bit prettier in places (but not the fonts), but it's neither as consistent or as usable as KDE3 was. It's certainly not as stable or as fast (even with desktop effects turned off), and there have been a bunch of non-obvious changes. To me, it feels like a couple of steps backwards in experience and utility.

In particular, konsole is a BIG step backwards - if you open up a konsole with a given command (like ssh SOMEHOST), then new tabs will *also* execute that command (something KDE3's konsole did not do - quite logically). If you hit the "close tab" button (which I to root around in preferences to find) konsole hangs and has to be killed. Sweet! There are a fraction of the preferences available, some things I found quite useful are now gone.

Replacing konqueror with dolphin as the default file browser was not a step forward, either. The inability to right-click on a menu item (in an application menu) and add it to the bar was removed - so how do I add firefox, konsole, and the 2 or 3 other apps I open up 400 times a day? It's not clear to me and if it's not clear to me then, IMO, it's a step backwards. How do I configure my desktops so I can easily mouse between them by dragging windows or even just moving my cursors across the edges? I don't see it anywhere!

So far, KDE4 has removed features I use, added ones I find annoying, hurt stability and speed, and impeded my workflow. I'm sure there are good reasons for KDE4, especially architectural ones, but so far my experience does not bear out any advantage, and some significant disadvantages.

You may say to yourself, "you just have to know where to look" but if I wanted that sort of opaqueness I'd run GNOME, or at least older versions of GNOME - maybe they've improved some. KDE3 was a very *usable*, *stable* desktop environment and so far KDE4 is worse in nearly every aspect I care about. It's too bad because I know a great deal of work has gone into it, but I'm really beginning to wonder if that work has been with the right goals and priority management in mind.

Saturday, December 13, 2008

NFSv4 on openSUSE 11.0

For the past decade or so I've been using NFS inside my home network. NFS is one of those things that has a low-enough barrier to entry that it's really easy to get going, but it's opaque and esoteric enough that when things go wrong it can be a profoundly unreal experience. NFS has many weaknesses, however, among them security (there isn't any), performance (it's not as good as it should be, even if it's not BAD), and complexity (lots of ports, portmapper, choice of UDP/TCP, and so on), and something called "cache coherency" which is essentially a tradeoff between performance and correct behavior. NFSv4 goes a long, long way to fixing most of these problems. However, it brings some new behaviors and requirements which cause some headache.

First, the good: NFSv3 operates over what seems like a half-dozen or more ports. Those ports are frequently dynamically assigned, which means that you get to involve portmapper. Portmapper's job is basically to say "nfs is on X, lockd is on Y, and statd is on Z" and so on. It also complicates firewall management, and can cause all manner of headaches. NFSv4 operates over ONE PORT - 2049. Excepting gss (security: authentication and authorization) which has it's own ports, id mapping which may operate over LDAP or NIS or whatever, the core NFSv4 protocol operates entirely over one port and uses TCP. That's awesome!

More good: NFSv4 has better cache coherency, locking, a better transport, and so on.

The bad: The exports file format has changed very slightly, and *how* filesystems are exported has changed a bunch. NFSv4 exports a single "root" filesystem under which all others may or may not show up. This root filesystem is identified in /etc/exports with "fsid=0" (or fsid=root). To save time, let's assume that you are going to place this root on your server at /exports. On your server, "mkdir /exports". Despite the documentation suggesting that symlinks could be used to expose filesystems that are not rooted in /exports, that's not true. On linux, you really only have three choices:

  1. Move/Copy the contents to the exported directory
  2. Use bind mounts (mkdir -p /exports/pictures-of-cheese && mount -o bind /pictures-of-cheese /exports/pictures-of-cheese)
  3. Mount the filesystem directly: mount /dev/sdb1 /the-hoff

Once you've done this, you can fiddle your /etc/exports file. Mine looks like this:




NFS4_SUPPORT="yes" in /etc/sysconfig/nfs

and restart portmapper and NFS.

Setting up your client to use NFSv4 is done the same way, by editing /etc/sysconfig/nfs. Do so and restart portmapper and nfs.

Mount the filesystem on the client:

mkdir /nfs
mount /nfs -t nfs4

If that doesn't work, I can't help you. Note that unlike NFSv3 we are mounting the *root* filesystem. If you are used to mounting /the-hoff (hah), just use a symlink to point into /nfs/the-hoff. Make sure it shows up before you continue.

Now the fun begins. id mapping. NFSv3 using AUTH_SYS (what almost everybody was using) worked like this: ids were used to identify users and groups. That's basically it. A user "bob" with id 1000 on the client would map 100% to user "sally" on the server if sally had id 1000. Names were irrelevant.

With NFSv4 that's no longer the case. Names are important. And to make things awesome, you can't use the old behavior, and if a name doesn't match it's automatically set to the "guest" user or group, the value of which is set in /etc/idmapd.conf - let me tell you - on machines which have slightly different /etc/passwd that's a fun one. It's an argument for LDAP or some other centralized directory, but it's still a pain.

So now I've got NFSv4 up and running. Why doesn't anything work? The default debug level of idmapd.conf is 0. I set it to 999 and got just a bit more noise in the logs, helping me to figure all of this out.

Otherwise it seems to be working OK but not great. It's working better than CIFS on - while testing rdiff-backup (to a CIFS mount) last night I got the kernel to oops 4 times - 4 reboots in 15 minutes does not make a stable filesystem. I tried 2.6.27.something and it worked much better, but given the long-standing locking issues with CIFS I'm not about to switch to it. Don't believe me? Google for 'cifs' and 'sqlite'. Remember, everybody and their brother is now using sqlite, firefox and xbmc two examples I can think of right away.

And now the post is done.

Monday, December 8, 2008

NBD performance enhancements

As I blogged about earlier, I've been making use of NBD lately as my block-device-over-network exporter of choice (previously AoE). I got pretty reasonable performance out of NBD, easily in the 10s of MB/s - basically the limit of my iffy network hardware, however I wanted to know if there was more to be found. And there is! nbd-server, for one, uses the following basic paradigm:
  1. read data from network
  2. process that data into instructions
  3. do the instructions say to read data?
    If so, lseek to the right place on disk, read data into a buffer, copy a reply (header) plus the data into another buffer, and send that whole buffer back to the client.
  4. do the instructions say to write data?
    If so, read more data from the network into a buffer, lseek to the right place on disk, and write it. Construct a reply and send that back.
  5. Other instructions processed here....
Right away I saw one place for improvement: replace lseek+read with pread and lseek+write with pwrite. I did so and immediately saw problems. I eventually tracked them down to most of nbd-server using the 64-bit (largefile) version of the UNIX I/O API, except for pread+pwrite. To fix, one requires to #define _XOPEN_SOURCE 500 (or any of the others which define the same), so I made the requisite changes in the automake (ewwwww) system and things worked fine after that, and some 10-20% faster. I considered using TCP_CORK + sendfile but I'd have to see how the client might handle a partial reply - will it re-read until it gets a complete reply? There *is* a danger here in that the server might say "hey, cool - what follows is 128K worth of data" in response to a read for 128K only for the server to respond with *less* than 128K worth of data due to some error. That would not happen in the current scenario as the data has been successfully read by the time the response is even generated. Perhaps using writev might avoid some buffer copies, however. One thing I'd like to see much more of is better use of udev/hal/dbus - I'd rather see the *option* for a dynamic /dev/nbdX (and /dev/mdX) device to be created when one is brought up and to have that exposed via dbus. This way, I could write udev rules so handle things like "a new block device has appeared, /dev/nbd0" or "a block devices, /dev/nbd0 would like to go (or has gone) away" and so on. Other than performance changes, I "fixed" NBD's largefile support, and added a tcp keepalive option to the client (handy for detecting when a server has gone away in an unusual fashion). I suppose I ought to add the same TCP keepalive support to the server, too. One thing I really like about NBD is it's simplicitly. The code is very simple and it's relatively clean, and it has very few dependencies (pretty much just glib 2.0 for config file parsing). It's also quite snappy and can take advantage of all of the great work that has gone into TCP congestion avoidance. Perhaps porting NBD to make use of SCTP might not be a bad idea.

Network Block Device + MD RAID1 = Fun

For the last few years I've been making use of drbd to provide a sort of semi-connected network raid1 as part of my overall backup and disaster recovery system. Recently, I've been experimenting with using nbd (network block device) and Linux MD raid1 (with bitmaps) to provide a similar functionality, and have some interesting findings as a result. Essentially, drbd takes some sort of storage (any seekable file) on this machine and mirrors it over the network to another machine. drbd is primarily found in HA (high-availability) environments. I've been using drbd like this: I took a pair of 80G drives and placed them in two machines: the first machine is my local fileserver, and the second machine is a workstation that is booted occasionally (about once a day). I configured drbd to use each of these drives as a mirror of the other, and set the server as primary. Then I formatted the newly-available block device (/dev/drbd0 in my case) with ext3, mounted it, and used it for rsync+hardlink-style backups (now I'm using rdiff-backup). Whenever the workstation would come up, drbd would take note and only synchronize the blocks of the underlying storage device that had changed. This was very fast, easily saturating my 100mbit network and under non-jumbo-framed (standard 1500 byte) gig-e would sustain north of 15-20 MiB/s. Not bad. However, I had a few problems with drbd:
  1. Performance
  2. The overhead for keeping track of what has changed and what hasn't was not designed for this usage scenario and is, apparently, hugely expensive. I do not have any hard numbers (unlike me) but I'd eyeball it in the 20-30% range. That's quite a bit. Sometimes it felt much worse than that!
  3. Reliability
  4. drbd is designed for use in an HA environment. However, I encountered numerous kernel crashes and other weirdnesses when it was put under heavy I/O. At one point, I had to *reboot* my file server 3 times in one day, normally the only time I reboot it is for a new kernel.
So I sat back and thought "Why not use AoE or NBD (network block device) and combine it with raid1 and bitmaps?". For starters, drbd does much more than just mirroring. It has an idea as to which mirror is the master, it can switch back and forth, it has automatic reconnect, rebuild, and so on. More than a little bit of "glue" would have to be written for me to replicate even most of the functionality that drbd provides. But I tried anyway, and largely succeeded for my needs. I use freedt a GPL'd daemontools replacement, so I wrote a run file which uses nbd-client to check to see if the server is up or not. If it is, it connects and enters a loop within which it performs a number of health checks (imperfectly) and if it detects that the server has gone away, shuts the block device back down. It also has hooks for up, pre_down and down which I use to interface with Linux MD. What I have is a largely autonomous system which, typically within 5 seconds, will note that the server is up, connect to it, add the block devices to the correct raid array (if any), and take it back out should the server disappear to disconnect. The raid array is built using raid1 and internal bitmaps which means it takes under a minute for me to synchronize 400-600MB worth of changes. As I refine the scripts, I may post them here, if anybody has any interest. I'm not looking to replace drbd, quite honestly it worked great for me for a long time, but this works and it was fun.

Monday, October 6, 2008

More Notes on Acer 7720 Wireless

Today I helped some friends update their Acer 7720 laptop. It's running openSUSE 10.3 and one of the updates is a new kernel version, probably the riskiest of the periodic updates. Unfortunately, upon reboot, the wireless was shot. I spent an hour with their laptop until I hit upon just the right google which said something about the ipw drivers. Now, I had already removed all of the ipw* stuff as I remembered the laptop works with the iwl not the ipw series. I *thought* I had removed all of it but I had forgotten the most important bit, the firmware. Once that was removed and the machine rebooted, everything worked perfectly. Yay!


  1. never update the kernel when you have to be able to use the machine shortly after that, especially if you are using wireless or something equally finicky
  2. be thorough with the axe - if I had removed the all of the right stuff, right away, the "issue" would have taken 5 minutes to resolve.
  3. Remove EVERYTHING ipw*


Thursday, September 4, 2008

Why Does Printing on UNIX Suck So Much?

It's 2008 and printing on Linux (UNIX) still sucks. What follows is a serious grump about printing on UNIX, CUPS in particular, and why photo printing at home is still insane.

CUPS has brought UNIX printing into the 21st Century but it's still cumbersome, opaque, and buggy. Only the very latest not-even-released-yet code has a USB driver that doesn't gack all over the floor most of the time. Have you ever straced the usb printing driver? My CPU was spun all the way up because the usb driver, apparently, doesn't know that when read(2) returns 0 (zero), it means (for streams) "don't bother trying anymore, there is no more data here. Ever." It's been doing that for years.

It seems as though, no matter what program I use: gwenview, gthumb, the gimp, or anything else I can't quite get what I want. Trying printing a 4x6 without tearing out all of your hair. I have yet to get a single 4x6 to print which wasn't resized to some much-smaller-than-4x6 size, or have something else really wrong with it.

How about the fact that every time CUPS is released I have to, typically, completely blow away the old configuration and start over? Upgrading from each previous version of CUPS has been painful to say the least.

I've been doing this for almost 15 years and I've seen Linux (and BSD) grow from "usable by hardcore geeks" to "usable by regular folks", but I don't understand why it needs to be this hard. I don't even have the vocabulary to properly describe just how frustrating it is having to deal with printing on UNIX. Every time it's the same old problems and some new ones, too, just for variety. I can honestly say I've probably spent more time fighting printing than any other single task (other than "using" the computer). The level of opaqueness that CUPS in particular exhibits still baffles me. If I'm having problems and I strace various processes to see what they are doing, I'll see error messages and informational messages, but no level of logging ever delivers them to a log file. They're being /dev/null'd or whatever. Who thought that was a good idea?

I've never gotten firefox to print the header and footer correctly. Some part of some document everywhere is cut off. Same goes for Open Office. Manually fiddling with your margins is something I might have expected to do in 1995, not 2008. CUPS has all this great info on the printer, thanks to the PPD - what it's actual printable area is, 4000 options, you name it. Why, then, don't some of the most basic things ever seem to work? One thing that really works well is the Printout Mode stuff - whoever thought that up was definitely smarter than your average bear.

I've given up entirely on Epson after no less than 3 different inkjets with clogged heads, never to print again, despite repeated attempts at repair. Canon, glorious though their printouts may be (no personal experience), aren't known for being exactly Linux friendly. Perhaps that's changed in the last year. Every time I look at a printer (for myself or much more likely for someone else) I check out the really awesome and, if it's not supported, I pass. Period. Kodak, despite all of the noise lately, appear to be completely useless for those of us that don't run Windows. HP, while Linux friendly, has come up with this hplip thing that mostly kinda-sorta works, but every now and again goes crazy and has to be killed off. I'll admit to having the best luck with HP printers, they last and the text printing is really fantastic. I still have an old HP laster with about 1 jillion (metric) copies on it and it still works great. I can turn it off, leave it for a year, and come back and it still works. 10 points for laser! I have no personal experience with Canon, but friends and family do and they love them. They're all running Windows or own a Mac, too. I've begun to think the entire inkjet thing is a huge scam.

Why even bother printing from home? I've all but given up on printing photos from home when I can use Picasa (online or the wine-ified version) and order printouts from any of a dozen places, for pickup or even delivery, for cheap. Picasa is awesome. Google really gets "just make it easy".

As far as printing photos this way, what is the level of frustration? That depends on the provider - I like using the LifePics provider as they are fairly local and seem to work really well. I tried using Walgreens (they are even closer and have a 1 hour wait) but, so far, they've been down 2 out of 3 tries. Photoworks didn't do a thing for me (I thought the interface wasn't very good), but Kodak Easyshare and Shutterfly are my favorites from a UI perspective. Fast, easy, unobtrusive. Snapfish is another common one. I'll never use WalMart.

Surely some people reading this will go, "Well, you idiot, all you have to do is this and that and edit this and tweak that and don't forget to wave a dead chicken over it (thrice)." and that's fine. Maybe I'm getting old but I'm pretty tired of having to do stuff like that just to get things to work. Modern unices are pretty much plug and play these days - yeah, setting up dual monitors and such can be "fun" but most of the time it just works great. Plug in your new USB keyboard? Hey! A keyboard! And it lights up! Plug in just about any camera, USB storage, and so on and so forth and Linux recognizes it and off you go. Yay! Plug in your printer and be prepared to lose the next hour (or more) of your life. You'll never get it back.

You might also be saying, "Well, stop grumping and do something about it." You bet. I'll get right on that in my copious spare time. I have contributed (and will continue to contribute) to a great many free/open/libre software projects over the years. That doesn't mean I won't try to help in some way - I'll file the appropriate informative, useful bug reports and even supply patches when I'm able.

[1] megapickles, because it's funnier.

Saturday, July 19, 2008

File Recovery / Undelete on ext3

I had the unfortunate situation where I had deleted some files (not many, only 2 or 3 files) and really did not want to re-do the week's worth of work. Google to the rescue. I found ext3grep, printed out the instructions (huge, need condensing for common-case), and built the software. It required a number of fixes to compile, which I'll forward upstream when I get a chance. Did it work? More or less, yes. I got my files back. The time-range searching doesn't seem to work, as I could narrow down the window of deletion to about 15 minutes but the list of recovered files was HUGE, including files I had deleted weeks ago. However, the software did what it was supposed to do, and easily enough. It takes some time, but it did the job. I gave them 10 Euro (about 400 American Pesos).

Thursday, July 10, 2008

RAID5,6 and 10 Benchmarks on

Copyright: Copyright © 2008 Jon Nelson
Date: Jul 2008

This is an expansion of a previous post ( ).

Since that time, I have redeployed using RAID10,f2. The redeployment went very well, but I'm not getting the performance I quite desired. More on that in another post. In the meantime, I slightly enhanced one of my benchmark scripts and decided to give it a go again.

1   Hardware and Setup

  • the kernel is 2.6.25,5, openSUSE 11.0 "default" kernel for x86-64
  • the CPU is an AMD x86-64, x2 3600+ in power-saving mode (1000 MHz)
  • the motherboard is an EPoX MF570SLI which uses the nVidia MCP55 SATA controller (PCIe).
  • in contrast to an earlier test, this time thera are 4 drives - 4 different makes of SATA II, 7200 rpm drives.
  • each drive is capable of not much more than 80 MB/s (at best - the outermost tracks) and, on average, more like 70 MB/s for the portions of the disk involved in these tests
  • the raids are comprised of 4x 4GB partitions, all from the first 8G of the disk.
  • the system was largely idle but is a live system
  • the system has 1 GB of RAM
  • in contrast to the earlier test, the 'cfq' scheduler was used. I forgot to change it.
  • the stripe_sizes and caches, queues sizes, and flusher parameters were left at their defaults

2   Important Notes

  • the caches were dropped before each invocation of 'dd':

    echo 3 > /proc/sys/vm/drop_caches
  • the 'write' portion of the test used conv=fdatasync

  • I did not test filesystem performance. This is just about the edge capabilities of linux RAID in various configurations.

  • I did not use iflag=direct (which sets O_DIRECT)

  • I ran each test 5 times, taking the mean average.

3   Questions

Initially, I just wanted to run a bunch of tests and eyeball the results. It's easy to do that, and draw conclusions from the data. However, it is maybe more useful to ask yourself, "What questions can be answered?" Here are a few questions I came up with, and the answers I came up with:

  1. What did you really test?

    Basically I tested streaming read and write performance to a series of raid levels and formats, using different chunk sizes for each.

    I did not want to use any filesystem which only gets in the way for this kind of test - I wasn't testing the filesystem, I was testing to see how different raid formats, layouts, and chunk sizes make a difference.

    A future installment may include filesystem testing as well, which I find just as if not more important, however it's so much more variable that I'm not really sure much sense can be found in the noise.

  2. Why didn't you include my-favorite-raid-level?

    I only wanted to include raid levels for which there is some redundancy. I could have included raid 1+0 but my test script is not sufficiently smart for that. Perhaps I'll include that in a future installment.

  3. Can I have the source to the test program?

    Sure. I'll try to make it available if somebody asks, but it's really nothing special. Futhermore, it's my intent to refine it a bit to support filesystem testing (via bonnie++ or iozone, preferred) and so on.

  4. When using raid5, does the format matter?

    If you squint your eyes a bit, the write performance, regardless of format, were all pretty close. The read performance was more variable, but still did not vary all that much. Chunk size seemed to matter more. Left-symmetric did the best overall, however.

  5. How is the performance graphed versus predicted?

    Left to the reader to comment!

  6. Did you do the readahead settings for the test?

    No. I left them at their defaults.

  7. What are you using to generate this?

    I am using reStructuredText, combined with Pygments.

  8. Which tool do you use to make graphs?

    Google Charts (by way of pygooglechart), a bunch of shell and Python. I used flot previously.

  9. How do the individual drives perform?

The drives are:

<6>ata3.00: ATA-7: Hitachi HDT725032VLA360, V54OA52A, max UDMA/133
<6>ata4.00: ATA-8: SAMSUNG HD321KJ, CP100-10, max UDMA7
<6>ata5.00: ATA-7: ST3320620AS, 3.AAK, max UDMA/133
<6>ata6.00: ATA-8: WDC WD3200AAKS-75VYA0, 12.01B02, max UDMA/133

And their performance:

Timing buffered disk reads:  218 MB in  3.01 seconds =  72.44 MB/sec

Timing buffered disk reads:  234 MB in  3.00 seconds =  77.92 MB/sec

Timing buffered disk reads:  228 MB in  3.02 seconds =  75.60 MB/sec

Timing buffered disk reads:  234 MB in  3.02 seconds =  77.57 MB/sec
  1. What difference does the scheduler make?

    As can clearly be seen on the RAID5 graphs, the IO scheduler can make a big difference. Using cfq or nooop, reads start out almost a full point faster than the others, and writes are 1/2 point faster.

    On the other hand, for RAID6, the scheduler doesn't seem to make much difference at all. At least for streaming reads/writes, which is all I'm testing here.

    For RAID10,n2 and RAID10,o2 the story is the same as for RAID6, but there is some impact (up to 1.0 points!) for RAID10,f2.

  2. What revisions have you made to this document?

    I re-ran the tests to include 2048K chunk sizes, and removed 128K as it wasn't very interesting and it cluttered up the graphs.

    I also re-ran the entire set of tests for the other three schedulers, noop, anticipatory, and deadline.

    I re-did the graphs using the Google Charts API (by way of pygooglechart) instead of using flot. There was nothing wrong with flot, in fact I found the software really nice to use, but some people found the google charts "prettier" and it's somewhat easier for me to use.

4   Unanswered Questions

  1. While I don't have the data in this article, I did originally perform these tests on The results were rather noisier, and in most cases a bit worse.

  2. Why aren't raid10,f2 reads getting closer to 4.0x ?

  3. What's with the strange drop in performance at 512K chunk sizes for RAID10,f2 for the deadline and noop schedulers, only to rise again at 1024K (and then drop at 2048K)?

  4. Why are raid10,o2 reads so AWFUL?

    Neil Brown was kind enough to suggest re-running with a larger chunk size, which I did.

    The read performance did, indeed, perform better. Up to the 3.0 mark, in fact.

  5. Why do raid6 reads behave the way they do? I would have expected a more linear graph - the raid6 write graph is very smooth.

    From 64 to 256k chunk size, there is little change (in either direction, for reads or writes) but at 512K the reads really improve and continue to do so as the chunk size increases.

  6. What should the theoretical performance of the various raid levels and formats look like?

    For raid10,f2 I would suspect that 4.0 would be perfect (for reads), and for sustained writes something like 1.5.

    I get 1.5 like this:

    the avg. speed of writing a given chunk of data should look like this:

    avg of writing to outer track + writing to inner track -> (70 + 35) / 2.0, (assuming inner track is 1/2 the speed of outer tracks) and theoretically we could write to 2 devices at a time, so... (( 70 + 35 ) / 2.0) * 2.0 / 70.0 = 1.5x.

    In reality, we do a bit better than that, probably due to the fact that I'm not using the whole disk and therefore the speed of the inner tracks of the region I'm actually using is greater than would otherwise be true.

5   Tables, Charts n Graphs

The following results are expressed in terms of a single with (a baseline) with 1.0 being the speed of a single drive (about 70MB/s).

scheduler level layout chunk writing reading
cfq raid10 f2 64 1.48 3.01
cfq raid10 f2 128 1.49 3.88
cfq raid10 f2 256 1.50 3.68
cfq raid10 f2 512 1.52 3.65
cfq raid10 f2 1024 1.47 3.76
cfq raid10 f2 2048 1.52 3.73
cfq raid10 n2 64 1.78 1.89
cfq raid10 n2 128 1.85 1.87
cfq raid10 n2 256 1.82 2.00
cfq raid10 n2 512 1.84 2.15
cfq raid10 n2 1024 1.83 2.42
cfq raid10 n2 2048 1.83 2.70
cfq raid10 o2 64 1.83 1.96
cfq raid10 o2 128 1.80 1.96
cfq raid10 o2 256 1.84 1.98
cfq raid10 o2 512 1.80 1.98
cfq raid10 o2 1024 1.83 2.49
cfq raid10 o2 2048 1.80 3.13
cfq raid5 left-asymmetric 64 1.72 2.51
cfq raid5 left-asymmetric 128 1.67 2.79
cfq raid5 left-asymmetric 256 1.52 2.92
cfq raid5 left-asymmetric 512 1.31 2.76
cfq raid5 left-asymmetric 1024 1.06 3.44
cfq raid5 left-asymmetric 2048 0.56 3.25
cfq raid5 left-symmetric 64 1.74 2.71
cfq raid5 left-symmetric 128 1.73 2.76
cfq raid5 left-symmetric 256 1.55 2.97
cfq raid5 left-symmetric 512 1.34 2.88
cfq raid5 left-symmetric 1024 1.08 3.44
cfq raid5 left-symmetric 2048 0.58 3.50
cfq raid5 right-asymmetric 64 1.75 2.70
cfq raid5 right-asymmetric 128 1.61 2.88
cfq raid5 right-asymmetric 256 1.58 2.88
cfq raid5 right-asymmetric 512 1.28 2.88
cfq raid5 right-asymmetric 1024 1.04 3.25
cfq raid5 right-asymmetric 2048 0.54 3.31
cfq raid5 right-symmetric 64 1.75 2.79
cfq raid5 right-symmetric 128 1.69 2.81
cfq raid5 right-symmetric 256 1.56 2.88
cfq raid5 right-symmetric 512 1.30 2.75
cfq raid5 right-symmetric 1024 1.01 3.02
cfq raid5 right-symmetric 2048 0.49 3.24
cfq raid6   64 1.30 1.76
cfq raid6   128 1.24 1.96
cfq raid6   256 1.17 1.91
cfq raid6   512 1.04 2.70
cfq raid6   1024 0.87 2.92
cfq raid6   2048 0.60 3.31
deadline raid10 f2 64 1.78 2.63
deadline raid10 f2 256 1.82 3.80
deadline raid10 f2 512 1.72 3.32
deadline raid10 f2 1024 1.75 3.61
deadline raid10 f2 2048 1.47 3.40
deadline raid10 n2 64 1.96 1.21
deadline raid10 n2 256 1.88 1.85
deadline raid10 n2 512 1.84 2.10
deadline raid10 n2 1024 1.89 2.41
deadline raid10 n2 2048 1.84 2.59
deadline raid10 o2 64 1.80 1.94
deadline raid10 o2 256 1.82 1.96
deadline raid10 o2 512 1.73 1.94
deadline raid10 o2 1024 1.87 2.63
deadline raid10 o2 2048 1.82 3.13
deadline raid5 left-asymmetric 64 1.67 2.55
deadline raid5 left-asymmetric 256 1.43 2.84
deadline raid5 left-asymmetric 512 1.22 2.76
deadline raid5 left-asymmetric 1024 1.04 3.27
deadline raid5 left-asymmetric 2048 0.52 3.31
deadline raid5 left-symmetric 64 1.61 2.32
deadline raid5 left-symmetric 256 1.42 2.89
deadline raid5 left-symmetric 512 1.26 2.89
deadline raid5 left-symmetric 1024 1.08 3.14
deadline raid5 left-symmetric 2048 0.55 3.31
deadline raid5 right-asymmetric 64 1.68 2.15
deadline raid5 right-asymmetric 256 1.50 2.88
deadline raid5 right-asymmetric 512 1.23 2.83
deadline raid5 right-asymmetric 1024 0.97 3.44
deadline raid5 right-asymmetric 2048 0.47 3.24
deadline raid5 right-symmetric 64 1.64 2.11
deadline raid5 right-symmetric 256 1.50 2.84
deadline raid5 right-symmetric 512 1.22 2.83
deadline raid5 right-symmetric 1024 1.00 3.02
deadline raid5 right-symmetric 2048 0.43 3.19
deadline raid6   64 1.22 1.73
deadline raid6   256 1.20 1.75
deadline raid6   512 1.04 2.45
deadline raid6   1024 0.89 3.19
deadline raid6   2048 0.57 3.32
anticipatory raid10 f2 64 1.62 2.59
anticipatory raid10 f2 128 1.59 3.50
anticipatory raid10 f2 256 1.61 3.46
anticipatory raid10 f2 512 1.65 3.73
anticipatory raid10 f2 1024 1.61 3.58
anticipatory raid10 f2 2048 1.47 3.80
anticipatory raid10 n2 64 1.87 1.21
anticipatory raid10 n2 128 1.83 1.45
anticipatory raid10 n2 256 1.83 1.90
anticipatory raid10 n2 512 1.83 2.20
anticipatory raid10 n2 1024 1.82 2.45
anticipatory raid10 n2 2048 1.82 2.70
anticipatory raid10 o2 64 1.82 1.91
anticipatory raid10 o2 128 1.85 1.94
anticipatory raid10 o2 256 1.86 2.05
anticipatory raid10 o2 512 1.80 1.96
anticipatory raid10 o2 1024 1.83 2.63
anticipatory raid10 o2 2048 1.78 3.19
anticipatory raid5 left-asymmetric 64 1.62 2.42
anticipatory raid5 left-asymmetric 128 1.59 2.63
anticipatory raid5 left-asymmetric 256 1.48 2.79
anticipatory raid5 left-asymmetric 512 1.32 2.88
anticipatory raid5 left-asymmetric 1024 1.10 3.37
anticipatory raid5 left-asymmetric 2048 0.54 3.25
anticipatory raid5 left-symmetric 64 1.67 2.49
anticipatory raid5 left-symmetric 128 1.62 2.76
anticipatory raid5 left-symmetric 256 1.52 2.83
anticipatory raid5 left-symmetric 512 1.32 2.76
anticipatory raid5 left-symmetric 1024 1.10 3.32
anticipatory raid5 left-symmetric 2048 0.58 3.25
anticipatory raid5 right-asymmetric 64 1.67 2.17
anticipatory raid5 right-asymmetric 128 1.55 2.63
anticipatory raid5 right-asymmetric 256 1.48 2.76
anticipatory raid5 right-asymmetric 512 1.30 2.92
anticipatory raid5 right-asymmetric 1024 1.09 3.37
anticipatory raid5 right-asymmetric 2048 0.52 3.37
anticipatory raid5 right-symmetric 64 1.72 2.19
anticipatory raid5 right-symmetric 128 1.67 2.63
anticipatory raid5 right-symmetric 256 1.47 2.88
anticipatory raid5 right-symmetric 512 1.32 2.88
anticipatory raid5 right-symmetric 1024 1.07 3.02
anticipatory raid5 right-symmetric 2048 0.47 3.20
anticipatory raid6   64 1.26 1.75
anticipatory raid6   128 1.22 1.67
anticipatory raid6   256 1.19 1.77
anticipatory raid6   512 1.03 2.59
anticipatory raid6   1024 0.91 3.08
anticipatory raid6   2048 0.58 3.24
noop raid10 f2 64 1.40 2.71
noop raid10 f2 256 1.42 3.80
noop raid10 f2 512 1.42 3.38
noop raid10 f2 1024 1.42 3.65
noop raid10 f2 2048 1.46 3.38
noop raid10 n2 64 1.84 1.21
noop raid10 n2 256 1.83 1.90
noop raid10 n2 512 1.85 2.18
noop raid10 n2 1024 1.85 2.45
noop raid10 n2 2048 1.83 2.55
noop raid10 o2 64 1.82 1.90
noop raid10 o2 256 1.85 1.92
noop raid10 o2 512 1.80 1.97
noop raid10 o2 1024 1.62 2.63
noop raid10 o2 2048 1.78 3.13
noop raid5 left-asymmetric 64 1.75 2.63
noop raid5 left-asymmetric 256 1.62 2.92
noop raid5 left-asymmetric 512 1.37 2.92
noop raid5 left-asymmetric 1024 1.09 3.32
noop raid5 left-asymmetric 2048 0.54 3.50
noop raid5 left-symmetric 64 1.78 2.20
noop raid5 left-symmetric 256 1.62 2.88
noop raid5 left-symmetric 512 1.37 2.88
noop raid5 left-symmetric 1024 1.12 3.25
noop raid5 left-symmetric 2048 0.58 3.37
noop raid5 right-asymmetric 64 1.78 2.23
noop raid5 right-asymmetric 256 1.61 2.97
noop raid5 right-asymmetric 512 1.38 2.89
noop raid5 right-asymmetric 1024 1.04 3.30
noop raid5 right-asymmetric 2048 0.52 3.25
noop raid5 right-symmetric 64 1.78 2.29
noop raid5 right-symmetric 256 1.65 2.84
noop raid5 right-symmetric 512 1.38 2.92
noop raid5 right-symmetric 1024 1.09 3.03
noop raid5 right-symmetric 2048 0.47 3.19
noop raid6   64 1.29 1.72
noop raid6   256 1.21 1.84
noop raid6   512 1.05 2.56
noop raid6   1024 0.88 3.08
noop raid6   2048 0.61 3.31