Monday, December 8, 2008

NBD performance enhancements

As I blogged about earlier, I've been making use of NBD lately as my block-device-over-network exporter of choice (previously AoE). I got pretty reasonable performance out of NBD, easily in the 10s of MB/s - basically the limit of my iffy network hardware, however I wanted to know if there was more to be found. And there is! nbd-server, for one, uses the following basic paradigm:
  1. read data from network
  2. process that data into instructions
  3. do the instructions say to read data?
    If so, lseek to the right place on disk, read data into a buffer, copy a reply (header) plus the data into another buffer, and send that whole buffer back to the client.
  4. do the instructions say to write data?
    If so, read more data from the network into a buffer, lseek to the right place on disk, and write it. Construct a reply and send that back.
  5. Other instructions processed here....
Right away I saw one place for improvement: replace lseek+read with pread and lseek+write with pwrite. I did so and immediately saw problems. I eventually tracked them down to most of nbd-server using the 64-bit (largefile) version of the UNIX I/O API, except for pread+pwrite. To fix, one requires to #define _XOPEN_SOURCE 500 (or any of the others which define the same), so I made the requisite changes in the automake (ewwwww) system and things worked fine after that, and some 10-20% faster. I considered using TCP_CORK + sendfile but I'd have to see how the client might handle a partial reply - will it re-read until it gets a complete reply? There *is* a danger here in that the server might say "hey, cool - what follows is 128K worth of data" in response to a read for 128K only for the server to respond with *less* than 128K worth of data due to some error. That would not happen in the current scenario as the data has been successfully read by the time the response is even generated. Perhaps using writev might avoid some buffer copies, however. One thing I'd like to see much more of is better use of udev/hal/dbus - I'd rather see the *option* for a dynamic /dev/nbdX (and /dev/mdX) device to be created when one is brought up and to have that exposed via dbus. This way, I could write udev rules so handle things like "a new block device has appeared, /dev/nbd0" or "a block devices, /dev/nbd0 would like to go (or has gone) away" and so on. Other than performance changes, I "fixed" NBD's largefile support, and added a tcp keepalive option to the client (handy for detecting when a server has gone away in an unusual fashion). I suppose I ought to add the same TCP keepalive support to the server, too. One thing I really like about NBD is it's simplicitly. The code is very simple and it's relatively clean, and it has very few dependencies (pretty much just glib 2.0 for config file parsing). It's also quite snappy and can take advantage of all of the great work that has gone into TCP congestion avoidance. Perhaps porting NBD to make use of SCTP might not be a bad idea.

No comments: