Sunday, June 22, 2008

DAR - Disk ARchiver

I first looked into dar a few years ago when I needed something to back up a friend's Windows machine at the filesystem level, and make it available to him (also on Windows) later. I more or less arbitrarily chose DAR. Since then, I haven't really used it. While rdiff-backup is working well for me, I'm always looking at other options. One option I was looking at is archive or file-based backups which, instead of storing entire trees as trees of their own (copies), stores the trees in a single or small set of files. Think "tarball" or "zip file" and you've got the idea. After some recent bad experiences with GNU tar, GNU cpio, and even pax, however, I was looking somewhat outside the usual circle of suspects. I remember my experience with dar so I thought I'd give dar a more in-depth look.

First, a bit about dar. dar stands for Disk ARchiver. To quote from the home page at http://dar.linux.free.fr/:

dar is a shell command that backs up directory trees and files. It has been tested under Linux, Windows, Solaris, FreeBSD, NetBSD, MacOS X and several other systems, it is released under the GNU General Public License (GPL).

dar looks very professionally done and the project appears healthy. dar has a great record of "getting it right the first time" - a sign of quality code and management of process. dar's feature list is rather impressive (more on this later). On the surface, it looks like dar will make your bread and slice it too.

My enthusiasm diminished almost immediately, however. I must preface this bit by saying that not once did I encounter a situation where I doubted the robustness of the code, but rather felt that some suboptimal user interface design decisions had been made. Some were even borderline show stoppers for me.

The first issue I ran into is that dar doesn't write to the file I told it to - it creates a new filename based on what I told it to use. dar calls this the basename. For example, let's say I asked it to create an archive and write it to file "foo". It won't create a file "foo" and write to it, but instead, it will create "foo.1.dar". I asked for "foo" I should get "foo". I understand why this was done but I don't have to like it - I'd rather complicate the file naming with --serialnames and "foo.%serial" or even a secondary helper program. dar clearly has the built-in support for splitting archives, but forcing the basename concept on the user doesn't win it any points from me.

Moving on, the second issue I ran into is the reporting of files that it archives: they are always shown as absolute paths, even when I am not specifying an absolute path. I'm used to tar and cpio and zip and just about everything else out there doing what I tell it to - if I say to process "foo" (implicitly "./foo") it processes "foo", not "/wherever/you/are/now/foo". I always use relative pathnames. Indeed, tar goes out of its way to always use relative pathnames, unless told to really really use absolute pathnames. cpio is most commonly invoked using the find(2) utility and performs no relative-to-absolute path munging. dar insists on reporting the full path of the files it archives. I'll show you by way of example:

dar --create files-to-archive --fs-root=./files-to-archive/ -v

(Again, this would create files-to-archive.1.dar, not files-to-archive. Grr.)

Let's create and populate files-to-archive:

[username:~] rm -rf files-to-archive; mkdir files-to-archive && ( for i in $(seq 1 10); do touch files-to-archive/file-$i ; done )
[username:~] find files-to-archive/
files-to-archive/
files-to-archive/file-7
files-to-archive/file-5
files-to-archive/file-4
files-to-archive/file-3
files-to-archive/file-6
files-to-archive/file-2
files-to-archive/file-9
files-to-archive/file-1
files-to-archive/file-10
files-to-archive/file-8
[username:~]

Groovy. Now, let's archive them:

[username:~] dar --create files-to-archive --fs-root=./files-to-archive/ -v
Adding file to archive: /home/username/files-to-archive/file-7
Adding file to archive: /home/username/files-to-archive/file-5
Adding file to archive: /home/username/files-to-archive/file-4
Adding file to archive: /home/username/files-to-archive/file-3
Adding file to archive: /home/username/files-to-archive/file-6
Adding file to archive: /home/username/files-to-archive/file-2
Adding file to archive: /home/username/files-to-archive/file-9
Adding file to archive: /home/username/files-to-archive/file-1
Adding file to archive: /home/username/files-to-archive/file-10
Adding file to archive: /home/username/files-to-archive/file-8
Writing archive contents...


--------------------------------------------
10 inode(s) saved
with 0 hard link(s) recorded
0 inode(s) changed at the moment of the backup
0 inode(s) not saved (no inode/file change)
0 inode(s) failed to save (filesystem error)
0 inode(s) ignored (excluded by filters)
0 inode(s) recorded as deleted from reference backup
--------------------------------------------
Total number of inode considered: 10
--------------------------------------------
EA saved for 0 inode(s)
--------------------------------------------
[username:~]

But! But! But I didn't say process /home/username/files-to-archive/ I said to process ./files-to-archive. That's annoying. The output makes me question whether I invoked dar correctly and how the files are actually stored in the archive. Are they stored with absolute paths or relative? If they are stored with absolute paths (mirroring what is displayed), then that's not what I asked for. If they are stored with relative paths (which is what I asked for), then the display is confusing!

Let's see what's in the archive:

[username:~] dar --list files-to-archive
[data ][ EA  ][compr] | permission | user  | group | size  |          date                 |    filename
----------------------+------------+-------+-------+-------+-------------------------------+------------
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        file-7
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        file-5
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        file-4
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        file-3
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        file-6
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        file-2
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        file-9
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        file-1
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        file-10
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        file-8
[username:~]

Uh. That's not at all what I wanted. It's missing the leading 'files-to-archive/' part. So I read the manpage. Again. (read read read...) OK, let's try this. I think I understand what I did wrong. The fs-root option doesn't do what I thought it did. Since fs-root defaults to '.' I don't need that option any more.

Let's see what happens when I invoke it "correctly":

[username:~] dar --create files-to-archive -g files-to-archive -v
Cannot read directory contents: /home/username/lost+found : Error opening directory: /home/username/lost+found : Permission denied
Adding file to archive: /home/username/files-to-archive
Adding file to archive: /home/username/files-to-archive/file-7
Adding file to archive: /home/username/files-to-archive/file-5
Adding file to archive: /home/username/files-to-archive/file-4
Adding file to archive: /home/username/files-to-archive/file-3
Adding file to archive: /home/username/files-to-archive/file-6
Adding file to archive: /home/username/files-to-archive/file-2
Adding file to archive: /home/username/files-to-archive/file-9
Adding file to archive: /home/username/files-to-archive/file-1
Adding file to archive: /home/username/files-to-archive/file-10
Adding file to archive: /home/username/files-to-archive/file-8
Writing archive contents...


--------------------------------------------
11 inode(s) saved
with 0 hard link(s) recorded
0 inode(s) changed at the moment of the backup
0 inode(s) not saved (no inode/file change)
0 inode(s) failed to save (filesystem error)
122 inode(s) ignored (excluded by filters)
0 inode(s) recorded as deleted from reference backup
--------------------------------------------
Total number of inode considered: 133
--------------------------------------------
EA saved for 0 inode(s)
--------------------------------------------
[username:~]

Pretty much the same as last time. Let's see what's in the archive:

[username:~] dar --list files-to-archive
[data ][ EA  ][compr] | permission | user  | group | size  |          date                 |    filename
----------------------+------------+-------+-------+-------+-------------------------------+------------
[Saved]       [-----]   drwxr-xr-x   username    users   0       Sun Jun 15 21:27:53 2008        files-to-archive
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        files-to-archive/file-7
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        files-to-archive/file-5
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        files-to-archive/file-4
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        files-to-archive/file-3
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        files-to-archive/file-6
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        files-to-archive/file-2
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        files-to-archive/file-9
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        files-to-archive/file-1
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        files-to-archive/file-10
[Saved]       [     ]   -rw-r--r--   username    users   0       Sun Jun 15 21:27:53 2008        files-to-archive/file-8
[username:~]

OK, that's better, but that invocation is by no means intuitive. Wait a minute, what's this bit about lost+found? Why does it care about lost+found, which is not part of the tree I asked dar to operate on? Maybe it has to do with how dar recurses. Let's try the cpio approach, wherein the archiver gets it's list of filenames on stdin. Something like this:

find files-to-archive | dar --create files-to-archive --include-from-file -

(Omitting more-or-less duplicate output and one erroneous invocation...)

That doesn't work. I have to use /dev/stdin instead of '-'. Ugh. Fine. When I re-invoke, however, I still get the grump about lost+found, though. I don't understand why dar thinks it needs to stat any file that it's not going to archive. It knows it's not going to archive lost+found, by virtue of the fact that it's clearly not present in the list of input paths! I consider this to be a bug. Imagine how horrible the performance might be if you wanted to archive 30 files out of 100,000. dar would stat every one of them! Consider the ramifications of this design choice in the following situations:

  • a damaged filesystem (let's say I know I can access A and B but when I access C I encounter a bad block...)
  • NFS or something equally expensive
  • bad NFS mount points

Moving on again, let's skip to a pet-peeve, hard links. Handling hard links badly is a show-stopper for me, so let's see how dar handles this:

[username:~] rm -rf foo
[username:~] mkdir foo
[username:~] dd if=/dev/zero of=foo/a bs=100k count=1
[username:~] ln foo/a foo/b
[username:~] ln foo/a foo/c
[username:~] ln foo/a foo/d
[username:~] find foo -ls
3342147    4 drwxr-xr-x   2 username  users        4096 Jun 21 09:32 foo
3342148    0 -rw-r--r--   4 username  users      102400 Jun 21 09:32 foo/d
3342148    0 -rw-r--r--   4 username  users      102400 Jun 21 09:32 foo/c
3342148    0 -rw-r--r--   4 username  users      102400 Jun 21 09:32 foo/a
3342148    0 -rw-r--r--   4 username  users      102400 Jun 21 09:32 foo/b
[username:~] find foo | dar -c foo --include-from-file=/dev/stdin
Cannot read directory contents: /home/username/lost+found : Error opening directory: /home/username/lost+found : Permission denied


--------------------------------------------
2 inode(s) saved
with 3 hard link(s) recorded
0 inode(s) changed at the moment of the backup
0 inode(s) not saved (no inode/file change)
0 inode(s) failed to save (filesystem error)
5 inode(s) ignored (excluded by filters)
0 inode(s) recorded as deleted from reference backup
--------------------------------------------
Total number of inode considered: 7
--------------------------------------------
EA saved for 0 inode(s)
--------------------------------------------
[username:~]

Whoah. Wait a minute! There are 4 files and one directory, but dar says it saved 2 inodes with 3 links recorded. Hmmmm. OK, yeah. 2 inodes - 1 file and 1 directory, and 3 hard links to that file.:

[username:~] dar --list foo
[data ][ EA  ][compr] | permission | user  | group | size  |          date                 |    filename
----------------------+------------+-------+-------+-------+-------------------------------+------------
[Saved]       [-----]   drwxr-xr-x   username    users   0       Sat Jun 21 09:32:33 2008        foo
[Saved]       [     ]   hrw-r--r--   username    users   102400  Sat Jun 21 09:32:27 2008        foo/d
[Saved]       [     ]   hrw-r--r--   username    users   102400  Sat Jun 21 09:32:27 2008        foo/c
[Saved]       [     ]   hrw-r--r--   username    users   102400  Sat Jun 21 09:32:27 2008        foo/a
[Saved]       [     ]   hrw-r--r--   username    users   102400  Sat Jun 21 09:32:27 2008        foo/b
[username:~]

Alright. So far so good. The size of the archive (102547 bytes) shows it's only storing one copy of the data. Good! dar handles hardlinks just fine.

Taking a step back, I've been beating up on dar pretty badly. dar does some things I don't care for, and some of those things may even be show stoppers. Is it worth continuing the investigation given these annoyances? dar is pretty sophisticated. The archive format itself is well-developed (and appears at least as capable as any of the tar formats), including support for extended attributes (I couldn't care less), acls (ditto), huge filenames, etc.. dar also supports (by no means an exhaustive list):

- splitting
- merging (cool!, with lots of file-selection smarts, too)
- encryption (both cruddy and good)
- storing the table-of-contents (aka, "catalog") only and using *that* to create the following:
- incremental/differential archives

It would appear that you can even merge differentials/incremental archives with each other and "full" archives to create new archives, all using a fairly powerful file selection mechanism. Furthermore, dar can record which files were removed between one archive and another, something that nothing else I've run across can do, which also happens to be a big issue for me.

dar looks enormously capable and does seem to be working properly. The issues I encountered:

  1. pathname reporting on archive (UI issue)
  2. stat(2)'ing files outside of the requested source tree
  3. the 'basename' archive filename imposition

won't dissuade me from investigating its use more thoroughly, but it's not going to happen with as much enthusiasm as I would have hoped.

1 comment:

Anonymous said...

I'd be interested to see a graph with the best performing raid5, raid6, raid10 stacked up against each other.

Thanks,

Leif