Category Archives: File Systems

Boot on BTRFS With Debian has uploaded a tutorial to boot a BTRFS filesystem on Debian

This tutorial will explain you how to boot from a BTRFS filesystem with kernel 2.6.31-RC4 and BTRFS 0.19. BTRFS is a new filesystem with some really interesting features like online defragmenting and snapshots. BTRFS is an experimental filesystem, use at your own risk. The kernel used is also experimental.

This tutorial worked fine for me but I don’t guarantee that this will work for you, and decline all responsibility for any problem you might have.

btrfs debian guide

btrfs: a brief comparison with ZFS has an overview of the history of the btrfs filesystem (wikipedia link). He also touches on the  difference and similarities between btrf and ZFS, Sun’s Zetabyte File System, something that I was most intersted in:

“People often ask about the relationship between btrfs and ZFS. From one point of view, the two file systems are very similar: they are copy-on-write checksummed file systems with multi-device support and writable snapshots. From other points of view, they are wildly different: file system architecture, development model, maturity, license, and host operating system, among other things. Rather than answer individual questions, I’ll give a short history of ZFS development and compare and contrast btrfs and ZFS on a few key items.

When ZFS first got started, the outlook for file systems in Solaris was rather dim as well. Logging UFS was already nearing the end of its rope in terms of file system size and performance. UFS was so far behind that many Solaris customers paid substantial sums of money to Veritas to run VxFS instead. Solaris needed a new file system, and it needed it soon.

Jeff Bonwick decided to solve the problem and started the ZFS project inside Sun. His organizing metaphor was that of the virtual memory subsystem – why can’t disk be as easy to administer and use as memory? The central on-disk data structure was the slab – a chunk of disk divided up into the same size blocks, like that in the SLAB kernel memory allocator, which he also created. Instead of extents, ZFS would use one block pointer per block, but each object would use a different block size – e.g., 512 bytes, or 128KB – depending on the size of the object. Block addresses would be translated through a virtual-memory-like mechanism, so that blocks could be relocated without the knowledge of upper layers. All file system data and metadata would be kept in objects. And all changes to the file system would be described in terms of changes to objects, which would be written in a copy-on-write fashion.

In summary, btrfs organizes everything on disk into a btree of extents containing items and data. ZFS organizes everything on disk into a tree of block pointers, with different block sizes depending on the object size. btrfs checksums and reference-counts extents, ZFS checksums and reference-counts variable-sized blocks. Both file systems write out changes to disk using copy-on-write – extents or blocks in use are never overwritten in place, they are always copied somewhere else first.

So, while the feature list of the two file systems looks quite similar, the implementations are completely different. It’s a bit like convergent evolution between marsupials and placental mammals – a marsupial mouse and a placental mouse look nearly identical on the outside, but their internal implementations are quite a bit different!

In my opinion, the basic architecture of btrfs is more suitable to storage than that of ZFS. One of the major problems with the ZFS approach – “slabs” of blocks of a particular size – is fragmentation. Each object can contain blocks of only one size, and each slab can only contain blocks of one size. You can easily end up with, for example, a file of 64K blocks that needs to grow one more block, but no 64K blocks are available, even if the file system is full off nearly empty slabs of 512 byte blocks, 4K blocks, 128K blocks, etc. To solve this problem, we (the ZFS developers) invented ways to create big blocks out of little blocks (“gang blocks”) and other unpleasant workarounds. In our defense, at the time btrees and extents seemed fundamentally incompatible with copy-on-write, and the virtual memory metaphor served us well in many other respects.

In contrast, the items-in-a-btree approach is extremely space efficient and flexible. Defragmentation is an ongoing process – repacking the items efficiently is part of the normal code path preparing extents to be written to disk. Doing checksums, reference counting, and other assorted metadata busy-work on a per-extent basis reduces overhead and makes new features (such as fast reverse mapping from an extent to everything that references it) possible.

Now for some personal predictions (based purely on public information – I don’t have any insider knowledge). Btrfs will be the default file system on Linux within two years. Btrfs as a project won’t (and can’t, at this point) be canceled by Oracle. If all the intellectual property issues are worked out (a big if), ZFS will be ported to Linux, but it will have less than a few percent of the installed base of btrfs. Check back in two years and see if I got any of these predictions right!”

SUN ZFS Triple-Parity RAID-Z

The standard in the RAID industry for storage is RAID-6, with recovery from a double drive failure. But it’s not going to be good enough as disk capacities increase, prolonging failed disk rebuild times and so lengthening the window of unrecoverable failure if a third disk fails before the recovery from a double drive failure is complete.

Adam Levental, from Sun Fishworks, says hard drive capacity roughly doubles every year but hard drive bandwidth is pretty constant, which means it takes longer and longer to write data to fill up a drive.

“Double-parity RAID, of course, provides protection from up to two failures (data corruption or the whole drive) within a RAID stripe. The necessity of triple-parity RAID arises from the observation that while hard drive capacity has roughly followed Kryder’s law, doubling annually, hard drive throughput has improved far more modestly. Accordingly, the time to populate a replacement drive in a RAID stripe is increasing rapidly. Today, a 1TB SAS drive takes about 4 hours to fill at its theoretical peak throughput; in a real-world environment that number can easily double, and 2TB and 3TB drives expected this year and next won’t move data much faster. Those long periods spent in a degraded state increase the exposure to the bit errors and other drive failures that would in turn lead to data loss. The industry moved to double-parity RAID because one parity disk was insufficient; longer resilver times mean that we’re spending more and more time back at single-parity. From that it was obvious that double-parity will soon become insufficient (I’m working on an article that examines these phenomena quantitatively so stay tuned).”

Leventhal has added triple-parity RAID to Sun’s ZFS filesystem, calling it RAIDz3. He suggests calling it generically RAID-7 or RAID-8 might be silly. RAID-6 is often known as RAID-DP though, so RAID-TP would seem logical. Leventhal says it too could be superseded if disk capacities keep on growing.

Read Adam’s post on his implemenation of Triple-Parity RAID-Z

The Btrfs file system (ZFS vs btrfs)

H-online has an interesting post explaining Btrfs, the designated “next generation file system” for Linux:

Btrfs, the designated “next generation file system” for Linux, offers a range of features that are not available in other Linux file systems – and it’s nearly ready for production use.

If the numerous articles published about this topic in the past few months are to be believed, Btrfs is the file system of the future for Linux and the file system developers agree: Btrfs is to be the “next generation file system” for Linux. The general consensus is that Btrfs is the ZFS for Linux. While this may be disputable at present since the ZFS, designed by Sun Microsystems for the Solaris Operating System, is already in production use, while Btrfs is still highly experimental, the two file systems do have a lot in common. With its integrated volume management, checksums for data integrity, Copy on Write and snapshots, Btrfs offers a range of features unrivalled by any of the Linux file systems currently in production use.

Btrfs, which is called “ButterFS” by some people and “BetterFS” by others, is actually short for B-Tree File System, and is so named because the file system manages its data and metadata in tree structures. Masterminded by Oracle developer Chris Mason, the file system has been a part of the Linux kernel since Linux 2.6.29. However, this doesn’t mean that it is stable, let alone suitable for production use. The Btrfs page at clearly points out that not even the file system’s on disk data formats have so far been finalised.

Continue reading