Archive for the 'File Systems' Category

High-availability storage with GlusterFS on Ubuntu

This tutorial shows how to set up a high-availability storage with two storage servers (Ubuntu 9.10) that use GlusterFS.

Each storage server will be a mirror of the other storage server, and files will be replicated automatically across both storage servers. The client system (Ubuntu 9.10 as well) will be able to access the storage as if it was a local filesystem.

GlusterFS is a clustered file-system capable of scaling to several peta-bytes. It aggregates various storage bricks over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system. Storage bricks can be made of any commodity hardware such as x86_64 servers with SATA-II RAID and Infiniband HBA.

About Glusterfs: Gluster Storage Platform is an open source clustered storage solution. The software is a powerful and flexible solution that simplifies the task of managing unstructured file data whether you have a few terabytes of storage or multiple petabytes. Gluster Storage Platform integrates the file system, an operating system layer, and a web-based management interface and installer.

Google to switch to EXT4

Google is apparently in the process of migrating their current EXT2 file-systems over to the more current EXT4 file-system (Ubuntu 9.10 uses it by default)

Phoronix reports

This was brought up in a JFS benchmarking discussion. Google’s Michael Rubin shared that they chose EXT4 after benchmarking it as well as XFS and JFS (possibly with our Phoronix Test Suite carrying out some of the testing, which they have used in other areas). Their results showed EXT4 and XFS performing close to one another, but with it being easier to upgrade from EXT2 to EXT4 rather than EXT2 to XFS, they went with the easier path. Btrfs is still too experimental for Google to even consider that an option at this point.

Also they have now hired the main developer behind EXT4: Ted Ts’o

Will Google become even faster with processing large quantities of data, and displaying them to the end-user?

Gluster enhances open-source clustered NAS

Gluster adds data storage management, virtual server support to open-source clustered NAS

Gluster is joining a recent wave of emerging vendors adding enterprise storage management features to clustered network-attached storage (NAS) systems based on commodity hardware with this week’s release of the Gluster Storage Platform.

Gluster came out of stealth in 2007 with GlusterFS, a scale-out file system for clustered NAS based on open-source code but reengineered “from the ground up,” according to senior director of marketing Jack O’Brien. Version 2 of GlusterFS came out last May with striping, data replication and management tools.

The Gluster Storage Platform, which became available this week, continues to build on those management features with a new software delivery model, an updated Web-based management GUI, and new support for virtual servers, including the ability to self-heal data errors in virtual server environments.

With the Gluster Storage Management Platform, users can now get GlusterFS, the Linux operating system kernel layer and management tools in one package loaded on a thumb drive. Gluster calls this “clustered storage on a stick.” This package also includes a two-step installation process with the goal of making the open-source software accessible to customers who aren’t used to working with code. “We call this release rated ‘E’ for everyone,” O’Brien said.

Similarly, the Gluster Storage Platform uses a Web-based GUI that’s added support for more of Gluster’s management features, such as event logging, which used to require a command line interface.

Finally, though not officially certified with any major server virtualization vendor yet, Gluster is offering support for running virtual machines (VMs) on its clustered NAS. Customers who choose this option can use the cluster’s internal replication to provide high availability (HA) failover for VMs running on the cluster, which is set up using a checkbox at the time of installation. From there, the file system automatically handles the replication using the underlying object-based storage system.

Continues

New FreeBSD Foundation Project: HAST

FreeBSD foundation logoThe FreeBSD Foundation has announced that is funding a new funded project: HAST

“Pawel Jakub Dawidek has been awarded a grant to implement storage replication software that will enable users to use the FreeBSD operating system for highly available configurations where data has to be shared across the cluster nodes. The project is partly being funded by OMCnet Internet Service and TransIP BV.

The software will allow for synchronous block-level replication of any storage media (GEOM providers, using FreeBSD nomenclature) over the TCP/IP network and for fast failure recovery. HAST will provide storage
using GEOM infrastructure, which means it will be file system and application independent and could be combined with any existing GEOM class. In case of a master node failure, the cluster will be able to
switch to the slave node, check and mount UFS file system or import ZFS pool and continue to work without missing a single bit of data.

High-availability is the number one requirement for any serious use of any operating system,

Pawel Jakub Dawideksaid Pawel Jakub Dawidek, FreeBSD Developer.

Highly available storage is one of the key components in such environments. I strongly believe there are many FreeBSD users that have been waiting a long time for this functionality. I’ll do my best to deliver software that matches FreeBSD quality and that will satisfy the needs of our users.

Pawel has been an active FreeBSD committer since 2003. During this period, he has touched almost every part of the kernel. But, his main interest in FreeBSD is storage and security related topics. Pawel is the author of various GEOM classes (eli, mirror, gate, label, journal, hsec, etc.), geom(8) utility, various opencrypto improvements as well as port of the ZFS file system from OpenSolaris to FreeBSD.

The project will complete by February 2010.”

If you want, you can support this project too.

FreeNAS Bash script for ZFS scrubbing

Gimpe has put together a bash script to automatically run a script at predifined intervals to do a scrub on each ZFS pool. Please note, this will only run on FreeNAS 0.7 (not on the 0.6x series as it doesn’t support Sun’s Zetabyte Filesystem (ZFS).

View/Download the script from hypeothetic.com

Boot on BTRFS With Debian

howtoforge.com has uploaded a tutorial to boot a BTRFS filesystem on Debian

This tutorial will explain you how to boot from a BTRFS filesystem with kernel 2.6.31-RC4 and BTRFS 0.19. BTRFS is a new filesystem with some really interesting features like online defragmenting and snapshots. BTRFS is an experimental filesystem, use at your own risk. The kernel used is also experimental.

This tutorial worked fine for me but I don’t guarantee that this will work for you, and decline all responsibility for any problem you might have.

btrfs debian guide

btrfs: a brief comparison with ZFS

brtfs-filesystemlwn.net has an overview of the history of the btrfs filesystem (wikipedia link). He also touches on the  difference and similarities between btrf and ZFS, Sun’s Zetabyte File System, something that I was most intersted in:

“People often ask about the relationship between btrfs and ZFS. From one point of view, the two file systems are very similar: they are copy-on-write checksummed file systems with multi-device support and writable snapshots. From other points of view, they are wildly different: file system architecture, development model, maturity, license, and host operating system, among other things. Rather than answer individual questions, I’ll give a short history of ZFS development and compare and contrast btrfs and ZFS on a few key items.

When ZFS first got started, the outlook for file systems in Solaris was rather dim as well. Logging UFS was already nearing the end of its rope in terms of file system size and performance. UFS was so far behind that many Solaris customers paid substantial sums of money to Veritas to run VxFS instead. Solaris needed a new file system, and it needed it soon.

Jeff Bonwick decided to solve the problem and started the ZFS project inside Sun. His organizing metaphor was that of the virtual memory subsystem – why can’t disk be as easy to administer and use as memory? The central on-disk data structure was the slab – a chunk of disk divided up into the same size blocks, like that in the SLAB kernel memory allocator, which he also created. Instead of extents, ZFS would use one block pointer per block, but each object would use a different block size – e.g., 512 bytes, or 128KB – depending on the size of the object. Block addresses would be translated through a virtual-memory-like mechanism, so that blocks could be relocated without the knowledge of upper layers. All file system data and metadata would be kept in objects. And all changes to the file system would be described in terms of changes to objects, which would be written in a copy-on-write fashion.

In summary, btrfs organizes everything on disk into a btree of extents containing items and data. ZFS organizes everything on disk into a tree of block pointers, with different block sizes depending on the object size. btrfs checksums and reference-counts extents, ZFS checksums and reference-counts variable-sized blocks. Both file systems write out changes to disk using copy-on-write – extents or blocks in use are never overwritten in place, they are always copied somewhere else first.

So, while the feature list of the two file systems looks quite similar, the implementations are completely different. It’s a bit like convergent evolution between marsupials and placental mammals – a marsupial mouse and a placental mouse look nearly identical on the outside, but their internal implementations are quite a bit different!

In my opinion, the basic architecture of btrfs is more suitable to storage than that of ZFS. One of the major problems with the ZFS approach – “slabs” of blocks of a particular size – is fragmentation. Each object can contain blocks of only one size, and each slab can only contain blocks of one size. You can easily end up with, for example, a file of 64K blocks that needs to grow one more block, but no 64K blocks are available, even if the file system is full off nearly empty slabs of 512 byte blocks, 4K blocks, 128K blocks, etc. To solve this problem, we (the ZFS developers) invented ways to create big blocks out of little blocks (“gang blocks”) and other unpleasant workarounds. In our defense, at the time btrees and extents seemed fundamentally incompatible with copy-on-write, and the virtual memory metaphor served us well in many other respects.

In contrast, the items-in-a-btree approach is extremely space efficient and flexible. Defragmentation is an ongoing process – repacking the items efficiently is part of the normal code path preparing extents to be written to disk. Doing checksums, reference counting, and other assorted metadata busy-work on a per-extent basis reduces overhead and makes new features (such as fast reverse mapping from an extent to everything that references it) possible.

Now for some personal predictions (based purely on public information – I don’t have any insider knowledge). Btrfs will be the default file system on Linux within two years. Btrfs as a project won’t (and can’t, at this point) be canceled by Oracle. If all the intellectual property issues are worked out (a big if), ZFS will be ported to Linux, but it will have less than a few percent of the installed base of btrfs. Check back in two years and see if I got any of these predictions right!”

SUN ZFS Triple-Parity RAID-Z

The standard in the RAID industry for storage is RAID-6, with recovery from a double drive failure. But it’s not going to be good enough as disk capacities increase, prolonging failed disk rebuild times and so lengthening the window of unrecoverable failure if a third disk fails before the recovery from a double drive failure is complete.

Adam Levental, from Sun Fishworks, says hard drive capacity roughly doubles every year but hard drive bandwidth is pretty constant, which means it takes longer and longer to write data to fill up a drive.

“Double-parity RAID, of course, provides protection from up to two failures (data corruption or the whole drive) within a RAID stripe. The necessity of triple-parity RAID arises from the observation that while hard drive capacity has roughly followed Kryder’s law, doubling annually, hard drive throughput has improved far more modestly. Accordingly, the time to populate a replacement drive in a RAID stripe is increasing rapidly. Today, a 1TB SAS drive takes about 4 hours to fill at its theoretical peak throughput; in a real-world environment that number can easily double, and 2TB and 3TB drives expected this year and next won’t move data much faster. Those long periods spent in a degraded state increase the exposure to the bit errors and other drive failures that would in turn lead to data loss. The industry moved to double-parity RAID because one parity disk was insufficient; longer resilver times mean that we’re spending more and more time back at single-parity. From that it was obvious that double-parity will soon become insufficient (I’m working on an article that examines these phenomena quantitatively so stay tuned).”

Leventhal has added triple-parity RAID to Sun’s ZFS filesystem, calling it RAIDz3. He suggests calling it generically RAID-7 or RAID-8 might be silly. RAID-6 is often known as RAID-DP though, so RAID-TP would seem logical. Leventhal says it too could be superseded if disk capacities keep on growing.

Read Adam’s post on his implemenation of Triple-Parity RAID-Z

The Btrfs file system (ZFS vs btrfs)

H-online has an interesting post explaining Btrfs, the designated “next generation file system” for Linux:

Btrfs, the designated “next generation file system” for Linux, offers a range of features that are not available in other Linux file systems – and it’s nearly ready for production use.

If the numerous articles published about this topic in the past few months are to be believed, Btrfs is the file system of the future for Linux and the file system developers agree: Btrfs is to be the “next generation file system” for Linux. The general consensus is that Btrfs is the ZFS for Linux. While this may be disputable at present since the ZFS, designed by Sun Microsystems for the Solaris Operating System, is already in production use, while Btrfs is still highly experimental, the two file systems do have a lot in common. With its integrated volume management, checksums for data integrity, Copy on Write and snapshots, Btrfs offers a range of features unrivalled by any of the Linux file systems currently in production use.

Btrfs, which is called “ButterFS” by some people and “BetterFS” by others, is actually short for B-Tree File System, and is so named because the file system manages its data and metadata in tree structures. Masterminded by Oracle developer Chris Mason, the file system has been a part of the Linux kernel since Linux 2.6.29. However, this doesn’t mean that it is stable, let alone suitable for production use. The Btrfs page at kernel.org clearly points out that not even the file system’s on disk data formats have so far been finalised.

Continue reading ‘The Btrfs file system (ZFS vs btrfs)’