Friday, November 21, 2008

BabuDB - efficient key value store for java

The XtreemFS MRC uses a simple key-value store for managing all metadata. We have spent a lot of time to evaluate the different storage backends: HSQLDB (and relational databases in general), BerkeleyDB for Java and simple Java TreeMaps.

Embedded SQL databases like HSQLDB are simply too slow for key-value lookups since SQL parsing is expensive. Stand-alone SQL server require IPC which also rules out that solution. We used BerkeleyDB for a while but due to the lack of documentation and stability issues we had to drop that as well. The Java TreeMaps did their job, but they have real limits in terms of size and serializing to disk is slow and interrups operations.

Finally, we decided to implement our own key-value store called BABUDB. It is based on the LSM-Tree concept and is optimized for applications that need a non-transactional key-value store. If you like to find out more, visit our google code project at http://code.google.com/p/babudb/. In the next release of XtreemFS, the MRC will be ported to the new database which will result in much better performance and a better utilization of multi-core processors.

Monday, October 27, 2008

XtreemFS 0.10.0 released: checksums and more

We have just released XtreemFS 0.10.0 on http://www.xtreemfs.org/. Apart from many quality improvements, it contains several exciting new features:
  • checksum support for file objects, including xtfs_scrub for verification
  • client: caching, improved performance, new platforms (win32, ARM, OS X)
  • UUIDs
  • OSD cleanup
  • plug-in interface
Made popular by ZFS, checksums allow you to verify the integrity of your file data. XtreemFS OSD can now compute, maintain and verify checksums for each file object. Our new scrubbing tool xtfs_scrub allows you to verify checksums on all OSDs in parallel.

Our client, an integral part of the XtreemFS architecture, is now much faster than previous versions, and you should be network or disk bound in most cases. We also added caching for file data and metadata, which also improves performance considerably. The client is now also running on Linux on ARM devices and we have added experimental ports for Windows and OS X.

XtreemFS now identifies all its services by UUIDs. This allows you to move services and their data to different hosts and use NATed network setups where only some of the IP addresses can be used.

Further, we have now a tool xtfs_cleanup for reclaiming storage space by erasing orphaned objects from OSDs (objects can be orphaned if the client crashes while deleting a file). We extend our plug-in interface in the MRC so that you can add your own policies (written in Java) to control its operation.

Friday, October 24, 2008

Client News: Caching, Win32, OS X, ARM

The client of XtreemFS contains a considerable part of the logic of XtreemFS as it has to coordinate the metadata and object storage servers (OSDs) to provide the user with a POSIX-compliant file system abstraction. 

This setup is great from an architectural view point because it scales well as it takes tasks away from the metadata and storage servers. But it also makes the client a complex beast. 

In the last weeks we have put a lot of work in the client and you will be able to use the features soon as part of the 0.10.0 release

The first area of work was caching. While the operating system already caches file data in the page cache, the granularity of the accesses are mostly 4k - too small if you have to fetch each of these over the network. With client-side caching, we can now fetch data with object granularity, which is usually a few hundred k or more. We will follow-up with real measurements soon, but the performance improvements are very good. 

Client-side caching is also laying the foundation for prefetching, RAID, and other things that are on our long-term roadmap. 

We have also implemented a metadata cache that allows the client to retrieve the results of a readdir() together with all the stat()s in one RPC - important for long-latency networks like DSL or installations over the Internet. But it also feels more snappy on the LAN.

We have also started porting the client to other platforms than Linux/FUSE. We have been running on OS X with Mac FUSE for quite a while now (available on our download page), and have now extended our coverage of platforms to Windows and ARM Linux. The Windows client is using the fantastic Dokan library with the fuse4win adapter. It will be available on the download page soon.

Wednesday, August 20, 2008

FUSE performance

The XtreemFS client is implemented as a FUSE driver. Therefore, the throughput of FUSE could also be a limiting factor for the overall performance of our file system. Matthias implemented a simple "emptyfs" FUSE driver which simply discards all data. I used the driver to measure bandwidth from an application through the VFS layer and FUSE to the user-level process. The machine I ran the test on has two CPUs with four cores each (Xeon E5420 @ 2.5GHz) with 16GB RAM. I used dd to transfer 2GB of data with block sizes from 4k to 64MB.

The results are plotted in two graphs. The first graph shows the throughput in MB/s as report by dd. The second graph shows the CPU usage (sy= system, us= user) and the number of context switches.

graph 1 (write bandwidth in MB/s):


graph 2 (CPU usage, context switches):


With this results (2GB/s) for 128k or larger blocks, it is easy to see that FUSE is not the limiting factor for us. But this also shows that FUSE without the direct_io options has real performance problems as all write requests are split into 4k writes. So, you have to choose between performance and the ability to execute files (mmap does not work when direct_io is enabled, see this FUSE mailing list entry).

Thursday, August 14, 2008

XtreemFS 0.9.0 released

We have released XtreemFS 0.9.0, a distributed file system for federated IT infrastructures. XtreemFS is an integrated part of the XtreemOS Linux operating system for the Grid, but can also be run on various other Linux distributions.

XtreemFS is a full file system that features:
  • full Posix compliance, incl. Posix ACLs and extended attributes
  • parallel access to striped files, stripe width configurable per file
  • scalable installations by adding more storage and metadata servers
  • transparent integration into XtreemOS' user and VO management
  • integration into various X.509 authentication infrastructures
  • mountable on all systems that support FUSE, incl. OS X
XtreemFS is GPL-licensed and available via http://www.xtreemfs.com as source code and packaged for various Linux distributions, incl. XtreemOS.

... Felix for the XtreemFS development team