Friday, May 17, 2013

Processing a MRC metadata dump with XSLT


TL;DR We describe how to dump the metadata of an XtreemFS installation to a XML file. The XML dump is filtered for files located on a specific OSD using XSLT. You can use this example for own analyzes of your file system's metadata.

At our institute we run an XtreemFS installation for scientific users. The installation spans 16 OSDs which are hosted at our site and are regularly accessed by three other institutes throughout Germany. During recent maintenance work we lost all chunks of one OSD by human error: I accidentally deleted all chunks of that OSD because I mistook the directory for a backup whereas it was the last remaining copy. Since the installation is meant for temporary scientific data, we decided against replication and backups at deployment to maximize the available capacity. (Single-disk failures are covered by the underlying RAID5 used on each OSD.)


Nonetheless, it was necessary to inform all users about their deleted files. Therefore, I had to find out which files were placed on the affected OSD. XtreemFS stores the list of replicas per file at the MRC (Metadata and Replica Catalog). The MRC allows to dump and restore the metadata in XML format. To find the affected files, I filtered the XML dump using XSLT. This blog post details the required steps. You can use the provided example to run your own analyzes on your file system's metadata.

Create a MRC database dump
You can use the XtreemFS tool xtfs_mrcdbtool to dump or restore the MRC database. The MRC will write/read the dump locally. Therefore, you have to specify where the MRC should write the dump on its machine:
xtfs_mrcdbtool -mrc mrc-host.example.com dump /tmp/dump.xml
This command will tell the MRC to write the database dump to the file /tmp/dump.xml. Make sure that the MRC has write permission for the given path. If you configured an "admin_password" for the MRC, you have to set the option --admin_password as well.

Filter the XML database dump using XSLT 
The MRC database dump is in XML format. The XML tree in the dump contains the file system tree of each volume.

You can use XSLT (Extensible Stylesheet Language Transformations) to filter the dump and transform the output to an even more human-readable form. I've added an example file to our code repository: filter_files.xslt You have to use a XSLT processor to transform the original XML dump. For example, use xsltproc:
xsltproc -o filtered_files_output.txt filter_files.xslt /tmp/dump.xml
The resulting file filtered_files_output.txt will have the following output format:
volume name/path on volume|creation time|file size|file's owner name
Modify the filter_files.xslt file to include or exclude other file attributes. This example handles only files which are (at least partially) placed on an OSD with the UUID "zib.mosgrid.osd15". This is realized by the following instruction in the XSLT file which limits the set of selected "file" elements:
<xsl:template match="file[xlocList/xloc/osd/@location='zib.mosgrid.osd15']">
Write your own XPath expression to realize own filters. If you want all files, just write match="file" without the brackets.