This is a rough implementation of a generic data store. It was developed based on the need for a structured way to manage molecular data, data with a variety of file formats and analyses, and where a rapidly advancing technology ensures that new formats keep coming.
This is a darcs repository, if you have
darcs installed, you can get a copy by doing
darcs get http://malde.org/~ketil/datastore
Each data set is a separate subdirectory, containing arbitrary files and subdirectories. A file named
meta.xml is mandatory, and contains the metadata describing the data set.
This directory contains the bulk of the implementation, mostly in the form of shell scripts.
The scripts rely heavily on
xmlstarlet to extract information from the metadata files. This is availabe through most Linux distributions, or the link above. Unfortunately,
xmlstarlet cannot read RNC schemas directly, and we need
trang to convert to RNG.
Some of the services require special software, e.g. the viroblast service requires
viroblast, and the metadata search service uses
xapian and its
omega web interface.
To add a data set, make a new subdirectory, and populate it with the files that constitute the dataset. Then run
mkdir DataSet mv [....] DataSet/ META/gen_meta.sh DataSet
You should now have a
meta.xml file in DataSet. If you run
META/xmlcheck.sh, you will likely get some warnings. Now, edit the metadata file, and fill in details. Then check it (again) with
xmlcheck, and when it passes, you are done as far as the system is concerned.
gen_meta.sh is run, two metadata sections are generated with empty (that is, just an ellipsis) contents. The
<description> section should contain a text describing what the data set is, while the
<provenance> section should describe how the data set came about (usually corresponding to 'methods'). Plain text is fine, and can be used by many services, e.g. it will be indexed and searched by the xapian/omega service. But in order to be more directly useful, one might want to add more structure to the text. Currently, the following tags are defined to help with this:
A reference to a species can be wrapped with a
<species> tag. The contents is just free text, but the tag has a required attribute,
tsn, and an optional one
sciname. Typically, it would look something like:
<species tsn="89113" sciname="Lepeophtheirus salmonis">Atlantic salmon louse</species>
Often it is useful to refer to other datasets. Again, the contents is plain text, but a required attribute must point to the dataset ID (i.e. its directory name), and an optional attribute describes the kind of relationship. For example:
...replaces the <dataset id="LSalSFF" rel="supersedes">454 libraries</dataset>...
The possible values for
Currently plain text fields, this is likely to change as more structure is added in the future.
A simple test dataset (aptly, if not imaginatively, named
Test) is also included.