Name | Last modified | Size | Description | |
---|---|---|---|---|
Parent Directory | - | |||
META/ | 2013-10-04 21:58 | - | ||
Test/ | 2013-10-01 16:40 | - | ||
_darcs/ | 2013-10-04 21:58 | - | ||
README.md | 2013-09-17 15:51 | 4.3K | ||
README.html | 2013-09-17 15:51 | 5.2K | ||
This is a rough implementation of a generic data store. It was developed based on the need for a structured way to manage molecular data, data with a variety of file formats and analyses, and where a rapidly advancing technology ensures that new formats keep coming.
This is a darcs repository, if you have darcs
installed, you can get a copy by doing
darcs get http://malde.org/~ketil/datastore
Each data set is a separate subdirectory, containing arbitrary files and subdirectories. A file named meta.xml
is mandatory, and contains the metadata describing the data set.
This directory contains the bulk of the implementation, mostly in the form of shell scripts.
The scripts rely heavily on xmlstarlet
to extract information from the metadata files. This is availabe through most Linux distributions, or the link above. Unfortunately, xmlstarlet
cannot read RNC schemas directly, and we need trang
to convert to RNG.
Some of the services require special software, e.g. the viroblast service requires viroblast
, and the metadata search service uses xapian
and its omega
web interface.
To add a data set, make a new subdirectory, and populate it with the files that constitute the dataset. Then run gen_meta.sh
.
mkdir DataSet
mv [....] DataSet/
META/gen_meta.sh DataSet
You should now have a meta.xml
file in DataSet. If you run META/xmlcheck.sh
, you will likely get some warnings. Now, edit the metadata file, and fill in details. Then check it (again) with xmlcheck
, and when it passes, you are done as far as the system is concerned.
When gen_meta.sh
is run, two metadata sections are generated with empty (that is, just an ellipsis) contents. The <description>
section should contain a text describing what the data set is, while the <provenance>
section should describe how the data set came about (usually corresponding to 'methods'). Plain text is fine, and can be used by many services, e.g. it will be indexed and searched by the xapian/omega service. But in order to be more directly useful, one might want to add more structure to the text. Currently, the following tags are defined to help with this:
A reference to a species can be wrapped with a <species>
tag. The contents is just free text, but the tag has a required attribute, tsn
, and an optional one sciname
. Typically, it would look something like:
<species tsn="89113" sciname="Lepeophtheirus salmonis">Atlantic salmon louse</species>
Often it is useful to refer to other datasets. Again, the contents is plain text, but a required attribute must point to the dataset ID (i.e. its directory name), and an optional attribute describes the kind of relationship. For example:
...replaces the <dataset id="LSalSFF" rel="supersedes">454 libraries</dataset>...
The possible values for rel
is supersedes
, subsumes
, and uses
.
Currently plain text fields, this is likely to change as more structure is added in the future.
A simple test dataset (aptly, if not imaginatively, named Test
) is also included.