SYNOPSIS xml2x - convert blast output in XML format, either to a (csv) table suitable for e.g. importing into Excel or OOCalc, or to HTML. Optionally annotating the output with GO terms and KEGG KOs. INSTALLATION The usual cabal routine, should also be possible to compile via the Makefile. USAGE xml2x [options] xmlfile1 xmlfile2... Use -v if you are on an interactive terminal to keep track of progress. Output format is specified with -C (CSV) or -H (HTML), with -C being the default. Note that only one output format can be used at a time. CSV OUTPUT For CSV output, the following modes are supported --all - output all blast matches (HSPs), one per line --top - output only the top hit for each input sequence --region - output top hit for regions that overlap <50% Use -o to specify an output file, the default is to output to standard out. HTML OUTPUT For HTML output, a directory called "blast.d" is created (or re-used if already present), and an index is constructed in a file named "index.html" in the current directory. The index lists some information about the highest scoring blast hit, and links to the file displaying the alignment. The directory contains one HTML file per input sequence, and uses a HTML table to rendering the alignments. Color codes indicate level of identity (not total match score or E-value!), so short, brightly red matches may have lower score than long gray ones. Frame (for BLASTX) or strand (for BLASTN) is indicated as text for each match. The files are named consistently, so if you run BLAST in both directions (i.e. swapping -i and -d), you should be able to go back and forth by clicking on the sequence names. ANNOTATIONS Options include --annotations to specify the mapping between UniProt accessions and GO terms. This file is usually called "gene_association.goa_uniprot", and is available from the GO consortium [1]. The file is several GB, you may want to consider trimming it down a bit by filtering out the automatic (IEA) annotations - however, xml2x will first scan the blast output to extract only relevant GO annotations, so keeping it all in memory is not necessary. Additionally, you can use --ontology to specify the description of the GO terms, and the output will then be somewhat more meaningful. The file is usually called "gene_ontology.obo", similarly available [2]. You can also add KEGG annotations with the -k (or --kegg-organism) option. This option takes a file prefix as a parameter, and for a prefix $P, expects to find files $P_uniprot.list and $P_ko.list. These files are read, and used to mapp KEGG KOs to each UniProt hit. Available from [3]. BUGS XML parsing is slow, but ndm said he'd look into it. Must be compiled with -smp to avoid huge memory requirements, but the plus side is that with -smp, we use a lot less RAM than AutoFact. REFERENCES [1] http://www.geneontology.org/ontology/gene_ontology.obo [2] ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/ [3] ftp://ftp.genome.jp/pub/kegg/genes/organisms/