Introduction
============

The modules in `Bio.GFF3` are designed to fairly directly represent
the data present in a [GFF](http://song.sourceforge.net/gff3.shtml)
sequence annotation data file.  This data can include actual sequence
data as well as annotated sequence features.  Here I'll present
small, useful example programs that uses both kinds of data.

GFF annotation data is just a table of features with sequence
locations.  Each feature covers a contiguous span of sequence
positions, like a `Bio.Location.ContigLocation`.  Features in the
table are arranged into a hierarchy, with parent and child features.
For example, a gene feature could have a collection of child features
that represent the distinct exons of a spliced transcript.  The parent
and child feature relationships are managed by the `Bio.GFF3` modules.

> import Control.Monad
> import Control.Monad.Error
> import qualified Data.ByteString.Lazy.Char8 as LBS
> import Data.List
> import Data.Ord
> import System.IO
>
> import qualified Bio.GFF3.Feature as F
> import qualified Bio.GFF3.FeatureHier as FH
> import Bio.GFF3.GFF
> import qualified Bio.Location.ContigLocation as CLoc
> import Bio.Location.Display
> import qualified Bio.Location.Location as Loc
> import Bio.Location.OnSeq
> import qualified Bio.Location.SeqLocation as SeqLoc
> import Bio.Location.Strand
> import Bio.Sequence.SeqData
> import Bio.Sequence.SequenceReader

Sequence data
=============

The `SequenceReader` typeclass in the `Bio.Sequence.SequenceReader`
module is like a `MonadReader` specialized for a database of
sequences.  It represents a computation performed in the context of
database of named sequences.  It has a function, `askSequence`, to
look up a sequence by name.  It uses `MonadError` to report
errors--usually, the error is that a sequence name doesn't exist.

For these examples, I'll use the yeast genome.  The sequence is
available in an [18 MB GFF
file](http://downloads.yeastgenome.org/chromosomal_feature/saccharomyces_cerevisiae.gff)
from the [Saccharomyces Genome Database](http://www.yeastgenome.org).
The Fasta format sequence data starts after the `##FASTA` line.  I cut
out everything before that line and saved the rest at
`/data/Libraries/SGD/saccharomyces_cerevisiae.fa`.

> yeastGFF :: FilePath
> yeastGFF = "/data/Libraries/SGD/saccharomyces_cerevisiae.gff"
> yeastFasta :: FilePath
> yeastFasta = "/data/Libraries/SGD/saccharomyces_cerevisiae.fa"

This simple monadic action looks up the sequences of a couple of yeast
chromosomes and prints a message that tells me how long they are.

> someChromosomeLengths :: (MonadError String m, SequenceReader m) => m String
> someChromosomeLengths = liftM unlines . mapM displayChromoLength $ chromoNames
>     where chromoNames = map LBS.pack [ "chrIII", "chrIV" ]
>
> displayChromoLength :: (MonadError String m, SequenceReader m) => LBS.ByteString -> m String
> displayChromoLength name = do (Seq name sequ _) <- askSequence name
>                               return $ "Length of" ++ LBS.unpack name ++ " is " ++ (show . LBS.length) sequ

To actually run this action, we need an actual monad that acts as a
`SequenceReader`.  The module provides one concrete instance,
`MapSequenceReaderT`, which uses a simple in-memory `Data.Map.Map`
from sequence names to sequences.  There is a convenience function to
read in sequences from a Fasta format file and build a `Map` with
sequence name keys.  I use this function with a Fasta file of the
yeast genomic sequence.  I then need to use `runErrorT` to unwrap
errors from inside the `ErrorT String IO` monad into an `IO (Either
String a)` result in the `IO` monad.

> runYeastFasta :: MapSequenceReaderT (ErrorT String IO) String -> IO ()
> runYeastFasta m = (runErrorT $ runFromFasta m yeastFasta) >>= handleErrors
>     where handleErrors (Right res) = putStr res
>           handleErrors (Left err)  = hPutStrLn stderr err

Running the chromosome length action gives the following results:

	*Main> runYeastFasta someChromosomeLengths
	Length of chrIII is 316617
	Length of chrIV is 1531918

Next, I'll get the sequence of the yeast gene *COS7*, the left-most
gene on chromosome IV, which occupies nucleotides 1802 to 2953 in the
1-based coordinates that are frequently used in biology.  To do this,
I just retrieve the entire sequence, then take a substring.

> cos7Sequence :: (MonadError String m, SequenceReader m) => m String
> cos7Sequence = liftM extractCos7 $ askSequence chrIV
>     where chrIV = LBS.pack "chrIV"
>           extractCos7 (Seq _ chrseq _) 
>                           = LBS.unpack . LBS.take (1 + cos7End - cos7Start) . LBS.drop (cos7Start - 1) $ chrseq
>           (cos7Start, cos7End) = (1802, 2953)

Running this sequence retrieval action produces the following results

	*Main> runYeastFasta cos7Sequence 
	ATGAAAGAGAATGAAGTCAAAGATGAGAAAAGCGTAGATGTGTTATCCTTCAAACAGCTC

*...*

	TGGCCATATATTAAAGAAGCGCAATTGTCCCGCAATGAAGAGTCTTTAATGAAGAAATGA

Since this is supposed to be a protein-coding gene, it's good to see
that it starts with an ATG and ends with a stop codon, in this case
TGA.  Using `take` and `drop` to pull out the subsequence is awkward,
and it only becomes more awkward when dealing with spliced genes with
multiple exons.  The `Bio.Sequence.SeqLocation` module has monadic
actions that handle this for us.  I'll use the `SeqLoc.askSeqData`
action to pull out the sequence of the spliced yeast gene *DTD1*.

> dtd1SeqLoc :: SeqLoc.SeqLoc
> dtd1SeqLoc = (OnSeq (LBS.pack "chrIV") dtd1Loc)
>     where dtd1Loc = Loc.Loc [ CLoc.fromStartEnd 65242 65306
>                             , CLoc.fromStartEnd 65378 65765
>                             ]

> dtd1Sequence :: (MonadError String m, SequenceReader m) => m String
> dtd1Sequence = liftM LBS.unpack $ SeqLoc.askSeqData dtd1SeqLoc

Running this sequence retrieval action yields

	*Main> runYeastFasta dtd1Sequence
	ATGAAGATTGTCTTACAAAAAGTCAGCCAAGCATCTGTAGTCGTCGATTCAAAAGTTATT

*...*

	ATGAGTTGCTCTTTAACTAATGAAGGGCCCGTTACAATCATTCTTGACAGTGACCAATAA

Again, the sequence starts with an ATG and ends with a stop codon.  I
can also check that the retrieved sequence is the correct, spliced
length.

	*Main> runYeastFasta . liftM (show . LBS.length) . SeqLoc.askSeqData $ dtd1SeqLoc
	453
	*Main> 453 / 3
	151.0

The sequence we extract is the right length to encode a 151 amino acid
protein.  Building the sequence location to use with `askSeqData`
isn't as error-prone as directly taking and dropping parts of the
sequence, but it's still tedious.  The next step is getting these
locations from the annotation portion of a GFF file.

Annotated Features
==================

The `FeatureHierReader` typeclass is another `MonadReader`-like class
that provides a database of annotations.  I can ask for the full
feature table using `askFeatureHier`, but usually I'll use convenience
functions that access some particular part of the `FeatureHier`.  I'll
start by getting the features for *COS7* and *DTD1*.  I look these up
using their IDs, which will be their systematic yeast gene names,
YDL248W and YDL219W.  The `featureByID` is the key action here; the
rest of the function just serves to show the feature and to map this
feature-showing action over the list of test genes.

> testFeatures :: (FH.FeatureHierReader m, MonadError String m) => m String
> testFeatures = liftM unlines . mapM testFeature $ tests
>     where tests = map LBS.pack [ "YDL248W", "YDL219W" ]
>           testFeature = liftM show . FH.featureById

Running this action requires a source of feature annotations.  The
`Bio.GFF3.GFF` module has a function that reads in annotation data
from a GFF file and uses it to run a `GFF` action, which provides a
`FeatureHierReader`.  This uses the full
`saccharomyces_cerevisiae.gff` file I downloaded, above.

	*Main> runGFF yeastGFF testFeatures
	Right "Feature {seqid = Chunk \"chrIV\" Empty, source = Chunk \"SGD\" Empty, ftype = Chunk \"gene\" Empty, start = 1802, end = 2953

This pulls out a fairly long representation of the two features,
wrapped in an `Either String` for error handling.  I can see that the
first feature is on chromosome IV from 1802 to 2953, which is the
location of *COS7*.  To see the location more clearly, I'll write a
little function to print this information more clearly.  It requires
error handling because not every feature has a `name`.

> displayFeatureLoc :: (MonadError String m) => F.Feature -> m String
> displayFeatureLoc f = liftM displayStr $ F.ident f
>     where displayStr namelbs = LBS.unpack namelbs ++ " is at " ++ locstr
>           locstr = display . F.seqLoc $ f

I write another wrapper script to display the locations of the sample
gene annotations.  It's very similar to `testFeatures`, but because
displayFeatureLoc is a monadic action I combine it with `featureById`
using a monadic join rather than `liftM`.

> testFeatureLocs :: (FH.FeatureHierReader m, MonadError String m) => m String
> testFeatureLocs = liftM unlines . mapM testFeatureLoc $ tests
>     where tests = map LBS.pack [ "YDL248W", "YDL219W" ]
>           testFeatureLoc = (displayFeatureLoc =<<) . FH.featureById

I run these actions and bind the results to a simple IO action that
unwraps errors and prints the result.

	*Main> runGFF yeastGFF testFeatureLocs >>= either error putStr
	YDL248W is at chrIV@1801to2952
	YDL219W is at chrIV@65242to65765

The location of *COS7* is correct, when I take into account the fact
that these coordinates are 0-based rather than 1-based.  However, the
location of *DTD1* covers both the exons and the introns of the
protein-coding gene.  The individual introns and exons of *DTD1* are
annotated separately from the overall gene.

Feature Hierarchies
===================

The feature annotations in a GFF file are assembled into a hierarchy,
with some features being parents of other features.  The exact meaning
of this hierarchy varies between databases.  The canonical example in
the GFF3 specification has a `gene` annotation with children that are
`mRNA` annotations, which in turn have children that are `exon` and
`CDS` annotations.  However, SGD uses `CDS` annotations that are
direct children of the `gene` annotation and doesn't generally define
transcript boundaries.  

In order to explore the SGD annotations, I'll write some functions to
make more human-readable output and then map them over the parts of
the feature hierarchy.  First, I just display a feature's type, as not
all features have identifiers, and its location.

> displayFeature :: F.Feature -> String
> displayFeature f = typestr ++ " at " ++ locstr
>     where typestr = LBS.unpack . F.ftype $ f
>           locstr = display . F.seqLoc $ f

Now I write a monadic action to look up all children of a feature,
recursively display their sub-hierarchy, and then indent that
sub-hierarchy.  For SGD annotations, the feature hierarchy is only one
layer deep, so this full recursive descent is unnecessary, but it
could be useful in other contexts.

> displayFeatureFamily :: (FH.FeatureHierReader m) => F.Feature -> m String
> displayFeatureFamily = liftM unlines . displayFamilyLines
>     where displayFamilyLines f = FH.children f >>= liftM (displayFamily . concat) . mapM displayFamilyLines
>               where displayFamily chlines = displayFeature f : map ("  " ++) chlines

I can now perform command-line queries to look up our example genes by
their systematic name and bind the result to `displayFeatureFamily`.

> printYeastGFF :: GFF String -> IO ()
> printYeastGFF = (either error putStr =<<) . runGFF yeastGFF

	*Main> printYeastGFF $ FH.featureById (LBS.pack "YDL248W") >>= displayFeatureFamily
	gene at chrIV@1801to2952
	  CDS at chrIV@1801to2952
	*Main> printYeastGFF $ FH.featureById (LBS.pack "YDL219W") >>= displayFeatureFamily
	gene at chrIV@65242to65765
	  intron at chrIV@65307to65377
	  CDS at chrIV@65378to65765
	  CDS at chrIV@65242to65306

The results show that the *COS7* gene has a single annotated CDS whose
location is the same as the gene.  The *DTD1* gene has two CDS
sub-features and an intron between them.  It would be slightly easier
to read if the features were sorted by location.  However, to do this
correctly, I'd need to sort forward-strand genes in ascending order
and reverse-strand genes in descending order.

> sortedChildren :: (FH.FeatureHierReader m) => F.Feature -> m [F.Feature]
> sortedChildren f = liftM (sortBy childOrder) . FH.children $ f
>     where childOrder = case F.strand f of
>                          Just Fwd -> comparing F.start
>                          Just RevCompl -> comparing (negate . F.start)
>                          Nothing -> error "Cannot sort children of strandless feature"
>
> displayFeatureFamily2 :: (FH.FeatureHierReader m) => F.Feature -> m String
> displayFeatureFamily2 = liftM unlines . displayFamilyLines
>     where displayFamilyLines f = sortedChildren f >>= liftM (displayFamily . concat) . mapM displayFamilyLines
>               where displayFamily chlines = displayFeature f : map ("  " ++) chlines

I can now verify that the *DTD1* gene sub-features are sorted in
ascending order while the *HNT1* sub-features, on the reverse
complement strand, are sorted in descending order.

	*Main> printYeastGFF $ FH.featureById (LBS.pack "YDL219W") >>= displayFeatureFamily2
	gene at chrIV@65242to65765
	  CDS at chrIV@65242to65306
	  intron at chrIV@65307to65377
	  CDS at chrIV@65378to65765
	*Main> printYeastGFF $ FH.featureById (LBS.pack "YDL125C") >>= displayFeatureFamily2
	gene at chrIV@239605to239018
	  CDS at chrIV@239605to239509
	  intron at chrIV@239508to239398
	  CDS at chrIV@239397to239018

Finally, I'll put together a function to filter out the CDS
annotations from the sorted children and assemble a sequence location.

> cdsSeqLoc :: (MonadError String m, FH.FeatureHierReader m) => F.Feature -> m SeqLoc.SeqLoc
> cdsSeqLoc = (foldM1 concatSeqLocs . map F.seqLoc . filter isCDS =<<) . sortedChildren
>     where isCDS = ((== (LBS.pack "CDS")) . F.ftype)
>           concatSeqLocs (OnSeq name1 (Loc.Loc locs1)) (OnSeq name2 (Loc.Loc locs2))
>               | name1 /= name2 = throwError $ "CDS refname mismatch: " ++ show (LBS.unpack name1, LBS.unpack name2)
>               | otherwise = return $ OnSeq name1 (Loc.Loc $ locs1 ++ locs2)
>           foldM1 _ [] = error "foldM1: empty list"
>           foldM1 f (x0:xrest) = foldM f x0 xrest

I can directly display these CDS locations and they agree with the
feature "families" displayed above.

	*Main> printYeastGFF $ FH.featureById (LBS.pack "YDL248W") >>= liftM display . cdsSeqLoc
	chrIV@1801to2952
	*Main> printYeastGFF $ FH.featureById (LBS.pack "YDL219W") >>= liftM display . cdsSeqLoc
	chrIV@65242to65306;65378to65765
	*Main> printYeastGFF $ FH.featureById (LBS.pack "YDL125C") >>= liftM display . cdsSeqLoc
	chrIV@239605to239509;239397to239018
