Introduction

The modules in Bio.GFF3 are designed to fairly directly represent the data present in a GFF sequence annotation data file. This data can include actual sequence data as well as annotated sequence features. Here I'll present small, useful example programs that uses both kinds of data.

GFF annotation data is just a table of features with sequence locations. Each feature covers a contiguous span of sequence positions, like a Bio.Location.ContigLocation. Features in the table are arranged into a hierarchy, with parent and child features. For example, a gene feature could have a collection of child features that represent the distinct exons of a spliced transcript. The parent and child feature relationships are managed by the Bio.GFF3 modules.

> import Control.Monad
> import Control.Monad.Error
> import qualified Data.ByteString.Lazy.Char8 as LBS
> import Data.List
> import Data.Ord
> import System.IO
>
> import qualified Bio.GFF3.Feature as F
> import qualified Bio.GFF3.FeatureHier as FH
> import Bio.GFF3.GFF
> import qualified Bio.Location.ContigLocation as CLoc
> import Bio.Location.Display
> import qualified Bio.Location.Location as Loc
> import Bio.Location.OnSeq
> import qualified Bio.Location.SeqLocation as SeqLoc
> import Bio.Location.Strand
> import Bio.Sequence.SeqData
> import Bio.Sequence.SequenceReader

Sequence data

The SequenceReader typeclass in the Bio.Sequence.SequenceReader module is like a MonadReader specialized for a database of sequences. It represents a computation performed in the context of database of named sequences. It has a function, askSequence, to look up a sequence by name. It uses MonadError to report errors--usually, the error is that a sequence name doesn't exist.

For these examples, I'll use the yeast genome. The sequence is available in an 18 MB GFF file from the Saccharomyces Genome Database. The Fasta format sequence data starts after the ##FASTA line. I cut out everything before that line and saved the rest at /data/Libraries/SGD/saccharomyces_cerevisiae.fa.

> yeastGFF :: FilePath
> yeastGFF = "/data/Libraries/SGD/saccharomyces_cerevisiae.gff"
> yeastFasta :: FilePath
> yeastFasta = "/data/Libraries/SGD/saccharomyces_cerevisiae.fa"

This simple monadic action looks up the sequences of a couple of yeast chromosomes and prints a message that tells me how long they are.

> someChromosomeLengths :: (MonadError String m, SequenceReader m) => m String
> someChromosomeLengths = liftM unlines . mapM displayChromoLength $ chromoNames
> where chromoNames = map LBS.pack [ "chrIII", "chrIV" ]
>
> displayChromoLength :: (MonadError String m, SequenceReader m) => LBS.ByteString -> m String
> displayChromoLength name = do (Seq name sequ _) <- askSequence name
> return $ "Length of" ++ LBS.unpack name ++ " is " ++ (show . LBS.length) sequ

To actually run this action, we need an actual monad that acts as a SequenceReader. The module provides one concrete instance, MapSequenceReaderT, which uses a simple in-memory Data.Map.Map from sequence names to sequences. There is a convenience function to read in sequences from a Fasta format file and build a Map with sequence name keys. I use this function with a Fasta file of the yeast genomic sequence. I then need to use runErrorT to unwrap errors from inside the ErrorT String IO monad into an IO (Either String a) result in the IO monad.

> runYeastFasta :: MapSequenceReaderT (ErrorT String IO) String -> IO ()
> runYeastFasta m = (runErrorT $ runFromFasta m yeastFasta) >>= handleErrors
> where handleErrors (Right res) = putStr res
> handleErrors (Left err) = hPutStrLn stderr err

Running the chromosome length action gives the following results:

*Main> runYeastFasta someChromosomeLengths
Length of chrIII is 316617
Length of chrIV is 1531918

Next, I'll get the sequence of the yeast gene COS7, the left-most gene on chromosome IV, which occupies nucleotides 1802 to 2953 in the 1-based coordinates that are frequently used in biology. To do this, I just retrieve the entire sequence, then take a substring.

> cos7Sequence :: (MonadError String m, SequenceReader m) => m String
> cos7Sequence = liftM extractCos7 $ askSequence chrIV
> where chrIV = LBS.pack "chrIV"
> extractCos7 (Seq _ chrseq _)
> = LBS.unpack . LBS.take (1 + cos7End - cos7Start) . LBS.drop (cos7Start - 1) $ chrseq
> (cos7Start, cos7End) = (1802, 2953)

Running this sequence retrieval action produces the following results

*Main> runYeastFasta cos7Sequence 
ATGAAAGAGAATGAAGTCAAAGATGAGAAAAGCGTAGATGTGTTATCCTTCAAACAGCTC

...

TGGCCATATATTAAAGAAGCGCAATTGTCCCGCAATGAAGAGTCTTTAATGAAGAAATGA

Since this is supposed to be a protein-coding gene, it's good to see that it starts with an ATG and ends with a stop codon, in this case TGA. Using take and drop to pull out the subsequence is awkward, and it only becomes more awkward when dealing with spliced genes with multiple exons. The Bio.Sequence.SeqLocation module has monadic actions that handle this for us. I'll use the SeqLoc.askSeqData action to pull out the sequence of the spliced yeast gene DTD1.

> dtd1SeqLoc :: SeqLoc.SeqLoc
> dtd1SeqLoc = (OnSeq (LBS.pack "chrIV") dtd1Loc)
> where dtd1Loc = Loc.Loc [ CLoc.fromStartEnd 65242 65306
> , CLoc.fromStartEnd 65378 65765
> ]
> dtd1Sequence :: (MonadError String m, SequenceReader m) => m String
> dtd1Sequence = liftM LBS.unpack $ SeqLoc.askSeqData dtd1SeqLoc

Running this sequence retrieval action yields

*Main> runYeastFasta dtd1Sequence
ATGAAGATTGTCTTACAAAAAGTCAGCCAAGCATCTGTAGTCGTCGATTCAAAAGTTATT

...

ATGAGTTGCTCTTTAACTAATGAAGGGCCCGTTACAATCATTCTTGACAGTGACCAATAA

Again, the sequence starts with an ATG and ends with a stop codon. I can also check that the retrieved sequence is the correct, spliced length.

*Main> runYeastFasta . liftM (show . LBS.length) . SeqLoc.askSeqData $ dtd1SeqLoc
453
*Main> 453 / 3
151.0

The sequence we extract is the right length to encode a 151 amino acid protein. Building the sequence location to use with askSeqData isn't as error-prone as directly taking and dropping parts of the sequence, but it's still tedious. The next step is getting these locations from the annotation portion of a GFF file.

Annotated Features

The FeatureHierReader typeclass is another MonadReader-like class that provides a database of annotations. I can ask for the full feature table using askFeatureHier, but usually I'll use convenience functions that access some particular part of the FeatureHier. I'll start by getting the features for COS7 and DTD1. I look these up using their IDs, which will be their systematic yeast gene names, YDL248W and YDL219W. The featureByID is the key action here; the rest of the function just serves to show the feature and to map this feature-showing action over the list of test genes.

> testFeatures :: (FH.FeatureHierReader m, MonadError String m) => m String
> testFeatures = liftM unlines . mapM testFeature $ tests
> where tests = map LBS.pack [ "YDL248W", "YDL219W" ]
> testFeature = liftM show . FH.featureById

Running this action requires a source of feature annotations. The Bio.GFF3.GFF module has a function that reads in annotation data from a GFF file and uses it to run a GFF action, which provides a FeatureHierReader. This uses the full saccharomyces_cerevisiae.gff file I downloaded, above.

*Main> runGFF yeastGFF testFeatures
Right "Feature {seqid = Chunk \"chrIV\" Empty, source = Chunk \"SGD\" Empty, ftype = Chunk \"gene\" Empty, start = 1802, end = 2953

This pulls out a fairly long representation of the two features, wrapped in an Either String for error handling. I can see that the first feature is on chromosome IV from 1802 to 2953, which is the location of COS7. To see the location more clearly, I'll write a little function to print this information more clearly. It requires error handling because not every feature has a name.

> displayFeatureLoc :: (MonadError String m) => F.Feature -> m String
> displayFeatureLoc f = liftM displayStr $ F.ident f
> where displayStr namelbs = LBS.unpack namelbs ++ " is at " ++ locstr
> locstr = display . F.seqLoc $ f

I write another wrapper script to display the locations of the sample gene annotations. It's very similar to testFeatures, but because displayFeatureLoc is a monadic action I combine it with featureById using a monadic join rather than liftM.

> testFeatureLocs :: (FH.FeatureHierReader m, MonadError String m) => m String
> testFeatureLocs = liftM unlines . mapM testFeatureLoc $ tests
> where tests = map LBS.pack [ "YDL248W", "YDL219W" ]
> testFeatureLoc = (displayFeatureLoc =<<) . FH.featureById

I run these actions and bind the results to a simple IO action that unwraps errors and prints the result.

*Main> runGFF yeastGFF testFeatureLocs >>= either error putStr
YDL248W is at chrIV@1801to2952
YDL219W is at chrIV@65242to65765

The location of COS7 is correct, when I take into account the fact that these coordinates are 0-based rather than 1-based. However, the location of DTD1 covers both the exons and the introns of the protein-coding gene. The individual introns and exons of DTD1 are annotated separately from the overall gene.

Feature Hierarchies

The feature annotations in a GFF file are assembled into a hierarchy, with some features being parents of other features. The exact meaning of this hierarchy varies between databases. The canonical example in the GFF3 specification has a gene annotation with children that are mRNA annotations, which in turn have children that are exon and CDS annotations. However, SGD uses CDS annotations that are direct children of the gene annotation and doesn't generally define transcript boundaries.

In order to explore the SGD annotations, I'll write some functions to make more human-readable output and then map them over the parts of the feature hierarchy. First, I just display a feature's type, as not all features have identifiers, and its location.

> displayFeature :: F.Feature -> String
> displayFeature f = typestr ++ " at " ++ locstr
> where typestr = LBS.unpack . F.ftype $ f
> locstr = display . F.seqLoc $ f

Now I write a monadic action to look up all children of a feature, recursively display their sub-hierarchy, and then indent that sub-hierarchy. For SGD annotations, the feature hierarchy is only one layer deep, so this full recursive descent is unnecessary, but it could be useful in other contexts.

> displayFeatureFamily :: (FH.FeatureHierReader m) => F.Feature -> m String
> displayFeatureFamily = liftM unlines . displayFamilyLines
> where displayFamilyLines f = FH.children f >>= liftM (displayFamily . concat) . mapM displayFamilyLines
> where displayFamily chlines = displayFeature f : map (" " ++) chlines

I can now perform command-line queries to look up our example genes by their systematic name and bind the result to displayFeatureFamily.

> printYeastGFF :: GFF String -> IO ()
> printYeastGFF = (either error putStr =<<) . runGFF yeastGFF
*Main> printYeastGFF $ FH.featureById (LBS.pack "YDL248W") >>= displayFeatureFamily
gene at chrIV@1801to2952
  CDS at chrIV@1801to2952
*Main> printYeastGFF $ FH.featureById (LBS.pack "YDL219W") >>= displayFeatureFamily
gene at chrIV@65242to65765
  intron at chrIV@65307to65377
  CDS at chrIV@65378to65765
  CDS at chrIV@65242to65306

The results show that the COS7 gene has a single annotated CDS whose location is the same as the gene. The DTD1 gene has two CDS sub-features and an intron between them. It would be slightly easier to read if the features were sorted by location. However, to do this correctly, I'd need to sort forward-strand genes in ascending order and reverse-strand genes in descending order.

> sortedChildren :: (FH.FeatureHierReader m) => F.Feature -> m [F.Feature]
> sortedChildren f = liftM (sortBy childOrder) . FH.children $ f
> where childOrder = case F.strand f of
> Just Fwd -> comparing F.start
> Just RevCompl -> comparing (negate . F.start)
> Nothing -> error "Cannot sort children of strandless feature"
>
> displayFeatureFamily2 :: (FH.FeatureHierReader m) => F.Feature -> m String
> displayFeatureFamily2 = liftM unlines . displayFamilyLines
> where displayFamilyLines f = sortedChildren f >>= liftM (displayFamily . concat) . mapM displayFamilyLines
> where displayFamily chlines = displayFeature f : map (" " ++) chlines

I can now verify that the DTD1 gene sub-features are sorted in ascending order while the HNT1 sub-features, on the reverse complement strand, are sorted in descending order.

*Main> printYeastGFF $ FH.featureById (LBS.pack "YDL219W") >>= displayFeatureFamily2
gene at chrIV@65242to65765
  CDS at chrIV@65242to65306
  intron at chrIV@65307to65377
  CDS at chrIV@65378to65765
*Main> printYeastGFF $ FH.featureById (LBS.pack "YDL125C") >>= displayFeatureFamily2
gene at chrIV@239605to239018
  CDS at chrIV@239605to239509
  intron at chrIV@239508to239398
  CDS at chrIV@239397to239018

Finally, I'll put together a function to filter out the CDS annotations from the sorted children and assemble a sequence location.

> cdsSeqLoc :: (MonadError String m, FH.FeatureHierReader m) => F.Feature -> m SeqLoc.SeqLoc
> cdsSeqLoc = (foldM1 concatSeqLocs . map F.seqLoc . filter isCDS =<<) . sortedChildren
> where isCDS = ((== (LBS.pack "CDS")) . F.ftype)
> concatSeqLocs (OnSeq name1 (Loc.Loc locs1)) (OnSeq name2 (Loc.Loc locs2))
> | name1 /= name2 = throwError $ "CDS refname mismatch: " ++ show (LBS.unpack name1, LBS.unpack name2)
> | otherwise = return $ OnSeq name1 (Loc.Loc $ locs1 ++ locs2)
> foldM1 _ [] = error "foldM1: empty list"
> foldM1 f (x0:xrest) = foldM f x0 xrest

I can directly display these CDS locations and they agree with the feature "families" displayed above.

*Main> printYeastGFF $ FH.featureById (LBS.pack "YDL248W") >>= liftM display . cdsSeqLoc
chrIV@1801to2952
*Main> printYeastGFF $ FH.featureById (LBS.pack "YDL219W") >>= liftM display . cdsSeqLoc
chrIV@65242to65306;65378to65765
*Main> printYeastGFF $ FH.featureById (LBS.pack "YDL125C") >>= liftM display . cdsSeqLoc
chrIV@239605to239509;239397to239018