Introduction

The modules in Bio.Location are designed to represent positions and locations (ranges of positions) on sequences, particularly nucleotide sequences. In this tutorial, I present examples based on quantifying gene expression from deep sequencing of mRNAs. This is a simplification of the actual use case that led me to develop the modules.

Deep sequencing technology produces millions of short sequencing reads. In some cases, the sequencing sample is produced from genomic DNA and the sequences can reveal changes in a specific individual genome compared to a reference or even provide novel sequence from an unknown organism. Deep sequencing of samples produced from mRNA transcripts rather than genomic DNA is another major application of this new technology. Here, the analysis involves simply counting the number of sequencing reads derived from each gene's mRNA in order to determine its relative abundance. When the mRNA for YFG1 is ten-fold more abundant than that of YFG2, then there will be roughly ten-fold more fragments of YFG1 than fragments of YFG2 in the sequencing sample. This will in turn produce ten times as many sequencing reads that align to the genome within YFG1.

My goal is to take a list of gene locations and a list of sequencing read alignment locations and to count how many read alignments fall in each gene. I start by importing the various modules I'll need, many of which are designed for qualified import:

> import qualified Data.ByteString.Lazy.Char8 as LBS
> import Data.List (intercalate)
> 
> import Bio.Location.OnSeq
> import qualified Bio.Location.ContigLocation as CLoc
> import qualified Bio.Location.Location as Loc
> import qualified Bio.Location.Position as Pos
> import qualified Bio.Location.SeqLocation as SeqLoc
> import qualified Bio.Location.SeqLocMap as SLM
> import Bio.Location.Strand
> import Bio.Sequence.SeqData

Sequence positions

The first step of analysis is to use an external program such as Bowtie to align each short deep sequencing read to the genome. I'll start with the genomic alignment position of a deep sequencing read. A few lines of Bowtie alignment are shown below:

HWI-EAS355_6_Nick:5:1:855:1808  -       chrXII  453779  GGTCTAGGCTTCGTCACTGACCTCCAC     hhhhhhhhhhhhhhhhhhhhhhhhhhh     1       
HWI-EAS355_6_Nick:5:1:806:1496  -       chrXII  454255  GTTCGATTAGTCTTTCGCCCCTATACC     hhhhhhhhhhhhhhhhhhhhhhhhhhh     1       
HWI-EAS355_6_Nick:5:1:1657:1865 -       chrIV   209505  TAACAGTTCTTGCCGAAAAATCCACAC     hhhhhhhhhhhhhhhhhhhhhhhhhhh     0       
HWI-EAS355_6_Nick:5:1:1469:1010 -       chrXIV  742572  CCAAAAGTAATATATTGAATTAATAAC     hhhhhhhhhhhhhhhhhhhhhhhhhhh     0       
HWI-EAS355_6_Nick:5:1:1471:1449 +       chrIV   809813  GAAGGATCTAGGTACCATCGTTGAAGA     hhhhhhhhhhhhhhhhhhhhhhhhhhh     0      26:G>A

Columns two through four describe the position of the alignment. The second column indicates whether the alignment is on the forward (+) or reverse complement (-) strand. The third gives the sequence name and the fourth gives the offset into the sequence. According to the Bowtie manual, the fourth column gives the "0-based offset into the reference sequence where leftmost character of the alignment occurs." The question of zero- versus one-based indexing is a common complication in bioinformatics. The Bio.Location modules use zero-based indexing, so the offsets can be used directly to index into SeqData.

The SeqPos data type in Bio.Location.SeqLocation module provides a direct representation of these three components. It is constructed from a Pos, specifying the offset and strand, nested within an OnSeq, giving the name of the sequence where the read aligned. The strand is given with the Strand data type. The one complication is that the offset in the fourth column gives the left-most character of the alignment, but when the alignment is on the reverse complement strand the starting position is actually the right-most character of the alignment.

> aligns :: [SeqLoc.SeqPos]
> aligns = map mkSeqPos [ ('-', "chrXII", 453779)
>                       , ('-', "chrXII", 454255)
>                       , ('-', "chrIV",  209505)
>                       , ('-', "chrXIV", 742572)
>                       , ('+', "chrIV",  809813)
>                       ]
>     where mkSeqPos ('+', seqName, offset) = OnSeq (LBS.pack seqName) (Pos.Pos offset Fwd)
>           mkSeqPos ('-', seqName, offset) = OnSeq (LBS.pack seqName) (Pos.Pos (offset + 1 - readLength) RevCompl)
>               where readLength = 27

Sequence Locations

Next, I need genomic annotations that indicate the location of genes. Below are sample GFF format annotations for the yeast genome:

chrIV   SGD     gene    1802    2953    .       +       .       ID=YDL248W;Name=YDL248W;gene=COS7
chrIV   SGD     CDS     1802    2953    .       +       0       Parent=YDL248W;Name=YDL248W;gene=COS7
chrIV   SGD     gene    3762    3836    .       +       .       ID=YDL247W-A;Name=YDL247W-A
chrIV   SGD     CDS     3762    3836    .       +       0       Parent=YDL247W-A;Name=YDL247W-A
chrIV   SGD     gene    5985    7814    .       +       .       ID=YDL247W;Name=YDL247W;gene=MPH2
chrIV   SGD     CDS     5985    7814    .       +       0       Parent=YDL247W;Name=YDL247W;gene=MPH2
chrIV   SGD     gene    8683    9756    .       -       .       ID=YDL246C;Name=YDL246C;gene=SOR2
chrIV   SGD     CDS     8683    9756    .       -       0       Parent=YDL246C;Name=YDL246C;gene=SOR2
chrIV   SGD     gene    11657   13360   .       -       .       ID=YDL245C;Name=YDL245C;gene=HXT15
chrIV   SGD     CDS     11657   13360   .       -       0       Parent=YDL245C;Name=YDL245C;gene=HXT15
chrIV   SGD     ARS     15493   15739   .       .       .       ID=ARS403;Name=ARS403;Alias=ARSIV-16
chrIV   SGD     gene    16204   17226   .       +       .       ID=YDL244W;Name=YDL244W;gene=THI13
chrIV   SGD     CDS     16204   17226   .       +       0       Parent=YDL244W;Name=YDL244W;gene=THI13

I start by using the ContigSeqLoc data type to represent a sequence location for the coding sequences (CDSes) in this annotation. Columns 4 and 5 give the bounds of the coding sequence and column 7 indicates the strand. The bounds are always given from lower to higher, even when the coding sequence is on the reverse complement strand. Conveniently, the ContigLoc data type represents sequence locations based on the left-most sequence position, along with the length and the strand. As above, this means that for reverse complement features the offset position corresponds to the end of the feature.

> cdsContigs :: [(String, SeqLoc.ContigSeqLoc)]
> cdsContigs = map mkContigSeqLoc [ ("YDL248W",   ("chrIV",  1802,  2953, '+'))
>                                 , ("YDL247W-A", ("chrIV",  3762,  3836, '+'))
>                                 , ("YDL247W",   ("chrIV",  5985,  7814, '+'))
>                                 , ("YDL246C",   ("chrIV",  8683,  9756, '-'))
>                                 , ("YDL245C",   ("chrIV", 11657, 13360, '-'))
>                                 , ("YDL244W",   ("chrIV", 16204, 17226, '+'))
>                                 ]
>     where mkContigSeqLoc (name, (chrName, leftOff, rightOff, strChr))
>               = (name, OnSeq (LBS.pack chrName) (CLoc.ContigLoc leftOff len str))
>                 where len = 1 + rightOff - leftOff
>                       str | strChr == '+' = Fwd
>                           | strChr == '-' = RevCompl
>                           | otherwise = error $ "Unknown strand " ++ show strChr

Discontinuous sequence locations

Even in the relatively simple yeast genome, however, I quickly run into a problem with the genome annotations:

chrIV   SGD     gene    65243   65766   .       +       .       ID=YDL219W;Name=YDL219W;gene=DTD1
chrIV   SGD     CDS     65243   65307   .       +       0       Parent=YDL219W;Name=YDL219W;gene=DTD1
chrIV   SGD     CDS     65379   65766   .       +       1       Parent=YDL219W;Name=YDL219W;gene=DTD1
chrIV   SGD     intron  65308   65378   .       +       .       Parent=YDL219W;Name=YDL219W;gene=DTD1

The DTD1 gene is spliced, meaning that the coding sequence is comprised of two non-contiguous regions called exons. I use the SeqLoc data type to represent these concatenated sub-locations. It is based on the Loc data type from the Bio.Location.Location package, which is a list of the ContigLoc locations I used above. It isn't apparent in the case of YDL219W, but when a gene is encoded on the reverse complement strand the individual exons are listed from left to right on the chromosome. As reverse strand genes run from right to left, the exons must be reversed.

> cdses :: [(String, SeqLoc.SeqLoc)]
> cdses = map mkSeqLoc [ ("YDL248W",   ("chrIV", [ (  1802,   2953, '+') ]))
>                      , ("YDL247W-A", ("chrIV", [ (  3762,   3836, '+') ]))
>                      , ("YDL247W",   ("chrIV", [ (  5985,   7814, '+') ]))
>                      , ("YDL246C",   ("chrIV", [ (  8683,   9756, '-') ]))
>                      , ("YDL245C",   ("chrIV", [ ( 11657,  13360, '-') ]))
>                      , ("YDL244W",   ("chrIV", [ ( 16204,  17226, '+') ]))
>                      , ("YDL219W",   ("chrIV", [ ( 65243,  65307, '+'), (65379, 65766, '+') ]))
>                      , ("YDL140C",   ("chrIV", [ (205361, 210562, '-') ]))
>                      , ("YDR172W",   ("chrIV", [ (808321, 810378, '+') ]))
>                      ]
>     where mkSeqLoc (name, (chrName, contigs))
>               = (name, OnSeq (LBS.pack chrName) (Loc.Loc clocs))
>                 where clocList@(cloc1:_) = map mkCLoc contigs
>                       clocs = case CLoc.strand cloc1 of
>                                 Fwd -> clocList
>                                 RevCompl -> reverse clocList
>           mkCLoc (leftOff, rightOff, strChr)
>               = CLoc.ContigLoc leftOff len str
>                 where len = 1 + rightOff - leftOff
>                       str | strChr == '+' = Fwd
>                           | strChr == '-' = RevCompl
>                           | otherwise = error $ "Unknown strand " ++ show strChr

I augmented the small subset of chromosome IV genes with a few more from further along the chromosome. These will allow me to actually find some hits for my sequencing read alignments.

Finding locations, naively

I now actually write a function to find where the sequencing reads lie in a gene. Budding yeast has a relatively small proportion of directly overlapping CDses, but I still want to allow for multiple hits, or no hits, for a sequencing read. I will use the isWithin function from the Bio.Location.SeqLocation module. This tests whether a position lies within a location--the position must be on the same sequence (identical sequence name) and the same strand as the location as well as lying within the bounds of one of the contiguous blocks making up the location.

> isHit :: SeqLoc.SeqPos -> (a, SeqLoc.SeqLoc) -> Bool
> isHit pos (_, loc) = pos `SeqLoc.isWithin` loc
> 
> posHitsNaive :: [(a, SeqLoc.SeqLoc)] -> SeqLoc.SeqPos -> [(a, SeqLoc.SeqLoc)]
> posHitsNaive targetLocs queryPos = filter (isHit queryPos) targetLocs

Finally, I want to determine where in the gene a read starts. It may be interesting if substantially more reads are found in one region of the gene than another. To do this, I'll use the posInto function. This function takes a location and a position on the same sequence and calculates a new position which corresponds to the original position, expressed relative to the coordinates of the location. That is, if the location runs from 100 to 200 and the position is 110, then the offset resulting from posInto will be 10. This function only computes a value when the position actually lies inside the location. I have already filtered for just the locations that do contain the position, so I will produce a runtime error if it fails. This requires a bit of plumbing to manage the errors. The posInto function has an inverse, the posOutof function.

> hitOffset :: SeqLoc.SeqPos -> SeqLoc.SeqLoc -> Offset
> hitOffset pos loc = either error Pos.offset into
>     where into = onSameSeq posIntoE pos loc
>           posIntoE p l = maybe outsideErr Right $ p `Loc.posInto` l
>               where outsideErr = Left "position outside hit location"

I put these together with a function to produce human-readable output describing the location of the sequencing read relative to protein-coding genes.

> posDescribeHits :: (SeqLoc.SeqPos -> [(String, SeqLoc.SeqLoc)]) -> SeqLoc.SeqPos -> String
> posDescribeHits lookupHits queryPos
>     = SeqLoc.displaySeqPos queryPos ++ ": " ++ describeHits
>     where describeHits = case lookupHits queryPos of
>                            [] -> "no CDS"
>                            hits -> intercalate "; " . map describeHit $ hits
>           describeHit (cdsName, cdsSeqLoc) = cdsName ++ " at " ++ show (hitOffset queryPos cdsSeqLoc)

Finally, I map the hit-describing function over each alignment.

> describeAlignsNaive = mapM_ (putStrLn . posDescribeHits (posHitsNaive cdses)) aligns

The describeAlignsNaive function can be run to produce the following results:

chrXII@453753(-): no CDS
chrXII@454229(-): no CDS
chrIV@209479(-): YDL140C at 1083
chrXIV@742546(-): no CDS
chrIV@809813(+): YDR172W at 1492

Finding locations

As suggested by its name, posHitsNaive is easy to write, but probably not the best way to look up which locations overlap a sequence position. The Bio.Location.LocMap and Bio.Location.SeqLocMap provide a much faster method of looking up positions. The LocMap data structure sorts locations into bins, with each location falling into one or more bins. A sequence position is then tested against only the set of locations overlapping its bin. The bin size is tuned to work well for yeast genes, with an average feature density of one per ~2kbp, and should work well for somewhat less dense annotation maps as well. The LocMap approach takes advantage of the fact that there is a great deal of structure in sequence location data, since each location is contained in a bounded range of sequence coordinates.

In order to use the SeqLocMap, I flip around the association list, making the locations into the key. I then construct the lookup map and apply it to sequence positions.

> posHits :: [(a, SeqLoc.SeqLoc)] -> SeqLoc.SeqPos -> [(a, SeqLoc.SeqLoc)]
> posHits targetLocs = map swap . flip SLM.lookupWithin targetSLM
>     where swap (a,b) = (b,a)
>           targetSLM = SLM.fromList . map swap $ targetLocs

I use the more sophisticated version of posHits with posDescribeHits and find the same result. There is an extensive set of QuickCheck properties ensuring that LocMap results match the naive approach, of course.

> describeAligns = mapM_ (putStrLn . posDescribeHits (posHits cdses)) aligns

chrXII@453753(-): no CDS
chrXII@454229(-): no CDS
chrIV@209479(-): YDL140C at 1083
chrXIV@742546(-): no CDS
chrIV@809813(+): YDR172W at 1492

This provides a demonstration of the position and location interfaces. For practical use, of course, I need to read annotations from a GFF format file as well as parse Bowtie output files directly.