[Tutorial for location and position stuff nick@ingolia.org**20090315065855] { adddir ./doc addfile ./doc/location-tutorial.lhs hunk ./doc/location-tutorial.lhs 1 + + + + +Bio.Location Tutorial + + + +

Introduction

+ +

The modules in Bio.Location are designed to represent +positions and locations (ranges of positions) on sequences, +particularly nucleotide sequences. In this tutorial, I present +examples based on quantifying gene expression from deep sequencing of +mRNAs. This is a simplification of the actual use case that led me to +develop the modules.

+ +

Deep sequencing technology produces millions of short sequencing +reads. In some cases, the sequencing sample is produced from genomic +DNA and the sequences can reveal changes in a specific individual +genome compared to a reference or even provide novel sequence from an +unknown organism. Deep sequencing of samples produced from mRNA +transcripts rather than genomic DNA is another major application of +this new technology. Here, the analysis involves simply counting the +number of sequencing reads derived from each gene's mRNA in order to +determine its relative abundance. When the mRNA for YFG1 is +ten-fold more abundant than that of YFG2, then there will be +roughly ten-fold more fragments of YFG1 than fragments of +YFG2 in the sequencing sample. This will in turn produce ten +times as many sequencing reads that align to the genome within +YFG1.

+ +

My goal is to take a list of gene locations and a list of +sequencing read alignment locations and to count how many read +alignments fall in each gene. I start by importing the various modules +I'll need, many of which are designed for qualified import:

+ +
+
+> import qualified Data.ByteString.Lazy.Char8 as LBS
+> import Data.List (intercalate)
+> 
+> import Bio.Location.OnSeq
+> import qualified Bio.Location.ContigLocation as CLoc
+> import qualified Bio.Location.Location as Loc
+> import qualified Bio.Location.Position as Pos
+> import qualified Bio.Location.SeqLocation as SeqLoc
+> import Bio.Location.Strand
+> import Bio.Sequence.SeqData
+
+
+ +

Positions and locations

+ +

The first step of analysis is to use an external program such as +Bowtie to align each short deep sequencing read to the genome. I'll +start with the genomic alignment position of a deep sequencing read. +A few lines of Bowtie +alignment are shown below:

+ +
+HWI-EAS355_6_Nick:5:1:855:1808  -       chrXII  453779  GGTCTAGGCTTCGTCACTGACCTCCAC     hhhhhhhhhhhhhhhhhhhhhhhhhhh     1       
+HWI-EAS355_6_Nick:5:1:806:1496  -       chrXII  454255  GTTCGATTAGTCTTTCGCCCCTATACC     hhhhhhhhhhhhhhhhhhhhhhhhhhh     1       
+HWI-EAS355_6_Nick:5:1:1657:1865 -       chrIV   209505  TAACAGTTCTTGCCGAAAAATCCACAC     hhhhhhhhhhhhhhhhhhhhhhhhhhh     0       
+HWI-EAS355_6_Nick:5:1:1469:1010 -       chrXIV  742572  CCAAAAGTAATATATTGAATTAATAAC     hhhhhhhhhhhhhhhhhhhhhhhhhhh     0       
+HWI-EAS355_6_Nick:5:1:1471:1449	+	chrIV	809813	GAAGGATCTAGGTACCATCGTTGAAGA	hhhhhhhhhhhhhhhhhhhhhhhhhhh	0	26:G>A
+
+ +

Columns two through four describe the position of the alignment. +The second column indicates whether the alignment is on the forward +(+) or reverse complement (-) strand. The third +gives the sequence name and the fourth gives the offset into the +sequence. According to the Bowtie +manual, the fourth column gives the "0-based offset into the +reference sequence where leftmost character of the alignment occurs." +The question of zero- versus one-based indexing is a common +complication in bioinformatics. The Bio.Location modules use +zero-based indexing, so the offsets can be used directly to index into +SeqData.

+ +

The SeqPos data type in Bio.Location.SeqLocation +module provides a direct representation of these three components. It +is constructed from a Pos, specifying the offset and strand, +nested within an OnSeq, giving the name of the sequence where +the read aligned. The strand is given with the Strand data +type. The one complication is that the offset in the fourth column +gives the left-most character of the alignment, but when the alignment +is on the reverse complement strand the starting position is actually +the right-most character of the alignment.

+ +
+
+> aligns :: [SeqLoc.SeqPos]
+> aligns = map mkSeqPos [ ('-', "chrXII", 453779)
+>                       , ('-', "chrXII", 454255)
+>                       , ('-', "chrIV",  209505)
+>                       , ('-', "chrXIV", 742572)
+>                       , ('+', "chrIV",  809813)
+>                       ]
+>     where mkSeqPos ('+', seqName, offset) = OnSeq (LBS.pack seqName) (Pos.Pos offset Fwd)
+>           mkSeqPos ('-', seqName, offset) = OnSeq (LBS.pack seqName) (Pos.Pos (offset + 1 - readLength) RevCompl)
+>               where readLength = 27
+
+
+ +

Next, I need genomic annotations that indicate the location of +genes. Below are sample GFF format annotations for the +yeast genome:

+ +
+
+chrIV   SGD     gene    1802    2953    .       +       .       ID=YDL248W;Name=YDL248W;gene=COS7
+chrIV   SGD     CDS     1802    2953    .       +       0       Parent=YDL248W;Name=YDL248W;gene=COS7
+chrIV   SGD     gene    3762    3836    .       +       .       ID=YDL247W-A;Name=YDL247W-A
+chrIV   SGD     CDS     3762    3836    .       +       0       Parent=YDL247W-A;Name=YDL247W-A
+chrIV   SGD     gene    5985    7814    .       +       .       ID=YDL247W;Name=YDL247W;gene=MPH2
+chrIV   SGD     CDS     5985    7814    .       +       0       Parent=YDL247W;Name=YDL247W;gene=MPH2
+chrIV   SGD     gene    8683    9756    .       -       .       ID=YDL246C;Name=YDL246C;gene=SOR2
+chrIV   SGD     CDS     8683    9756    .       -       0       Parent=YDL246C;Name=YDL246C;gene=SOR2
+chrIV   SGD     gene    11657   13360   .       -       .       ID=YDL245C;Name=YDL245C;gene=HXT15
+chrIV   SGD     CDS     11657   13360   .       -       0       Parent=YDL245C;Name=YDL245C;gene=HXT15
+chrIV   SGD     ARS     15493   15739   .       .       .       ID=ARS403;Name=ARS403;Alias=ARSIV-16
+chrIV   SGD     gene    16204   17226   .       +       .       ID=YDL244W;Name=YDL244W;gene=THI13
+chrIV   SGD     CDS     16204   17226   .       +       0       Parent=YDL244W;Name=YDL244W;gene=THI13
+
+
+ +

I start by using the ContigSeqLoc data type to represent a +sequence location for the coding sequences (CDSes) in this annotation. +Columns 4 and 5 give the bounds of the coding sequence and column 7 +indicates the strand. The bounds are always given from lower to +higher, even when the coding sequence is on the reverse complement +strand. Conveniently, the ContigLoc data type represents +sequence locations based on the left-most sequence position, along +with the length and the strand. As above, this means that for reverse +complement features the offset position corresponds to the end of the +feature.

+ +
+
+> cdsContigs :: [(String, SeqLoc.ContigSeqLoc)]
+> cdsContigs = map mkContigSeqLoc [ ("YDL248W",   ("chrIV",  1802,  2953, '+'))
+>                                 , ("YDL247W-A", ("chrIV",  3762,  3836, '+'))
+>                                 , ("YDL247W",   ("chrIV",  5985,  7814, '+'))
+>                                 , ("YDL246C",   ("chrIV",  8683,  9756, '-'))
+>                                 , ("YDL245C",   ("chrIV", 11657, 13360, '-'))
+>                                 , ("YDL244W",   ("chrIV", 16204, 17226, '+'))
+>                                 ]
+>     where mkContigSeqLoc (name, (chrName, leftOff, rightOff, strChr))
+>               = (name, OnSeq (LBS.pack chrName) (CLoc.ContigLoc leftOff len str))
+>                 where len = 1 + rightOff - leftOff
+>                       str | strChr == '+' = Fwd
+>                           | strChr == '-' = RevCompl
+>                           | otherwise = error $ "Unknown strand " ++ show strChr
+
+
+ +

Even in the relatively simple yeast genome, however, I quickly run +into a problem with the genome annotations:

+ +
+
+chrIV   SGD     gene    65243   65766   .       +       .       ID=YDL219W;Name=YDL219W;gene=DTD1
+chrIV   SGD     CDS     65243   65307   .       +       0       Parent=YDL219W;Name=YDL219W;gene=DTD1
+chrIV   SGD     CDS     65379   65766   .       +       1       Parent=YDL219W;Name=YDL219W;gene=DTD1
+chrIV   SGD     intron  65308   65378   .       +       .       Parent=YDL219W;Name=YDL219W;gene=DTD1
+
+
+ +

The DTD1 gene is spliced, meaning that the coding sequence +is comprised of two non-contiguous regions called exons. I use the +SeqLoc data type to represent these concatenated +sub-locations. It is based on the Loc data type from the +Bio.Location.Location package, which is a list of the +ContigLoc locations I used above. It isn't apparent in the +case of YDL219W, but when a gene is encoded on the reverse +complement strand the individual exons are listed from left to right +on the chromosome. As reverse strand genes run from right to left, +the exons must be reversed.

+ +
+
+> cdses :: [(String, SeqLoc.SeqLoc)]
+> cdses = map mkSeqLoc [ ("YDL248W",   ("chrIV", [ (  1802,   2953, '+') ]))
+>                      , ("YDL247W-A", ("chrIV", [ (  3762,   3836, '+') ]))
+>                      , ("YDL247W",   ("chrIV", [ (  5985,   7814, '+') ]))
+>                      , ("YDL246C",   ("chrIV", [ (  8683,   9756, '-') ]))
+>                      , ("YDL245C",   ("chrIV", [ ( 11657,  13360, '-') ]))
+>                      , ("YDL244W",   ("chrIV", [ ( 16204,  17226, '+') ]))
+>                      , ("YDL219W",   ("chrIV", [ ( 65243,  65307, '+'), (65379, 65766, '+') ]))
+>                      , ("YDL140C",   ("chrIV", [ (205361, 210562, '-') ]))
+>                      , ("YDR172W",   ("chrIV", [ (808321, 810378, '+') ]))
+>                      ]
+>     where mkSeqLoc (name, (chrName, contigs))
+>               = (name, OnSeq (LBS.pack chrName) (Loc.Loc clocs))
+>                 where clocList@(cloc1:_) = map mkCLoc contigs
+>                       clocs = case CLoc.strand cloc1 of
+>                                 Fwd -> clocList
+>                                 RevCompl -> reverse clocList
+>           mkCLoc (leftOff, rightOff, strChr)
+>               = CLoc.ContigLoc leftOff len str
+>                 where len = 1 + rightOff - leftOff
+>                       str | strChr == '+' = Fwd
+>                           | strChr == '-' = RevCompl
+>                           | otherwise = error $ "Unknown strand " ++ show strChr
+
+
+ +

I augmented the small subset of chromosome IV genes with a few more +from further along the chromosome. These will allow me to actually +find some hits for my sequencing read alignments.

+ +

I now actually write a function to find where the sequencing reads +lie in a gene. Budding yeast has a relatively small proportion of +directly overlapping CDses, but I still want to allow for multiple +hits, or no hits, for a sequencing read. I will use the +isWithin function from the Bio.Location.SeqLocation +module. This tests whether a position lies within a location--the +position must be on the same sequence (identical sequence name) and +the same strand as the location as well as lying within the bounds of +one of the contiguous blocks making up the location.

+ +
+
+> isHit :: SeqLoc.SeqPos -> (a, SeqLoc.SeqLoc) -> Bool
+> isHit pos (_, loc) = pos `SeqLoc.isWithin` loc
+
+> posHits :: [(a, SeqLoc.SeqLoc)] -> SeqLoc.SeqPos -> [(a, SeqLoc.SeqLoc)]
+> posHits targetLocs queryPos = filter (isHit queryPos) targetLocs
+
+
+ +

Finally, I want to determine where in the gene a read starts. It +may be interesting if substantially more reads are found in one region +of the gene than another. To do this, I'll use the posInto +function. This function takes a location and a position on the same +sequence and calculates a new position which corresponds to the +original position, expressed relative to the coordinates of the +location. That is, if the location runs from 100 to 200 and the +position is 110, then the offset resulting from posInto will +be 10. This function only computes a value when the position actually +lies inside the location. I have already filtered for just the +locations that do contain the position, so I will produce a runtime +error if it fails. This requires a bit of plumbing to manage the +errors. The posInto function has an inverse, the +posOutof function.

+ +
+
+> hitOffset :: SeqLoc.SeqPos -> SeqLoc.SeqLoc -> Offset
+> hitOffset pos loc = either error Pos.offset into
+>     where into = onSameSeq posIntoE pos loc
+>           posIntoE p l = maybe outsideErr Right $ p `Loc.posInto` l
+>               where outsideErr = Left "position outside hit location"
+
+
+ +

I put these together with a function to produce human-readable +output describing the location of the sequencing read relative to +protein-coding genes.

+ +
+
+> posDescribeHits :: [(String, SeqLoc.SeqLoc)] -> SeqLoc.SeqPos -> String
+> posDescribeHits targetLocs queryPos
+>     = SeqLoc.displaySeqPos queryPos ++ ": " ++ describeHits
+>     where describeHits = case posHits targetLocs queryPos of
+>                            [] -> "no CDS"
+>                            hits -> intercalate "; " . map describeHit $ hits
+>           describeHit (cdsName, cdsSeqLoc) = cdsName ++ " at " ++ show (hitOffset queryPos cdsSeqLoc)
+
+
+ +

Finally, I map the hit-describing function over each alignment.

+ +
+
+> describeAligns = mapM_ (putStrLn . posDescribeHits cdses) aligns
+
+
+ +

The describeAligns function can be run to produce the +following results:

+ +
+
+chrXII@453753(-): no CDS
+chrXII@454229(-): no CDS
+chrIV@209479(-): YDL140C at 1083
+chrXIV@742546(-): no CDS
+chrIV@809813(+): YDR172W at 1492
+
+
+ +

This provides a demonstration of the position and location +interfaces. However, it leaves many things to be desired. To +actually use this tool, I need to read annotations from a GFF format +file as well as parse Bowtie output files directly. In addition, I +need to search for overlapping cdses more efficiently--the current +approach is quite simplistic. +

+ + + }