[Tutorial for location and position stuff nick@ingolia.org**20090315065855] { adddir ./doc addfile ./doc/location-tutorial.lhs hunk ./doc/location-tutorial.lhs 1 + + +
+ +The modules in Bio.Location are designed to represent +positions and locations (ranges of positions) on sequences, +particularly nucleotide sequences. In this tutorial, I present +examples based on quantifying gene expression from deep sequencing of +mRNAs. This is a simplification of the actual use case that led me to +develop the modules.
+ +Deep sequencing technology produces millions of short sequencing +reads. In some cases, the sequencing sample is produced from genomic +DNA and the sequences can reveal changes in a specific individual +genome compared to a reference or even provide novel sequence from an +unknown organism. Deep sequencing of samples produced from mRNA +transcripts rather than genomic DNA is another major application of +this new technology. Here, the analysis involves simply counting the +number of sequencing reads derived from each gene's mRNA in order to +determine its relative abundance. When the mRNA for YFG1 is +ten-fold more abundant than that of YFG2, then there will be +roughly ten-fold more fragments of YFG1 than fragments of +YFG2 in the sequencing sample. This will in turn produce ten +times as many sequencing reads that align to the genome within +YFG1.
+ +My goal is to take a list of gene locations and a list of +sequencing read alignment locations and to count how many read +alignments fall in each gene. I start by importing the various modules +I'll need, many of which are designed for qualified import:
+ ++ +> import qualified Data.ByteString.Lazy.Char8 as LBS +> import Data.List (intercalate) +> +> import Bio.Location.OnSeq +> import qualified Bio.Location.ContigLocation as CLoc +> import qualified Bio.Location.Location as Loc +> import qualified Bio.Location.Position as Pos +> import qualified Bio.Location.SeqLocation as SeqLoc +> import Bio.Location.Strand +> import Bio.Sequence.SeqData + ++ +
The first step of analysis is to use an external program such as +Bowtie to align each short deep sequencing read to the genome. I'll +start with the genomic alignment position of a deep sequencing read. +A few lines of Bowtie +alignment are shown below:
+ ++HWI-EAS355_6_Nick:5:1:855:1808 - chrXII 453779 GGTCTAGGCTTCGTCACTGACCTCCAC hhhhhhhhhhhhhhhhhhhhhhhhhhh 1 +HWI-EAS355_6_Nick:5:1:806:1496 - chrXII 454255 GTTCGATTAGTCTTTCGCCCCTATACC hhhhhhhhhhhhhhhhhhhhhhhhhhh 1 +HWI-EAS355_6_Nick:5:1:1657:1865 - chrIV 209505 TAACAGTTCTTGCCGAAAAATCCACAC hhhhhhhhhhhhhhhhhhhhhhhhhhh 0 +HWI-EAS355_6_Nick:5:1:1469:1010 - chrXIV 742572 CCAAAAGTAATATATTGAATTAATAAC hhhhhhhhhhhhhhhhhhhhhhhhhhh 0 +HWI-EAS355_6_Nick:5:1:1471:1449 + chrIV 809813 GAAGGATCTAGGTACCATCGTTGAAGA hhhhhhhhhhhhhhhhhhhhhhhhhhh 0 26:G>A ++ +
Columns two through four describe the position of the alignment. +The second column indicates whether the alignment is on the forward +(+) or reverse complement (-) strand. The third +gives the sequence name and the fourth gives the offset into the +sequence. According to the Bowtie +manual, the fourth column gives the "0-based offset into the +reference sequence where leftmost character of the alignment occurs." +The question of zero- versus one-based indexing is a common +complication in bioinformatics. The Bio.Location modules use +zero-based indexing, so the offsets can be used directly to index into +SeqData.
+ +The SeqPos data type in Bio.Location.SeqLocation +module provides a direct representation of these three components. It +is constructed from a Pos, specifying the offset and strand, +nested within an OnSeq, giving the name of the sequence where +the read aligned. The strand is given with the Strand data +type. The one complication is that the offset in the fourth column +gives the left-most character of the alignment, but when the alignment +is on the reverse complement strand the starting position is actually +the right-most character of the alignment.
+ ++ +> aligns :: [SeqLoc.SeqPos] +> aligns = map mkSeqPos [ ('-', "chrXII", 453779) +> , ('-', "chrXII", 454255) +> , ('-', "chrIV", 209505) +> , ('-', "chrXIV", 742572) +> , ('+', "chrIV", 809813) +> ] +> where mkSeqPos ('+', seqName, offset) = OnSeq (LBS.pack seqName) (Pos.Pos offset Fwd) +> mkSeqPos ('-', seqName, offset) = OnSeq (LBS.pack seqName) (Pos.Pos (offset + 1 - readLength) RevCompl) +> where readLength = 27 + ++ +
Next, I need genomic annotations that indicate the location of +genes. Below are sample GFF format annotations for the +yeast genome:
+ ++ +chrIV SGD gene 1802 2953 . + . ID=YDL248W;Name=YDL248W;gene=COS7 +chrIV SGD CDS 1802 2953 . + 0 Parent=YDL248W;Name=YDL248W;gene=COS7 +chrIV SGD gene 3762 3836 . + . ID=YDL247W-A;Name=YDL247W-A +chrIV SGD CDS 3762 3836 . + 0 Parent=YDL247W-A;Name=YDL247W-A +chrIV SGD gene 5985 7814 . + . ID=YDL247W;Name=YDL247W;gene=MPH2 +chrIV SGD CDS 5985 7814 . + 0 Parent=YDL247W;Name=YDL247W;gene=MPH2 +chrIV SGD gene 8683 9756 . - . ID=YDL246C;Name=YDL246C;gene=SOR2 +chrIV SGD CDS 8683 9756 . - 0 Parent=YDL246C;Name=YDL246C;gene=SOR2 +chrIV SGD gene 11657 13360 . - . ID=YDL245C;Name=YDL245C;gene=HXT15 +chrIV SGD CDS 11657 13360 . - 0 Parent=YDL245C;Name=YDL245C;gene=HXT15 +chrIV SGD ARS 15493 15739 . . . ID=ARS403;Name=ARS403;Alias=ARSIV-16 +chrIV SGD gene 16204 17226 . + . ID=YDL244W;Name=YDL244W;gene=THI13 +chrIV SGD CDS 16204 17226 . + 0 Parent=YDL244W;Name=YDL244W;gene=THI13 + ++ +
I start by using the ContigSeqLoc data type to represent a +sequence location for the coding sequences (CDSes) in this annotation. +Columns 4 and 5 give the bounds of the coding sequence and column 7 +indicates the strand. The bounds are always given from lower to +higher, even when the coding sequence is on the reverse complement +strand. Conveniently, the ContigLoc data type represents +sequence locations based on the left-most sequence position, along +with the length and the strand. As above, this means that for reverse +complement features the offset position corresponds to the end of the +feature.
+ ++ +> cdsContigs :: [(String, SeqLoc.ContigSeqLoc)] +> cdsContigs = map mkContigSeqLoc [ ("YDL248W", ("chrIV", 1802, 2953, '+')) +> , ("YDL247W-A", ("chrIV", 3762, 3836, '+')) +> , ("YDL247W", ("chrIV", 5985, 7814, '+')) +> , ("YDL246C", ("chrIV", 8683, 9756, '-')) +> , ("YDL245C", ("chrIV", 11657, 13360, '-')) +> , ("YDL244W", ("chrIV", 16204, 17226, '+')) +> ] +> where mkContigSeqLoc (name, (chrName, leftOff, rightOff, strChr)) +> = (name, OnSeq (LBS.pack chrName) (CLoc.ContigLoc leftOff len str)) +> where len = 1 + rightOff - leftOff +> str | strChr == '+' = Fwd +> | strChr == '-' = RevCompl +> | otherwise = error $ "Unknown strand " ++ show strChr + ++ +
Even in the relatively simple yeast genome, however, I quickly run +into a problem with the genome annotations:
+ ++ +chrIV SGD gene 65243 65766 . + . ID=YDL219W;Name=YDL219W;gene=DTD1 +chrIV SGD CDS 65243 65307 . + 0 Parent=YDL219W;Name=YDL219W;gene=DTD1 +chrIV SGD CDS 65379 65766 . + 1 Parent=YDL219W;Name=YDL219W;gene=DTD1 +chrIV SGD intron 65308 65378 . + . Parent=YDL219W;Name=YDL219W;gene=DTD1 + ++ +
The DTD1 gene is spliced, meaning that the coding sequence +is comprised of two non-contiguous regions called exons. I use the +SeqLoc data type to represent these concatenated +sub-locations. It is based on the Loc data type from the +Bio.Location.Location package, which is a list of the +ContigLoc locations I used above. It isn't apparent in the +case of YDL219W, but when a gene is encoded on the reverse +complement strand the individual exons are listed from left to right +on the chromosome. As reverse strand genes run from right to left, +the exons must be reversed.
+ ++ +> cdses :: [(String, SeqLoc.SeqLoc)] +> cdses = map mkSeqLoc [ ("YDL248W", ("chrIV", [ ( 1802, 2953, '+') ])) +> , ("YDL247W-A", ("chrIV", [ ( 3762, 3836, '+') ])) +> , ("YDL247W", ("chrIV", [ ( 5985, 7814, '+') ])) +> , ("YDL246C", ("chrIV", [ ( 8683, 9756, '-') ])) +> , ("YDL245C", ("chrIV", [ ( 11657, 13360, '-') ])) +> , ("YDL244W", ("chrIV", [ ( 16204, 17226, '+') ])) +> , ("YDL219W", ("chrIV", [ ( 65243, 65307, '+'), (65379, 65766, '+') ])) +> , ("YDL140C", ("chrIV", [ (205361, 210562, '-') ])) +> , ("YDR172W", ("chrIV", [ (808321, 810378, '+') ])) +> ] +> where mkSeqLoc (name, (chrName, contigs)) +> = (name, OnSeq (LBS.pack chrName) (Loc.Loc clocs)) +> where clocList@(cloc1:_) = map mkCLoc contigs +> clocs = case CLoc.strand cloc1 of +> Fwd -> clocList +> RevCompl -> reverse clocList +> mkCLoc (leftOff, rightOff, strChr) +> = CLoc.ContigLoc leftOff len str +> where len = 1 + rightOff - leftOff +> str | strChr == '+' = Fwd +> | strChr == '-' = RevCompl +> | otherwise = error $ "Unknown strand " ++ show strChr + ++ +
I augmented the small subset of chromosome IV genes with a few more +from further along the chromosome. These will allow me to actually +find some hits for my sequencing read alignments.
+ +I now actually write a function to find where the sequencing reads +lie in a gene. Budding yeast has a relatively small proportion of +directly overlapping CDses, but I still want to allow for multiple +hits, or no hits, for a sequencing read. I will use the +isWithin function from the Bio.Location.SeqLocation +module. This tests whether a position lies within a location--the +position must be on the same sequence (identical sequence name) and +the same strand as the location as well as lying within the bounds of +one of the contiguous blocks making up the location.
+ ++ +> isHit :: SeqLoc.SeqPos -> (a, SeqLoc.SeqLoc) -> Bool +> isHit pos (_, loc) = pos `SeqLoc.isWithin` loc + +> posHits :: [(a, SeqLoc.SeqLoc)] -> SeqLoc.SeqPos -> [(a, SeqLoc.SeqLoc)] +> posHits targetLocs queryPos = filter (isHit queryPos) targetLocs + ++ +
Finally, I want to determine where in the gene a read starts. It +may be interesting if substantially more reads are found in one region +of the gene than another. To do this, I'll use the posInto +function. This function takes a location and a position on the same +sequence and calculates a new position which corresponds to the +original position, expressed relative to the coordinates of the +location. That is, if the location runs from 100 to 200 and the +position is 110, then the offset resulting from posInto will +be 10. This function only computes a value when the position actually +lies inside the location. I have already filtered for just the +locations that do contain the position, so I will produce a runtime +error if it fails. This requires a bit of plumbing to manage the +errors. The posInto function has an inverse, the +posOutof function.
+ ++ +> hitOffset :: SeqLoc.SeqPos -> SeqLoc.SeqLoc -> Offset +> hitOffset pos loc = either error Pos.offset into +> where into = onSameSeq posIntoE pos loc +> posIntoE p l = maybe outsideErr Right $ p `Loc.posInto` l +> where outsideErr = Left "position outside hit location" + ++ +
I put these together with a function to produce human-readable +output describing the location of the sequencing read relative to +protein-coding genes.
+ ++ +> posDescribeHits :: [(String, SeqLoc.SeqLoc)] -> SeqLoc.SeqPos -> String +> posDescribeHits targetLocs queryPos +> = SeqLoc.displaySeqPos queryPos ++ ": " ++ describeHits +> where describeHits = case posHits targetLocs queryPos of +> [] -> "no CDS" +> hits -> intercalate "; " . map describeHit $ hits +> describeHit (cdsName, cdsSeqLoc) = cdsName ++ " at " ++ show (hitOffset queryPos cdsSeqLoc) + ++ +
Finally, I map the hit-describing function over each alignment.
+ ++ +> describeAligns = mapM_ (putStrLn . posDescribeHits cdses) aligns + ++ +
The describeAligns function can be run to produce the +following results:
+ ++ +chrXII@453753(-): no CDS +chrXII@454229(-): no CDS +chrIV@209479(-): YDL140C at 1083 +chrXIV@742546(-): no CDS +chrIV@809813(+): YDR172W at 1492 + ++ +
This provides a demonstration of the position and location +interfaces. However, it leaves many things to be desired. To +actually use this tool, I need to read annotations from a GFF format +file as well as parse Bowtie output files directly. In addition, I +need to search for overlapping cdses more efficiently--the current +approach is quite simplistic. +
+ + + }