[Change tutorial to markdown and pandoc-compile html nick@ingolia.org**20090315190044] { move ./doc/location-tutorial.lhs ./doc/location-tutorial.txt addfile ./doc/location-tutorial.html hunk ./doc/location-tutorial.html 1 + +

Introduction

The modules in Bio.Location are designed to represent positions and locations (ranges of positions) on sequences, particularly nucleotide sequences. In this tutorial, I present examples based on quantifying gene expression from deep sequencing of mRNAs. This is a simplification of the actual use case that led me to develop the modules.

Deep sequencing technology produces millions of short sequencing reads. In some cases, the sequencing sample is produced from genomic DNA and the sequences can reveal changes in a specific individual genome compared to a reference or even provide novel sequence from an unknown organism. Deep sequencing of samples produced from mRNA transcripts rather than genomic DNA is another major application of this new technology. Here, the analysis involves simply counting the number of sequencing reads derived from each gene's mRNA in order to determine its relative abundance. When the mRNA for YFG1 is ten-fold more abundant than that of YFG2, then there will be roughly ten-fold more fragments of YFG1 than fragments of YFG2 in the sequencing sample. This will in turn produce ten times as many sequencing reads that align to the genome within YFG1.

My goal is to take a list of gene locations and a list of sequencing read alignment locations and to count how many read alignments fall in each gene. I start by importing the various modules I'll need, many of which are designed for qualified import:

> import qualified Data.ByteString.Lazy.Char8 as LBS
> import Data.List (intercalate)
>
> import Bio.Location.OnSeq
> import qualified Bio.Location.ContigLocation as CLoc
> import qualified Bio.Location.Location as Loc
> import qualified Bio.Location.Position as Pos
> import qualified Bio.Location.SeqLocation as SeqLoc
> import qualified Bio.Location.SeqLocMap as SLM
> import Bio.Location.Strand
> import Bio.Sequence.SeqData

Sequence positions

The first step of analysis is to use an external program such as Bowtie to align each short deep sequencing read to the genome. I'll start with the genomic alignment position of a deep sequencing read. A few lines of Bowtie alignment are shown below:

HWI-EAS355_6_Nick:5:1:855:1808  -       chrXII  453779  GGTCTAGGCTTCGTCACTGACCTCCAC     hhhhhhhhhhhhhhhhhhhhhhhhhhh     1       
+HWI-EAS355_6_Nick:5:1:806:1496  -       chrXII  454255  GTTCGATTAGTCTTTCGCCCCTATACC     hhhhhhhhhhhhhhhhhhhhhhhhhhh     1       
+HWI-EAS355_6_Nick:5:1:1657:1865 -       chrIV   209505  TAACAGTTCTTGCCGAAAAATCCACAC     hhhhhhhhhhhhhhhhhhhhhhhhhhh     0       
+HWI-EAS355_6_Nick:5:1:1469:1010 -       chrXIV  742572  CCAAAAGTAATATATTGAATTAATAAC     hhhhhhhhhhhhhhhhhhhhhhhhhhh     0       
+HWI-EAS355_6_Nick:5:1:1471:1449 +       chrIV   809813  GAAGGATCTAGGTACCATCGTTGAAGA     hhhhhhhhhhhhhhhhhhhhhhhhhhh     0      26:G>A
+

Columns two through four describe the position of the alignment. The second column indicates whether the alignment is on the forward (+) or reverse complement (-) strand. The third gives the sequence name and the fourth gives the offset into the sequence. According to the Bowtie manual, the fourth column gives the "0-based offset into the reference sequence where leftmost character of the alignment occurs." The question of zero- versus one-based indexing is a common complication in bioinformatics. The Bio.Location modules use zero-based indexing, so the offsets can be used directly to index into SeqData.

The SeqPos data type in Bio.Location.SeqLocation module provides a direct representation of these three components. It is constructed from a Pos, specifying the offset and strand, nested within an OnSeq, giving the name of the sequence where the read aligned. The strand is given with the Strand data type. The one complication is that the offset in the fourth column gives the left-most character of the alignment, but when the alignment is on the reverse complement strand the starting position is actually the right-most character of the alignment.

> aligns :: [SeqLoc.SeqPos]
> aligns = map mkSeqPos [ ('-', "chrXII", 453779)
> , ('-', "chrXII", 454255)
> , ('-', "chrIV", 209505)
> , ('-', "chrXIV", 742572)
> , ('+', "chrIV", 809813)
> ]
> where mkSeqPos ('+', seqName, offset) = OnSeq (LBS.pack seqName) (Pos.Pos offset Fwd)
> mkSeqPos ('-', seqName, offset) = OnSeq (LBS.pack seqName) (Pos.Pos (offset + 1 - readLength) RevCompl)
> where readLength = 27

Sequence Locations

Next, I need genomic annotations that indicate the location of genes. Below are sample GFF format annotations for the yeast genome:

chrIV   SGD     gene    1802    2953    .       +       .       ID=YDL248W;Name=YDL248W;gene=COS7
+chrIV   SGD     CDS     1802    2953    .       +       0       Parent=YDL248W;Name=YDL248W;gene=COS7
+chrIV   SGD     gene    3762    3836    .       +       .       ID=YDL247W-A;Name=YDL247W-A
+chrIV   SGD     CDS     3762    3836    .       +       0       Parent=YDL247W-A;Name=YDL247W-A
+chrIV   SGD     gene    5985    7814    .       +       .       ID=YDL247W;Name=YDL247W;gene=MPH2
+chrIV   SGD     CDS     5985    7814    .       +       0       Parent=YDL247W;Name=YDL247W;gene=MPH2
+chrIV   SGD     gene    8683    9756    .       -       .       ID=YDL246C;Name=YDL246C;gene=SOR2
+chrIV   SGD     CDS     8683    9756    .       -       0       Parent=YDL246C;Name=YDL246C;gene=SOR2
+chrIV   SGD     gene    11657   13360   .       -       .       ID=YDL245C;Name=YDL245C;gene=HXT15
+chrIV   SGD     CDS     11657   13360   .       -       0       Parent=YDL245C;Name=YDL245C;gene=HXT15
+chrIV   SGD     ARS     15493   15739   .       .       .       ID=ARS403;Name=ARS403;Alias=ARSIV-16
+chrIV   SGD     gene    16204   17226   .       +       .       ID=YDL244W;Name=YDL244W;gene=THI13
+chrIV   SGD     CDS     16204   17226   .       +       0       Parent=YDL244W;Name=YDL244W;gene=THI13
+

I start by using the ContigSeqLoc data type to represent a sequence location for the coding sequences (CDSes) in this annotation. Columns 4 and 5 give the bounds of the coding sequence and column 7 indicates the strand. The bounds are always given from lower to higher, even when the coding sequence is on the reverse complement strand. Conveniently, the ContigLoc data type represents sequence locations based on the left-most sequence position, along with the length and the strand. As above, this means that for reverse complement features the offset position corresponds to the end of the feature.

> cdsContigs :: [(String, SeqLoc.ContigSeqLoc)]
> cdsContigs = map mkContigSeqLoc [ ("YDL248W", ("chrIV", 1802, 2953, '+'))
> , ("YDL247W-A", ("chrIV", 3762, 3836, '+'))
> , ("YDL247W", ("chrIV", 5985, 7814, '+'))
> , ("YDL246C", ("chrIV", 8683, 9756, '-'))
> , ("YDL245C", ("chrIV", 11657, 13360, '-'))
> , ("YDL244W", ("chrIV", 16204, 17226, '+'))
> ]
> where mkContigSeqLoc (name, (chrName, leftOff, rightOff, strChr))
> = (name, OnSeq (LBS.pack chrName) (CLoc.ContigLoc leftOff len str))
> where len = 1 + rightOff - leftOff
> str | strChr == '+' = Fwd
> | strChr == '-' = RevCompl
> | otherwise = error $ "Unknown strand " ++ show strChr

Discontinuous sequence locations

Even in the relatively simple yeast genome, however, I quickly run into a problem with the genome annotations:

chrIV   SGD     gene    65243   65766   .       +       .       ID=YDL219W;Name=YDL219W;gene=DTD1
+chrIV   SGD     CDS     65243   65307   .       +       0       Parent=YDL219W;Name=YDL219W;gene=DTD1
+chrIV   SGD     CDS     65379   65766   .       +       1       Parent=YDL219W;Name=YDL219W;gene=DTD1
+chrIV   SGD     intron  65308   65378   .       +       .       Parent=YDL219W;Name=YDL219W;gene=DTD1
+

The DTD1 gene is spliced, meaning that the coding sequence is comprised of two non-contiguous regions called exons. I use the SeqLoc data type to represent these concatenated sub-locations. It is based on the Loc data type from the Bio.Location.Location package, which is a list of the ContigLoc locations I used above. It isn't apparent in the case of YDL219W, but when a gene is encoded on the reverse complement strand the individual exons are listed from left to right on the chromosome. As reverse strand genes run from right to left, the exons must be reversed.

> cdses :: [(String, SeqLoc.SeqLoc)]
> cdses = map mkSeqLoc [ ("YDL248W", ("chrIV", [ ( 1802, 2953, '+') ]))
> , ("YDL247W-A", ("chrIV", [ ( 3762, 3836, '+') ]))
> , ("YDL247W", ("chrIV", [ ( 5985, 7814, '+') ]))
> , ("YDL246C", ("chrIV", [ ( 8683, 9756, '-') ]))
> , ("YDL245C", ("chrIV", [ ( 11657, 13360, '-') ]))
> , ("YDL244W", ("chrIV", [ ( 16204, 17226, '+') ]))
> , ("YDL219W", ("chrIV", [ ( 65243, 65307, '+'), (65379, 65766, '+') ]))
> , ("YDL140C", ("chrIV", [ (205361, 210562, '-') ]))
> , ("YDR172W", ("chrIV", [ (808321, 810378, '+') ]))
> ]
> where mkSeqLoc (name, (chrName, contigs))
> = (name, OnSeq (LBS.pack chrName) (Loc.Loc clocs))
> where clocList@(cloc1:_) = map mkCLoc contigs
> clocs = case CLoc.strand cloc1 of
> Fwd -> clocList
> RevCompl -> reverse clocList
> mkCLoc (leftOff, rightOff, strChr)
> = CLoc.ContigLoc leftOff len str
> where len = 1 + rightOff - leftOff
> str | strChr == '+' = Fwd
> | strChr == '-' = RevCompl
> | otherwise = error $ "Unknown strand " ++ show strChr

I augmented the small subset of chromosome IV genes with a few more from further along the chromosome. These will allow me to actually find some hits for my sequencing read alignments.

Finding locations, naively

I now actually write a function to find where the sequencing reads lie in a gene. Budding yeast has a relatively small proportion of directly overlapping CDses, but I still want to allow for multiple hits, or no hits, for a sequencing read. I will use the isWithin function from the Bio.Location.SeqLocation module. This tests whether a position lies within a location--the position must be on the same sequence (identical sequence name) and the same strand as the location as well as lying within the bounds of one of the contiguous blocks making up the location.

> isHit :: SeqLoc.SeqPos -> (a, SeqLoc.SeqLoc) -> Bool
> isHit pos (_, loc) = pos `SeqLoc.isWithin` loc
>
> posHitsNaive :: [(a, SeqLoc.SeqLoc)] -> SeqLoc.SeqPos -> [(a, SeqLoc.SeqLoc)]
> posHitsNaive targetLocs queryPos = filter (isHit queryPos) targetLocs

Finally, I want to determine where in the gene a read starts. It may be interesting if substantially more reads are found in one region of the gene than another. To do this, I'll use the posInto function. This function takes a location and a position on the same sequence and calculates a new position which corresponds to the original position, expressed relative to the coordinates of the location. That is, if the location runs from 100 to 200 and the position is 110, then the offset resulting from posInto will be 10. This function only computes a value when the position actually lies inside the location. I have already filtered for just the locations that do contain the position, so I will produce a runtime error if it fails. This requires a bit of plumbing to manage the errors. The posInto function has an inverse, the posOutof function.

> hitOffset :: SeqLoc.SeqPos -> SeqLoc.SeqLoc -> Offset
> hitOffset pos loc = either error Pos.offset into
> where into = onSameSeq posIntoE pos loc
> posIntoE p l = maybe outsideErr Right $ p `Loc.posInto` l
> where outsideErr = Left "position outside hit location"

I put these together with a function to produce human-readable output describing the location of the sequencing read relative to protein-coding genes.

> posDescribeHits :: (SeqLoc.SeqPos -> [(String, SeqLoc.SeqLoc)]) -> SeqLoc.SeqPos -> String
> posDescribeHits lookupHits queryPos
> = SeqLoc.displaySeqPos queryPos ++ ": " ++ describeHits
> where describeHits = case lookupHits queryPos of
> [] -> "no CDS"
> hits -> intercalate "; " . map describeHit $ hits
> describeHit (cdsName, cdsSeqLoc) = cdsName ++ " at " ++ show (hitOffset queryPos cdsSeqLoc)

Finally, I map the hit-describing function over each alignment.

> describeAlignsNaive = mapM_ (putStrLn . posDescribeHits (posHitsNaive cdses)) aligns

The describeAlignsNaive function can be run to produce the following results:

chrXII@453753(-): no CDS
+chrXII@454229(-): no CDS
+chrIV@209479(-): YDL140C at 1083
+chrXIV@742546(-): no CDS
+chrIV@809813(+): YDR172W at 1492
+

Finding locations

As suggested by its name, posHitsNaive is easy to write, but probably not the best way to look up which locations overlap a sequence position. The Bio.Location.LocMap and Bio.Location.SeqLocMap provide a much faster method of looking up positions. The LocMap data structure sorts locations into bins, with each location falling into one or more bins. A sequence position is then tested against only the set of locations overlapping its bin. The bin size is tuned to work well for yeast genes, with an average feature density of one per ~2kbp, and should work well for somewhat less dense annotation maps as well. The LocMap approach takes advantage of the fact that there is a great deal of structure in sequence location data, since each location is contained in a bounded range of sequence coordinates.

In order to use the SeqLocMap, I flip around the association list, making the locations into the key. I then construct the lookup map and apply it to sequence positions.

> posHits :: [(a, SeqLoc.SeqLoc)] -> SeqLoc.SeqPos -> [(a, SeqLoc.SeqLoc)]
> posHits targetLocs = map swap . flip SLM.lookupWithin targetSLM
> where swap (a,b) = (b,a)
> targetSLM = SLM.fromList . map swap $ targetLocs

I use the more sophisticated version of posHits with posDescribeHits and find the same result. There is an extensive set of QuickCheck properties ensuring that LocMap results match the naive approach, of course.

> describeAligns = mapM_ (putStrLn . posDescribeHits (posHits cdses)) aligns
chrXII@453753(-): no CDS
+chrXII@454229(-): no CDS
+chrIV@209479(-): YDL140C at 1083
+chrXIV@742546(-): no CDS
+chrIV@809813(+): YDR172W at 1492
+

This provides a demonstration of the position and location interfaces. For practical use, of course, I need to read annotations from a GFF format file as well as parse Bowtie output files directly.

+ hunk ./doc/location-tutorial.txt 1 - - - - -Bio.Location Tutorial - - +Introduction +============ hunk ./doc/location-tutorial.txt 4 -

Introduction

+The modules in `Bio.Location` are designed to represent positions and +locations (ranges of positions) on sequences, particularly nucleotide +sequences. In this tutorial, I present examples based on quantifying +gene expression from deep sequencing of mRNAs. This is a +simplification of the actual use case that led me to develop the +modules. hunk ./doc/location-tutorial.txt 11 -

The modules in Bio.Location are designed to represent -positions and locations (ranges of positions) on sequences, -particularly nucleotide sequences. In this tutorial, I present -examples based on quantifying gene expression from deep sequencing of -mRNAs. This is a simplification of the actual use case that led me to -develop the modules.

- -

Deep sequencing technology produces millions of short sequencing +Deep sequencing technology produces millions of short sequencing hunk ./doc/location-tutorial.txt 19 -determine its relative abundance. When the mRNA for YFG1 is -ten-fold more abundant than that of YFG2, then there will be -roughly ten-fold more fragments of YFG1 than fragments of -YFG2 in the sequencing sample. This will in turn produce ten -times as many sequencing reads that align to the genome within -YFG1.

+determine its relative abundance. When the mRNA for *YFG1* is +ten-fold more abundant than that of *YFG2*, then there will be roughly +ten-fold more fragments of *YFG1* than fragments of *YFG2* in the +sequencing sample. This will in turn produce ten times as many +sequencing reads that align to the genome within *YFG1*. hunk ./doc/location-tutorial.txt 25 -

My goal is to take a list of gene locations and a list of +My goal is to take a list of gene locations and a list of hunk ./doc/location-tutorial.txt 28 -I'll need, many of which are designed for qualified import:

- -
+I'll need, many of which are designed for qualified import:
hunk ./doc/location-tutorial.txt 38
+> import qualified Bio.Location.SeqLocMap as SLM
hunk ./doc/location-tutorial.txt 42
-
- -

Positions and locations

+Sequence positions +================== hunk ./doc/location-tutorial.txt 45 -

The first step of analysis is to use an external program such as +The first step of analysis is to use an external program such as hunk ./doc/location-tutorial.txt 48 -A few lines of Bowtie -alignment are shown below:

+A few lines of [Bowtie](http://bowtie-bio.sourceforge.net/index.shtml) +alignment are shown below: hunk ./doc/location-tutorial.txt 51 -
-HWI-EAS355_6_Nick:5:1:855:1808  -       chrXII  453779  GGTCTAGGCTTCGTCACTGACCTCCAC     hhhhhhhhhhhhhhhhhhhhhhhhhhh     1       
-HWI-EAS355_6_Nick:5:1:806:1496  -       chrXII  454255  GTTCGATTAGTCTTTCGCCCCTATACC     hhhhhhhhhhhhhhhhhhhhhhhhhhh     1       
-HWI-EAS355_6_Nick:5:1:1657:1865 -       chrIV   209505  TAACAGTTCTTGCCGAAAAATCCACAC     hhhhhhhhhhhhhhhhhhhhhhhhhhh     0       
-HWI-EAS355_6_Nick:5:1:1469:1010 -       chrXIV  742572  CCAAAAGTAATATATTGAATTAATAAC     hhhhhhhhhhhhhhhhhhhhhhhhhhh     0       
-HWI-EAS355_6_Nick:5:1:1471:1449	+	chrIV	809813	GAAGGATCTAGGTACCATCGTTGAAGA	hhhhhhhhhhhhhhhhhhhhhhhhhhh	0	26:G>A
-
+ HWI-EAS355_6_Nick:5:1:855:1808 - chrXII 453779 GGTCTAGGCTTCGTCACTGACCTCCAC hhhhhhhhhhhhhhhhhhhhhhhhhhh 1 + HWI-EAS355_6_Nick:5:1:806:1496 - chrXII 454255 GTTCGATTAGTCTTTCGCCCCTATACC hhhhhhhhhhhhhhhhhhhhhhhhhhh 1 + HWI-EAS355_6_Nick:5:1:1657:1865 - chrIV 209505 TAACAGTTCTTGCCGAAAAATCCACAC hhhhhhhhhhhhhhhhhhhhhhhhhhh 0 + HWI-EAS355_6_Nick:5:1:1469:1010 - chrXIV 742572 CCAAAAGTAATATATTGAATTAATAAC hhhhhhhhhhhhhhhhhhhhhhhhhhh 0 + HWI-EAS355_6_Nick:5:1:1471:1449 + chrIV 809813 GAAGGATCTAGGTACCATCGTTGAAGA hhhhhhhhhhhhhhhhhhhhhhhhhhh 0 26:G>A hunk ./doc/location-tutorial.txt 57 -

Columns two through four describe the position of the alignment. -The second column indicates whether the alignment is on the forward -(+) or reverse complement (-) strand. The third -gives the sequence name and the fourth gives the offset into the -sequence. According to the Bowtie -manual, the fourth column gives the "0-based offset into the -reference sequence where leftmost character of the alignment occurs." -The question of zero- versus one-based indexing is a common -complication in bioinformatics. The Bio.Location modules use -zero-based indexing, so the offsets can be used directly to index into -SeqData.

+Columns two through four describe the position of the alignment. The +second column indicates whether the alignment is on the forward (`+`) +or reverse complement (`-`) strand. The third gives the sequence name +and the fourth gives the offset into the sequence. According to the +[Bowtie manual](http://bowtie-bio.sourceforge.net/manual.shtml), the +fourth column gives the "0-based offset into the reference sequence +where leftmost character of the alignment occurs." The question of +zero- versus one-based indexing is a common complication in +bioinformatics. The `Bio.Location` modules use zero-based indexing, so +the offsets can be used directly to index into `SeqData`. hunk ./doc/location-tutorial.txt 68 -

The SeqPos data type in Bio.Location.SeqLocation +The `SeqPos` data type in `Bio.Location.SeqLocation` hunk ./doc/location-tutorial.txt 70 -is constructed from a Pos, specifying the offset and strand, -nested within an OnSeq, giving the name of the sequence where -the read aligned. The strand is given with the Strand data +is constructed from a `Pos`, specifying the offset and strand, +nested within an `OnSeq`, giving the name of the sequence where +the read aligned. The strand is given with the `Strand` data hunk ./doc/location-tutorial.txt 76 -the right-most character of the alignment.

- -
+the right-most character of the alignment.
hunk ./doc/location-tutorial.txt 89
-
- -

Next, I need genomic annotations that indicate the location of -genes. Below are sample GFF format annotations for the -yeast genome:

- -
+Sequence Locations
+==================
hunk ./doc/location-tutorial.txt 92
-chrIV   SGD     gene    1802    2953    .       +       .       ID=YDL248W;Name=YDL248W;gene=COS7
-chrIV   SGD     CDS     1802    2953    .       +       0       Parent=YDL248W;Name=YDL248W;gene=COS7
-chrIV   SGD     gene    3762    3836    .       +       .       ID=YDL247W-A;Name=YDL247W-A
-chrIV   SGD     CDS     3762    3836    .       +       0       Parent=YDL247W-A;Name=YDL247W-A
-chrIV   SGD     gene    5985    7814    .       +       .       ID=YDL247W;Name=YDL247W;gene=MPH2
-chrIV   SGD     CDS     5985    7814    .       +       0       Parent=YDL247W;Name=YDL247W;gene=MPH2
-chrIV   SGD     gene    8683    9756    .       -       .       ID=YDL246C;Name=YDL246C;gene=SOR2
-chrIV   SGD     CDS     8683    9756    .       -       0       Parent=YDL246C;Name=YDL246C;gene=SOR2
-chrIV   SGD     gene    11657   13360   .       -       .       ID=YDL245C;Name=YDL245C;gene=HXT15
-chrIV   SGD     CDS     11657   13360   .       -       0       Parent=YDL245C;Name=YDL245C;gene=HXT15
-chrIV   SGD     ARS     15493   15739   .       .       .       ID=ARS403;Name=ARS403;Alias=ARSIV-16
-chrIV   SGD     gene    16204   17226   .       +       .       ID=YDL244W;Name=YDL244W;gene=THI13
-chrIV   SGD     CDS     16204   17226   .       +       0       Parent=YDL244W;Name=YDL244W;gene=THI13
+Next, I need genomic annotations that indicate the location of genes.
+Below are sample [GFF format](http://song.sourceforge.net/gff3.shtml)
+annotations for the [yeast genome](http://www.yeastgenome.org/):
hunk ./doc/location-tutorial.txt 96
-
+ chrIV SGD gene 1802 2953 . + . ID=YDL248W;Name=YDL248W;gene=COS7 + chrIV SGD CDS 1802 2953 . + 0 Parent=YDL248W;Name=YDL248W;gene=COS7 + chrIV SGD gene 3762 3836 . + . ID=YDL247W-A;Name=YDL247W-A + chrIV SGD CDS 3762 3836 . + 0 Parent=YDL247W-A;Name=YDL247W-A + chrIV SGD gene 5985 7814 . + . ID=YDL247W;Name=YDL247W;gene=MPH2 + chrIV SGD CDS 5985 7814 . + 0 Parent=YDL247W;Name=YDL247W;gene=MPH2 + chrIV SGD gene 8683 9756 . - . ID=YDL246C;Name=YDL246C;gene=SOR2 + chrIV SGD CDS 8683 9756 . - 0 Parent=YDL246C;Name=YDL246C;gene=SOR2 + chrIV SGD gene 11657 13360 . - . ID=YDL245C;Name=YDL245C;gene=HXT15 + chrIV SGD CDS 11657 13360 . - 0 Parent=YDL245C;Name=YDL245C;gene=HXT15 + chrIV SGD ARS 15493 15739 . . . ID=ARS403;Name=ARS403;Alias=ARSIV-16 + chrIV SGD gene 16204 17226 . + . ID=YDL244W;Name=YDL244W;gene=THI13 + chrIV SGD CDS 16204 17226 . + 0 Parent=YDL244W;Name=YDL244W;gene=THI13 hunk ./doc/location-tutorial.txt 110 -

I start by using the ContigSeqLoc data type to represent a +I start by using the `ContigSeqLoc` data type to represent a hunk ./doc/location-tutorial.txt 115 -strand. Conveniently, the ContigLoc data type represents +strand. Conveniently, the `ContigLoc` data type represents hunk ./doc/location-tutorial.txt 119 -feature.

- -
+feature.
hunk ./doc/location-tutorial.txt 136
-
- -

Even in the relatively simple yeast genome, however, I quickly run -into a problem with the genome annotations:

- -
+Discontinuous sequence locations
+================================
hunk ./doc/location-tutorial.txt 139
-chrIV   SGD     gene    65243   65766   .       +       .       ID=YDL219W;Name=YDL219W;gene=DTD1
-chrIV   SGD     CDS     65243   65307   .       +       0       Parent=YDL219W;Name=YDL219W;gene=DTD1
-chrIV   SGD     CDS     65379   65766   .       +       1       Parent=YDL219W;Name=YDL219W;gene=DTD1
-chrIV   SGD     intron  65308   65378   .       +       .       Parent=YDL219W;Name=YDL219W;gene=DTD1
+Even in the relatively simple yeast genome, however, I quickly run
+into a problem with the genome annotations: 
hunk ./doc/location-tutorial.txt 142
-
+ chrIV SGD gene 65243 65766 . + . ID=YDL219W;Name=YDL219W;gene=DTD1 + chrIV SGD CDS 65243 65307 . + 0 Parent=YDL219W;Name=YDL219W;gene=DTD1 + chrIV SGD CDS 65379 65766 . + 1 Parent=YDL219W;Name=YDL219W;gene=DTD1 + chrIV SGD intron 65308 65378 . + . Parent=YDL219W;Name=YDL219W;gene=DTD1 hunk ./doc/location-tutorial.txt 147 -

The DTD1 gene is spliced, meaning that the coding sequence +The *DTD1* gene is spliced, meaning that the coding sequence hunk ./doc/location-tutorial.txt 149 -SeqLoc data type to represent these concatenated -sub-locations. It is based on the Loc data type from the -Bio.Location.Location package, which is a list of the -ContigLoc locations I used above. It isn't apparent in the -case of YDL219W, but when a gene is encoded on the reverse +`SeqLoc` data type to represent these concatenated +sub-locations. It is based on the `Loc` data type from the +`Bio.Location.Location` package, which is a list of the +`ContigLoc` locations I used above. It isn't apparent in the +case of *YDL219W*, but when a gene is encoded on the reverse hunk ./doc/location-tutorial.txt 156 -the exons must be reversed.

- -
+the exons must be reversed.
hunk ./doc/location-tutorial.txt 182
-
- -

I augmented the small subset of chromosome IV genes with a few more +I augmented the small subset of chromosome IV genes with a few more hunk ./doc/location-tutorial.txt 184 -find some hits for my sequencing read alignments.

+find some hits for my sequencing read alignments. hunk ./doc/location-tutorial.txt 186 -

I now actually write a function to find where the sequencing reads +Finding locations, naively +========================== + +I now actually write a function to find where the sequencing reads hunk ./doc/location-tutorial.txt 193 -isWithin function from the Bio.Location.SeqLocation +`isWithin` function from the `Bio.Location.SeqLocation` hunk ./doc/location-tutorial.txt 197 -one of the contiguous blocks making up the location.

- -
+one of the contiguous blocks making up the location.
hunk ./doc/location-tutorial.txt 201
+>
+> posHitsNaive :: [(a, SeqLoc.SeqLoc)] -> SeqLoc.SeqPos -> [(a, SeqLoc.SeqLoc)]
+> posHitsNaive targetLocs queryPos = filter (isHit queryPos) targetLocs
hunk ./doc/location-tutorial.txt 205
-> posHits :: [(a, SeqLoc.SeqLoc)] -> SeqLoc.SeqPos -> [(a, SeqLoc.SeqLoc)]
-> posHits targetLocs queryPos = filter (isHit queryPos) targetLocs
-
-
- -

Finally, I want to determine where in the gene a read starts. It +Finally, I want to determine where in the gene a read starts. It hunk ./doc/location-tutorial.txt 207 -of the gene than another. To do this, I'll use the posInto +of the gene than another. To do this, I'll use the `posInto` hunk ./doc/location-tutorial.txt 212 -position is 110, then the offset resulting from posInto will +position is 110, then the offset resulting from `posInto` will hunk ./doc/location-tutorial.txt 217 -errors. The posInto function has an inverse, the -posOutof function.

- -
+errors.  The `posInto` function has an inverse, the
+`posOutof` function.
hunk ./doc/location-tutorial.txt 226
-
- -

I put these together with a function to produce human-readable +I put these together with a function to produce human-readable hunk ./doc/location-tutorial.txt 228 -protein-coding genes.

- -
+protein-coding genes.
hunk ./doc/location-tutorial.txt 230
-> posDescribeHits :: [(String, SeqLoc.SeqLoc)] -> SeqLoc.SeqPos -> String
-> posDescribeHits targetLocs queryPos
+> posDescribeHits :: (SeqLoc.SeqPos -> [(String, SeqLoc.SeqLoc)]) -> SeqLoc.SeqPos -> String
+> posDescribeHits lookupHits queryPos
hunk ./doc/location-tutorial.txt 233
->     where describeHits = case posHits targetLocs queryPos of
+>     where describeHits = case lookupHits queryPos of
hunk ./doc/location-tutorial.txt 238
-
+Finally, I map the hit-describing function over each alignment. + +> describeAlignsNaive = mapM_ (putStrLn . posDescribeHits (posHitsNaive cdses)) aligns hunk ./doc/location-tutorial.txt 242 -

Finally, I map the hit-describing function over each alignment.

+The `describeAlignsNaive` function can be run to produce the +following results: hunk ./doc/location-tutorial.txt 245 -
+	chrXII@453753(-): no CDS
+	chrXII@454229(-): no CDS
+	chrIV@209479(-): YDL140C at 1083
+	chrXIV@742546(-): no CDS
+	chrIV@809813(+): YDR172W at 1492
hunk ./doc/location-tutorial.txt 251
-> describeAligns = mapM_ (putStrLn . posDescribeHits cdses) aligns
+Finding locations
+=================
hunk ./doc/location-tutorial.txt 254
-
+As suggested by its name, `posHitsNaive` is easy to write, but +probably not the best way to look up which locations overlap a +sequence position. The `Bio.Location.LocMap` and +`Bio.Location.SeqLocMap` provide a much faster method of looking up +positions. The `LocMap` data structure sorts locations into bins, +with each location falling into one or more bins. A sequence position +is then tested against only the set of locations overlapping its bin. +The bin size is tuned to work well for yeast genes, with an average +feature density of one per ~2kbp, and should work well for somewhat +less dense annotation maps as well. The `LocMap` approach takes +advantage of the fact that there is a great deal of structure in +sequence location data, since each location is contained in a bounded +range of sequence coordinates. hunk ./doc/location-tutorial.txt 268 -

The describeAligns function can be run to produce the -following results:

+In order to use the `SeqLocMap`, I flip around the association list, +making the locations into the key. I then construct the lookup map +and apply it to sequence positions. + +> posHits :: [(a, SeqLoc.SeqLoc)] -> SeqLoc.SeqPos -> [(a, SeqLoc.SeqLoc)] +> posHits targetLocs = map swap . flip SLM.lookupWithin targetSLM +> where swap (a,b) = (b,a) +> targetSLM = SLM.fromList . map swap $ targetLocs hunk ./doc/location-tutorial.txt 277 -
+I use the more sophisticated version of `posHits` with
+`posDescribeHits` and find the same result.  There is an extensive set
+of QuickCheck properties ensuring that `LocMap` results match the
+naive approach, of course.
hunk ./doc/location-tutorial.txt 282
-chrXII@453753(-): no CDS
-chrXII@454229(-): no CDS
-chrIV@209479(-): YDL140C at 1083
-chrXIV@742546(-): no CDS
-chrIV@809813(+): YDR172W at 1492
+> describeAligns = mapM_ (putStrLn . posDescribeHits (posHits cdses)) aligns
hunk ./doc/location-tutorial.txt 284
-
+ chrXII@453753(-): no CDS + chrXII@454229(-): no CDS + chrIV@209479(-): YDL140C at 1083 + chrXIV@742546(-): no CDS + chrIV@809813(+): YDR172W at 1492 hunk ./doc/location-tutorial.txt 290 -

This provides a demonstration of the position and location -interfaces. However, it leaves many things to be desired. To -actually use this tool, I need to read annotations from a GFF format -file as well as parse Bowtie output files directly. In addition, I -need to search for overlapping cdses more efficiently--the current -approach is quite simplistic. -

+This provides a demonstration of the position and location interfaces. +For practical use, of course, I need to read annotations from a GFF +format file as well as parse Bowtie output files directly. hunk ./doc/location-tutorial.txt 294 - - }