Introduction ============ The modules in `Bio.Location` are designed to represent positions and locations (ranges of positions) on sequences, particularly nucleotide sequences. In this tutorial, I present examples based on quantifying gene expression from deep sequencing of mRNAs. This is a simplification of the actual use case that led me to develop the modules. Deep sequencing technology produces millions of short sequencing reads. In some cases, the sequencing sample is produced from genomic DNA and the sequences can reveal changes in a specific individual genome compared to a reference or even provide novel sequence from an unknown organism. Deep sequencing of samples produced from mRNA transcripts rather than genomic DNA is another major application of this new technology. Here, the analysis involves simply counting the number of sequencing reads derived from each gene's mRNA in order to determine its relative abundance. When the mRNA for *YFG1* is ten-fold more abundant than that of *YFG2*, then there will be roughly ten-fold more fragments of *YFG1* than fragments of *YFG2* in the sequencing sample. This will in turn produce ten times as many sequencing reads that align to the genome within *YFG1*. My goal is to take a list of gene locations and a list of sequencing read alignment locations and to count how many read alignments fall in each gene. I start by importing the various modules I'll need, many of which are designed for qualified import: > import qualified Data.ByteString.Lazy.Char8 as LBS > import Data.List (intercalate) > > import Bio.Location.OnSeq > import qualified Bio.Location.ContigLocation as CLoc > import qualified Bio.Location.Location as Loc > import qualified Bio.Location.Position as Pos > import qualified Bio.Location.SeqLocation as SeqLoc > import qualified Bio.Location.SeqLocMap as SLM > import Bio.Location.Strand > import Bio.Sequence.SeqData Sequence positions ================== The first step of analysis is to use an external program such as Bowtie to align each short deep sequencing read to the genome. I'll start with the genomic alignment position of a deep sequencing read. A few lines of [Bowtie](http://bowtie-bio.sourceforge.net/index.shtml) alignment are shown below: HWI-EAS355_6_Nick:5:1:855:1808 - chrXII 453779 GGTCTAGGCTTCGTCACTGACCTCCAC hhhhhhhhhhhhhhhhhhhhhhhhhhh 1 HWI-EAS355_6_Nick:5:1:806:1496 - chrXII 454255 GTTCGATTAGTCTTTCGCCCCTATACC hhhhhhhhhhhhhhhhhhhhhhhhhhh 1 HWI-EAS355_6_Nick:5:1:1657:1865 - chrIV 209505 TAACAGTTCTTGCCGAAAAATCCACAC hhhhhhhhhhhhhhhhhhhhhhhhhhh 0 HWI-EAS355_6_Nick:5:1:1469:1010 - chrXIV 742572 CCAAAAGTAATATATTGAATTAATAAC hhhhhhhhhhhhhhhhhhhhhhhhhhh 0 HWI-EAS355_6_Nick:5:1:1471:1449 + chrIV 809813 GAAGGATCTAGGTACCATCGTTGAAGA hhhhhhhhhhhhhhhhhhhhhhhhhhh 0 26:G>A Columns two through four describe the position of the alignment. The second column indicates whether the alignment is on the forward (`+`) or reverse complement (`-`) strand. The third gives the sequence name and the fourth gives the offset into the sequence. According to the [Bowtie manual](http://bowtie-bio.sourceforge.net/manual.shtml), the fourth column gives the "0-based offset into the reference sequence where leftmost character of the alignment occurs." The question of zero- versus one-based indexing is a common complication in bioinformatics. The `Bio.Location` modules use zero-based indexing, so the offsets can be used directly to index into `SeqData`. The `SeqPos` data type in `Bio.Location.SeqLocation` module provides a direct representation of these three components. It is constructed from a `Pos`, specifying the offset and strand, nested within an `OnSeq`, giving the name of the sequence where the read aligned. The strand is given with the `Strand` data type. The one complication is that the offset in the fourth column gives the left-most character of the alignment, but when the alignment is on the reverse complement strand the starting position is actually the right-most character of the alignment. > aligns :: [SeqLoc.SeqPos] > aligns = map mkSeqPos [ ('-', "chrXII", 453779) > , ('-', "chrXII", 454255) > , ('-', "chrIV", 209505) > , ('-', "chrXIV", 742572) > , ('+', "chrIV", 809813) > ] > where mkSeqPos ('+', seqName, offset) = OnSeq (LBS.pack seqName) (Pos.Pos offset Fwd) > mkSeqPos ('-', seqName, offset) = OnSeq (LBS.pack seqName) (Pos.Pos (offset + 1 - readLength) RevCompl) > where readLength = 27 Sequence Locations ================== Next, I need genomic annotations that indicate the location of genes. Below are sample [GFF format](http://song.sourceforge.net/gff3.shtml) annotations for the [yeast genome](http://www.yeastgenome.org/): chrIV SGD gene 1802 2953 . + . ID=YDL248W;Name=YDL248W;gene=COS7 chrIV SGD CDS 1802 2953 . + 0 Parent=YDL248W;Name=YDL248W;gene=COS7 chrIV SGD gene 3762 3836 . + . ID=YDL247W-A;Name=YDL247W-A chrIV SGD CDS 3762 3836 . + 0 Parent=YDL247W-A;Name=YDL247W-A chrIV SGD gene 5985 7814 . + . ID=YDL247W;Name=YDL247W;gene=MPH2 chrIV SGD CDS 5985 7814 . + 0 Parent=YDL247W;Name=YDL247W;gene=MPH2 chrIV SGD gene 8683 9756 . - . ID=YDL246C;Name=YDL246C;gene=SOR2 chrIV SGD CDS 8683 9756 . - 0 Parent=YDL246C;Name=YDL246C;gene=SOR2 chrIV SGD gene 11657 13360 . - . ID=YDL245C;Name=YDL245C;gene=HXT15 chrIV SGD CDS 11657 13360 . - 0 Parent=YDL245C;Name=YDL245C;gene=HXT15 chrIV SGD ARS 15493 15739 . . . ID=ARS403;Name=ARS403;Alias=ARSIV-16 chrIV SGD gene 16204 17226 . + . ID=YDL244W;Name=YDL244W;gene=THI13 chrIV SGD CDS 16204 17226 . + 0 Parent=YDL244W;Name=YDL244W;gene=THI13 I start by using the `ContigSeqLoc` data type to represent a sequence location for the coding sequences (CDSes) in this annotation. Columns 4 and 5 give the bounds of the coding sequence and column 7 indicates the strand. The bounds are always given from lower to higher, even when the coding sequence is on the reverse complement strand. Conveniently, the `ContigLoc` data type represents sequence locations based on the left-most sequence position, along with the length and the strand. As above, this means that for reverse complement features the offset position corresponds to the end of the feature. > cdsContigs :: [(String, SeqLoc.ContigSeqLoc)] > cdsContigs = map mkContigSeqLoc [ ("YDL248W", ("chrIV", 1802, 2953, '+')) > , ("YDL247W-A", ("chrIV", 3762, 3836, '+')) > , ("YDL247W", ("chrIV", 5985, 7814, '+')) > , ("YDL246C", ("chrIV", 8683, 9756, '-')) > , ("YDL245C", ("chrIV", 11657, 13360, '-')) > , ("YDL244W", ("chrIV", 16204, 17226, '+')) > ] > where mkContigSeqLoc (name, (chrName, leftOff, rightOff, strChr)) > = (name, OnSeq (LBS.pack chrName) (CLoc.ContigLoc leftOff len str)) > where len = 1 + rightOff - leftOff > str | strChr == '+' = Fwd > | strChr == '-' = RevCompl > | otherwise = error $ "Unknown strand " ++ show strChr Discontinuous sequence locations ================================ Even in the relatively simple yeast genome, however, I quickly run into a problem with the genome annotations: chrIV SGD gene 65243 65766 . + . ID=YDL219W;Name=YDL219W;gene=DTD1 chrIV SGD CDS 65243 65307 . + 0 Parent=YDL219W;Name=YDL219W;gene=DTD1 chrIV SGD CDS 65379 65766 . + 1 Parent=YDL219W;Name=YDL219W;gene=DTD1 chrIV SGD intron 65308 65378 . + . Parent=YDL219W;Name=YDL219W;gene=DTD1 The *DTD1* gene is spliced, meaning that the coding sequence is comprised of two non-contiguous regions called exons. I use the `SeqLoc` data type to represent these concatenated sub-locations. It is based on the `Loc` data type from the `Bio.Location.Location` package, which is a list of the `ContigLoc` locations I used above. It isn't apparent in the case of *YDL219W*, but when a gene is encoded on the reverse complement strand the individual exons are listed from left to right on the chromosome. As reverse strand genes run from right to left, the exons must be reversed. > cdses :: [(String, SeqLoc.SeqLoc)] > cdses = map mkSeqLoc [ ("YDL248W", ("chrIV", [ ( 1802, 2953, '+') ])) > , ("YDL247W-A", ("chrIV", [ ( 3762, 3836, '+') ])) > , ("YDL247W", ("chrIV", [ ( 5985, 7814, '+') ])) > , ("YDL246C", ("chrIV", [ ( 8683, 9756, '-') ])) > , ("YDL245C", ("chrIV", [ ( 11657, 13360, '-') ])) > , ("YDL244W", ("chrIV", [ ( 16204, 17226, '+') ])) > , ("YDL219W", ("chrIV", [ ( 65243, 65307, '+'), (65379, 65766, '+') ])) > , ("YDL140C", ("chrIV", [ (205361, 210562, '-') ])) > , ("YDR172W", ("chrIV", [ (808321, 810378, '+') ])) > ] > where mkSeqLoc (name, (chrName, contigs)) > = (name, OnSeq (LBS.pack chrName) (Loc.Loc clocs)) > where clocList@(cloc1:_) = map mkCLoc contigs > clocs = case CLoc.strand cloc1 of > Fwd -> clocList > RevCompl -> reverse clocList > mkCLoc (leftOff, rightOff, strChr) > = CLoc.ContigLoc leftOff len str > where len = 1 + rightOff - leftOff > str | strChr == '+' = Fwd > | strChr == '-' = RevCompl > | otherwise = error $ "Unknown strand " ++ show strChr I augmented the small subset of chromosome IV genes with a few more from further along the chromosome. These will allow me to actually find some hits for my sequencing read alignments. Finding locations, naively ========================== I now actually write a function to find where the sequencing reads lie in a gene. Budding yeast has a relatively small proportion of directly overlapping CDses, but I still want to allow for multiple hits, or no hits, for a sequencing read. I will use the `isWithin` function from the `Bio.Location.SeqLocation` module. This tests whether a position lies within a location--the position must be on the same sequence (identical sequence name) and the same strand as the location as well as lying within the bounds of one of the contiguous blocks making up the location. > isHit :: SeqLoc.SeqPos -> (a, SeqLoc.SeqLoc) -> Bool > isHit pos (_, loc) = pos `SeqLoc.isWithin` loc > > posHitsNaive :: [(a, SeqLoc.SeqLoc)] -> SeqLoc.SeqPos -> [(a, SeqLoc.SeqLoc)] > posHitsNaive targetLocs queryPos = filter (isHit queryPos) targetLocs Finally, I want to determine where in the gene a read starts. It may be interesting if substantially more reads are found in one region of the gene than another. To do this, I'll use the `posInto` function. This function takes a location and a position on the same sequence and calculates a new position which corresponds to the original position, expressed relative to the coordinates of the location. That is, if the location runs from 100 to 200 and the position is 110, then the offset resulting from `posInto` will be 10. This function only computes a value when the position actually lies inside the location. I have already filtered for just the locations that do contain the position, so I will produce a runtime error if it fails. This requires a bit of plumbing to manage the errors. The `posInto` function has an inverse, the `posOutof` function. > hitOffset :: SeqLoc.SeqPos -> SeqLoc.SeqLoc -> Offset > hitOffset pos loc = either error Pos.offset into > where into = onSameSeq posIntoE pos loc > posIntoE p l = maybe outsideErr Right $ p `Loc.posInto` l > where outsideErr = Left "position outside hit location" I put these together with a function to produce human-readable output describing the location of the sequencing read relative to protein-coding genes. > posDescribeHits :: (SeqLoc.SeqPos -> [(String, SeqLoc.SeqLoc)]) -> SeqLoc.SeqPos -> String > posDescribeHits lookupHits queryPos > = SeqLoc.displaySeqPos queryPos ++ ": " ++ describeHits > where describeHits = case lookupHits queryPos of > [] -> "no CDS" > hits -> intercalate "; " . map describeHit $ hits > describeHit (cdsName, cdsSeqLoc) = cdsName ++ " at " ++ show (hitOffset queryPos cdsSeqLoc) Finally, I map the hit-describing function over each alignment. > describeAlignsNaive = mapM_ (putStrLn . posDescribeHits (posHitsNaive cdses)) aligns The `describeAlignsNaive` function can be run to produce the following results: chrXII@453753(-): no CDS chrXII@454229(-): no CDS chrIV@209479(-): YDL140C at 1083 chrXIV@742546(-): no CDS chrIV@809813(+): YDR172W at 1492 Finding locations ================= As suggested by its name, `posHitsNaive` is easy to write, but probably not the best way to look up which locations overlap a sequence position. The `Bio.Location.LocMap` and `Bio.Location.SeqLocMap` provide a much faster method of looking up positions. The `LocMap` data structure sorts locations into bins, with each location falling into one or more bins. A sequence position is then tested against only the set of locations overlapping its bin. The bin size is tuned to work well for yeast genes, with an average feature density of one per ~2kbp, and should work well for somewhat less dense annotation maps as well. The `LocMap` approach takes advantage of the fact that there is a great deal of structure in sequence location data, since each location is contained in a bounded range of sequence coordinates. In order to use the `SeqLocMap`, I flip around the association list, making the locations into the key. I then construct the lookup map and apply it to sequence positions. > posHits :: [(a, SeqLoc.SeqLoc)] -> SeqLoc.SeqPos -> [(a, SeqLoc.SeqLoc)] > posHits targetLocs = map swap . flip SLM.lookupWithin targetSLM > where swap (a,b) = (b,a) > targetSLM = SLM.fromList . map swap $ targetLocs I use the more sophisticated version of `posHits` with `posDescribeHits` and find the same result. There is an extensive set of QuickCheck properties ensuring that `LocMap` results match the naive approach, of course. > describeAligns = mapM_ (putStrLn . posDescribeHits (posHits cdses)) aligns chrXII@453753(-): no CDS chrXII@454229(-): no CDS chrIV@209479(-): YDL140C at 1083 chrXIV@742546(-): no CDS chrIV@809813(+): YDR172W at 1492 This provides a demonstration of the position and location interfaces. For practical use, of course, I need to read annotations from a GFF format file as well as parse Bowtie output files directly.