Next-Generation Genome Sequencing

Size: px

Start display at page:

Download "Next-Generation Genome Sequencing"

Miles Bates
6 years ago
Views:

1 Next-Generation Genome Sequencing Jarkko Salojärvi, D.Sc. (Tech) Department of Biosciences, Division of Plant Biology Department of Veterinary Biosciences, Veterinary Microbiology and Epidemiology University of Helsinki

2 Topics Today Quick recap of relevant parts of technologies SOLiD and Solexa/Illlumina. Sequencing in Practice Properties of data How raw data will look like Re-sequencing Sequencing de novo Some example throughout the presentation...

Where it all began: Sanger sequencing Developed in 1977.

3 Where it all began: Sanger sequencing Developed in Read length up to bases. Current platforms allow 96 concurrent reads. Still needed for closing gaps, sequencing long repeated fragments Frederick Sanger twice Nobel laureate

4 High-throughput sequencing technologies - summary

5 Throughput of different technologies Most recent versions. SOLiD 4 Solexa HiSeq 454 Titanium Runtime 5-10 days 8 days 10 h Read Length (bp) Raw sequence Gb /run Accuracy >99.94% >98.5% 99% (at 400 bp)

6 Sequencing in Practice

7 What do you get from sequencing service? Huge text files with usually two things being reported: 1.Base calls for each read 2.Read qualities In SOLiD these are in two separate files.csfasta and.qv.qual In Solexa, both are reported in the same.fastq file. At Biocenter Viikki: SOLiD 4 and 454 Titanium At FIMM Meilahti: Solexa Commercial services available, price thousand(s)/run.

8 Assembly task Given: A text file with lots of short reads, nucleotide sequences. Task: Align these, either with respect to each other (de novo) or a reference genome (re-sequencing). Essential: Coverage: Number of overlapping reads. Depth: Number of reads on a single nucleotide. Composition of reads (fragments/mate-pair) Is there a reference sequence of the organism?

9 Fragments vs. mate-pairs 1.Individual fragments. 2.Paired reads Mate-pair: genomic DNA is fragmented and size-selected inserts are circularized and linked by means of an internal adaptor. Paired end: Fragmentation of genomic DNA into short segments, followed by sequencing of both ends of the segment (but not the part in between). End result is the same: you know reads from both ends, plus the average distance between reads. A large fraction of short reads are difficult to map uniquely to the genome, and the second read of a pair can be used to find the correct location.

10 Mate-pair reads [Korbel et al.07] A Human genomic DNA i) Shearing and size selection ii) Protection and adapter ligation Bio Met Met Bio viii) Cutoff I Cutoff D iii) Circularization vi) Sequencing of >30 million paired ends with 454 technology Bio Bio Sequenced paired ends iv) Random Cleavage Bio v) Linker(+) read isolation Bio vii) Computational analysis and mapping of Structural Variants (SVs) Count Span of paired ends (i.e. distance between mapped ends [bp]) Result: A snapshot of the full genome in every ~3kbp -Can be used to align contigs from standard sequencing run.

11 Example: paired end mapping to reveal structural variation (SV) in human genome B Human reference genome Normally mapped End distance > cutoff D Altered end orientation Source, i.e. location paired ends originated from in sample genome Individual (sample) sequence Best-placement of end in human reference genome No SV Deletion Inversion breakpoint Span of paired ends in human reference genome End distance < cutoff I Insertion of sequence from distant locus Insertion of sequence from distant locus Region deleted from sample genome Region inserted in sample genome Human reference genome Individual (sample) sequence Region inverted in sample genome End that maps in inverted orientation relative to original (i.e. sample) locus Insertion, simple Insertion, mated Insertion, unmated [Korbel et al.07]

12 De novo vs. re-sequencing of genome In de novo, reads are assembled into contigs: Contiguous sequence of DNA created by assembling overlapping sequenced fragments of a chromosome. Reference assembly = re-sequencing. if the genome/template is known! reference assembly if the genome/template genome/tempale is unknown! de ference novo assembly Reference genome Gap in sequence coverage but reference genome tells that sequences are from the same contig or genomic region! tolerates short read length contig 1 contig 2 same original contig (for example mrna) may be splitted to multiple to shor contigs! longer reads provide more overlap for connecting individual reads

13 SOLiD raw data

14 SOLiD probes Probes designed for reading two nucleotides at a time. Four different colors. Resulting sequence in colorspace...but also CG,GC and TA are red?!

15 SOLiD sequence decoding Key to decoding: known last base of the adapter oligo sequence. Known: 0=A CA AC GT TG 0-1 CA AC GT TG 1-2 AA CC GG TT 2-3 A C A Petri Auvinen, DNA Sequencing and Genomi

16 SOLiD raw data files: raw reads in.csfasta SOLiD gives out in general two files, the reads in color space (.csfasta) and read qualities (.QV-qual) <filename>.csfasta Overall format: Last base of adapter oligo sequence+color space presented in numbers. 1 st Nucleotide 2 nd Nucleotide A C G T A C G T Example: >1_88_1830_R3 G >1_89_1562_R3 G & >$'&)#0(-$&+'& >TAG_ I D Co l o r _spa c e

17 SOLiD raw data files - qualities in QV.qual Quality values are in <filename>.qv.qual phred-like score for each read. score q=-10*log10(p) Example: ' >TAG_ID quality values >97_2040_1850_F >97_2040_1898_F ' p q

Benefit: Complementation in color space One benefit of color space is that it is self-complementing:! 2 nd Nucleotide A C G T A 0 1 2 3 C 1 0 3 2 1 st Nucleotide! F'6%'03'!

18 Benefit: Complementation in color space One benefit of color space is that it is self-complementing:! 2 nd Nucleotide A C G T A C st Nucleotide! F'6%'03'! Ba s e G T A G C T C G T C G T G C A G Co l o r spa c e D+71.'7'0-'5! Ba s e T C G A G C A G C A C G T C Co l o r spa c e

19 Downside One incorrect base can screw up the whole read in decoding CA AC GT TG 0-1 CA AC GT TG 1-2 AA CC GG TT 2-3 A C A A CA AC GT TG 0-1 TA GC CG AT 1-2 AA CC GG TT 2-3 A C G G In colorspace there is still only one error -> Alignment MUST be done in colorspace!

20 Solexa raw data

21 Solexa pipeline

22 Solexa output in fastq file Solexa raw data comes in one text file, default naming by flowcell lane and read direction example: s_7_1_sequence.txt Four lines per read: identifier 2.Raw sequence letters (A,T,C,G,N) 3.+same_sequence identifier 4.Read quality codes Phred-like.

23 Fastq quality scores Quality scores are reported in ASCII Saves disk space Example:

24 Re-sequencing

25 Requirements All alignment programs are designed in unix/linux platforms. Windows too slow. Written in C, some parts in Perl. Need a lot of memory: for human-sized genomes, at least 8Gb of RAM. Need a lot of disk space: data files now ~ 5 Gb. Take from tens of minutes to hours to complete. An account at CSC or some other computation facility recommended No graphical user interfaces Command line interface example: >assemble.pl reads_in.csfasta read_qualities_f3_qv.qual ref_file TAIR9_chr.fas -ref_type nt -NO_CORRECTION

26 Re-sequencing pipeline Proceeds in a similar manner for all platforms: 1. Create an index to be used for searching the reference genome. 2. Using the index, align reads to reference. 3. Form a consensus sequence. 4. Identify SNPs etc.

27 Short read alignment Because of huge amount of data, BLAST is too slow, and faster alignment methods have been developed. Faster methods use shortcuts based on indexing, where you search only a small part of the sequences. Hashing-based indexing. Burrows-Wheeler transform. Progress is rapid, methods published 2 years ago are now old.

28 Hashing-based aligners First generation of read aligners. Extend the idea of BLAST. Indexing: divide reads of length L into bins based on their first n nucleotides n is roughly 20 Alignment: For each position p in the reference genome: Reference sequence=pick next L nucleotides Find the appropriate bin Match the remaining reference sequence to reads in the bin Software: No gaps: Eland, Maq. Gaps allowed: Elandv2, SOAP, GenomeMapper (part of SHORE).

29 Burrows-Wheeler transform Next generation of sequence aligners Reference: ^GOOGOL Searches sequence matches using a prefix trie. Results in fast read alignment method Requires less memory Small gaps allowed Software: Bowtie (no gaps), BWA. SOAP2, 2-way BWT Task: Find match to LOL, given at most one mismatch

30 SHort Read Mapping Package = SHRiMP k-mer hashing step +very efficient implementation of the Smith-Waterman algorithm. Can be used for letter space and color space reads. Slower than the others, but gives optimal local alignment.

31 Performance comparison Homer N, Merriman B, Nelson SF (2009) BFAST: An Alignment Tool for Large Scale Genome Resequencing. PLoS ONE 4(11): e7767. doi: /journal.pone

32 Re-sequencing in colorspace Colorspace has its own pros and cons, which can be taken into account in sequence alignment. Translation into nucleotides as the last step after alignment! Use software that supports colorspace. In practice, this is just an option you give to the alignment program.

33 Using read qualities in alignment? Not all programs use them!! (check the manuals) Most new methods use read qualities. Out of the old ones: Maq. Lets look how the assembly goes with Maq... Has been used a lot in early papers. A benchmark for new methods regarding speed and alignment.

34 Maq - workflow maq fasta2bfa ref.fasta ref.bfa Convert the reference sequences to the binary fasta format maq fastq2bfq reads.fastq reads-1.bfq Convert the reads to the binary fastq format maq match reads-1.map ref.bfa reads-1.bfq Align the reads to the reference maq mapcheck ref.bfa reads-1.map >mapcheck.txt Statistics from the alignment maq assemble consensus.cns ref.bfa reads-1.map 2>assemble.log Build the mapping assembly maq cns2fq consensus.cns >cns.fq Extract consensus sequences and qualities maq cns2snp consensus.cns >cns.snp Extract list of SNPs

35 Maq - workflow

36 Aligning reads - analysing results Software for visualization: Maqviewer, SHOREmap, IGV, Tablet Visualization is VERY important! See the real quality of the data, alignment, coverage etc. Helps to identify errors Helps to evaluate SNP calls, identify gaps etc. There are wings, a propeller, and a pilot - it must be...

37 Maqviewer Only basic functionality.

38 Tablet viewer Coded in JAVA Graphical interface May be slow

39 SNP calling After aligning the short reads to reference genome, identify nucleotides that differ from reference. Make a consensus sequence of the reads Simplest: choose the most common one. Better: Use quality values in the voting Reference Consensus Individual reads SNP

40 SNP calling Software for SNP detection: SOAPsnp, Maq, probhd, SHOREmap, MUMmer Maq computes a phred-type score for the SNPs. SNPs hard to define, usually some thresholds given based on depth and number of differing nucleotides.??

41 Example: Sequencing of A.thaliana genome, mutations induced by EMS Two different mutants sequenced with two platforms: SOLiD + Maq: SNPs Solexa + Maq: SNPs Roughly 1 SNP per every 10,000 bases.

42 Further analysis Which SNP is responsible of the mutant phenotype? Usually some window of the genome is known. To identify the SNP, need to combine SNP locations and genome annotation: Is the SNP in a coding sequence,exon,intron, promoter, 3/5 UTR, junk? Is the SNP disruptive? Transclation results in stop codon/altered amino acid sequence? Each splice variant can be different, variants not known. Location in the protein? If done properly, would require protein structure prediction (very hard)

43 Sequencing de novo

44 How long reads does de novo genome assembly require? Key Problem: longer than read length repeats in the genome. Theoretical analysis: E.coli: 30 bp read length, 75% of genome is covered with contigs>10,000bp C.elegans: 50 bp read length, 51% of the genome is covered with contigs >10,000bp. Human: 50 bp read length, ~15% is covered with contigs>10,000 bp (chromosome 1). Re-sequencing and de novo sequencing of the majority of a bacterial genome is theoretically possible with read lengths of bp. With longer genomes significant proportions are left uncovered.

45 Percentage of the E.coli genome covered by contigs greater than a threshold length as a function of read length Whiteford, N. et al. Nucl. Acids Res :e171; doi: /nar/gni170

46 (b) (b) Read length l (nt) Read length l (nt) C. Elegans Human Use paired-end mapping to connect the longer contigs.

Genome assembly using paired end reads Figure 1: An illustration of the Paired End assembly process. Paired End reads are used to order and orient the contigs derived from the Newbler assembly.

47 Genome assembly using paired end reads Figure 1: An illustration of the Paired End assembly process. Paired End reads are used to order and orient the contigs derived from the Newbler assembly. The large blue lines represent contigs generated from the whole genome shotgun sequencing and assembly. The multiple blue and grey lines represent Paired End information. The blue segments represent the two 20 nucleotide regions that were sequenced while the dotted grey line represents the distance between those two sequenced regions. [454 Technical note 1]

2010 Nature America ity score q20 (Supplementary Table 7). From these reads, ABySS, sequence could not be al SOAPdenovo and Velvet generated 6,535, 4,826 and 6,617 contigs estimated that 2.2 3.

6% of the contigs showed high sequence similarity (q90% identity) to a contig in each of the other assemblies (Fig. 4a). In addi- DISCUSSION tion, the SOAPdenovo assembly showed similarity to 90.

48 2010 Nature America ity score q20 (Supplementary Table 7). From these reads, ABySS, sequence could not be al SOAPdenovo and Velvet generated 6,535, 4,826 and 6,617 contigs estimated that M longer than 100 bp, respectively (Supplementary Fig. 18). On aver- alternative assemblies (S age, 64.6% of the contigs showed high sequence similarity (q90% identity) to a contig in each of the other assemblies (Fig. 4a). In addi- DISCUSSION tion, the SOAPdenovo assembly showed similarity to 90.7% of all The most abundant and assembly sequences (Fig. 4a). are SNVs. We compared Software: ALLPATHS, Edena, we Velvet and SOAPdenovo To analyze AbySS, the quality of these assemblers, designed PCR prim- three methods of SNP ers to amplify DNA fragments from the 186 randomly selected con- decision. Among the th Based on de Bruijn graphs. tigs (32 fragments that were 500 1,000 bp long and 30 fragments of at least 1.5 kb Fujimoto et al.: and carried from each software assembly) Illumina/solexa sequencing of human genome a b SOAPdenovo out PCR 200-bp amplification. Out of 186 contigs, 814 insert libraries, 12 runs. 181 were amplified with51-76 the proper length read length nt. (Supplementary 40xcoverage Fig. 19). We also validated these sequences through Sanger sequencing novothan assembly, of contigs: anddemore 90% comparison of them showed high 1,956 11,616 1,921 ABySS 6,535 to (violet) sequence identity (>90%) the predicted SOAPdenovo 4,826 (yellow) contig sequences. Software for de novo genome assembly Velvet 6,617 (green). ABySS Contigs that were aligned with more than 90% Velvet Figure 4 De novo assembly of unmapped identity were considered shared contigs 955 reads. (a) Comparison of contigs generated by ABySS (violet), SOAPdenovo (yellow) and c d Hs Alt Velvet (green). Contigs that were aligned with Hs GRCh37 more than 90% identity were considered shared Fujimoto et al. (2010) Whole-genome sequencing and comprehensivehs variant analysis of other contigs. (b) Identification of contigs by ABySS a Japanese individual using massively parallel sequencing. Nature Genetics 42, Chimpanzee showing the proportion of the total length

49 Velvet assembly - de Bruijn graphs Split reads into k-mers. Align all k-mers in the reads (here 5- mers) de Bruijn graph: Each node represents a series of overlapping k-mers Final nucleotides make up the sequence of the node. Last k-mer of an arc s origin overlaps with the first of its destination. Reads are mapped as paths through the graph D. R. Zerbino and E. Birney (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18:

50 Example: de novo fragment assembly of one SOLiD run of A.Thaliana Data 68,073,401 reads. Read length 50 nt 12 x coverage. Assembly At least 3xcoverage: 29 M reads. Number of contigs: 295,203 Median contig length: 276 nt Longest contig: 3399 nt Shortest contig: 82 nt Sum contig length : 98,457,405 ~62.7% of the genome. contig length (nt) Index

51 SNP analysis of de novo vs. re-sequencing The sequenced genome was A.Thaliana, Cvi ecotype In de novo, velvet+mummer: 2,371,409 SNPs Re-sequencing, TAIR9+Maq: 183,811 SNPs Published SNP list: 810,205 SNPs Warning: Indices of the published list of Cvi SNPs do not match TAIR9 reference genome!! Consensus sequence of Cvi is not published ( released )

52 Example: Sequencing of the giant panda genome Li et al. (2010) The sequence and de novo assembly of the giant panda genome. Nature 463,

53 Genome assembly using paired end reads v2 Several libraries with different insert lengths Can use the same sequencing technology for whole assembly. Strategy: Join reads with short insert lengths into contigs Make into scaffolds by mapping unpaired ends to other contigs Scaffold = set of contigs with spaces between. Use longer insert libraries for arranging contigs

54 Sequencing setup for Giant Panda 37 paired-end insert libraries with insert sizes of 150 bp, 500 bp, 2 kb, 5 kb and 10 k. Illumina Genome Analyser platform. 176 Gb of usable sequence, 73x coverage. Average read length of 52 bp.

55 Summary of Assembly Final contig size 2.24 Gb Estimated genome size 2.40 Gb.

Hidden Markov Model-based prediction of genes: Genscan, Augustus, HMMgene

56 Genome annotation de novo Gene finding: Align known genes of model species against the new genome. Hidden Markov Model-based prediction of genes: Genscan, Augustus, HMMgene Gene annotation: Function of the genes that can be aligned to new genome give some clue. Gene orthologues, InParanoid, Multiparanoid.

57 Structure of the umami receptor T1R1 gene Heterodimer T1R1/T1R3 may be the sole receptor for umami taste. Umami: detection of the carboxylate anion of glutamic acid, a naturally occurring amino acid common in meats, cheese, broth, stock and other protein-heavy foods. In panda T1R1 is a pseudogene. Recent mutation, may explain the diet?

58 Example application: nucleosome positioning

59 Chromatin structure Chromatin=combination of DNA, RNA, and protein that makes up chromosomes. Functions: Package DNA, strengthen the DNA to allow mitosis and meiosis Serves as a mechanism to control expression and DNA replication. Changes in chromatin structure are affected by chemical modifications of histone proteins such as methylation (DNA and proteins) and acetylation (proteins), and by non-histone, DNA-binding proteins.

60 Predicting nucleosome positions Separate DNA into nucleosome vs. linker DNA parts. Sequence these with 454. Nucleosome ~146 bp, linker DNA ~ bp. Construct a model to predict nucleosome positions. [Field et al. 08]

61 Computational model Nucleosomes: estimate a (position-specific) di-nucleotide model PN over all nucleotide sequences. Linker DNA: Estimate 5-mer model PL for linker DNA vs. nucleosome. ScoreðSÞ~log P NðSÞ P L ðsþ P N,1 ðs½1šþ 147 P P N,iðS½iŠjS½i{1ŠÞ i~2 ~log P l S½1Š 147 P P lðs½išjs½maxð1,i{4þš,...,s½i{1šþ i~2 Estimate score for whole DNA, taking into account all legal configurations of nucleosome positioning. Normalize to get probabilities PðW c ½SŠÞ~ W c½sš P W c ½SŠ, c [C

62 Result Nucleosome localization can be predicted from DNA sequence. Two different types of regulation by chromatin in yeast promoters: Nucleosome-depeleted areas: genes showing relatively low cell-to-cell expression variability, or transcriptional noise. Nucleosome-rich areas: Transcription factors need to compete with nucleosomes for access to the DNA => variability in gene expression.

63 Further uses for high-throughput sequencing? Cataloging sequences and their variation: Between individuals and species. SNPs, quantitative trait loci. Copy number variations. Mutations and genome rearrangements. Metagenomics. Evolution at an individual level. Phylogeny Epigenetics DNA methylation (using ChIP-seq). Chromatin structure. Transcriptome Digital Gene Expression. ChIP-seq. Splice variants. microrna. Cell-specific gene expression.

64 What can high-throughput sequencing do for you? [Kahvejian et al. 08]

65 References Li, H, Homer, N. (2010) A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics 11(5): Fujimoto et al. (2010) Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing. Nature Genetics 42, Li et al. (2010) The sequence and de novo assembly of the giant panda genome. Nature 463, Magi A. et al. (2010) Bioinformatics for Next Generation Sequencing Data. Genes 1: Vera, J.C., Wheat, C.W., Fescemyer, H.W., Frilander, M.J., Crawford, D.L., Hanski, I., and Marden, J.H. (2008) Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Molecular Ecology 17: Field Y, Kaplan N, Fondufe-Mittendorf Y, Moore IK, Sharon E, et al. (2008) Distinct Modes of Regulation by Chromatin Encoded through Nucleosome Positioning Signals. PLoS Comput Biol 4(11): e doi: / journal.pcbi Kahvejian A., Quackenbush J., Thompson J.F. (2008) What would you do if you could sequence everything? Nature Biotechnology 26(10): D. R. Zerbino and E. Birney (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18: Korbel et al. (2007) Paired-end mapping reveals extensive structural variation in the human genome, Science 318: Whole Genome Assembly using Paired End Reads in E. coli, B. licheniformis, and S. cerevisiae. 454 Application note 1, Whiteford, N. et al. (2005) An analysis of the feasibility of short read sequencing. Nucleic Acids Res. 33, e171.

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

High-Throughput Bioinformatics: Re-sequencing and de novo assembly Elena Czeizler 13.11.2015 Sequencing data Current sequencing technologies produce large amounts of data: short reads The outputted sequences