2nd (Next) Generation Sequencing 2/2/2018

Size: px
Start display at page:

Download "2nd (Next) Generation Sequencing 2/2/2018"

Transcription

1 2nd (Next) Generation Sequencing 2/2/2018

2 Why do we want to sequence a genome? - To see the sequence (assembly) To validate an experiment (insert or knockout) To compare to another genome and find variations (cancer, populations) The problem: We cannot sequence the genome from start to end. We need to sheer the DNA into smaller fragments and sequence smaller pieces. Sanger sequencing is slow and not high throughput: 13 years for a human genome. 2

3 Next Generation Sequencing (NGS) allows for producing millions and billions of short reads in just few days at lower cost. You can sequence your genome at 30X depth for USD. Illumina HiSeq 2500 Roche 454 Ion Torrent 3

4 4 Surya Saha, Boyce Thompson Institute, Ithaca, NY (BTI plant bioinformatics course)

5 5 Alex Sanchez, Statistics and Bioinformatics Research Group, Statistics Department, Universitat de Barcelona

6 Reads come from molecule fragments Read length is the same for an entire dataset (e.g. 101 bases long) Either single or paired-end reads Mate reads Physical coverage and depth Number of reads Duplicates (PCR or sequence) Dark matter (PCR cannot find repeats) pair 1 fragment pair 2 Lex Naderbragt, SeRC Nordic Assembly Workshop in Stockholm, Sweden, May 14th

7 chr22:11m-12m RepeatMasker Gap 7

8 Illumina sequencers can only sequence DNA fragments up to ~300nt long DNA must be size-selected, usually by gel cut ~ nt band cut, purified, prepared for sequencing Fragment length follows a normal distribution around target cut size 8

9 Each sequencing run generates a certain # of total reads # of reads per sample ~= # total reads/number of samples # of reads for one sample: library size Can choose target library size for your instrument based on: Desired depth Desired coverage For more see 9

10 Single End Paired End 10

11 Question: Given a read and a reference sequence, where, if anywhere, in the reference does the read sequence occur? E.g. chr3:2,358,092-2,358,193 More on this next lecture 11

12 Genome Locus Mapped or Aligned reads Depth: number of sequenced bases that map to a given location Coverage: fraction of genomic locus covered by at least one read 12

13 Illumina is now the most common sequencer. It s error is uniformly distributed (~0.1%) only substitutions (no indels). Older Illumina machines had a fall of quality towards the end of the read. 13

14 Fragment (insert) size follow a truncated normal distribution Sequencing depth is defined by number of fragments covering a bp of the DNA. Not the number of reads. Use read depth to refer to that. Physical coverage is the amount of the genome expected to be covered. However coverage is usually used to mean depth! Coverage follows a Poisson (Negative Binomial) distribution with lambda=physical depth. Coverage follows a Poisson distribution. Read length is a fixed number for Illumina reads. Error is usually higher toward the ends trimming 14

15 Good coverage Bad coverage 15

16 - The machines output files containing short reads in fastq format. For each read there are 4 read_header comment Read_sequence + [read_header] Quality_string (in ASCII) - Scores estimate the probability that a base is called incorrectly. Q30 means 99.9% accuracy. Reads are short, we need a reference sequence to resolve where they come from (resequencing). 16

17 start new 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCA TTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTT CTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR length=125 17

18 unique read 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCA TTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTT CTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR length=125 17

19 comments separated by space, could be 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCA TTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTT CTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR length=125 17

20 Sequence of the 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCA TTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTT CTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR length=125 17

21 start quality 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCA TTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTT CTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR length=125 17

22 repeat read header and comment, not 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCA TTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTT CTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR length=125 17

23 Quality sequence of the read, in 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCA TTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTT CTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR length=125 17

24 @SRR length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCA TTTGAACTCAGTTTGAACCTGTCTCTTATACACATCT +SRR length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 2 length=125 NTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTATCTTT CTTCTACTCACTTGCTTCTCAATTGAAAGAGCGGAAA +SRR length=125 Next read 17

25 Quail, Michael A., et al. "A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers." BMC genomics13.1 (2012):

26 NCBI ( Illumina basespace ( Google genomics cloud ( Genome In A Bottle (GIAB) ( REPOSITIVE ( GDC ( Seven Bridges ( 19

27 27

28 28

29

30 30

31 31

32 ART : WGS simulator WGSIM: WGS simulator PBSIM: PacBio simulator See more on OMIC tools ( ) 20