Next- gen sequencing. STAMPS 2015 Hilary G. Morrison Joe Vineis, Nora Downey, Be>e Hecox- Lea, Kim Finnegan

Size: px
Start display at page:

Download "Next- gen sequencing. STAMPS 2015 Hilary G. Morrison Joe Vineis, Nora Downey, Be>e Hecox- Lea, Kim Finnegan"

Transcription

1 Next- gen sequencing STAMPS 2015 Hilary G. Morrison Joe Vineis, Nora Downey, Be>e Hecox- Lea, Kim Finnegan

2 QuesIons What is the difference between standard and next- gen sequencing? How is next- gen sequencing done? What can I sequence on a next- gen plavorm? How do the data look?

3 QuesIons What is the difference between standard and next- gen sequencing? How is next- gen sequencing done? What can I sequence on a next- gen plavorm? How do the data look?

4 What is the difference between standard and next- gen sequencing? Sequence clone library or PCR products in well format Chromatograms enable quality review < 2Mb/day; $1/rxn

5 What is the difference between standard and next- gen sequencing? Clone- free Single molecule templates amplified for each reacion Difficult to evaluate raw data reagent cost

6 What is the difference between standard and next- gen sequencing?

7 QuesIons What is the difference between standard and next- gen sequencing? How is next- gen sequencing done? What can I sequence on a next- gen plavorm? How do the data look?

8 In general Template- guided extension of complementary sequencing primer All require isolaion of single library molecules This is followed by clonal amplificaion to generate sufficient signal for detecion Signal is captured during each cycle of sequencing (reagent flow) Signals are processed and converted to base calls

9 454 and Ion Torrent plavorms Sequencing cycles are 4 deoxynucleoides provided one at a Ime (e.g. flow cycle is T,A,C,G) 2 or more bases incorporated during a cycle if template contains a homopolymer and detected by increased signal strength light (454) or charge (Ion Torrent) means that read lengths are variable both plavorms prone to homopolymer errors

10 454 Sequencing Solid Phase Pyrosequencing ~800,000 reads ~500 bp in length 3 days prep Lme Overnight run Lme 4- mer 3- mer 2- mer 1- mer

11 Illumina sequencing by synthesis Sequencing cycles are 4 deoxynucleoides provided together in each cycle Only one base is incorporated in each cycle A or C = red signal; G or T = green signal Read lengths are constant and known homopolymer errors unlikely imaging errors common (Ns, loss of data)

12 Illumina SBS

13 Third Gen technologies Sequence in real Ime Sequence single molecules Long (>10kb) reads RelaIvely high error rate Useful for assembly scaffolding

14 PacBio Single molecule real- Ime (SMRT) sequencing up to 90 Mb per run Single molecule of DNA polymerase is anchored in each reacion well labeled dntps detected short, highly accurate read consensus from circularized SMRTbell templates or, long, less accurate reads, up to an average of 3000 bp

15 Oxford Nanopore Single strand sequencing Single molecule of DNA polymerase is translocated through arificial pore bound to membrane dntps disinguished by conducivity change

16 PacBio vs. Nanopore PacBio: Large iniial acquisiion cost, start up, training; purchase kits for each run Nanopore: PromethION benchtop system; 48 flow cells MinION and MinION Access Program; $1,000 fee Metrichor takes a biological sample and streams data to the cloud

17 Where is it going on? h>p://omicsmaps.com

18 QuesIons What is the difference between standard and next- gen sequencing? How is next- gen sequencing done? What can I sequence on a next- gen plasorm? How do the data look?

19 BPC/MBL projects Bacterial genomes, ~4-12 Mbp EukaryoIc (drat) genomes Environmental metagenomes Transcriptomes Metatranscriptomes Microbial community composiion (specific gene amplicons)

20 ConsideraIons for plavorm choice Input DNA requirement (mass, quality, MW) Library preparaion opions PlaVorm flexibility Read length, type Cost per base and per run Accuracy of data Availability of analysis tools (open source) Data storage needs Ability to reanalyze old data

21 TheoreIcal throughput Systems Specs: MiSeq MiSeq2 Next Seq H Next Seq M HiSeq 1000 Lanes / Flowcell GBase/Run: Gigabases/Flowcell Gigabases/Lane Single Reads/Run (M) ,500 Single Reads/Lane (M) Read Lengths 2 X X X X x 100 Run Time (Days) Gigabases/day

22 To make best use of available capacity, combine projects Separate samples using lanes or gaskets Trust that reads from each source will assemble into bins or conigs Map to a known reference when possible MulIplex samples with indices and/or in- line barcodes

23 QuesIons What is the difference between standard and next- gen sequencing? How is next- gen sequencing done? What can I sequence on a next- gen plavorm? How do the data look?

24 Files and quality informaion Raw data formats vary Most common formats are sff (binary; 454, IT) and fastq (text; Illumina). PacBio may provide consensus genome only. Fastq is versaile, many conversion tools available, qual scores embedded But core facility is unlikely to provide further quality metrics

25 Run quality assessment: clustering

26 Overclustering

27 Overclustering

28 Run quality scoring

29 Run quality scoring % in Q bin by cycle % >Q30 by cycle

30 Run quality assessment

31 Post- run quality assessment Reads returned to user have passed Illumina filters DemulIplexing may allow up to 2 index mismatches Ns are permi>ed Control these with bcl2fastq Use unix to do your own QC! (cat, wc, grep, cut) or write a script count reads count reads with Ns look for inline primer matches look at yield by flow cell posiion look at PhiX or other control data

32 NextSeq 500: quality scores * * Average library insert size < 300 bp

33 PhiX quality and accuracy Error rate: 0.7% NextSeq; 0.4% Hiseq

34 Off- instrument quality filtering Many methods and wrappers, including denoising, Galaxy, prinseq, etc. In house: 454 amplicon read filtering based on N, barcode/primer detecion, anchor sequences ( h>ps://github.com/meren/454- anchor- trimming) In house: Minoche filtering based on Evalua(on of genomic high- throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems, Minoche, Dohm and Himmelbauer, Genome Biology 2011, DOI: /gb implemented by A. Murat Eren (h>ps://github.com/meren/illumina- uils) In house: Read overlap merging, A Filtering Method to Generate High Quality Short Reads Using Illumina Paired- End Technology, Eren et al., 2013, DOI: /journal.pone Zero mismatches in perfect overlapping reads Mismatches lower than cutoff specified by user based on extent of overlap

35 Summary PlaVorms and capabiliies are evolving rapidly Sources of error are not always evident get as much informaion about the run as possible Design your experiment carefully insert size, coverage required, hybrid approaches Do your own post- run QC Take advantage of user forums, not just the company sites

36 Reality Check Be skepical of the hype. The key to ge ng the best results out of any sequencing technology is understanding how it works and and how and why it makes mistakes. - Joe Vineis