NextGen Sequencing and Target Enrichment

Similar documents
Transcription:

NextGen Sequencing and Target Enrichment Laurent FARINELLI 7 September 2010 Agilent 3rd Analytic Forum Basel, Switzerland

Outline The illumina HiSEQ 2000 system Applications Target enrichment Outlook 7 September 2010 Agilent Forum - Basel 2

DNA sequencing Purified DNA Chromatogram and DNA sequence 7 September 2010 Agilent Forum - Basel 3

HiSeq 2000 workflow Purified DNA Sample preparation Massively parallel sequencing Millions of DNA sequences Bioinformatics! 7 September 2010 Agilent Forum - Basel 4

Revolution in Throughput One lane, one sample.. Capillary Sequencer 1 gene 1 page Village 1 000 inhabitants 12 000 000x Illumina HiSEQ A genome at 4x coverage A library with 12 000 books of 1000 pages each 4x Earth s 7 000 000 000 inhabitants 7 September 2010 Agilent Forum - Basel 5

1996, early days at Glaxo.. Whole genome sequencing of Streptococcus pneumoniae 7 instruments ABI 377 glass plates 700 sequences per day, reads of 500 bases 4 bases per second combined throughput => almost 1 year to sequence 2.1 Mb 7 September 2010 Agilent Forum - Basel 6

1996, we need something else.. DNA Colonies.. 1996-1997: GlaxoWellcome s Geneva Biomedical Research Institute Mayer P., Farinelli L. and Kawashima, E., 1997, Patent application WO 98/44151.. now know as DNA Clusters...Massively parallel sequencing 7 September 2010 Agilent Forum - Basel 7

DNA Colonies: in situ Amplification In situ amplification 1996-1997: GlaxoWellcome s Geneva Biomedical Research Institute 1998-2000: Serono 2000-2003: Manteia Predictive Medicine 2004- : 7 September 2010 Agilent Forum - Basel 8

DNA Colonies: Random Arraying DNA Colonies 10 microns Up to 10 000 000 DNA colonies / cm 2 1996-1997: GlaxoWellcome s Geneva Biomedical Research Institute 1998-2000: Serono 2000-2003: Manteia Predictive Medicine 2004- : 7 September 2010 Agilent Forum - Basel 9

Base-by-base sequencing 1 1. Incorporation - Polymerase - The 4 nucleotides, labelled and blocked on the 3 -end Only 1 base is added on each DNA colony 7 September 2010 Agilent Forum - Basel 10

Base-by-base sequencing T 1 2. Image acquisition - The most time consuming step => Real-time base-calling T 7 September 2010 Agilent Forum - Basel 11

Base-by-base sequencing T 3. De-blocking - Remove label - de-block 3 -end => Ready for next incorporation T 7 September 2010 Agilent Forum - Basel 12

Base-by-base sequencing T G 1 2 Cycle 2, the second base is read One cycle takes ~1 hour (15 min. chemistry, 45 min. scanning the flow-cell) T T 7 September 2010 Agilent Forum - Basel 13

Base calling from raw data T G C T A C G A T 1 2 3 7 8 9 4 5 6 T T T T T T T G T 1. Convert images into intensity files (text files) 2. Base calling from each x, y position 3. Filtration of sequences 7 September 2010 Agilent Forum - Basel 14

2006: Next Generation Acquisition of the Solexa 1G system Q1 2007: 4 days to sequence a flow-cell One channel 500 000 reads of 26 bases 100 Mb per flow-cell 300 bases per second 7 September 2010 Agilent Forum - Basel 15

The Genome Analyzer revolution Sequencing speed Solexa 1G Bases / sec 350 300 250 200 150 100 50 0 300 7 instruments ABI 377 1 setup ABI 3730xl 4 10 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 Year 7 September 2010 Agilent Forum - Basel 16

Tremendous Speed Increase Sequencing speed 60000 GA IIx Pipeline 1.6 50000 40000 Bases / sec 30000 20000 7 instruments ABI 377 1 setup ABI 3730xl GA 1G GA II Pipeline 1.4 GA II P-E 10000 0 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year 7 September 2010 Agilent Forum - Basel 17

Tremendous Speed Increase Sequencing speed HiSEQ-2000 June 2010 350000 300000 250000 Bases / sec 200000 150000 7 instruments ABI 377 GA 1G 100000 50000 1 setup ABI 3730xl GA II P-E GA IIx Pipeline 1.4 and 1.6 0 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Q2, 2010 Year 7 September 2010 Agilent Forum - Basel 18

One HiSEQ 2000 Paired-ends Channel (One 9-days run is 2 flow-cells of 8 channels each) 60-80 mio sequences of 2 x 108 bases 12 Gb 12 000 books of 1000 pages 2400x coverage of a 5 Mb bacterium 1000x coverage of a 12 Mb Yeast 93x coverage of 128 Mb Arabidopsis 7 September 2010 Agilent Forum - Basel 19

One HiSEQ 2000 Instrument (Validation run, July 2010) 140 GB per flow-cell 2 flow-cells with PhiX Corresponds to 40x coverage of 2 human genomes 7 September 2010 Agilent Forum - Basel 20

High number of reads per channel Good for large genomes Several channels or flow-cells are still needed for mammalian genomes 40x coverage in a single flow-cell 7 September 2010 Agilent Forum - Basel 21

High number of reads per channel Too many for small projects Too many reads for several applications => Need multiplexing 4, 8 or more samples per channel (ChIP-SEQ, smallrnas, bacteria, targeted re-sequencing, etc..) 7 September 2010 Agilent Forum - Basel 22

High number of reads per channel: Multiplexing Sample prep becomes cost driver Price in US$ 10000 7500 5000 2500 1 sample per channel 4 samples per channel 8 samples per channel 12 samples per channel 0 Prep Channel Price/sample Need for more efficient and cheaper sample prep => We are working on 96-wells preps 7 September 2010 Agilent Forum - Basel 23

Outline The illumina HiSEQ 2000 system Applications Target enrichment Outlook 7 September 2010 Agilent Forum - Basel 24

Here Come the Novel Applications Using small RNAs to assemble de novo viral genomes Sequencing by sirna: complete nucleotide sequence of SPFMV-Piu3 and ~50% of genome of three novel viruses Data kindly provided by Dr. Jan KREUZE Sample prep performed at Sequencing by sirna: a novel generic tool for virus discovery Kreuze et al. (2009) Complete viral genome sequence and discovery of novel viruses by deep sequencing of small RNAs: a generic method for diagnosis, discovery and sequencing of viruses. Virology 388: 1-7 7 September 2010 Agilent Forum - Basel 25

Applications performed at Fasteris Binding sites ChIP-SEQ, 4C, etc.. Small RNAs Profiles, comparisons de novo assembly Transcriptomes Digital Gene Expression Profiling, mrna-seq Whole Genomes Re-sequencing Bacteria, yeast, plant, animals, viruses SNP detection Targeted re-sequencing PCR products, pooled samples SureSelect target enrichment De novo sequencing Bacteria, eukaryotes Custom applications For all applications: Bar-coding / Multiplexing 7 September 2010 Agilent Forum - Basel 26

Applications developed at Fasteris Genomic shotgun protocol (Feb. 2007) Genomic Research Laboratory University Hospital of Geneva Staphylococcus aureus Re-sequencing SNP detection De novo assembly Development of EDENA software 7 September 2010 Agilent Forum - Basel 27

Genomic Sample Preparation Nebulized DNA Ends repair T4 DNA pol, Klenow, PNK Purification Blunt DNA Addition of 3 A Klenow (exo-) One type of double-stranded adapter T T A A A Purification Ligation of genomic adapters Purification A Prevents self-ligation T A T A A T A T 7 September 2010 Agilent Forum - Basel 28

Genomic Sample Preparation Agarose Gel purification 150-250 bp 5 3 3 5 Example with a single template PCR amplification 3 5 5 3 5 3 3 5 5 3 3 5 2 different reads obtained on the Genome Analyzer 7 September 2010 Agilent Forum - Basel 29

Outline The illumina HiSEQ 2000 system Applications Target enrichment Outlook 7 September 2010 Agilent Forum - Basel 30

Analyzing subsets of genomes Transcriptome ChIP-SEQ PCR Genome-wide genotyping Splice leader protocol Target enrichment etc.. 7 September 2010 Agilent Forum - Basel 31

Applications developed at Fasteris Transcriptome shotgun (June 2007) Firmenich SA Geneva Find enzymes in un-sequenced plant Transcriptome shotgun from cdna De novo assembly of single-ends reads Identification of contigs with homology (BLASTx) RACE cloning of full lengths cdnas 7 September 2010 Agilent Forum - Basel 32

Counting transcripts DGEP vs. mrna-seq Theoretical example Sample with 100 000 transcripts => each transcript present at single copy will be represented by 10 reads Obtained 10^6 reads DGEP mrna-seq Interesting if a reference transcriptome is available All reads start at same point Reads are along the transcript 7 September 2010 Agilent Forum - Basel 33

Applications developed at Fasteris Dir-mRNA-SEQ in Bacteria (2008-2009) Genomic Research Laboratory University Hospital of Geneva Staphylococcus aureus Finding whole genome expressed regions Finding small RNAs as well 7 September 2010 Agilent Forum - Basel 34

Dir-mRNA-SEQ: Two transcripts in reverse orientation From a bacterial transcriptome Data kindly provided by Dr. Patrice François, Marie Beaume and David Hernandez, in the group of Prof. Jacques Schrenzel www.genomic.ch 7 September 2010 Agilent Forum - Basel 35

Applications developed at Fasteris ChIP-Seq protocol (2008) Genomic Research Laboratory University Hospital of Geneva Mouse with age-related memory loss Protocol from very low amounts of material 7 September 2010 Agilent Forum - Basel 36

Applications developed at Fasteris Metagenomics (2008-2009) Genomic Research Laboratory University Hospital of Geneva Bacteria present in oral samples Sequencing of 16S amplified region 7 September 2010 Agilent Forum - Basel 37

Applications developed at Fasteris Genome Wide Genotyping (2009) University of Oxford, UK Harvard medical School, USA Low density genotyping Sequencing next to restriction sites 7 September 2010 Agilent Forum - Basel 38

Genome-wide genotyping Complexity reduction trough restriction digest Sample A SNP Sample B Average distance between tags according to choice of restriction enzyme Can be applied to unsequenced genome 7 September 2010 Agilent Forum - Basel 39

Published August 2010 7 September 2010 Agilent Forum - Basel 40

Splice Leader Trapping in Spliced Leader Trapping (SLT) coverage and gene discovery rate Trypanosoma brucei Torsten Ochsenreiter Isabel Roditi 7 September 2010 Agilent Forum - Basel 41

Splice Leader Trapping in Spliced Leader Trapping (SLT) coverage and gene discovery rate Trypanosoma brucei Torsten Ochsenreiter Isabel Roditi 7 September 2010 Agilent Forum - Basel 42

Target enrichment Solid arrays (Agilent, NimbleGen) Liquid Arrays, Agilent SureSelect Whole exome Custom region 7 September 2010 Agilent Forum - Basel 43

Applications performed at Fasteris Targeted re-sequencing (2008) Medical Genetics, University Hospital of Geneva Human NimbleGen solid arrays SNP detection 7 September 2010 Agilent Forum - Basel 44

Applications performed at Fasteris Targeted re-sequencing (2008) Medical Genetics, University Hospital of Geneva 7 September 2010 Agilent Forum - Basel 45

Applications performed at Fasteris Targeted re-sequencing (2008) Università degli Studi di Milano, Italy Human Agilent solid arrays, capture performed at ImaGenes, Berlin Identification of the causative (?) mutation 7 September 2010 Agilent Forum - Basel 46

Applications performed at Fasteris Targeted re-sequencing (2008) Università degli Studi di Milano, Italy Human Agilent solid arrays, capture performed at ImaGenes, Berlin Identification of the causative (?) mutation 7 September 2010 Agilent Forum - Basel 47

Fasteris Certified by Agilent 7 September 2010 Agilent Forum - Basel 48

SureSelect human whole exome Reads (2x108bp) Bases Mapped Target size On target Target not covered On target within 100 bp Sample 1 22.3 mio 4.8 Gb 97.5% 37 Mb 60.4% 3.2% 68.0% Sample 2 22.4 mio 4.8 Gb 98.0% 37 Mb 60.5% 3.0% 66.9% Sample 3 22.2 mio 4.8 Gb 97.9% 37 Mb 59.3% 2.6% 67.4% Sample 4 21.1 mio 4.6 Gb 98.0% 37 Mb 60.5% 2.7% 67.8% Dec. 2009, Genome Analyzer GAIIx, one lane per captured library 7 September 2010 Agilent Forum - Basel 49

SureSelect custom target Short target region: 510 kb (must be repeat-masked) 100 samples Fasteris-developed protocol: Prepared in 96-wells plate Index added before capture Sequenced multiplexed 16 per lane (GAIIx) 7 September 2010 Agilent Forum - Basel 50

SureSelect custom target, 100 samples Min. Max. Enrichment 422x 1436x Average coverage 21x 155x New 96-wells protocol successful 4 655 filtered SNPs identified, 1 652 shared Coverage variations essentially due different number of reads: quantification of libraries before sequencing is always difficult 7 September 2010 Agilent Forum - Basel 51

Outline The illumina HiSEQ 2000 system Applications Target enrichment Outlook 7 September 2010 Agilent Forum - Basel 52

Challenges Reduce sample prep costs Will it become easier/cheaper to sequence the whole genome rather than selecting targets? Issue of repeats Bioinformatics to analyze hundreds or thousands of samples 7 September 2010 Agilent Forum - Basel 53

Coming up Simplified transcriptome protocol 96-well protocol to be adapted to other protocols: Transcriptome Automated sample prep by illumina New adapters with built-in index Robotic platform Increasing sequencing yield Software, sequencing kits, instruments? New bioinformatics tools 7 September 2010 Agilent Forum - Basel 54

Conclusion Of the various capture approaches tested at Fasteris, the SureSelect liquid arrays produced the highest proportion of reads on target Due to the fact that only 1% of the genome is sequenced, the cost saving of SureSelect largely compensates for the cost of the target enrichment kits For the years to come, target enrichment will remain an attractive protocol 7 September 2010 Agilent Forum - Basel 55

The Fasteris team at your service Laurent Magne Anne Christelle Sylvie Christine Cristel Elisabeth Cécile Marta Loïc 7 September 2010 Agilent Forum - Basel 56