NextGen Sequencing and Target Enrichment

Size: px
Start display at page:

Download "NextGen Sequencing and Target Enrichment"

Transcription

1 NextGen Sequencing and Target Enrichment Laurent FARINELLI 7 September 2010 Agilent 3rd Analytic Forum Basel, Switzerland

2 Outline The illumina HiSEQ 2000 system Applications Target enrichment Outlook 7 September 2010 Agilent Forum - Basel 2

3 DNA sequencing Purified DNA Chromatogram and DNA sequence 7 September 2010 Agilent Forum - Basel 3

4 HiSeq 2000 workflow Purified DNA Sample preparation Massively parallel sequencing Millions of DNA sequences Bioinformatics! 7 September 2010 Agilent Forum - Basel 4

5 Revolution in Throughput One lane, one sample.. Capillary Sequencer 1 gene 1 page Village inhabitants x Illumina HiSEQ A genome at 4x coverage A library with books of 1000 pages each 4x Earth s inhabitants 7 September 2010 Agilent Forum - Basel 5

6 1996, early days at Glaxo.. Whole genome sequencing of Streptococcus pneumoniae 7 instruments ABI 377 glass plates 700 sequences per day, reads of 500 bases 4 bases per second combined throughput => almost 1 year to sequence 2.1 Mb 7 September 2010 Agilent Forum - Basel 6

7 1996, we need something else.. DNA Colonies : GlaxoWellcome s Geneva Biomedical Research Institute Mayer P., Farinelli L. and Kawashima, E., 1997, Patent application WO 98/ now know as DNA Clusters...Massively parallel sequencing 7 September 2010 Agilent Forum - Basel 7

8 DNA Colonies: in situ Amplification In situ amplification : GlaxoWellcome s Geneva Biomedical Research Institute : Serono : Manteia Predictive Medicine : 7 September 2010 Agilent Forum - Basel 8

9 DNA Colonies: Random Arraying DNA Colonies 10 microns Up to DNA colonies / cm : GlaxoWellcome s Geneva Biomedical Research Institute : Serono : Manteia Predictive Medicine : 7 September 2010 Agilent Forum - Basel 9

10 Base-by-base sequencing 1 1. Incorporation - Polymerase - The 4 nucleotides, labelled and blocked on the 3 -end Only 1 base is added on each DNA colony 7 September 2010 Agilent Forum - Basel 10

11 Base-by-base sequencing T 1 2. Image acquisition - The most time consuming step => Real-time base-calling T 7 September 2010 Agilent Forum - Basel 11

12 Base-by-base sequencing T 3. De-blocking - Remove label - de-block 3 -end => Ready for next incorporation T 7 September 2010 Agilent Forum - Basel 12

13 Base-by-base sequencing T G 1 2 Cycle 2, the second base is read One cycle takes ~1 hour (15 min. chemistry, 45 min. scanning the flow-cell) T T 7 September 2010 Agilent Forum - Basel 13

14 Base calling from raw data T G C T A C G A T T T T T T T T G T 1. Convert images into intensity files (text files) 2. Base calling from each x, y position 3. Filtration of sequences 7 September 2010 Agilent Forum - Basel 14

15 2006: Next Generation Acquisition of the Solexa 1G system Q1 2007: 4 days to sequence a flow-cell One channel reads of 26 bases 100 Mb per flow-cell 300 bases per second 7 September 2010 Agilent Forum - Basel 15

16 The Genome Analyzer revolution Sequencing speed Solexa 1G Bases / sec instruments ABI setup ABI 3730xl Year 7 September 2010 Agilent Forum - Basel 16

17 Tremendous Speed Increase Sequencing speed GA IIx Pipeline Bases / sec instruments ABI setup ABI 3730xl GA 1G GA II Pipeline 1.4 GA II P-E Year 7 September 2010 Agilent Forum - Basel 17

18 Tremendous Speed Increase Sequencing speed HiSEQ-2000 June Bases / sec instruments ABI 377 GA 1G setup ABI 3730xl GA II P-E GA IIx Pipeline 1.4 and Q2, 2010 Year 7 September 2010 Agilent Forum - Basel 18

19 One HiSEQ 2000 Paired-ends Channel (One 9-days run is 2 flow-cells of 8 channels each) mio sequences of 2 x 108 bases 12 Gb books of 1000 pages 2400x coverage of a 5 Mb bacterium 1000x coverage of a 12 Mb Yeast 93x coverage of 128 Mb Arabidopsis 7 September 2010 Agilent Forum - Basel 19

20 One HiSEQ 2000 Instrument (Validation run, July 2010) 140 GB per flow-cell 2 flow-cells with PhiX Corresponds to 40x coverage of 2 human genomes 7 September 2010 Agilent Forum - Basel 20

21 High number of reads per channel Good for large genomes Several channels or flow-cells are still needed for mammalian genomes 40x coverage in a single flow-cell 7 September 2010 Agilent Forum - Basel 21

22 High number of reads per channel Too many for small projects Too many reads for several applications => Need multiplexing 4, 8 or more samples per channel (ChIP-SEQ, smallrnas, bacteria, targeted re-sequencing, etc..) 7 September 2010 Agilent Forum - Basel 22

23 High number of reads per channel: Multiplexing Sample prep becomes cost driver Price in US$ sample per channel 4 samples per channel 8 samples per channel 12 samples per channel 0 Prep Channel Price/sample Need for more efficient and cheaper sample prep => We are working on 96-wells preps 7 September 2010 Agilent Forum - Basel 23

24 Outline The illumina HiSEQ 2000 system Applications Target enrichment Outlook 7 September 2010 Agilent Forum - Basel 24

25 Here Come the Novel Applications Using small RNAs to assemble de novo viral genomes Sequencing by sirna: complete nucleotide sequence of SPFMV-Piu3 and ~50% of genome of three novel viruses Data kindly provided by Dr. Jan KREUZE Sample prep performed at Sequencing by sirna: a novel generic tool for virus discovery Kreuze et al. (2009) Complete viral genome sequence and discovery of novel viruses by deep sequencing of small RNAs: a generic method for diagnosis, discovery and sequencing of viruses. Virology 388: September 2010 Agilent Forum - Basel 25

26 Applications performed at Fasteris Binding sites ChIP-SEQ, 4C, etc.. Small RNAs Profiles, comparisons de novo assembly Transcriptomes Digital Gene Expression Profiling, mrna-seq Whole Genomes Re-sequencing Bacteria, yeast, plant, animals, viruses SNP detection Targeted re-sequencing PCR products, pooled samples SureSelect target enrichment De novo sequencing Bacteria, eukaryotes Custom applications For all applications: Bar-coding / Multiplexing 7 September 2010 Agilent Forum - Basel 26

27 Applications developed at Fasteris Genomic shotgun protocol (Feb. 2007) Genomic Research Laboratory University Hospital of Geneva Staphylococcus aureus Re-sequencing SNP detection De novo assembly Development of EDENA software 7 September 2010 Agilent Forum - Basel 27

28 Genomic Sample Preparation Nebulized DNA Ends repair T4 DNA pol, Klenow, PNK Purification Blunt DNA Addition of 3 A Klenow (exo-) One type of double-stranded adapter T T A A A Purification Ligation of genomic adapters Purification A Prevents self-ligation T A T A A T A T 7 September 2010 Agilent Forum - Basel 28

29 Genomic Sample Preparation Agarose Gel purification bp Example with a single template PCR amplification different reads obtained on the Genome Analyzer 7 September 2010 Agilent Forum - Basel 29

30 Outline The illumina HiSEQ 2000 system Applications Target enrichment Outlook 7 September 2010 Agilent Forum - Basel 30

31 Analyzing subsets of genomes Transcriptome ChIP-SEQ PCR Genome-wide genotyping Splice leader protocol Target enrichment etc.. 7 September 2010 Agilent Forum - Basel 31

32 Applications developed at Fasteris Transcriptome shotgun (June 2007) Firmenich SA Geneva Find enzymes in un-sequenced plant Transcriptome shotgun from cdna De novo assembly of single-ends reads Identification of contigs with homology (BLASTx) RACE cloning of full lengths cdnas 7 September 2010 Agilent Forum - Basel 32

33 Counting transcripts DGEP vs. mrna-seq Theoretical example Sample with transcripts => each transcript present at single copy will be represented by 10 reads Obtained 10^6 reads DGEP mrna-seq Interesting if a reference transcriptome is available All reads start at same point Reads are along the transcript 7 September 2010 Agilent Forum - Basel 33

34 Applications developed at Fasteris Dir-mRNA-SEQ in Bacteria ( ) Genomic Research Laboratory University Hospital of Geneva Staphylococcus aureus Finding whole genome expressed regions Finding small RNAs as well 7 September 2010 Agilent Forum - Basel 34

35 Dir-mRNA-SEQ: Two transcripts in reverse orientation From a bacterial transcriptome Data kindly provided by Dr. Patrice François, Marie Beaume and David Hernandez, in the group of Prof. Jacques Schrenzel 7 September 2010 Agilent Forum - Basel 35

36 Applications developed at Fasteris ChIP-Seq protocol (2008) Genomic Research Laboratory University Hospital of Geneva Mouse with age-related memory loss Protocol from very low amounts of material 7 September 2010 Agilent Forum - Basel 36

37 Applications developed at Fasteris Metagenomics ( ) Genomic Research Laboratory University Hospital of Geneva Bacteria present in oral samples Sequencing of 16S amplified region 7 September 2010 Agilent Forum - Basel 37

38 Applications developed at Fasteris Genome Wide Genotyping (2009) University of Oxford, UK Harvard medical School, USA Low density genotyping Sequencing next to restriction sites 7 September 2010 Agilent Forum - Basel 38

39 Genome-wide genotyping Complexity reduction trough restriction digest Sample A SNP Sample B Average distance between tags according to choice of restriction enzyme Can be applied to unsequenced genome 7 September 2010 Agilent Forum - Basel 39

40 Published August September 2010 Agilent Forum - Basel 40

41 Splice Leader Trapping in Spliced Leader Trapping (SLT) coverage and gene discovery rate Trypanosoma brucei Torsten Ochsenreiter Isabel Roditi 7 September 2010 Agilent Forum - Basel 41

42 Splice Leader Trapping in Spliced Leader Trapping (SLT) coverage and gene discovery rate Trypanosoma brucei Torsten Ochsenreiter Isabel Roditi 7 September 2010 Agilent Forum - Basel 42

43 Target enrichment Solid arrays (Agilent, NimbleGen) Liquid Arrays, Agilent SureSelect Whole exome Custom region 7 September 2010 Agilent Forum - Basel 43

44 Applications performed at Fasteris Targeted re-sequencing (2008) Medical Genetics, University Hospital of Geneva Human NimbleGen solid arrays SNP detection 7 September 2010 Agilent Forum - Basel 44

45 Applications performed at Fasteris Targeted re-sequencing (2008) Medical Genetics, University Hospital of Geneva 7 September 2010 Agilent Forum - Basel 45

46 Applications performed at Fasteris Targeted re-sequencing (2008) Università degli Studi di Milano, Italy Human Agilent solid arrays, capture performed at ImaGenes, Berlin Identification of the causative (?) mutation 7 September 2010 Agilent Forum - Basel 46

47 Applications performed at Fasteris Targeted re-sequencing (2008) Università degli Studi di Milano, Italy Human Agilent solid arrays, capture performed at ImaGenes, Berlin Identification of the causative (?) mutation 7 September 2010 Agilent Forum - Basel 47

48 Fasteris Certified by Agilent 7 September 2010 Agilent Forum - Basel 48

49 SureSelect human whole exome Reads (2x108bp) Bases Mapped Target size On target Target not covered On target within 100 bp Sample mio 4.8 Gb 97.5% 37 Mb 60.4% 3.2% 68.0% Sample mio 4.8 Gb 98.0% 37 Mb 60.5% 3.0% 66.9% Sample mio 4.8 Gb 97.9% 37 Mb 59.3% 2.6% 67.4% Sample mio 4.6 Gb 98.0% 37 Mb 60.5% 2.7% 67.8% Dec. 2009, Genome Analyzer GAIIx, one lane per captured library 7 September 2010 Agilent Forum - Basel 49

50 SureSelect custom target Short target region: 510 kb (must be repeat-masked) 100 samples Fasteris-developed protocol: Prepared in 96-wells plate Index added before capture Sequenced multiplexed 16 per lane (GAIIx) 7 September 2010 Agilent Forum - Basel 50

51 SureSelect custom target, 100 samples Min. Max. Enrichment 422x 1436x Average coverage 21x 155x New 96-wells protocol successful filtered SNPs identified, shared Coverage variations essentially due different number of reads: quantification of libraries before sequencing is always difficult 7 September 2010 Agilent Forum - Basel 51

52 Outline The illumina HiSEQ 2000 system Applications Target enrichment Outlook 7 September 2010 Agilent Forum - Basel 52

53 Challenges Reduce sample prep costs Will it become easier/cheaper to sequence the whole genome rather than selecting targets? Issue of repeats Bioinformatics to analyze hundreds or thousands of samples 7 September 2010 Agilent Forum - Basel 53

54 Coming up Simplified transcriptome protocol 96-well protocol to be adapted to other protocols: Transcriptome Automated sample prep by illumina New adapters with built-in index Robotic platform Increasing sequencing yield Software, sequencing kits, instruments? New bioinformatics tools 7 September 2010 Agilent Forum - Basel 54

55 Conclusion Of the various capture approaches tested at Fasteris, the SureSelect liquid arrays produced the highest proportion of reads on target Due to the fact that only 1% of the genome is sequenced, the cost saving of SureSelect largely compensates for the cost of the target enrichment kits For the years to come, target enrichment will remain an attractive protocol 7 September 2010 Agilent Forum - Basel 55

56 The Fasteris team at your service Laurent Magne Anne Christelle Sylvie Christine Cristel Elisabeth Cécile Marta Loïc 7 September 2010 Agilent Forum - Basel 56