Wet-lab Considerations for Illumina data analysis

Size: px
Start display at page:

Download "Wet-lab Considerations for Illumina data analysis"

Transcription

1 Wet-lab Considerations for Illumina data analysis Based on a presentation by Henriette O Geen Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

2 Complementary Approaches Illumina Still-imaging of clusters (~1000 clonal molecules) PacBio Movie recordings fluorescence of single molecules Short reads - 2x300 bp Repeats are mostly not analyzable High output - up to 75 Gb per lane Up to 30 kb, N50 18 kb spans retro elements up to 1 Gb per SMRT-cell High accuracy ( < 0.5 %) Error rate 14 to 15 % Considerable base composition bias Very affordable De novo assemblies of thousands of scaffolds No base composition bias Costs 5 to 10 time s higher Near perfect genome assemblies

3 SNPs, Indels CNVs Rearrangements De novo genome Sequencing Genome Resequencing Metagenomics RNA-seq Gene Expression Splice Isoform Abundance High Throughput Short Read Sequencing: Illumina Exome Sequencing DNA Methylation ChIP-SEQ 3D Organization Genotyping Small RNA

4 SNPs, Indels CNVs Rearrangements Genome Resequencing De novo genome Sequencing RNA-seq Gene Expression Splice Isoform Abundance Long Read Sequencing PacBio Metageno mics Exome Sequencing DNA Methylation ChIP-SEQ 3D Organization Genotyping Small RNA

5 2014

6 Sequencing workflow Library Construction Cluster Formation Illumina Sequencing Data Analysis

7 ChIP-seq Input control : verify fragment length ( bp)???? very experiment specific

8 Standard RNA-Seq library protocol QC of total RNA to assess integrity Removal of rrna (most common) mrna isolation rrna depletion Fragmentation of RNA Reverse transcription and secondstrand cdna synthesis Ligation of adapters PCR Amplify Purify, QC and Quantify

9 Recommended RNA input Library prep kit mrna (TruSeq) Directional mrna (TruSeq) Apollo324 library robot (strand specific) Small RNA (TruSeq) Ribo depletion (Epicentre) SMARTer Ultra Low RNA (Clontech) Ovation RNA seq V2, Single Cell RNA seq (NuGen) Starting material 100 ng 4 μg total RNA 1 5 μg total RNA or 50 ng mrna 100 ng mrna 1 μg total RNA 1 5 μg total RNA 100 pg 10 ng 10 ng 100 ng

10 18S (2500b), 28S (4000b)

11

12 RNA integrity <> reproducibility Chen et al. 2014

13 Considerations in choosing an RNA-Seq method Transcript type: - mrna, extent of degradation - small/micro RNA Strandedness: - un-directional ds cdna library - directional library Input RNA amount: ug original total RNA - linear amplification from ng RNA Complexity: - original abundance - cdna normalization for uniformity Boundary of transcripts: - identify 5 and/or 3 ends - poly-adenylation sites - Degradation, cleavage sites

14

15 Fragmentation Mechanical shearing: BioRuptor Covaris Enzymatic: Fragmentase, RNAse3 DNA, RNA DNA, RNA Chemical: Mg2+, Zn2+ RNA

16 Here:

17 Illumina Sequencing Technology Sequencing By Synthesis (SBS) Technology 3 5 DNA ( ug) Library preparation Single Cluster molecule generation array A C T C T G C T G A A G 5 T G C T A C G A T A C C C G A T C G A T Sequencing

18 TruSeq Chemistry: Flow Cell 8 channels Surface of flow cell coated with a lawn of oligo pairs

19 Sequencing 1.6 Billion Clusters Per Flow Cell 20 Microns 100 Microns 19

20 Sequencing 100 Microns 20

21 What will go wrong? cluster identification bubbles synthesis errors:

22 What will go wrong? synthesis errors: Phasing & Pre-Phasing problems

23 DNA library construction Fragmented DNA End Repair 5 P OH HO P 5 Blunt End Fragments A Tailing 5 P A A P 5 Single Overhang Fragments T T Adapter Ligation DNA Fragments with Adapter Ends

24 Enrichment of library fragments 5 5 PCR Amplification

25 If you can put adapters on it, we can sequence it!

26 Know your sample single-stranded Adapter Ligation

27 Optional: PCR-free libraries PCR-free library: OR Library can be sequenced if concentration allows Reduction of PCR bias against e.g. GC rich orat rich regions, especially for metagenomic samples Library enrichment by PCR: Ideal combination: high input and low cycle number; low-bias polymerase

28 Quantitation & QC methods Intercalating dye methods (PicoGreen, Qubit, etc.): Specific to dsdna, accurate at low levels of DNA Great for pooling of indexed libraries to be sequenced in one lane Requires standard curve generation, many accurate pipetting steps Bioanalyzer: Quantitation is good for rough estimate Invaluable for library QC High-sensitivity DNA chip allows quantitation of low DNA levels qpcr Most accurate quantitation method More labor-intensive Must be compared to a control

29 Library QC by Bioanalyzer Predominant species of appropriate MW Minimal primer dimer or adapter dimers Minimal higher MW material

30 Library QC by Bioanalyzer ~ 125 bp Beautiful 100% Adapters Beautiful

31 Library QC ~125 bp Examples for successful libraries Adapter contamination at ~125 bp

32 Is strand-specific information important? Standard library (non-directional) antisense sense Neu1

33 Strand-specific RNA-seq Standard library (non-directional) Antisense non-coding RNA Sense transcripts Informative for non-coding RNAs and antisense transcripts Essential when NOT using polya selection (mrna) No disadvantage to preserving strand specificity

34 RNA-seq for DGE Differential Gene Expression (DGE) 50 bp single end reads 30 million reads per sample (eukaryotes) 10 mill. reads > 80% of annotated genes 30 mill.. reads > 90% of annotated genes 10 million reads per sample (bacteria)

35 Other RNA-seq Transcriptome assembly: 300 bp paired end plus 100 bp paired end Long non coding RNA studies: 100 bp paired end million reads Splice variant studies: 100 bp paired end million reads

36 RNA-seq targeted sequencing: - Capture-seq (Mercer et al. 2014) - Nimblegen and Illumina - Low quality DNA (FFPE) - Lower read numbers 10 million reads - Targeting lowly expressed genes.

37 RNA-seq reproducibility Two big studies multi-center studies (2014) High reproducibility of data given: - same library prep kits, same protocols - same RNA samples - RNA isolation protocols have to be identical - robotic library preps?

38 C 1 Single cell capture

39 SMARTer cdna conversion / SMART-seq Picelli 2014

40 Molecular indexing for precision counts

41 Molecular indexing for precision counts

42 Synthetic Spike Ins ERCC spike in mix QC of library prep normalization (internal standards) advanced normalization (transcript lengths) Sample identity verification

43 ERCC spike-ins RNA-seq spike-in standards: 92 polyadenylated transcripts - mimic natural eukaryotic mrnas. wide range of lengths (250 2,000 nucleotides) and GC-contents (5 51%)

44 ERCC spike-in RNA-seq spike-in standards: To test Dynamic range and lower limit of detection Fold-change response

45 ERCC spike-ins Relative but NOT absolute expression: dosage a big problem To recover the differences of 2:3 90 million reads needed Protocols reproducible but not accurate (10x differences of ERCC transcripts between protocols)

46 THIRD GENERATION DNA SEQUENCING Single Molecule Real Time (SMRT ) sequencing Sequencing of single DNA molecule by single polymerase Very long reads: average reads over 8 kb, up to 30 kb High error rate (~13%). Complementary to short accurate reads of Illumina

47 70 nm aperture Zero Mode Waveguide

48 Damien Pelt

49 First Sequencing of CGG-repeat Alleles in Human Fragile X Syndrome using PacBio RS Sequencer Paul Hagerman, Biochemistry and Molecular Medicine, SOM. Single-molecule sequencing of pure CGG array, - first for disease-relevant allele. Loomis et al. (2012) Genome Research. - applicable to many other tandem repeat disorders. Direct genomic DNA sequencing of methyl groups, - direct epigenetic sequencing (paper under review). Discovered 100% bias toward methylation of 20 CGGrepeat allele in female, first direct methylated DNA sequencing in human CGG disease. 36 CGG 95 DoD STTR award with PacBio. Basis of R01 applications. C A G T Nucleotide position

50 Iso-Seq Pacbio Sequence full length transcripts no assembly High accuracy (except very long transcripts) More than 95% of genes show alternate splicing On average more than 5 isoforms/gene Precise delineation of transcript isoforms ( PCR artifacts? chimeras?)

51 Thank you!