Using New ThiNGS on Small Things. Shane Byrne

Using New ThiNGS on Small Things Shane Byrne

Next Generation Sequencing New Things Small Things

NGS Next Generation Sequencing = 2 nd generation of sequencing 454 GS FLX, SOLiD, GAIIx, HiSeq, MiSeq, Ion Torrent, Ion Proton

Fluorescence Detecting H + Inside the boxes

3 rd Generation Sequencing 3 rd Generation sequencing Oxford Nanopore GridION, Helicos, PacBio RS

Nanopore sequencing 3 rd Generation Sequencing

Economics of Sequencing

Sequencing Approaches Shotgun, sequence the fragment library and then put back together Assemble with help of reference sequence(s) scaffold for aligning De-Novo assembly Multiple templates: cdna from total RNA sequence all transcripts present in a sample. Match to database, subtract things not required. Amplicon/Targeted Sequencing PCR process first Selection of target material for sequencing Multiplex with barcode tags Single or multiple organisms in starting samples

Sequencing Basics Library preparation preparation of fragments/amplicons/cdna for sequencing Addition of adaptors to each end of fragment Adaptors can also have barcodes indices Specific barcode for specific sample Recombine libraries together Application of library(ies) to flow cell Illumina = adaptors bind to flow cell forming clusters Other NGS systems libraries onto beads/emulsion Cluster density improvements increase sequencing output Bridge amplification/fluorescence reads occur in the clusters and captured by optics

Example of Applications

Coverage NGS generates multiple sequencing reads over genetic regions, this is termed coverage. Coverage conveys read depth to NGS, this is leveraged to find minor sequence variation Deep Sequencing means using a lot of coverage across an area of interest

Coverage Vs Multiplexing As an example: Sequence 1 organism in a run and get 1000x coverage but only need 50x coverage for acceptable quality of the sequencing data Could multiplex 20 organisms into run now get 50x coverage and more results for your $ If deep sequencing need to determine coverage required to detect minor variants NGS can discern mutants at 1% prevalence

Multiplexing (Indices/Barcodes) Adaptors = Adaptors with Specific Index sequence included Computer processing pulls all related sequences back together

Terms and Numbers Sequence yield/run (Mb,Gb) Reads/run (thousand, million, billion) Sequence Yield/Reads = Read Length (bp) Cost/run usually fixed. MiSeq $1,184 for sequencing reagents for a run $ Sample prep costs for library generation Need to have volume to start a run or costly Assembling bacterial genomes: 3-5Mb much easier than human genome 3Gb 5 mins on lab computer Vs 24 hours on high end server Viral genomes 2kb-2Mb

Quality Phred Q Score Phred scores describe the quality of the sequencing output. Q10, Q20, Q30 etc. Q30 quality is considered almost perfect with no errors/ambiguities NGS ~Q30 is benchmark Sanger = Q20 max

Time/Quality/Yield/Reads

Quality

Visual genome size comparison 24hrs, 4000 core server 5mins, 6 core 32Gb lab PC

Assembly

Coverage more improves the contig sizes The more coverage you generate the easier it is to piece it all back together

Computing Overheads Increase as Coverage Increases As coverage increases, demand on memory for program assembling data increases

Virology and NGS Virology Detection of Unknown Viral Pathogens and Discovery of Novel Viruses Detection of Tumour Viruses Characterization of the Human Virome Full-Length Viral Genome Sequencing Investigation of Viral Genome Variability and Characterization of Viral Quasispecies Monitoring Antiviral Drug Resistance (1% popn. level) Epidemiology of Viral Infections and Viral Evolution Quality Control of Live-Attenuated Viral Vaccines

Bacteriology and NGS Bacteriology Bacterial whole genome sequencing Epidemiology Typing Microbiome analysis / dysfunction Treatment applications Ability to detect all unknown bacteria in a sample with high sensitivity Culture independent methodology Antimicrobial resistance mutants at low level Resistome Forensics Microbiome signatures

Metagenomic Workflow 16S

Amplicon Sequencing Multiple amplicons targeted sequencing Tumour somatic mutation genes Bacterial virulence genes, specific identification targets 96 samples x 12 targets = $48/patient

Pathogen Transcript Detection

Subtraction Pathogen Transcript Detection

Costs for multiple bacterial genomes MiSeq V3 48 whole 5MB bacterial genomes at 50x coverage/run

Emerging from the surroundings

What exactly is in the surroundings? Microbiome analysis using NGS is proving very interesting Moving from waiting to looking Viral metagenomic analyses of environmental samples suggest that the field of virology has explored less than 1% of the extant viral diversity. In the last decade, the culture-independent and sequenceindependent metagenomic approach has permitted the discovery of many viruses in a wide range of samples. Phylogenetically, some of these viruses are distantly related to previously discovered viruses. In addition, 60 99% of the sequences generated in different viral metagenomic studies are not homologous to known viruses.

The water looks good. How many viruses/litre sea water? = 100 Billion How many bacteria+protists/litre sea water? =10 Billion How many viruses on Earth = 10 31 Water

Soil How many viruses/g of soil? = 1.2x10 9 How many bacteria/g of soil? =100 Million - 3 Billion

Potential Pathogen Pyramid

Viral Discovery