Bioinformatics: A perspective

Similar documents
Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

Third Generation Sequencing

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

Next Generation Sequencing (NGS)

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Human genome sequence

Introductory Next Gen Workshop

Next-Generation Sequencing. Technologies

Next-Generation Sequencing Services à la carte

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

DNA-Sequencing. Technologies & Devices

CM581A2: NEXT GENERATION SEQUENCING PLATFORMS AND LIBRARY GENERATION

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday September 15, 2014

Contact us for more information and a quotation

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday June 16, 2014

Introduction to Next Generation Sequencing (NGS)

Research school methods seminar Genomics and Transcriptomics

NGS technologies approaches, applications and challenges!

Welcome to the NGS webinar series

Outline. General principles of clonal sequencing Analysis principles Applications CNV analysis Genome architecture

Bioinformatics Advice on Experimental Design

Single Cell Genomics

Introduction to Bioinformatics and Gene Expression Technologies

DNA-Sequenzierung. Technologien & Geräte

Sequencing techniques and applications

INTRODUCCIÓ A LES TECNOLOGIES DE 'NEXT GENERATION SEQUENCING'

Targeted Sequencing in the NBS Laboratory

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Tuesday December 16, 2014

Surely Better Target Enrichment from Sample to Sequencer and Analysis

Genome Sequencing. I: Methods. MMG 835, SPRING 2016 Eukaryotic Molecular Genetics. George I. Mias

High Throughput Sequencing Technologies. UCD Genome Center Bioinformatics Core Monday 15 June 2015

Next Gen Sequencing. Expansion of sequencing technology. Contents

Ultrasequencing: Methods and Applications of the New Generation Sequencing Platforms

Next Generation Sequencing Technologies

Axygen AxyPrep Magnetic Bead Purification Kits. A Corning Brand

NEXT-GENERATION SEQUENCING AND BIOINFORMATICS

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Next Generation Sequencing: An Overview

Next Generation Sequencing. Josef K Vogt Slides by: Simon Rasmussen

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

GenoLogics LIMS Helps EdgeBio Build Services Business

How much sequencing do I need? Emily Crisovan Genomics Core

Introduction Bioo Scientific

Next Generation Sequencing. Simon Rasmussen Assistant Professor Center for Biological Sequence analysis Technical University of Denmark

Efficiency in Next-Generation Sequencing for Public Health

CSC Assignment1SequencingReview- 1109_Su N_NEXT_GENERATION_SEQUENCING.docx By Anonymous. Similarity Index

Biochemistry 412. New Strategies, Technologies, & Applications For DNA Sequencing. 12 February 2008

BIOINFORMATICS 1 SEQUENCING TECHNOLOGY. DNA story. DNA story. Sequencing: infancy. Sequencing: beginnings 26/10/16. bioinformatic challenges

IMGM Laboratories GmbH. Sales Manager

Lab methods: Exome / Genome. Ewart de Bruijn

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology

NOW GENERATION SEQUENCING. Monday, December 5, 11

Next Generation Sequencing (NGS) Market Size, Growth and Trends ( )

SNP GENOTYPING WITH iplex REAGENTS AND THE MASSARRAY SYSTEM

Opportunities offered by new sequencing technologies

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

RADSeq Data Analysis. Through STACKS on Galaxy. Yvan Le Bras Anthony Bretaudeau Cyril Monjeaud Gildas Le Corguillé

HLA-Typing Strategies

Next-Generation DNA Sequencing Informatics, Second Edition

Standard Products Nucleic Acid Sample Submission Guideline

Next Generation Sequencing Technologies. Some slides are modified from Robi Mitra s lecture notes

Bioinformatics Specialist

High throughput omics and BIOINFORMATICS

RIPTIDE HIGH THROUGHPUT RAPID LIBRARY PREP (HT-RLP)

Course Overview: Mutation Detection Using Massively Parallel Sequencing

SPRIworks Systems. Push button. Walk away. Fully automated library construction systems with built-in size selection and cleanup BR-15981B

A Genomics (R)evolution: Harnessing the Power of Single Cells

2014 APHL Next Generation Sequencing (NGS) Survey

Analysing genomes and transcriptomes using Illumina sequencing

Sequencing Theory. Brett E. Pickett, Ph.D. J. Craig Venter Institute

HLA and Next Generation Sequencing it s all about the Data

Thema Gentechnologie. Erwin R. Schmidt Institut für Molekulargenetik Vorlesung #

MHC Region. MHC expression: Class I: All nucleated cells and platelets Class II: Antigen presenting cells

Genome Sequencing Technologies. Jutta Marzillier, Ph.D. Lehigh University Department of Biological Sciences Iacocca Hall

Functional Genomics Research Stream. Research Meetings: November 2 & 3, 2009 Next Generation Sequencing

Next Generation Bioinformatics on the Cloud

2100 Bioanalyzer. Overview & News. Ralph Beneke Dec 2010

Considerations for Illumina library preparation. Henriette O Geen June 20, 2014 UCD Genome Center

Published in: Next Generation Sequencing - Advances, Applications and Challenges DOI: /61964

Impact of gdna Integrity on the Outcome of DNA Methylation Studies

Next Generation Sequencing. Target Enrichment

Gene Expression on the Fluidigm BioMark HD

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

Sequence Annotation & Designing Gene-specific qpcr Primers (computational)

Just launched. Operated by. Karel Harant

Sanger Sequencing: Troubleshooting Guide

Applications of Next Generation Sequencing in Metagenomics Studies

Applications and Uses. (adapted from Roche RealTime PCR Application Manual)

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Introducing the Applied Biosystems TM Precision ID NGS System for Human Identification

Bioinformatics and computational tools

Ion S5 and Ion S5 XL Systems

RNA-Seq analysis workshop

Introduction to BIOINFORMATICS

High Cross-Platform Genotyping Concordance of Axiom High-Density Microarrays and Eureka Low-Density Targeted NGS Assays

Assay Design Considerations, Optimization and Validation

AmpFlSTR Identifiler PCR Amplification Kit

Transcription:

Bioinformatics: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu

Outline The World we are presented with Advances in DNA Sequencing Bioinformatics as Data Science Viewport into bioinformatics Training Suggestions

Sequencing Costs Cost per Megabase of Sequence Cost per Human Sized Genome @ 30x (Illumina) $1000 $0.05/Mb $100000000 $10000000 $4211 per Human sized (30x) genome $100 Dollars $10 $1000000 $1 $100000 $0.1 $10000 2002 2004 2006 2008 2010 2012 2014 2002 2004 2006 2008 2010 2012 2014 Year Includes: labor, administration, management, utilities, reagents, consumables, instruments (amortized over 3 years), informatics related to sequence productions, submission, indirect costs. http://www.genome.gov/sequencingcosts/ Year

Growth in Public Sequence Database GenBank Sequence Read Archive Bases 10 6 10 8 10 10 10 12 WGS > 1 trillion bp GenBank WGS 10 11 10 12 10 13 10 14 10 15 > 1 quadrillion bp Bases 1985 1990 1995 2000 2005 2010 2015 Year 2008 2010 2012 2014 Year http://www.ncbi.nlm.nih.gov/genbank/statistics http://www.ncbi.nlm.nih.gov/traces/sra/

Increase in Genome Sequencing Projects Lists > 3700 unique genus JGI Genomes Online Database (GOLD) 67,822 genome sequencing projects

Sequencing Platforms 1986 - Dye terminator Sanger sequencing, technology dominated until 2005 until next generation sequencers, peaking at about 900kb/day

Next Generation 2005 Next Generation Sequencing as Massively parallel sequencing, both throughput and speed advances. The first was the Genome Sequencer (GS) instrument developed by 454 life Sciences (later acquired by Roche), Pyrosequencing 1.5Gb/day Discontinued

Illumina 2006 The second Next Generation Sequencing platform was Solexa (later acquired by Illumina). Now the dominant platform with 75% market share of sequencer and and estimated >90% of all bases sequenced are from an Illumina machine, Sequencing by Synthesis > 200Gb/day.

Complete Genomics 2006 Using DNA nanoball sequencing, has been a leader in Human genome resequencing, having sequenced over 20,000 genomes to date. In 2013 purchased by BGI and is now set to release their first commercial sequencer, the Revolocity. Throughput on par with HiSeq Human genome/exomes only. 10,000 Human Genomes per year

Bench top Sequencers Roche 454 Junior Life Technologies Ion Torrent Ion Proton Illumina MiSeq

The Next Next Generation 2009 Single Molecule Read Time sequencing by Pacific Biosystems, most successful third generation sequencing platforms, ~2Gb/day

Oxford Nanopore 2015 Another 3 rd generation sequencer, founded in 2005 and currently in beta testing. The sequencer uses nanopore technology developed in the 90 s to sequence single molecules. Throughput is about 500Mb per flowcell. FYI: 4 th generation sequencing is being described as In-situ sequencing

Flexibility Genomic reduction allows for greater coverage and multiplexing of samples. You can fine tune your depth of coverage needs and sample size with the reduction technique Whole Genome Single Multiplexing Capture Techniques Reduction Techniques RADseq Fluidigm Access Array Amplicons Few or Single Amplicons 1KB 1X Depth of Coverage 100000X DNA Sequence Read 2 (50-300bp) Read 2 primer Greater Multiplexing Read 1 (50-300bp) Barcode (8bp) Barcode Read primer

DNA-seq RNA-seq Amplicons CHiP-seq MeDiP-seq RAD-seq ddrad-seq Pool-seq EnD-seq DNase-seq ATAC-seq MNase-seq FAIRE-seq Ribose-seq smrna-seq mrna-seq Tn-seq QTL-seq Sequencing Libraries tagrna-seq PAT-seq Structure-seq MPE-seq STARR-seq Mod-seq BrAD-seq SLAF-seq G&T-seq

omicsmaps.com

The data deluge Plucking the biology from the Noise

Reality Its much more difficult than we may first think

The real cost of sequencing Sample collection and experimental design Sequencing Data reduction Data management Downstream analyses 100% Data management 20 40 60 80 0% Pre-NGS (Approximately 2000) Now (Approximately 2010) Future (Approximately 2020) Sboner et al. Genome Biology 2011 12:125 doi:10.1186/gb-2011-12-8-125

Bioinformatics is Data Computational Biology Science Biology Biostatistics Math Statistics Computer Science Bioinformatics The data scientist role has been described as part analyst, part artist. Anjul Bhambhri, vice president of big data products at IBM

Data Science Data science is the process of formulating a quantitative question that can be answered with data, collecting and cleaning the data, analyzing the data, and communicating the answer to the question to a relevant audience. Five Fundamental Concepts of Data Science statisticsviews.com November 11, 2013 by Kirk Borne

7 Stages to Data Science 1. Define the question of interest 2. Get the data 3. Clean the data 4. Explore the data 5. Fit statistical models 6. Communicate the results 7. Make your analysis reproducible

7 Stages to Data Science 1. Define the question of interest 2. Get the data 3. Clean the data 4. Explore the data 5. Fit statistical models 6. Communicate the results 7. Make your analysis reproducible

1. Define the question of interest Begin with the end in mind! what is the question how are we to know we are successful what are our expectations dictates the data that should be collected the features being analyzed which algorithms should be use

2. Get the data 3. Clean the data 4. Explore the data Know your data! know what the source was technical processing in producing data (bias, artifacts, etc.) Data Profiling Data are never perfect but love your data anyway! the collection of massive data sets often leads to unusual, surprising, unexpected and even outrageous.

5. Fit statistical models Over fitting is a sin against data science! Model s should not be over-complicated If the data scientist has done their job correctly the statistical models don't need to be incredibly complicated to identify important relationships In fact, if a complicated statistical model seems necessary, it often means that you don't have the right data to answer the question you really want to answer.

6. Communicate the results 7. Make your analysis reproducible Remember that this is science! We are experimenting with data selections, processing, algorithms, ensembles of algorithms, measurements, models. At some point these must all be tested for validity and applicability to the problem you are trying to solve.

Data science done well looks easy and that s a big problem for data scientists simplystatistics.org March 3, 2015 by Jeff Leek

Data Science Bias Data Science (data analysis, bioinformatics) is most often taught through an apprentice model Different disciplines/regions develop their own subcultures, and decisions are based on cultural conventions rather than empirical evidence. Programming languages Statistical models (Bayes vs Frequentist) Multiple testing correction Application choice, etc. These (and others) decisions matter a lot in data analysis "I saw it in a widely-cited paper in journal XX from my field"

The Data Science in Bioinformatics Bioinformatics is not something you are taught, it s a way of life The best bioinformaticians I know are problem solvers they start the day not knowing something, and they enjoy finding out (themselves) how to do it. It s a great skill to have, but for most, it s not even a skill it s a passion, it s a way of life, it s a thrill. It s what these people would do at the weekend (if their families let them). Mick Watson Rosland Institute

Models Workshops Often enrolled too late Collaborations More experience persons Apprenticeships Previous lab personnel to young personnel Formal Education Most programs are Post-doc or graduate level Few Undergraduate

Substrate Cloud Computing Cluster Computing BAS TM LINUX Laptop & Desktop

Environment Command Line and Programming Languages VS Bioinformatics Software Suite

Quality Assurance In high throughput biological work (Sequencing, HT Genotyping, etc.), what may seem like small technical artifacts introduced during experimentation, sample extraction and/or preparation can lead to large changes, or technical bias, in the data. Not to say this doesn t occur with smaller scale analysis such as Sanger sequencing or qrt-pcr, but they do become more apparent and may cause significant issues during analysis.

Sample Preparation Prepare more samples then you are going to need, i.e. expect some will be of poor quality, or fail Preparation stages should occur across all samples at the same time (or as close as possible) and by the same person Spend time practicing a new technique to produce the highest quality product you can, reliably Quality and quantity should be established using Fragment analysis traces (pseudo-gel images) DNA/RNA should not be degraded 260/280 ratios for RNA should be approximately 2.0 and 260/230 should be between 2.0 and 2.2. Values over 1.8 are acceptable BE CONSISTENT ACROSS ALL SAMPLES!!!

Bioinformatics Know and Understand the experiment The Question of Interest Build a set of assumptions/expectations Mix of technical and biological Spend your time testing your assumptions/expectations Don t spend your time finding the best software Don t under-estimate the time Bioinformatics may take Be prepared to accept failed experiments

Bottom Line The Bottom Line: Spend the time (and money) planning and producing good quality, accurate and sufficient data for your experiment. Get to know to your data, develop and test expectations Result, you ll spend much less time (and less money) extracting biological significance and results during analysis.

Questions