1000 Genomes project: from mapping reads to de novo muta6ons

Size: px
Start display at page:

Download "1000 Genomes project: from mapping reads to de novo muta6ons"

Transcription

1 1000 Genomes project: from mapping reads to de novo muta6ons Mark A. DePristo Manager, Genome Sequencing and Analysis Group Medical and Popula6on Gene6cs Program Broad Ins6tute of Harvard and MIT December 3, 2009

2 Acknowledgments Quality score recalibra6on Local realignment Varia6on discovery De novo muta6ons Anthony Philippakis Andrew Kernytsky MaQ Hanna Eric Banks Andrey Sivachenko Jared Maguire Kiran Garimella Manny Rivas Michael Melgar Eric Banks Andrew Kernytsky MaQ Hurles and Philip Awadalla at the Sanger Other contributors The en6re genome sequencing and analysis group Especially the GSA sozware engineering team: MaQ Hanna and Aaron McKenna MPG directorship: Stacey Gabriel, David Altshuler, Mark Daly Carrie Sougnez, produc6on teams and folks at 320 and 7CC The SAM/BAM working group: Bob Handsaker, Tim Fennell, Heng Li, and Richard Durbin The cancer genome analysis group: Gad Getz, Kris6an Cibulskis, Andrey Sivachenko The IGV team: Jim Robinson and Helga Thorvaldsdo`r Produc6on informa6cs: Tim Fennell and Alec Wysoker The 1000 genomes project

3 Agenda Introduc6on to the 1000 genomes project Mapping and alignment SAM/BAM format Visualizing the data The Genome Analysis Toolkit The infrastructure suppor6ng our tools for working with next genera6on sequencing data Tools developed in the GATK for calling SNPs and indels in the 1000 genomes pilot

4 The 1000 genomes project is characterizing common gene6c varia6on with MAF >1% in three popula6ons Pilot 1: Pilot 1: ~150 individuals whole genome Applies a mul6 sample sequenced to 4x depth generaliza6on of the single sample approach in pilot 2 Data produc6on and analysis ~ 17M SNPs Method not discussed in detail ~ 2 10M short indels Pilot 2: Two children and their parents whole genome sequence to ~70x Data produc6on and analysis ~ 3 5M SNPs ~ K short indels Pilot 3: Pilot 3: Applies the same SNP and 1000 genes in ~400 individuals to ~50x depth indel calling methods as Data produc6on and analysis Pilot 2 ~ 10K SNPs Method not discussed in detail ~ 1000 short indels

5 Data for the project comes from many centers and several technologies Added for produc6on phase For pilot phase only Slide courtesy of Carrie Sougnez

6 The pilot phase alone has generated ~5 Tb of sequence Pilot 1 Pilot 2 Pilot 3 Total Number of Samples Illumina SOLID Total Slide courtesy of Carrie Sougnez

7 Agenda Introduc6on to the 1000 genomes project Mapping and alignment SAM/BAM format Visualizing the data The Genome Analysis Toolkit The infrastructure suppor6ng our tools for working with next genera6on sequencing data Tools developed in the GATK for calling SNPs and indels in the 1000 genomes pilot

8 From unmapped reads to true gene6c varia6on in next genera6on sequencing data Solexa SOLiD 454 Raw short reads Mapping and alignment Region 1 Region 2 Human reference genome A single run of a sequencer generates ~50M ~75bp short reads for analysis The origin of each read from the human genome sequence is found Quality calibra6on and annota6on Iden6fying gene6c varia6on Region 1 Region 2 Region 1 Region 2 Human reference genome Human reference genome SNP The quality of each read is calibrated and addi6onal informa6on annotated for downstream analyses SNPs and indels from the reference are found where the reads collec6vely provide evidence of a variant

9 Finding the true origin of each read is a computa6onally demanding and important first step Region 1 Region 2 Region 3 Reference genome Enormous pile of short reads from NGS Mapping and alignment algorithm Detects correct read origin and flags them with high certainty Detects ambiguity in the origin of reads and flags them as uncertain Solexa : MAQ 454 : SSAHA SOLiD : Corona Robust, accurate gold standard aligner for NGS Developed by Li and Durbin Soon to be replaced by BWA, also by Li and Durbin Hash based aligner with high sensi6vity and specificity with longer reads ABI designed tool for aligning in color space SAM/BAM files

10 The SAM file format Data sharing was a major issue with the 1000 genomes Each center, technology and analysis tool used its own idiosyncra6c file formats no one could exchange data The Sequence Alignment and Mapping (SAM) file format was designed to capture all of the cri6cal informa6on about NGS data in a single indexed and compressed file Becoming a standard and is now used by produc6on informa6cs, MPG, and cancer analysis groups at the Broad Has enabled sharing of data across centers and the development of tools that work across plaporms More info at hqp://samtools.sourceforge.net/

11 What does the data actually look like? chr5:112mb 454 This is a screenshot of IGV All the 1000 genomes data can be viewed easily with IGV hqp:// SLX SOLid Coverage Non reference bases Individual reads

12 Agenda Introduc6on to the 1000 genomes project Mapping and alignment SAM/BAM format Visualizing the data The Genome Analysis Toolkit The infrastructure suppor6ng our tools for working with next genera6on sequencing data Tools developed in the GATK for calling SNPs and indels in the 1000 genomes pilot

13 The GATK is a structured programming framework that aims to simplify wri6ng analysis tools for resequencing data The framework is designed to support most common paradigms of analysis algorithms Provides structured access to reads in SAM format, reference context, as well as reference associated meta data General purpose Op6mized for ease of use and completeness of func6onality within scope Efficient Engineering investment on performance of cri6cal data structures and manipula6on rou6nes Convenient Structured plug in model makes developing against the framework rela6vely painfree

14 The func6onal programming paradigm The GATK follows a common func6onal programming paradigm called map and reduce reduce( g, map( f, list ), init ) ## python Object result = init; // java for ( List x: list ) result = g( result, f(x) ); (reduce g (map f list)) ;; scheme

15 The map / reduce framework Data elements f(x) X = f(x) r(x,y,, z) R = r(a, R(B,,E)) a b c d e A B C D E R Opera6ons are independent of each other Results depends on all sites Result is: Map Reduce Func6on f applied to each element of list Func6on r recursively reduced over each f( )

16 Many algorithms fit within the Map/Reduce framework Idea behind Map/Reduce is to provide structured traversal and access to data Separate problems of accessing data from calcula6ons on the elements in the data Developers can provide powerful, intelligent, efficient traversal engines that implement the map opera6on Analysts can easily write func6ons to analyze their data, and then map them across the data Google popularized map/reduce see Dean and Ghemawat, OSDI'04: Sixth Symposium on Opera6ng System Design and Implementa6on Becoming so popular there was a New York Times ar6cle about it on Tuesday, March 17 th, 2009!

17 Map/Reduce over the genome Fundamental data dbsnp exons Reference metadata Reference genome Reads, maybe aligned Reference Reads Metadata Reference genome in fasta format SAM format reads Some traversal types may required reads to be aligned (by locus, for example) Data associated with posi6ons on the reference genome E.g., dbsnp, exons

18 Map/Reduce by read dbsnp exons Reference metadata Reference genome Reads, maybe aligned f (single read, covered reference seq, covered metadata) Evaluated over each read, with reduce accumulating x results at ever read x

19 Map/Reduce by loci dbsnp exons i j k l m Reference metadata Reference genome Reads, maybe aligned f (all reads cover locus, indices into reads yielding equivalent positions covered reference seq, covered metadata) Evaluated over each locus in the genome, with reduce accumulating x results at ever locus x

20 The Genome Analysis Toolkit (GATK) enables rapid development of efficient and robust analysis tools Genome Analysis Toolkit (GATK) infrastructure Traversal engine Analysis tool Supports any BAMcompa6ble aligner All of these tools have been developed in the GATK They are memory and CPU efficient, cluster friendly and are easily parallelized They are now publically and are being used at many sites around the world Ini6al alignment MSA realignment Q score recalibra6on Single sample genotyping SNP filtering Provided by framework Implemented by user More info: hqp://

21 The GATK engine already supports many advanced features

22 Pileup with dbsnp Code: org/broadins6tute/s6ng/gatk/walkers/pileup.java package, imports, etc. removed for presenta6on public class DepthOfCoverageWalker extends LociWalker<Integer, Integer>{ public Integer map(list<referenceordereddatum> roddata, char ref, LocusContext context) { String bases = ""; String quals = " ; for ( int i = 0; i < context.getreads().size(); i++ ) { SAMRecord read = context. getreads().get(i); int offset = context.getoffsets().get(i); bases += read.getreadstring().charat(offset); quals += read.getbasequalitystring().charat(offset); } Build bases and quals strings String rodstring = ""; for ( ReferenceOrderedDatum datum : roddata ) { if ( datum!= null && datum instanceof roddbsnp) { roddbsnp dbsnp = (roddbsnp)datum; rodstring = "[ROD: + dbsnp.tomediumstring() + ] ; } } System.out.printf("%s: %s %s %s %s%n", context.getlocation(), ref, bases, quals, rodstring); return 1; } } Build the dbsnp string

23 Pileup with dbsnp II CPU 6me Max. memory 10 secs 1 GB Command Analysis name java -jar dist/genomeanalysistk.jar T Pileup -I /broad/1kg/legacy_data/tcga-freeze3/tcga-freeze3-normal.bam Reads -R /seq/references/homo_sapiens_assembly18/v0/homo_sapiens_assembly18.fasta -L chr1:559, ,848 -DBSNP /humgen/gsa-scr1/gatk_data/dbsnp_129_hg18.rod Output dbsnp track Sort order is: coordinate chr1:559844: C CCCCCCCCTGGCTCCCCCCCCCAGCCCTCCCCCCCACCCCCCCACCCCCCCCCCCCCCC 4;6@@2;?&'(8(-00=??6@31)@)<).@?6? 3/18?(=833.;(<?:@?9?>*95)> chr1:559845: A AAAGACAAAAAAAAGAAAAAAAAAAAAAACAAAAAAAATAAAAAAAAAAAAAAA,>?&*(5(((8(??)@(>4@2<, 1>=9;8)30<)463((=,4?;??9>>*:5.> chr1:559846: G AGAACAAAGAAAAAAACGAAAAGGCTAAGTAAAAAACGGGGGGGGGGGGG *&((5,((@?)@(5)?1;,.><>:.)50<#7/),(=/ 9?:<>8>=3/1(> [ROD: chr1: :rs :a/g:snp:hapmap:2hit] chr1:559847: A AAAAAAAAAAAAAAAAAAAAAAAAACAAATAAAAAAAAAAAAAA 4:=@?)?(30@);).>>>:81>8<0#>09*>,4?>@>6>=7(3> chr1:559848: A AAAAAAAAAAAAAAACAAAGAAATAAAAAAAAAAACAAAA )@()0@)=).9>1:7)>-<#4>)(>=/1??<>6>=659)> [PROGRESS] Traversed 81 loci in 9.98 secs ( secs per 1M loci) Traversal reduce result is 5 Ref chr1: is a heterozygous A/G site, consistent with hapmap

24 Tree reduce parallelism framework Thread Single thread work unit Tree reduce thread 1 MAP REDUCE MAP REDUCE REDUCE 2 MAP REDUCE MAP REDUCE REDUCE 3 MAP REDUCE MAP REDUCE REDUCE 4 MAP REDUCE MAP REDUCE

25 Automa6c paralleliza6on in the GATK ExecuFon Fme (walk Fme (s)) Number of parallel tasks SMP, single machine Distributed processing: 1 thread per node Distributed processing 4 threads per node Single sample genotyper on chr20 30x SLX reads for NA12878 (1000 genomes)

26 Ge`ng and using the GATK Visit our wiki hqp:// Has developer documents describing how to build the system and read the hello reads tutorial Download binary Jar as well as publically available tools Check out source from SVN repository: hqps://svnrepos.broadins6tute.org/s6ng/

27 Core GATK development team Mark DePristo MaQhew Hanna Aaron McKenna We are looking for feedback, bug reports, feature requests, brainstorming sessions, etc. to make the system as powerful and easy to use as possible Please understand that the system is in ac6ve development, it s usable but interfaces, func6onality, etc., are con6nuously changing and improving

28 Agenda Introduc6on to the 1000 genomes project Mapping and alignment SAM/BAM format Visualizing the data The Genome Analysis Toolkit The infrastructure suppor6ng our tools for working with next genera6on sequencing data Tools developed in the GATK for calling SNPs and indels in the 1000 genomes pilot

29 Mul6ple sequence realignment Read by read mapping introduces ar6facts that can only be resolved by examining mul6ple reads within their local context Ini6al alignment MSA realignment Inconsistent indels Ref: AAGCGTCGAT Read1: AAG---CGAT Read2: GCGAT AAGCGTCGAT AAG---CGAT G---CGAT Cryp6c indels AAGCGTCGAT AAGCGAT GCGAT AAGCGTCGAT AAG---CGAT G---CGAT Q score recalibra6on Single sample genotyping SNP filtering Bases mismatching reference in red

30 Local realignment iden6fies the most parsimonious alignment along all of the reads at a problema6c locus 1. Find the best alternate consensus sequence that, together with the reference, best fits the reads in a pile (maximum of 1 indel) Ref: Three adjacent SNPs AAGCGTCG Realigning determines which is beqer AAGCGTCG AAG---CG Read pile consistent with the reference sequence Read pile consistent with a 3bp inser6on 2. The score for an alternate consensus is the total sum of the quality scores of mismatching bases 3. If the score of the best alternate consensus is sufficiently beqer than the original alignments (using a LOD score), then we accept the proposed realignment of the reads

31 Before Local realignment uncovers the hidden indel in these reads and eliminates all the poten6al FP SNPs AZer Local realignment enabled us to find ~90% of short indels with ~70% specificity in a blind simula6on assessment

32 Modeling the error process An accurate error model is essen6al for reliable downstream analyses such as SNP calling Pr{ observing base b true genotype is G } What is the probability that b (e.g., A) is actually some other base (e.g., either, C, G, or T)? This prob. is encoded by the phred scaled quality score The quality scores reported by the Solexa, SOLiD, and 454 base callers are inaccurate To correct them, we examine the aligned reads and use the reference mismatch rate at non dbsnp sites to recalibrate the reported quality scores We can also account for covariates of base errors, such as local sequence context and machine cycle, to iden6fy subsets of higher quality bases Ini6al alignment MSA realignment Q score recalibra6on Single sample genotyping SNP filtering

33 Recalibra6on make quality scores more accurate 1000 genomes 454 lane Empirical Q score Q40 Q30 Q20 Q10! Ini6al!!!!!!!!!!!!!!!!!!! Recalibrated!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! BeQer fit Q0 More informa6ve Q0 Q10 Q20 Q30 Q40 Reported Q score Q0 Q10 Q20 Q30 Q40 Reported Q score

34 Recalibra6on removes some error covariates 1000 genomes 454 lane Ini6al Recalibrated +10 Empirical! Reported Quality!10! Difference between Reported and empirical Q score Covariates corrected AA AG CA CG GA GG TA TG Dinucleotide Dinucleo6de context AA AG CA CG GA GG TA TG Dinucleo6de context

35 Recalibra6on iden6fies high quality bases and improves SNP calls 1KG 454 lane IniFal RecalibaFon No. bases in lanes 80M 80M Lane wide reported Q Lane wide empirical Q RMSE between Q reported and empirical 17,554 9,635 % of true Q25 bases 89% 95% % of true Q30 bases 0% 53% Iden6fies >50% bases as true Q30 Results in ~10% more SNP calls at same quality compared to unrecalibrated data

36 Bayesian SNP Caller for Pilot 2 Bayesian model Likelihood for the genotype Prior for the genotype L(G D) = P(G) P(D G) Likelihood of the data given the genotype Ini6al alignment Prior genotype probabili6es enforce variant expecta6on rates Likelihood of data computed using pileup of bases and associated quality scores at given locus L(G D) computed for all 10 genotypes ( ) ( ) Confidence in call given by lod = log10 L G best D L G ref D T=5 is common 5.0 MSA realignment Q score recalibra6on Single sample genotyping SNP filtering

37 Filtering poor SNP calls in pilot 2 We use a baqery of expecta6on tests to separate likely FP SNPs from our SNP calls This is possible because erroneous SNP calls ozen result from recurring systema6c errors We flag a SNP as a likely FP if it exhibits unusual behavior according to: In excessive depth of coverage Occurs preferen6ally on a single strand Has a skewed allelic imbalance In a region of poor read mapping Occurs in very close proximity to other SNPs Ini6al alignment MSA realignment Q score recalibra6on Single sample genotyping SNP filtering

38 Evalua6ng SNP call quality Did I get the right number of calls? The number of SNP calls should be close to the average human heterozygosity of 1 variant per 1000 bases Only detects gross under/over calling Concordance with hapmap chip results? OZen we have genotype chip data that indicates the hom ref, het, hom var status at millions of sites Good SNP calls should be >99.5% consistent these chip results, and >99% of the variable sites should be found The chip sites are in the beqer parts of the genome, and so are not representa6ve of the difficul6es at novel sites What frac6on of my calls are already known? Reasonable transi6on to transversion ra6o (Ti/Tv)? dbsnp catalogs most common varia6on, so most of the true variants found will be in dbsnp For single sample calls, ~90 of variants should be in dbsnp Need to adjust expecta6on when considering calls across samples Transi6ons are twice as frequent as transversions (see Ebersberger, 2002) Validated human SNP data suggests that the Ti/Tv should be ~2.1 genome wide and ~2.8 in exons FP SNPs should has Ti/Tv around 0.5 Ti/Tv is a good metric for assessing SNP call quality A C G T transi6ons transversions

39 A quality score aware Bayesian SNP caller produces accurate SNP calls Chromosome 1, NA12878 calls from Solexa only We find 99.3% of the variable chip sites and call het / hom genotypes with 99.9% accuracy The overall Ti/Tv is ~2.1, very close to expecta6on SNPs 271K Genotype chip concordance All calls dbsnp % 88% Ti/Tv % sensi6vity / 99.9% specificity Novel calls 30K calls Ti/Tv = / 884 variants per base, a bit higher than 1 / 1000 expecta6on The majority of our SNPs are at known sites, consistent with expecta6ons The Ti/Tv suggests a ~30% FP rate in this group. Calls from recalibrated, indel realigned Solexa NA12878 with LOD > 5

40 Consistency among SOLiD, 454, and SOLEXA reads enables an even more accurate set of calls Chromosome 1, NA12878 calls requiring calls in solexa and 454/SOLiD All calls We lose some sensi6vity to find sites at hapmap SNPs 235K Genotype chip concordance dbsnp % 92% Ti/Tv % sensi6vity / 99.9% specificity 1 / 1052 variants, now very close to 1/1000 expecta6on Our dbsnp rate increased by 4% Novel calls 16K calls Ti/Tv = 2.13 The novel calls are now as good as the SNPs at known sites Calls from recalibrated, indel realigned NA12878 with LOD > 5

41 Using these concordant calls allows us to iden6fy de novo muta6ons Algorithm for iden6fying puta6ve de novo muta6ons De novo muta6on calls from chr1 of NA12878 Dad Confident homozygous reference site Mom Confident homozygous reference site Broad Sanger Puta6ve de novo 156 Daughter Novel SNP consistent in all three techs This set includes 4 true de novo muta6ons! Calls from recalibrated, indel realigned NA12878, NA12891, NA12892 ValidaPon data courtesy of MaR Hurles and Philip Awadalla

42 Mom Dad No evidence in parents 454 Child SLX Consistent in all three technologies SOLid Validated as a true de novo muta6on

43 We apply a generaliza6on of the single sample caller to pilot 1 4x reads on average Individual 1 Single sample calls Allele frequency Individual 2 Expecta6on maximiza6on SNPs Individual N Genotype frequencies This approach allows us to combine our poorly determined single sample calls (its 4x azer all) to make high quality popula6on calls We have been working with the Sanger (Durbin) and U. Michigan (Abecasis) to make project wide Pilot 1 calls Other approaches use LD to separate machine errors (which are inconsistent with LD) from true variants (which are) Very powerful but introduces an LD bias into the call set The best combined approach is s6ll an open ques6on Work of Jared Maguire and Mark Daly

44 Available in preliminary form from 1000 genomes Pilot 1 ~ 17M SNPs discovered in three popula6on with limited genotype certainty Pilot 2 ~2.7B genotyped sites and ~3M SNPs per person in three trios to very high accuracy Pilot 3 ~13K SNPs in 1000 genomes with MAF >1% to high accuracy Preliminary calls have been made for all pilots 1, 2 and 3 by several centers and groups around the world All three pilots are proceeding to valida6on in the next month Final, high quality calls by November Publica6on and public release in December

45 Help develop and apply methods in NGS to medical gene6cs projects The Genome Sequencing and Analysis group in Medical and Popula6on Gene6cs at the Broad Ins6tute is hiring Computa6onal Biologist Ph.D. level research scien6st focused on algorithmic R&D Bioinforma6c Analyst B.A./M.A. level analyst focused on algorithmic R&D Senior SoZware Engineer B.A./M.A./Ph.D in CS with 5+ years of experience to lead MPG sozware development projects SoZware Engineer B.A. in CS to develop sozware throughout MPG Talk to me for more informa6on or

MPG NGS workshop I: SNP calling

MPG NGS workshop I: SNP calling MPG NGS workshop I: SNP calling Mark DePristo Manager, Medical and Popula

More information

Variant Quality Score Recalibra2on

Variant Quality Score Recalibra2on talks Variant Quality Score Recalibra2on Assigning accurate confidence scores to each puta2ve muta2on call You are here in the GATK Best Prac2ces workflow for germline variant discovery Data Pre-processing

More information

Data processing and analysis of genetic variation using next-generation DNA sequencing!

Data processing and analysis of genetic variation using next-generation DNA sequencing! Data processing and analysis of genetic variation using next-generation DNA sequencing! Mark DePristo, Ph.D.! Genome Sequencing and Analysis Group! Medical and Population Genetics Program! Broad Institute

More information

Strand NGS Variant Caller

Strand NGS Variant Caller STRAND LIFE SCIENCES WHITE PAPER Strand NGS Variant Caller A Benchmarking Study Rohit Gupta, Pallavi Gupta, Aishwarya Narayanan, Somak Aditya, Shanmukh Katragadda, Vamsi Veeramachaneni, and Ramesh Hariharan

More information

Mapping errors require re- alignment

Mapping errors require re- alignment RE- ALIGNMENT Mapping errors require re- alignment Source: Heng Li, presenta8on at GSA workshop 2011 Alignment Key component of alignment algorithm is the scoring nega8ve contribu8on to score opening a

More information

Next Genera*on Sequencing II: Personal Genomics. Jim Noonan Department of Gene*cs

Next Genera*on Sequencing II: Personal Genomics. Jim Noonan Department of Gene*cs Next Genera*on Sequencing II: Personal Genomics Jim Noonan Department of Gene*cs Personal genome sequencing Iden*fying the gene*c basis of phenotypic diversity among humans Gene*c risk factors for disease

More information

talks Callset Evalua,on Comparing sta,s,cs between your callset and a truth set

talks Callset Evalua,on Comparing sta,s,cs between your callset and a truth set talks Callset Evalua,on Comparing sta,s,cs between your callset and a truth set You are here in the GATK Best Prac,ces workflow for germline variant discovery Data Pre-processing >> Variant Discovery >>

More information

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère C3BI VARIANTS CALLING November 2016 Pierre Lechat Stéphane Descorps-Declère General Workflow (GATK) software websites software bwa picard samtools GATK IGV tablet vcftools website http://bio-bwa.sourceforge.net/

More information

Fast and Accurate Variant Calling in Strand NGS

Fast and Accurate Variant Calling in Strand NGS S T R A ND LIF E SCIENCE S WH ITE PAPE R Fast and Accurate Variant Calling in Strand NGS A benchmarking study Radhakrishna Bettadapura, Shanmukh Katragadda, Vamsi Veeramachaneni, Atanu Pal, Mahesh Nagarajan

More information

Prioritization: from vcf to finding the causative gene

Prioritization: from vcf to finding the causative gene Prioritization: from vcf to finding the causative gene vcf file making sense A vcf file from an exome sequencing project may easily contain 40-50 thousand variants. In order to optimize the search for

More information

Variant calling in NGS experiments

Variant calling in NGS experiments Variant calling in NGS experiments Jorge Jiménez jjimeneza@cipf.es BIER CIBERER Genomics Department Centro de Investigacion Principe Felipe (CIPF) (Valencia, Spain) 1 Index 1. NGS workflow 2. Variant calling

More information

Gene Regulatory Networks Computa.onal Genomics Seyoung Kim

Gene Regulatory Networks Computa.onal Genomics Seyoung Kim Gene Regulatory Networks 02-710 Computa.onal Genomics Seyoung Kim Transcrip6on Factor Binding Transcrip6on Control Gene transcrip.on is influenced by Transcrip.on factor binding affinity for the regulatory

More information

Comparing a few SNP calling algorithms using low-coverage sequencing data

Comparing a few SNP calling algorithms using low-coverage sequencing data Yu and Sun BMC Bioinformatics 2013, 14:274 RESEARCH ARTICLE Open Access Comparing a few SNP calling algorithms using low-coverage sequencing data Xiaoqing Yu 1 and Shuying Sun 1,2* Abstract Background:

More information

Next Genera*on Sequencing So2ware for Data Management, Analysis, and Visualiza*on. Session W14

Next Genera*on Sequencing So2ware for Data Management, Analysis, and Visualiza*on. Session W14 Next Genera*on Sequencing So2ware for Data Management, Analysis, and Visualiza*on Session W14 1 Tools for Next Genera*on Sequencing Data Analysis Kip Lord Bodi Genomics Core Director Tu2s University Core

More information

Popula'on Gene'cs I: Gene'c Polymorphisms, Haplotype Inference, Recombina'on Computa.onal Genomics Seyoung Kim

Popula'on Gene'cs I: Gene'c Polymorphisms, Haplotype Inference, Recombina'on Computa.onal Genomics Seyoung Kim Popula'on Gene'cs I: Gene'c Polymorphisms, Haplotype Inference, Recombina'on 02-710 Computa.onal Genomics Seyoung Kim Overview Two fundamental forces that shape genome sequences Recombina.on Muta.on, gene.c

More information

Distributed Pipeline for Genomic Variant Calling

Distributed Pipeline for Genomic Variant Calling Distributed Pipeline for Genomic Variant Calling Richard Xia, Sara Sheehan, Yuchen Zhang, Ameet Talwalkar, Matei Zaharia Jonathan Terhorst, Michael Jordan, Yun S. Song, Armando Fox, David Patterson Division

More information

Parallel Compu,ng Strategies for NGS Sequence Mapping

Parallel Compu,ng Strategies for NGS Sequence Mapping Parallel Compu,ng Strategies for NGS Sequence Mapping Kun Huang Doruk Bozdag, Terry Camerlengo, Ha,ce Gulcin Ozer, Joanne Trgovcich, Tea Meulia, Umit Catalyurek Biomedical Informa,cs OSUCCC Biomedical

More information

NGS in Pathology Webinar

NGS in Pathology Webinar NGS in Pathology Webinar NGS Data Analysis March 10 2016 1 Topics for today s presentation 2 Introduction Next Generation Sequencing (NGS) is becoming a common and versatile tool for biological and medical

More information

SNP calling and VCF format

SNP calling and VCF format SNP calling and VCF format Laurent Falquet, Oct 12 SNP? What is this? A type of genetic variation, among others: Family of Single Nucleotide Aberrations Single Nucleotide Polymorphisms (SNPs) Single Nucleotide

More information

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI Variation detection based on second generation sequencing data Xin LIU Department of Science and Technology, BGI liuxin@genomics.org.cn 2013.11.21 Outline Summary of sequencing techniques Data quality

More information

Normal-Tumor Comparison using Next-Generation Sequencing Data

Normal-Tumor Comparison using Next-Generation Sequencing Data Normal-Tumor Comparison using Next-Generation Sequencing Data Chun Li Vanderbilt University Taichung, March 16, 2011 Next-Generation Sequencing First-generation (Sanger sequencing): 115 kb per day per

More information

Structure, Measurement & Analysis of Genetic Variation

Structure, Measurement & Analysis of Genetic Variation Structure, Measurement & Analysis of Genetic Variation Sven Cichon, PhD Professor of Medical Genetics, Director, Division of Medcial Genetics, University of Basel Institute of Neuroscience and Medicine

More information

Variant Discovery. Jie (Jessie) Li PhD Bioinformatics Analyst Bioinformatics Core, UCD

Variant Discovery. Jie (Jessie) Li PhD Bioinformatics Analyst Bioinformatics Core, UCD Variant Discovery Jie (Jessie) Li PhD Bioinformatics Analyst Bioinformatics Core, UCD Variant Type Alkan et al, Nature Reviews Genetics 2011 doi:10.1038/nrg2958 Variant Type http://www.broadinstitute.org/education/glossary/snp

More information

Variant Callers. J Fass 24 August 2017

Variant Callers. J Fass 24 August 2017 Variant Callers J Fass 24 August 2017 Variant Types Caller Consistency Pabinger (2014) Briefings Bioinformatics 15:256 Freebayes Bayesian haplotype caller that can call SNPs, short CNVs / duplications,

More information

DNASeq: Analysis pipeline and file formats Sumir Panji, Gerrit Boha and Amel Ghouila

DNASeq: Analysis pipeline and file formats Sumir Panji, Gerrit Boha and Amel Ghouila DNASeq: Analysis pipeline and file formats Sumir Panji, Gerrit Boha and Amel Ghouila Bioinforma>cs analysis and annota>on of variants in NGS data workshop Cape Town, 4th to 6th April 2016 DNA Sequencing:

More information

SNP Matching Guide, BF McAllister

SNP Matching Guide, BF McAllister Informa(on in this guide is prepared and presented by Bryant McAllister, Associate Professor of Biology at The University of Iowa. This and other resources for understanding the interpreta(ons and uses

More information

Alignment & Variant Discovery. J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014

Alignment & Variant Discovery. J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014 Alignment & Variant Discovery J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014 From reads to molecules Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG

More information

Introduction to RNA-Seq in GeneSpring NGS Software

Introduction to RNA-Seq in GeneSpring NGS Software Introduction to RNA-Seq in GeneSpring NGS Software Dipa Roy Choudhury, Ph.D. Strand Scientific Intelligence and Agilent Technologies Learn more at www.genespring.com Introduction to RNA-Seq In a few years,

More information

BIGGIE: A Distributed Pipeline for Genomic Variant Calling

BIGGIE: A Distributed Pipeline for Genomic Variant Calling BIGGIE: A Distributed Pipeline for Genomic Variant Calling Richard Xia, Sara Sheehan, Yuchen Zhang, Ameet Talwalkar, Matei Zaharia Jonathan Terhorst, Michael Jordan, Yun S. Song, Armando Fox, David Patterson

More information

RNAseq and Variant discovery

RNAseq and Variant discovery RNAseq and Variant discovery RNAseq Gene discovery Gene valida5on training gene predic5on programs Gene expression studies Paris japonica Gene discovery Understanding physiological processes Dissec5ng

More information

Variant Finding. UCD Genome Center Bioinformatics Core Wednesday 30 August 2016

Variant Finding. UCD Genome Center Bioinformatics Core Wednesday 30 August 2016 Variant Finding UCD Genome Center Bioinformatics Core Wednesday 30 August 2016 Types of Variants Adapted from Alkan et al, Nature Reviews Genetics 2011 Why Look For Variants? Genotyping Correlation with

More information

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014 Alignment J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014 From reads to molecules Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG

More information

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer. DNA Preparation and QC Extraction DNA was extracted from whole blood or flash frozen post-mortem tissue using a DNA mini kit (QIAmp #51104 and QIAmp#51404, respectively) following the manufacturer s recommendations.

More information

Graph structures for represen/ng and analysing gene/c varia/on. Gil McVean

Graph structures for represen/ng and analysing gene/c varia/on. Gil McVean Graph structures for represen/ng and analysing gene/c varia/on Gil McVean What is gene/c varia/on data? Binary incidence matrix What is gene/c varia/on data? Genotype likelihoods What is gene/c varia/on

More information

Analytics Behind Genomic Testing

Analytics Behind Genomic Testing A Quick Guide to the Analytics Behind Genomic Testing Elaine Gee, PhD Director, Bioinformatics ARUP Laboratories 1 Learning Objectives Catalogue various types of bioinformatics analyses that support clinical

More information

White Paper GENALICE MAP: Variant Calling in a Matter of Minutes. Bas Tolhuis, PhD - GENALICE B.V.

White Paper GENALICE MAP: Variant Calling in a Matter of Minutes. Bas Tolhuis, PhD - GENALICE B.V. White Paper GENALICE MAP: Variant Calling in a Matter of Minutes Bas Tolhuis, PhD - GENALICE B.V. White Paper GENALICE MAP Variant Calling GENALICE BV May 2014 White Paper GENALICE MAP Variant Calling

More information

HiSeq Whole Exome Sequencing Report. BGI Co., Ltd.

HiSeq Whole Exome Sequencing Report. BGI Co., Ltd. HiSeq Whole Exome Sequencing Report BGI Co., Ltd. Friday, 11th Nov., 2016 Table of Contents Results 1 Data Production 2 Summary Statistics of Alignment on Target Regions 3 Data Quality Control 4 SNP Results

More information

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis Data Basics Josef K Vogt Slides by: Simon Rasmussen 2017 Generalized NGS analysis Sample prep & Sequencing Data size Main data reductive steps SNPs, genes, regions Application Assembly: Compare Raw Pre-

More information

Bioinformatics in next generation sequencing projects

Bioinformatics in next generation sequencing projects Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet May 2013 Standard sequence library generation Illumina

More information

Targeted resequencing

Targeted resequencing Targeted resequencing Sarah Calvo, Ph.D. Computational Biologist Vamsi Mootha laboratory Snapshots of Genome Wide Analysis in Human Disease (MPG), 4/20/2010 Vamsi Mootha, PI How can I assess a small genomic

More information

Variant Simulation Tools

Variant Simulation Tools Variant Simulation Tools Bo Peng Sep 25, 2014 Genetic Simulations Why perform simulations? To get data that match these (unrealis+c) assump+ons of our methods Validate sta+s+cal methods using simulated

More information

NEXT GENERATION SEQUENCING. Farhat Habib

NEXT GENERATION SEQUENCING. Farhat Habib NEXT GENERATION SEQUENCING HISTORY HISTORY Sanger Dominant for last ~30 years 1000bp longest read Based on primers so not good for repetitive or SNPs sites HISTORY Sanger Dominant for last ~30 years 1000bp

More information

Exploring structural variation in the tomato genome with JBrowse

Exploring structural variation in the tomato genome with JBrowse Exploring structural variation in the tomato genome with JBrowse Richard Finkers, Wageningen UR Plant Breeding Richard.Finkers@wur.nl; @rfinkers Version 1.0, December 2013 This work is licensed under the

More information

Quan=fying genomic varia=on of gut microbiota across the human popula=on. Stephen Nayfach iseem2 Call February 9, 2015

Quan=fying genomic varia=on of gut microbiota across the human popula=on. Stephen Nayfach iseem2 Call February 9, 2015 Quan=fying genomic varia=on of gut microbiota across the human popula=on Stephen Nayfach iseem2 Call February 9, 2015 Biological Mo=va=on Evolu=onarily similar organisms oden differ in their gene content

More information

Germline variant calling and joint genotyping

Germline variant calling and joint genotyping talks Germline variant calling and joint genotyping Applying the joint discovery workflow with HaplotypeCaller + GenotypeGVCFs You are here in the GATK Best PracDces workflow for germline variant discovery

More information

RNA Seq: Methods and Applica6ons. Prat Thiru

RNA Seq: Methods and Applica6ons. Prat Thiru RNA Seq: Methods and Applica6ons Prat Thiru 1 Outline Intro to RNA Seq Biological Ques6ons Comparison with Other Methods RNA Seq Protocol RNA Seq Applica6ons Annota6on Quan6fica6on Other Applica6ons Expression

More information

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl Next Generation Sequencing Bioinformatics small variants Data Analysis Guidelines genomescan.nl GenomeScan s Guidelines for Small Variant Analysis on NGS Data Using our own proprietary data analysis pipelines

More information

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech GALAXY INITIATION A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech How does Next- Gen sequencing work? DNA fragmentation Size selection and clonal amplification Massive parallel sequencing ACCGTTTGCCG

More information

Read Mapping and Variant Calling. Johannes Starlinger

Read Mapping and Variant Calling. Johannes Starlinger Read Mapping and Variant Calling Johannes Starlinger Application Scenario: Personalized Cancer Therapy Different mutations require different therapy Collins, Meredith A., and Marina Pasca di Magliano.

More information

From reads to results: differen1al expression analysis with RNA seq. Alicia Oshlack Bioinforma1cs Division Walter and Eliza Hall Ins1tute

From reads to results: differen1al expression analysis with RNA seq. Alicia Oshlack Bioinforma1cs Division Walter and Eliza Hall Ins1tute From reads to results: differen1al expression analysis with RNA seq Alicia Oshlack Bioinforma1cs Division Walter and Eliza Hall Ins1tute Purported benefits and opportuni1es of RNA seq All transcripts are

More information

Best practices for Variant Calling with Pacific Biosciences data

Best practices for Variant Calling with Pacific Biosciences data Best practices for Variant Calling with Pacific Biosciences data Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D. Genome Sequence and Analysis Medical and Population Genetics carneiro@broadinstitute.org 1

More information

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte BICF Variant Analysis Tools Using the BioHPC Workflow Launching Tool Astrocyte Prioritization of Variants SNP INDEL SV Astrocyte BioHPC Workflow Platform Allows groups to give easy-access to their analysis

More information

Data Analysis Report: Variant Analysis v1.2

Data Analysis Report: Variant Analysis v1.2 GATC Biotech AG, Jakob-Stadler-Platz 7, 78467 Konstanz Data Analysis Report: Variant Analysis v1.2 Project / Study: GATC-Demo Date: February 28, 2018 Table of Contents 1 Analysis workflow 1 2 Samples Analysed

More information

Gene Expression analysis with RNA-Seq data

Gene Expression analysis with RNA-Seq data Gene Expression analysis with RNA-Seq data C3BI Hands-on NGS course November 24th 2016 Frédéric Lemoine Plan 1. 2. Quality Control 3. Read Mapping 4. Gene Expression Analysis 5. Splicing/Transcript Analysis

More information

The effect of strand bias in Illumina short-read sequencing data

The effect of strand bias in Illumina short-read sequencing data Guo et al. BMC Genomics 2012, 13:666 RESEARCH ARTICLE Open Access The effect of strand bias in Illumina short-read sequencing data Yan Guo 1, Jiang Li 1, Chung-I Li 1, Jirong Long 2, David C Samuels 3

More information

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

Data Analysis with CASAVA v1.8 and the MiSeq Reporter Data Analysis with CASAVA v1.8 and the MiSeq Reporter Eric Smith, PhD Bioinformatics Scientist September 15 th, 2011 2010 Illumina, Inc. All rights reserved. Illumina, illuminadx, Solexa, Making Sense

More information

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Genome 373: Mapping Short Sequence Reads II. Doug Fowler Genome 373: Mapping Short Sequence Reads II Doug Fowler The final Will be in this room on June 6 th at 8:30a Will be focused on the second half of the course, but will include material from the first half

More information

Lecture 7. Next-generation sequencing technologies

Lecture 7. Next-generation sequencing technologies Lecture 7 Next-generation sequencing technologies Next-generation sequencing technologies General principles of short-read NGS Construct a library of fragments Generate clonal template populations Massively

More information

Introduc0on to Variant Analysis with NGS data

Introduc0on to Variant Analysis with NGS data Introduc0on to Variant Analysis with NGS data Lecture by: Date: Lecture series: Study program: Dr. Chris0an Rausch 3 November 2014 Tumor Biology and Clinical Behavior VUmc Master of Oncology About Chris0an

More information

Supplementary Figures and Data

Supplementary Figures and Data Supplementary Figures and Data Whole Exome Screening Identifies Novel and Recurrent WISP3 Mutations Causing Progressive Pseudorheumatoid Dysplasia in Jammu and Kashmir India Ekta Rai 1, Ankit Mahajan 2,

More information

Introduction to Next Generation Sequencing

Introduction to Next Generation Sequencing The Sequencing Revolution Introduction to Next Generation Sequencing Dena Leshkowitz,WIS 1 st BIOmics Workshop High throughput Short Read Sequencing Technologies Highly parallel reactions (millions to

More information

Variant prioritization in NGS studies: Annotation and Filtering "

Variant prioritization in NGS studies: Annotation and Filtering Variant prioritization in NGS studies: Annotation and Filtering Colleen J. Saunders (PhD) DST/NRF Innovation Postdoctoral Research Fellow, South African National Bioinformatics Institute/MRC Unit for Bioinformatics

More information

Supplementary Figures

Supplementary Figures 1 Supplementary Figures exm26442 2.40 2.20 2.00 1.80 Norm Intensity (B) 1.60 1.40 1.20 1 0.80 0.60 0.40 0.20 2 0-0.20 0 0.20 0.40 0.60 0.80 1 1.20 1.40 1.60 1.80 2.00 2.20 2.40 2.60 2.80 Norm Intensity

More information

Exploring genomic databases: Practical session "

Exploring genomic databases: Practical session Exploring genomic databases: Practical session Work through the following practical exercises on your own. The objective of these exercises is to become familiar with the information available in each

More information

Introduc)on to NGS Variant Calling

Introduc)on to NGS Variant Calling Introduc)on to NGS Variant Calling Bioinforma)cs analysis and annota)on of variants in NGS data workshop Cape Town, 4 th to 6 th April 2016 Sumir Panji, Amel Ghouila, Gerrit Botha Types of variants Learning

More information

Variant Detection in Next Generation Sequencing Data. John Osborne Sept 14, 2012

Variant Detection in Next Generation Sequencing Data. John Osborne Sept 14, 2012 + Variant Detection in Next Generation Sequencing Data John Osborne Sept 14, 2012 + Overview My Bias Talk slanted towards analyzing whole genomes using Illumina paired end reads with open source tools

More information

CMSC423: Bioinformatic databases, algorithms and tools

CMSC423: Bioinformatic databases, algorithms and tools CMSC423: Bioinformatic databases, algorithms and tools Héctor Corrada Bravo Dept. of Computer Science Center for Bioinformatics and Computational Biology University of Maryland University of Maryland,

More information

Sanger vs Next-Gen Sequencing

Sanger vs Next-Gen Sequencing Tools and Algorithms in Bioinformatics GCBA815/MCGB815/BMI815, Fall 2017 Week-8: Next-Gen Sequencing RNA-seq Data Analysis Babu Guda, Ph.D. Professor, Genetics, Cell Biology & Anatomy Director, Bioinformatics

More information

H3A - Genome-Wide Association testing SOP

H3A - Genome-Wide Association testing SOP H3A - Genome-Wide Association testing SOP Introduction File format Strand errors Sample quality control Marker quality control Batch effects Population stratification Association testing Replication Meta

More information

CNV and variant detection for human genome resequencing data - for biomedical researchers (II)

CNV and variant detection for human genome resequencing data - for biomedical researchers (II) CNV and variant detection for human genome resequencing data - for biomedical researchers (II) Chuan-Kun Liu 劉傳崑 Senior Maneger National Center for Genome Medican bioit@ncgm.sinica.edu.tw Abstract Common

More information

Genome STRiP ASHG Workshop demo materials. Bob Handsaker October 19, 2014

Genome STRiP ASHG Workshop demo materials. Bob Handsaker October 19, 2014 Genome STRiP ASHG Workshop demo materials Bob Handsaker October 19, 2014 Running Genome STRiP directly on AWS Genome STRiP Structure in Populations Popula'on)aware-discovery-andgenotyping-of-structural-varia'onfrom-whole)genome-sequencing-

More information

Lecture: Genetic Basis of Complex Phenotypes Advanced Topics in Computa8onal Genomics

Lecture: Genetic Basis of Complex Phenotypes Advanced Topics in Computa8onal Genomics Lecture: Genetic Basis of Complex Phenotypes 02-715 Advanced Topics in Computa8onal Genomics Genome Polymorphisms A Human Genealogy TCGAGGTATTAAC The ancestral chromosome From SNPS TCGAGGTATTAAC TCTAGGTATTAAC

More information

VALIDATION OF HLA TYPING BY NGS

VALIDATION OF HLA TYPING BY NGS VALIDATION OF HLA TYPING BY NGS Eric T. Weimer, Ph.D., D(ABMLI) Assistant Professor, Pathology and Laboratory Medicine Associate Director, Clinical Flow Cytometry, HLA, and Immunology Laboratories CONFLICT

More information

Whole Genome Sequencing. Biostatistics 666

Whole Genome Sequencing. Biostatistics 666 Whole Genome Sequencing Biostatistics 666 Genomewide Association Studies Survey 500,000 SNPs in a large sample An effective way to skim the genome and find common variants associated with a trait of interest

More information

Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service. Dr. Ruth Burton Product Manager

Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service. Dr. Ruth Burton Product Manager Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service Dr. Ruth Burton Product Manager Today s agenda Introduction CytoSure arrays and analysis

More information

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Genome Assembly Using de Bruijn Graphs. Biostatistics 666 Genome Assembly Using de Bruijn Graphs Biostatistics 666 Previously: Reference Based Analyses Individual short reads are aligned to reference Genotypes generated by examining reads overlapping each position

More information

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

Genomic DNA ASSEMBLY BY REMAPPING. Course overview ASSEMBLY BY REMAPPING Laurent Falquet, The Bioinformatics Unravelling Group, UNIFR & SIB MA/MER @ UniFr Group Leader @ SIB Course overview Genomic DNA PacBio Illumina methylation de novo remapping Annotation

More information

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science + UAB DNA-Seq Analysis Workshop John Osborne Research Associate Centers for Clinical and Translational Science ozborn@uab.,edu + Thanks in advance You are the Guinea pigs for this workshop! At this point

More information

Novel Variant Discovery Tutorial

Novel Variant Discovery Tutorial Novel Variant Discovery Tutorial Release 8.4.0 Golden Helix, Inc. August 12, 2015 Contents Requirements 2 Download Annotation Data Sources...................................... 2 1. Overview...................................................

More information

RNA Ribonucleic Acid. Week 14, Lecture 28. RNA- seq is a new, emerging field. Two major domains applica:on 12/4/ When the transcriptome is known

RNA Ribonucleic Acid. Week 14, Lecture 28. RNA- seq is a new, emerging field. Two major domains applica:on 12/4/ When the transcriptome is known 2014 - BMMB 852D: Applied Bioinforma:cs RNA Ribonucleic Acid Week 14, Lecture 28 István Albert Biochemistry and Molecular Biology and Bioinforma:cs Consul:ng Center Penn State Two major domains applica:on

More information

ReQON: a Bioconductor package for recalibrating quality scores from next-generation sequencing data

ReQON: a Bioconductor package for recalibrating quality scores from next-generation sequencing data Cabanski et al. BMC Bioinformatics 2012, 13:221 SOFTWARE Open Access ReQON: a Bioconductor package for recalibrating quality scores from next-generation sequencing data Christopher R Cabanski 1, Keary

More information

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4 WHITE PAPER Oncomine Comprehensive Assay Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4 Contents Scope and purpose of document...2 Content...2 How Torrent

More information

Dipping into Guacamole. Tim O Donnell & Ryan Williams NYC Big Data Genetics Meetup Aug 11, 2016

Dipping into Guacamole. Tim O Donnell & Ryan Williams NYC Big Data Genetics Meetup Aug 11, 2016 Dipping into uacamole Tim O Donnell & Ryan Williams NYC Big Data enetics Meetup ug 11, 2016 Who we are: Hammer Lab Computational lab in the department of enetics and enomic Sciences at Mount Sinai Principal

More information

FDA and the Regula/on of Next Genera/on Sequencing

FDA and the Regula/on of Next Genera/on Sequencing FDA and the Regula/on of Next Genera/on Sequencing David Litwack, Ph.D. Personalized Medicine Staff Office of In Vitro Diagnos@cs and Radiological Health, FDA In Vitro Diagnos/cs in the Age of Precision

More information

Processing Ion AmpliSeq Data using NextGENe Software v2.3.0

Processing Ion AmpliSeq Data using NextGENe Software v2.3.0 Processing Ion AmpliSeq Data using NextGENe Software v2.3.0 July 2012 John McGuigan, Megan Manion, Kevin LeVan, CS Jonathan Liu Introduction The Ion AmpliSeq Panels use highly multiplexed PCR in order

More information

The Genome Analysis Centre. Building Excellence in Genomics and Computa5onal Bioscience

The Genome Analysis Centre. Building Excellence in Genomics and Computa5onal Bioscience Building Excellence in Genomics and Computa5onal Bioscience Resequencing approaches Sarah Ayling Crop Genomics and Diversity sarah.ayling@tgac.ac.uk Why re- sequence plants? To iden

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature26136 We reexamined the available whole data from different cave and surface populations (McGaugh et al, unpublished) to investigate whether insra exhibited any indication that it has

More information

Accelerate High Throughput Analysis for Genome Sequencing with GPU

Accelerate High Throughput Analysis for Genome Sequencing with GPU Accelerate High Throughput Analysis for Genome Sequencing with GPU ATIP - A*CRC Workshop on Accelerator Technologies in High Performance Computing May 7-10, 2012 Singapore BingQiang WANG, Head of Scalable

More information

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer Project XX Customer Detail Table of Contents. Bioinformatics analysis pipeline...3.. Read quality check. 3.2. Read alignment...3.3.

More information

Linkage Analysis Computa.onal Genomics Seyoung Kim

Linkage Analysis Computa.onal Genomics Seyoung Kim Linkage Analysis 02-710 Computa.onal Genomics Seyoung Kim Genome Polymorphisms Gene.c Varia.on Phenotypic Varia.on A Human Genealogy TCGAGGTATTAAC The ancestral chromosome SNPs and Human Genealogy A->G

More information

User Guide. MAGNET : MicroArray & RNAseq Gene expression Network Evalua=on Toolkit. Page 1

User Guide. MAGNET : MicroArray & RNAseq Gene expression Network Evalua=on Toolkit. Page 1 User Guide MAGNET : MicroArray & RNAseq Gene expression Network Evalua=on Toolkit Page 1 Case Western Reserve University February 2012 Page 2 Page 3 1 - Introduction This sec=on will introduce MAGNET:

More information

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es SNP calling Jose Blanca COMAV institute bioinf.comav.upv.es SNP calling Genotype matrix Genotype matrix: Samples x SNPs SNPs and errors A change in a read may due to: Sample contamination Cloning or PCR

More information

Accelerate precision medicine with Microsoft Genomics

Accelerate precision medicine with Microsoft Genomics Accelerate precision medicine with Microsoft Genomics Copyright 2018 Microsoft, Inc. All rights reserved. This content is for informational purposes only. Microsoft makes no warranties, express or implied,

More information

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang Supplementary Materials for: Detecting very low allele fraction variants using targeted DNA sequencing and a novel molecular barcode-aware variant caller Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John

More information

solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome

solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome See the Difference With a commitment to your peace of mind, Life Technologies provides a portfolio of robust and scalable

More information

Why can GBS be complicated? Tools for filtering & error correction. Edward Buckler USDA-ARS Cornell University

Why can GBS be complicated? Tools for filtering & error correction. Edward Buckler USDA-ARS Cornell University Why can GBS be complicated? Tools for filtering & error correction Edward Buckler USDA-ARS Cornell University http://www.maizegenetics.net Maize has more molecular diversity than humans and apes combined

More information

ISO/IEC JTC 1/SC 29/WG 11 N15527 Warsaw, CH June Introduction

ISO/IEC JTC 1/SC 29/WG 11 N15527 Warsaw, CH June Introduction INTERNATIONAL ORGANISATION FOR STANDARDISATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO/IEC JTC 1/SC 29/WG 11 CODING OF MOVING PICTURES AND AUDIO ISO/IEC JTC 1/SC 29/WG 11 N15527 Warsaw, CH June

More information

RNAseq / ChipSeq / Methylseq and personalized genomics

RNAseq / ChipSeq / Methylseq and personalized genomics RNAseq / ChipSeq / Methylseq and personalized genomics 7711 Lecture Subhajyo) De, PhD Division of Biomedical Informa)cs and Personalized Biomedicine, Department of Medicine University of Colorado School

More information

Assignment 9: Genetic Variation

Assignment 9: Genetic Variation Assignment 9: Genetic Variation Due Date: Friday, March 30 th, 2018, 10 am In this assignment, you will profile genome variation information and attempt to answer biologically relevant questions. The variant

More information

BroadE Workshop: Genome Assembly. March 20 th, 2013

BroadE Workshop: Genome Assembly. March 20 th, 2013 BroadE Workshop: Genome Assembly March 20 th, 2013 Introduc@on & Logis@cs De- Bruijn Graph Interac@ve Problem (45 minutes) Assembly Theory Lecture (45 minutes) Break (10-15 minutes) Assembly in Prac@ce

More information