Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Size: px
Start display at page:

Download "Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl"

Transcription

1 Next Generation Sequencing Bioinformatics small variants Data Analysis Guidelines genomescan.nl

2 GenomeScan s Guidelines for Small Variant Analysis on NGS Data Using our own proprietary data analysis pipelines Dear customer, As of the beginning of 2015 ServiceXS became a trademark of GenomeScan B.V. GenomeScan focuses exclusively on Molecular Diagnostics whereas our ServiceXS trademark is intended for your R&D projects. GenomeScan is dedicated to help you design and perform Next Generation Sequencing (NGS) experiments that generate high quality results. This guide provides information for our data analysis services and resources and tools for further analysis of your sequencing data. NGS experiments result in vast amounts of data and therefore data analysis can be challenging. Our ability to assist in the analysis of your results can be the key factor leading to a successful project. Our experience in the past years is that even state-of-the-art NGS software is not always able to fulfill the data analysis needs of our customers. To alleviate this problem our experienced team of bioinformaticians and molecular biologists can provide standard or custom bioinformatics solutions to get the most out of your project. GenomeScan provides a comprehensive package of bioinformatics services for our NGS customers, which enable them to utilise all the applications that are possible with billions of bases of sequence data per run. GenomeScan can advise and assist you in every step of the data analysis. Do not hesitate to contact us if you have any questions after reading this guideline! On behalf of the Bioinformatics team, Thomas Chin-A-Woeng Project Manager

3 GenomeScan Guidelines- Page 2 of 14 Document Outline Page 1 Introduction 3 2 Application Description 2.1 Quality Filtering and Trimming 2.2 Alignments 2.3 SNP Detection 2.4 SNP Filtering 2.5 Indel detection 2.6 Export Files 2.7 Consensus Sequence (optional) 2.8 SNP Effect Analysis 4 3 Analysis Results 3.1 Raw Sequencing Files 3.2 Alignment Files 3.3 Main SNP File 3.4 Human Readable SNP File 3.5 Genotype Summary 3.6 Assign Design File 3.7 Combined.tab 3.8 IUPAC and variant references 3.9 Visualisation 8 4 File Formats 4.1 Variant Analysis 4.2 Structural Variation 4.3 Reference Genomes 4.4 Assay Design 11 Changes to Previous Version (2.0) -Lay-out changes

4 GenomeScan Guidelines- Page 3 of 14 Chapter 1 Introduction Most organisms within a particular species differ very little in their genomic structure. These variations are referred to as allele changes. A single nucleotide polymorphism or SNP is a DNA sequence variation occurring when a single nucleotide - A, T, C, or G - in the genome differs between members of a species (or between paired chromosomes in an individual). Each individual has many single nucleotide polymorphisms that together create a unique DNA pattern for that individual. Typically, SNPs commonly observed in a population exhibit two alleles, a major allele, which is more prevalent, and a relatively rarely occurring minor allele. The study of single nucleotide polymorphisms is also important in genotyping in crop and livestock breeding. Single nucleotide polymorphisms may fall within coding sequences of genes, non- coding regions, or in the intergenic regions between genes. SNPs sometimes have very deleterious effects, such as a change in only one nucleotide can cause codon(s) to be misread and accordingly a wrong protein will form. SNPs within a coding sequence will not necessarily change the amino acid sequence of the protein that is produced, due to degeneracy of the genetic code. A SNP in which both forms lead to the same polypeptide sequence is termed synonymous, if a different polypeptide sequence is produced they are non-synonymous. SNPs that are not in protein coding regions may still have consequences for gene splicing, transcription factor binding, or the sequence of non-coding RNA. SNPs located in regulatory regions (promoters, UTRs) may have a significant influence on the expression level of a gene. Next- generation sequencing (NGS) allows SNP identification without prior target information. The high coverage possible in NGS also facilitates discovery of rare alleles within population studies. SNP detection algorithms compare the nucleotides present on aligned reads against the reference at each position (Fig. 1). Based on the distribution of As, Ts, Gs, and Cs at that position and the likelihood of a sequencing error, a judgement is made as to the existence of a SNP. Further downstream the SNP analysis the potential effects of SNPs associated with the DNA sequence can be evaluated. Fig.1. Alignment against a reference sequence This guideline describes the workflow for detection of small variants in a sample genome in comparison to a reference genome. The main steps are (1) quality filtering and adapter trimming, (2) alignment, (3) SNP detection (4) filtering of significant SNPs, and (5) optionally SNP effect analysis and clustering.

5 GenomeScan Guidelines- Page 4 of 14 Chapter 2 Sequencing Applications The following section describes the main steps for SNP/Indel analysis (Fig. 2). The most common workflow step in preparation for SNP analysis is to filter the reads and retain only those with high mapping and base qualities. After calling SNPs and choosing the appropriate thresholds for filtering, a VCF file is generated. From this VCF file various export formats that can be interpreted by the customer are derived. Fig. 2. SNP detection workflow. 2.1 Quality Filtering and Trimming The SNP/indel pipeline starts with quality filtering and trimming of the sequence reads. For filtering a set of standard thresholds is used which are optimised for the SNP/indel analysis pipeline. The main parameter defaults are: Table 1. Read filtering Filter Default Description Adapters On Illimina sequencing adapters are removed Minimal Q-score 22 All bases in the read should have at least a Q-score of 22 (corresponding to a chance of one error in 160 bases), bases with lower qualities are trimmed off Minimal read After trimming bases reads should be at least 36bp to be kept in the data 36 length set Treat paired-end On For paired-end reads both reads should be kept or removed altogether 5' or 3' trim Off 5' and 3' end of reads can be optionally trimmed for adapter sequences or other unwanted bases indicated by the customer Presumed adapter sequences are removed from the read when the bases match a sequence in the adapter sequence set (Illumina TruSeq adapters) with two or less mismatches and an alignment quality of at least 12. To remove noise introduced by sequencing errors, reads are filtered and clipped by quality. By default, the reads are filtered using a phred score of Q22 as a minimum threshold. Bases with phred scores below this level are removed and as a consequence reads are split. If the resulting reads are shorter than the minimal read length (36 bp by default), the reads are removed altogether (both pairs in paired-end reads) when paired-end mode is enforced. The filtered reads are written to FASTQ format and filtering statistics are calculated and reported. The filtered reads are used for the next stage of the pipeline. 2.2 Quality Filtering and Trimming The next step of the pipeline consists of aligning the filtered reads to the genome reference provided by the customer or generated using de novo assembly. The filtered reads are aligned to the reference sequence with a short read aligner based on Burrows Wheeler Transform. A mismatch rate of 4% (4 mismatches in a read of 100 bases) is used by default. This step lays the foundation for finding the SNPs and variations. The alignment files (BAM files sorted and indexed.bam files by the

6 GenomeScan Guidelines- Page 5 of 14 samtools v package) containing the mapped read information are provided on the harddisk in the Alignments folder. 2.3 Whole genome (re-)sequencing of strains or related organisms The pipeline performs SNP/Indel identification using Bayesian statistics similar to other commonly used software tools for SNP detection. It uses the nucleotide values taken by each read covering the location, as well as its associated base quality, and calculates a consensus genotype. Issues that a SNP caller has to be able to consider are quality of reads, mapping quality, coverage, homopolymeric tracts, and ploidy. The caller takes the following factors into consideration: A sequencer outputs a sequence of nucleotides corresponding to each read and assigns a quality value based on the confidence with which a particular base is called. The base quality values add weight to the called nucleotides. Misaligned reads create false positive SNPs or incorrect frequencies. Most alignment algorithms assign quality scores to a mapping based on the read alignment with the reference. These mapping scores indicate the likelihood of a read originating from the suggested position on the reference. The mapping quality score takes into account the inserts, deletions, and substitutions necessary for alignment at a particular position. The number of reads at a genomic position also determines the confidence of a found SNP. Greater sequencing depth leads to higher SNP calling accuracy. The ploidy of the sample determines the number of nucleotide inferences necessary to conclude the underlying genotype. When haploid, the algorithm does not assume the probability of seeing a heterozygote. Some sequencers exhibit inaccurate representations of homopolymers (e.g. AAAAAA) and their immediate neighbors due to limitations in the technology. Such regions are also handled by the SNP detection algorithm. The SNP/Indel pipeline is capable of detecting three types of variants: substitutions or mutations, deletions, and insertions. Substitutions consist of one or more nucleotide substitutions occurring at certain genomic positions. Deletions are one or more nucleotide deletions occurring at a given location. A deletion event is represented as a change from one or more consecutive nucleotides to a gap (no bases). Insertions are one or multiple consecutive nucleotide insertions occurring at a given location. The pipeline can process data in single- or multi-sample mode. In the default multi-sample mode, low-confidence calls occurring in multiple samples increase the confidence of the SNP call. An associated Phred quality is output along with the consensus genotype; this score represents the confidence in the variant call. High scores correspond to less possibility of error in the call. 2.4 SNP Filtering Genomic positions are reported to be potential SNP sites if they satisfy a set of predefined criteria that may be set by the customer or bioinformatician. They may be dependent upon the experimental setup of the experiment. These include the minimal read depth, minimal quality score, and minimal variant frequency. For all these criteria the number must exceed the thresholds defined. A VCF file is generated from the positions passing the filter. The results are reported in filtered.snps.vcf in VCF file v4.1 format in the Variants directory. From this VCF file various export formats that can be interpreted by the customer are derived. The following filters can be applied to the SNP list:

7 GenomeScan Guidelines- Page 6 of 14 Read depth: The deeper the sequencing the more reliable the SNP detection can determine whether it is a true SNP. A minimal threshold can be set to ascertain a minimal coverage before a SNP is reported. Quality score: SNPs are filtered based on their quality score. All SNPs with quality scores less than the defined threshold are filtered out. This ensures that SNPs with low quality are discarded, but when these should also be included the threshold can be lowered. The variant frequency is set according prior expectations about the data set amongst which are the ploidy and whether a pool of samples was analysed. 2.5 Indel Detection Small insertions-deletions (indels), up to 30 bases, are detected by the indel caller using in-read information (in contrast to mate or pair information). Aligners typically introduce gaps into reads for better mapping that may represent deletions. Similar to a base, a gap (deletion) is significant when the missing base(s) meet the filter criteria. Since deletion do not have an associated quality score the surrounding base qualities are used for computation of a confidence score. The indels are provided in VCF format and tab-delimited format. 2.6 Human-readable files The SNP list is stored in snps.tab in tab-delimited format (in the Export folder). These can be directly opened using a spreadsheet application such as MS Excel and LibreOffice if the number of rows does not exceed the limitations of the application. From this file the genotype columns are extracted into the summary.tab file. A SNP assay design file is generated for the SNPs reported in snps.tab and reported in design.tab. This file contains the contig information 75 bases upstream and downstream of the identified SNP position. Optionally, a file with the combined information of snps.tab, summary.tab, and design.tab is provided in the file combined.tab. This file also includes additional columns with the distance to the closest previous and next SNP and and the average sequence depth for all samples. Small indels are output in the indels.tab file in the Export folder. 2.7 Consensus Sequencing (optional) Based on the consensus call and the reference sequence a new reference sequence may be derived which includes the found SNPs and genotypes. The resulting file is in FastA format and may be coded in different ways. 2.8 SNP Effect Analysis (optional) SNP Effect Analysis processes the list of SNPs and reports the effect that these SNPs have on the genes in a given context. Using the genome feature information the SNPs are classified. The following classifications are detected and reported.

8 GenomeScan Guidelines- Page 7 of 14 Table 2. Read filtering Classification Intergenic Synonymous Non-synonymous Stop gain Stop loss Intronic Upstream Downstream Description A variant that does not fall within the neighborhood of any gene in the annotation Variant in an exon. Synonymous: mutation has no effect on the final amino acid sequence Variant in an exon. Synonymous: mutation has effect on the final amino acid sequence Result in a STOP codon STOP codon lost A mutation occurring in intronic regions A variant occuring upstream of the transcript A variant occuring downstream of the transcript Essential splice site Mutations to the donor and acceptor sites of the intron Splice site Mutations to locations of splicing signals (i.e. 3-8 bases into the intron from either side, 1-3 bases into neighboring exon) 5' UTR A variant in the 5' UTR region 3' UTR A variant in the 3' UTR region

9 GenomeScan Guidelines- Page 8 of 14 Chapter 3 Analysis Results 3.1 Raw Sequence Files The raw sequence files output by the Illumina pipeline are being used as input for the SNP detection. These sequence files are provided to the customer in FASTQ format in the 'Raw data' directory. The quality-filtered output performed in the first step in the pipeline is provided optionally to the customer. 3.2 Alignment files The alignment files are provided in sorted BAM format with an accompanying index file. See our Next-generation data analysis guideline for a full description of BAM files. 3.3 Main SNP file (snps.vcf) The main output of the SNP/indel pipeline is a text file in VCF format formatted according to the VCF 4.1 specification. VCF stands for Variant Call Format, and was originally used by the 1000 Genomes project to encode structural genetic variants. A short overview is given in Section Human readable SNP file (snps.tab) This text file contains information in tab delimited format. It is both human- and machine readable. Fig. 3. Layout of snps.tab and summary.tab files. The format specification of this file is defined in Section Columns 1 to 4 are general columns applicable to all samples. Columns 5 to 7 contain SNP information for individual samples. Columns 7 to 10 contain genotype information. Columns 11 to 15 provide raw statistics on coverage and base composition. The layout of the columns is described in section (Table 8, Fig. 3).

10 GenomeScan Guidelines- Page 9 of Genotype summary (summary.tab) This tab-delimited file contains the consensus columns in the snps.tab. It is both human- and machine readable (Fig. 3 inset). 3.6 Assay design file (design.tab) This tab-delimited file shows the flanking sequences of each position in the SNP file. It is both human- and machine readable. Indicated in the flanking regions are neighbouring SNPs which may be of importance for the design of follow up assay. 3.7 Combined.tab (optional) This tab-delimited files combine the info from snps.tab, summary.tab, and desig.tab and included additional information about the distances to neighbouring SNPs and total coverages over all samples. 3.8 IUPAC and variant references The construction of the IUPAC reference is depicted in Fig. 4. Fig. 4. Generation of a IUPAC or variant reference A new IUPAC or reference with variant alleles is generated using the original reference and, read information, and variant tables. After alignment of the reads onto the reference sequence, each base position is evaluated for its variants, coverage, and quality. Regions or bases with no coverage are flagged in the new references with 'n'. Regions with coverage below a preset read depth (default <=2) or doubtful alignment quality are flagged with lowercase bases to indicate low quality. Variant alleles are depicted their IUPAC codes in the IUPAC reference or with the variant allele in the variant reference. The IUPAC reference in FASTQ format has an additional advantage that the genotype call score is encoded as quality score similar to the Sanger phred score encodings. An offset of 33 is used when translating ASCII encoding to the numerial score. The genotype call score is calculated as Q = 10log 10log P where P represents the probability that a polymorphism exists at the given location. Whether or not a variant allele is reported in the derived reference is dependent upon a set of key threshold values inclusing variant freqency (default 30% for heterozygous diploid organisms or 80% for haploid genomes), coverage or read depth (default 20), and mapping quality. 3.9 Visualisation Aligned reads, pileups, and SNPs can be viewed in numerous software packages for NGS. Using the reference file and alignment files this can be easily done in the IGV browser. See our Nextgeneration data analysis guideline how this can be performed.

11 GenomeScan Guidelines- Page 10 of 14 Chapter 4 File Formats This chapter describes the file formats specifically used for SNP and indel analysis. For other common formats such as sequence and alignment files, please refer to our NGS data analysis guideline. 4.1 Variant Analysis The FASTQ sequence files output by the Illumina sequencers are saved compressed in the commonly used GNU zip format. This is indicated by the.gz file extension. Most downstream data analysis tools automatically decompress the files when used as input as well a most decompression software packages can inflate this format. VCF files The Variant Call Format (VCF) is flexible format used to store any type of DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations by listing both the reference haplotype (the REF column) and the alternate haplotypes (the ALT column). The format was developed for the 1000 Genomes Project, and has been generally adapted by many scientists and software tools. The VCF format is a text file format which contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. The specification for the format can be found at and published (Danecek et al The variant call format and VCFtools. Bioinformatics 27: The full VCF specification also includes a set of recommended practices for describing complex variants. The header contains an arbitrary number of meta-information lines, each starting with characters ##, and a tab-delimited field definition line, starting with a single # character. The metainformation header lines provide a standardised description of tags and annotations used in the data section. The use of meta-information allows the information stored within a VCF file to be tailored to the dataset in question. It can be also used to provide information about the means of file creation, date of creation, version of the reference sequence, software used and any other information relevant to the history of the file. The field definition line names eight mandatory columns, corresponding to data columns representing the chromosome (CHROM), a 1-based position of the start of the variant (POS), unique identifiers of the variant (ID), the reference allele (REF), a comma separated list of alternate nonreference alleles (ALT), a phred-scaled quality score (QUAL), site filtering information (FILTER), and a semicolon separated list of additional, user extensible annotation (INFO). In addition, if samples are present in the file, the mandatory header columns are followed by a FORMAT column and an arbitrary number of sample IDs that define the samples included in the VCF file. The FORMAT column is used to define the information contained within each subsequent genotype column, which consists of a colon separated list of fields. E.g., the FORMAT field GT:GQ:DP in the fourth data entry of Fig. 5 indicates that the subsequent entries contain information regarding the genotype, genotype quality, and read depth for each sample. All data lines are tab- delimited and the number of fields in each data line must match the number of fields in the header line.

12 GenomeScan Guidelines- Page 11 of 14 Fig. 5. VCF file. The VCF specification includes several common keywords with standardised meaning. The following table gives some examples of the reserved tags. Table 3. SNP/genotype file Abbreviation Genotype columns GT PS DP GL GQ INFO column DB H3 VALIDATED AN AC SVTYPE END IMPRECISE CIPOS/CIEND Description Genotype, encodes alleles as numbers: 0 for the reference allele, 1 for the first allele listed in ALT column, 2 for the second allele listed in ALT and so on. The number of alleles suggests ploidy of the sample and the separator indicates whether the alleles are phased ( ) or unphased ( / ) with respect to other data lines. Phase set, indicates that the alleles of genotypes with the same PS value are listed in the same order. Read depth at this position. Genotype likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields. Genotype quality, probability that the genotype call is wrong under the condition that the site is being variant. Note that the QUAL column gives an overall quality score for the assertion made in ALT that the site is variant or no variant. dbsnp membership. Membership in HapMap3. Validated by follow-up experiment. Total number of alleles in called genotypes. Allele count in genotypes, for each ALT allele, in the same order as listed. Type of structural variant (DEL for deletion, DUP for duplication, INV for inversion, etc. as described in the specification. End position of the variant. Indicates that the position of the variant is not known accurately. Confidence interval around POS and END positions for imprecise variants. Missing values are represented with a dot. For practical reasons, the VCF specification requires that the data lines appear in their chromosomal order. VCF files can be stored in a compressed manner, compressed by bgzip, a program which utilizes the zlib-compatible BGZF library (Li et al., 2009). Files compressed by bgzip can be decompressed by the standard gunzip and zcat utilities. Fast random access retrieval of variants from a range of positions on the reference genome can be achieved by indexing genomic position using tabix, a

13 GenomeScan Guidelines- Page 12 of 14 generic indexer for tab-delimited files. Both programs, bgzip and tabix, are part of the samtools software package and can be downloaded from the SAMtools web site ( BCF (Binary Call Format) Binary format used by samtools/bcftools for efficient storing and parsing of genotype likelihoods. A description can be found at SNP/Genotype The snps.tab file is a proprietary human-readable file with all information regarding SNPs and genotypes. The file is also machine-readable. Columns 1 to 4 are general columns applicable to all samples. Column 5 to 7 contain SNP information for individual samples. Columns 7 to 10 contain genotype information. Columns 11 to 15 provide raw statistics on coverage and base composition. The layout of the columns is as follows (Table 4.6, Fig. 3): Table 4. SNP/genotype file Column Format Description 1 Text Chromosome or contig 2 Numerical 1-based genomic position within chromosome or contig 3 Nucleotide base Reference allele 4 Nucleotide base Detected alleles 5 Nucleotide base Detected SNP, empty is no SNP or below significance 6 Numerical Variant frequency (%) 7 Numerical Quality score for SNP 8 IUPAC base Genotype 9 Numerical Genotype quality score 10 Numerical Depth to calculate genotype or SNP 11 Numerical Fraction of A 12 Numerical Fraction of C 13 Numerical Fraction of T 14 Numerical Fraction of G 15 Numerical Total depth (alignment) Same as 5-15 for sample 2, etc Summary file The summary.tab file contains only the consensus genotypes of the samples. Table 5. Summary file Column Format Description 1 Text Chromosome or contig 2 Numerical 1-based genomic position within chromosome or contig 3 Nucleotide base Reference allele 4.. n IUPAC call Consensus genotype for sample Insertions/deletions file The indels.tab file contains short indel information in the following format:

14 GenomeScan Guidelines- Page 13 of 14 Table 6 Indel file format Column Name Format Description 1 chr1 Text Chromosome or contig 2 pos1 Numerical 1-based genomic position within chromosome or contig chr1 3 reference Nucleotide base Reference allele 4 sequence Nucleotide base(s) Detected variation 5 chr2 Text Not used 6 pos2 Numerical Not used 7 type Text Variation class: INS (insertion) or DEL (deletion) 8 size Numerical Size of the variation 9 varfreq Numerical Frequency at which the variation is observed (%) 10 score Numerical Quality score 11 depth Numerical Coverage at the indicated position 12..n Same as 8-11 for sample 2, etc 4.2 Structural Variation SV file The filtered.sv.tab file contains structural variation data (large insertions, deletions, duplication, interchromosomal and intrachromosomal translocations). Table 7. Structural variation file format Column Name Format Description 1 chr1 Text Chromosome or contig 2 pos1 Numerical 1-based genomic position within chromosome or contig chr1 3 reference Nucleotide base Reference allele 4 sequence Nucleotide base(s) Detected variation 5 chr2 Text Chromosome or contig 6 pos2 Numerical Second position in genome. 1-based genomic position within chromosome or contig chr2 7 type Text Variation class: INS (insertion), DEL (deletion), CTX (interchromosomal translocation), ITX (intrachromosomal translation) 8 size Numerical Size of the variation 9 varfreq Numerical Frequency at which the variation is observed (%) 10 score Numerical Quality score 11 depth Numerical Coverage at the indicated position 12 sample Text Optional sample id 4.3 Reference genomes IUPAC references A IUPAC reference describes a heterozygous genome in which the alleleles are indicated using the standard IUPAC codes for DNA. The file may be in sequence file format such as FastA format or FASTQ format.

15 GenomeScan Guidelines- Page 14 of Assay Design Assay Design File The design.csv file contains all information required to design follow up assay such as qpcr and genotyping assays. Table 8. Assay design file Column Name Format Description 1 contig Text Chromosome or contig 2 position Numerical 1-based genomic position within chromosome or contig 3 reference Nucleotide base Reference allele 4 sequence Sequence DNA sequences left and right flanking the variant position. Any neigbouring SNVs are encoded in IUPAC. The actual SNV position is indicated using bracket notation ([A/T[). Notes

16

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère C3BI VARIANTS CALLING November 2016 Pierre Lechat Stéphane Descorps-Declère General Workflow (GATK) software websites software bwa picard samtools GATK IGV tablet vcftools website http://bio-bwa.sourceforge.net/

More information

SNP calling and VCF format

SNP calling and VCF format SNP calling and VCF format Laurent Falquet, Oct 12 SNP? What is this? A type of genetic variation, among others: Family of Single Nucleotide Aberrations Single Nucleotide Polymorphisms (SNPs) Single Nucleotide

More information

Variant Finding. UCD Genome Center Bioinformatics Core Wednesday 30 August 2016

Variant Finding. UCD Genome Center Bioinformatics Core Wednesday 30 August 2016 Variant Finding UCD Genome Center Bioinformatics Core Wednesday 30 August 2016 Types of Variants Adapted from Alkan et al, Nature Reviews Genetics 2011 Why Look For Variants? Genotyping Correlation with

More information

Prioritization: from vcf to finding the causative gene

Prioritization: from vcf to finding the causative gene Prioritization: from vcf to finding the causative gene vcf file making sense A vcf file from an exome sequencing project may easily contain 40-50 thousand variants. In order to optimize the search for

More information

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es

SNP calling. Jose Blanca COMAV institute bioinf.comav.upv.es SNP calling Jose Blanca COMAV institute bioinf.comav.upv.es SNP calling Genotype matrix Genotype matrix: Samples x SNPs SNPs and errors A change in a read may due to: Sample contamination Cloning or PCR

More information

NGS in Pathology Webinar

NGS in Pathology Webinar NGS in Pathology Webinar NGS Data Analysis March 10 2016 1 Topics for today s presentation 2 Introduction Next Generation Sequencing (NGS) is becoming a common and versatile tool for biological and medical

More information

Assignment 9: Genetic Variation

Assignment 9: Genetic Variation Assignment 9: Genetic Variation Due Date: Friday, March 30 th, 2018, 10 am In this assignment, you will profile genome variation information and attempt to answer biologically relevant questions. The variant

More information

Variant calling in NGS experiments

Variant calling in NGS experiments Variant calling in NGS experiments Jorge Jiménez jjimeneza@cipf.es BIER CIBERER Genomics Department Centro de Investigacion Principe Felipe (CIPF) (Valencia, Spain) 1 Index 1. NGS workflow 2. Variant calling

More information

Axiom mydesign Custom Array design guide for human genotyping applications

Axiom mydesign Custom Array design guide for human genotyping applications TECHNICAL NOTE Axiom mydesign Custom Genotyping Arrays Axiom mydesign Custom Array design guide for human genotyping applications Overview In the past, custom genotyping arrays were expensive, required

More information

Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017

Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017 Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017 Topics to cover today What is Next Generation Sequencing (NGS)? Why do we need NGS? Common approaches to NGS NGS

More information

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014 Single Nucleotide Variant Analysis H3ABioNet May 14, 2014 Outline What are SNPs and SNVs? How do we identify them? How do we call them? SAMTools GATK VCF File Format Let s call variants! Single Nucleotide

More information

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4 WHITE PAPER Oncomine Comprehensive Assay Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4 Contents Scope and purpose of document...2 Content...2 How Torrent

More information

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis Data Basics Josef K Vogt Slides by: Simon Rasmussen 2017 Generalized NGS analysis Sample prep & Sequencing Data size Main data reductive steps SNPs, genes, regions Application Assembly: Compare Raw Pre-

More information

Exome Sequencing and Disease Gene Search

Exome Sequencing and Disease Gene Search Exome Sequencing and Disease Gene Search Erzurumluoglu AM, Rodriguez S, Shihab HA, Baird D, Richardson TG, Day IN, Gaunt TR. Identifying Highly Penetrant Disease Causal Mutations Using Next Generation

More information

Bioinformatics for NGS projects. Guidelines. genomescan.nl

Bioinformatics for NGS projects. Guidelines. genomescan.nl Next Generation Sequencing Bioinformatics for NGS projects Guidelines genomescan.nl GenomeScan s Guidelines for Bioinformatics Services on NGS Data Using our own proprietary data analysis pipelines Dear

More information

Variant Discovery. Jie (Jessie) Li PhD Bioinformatics Analyst Bioinformatics Core, UCD

Variant Discovery. Jie (Jessie) Li PhD Bioinformatics Analyst Bioinformatics Core, UCD Variant Discovery Jie (Jessie) Li PhD Bioinformatics Analyst Bioinformatics Core, UCD Variant Type Alkan et al, Nature Reviews Genetics 2011 doi:10.1038/nrg2958 Variant Type http://www.broadinstitute.org/education/glossary/snp

More information

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow Technical Overview Import VCF Introduction Next-generation sequencing (NGS) studies have created unanticipated challenges with

More information

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer Project XX Customer Detail Table of Contents. Bioinformatics analysis pipeline...3.. Read quality check. 3.2. Read alignment...3.3.

More information

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Genome 373: Mapping Short Sequence Reads II. Doug Fowler Genome 373: Mapping Short Sequence Reads II Doug Fowler The final Will be in this room on June 6 th at 8:30a Will be focused on the second half of the course, but will include material from the first half

More information

QIAseq Targeted Panel Analysis Plugin USER MANUAL

QIAseq Targeted Panel Analysis Plugin USER MANUAL QIAseq Targeted Panel Analysis Plugin USER MANUAL User manual for QIAseq Targeted Panel Analysis 1.1 Windows, macos and Linux June 18, 2018 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej

More information

Variant Callers. J Fass 24 August 2017

Variant Callers. J Fass 24 August 2017 Variant Callers J Fass 24 August 2017 Variant Types Caller Consistency Pabinger (2014) Briefings Bioinformatics 15:256 Freebayes Bayesian haplotype caller that can call SNPs, short CNVs / duplications,

More information

Novel Variant Discovery Tutorial

Novel Variant Discovery Tutorial Novel Variant Discovery Tutorial Release 8.4.0 Golden Helix, Inc. August 12, 2015 Contents Requirements 2 Download Annotation Data Sources...................................... 2 1. Overview...................................................

More information

Comparing a few SNP calling algorithms using low-coverage sequencing data

Comparing a few SNP calling algorithms using low-coverage sequencing data Yu and Sun BMC Bioinformatics 2013, 14:274 RESEARCH ARTICLE Open Access Comparing a few SNP calling algorithms using low-coverage sequencing data Xiaoqing Yu 1 and Shuying Sun 1,2* Abstract Background:

More information

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014 Alignment J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014 From reads to molecules Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG

More information

Strand NGS Variant Caller

Strand NGS Variant Caller STRAND LIFE SCIENCES WHITE PAPER Strand NGS Variant Caller A Benchmarking Study Rohit Gupta, Pallavi Gupta, Aishwarya Narayanan, Somak Aditya, Shanmukh Katragadda, Vamsi Veeramachaneni, and Ramesh Hariharan

More information

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte

BICF Variant Analysis Tools. Using the BioHPC Workflow Launching Tool Astrocyte BICF Variant Analysis Tools Using the BioHPC Workflow Launching Tool Astrocyte Prioritization of Variants SNP INDEL SV Astrocyte BioHPC Workflow Platform Allows groups to give easy-access to their analysis

More information

Introduction to human genomics and genome informatics

Introduction to human genomics and genome informatics Introduction to human genomics and genome informatics Session 1 Prince of Wales Clinical School Dr Jason Wong ARC Future Fellow Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer

More information

Alignment & Variant Discovery. J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014

Alignment & Variant Discovery. J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014 Alignment & Variant Discovery J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014 From reads to molecules Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG

More information

ISO/IEC JTC 1/SC 29/WG 11 N15527 Warsaw, CH June Introduction

ISO/IEC JTC 1/SC 29/WG 11 N15527 Warsaw, CH June Introduction INTERNATIONAL ORGANISATION FOR STANDARDISATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO/IEC JTC 1/SC 29/WG 11 CODING OF MOVING PICTURES AND AUDIO ISO/IEC JTC 1/SC 29/WG 11 N15527 Warsaw, CH June

More information

Analytics Behind Genomic Testing

Analytics Behind Genomic Testing A Quick Guide to the Analytics Behind Genomic Testing Elaine Gee, PhD Director, Bioinformatics ARUP Laboratories 1 Learning Objectives Catalogue various types of bioinformatics analyses that support clinical

More information

RareVariantVis 2: R suite for analysis of rare variants in whole genome sequencing data.

RareVariantVis 2: R suite for analysis of rare variants in whole genome sequencing data. RareVariantVis 2: R suite for analysis of rare variants in whole genome sequencing data. Adam Gudyś and Tomasz Stokowy October 30, 2017 Introduction The search for causative genetic variants in rare diseases

More information

Release Notes for Genomes Processed Using Complete Genomics Software

Release Notes for Genomes Processed Using Complete Genomics Software Release Notes for Genomes Processed Using Complete Genomics Software Software Version 2.4 Related Documents... 1 Changes to Version 2.4... 2 Changes to Version 2.2... 4 Changes to Version 2.0... 5 Changes

More information

Oncomine cfdna Assays Part III: Variant Analysis

Oncomine cfdna Assays Part III: Variant Analysis Oncomine cfdna Assays Part III: Variant Analysis USER GUIDE for use with: Oncomine Lung cfdna Assay Oncomine Colon cfdna Assay Oncomine Breast cfdna Assay Catalog Numbers A31149, A31182, A31183 Publication

More information

Read Mapping and Variant Calling. Johannes Starlinger

Read Mapping and Variant Calling. Johannes Starlinger Read Mapping and Variant Calling Johannes Starlinger Application Scenario: Personalized Cancer Therapy Different mutations require different therapy Collins, Meredith A., and Marina Pasca di Magliano.

More information

Introduction to Next Generation Sequencing

Introduction to Next Generation Sequencing The Sequencing Revolution Introduction to Next Generation Sequencing Dena Leshkowitz,WIS 1 st BIOmics Workshop High throughput Short Read Sequencing Technologies Highly parallel reactions (millions to

More information

Release Notes for Genomes Processed Using Complete Genomics Software

Release Notes for Genomes Processed Using Complete Genomics Software Release Notes for Genomes Processed Using Complete Genomics Software Version 2.0 Related Documents... 1 Changes to Version 2.0... 2 Changes to Version 1.12.0... 10 Changes to Version 1.11.0... 12 Changes

More information

Genomic resources. for non-model systems

Genomic resources. for non-model systems Genomic resources for non-model systems 1 Genomic resources Whole genome sequencing reference genome sequence comparisons across species identify signatures of natural selection population-level resequencing

More information

Lecture 7. Next-generation sequencing technologies

Lecture 7. Next-generation sequencing technologies Lecture 7 Next-generation sequencing technologies Next-generation sequencing technologies General principles of short-read NGS Construct a library of fragments Generate clonal template populations Massively

More information

MPG NGS workshop I: SNP calling

MPG NGS workshop I: SNP calling MPG NGS workshop I: SNP calling Mark DePristo Manager, Medical and Popula

More information

Genomics: Human variation

Genomics: Human variation Genomics: Human variation Lecture 1 Introduction to Human Variation Dr Colleen J. Saunders, PhD South African National Bioinformatics Institute/MRC Unit for Bioinformatics Capacity Development, University

More information

Reference genomes and common file formats

Reference genomes and common file formats Reference genomes and common file formats Dóra Bihary MRC Cancer Unit, University of Cambridge CRUK Functional Genomics Workshop September 2017 Overview Reference genomes and GRC Fasta and FastQ (unaligned

More information

DATA FORMATS AND QUALITY CONTROL

DATA FORMATS AND QUALITY CONTROL HTS Summer School 12-16th September 2016 DATA FORMATS AND QUALITY CONTROL Romina Petersen, University of Cambridge (rp520@medschl.cam.ac.uk) Luigi Grassi, University of Cambridge (lg490@medschl.cam.ac.uk)

More information

Reference genomes and common file formats

Reference genomes and common file formats Reference genomes and common file formats Overview Reference genomes and GRC Fasta and FastQ (unaligned sequences) SAM/BAM (aligned sequences) Summarized genomic features BED (genomic intervals) GFF/GTF

More information

Release Notes for Genomes Processed Using Complete Genomics Software

Release Notes for Genomes Processed Using Complete Genomics Software Release Notes for Genomes Processed Using Complete Genomics Software Software Version 2.2 Related Documents... 1 Changes to Version 2.2... 2 Changes to Version 2.0... 3 Changes to Version 1.12.0... 11

More information

Release Notes for Genomes Processed Using Complete Genomics Software

Release Notes for Genomes Processed Using Complete Genomics Software Release Notes for Genomes Processed Using Complete Genomics Software Version 1.11.0 Related Documents... 1 Changes to Version 1.11.0... 2 Changes to Version 1.10.0... 6 Changes to Version 1.9.0... 10 Changes

More information

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang Supplementary Materials for: Detecting very low allele fraction variants using targeted DNA sequencing and a novel molecular barcode-aware variant caller Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John

More information

Gene Expression analysis with RNA-Seq data

Gene Expression analysis with RNA-Seq data Gene Expression analysis with RNA-Seq data C3BI Hands-on NGS course November 24th 2016 Frédéric Lemoine Plan 1. 2. Quality Control 3. Read Mapping 4. Gene Expression Analysis 5. Splicing/Transcript Analysis

More information

About Strand NGS. Strand Genomics, Inc All rights reserved.

About Strand NGS. Strand Genomics, Inc All rights reserved. About Strand NGS Strand NGS-formerly known as Avadis NGS, is an integrated platform that provides analysis, management and visualization tools for next-generation sequencing data. It supports extensive

More information

Bioinformatics in next generation sequencing projects

Bioinformatics in next generation sequencing projects Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet May 2013 Standard sequence library generation Illumina

More information

CITATION FILE CONTENT / FORMAT

CITATION FILE CONTENT / FORMAT CITATION 1) For any resultant publications using single samples please cite: Matthew A. Field, Vicky Cho, T. Daniel Andrews, and Chris C. Goodnow (2015). "Reliably detecting clinically important variants

More information

Variant Analysis. CB2-201 Computational Biology and Bioinformatics! February 27, Emidio Capriotti!

Variant Analysis. CB2-201 Computational Biology and Bioinformatics! February 27, Emidio Capriotti! Variant Analysis CB2-201 Computational Biology and Bioinformatics February 27, 2015 Emidio Capriotti http://biofold.org/emidio Division of Informatics Department of Pathology Variant Call Format The final

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature13127 Factors to consider in assessing candidate pathogenic mutations in presumed monogenic conditions The questions itemized below expand upon the definitions in Table 1 and are provided

More information

Custom TaqMan Assays DESIGN AND ORDERING GUIDE. For SNP Genotyping and Gene Expression Assays. Publication Number Revision G

Custom TaqMan Assays DESIGN AND ORDERING GUIDE. For SNP Genotyping and Gene Expression Assays. Publication Number Revision G Custom TaqMan Assays DESIGN AND ORDERING GUIDE For SNP Genotyping and Gene Expression Assays Publication Number 4367671 Revision G For Research Use Only. Not for use in diagnostic procedures. Manufacturer:

More information

Annotating your variants: Ensembl Variant Effect Predictor (VEP) Helen Sparrow Ensembl EMBL-EBI 2nd November 2016

Annotating your variants: Ensembl Variant Effect Predictor (VEP) Helen Sparrow Ensembl EMBL-EBI 2nd November 2016 Training materials Ensembl training materials are protected by a CC BY license http://creativecommons.org/licenses/by/4.0/ If you wish to re-use these materials, please credit Ensembl for their creation

More information

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010 Mapping Next Generation Sequence Reads Bingbing Yuan Dec. 2, 2010 1 What happen if reads are not mapped properly? Some data won t be used, thus fewer reads would be aligned. Reads are mapped to the wrong

More information

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing: Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing: Patented, Anti-Correlation Technology Provides 99.5% Accuracy & Sensitivity to 5% Variant Knowledge Base and External Annotation

More information

NEXT GENERATION SEQUENCING. Farhat Habib

NEXT GENERATION SEQUENCING. Farhat Habib NEXT GENERATION SEQUENCING HISTORY HISTORY Sanger Dominant for last ~30 years 1000bp longest read Based on primers so not good for repetitive or SNPs sites HISTORY Sanger Dominant for last ~30 years 1000bp

More information

MAKING WHOLE GENOME ALIGNMENTS USABLE FOR BIOLOGISTS. EXAMPLES AND SAMPLE ANALYSES.

MAKING WHOLE GENOME ALIGNMENTS USABLE FOR BIOLOGISTS. EXAMPLES AND SAMPLE ANALYSES. MAKING WHOLE GENOME ALIGNMENTS USABLE FOR BIOLOGISTS. EXAMPLES AND SAMPLE ANALYSES. Table of Contents Examples 1 Sample Analyses 5 Examples: Introduction to Examples While these examples can be followed

More information

Introduc)on to Genomics

Introduc)on to Genomics Introduc)on to Genomics Libor Mořkovský, Václav Janoušek, Anastassiya Zidkova, Anna Přistoupilová, Filip Sedlák h1p://ngs-course.readthedocs.org/en/praha-january-2017/ Genome The genome is the gene,c material

More information

Bulked Segregant Analysis For Fine Mapping Of Genes. Cheng Zou, Qi Sun Bioinformatics Facility Cornell University

Bulked Segregant Analysis For Fine Mapping Of Genes. Cheng Zou, Qi Sun Bioinformatics Facility Cornell University Bulked Segregant Analysis For Fine Mapping Of enes heng Zou, Qi Sun Bioinformatics Facility ornell University Outline What is BSA? Keys for a successful BSA study Pipeline of BSA extended reading ompare

More information

Mapping errors require re- alignment

Mapping errors require re- alignment RE- ALIGNMENT Mapping errors require re- alignment Source: Heng Li, presenta8on at GSA workshop 2011 Alignment Key component of alignment algorithm is the scoring nega8ve contribu8on to score opening a

More information

Reads to Discovery. Visualize Annotate Discover. Small DNA-Seq ChIP-Seq Methyl-Seq. MeDIP-Seq. RNA-Seq. RNA-Seq.

Reads to Discovery. Visualize Annotate Discover. Small DNA-Seq ChIP-Seq Methyl-Seq. MeDIP-Seq. RNA-Seq. RNA-Seq. Reads to Discovery RNA-Seq Small DNA-Seq ChIP-Seq Methyl-Seq RNA-Seq MeDIP-Seq www.strand-ngs.com Analyze Visualize Annotate Discover Data Import Alignment Vendor Platforms: Illumina Ion Torrent Roche

More information

Bionano Access : Assembly Report Guidelines

Bionano Access : Assembly Report Guidelines Bionano Access : Assembly Report Guidelines Document Number: 30255 Document Revision: A For Research Use Only. Not for use in diagnostic procedures. Copyright 2018 Bionano Genomics Inc. All Rights Reserved

More information

Homework 4. Due in class, Wednesday, November 10, 2004

Homework 4. Due in class, Wednesday, November 10, 2004 1 GCB 535 / CIS 535 Fall 2004 Homework 4 Due in class, Wednesday, November 10, 2004 Comparative genomics 1. (6 pts) In Loots s paper (http://www.seas.upenn.edu/~cis535/lab/sciences-loots.pdf), the authors

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature26136 We reexamined the available whole data from different cave and surface populations (McGaugh et al, unpublished) to investigate whether insra exhibited any indication that it has

More information

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data BST227 Introduction to Statistical Genetics Lecture 8: Variant calling from high-throughput sequencing data 1 PC recap typical genome Differs from the reference genome at 4-5 million sites ~85% SNPs ~15%

More information

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1 BST 226 Statistical Methods for Bioinformatics David M. Rocke March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1 NGS Technologies Illumina Sequencing HiSeq 2500 & MiSeq PacBio Sequencing PacBio

More information

Proteogenomics. Kelly Ruggles, Ph.D. Proteomics Informatics Week 9

Proteogenomics. Kelly Ruggles, Ph.D. Proteomics Informatics Week 9 Proteogenomics Kelly Ruggles, Ph.D. Proteomics Informatics Week 9 Proteogenomics: Intersection of proteomics and genomics As the cost of high-throughput genome sequencing goes down whole genome, exome

More information

Compatible with: Ion Torrent Platforms Roche Sequencing Platforms Illumina Sequencing Platforms Life Technologies SOLiD System

Compatible with: Ion Torrent Platforms Roche Sequencing Platforms Illumina Sequencing Platforms Life Technologies SOLiD System Application Modules for: SNP/Indel/Structural Variant Analysis CNV Analysis Somatic Mutation Mining Large Genome Alignment and Variant Discovery Exome Analysis and Variant Discovery RNA-Seq/Transcriptome

More information

Ensembl Tools. EBI is an Outstation of the European Molecular Biology Laboratory.

Ensembl Tools. EBI is an Outstation of the European Molecular Biology Laboratory. Ensembl Tools EBI is an Outstation of the European Molecular Biology Laboratory. Questions? We ve muted all the mics Ask questions in the Chat box in the webinar interface I will check the Chat box periodically

More information

Sanger vs Next-Gen Sequencing

Sanger vs Next-Gen Sequencing Tools and Algorithms in Bioinformatics GCBA815/MCGB815/BMI815, Fall 2017 Week-8: Next-Gen Sequencing RNA-seq Data Analysis Babu Guda, Ph.D. Professor, Genetics, Cell Biology & Anatomy Director, Bioinformatics

More information

Bionano Solve Theory of Operation: Variant Annotation Pipeline

Bionano Solve Theory of Operation: Variant Annotation Pipeline Bionano Solve Theory of Operation: Variant Annotation Pipeline Document Number: 30190 Document Revision: B For Research Use Only. Not for use in diagnostic procedures. Copyright 2018 Bionano Genomics,

More information

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Genome Assembly Using de Bruijn Graphs. Biostatistics 666 Genome Assembly Using de Bruijn Graphs Biostatistics 666 Previously: Reference Based Analyses Individual short reads are aligned to reference Genotypes generated by examining reads overlapping each position

More information

Introduction to RNA-Seq in GeneSpring NGS Software

Introduction to RNA-Seq in GeneSpring NGS Software Introduction to RNA-Seq in GeneSpring NGS Software Dipa Roy Choudhury, Ph.D. Strand Scientific Intelligence and Agilent Technologies Learn more at www.genespring.com Introduction to RNA-Seq In a few years,

More information

Whole Genome Sequencing. Biostatistics 666

Whole Genome Sequencing. Biostatistics 666 Whole Genome Sequencing Biostatistics 666 Genomewide Association Studies Survey 500,000 SNPs in a large sample An effective way to skim the genome and find common variants associated with a trait of interest

More information

Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service. Dr. Ruth Burton Product Manager

Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service. Dr. Ruth Burton Product Manager Introducing combined CGH and SNP arrays for cancer characterisation and a unique next-generation sequencing service Dr. Ruth Burton Product Manager Today s agenda Introduction CytoSure arrays and analysis

More information

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech GALAXY INITIATION A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech How does Next- Gen sequencing work? DNA fragmentation Size selection and clonal amplification Massive parallel sequencing ACCGTTTGCCG

More information

Normal-Tumor Comparison using Next-Generation Sequencing Data

Normal-Tumor Comparison using Next-Generation Sequencing Data Normal-Tumor Comparison using Next-Generation Sequencing Data Chun Li Vanderbilt University Taichung, March 16, 2011 Next-Generation Sequencing First-generation (Sanger sequencing): 115 kb per day per

More information

Create a Planned Run. Using the Ion AmpliSeq Pharmacogenomics Research Panel Plugin USER BULLETIN. Publication Number MAN Revision A.

Create a Planned Run. Using the Ion AmpliSeq Pharmacogenomics Research Panel Plugin USER BULLETIN. Publication Number MAN Revision A. USER BULLETIN Create a Planned Run Using the Ion AmpliSeq Pharmacogenomics Research Panel Plugin Publication Number MAN0013730 Revision A.0 For Research Use Only. Not for use in diagnostic procedures.

More information

Variant Detection in Next Generation Sequencing Data. John Osborne Sept 14, 2012

Variant Detection in Next Generation Sequencing Data. John Osborne Sept 14, 2012 + Variant Detection in Next Generation Sequencing Data John Osborne Sept 14, 2012 + Overview My Bias Talk slanted towards analyzing whole genomes using Illumina paired end reads with open source tools

More information

De Novo Assembly (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

De Novo Assembly (Pseudomonas aeruginosa MAPO1 ) Sample to Insight De Novo Assembly (Pseudomonas aeruginosa MAPO1 ) Sample to Insight 1 Workflow Import NGS raw data QC on reads De novo assembly Trim reads Finding Genes BLAST Sample to Insight Case Study Pseudomonas aeruginosa

More information

De Novo Assembly of High-throughput Short Read Sequences

De Novo Assembly of High-throughput Short Read Sequences De Novo Assembly of High-throughput Short Read Sequences Chuming Chen Center for Bioinformatics and Computational Biology (CBCB) University of Delaware NECC Third Skate Genome Annotation Workshop May 23,

More information

The Diploid Genome Sequence of an Individual Human

The Diploid Genome Sequence of an Individual Human The Diploid Genome Sequence of an Individual Human Maido Remm Journal Club 12.02.2008 Outline Background (history, assembling strategies) Who was sequenced in previous projects Genome variations in J.

More information

Oral Cleft Targeted Sequencing Project

Oral Cleft Targeted Sequencing Project Oral Cleft Targeted Sequencing Project Oral Cleft Group January, 2013 Contents I Quality Control 3 1 Summary of Multi-Family vcf File, Jan. 11, 2013 3 2 Analysis Group Quality Control (Proposed Protocol)

More information

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Introduction to RNA-Seq David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Abundance RNA is... Diverse Dynamic Central DNA rrna Epigenetics trna RNA mrna Time Protein Abundance

More information

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science + UAB DNA-Seq Analysis Workshop John Osborne Research Associate Centers for Clinical and Translational Science ozborn@uab.,edu + Thanks in advance You are the Guinea pigs for this workshop! At this point

More information

SMAP File Format Specification Sheet

SMAP File Format Specification Sheet SMAP File Format Specification Sheet Document Number: 30041 Document Revision: E For Research Use Only. Not for use in diagnostic procedures. Copyright 2018 Bionano Genomics Inc. All Rights Reserved Table

More information

Supplementary information ATLAS

Supplementary information ATLAS Supplementary information ATLAS Vivian Link, Athanasios Kousathanas, Krishna Veeramah, Christian Sell, Amelie Scheu and Daniel Wegmann Section 1: Complete list of functionalities Sequence data processing

More information

Training materials.

Training materials. Training materials Ensembl training materials are protected by a CC BY license http://creativecommons.org/licenses/by/4.0/ If you wish to re-use these materials, please credit Ensembl for their creation

More information

Selecting TILLING mutants

Selecting TILLING mutants Selecting TILLING mutants The following document will explain how to select TILLING mutants for your gene(s) of interest. To begin, you will need the IWGSC gene model identifier for your gene(s), the IWGSC

More information

GBS Usage Cases: Non-model Organisms. Katie E. Hyma, PhD Bioinformatics Core Institute for Genomic Diversity Cornell University

GBS Usage Cases: Non-model Organisms. Katie E. Hyma, PhD Bioinformatics Core Institute for Genomic Diversity Cornell University GBS Usage Cases: Non-model Organisms Katie E. Hyma, PhD Bioinformatics Core Institute for Genomic Diversity Cornell University Q: How many SNPs will I get? A: 42. What question do you really want to ask?

More information

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR Human Genetic Variation Ricardo Lebrón rlebron@ugr.es Dpto. Genética UGR What is Genetic Variation? Origins of Genetic Variation Genetic Variation is the difference in DNA sequences between individuals.

More information

ChIP-seq and RNA-seq. Farhat Habib

ChIP-seq and RNA-seq. Farhat Habib ChIP-seq and RNA-seq Farhat Habib fhabib@iiserpune.ac.in Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions

More information

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Transcriptomics analysis with RNA seq: an overview Frederik Coppens Transcriptomics analysis with RNA seq: an overview Frederik Coppens Platforms Applications Analysis Quantification RNA content Platforms Platforms Short (few hundred bases) Long reads (multiple kilobases)

More information

SVMerge Output File Format Specification Sheet

SVMerge Output File Format Specification Sheet SVMerge Output File Format Specification Sheet Document Number: 30165 Document Revision: C For Research Use Only. Not for use in diagnostic procedures. Copyright 2017 Bionano Genomics, Inc. All Rights

More information

Lecture 2: Biology Basics Continued

Lecture 2: Biology Basics Continued Lecture 2: Biology Basics Continued Central Dogma DNA: The Code of Life The structure and the four genomic letters code for all living organisms Adenine, Guanine, Thymine, and Cytosine which pair A-T and

More information

Ion AmpliSeq Designer: Getting Started

Ion AmpliSeq Designer: Getting Started Ion AmpliSeq Designer: Getting Started USER GUIDE Publication Number MAN0010907 Revision E.0 For Research Use Only. Not for use in diagnostic procedures. Manufacturer: Life Technologies Corporation Carlsbad,

More information

Design and Ordering Guide. Custom TaqMan Assays. For New SNP Genotyping and Gene Expression Assays

Design and Ordering Guide. Custom TaqMan Assays. For New SNP Genotyping and Gene Expression Assays Design and Ordering Guide Custom TaqMan Assays For New SNP Genotyping and Gene Expression Assays For Research Use Only. Not for use in diagnostic procedures. Information in this document is subject to

More information

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer. DNA Preparation and QC Extraction DNA was extracted from whole blood or flash frozen post-mortem tissue using a DNA mini kit (QIAmp #51104 and QIAmp#51404, respectively) following the manufacturer s recommendations.

More information

Introduction to RNA sequencing

Introduction to RNA sequencing Introduction to RNA sequencing Bioinformatics perspective Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden November 2017 Olga (NBIS) RNA-seq November 2017 1 / 49 Outline Why sequence

More information