Reference genomes and common file formats

Similar documents
Reference genomes and common file formats

Bioinformatics in next generation sequencing projects

ISO/IEC JTC 1/SC 29/WG 11 N15527 Warsaw, CH June Introduction

DATA FORMATS AND QUALITY CONTROL

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Sanger vs Next-Gen Sequencing

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

Gene Expression analysis with RNA-Seq data

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Quantifying gene expression

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

Introduction to Next Generation Sequencing

NGS, Cancer and Bioinformatics. 5/3/2015 Yannick Boursin

10/06/2014. RNA-Seq analysis. With reference assembly. Cormier Alexandre, PhD student UMR8227, Algal Genetics Group

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

How to deal with your RNA-seq data?

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

NGS in Pathology Webinar

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

Galaxy for Next Generation Sequencing 初探次世代序列分析平台 蘇聖堯 2013/9/12

RNA-seq Data Analysis

DNASeq: Analysis pipeline and file formats Sumir Panji, Gerrit Boha and Amel Ghouila

Introduction to NGS analyses

Introduction to bioinformatics (NGS data analysis)

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

Bioinformatics for NGS projects. Guidelines. genomescan.nl

Introduc)on to Bioinforma)cs of next- genera)on sequencing. Sequence acquisi)on and processing; genome mapping and alignment manipula)on

BME 110 Midterm Examination

user s guide Question 1

Lecture 7. Next-generation sequencing technologies

Basic Bioinformatics: Homology, Sequence Alignment,

Investigating Inherited Diseases

Small Exon Finder User Guide

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

Bionano Access v1.1 Release Notes

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Guided tour to Ensembl

ATAC-seq workshop. Katarzyna Kedzierska Center for Public Health Genomics University of Virginia. Jachranka, Poland September 2017

QIAseq Targeted Panel Analysis Plugin USER MANUAL

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

RNA-Seq Module 2 From QC to differential gene expression.

Data Retrieval from GenBank

CrusView is a tool for karyotype/genome visualization and comparison of crucifer species. It also provides functions to import new genomes.

Building a platinum human genome assembly from single haplotype human genomes. Karyn Meltz Steinberg PacBio UGM December,

Ensembl Tools. EBI is an Outstation of the European Molecular Biology Laboratory.

De Novo Assembly of High-throughput Short Read Sequences

Galaxy Platform For NGS Data Analyses

Training materials.

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Alignment & Variant Discovery. J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014

SNP calling and VCF format

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

VM origin. Okeanos: Image Trinity_U16 (upgrade to Ubuntu16.04, thanks to Alexandros Dimopoulos) X2go: LXDE

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

Introduc)on to Genomics

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017


An Introduction to the package geno2proteo

Using the Genome Browser: A Practical Guide. Travis Saari

TECH NOTE Ligation-Free ChIP-Seq Library Preparation

Chapter 2: Access to Information

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

MAKING WHOLE GENOME ALIGNMENTS USABLE FOR BIOLOGISTS. EXAMPLES AND SAMPLE ANALYSES.

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

NEXT GENERATION SEQUENCING. Farhat Habib

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

SMAP File Format Specification Sheet

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

MODULE 5: TRANSLATION

COMPUTER RESOURCES II:

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

European Union Reference Laboratory for Genetically Modified Food and Feed (EURL GMFF)

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Read Complexity

Read Mapping and Variant Calling. Johannes Starlinger

From reads to results. Dr Torsten Seemann

Next-Generation Sequencing. Technologies

Hands-On Four Investigating Inherited Diseases

RNA-Seq Software, Tools, and Workflows

Illumina Sequencing Error Profiles and Quality Control

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

axe Documentation Release g6d4d1b6-dirty Kevin Murray

Supplementary Information Supplementary Figures

Package geno2proteo. December 12, 2017

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

Introduction to human genomics and genome informatics

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

Francisco García Quality Control for NGS Raw Data

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

Biol4230 Thurs, March 22, 2018 Bill Pearson Pinn 6-057

L3: Short Read Alignment to a Reference Genome

Introduction to DNA-Sequencing

Analytics Behind Genomic Testing

ab initio and Evidence-Based Gene Finding

Bionano Access 1.1 Software User Guide

Transcription:

Reference genomes and common file formats

Overview Reference genomes and GRC Fasta and FastQ (unaligned sequences) SAM/BAM (aligned sequences) Summarized genomic features BED (genomic intervals) GFF/GTF (gene annotation) Wiggle files, BEDgraphs, BigWigs (genomic scores)

Why do we need to know about reference genomes? Allows for genes and genomic features to be evaluated in their genomic context. Gene A is close to gene B Gene A and gene B are within feature C Can be used to align shallow targeted high-throughput sequencing to a pre-built map of an organism

Genome Reference Consortium (GRC) Most model organism reference genomes are being regularly updated Reference genomes consist of a mixture of known chromosomes and unplaced contigs called as Genome Reference Assembly Genome Reference Consortium: A collaboration of institutes which curate and maintain the reference genomes of 4 model organisms: Human - GRCh38.p9 (26 Sept 2016) Mouse - GRCm38.p5 (29 June 2016) Zebrafish - GRCz10 (12 Sept 2014) Chicken - Gallus_gallus-5.0 (16 Dec 2015) Latest human assembly is GRCh38, patches add information to the assembly without disrupting the chromosome coordinates Other model organisms are maintained separately, like: Drosophila - Berkeley Drosophila Genome Project

Overview Reference genomes and GRC Fasta and FastQ (unaligned sequences) SAM/BAM (aligned sequences) Summarized genomic features BED (genomic intervals) GFF/GTF (gene annotation) Wiggle files, BEDgraphs, BigWigs (genomic scores)

The reference genome A reference genome is a collection of contigs A contig is a stretch of DNA sequence encoded as A, G, C, T or N Typically comes in FASTA format: ">" line contains information on contig Following lines contain contig sequences

Unaligned sequences - FastQ Unaligned sequence files generated from HTS machines are mapped to a reference genome to produce aligned sequence FastQ (unaligned sequences) SAM (aligned sequences) FastQ: FASTA with quality "@" followed by identifier Sequence information "+" Quality scores encodes as ASCI

Unaligned sequences - FastQ header Header for each read can contain additional information HS2000-887_89 - Machine name 5 - Flowcell lane /1 - Read 1 or 2 of pair

Unaligned sequences - FastQ qualities Qualities come after the "+" line -log10 probability of sequence base being wrong Encoded in ASCII to save space Used in quality assessment and downstream analysis

Overview Reference genomes and GRC Fasta and FastQ (unaligned sequences) SAM/BAM (aligned sequences) Summarized genomic features BED (genomic intervals) GFF/GTF (gene annotation) Wiggle files, BEDgraphs, BigWigs (genomic scores)

Aligned sequences - SAM format SAM - Sequence Alignment Map Standard format for sequence data Recognised by majority of software and browsers SAM header SAM header contains information on alignment and contigs used @HD - Version number and sorting information @SQ - Contig/Chromosome name and length of sequence

Aligned sequences - SAM format SAM aligned reads Contains read and alignment information and location Read name Sequence of read Encoded sequence quality

Aligned sequences - SAM format SAM aligned reads Chromosome to which the read aligns Position in chromosome to which 5' end of the read aligns Alignment information - "Cigar string" 100M - Continuous match of 100 bases 28M1D72M - 28 bases continuously match, 1 deletion from reference, 72 base match

Aligned sequences - SAM format SAM aligned reads Bit flag - TRUE/FALSE for pre-defined read criteria, like: is it paired? duplicate? https://broadinstitute.github.io/picard/explain-flags.html Paired read position and insert size User defined flags Li H et al.,the Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9.

Overview Reference genomes and GRC Fasta and FastQ (unaligned sequences) SAM/BAM (aligned sequences) Summarized genomic features BED (genomic intervals) GFF/GTF (gene annotation) Wiggle files, BEDgraphs, BigWigs (genomic scores)

Summarised genomic features formats After alignment, sequence reads are typically summarised into scores over/within genomic intervals BED - genomic intervals with additional information Wiggle files, BEDgraphs, BigWigs - genomic intervals with scores GFF/GTF - genomic annotation with information and scores

BED format - genomic intervals BED3-3 tab separated columns Chromosome Start End Simplest format BED6-6 tab separated columns Chromosome, start, end Identifier Score Strand ("." stands for strandless)

Wiggle format - genomic scores Variable step Wiggle format Fixed step Wiggle format Information line Chromosome Step size (Span - default=1, to describe contiguous positions with same value) Each line contains: Start position of the step Score Information line Chromosome Start position of first step Step size (Span - default=1, to describe contiguous positions with same value) Each line contains: Score

bedgraph format - genomic scores BED-like format Starts as a 3 column BED file (chromosome, start, end) 4th column: score value

GFF - genomic annotation Stores position, feature (exon) and meta-feature (transcript/gene) information Columns: Chromosome Source Feature type Start position End position Score Strand Frame - 0, 1 or 2 indicating which base of the feature is the first base of the codon Semicolon separated attribute: ID (feature name);parent (meta-feature name)

Saving time and space - compressed file formats Many programs and browsers deal better with compressed, indexed versions of genomic files SAM -> BAM (.bam and index file of.bai) BED -> bigbed (.bb) Wiggle and bedgraph -> bigwig (.bw/.bigwig) BED and GFF -> (.gz and index file of.tbi)

Getting help and more information UCSC file formats https://genome.ucsc.edu/faq/faqformat.html IGV file formats http://software.broadinstitute.org/software/igv/fileformats Sanger file formats http://gmod.org/wiki/gff3

Acknowledgement Tom Carroll http://mrccsc.github.io/genomic_formats/genomicfileformats.html#/