Introduction to the UCSC genome browser

Similar documents
Introduction to human genomics and genome informatics

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

NGS Approaches to Epigenomics

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Reads to Discovery. Visualize Annotate Discover. Small DNA-Seq ChIP-Seq Methyl-Seq. MeDIP-Seq. RNA-Seq. RNA-Seq.

Molecular Genetics of Disease and the Human Genome Project

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

RNA-Sequencing analysis

Control of Eukaryotic Gene Expression (Learning Objectives)

Transcriptomics. Marta Puig Institut de Biotecnologia i Biomedicina Universitat Autònoma de Barcelona

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

About Strand NGS. Strand Genomics, Inc All rights reserved.

Complete draft sequence 2001

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

What Can the Epigenome Teach Us About Cellular States and Diseases?

RNA-SEQUENCING ANALYSIS

Genomes: What we know and what we don t know

2/19/13. Contents. Applications of HMMs in Epigenomics

Applications of HMMs in Epigenomics

The University of California, Santa Cruz (UCSC) Genome Browser

Satellite Education Workshop (SW4): Epigenomics: Design, Implementation and Analysis for RNA-seq and Methyl-seq Experiments

Multi-omics in biology: integration of omics techniques

Introduction to Genomic Medicine Exeter Expert Series: Cardiology

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Chapter 5. Structural Genomics

Deep Sequencing technologies

Introduction to Genome Biology

AP Biology. The BIG Questions. Chapter 19. Prokaryote vs. eukaryote genome. Prokaryote vs. eukaryote genome. Why turn genes on & off?

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017

Unit 1 Human cells. 1. Division and differentiation in human cells

Chapter 19. Control of Eukaryotic Genome. AP Biology

Introduction to NGS analyses

Chapter 18: Regulation of Gene Expression. 1. Gene Regulation in Bacteria 2. Gene Regulation in Eukaryotes 3. Gene Regulation & Cancer

Control of Eukaryotic Genes. AP Biology

Overview of Human Genetics

Lecture 2: Biology Basics Continued. Fall 2018 August 23, 2018

Unit IX Problem 3 Genetics: Basic Concepts in Molecular Biology

Gene Regulation Biology

Enzyme that uses RNA as a template to synthesize a complementary DNA

Introduction to Bioinformatics

2/10/17. Contents. Applications of HMMs in Epigenomics

What we ll do today. Types of stem cells. Do engineered ips and ES cells have. What genes are special in stem cells?

Lecture 2: Biology Basics Continued

ChIP-seq and RNA-seq. Farhat Habib

Do engineered ips and ES cells have similar molecular signatures?

02 Agenda Item 03 Agenda Item

Analytics Behind Genomic Testing

Human genetic variation

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

DNA and Biotechnology Form of DNA Form of DNA Form of DNA Form of DNA Replication of DNA Replication of DNA

CS273B: Deep learning for Genomics and Biomedicine

Targeted RNA sequencing reveals the deep complexity of the human transcriptome.

Transcription in Eukaryotes

Higher Human Biology Unit 1: Human Cells Pupils Learning Outcomes

Chapter 17 Lecture. Concepts of Genetics. Tenth Edition. Regulation of Gene Expression in Eukaryotes

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

DNA METHYLATION RESEARCH TOOLS

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Bi 8 Lecture 4. Ellen Rothenberg 14 January Reading: from Alberts Ch. 8

Genomic resources. for non-model systems

Unit 6: Molecular Genetics & DNA Technology Guided Reading Questions (100 pts total)

Gene Regulation 10/19/05

The Human Genome and its upcoming Dynamics

Introduction to BIOINFORMATICS

Wednesday, November 22, 17. Exons and Introns

DNA. bioinformatics. epigenetics methylation structural variation. custom. assembly. gene. tumor-normal. mendelian. BS-seq. prediction.

ChIP-seq and RNA-seq

Regulation of Gene Expression

Jumping Into Your Gene Pool: Understanding Genetic Test Results

CELL BIOLOGY - CLUTCH CH. 7 - GENE EXPRESSION.

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

Chapter 11: Regulation of Gene Expression

Chapter 12 Packet DNA 1. What did Griffith conclude from his experiment? 2. Describe the process of transformation.

Chapter 2. An Introduction to Genes and Genomes

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Before starting, write your name on the top of each page Make sure you have all pages

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

In silico variant analysis: Challenges and Pitfalls

Bi 8 Lecture 5. Ellen Rothenberg 19 January 2016

Introduction to Next Generation Sequencing (NGS) Andrew Parrish Exeter, 2 nd November 2017

Galaxy Platform For NGS Data Analyses

user s guide Question 1

Therapeutic & Prevention Application of Nucleic Acids

Gene-centered resources at NCBI

STAT 536: Genetic Statistics

Genetics Biology 331 Exam 3B Spring 2015

The Human Genome Project has always been something of a misnomer, implying the existence of a single human genome

Section C: The Control of Gene Expression

The Nature of Genes. The Nature of Genes. Genes and How They Work. Chapter 15/16

Investigating Inherited Diseases

Genomics and Gene Recognition Genes and Blue Genes

From DNA to Protein: Genotype to Phenotype

Finding Enrichments of Functional Annotations for Disease- Associated Single-Nucleotide Polymorphisms

STSs and ESTs. Sequence-Tagged Site: short, unique sequence Expressed Sequence Tag: short, unique sequence from a coding region

Chapter 15 Gene Technologies and Human Applications

Regulation of Gene Expression

Authors: Vivek Sharma and Ram Kunwar

Transcription:

Introduction to the UCSC genome browser Dominik Beck NHMRC Peter Doherty and CINSW ECR Fellow, Senior Lecturer Lowy Cancer Research Centre, UNSW and Centre for Health Technology, UTS SYDNEY NSW AUSTRALIA

What we will cover Structure of the human genome Genomic information Data acquisition UCSC Genome Browser

Structure of human genome A C C T G G Annunziato A. 2008. DNA packaging: Nucleosomes and chromatin. Nature Education 1(1).

Structure of human genome Total of 23 pairs of chromosomes. Each chromosome is diploid. Each individual chromosome made up of double stranded DNA. ~3 billion bps (2m) compacted in a cell (15 μm) Annunziato A. 2008. DNA packaging: Nucleosomes and chromatin. Nature Education 1(1).

Information in the genome Genes: ~1.2% coding ~2% non-coding Regulatory regions: ~2% Repetitive elements comprise another ~50% of the human genome

Information in the genome Encyclopedia of DNA Elements: ENCODE 147 cell types / 1,640 data sets 80.4% of the human genome participates in at least one biochemical event 95% within 8 kb of a biochemical events 99% within 1.7 kb of a biochemical events Clark et all 2015 Capture sequencing / 24 cell types 22046 novel exons 10136 novel splice junctions Nature. 2012 Sep 6;489(7414):57-74. doi: 10.1038/nature11247. Nat Methods. 2015 Apr;12(4):339-42. doi: 10.1038/nmeth.3321.

Reference human genome Human genomes vary significantly between individuals (~0.1%) Important things to note about the reference genome: Is a composite sequence (i.e. does not correspond to anyone s genome) Is haploid (i.e. only 1 sequence) Computationally, a reference genome is used.

Reference human genome Genomic data is most common represented in two ways: 1. Sequence data fasta format (.fa or.fasta) >chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ACAGTACTGGCGGATTATAGGGAAACACCCGGAGCATATGCTGTTTGGTC TCAgtagactcctaaatatgggattcctgggtttaaaagtaaaaaataaa tatgtttaatttgtgaactgattaccatcagaattgtactgttctgtatc ccaccagcaatgtctaggaatgcctgtttctccacaaagtgtttactttt... 2. Location data bed format (.bed) chr1 934343 935552 HES4 0 - chr1 948846 949919 ISG15 0 +... chromosome start end name score strand All about genomic formats here - http://genome.ucsc.edu/faq/faqformat.html

What we will cover Structure of the human genome Genomic information DNA (Sequence variation) RNA (Genes & gene expression) Regulation\Epigenetics DNA methylation Histone modification Transcription factor binding

DNA: Sequence variation

Variations in DNA sequence Cytological level: Entire chromosome (e.g. chromosome numbers) Partial chromosome (e.g. segmental duplications, rearrangements, and deletions) Sub-chromosomal level: Transposable elements Short Deletions/Insertions, Tandem repeats Sequence level: Single Nucleotide Polymorphisms (SNPs) Small Nucleotide Insertions and Deletions (Indels; <=100bps)

Sequence variation Single nucleotide polymorphisms (SNPs) DNA sequence variations that exist with members of a species. They are inherited at birth and therefore present in all cells. Somatic mutations Are somatic i.e. only present in some cells. Mutations are often observed in cancer cells.

Types of SNPs/Mutations TSS Non-coding Intergenic region TSS Coding Synonymous Non-Synonymous Most SNPs and mutations fall in intergenic regions. Within genes, they can either fall in the non-coding or coding regions. Within coding regions, they can either not-change (synonymous) or change (non-synonymous) amino acids.

Effects of sequence variation Non-synonymous variants: Missense (change protein structure) Nonsense (truncates protein) Synonymous or non-coding variants: Alter transcriptional/translational efficiency Alter mrna stability Alter gene regulation (i.e. alter TF binding) Alter RNA-regulation (i.e. affect mirna binding) Majority of sequence variation are neutral (<1% phenotype)

RNA: Genes and gene expression

Types of genes A gene is a functional unit of DNA that is transcribed into RNA. Total genes in the human genome 57,445 mrna mirna lncrna Source: GENCODE (version 18)

Protein coding genes ~ 20,000 in the human genome. Due to splicing one gene can make many proteins. Traditionally considered to be the most important functional unit of genomes. Source: http://www.news-medical.net

MicroRNA (mirna) mirna gene pri-mirna pre-mirna mirna/mirna* mirisk w selected mirna arm Nucleus Cytoplasm Discovered in 1993. Plays a role in posttranscriptional regulation. Acts by either causing RNA degradation or inhibition of translation. Implicated in many aspects of health and disease including: Development Cancer Heart disease

Long non-coding RNA (lncrna) Recently described class of RNAs which often transcribed by PolII promoters and often spliced. Unlike coding and mirnas, lncrna are less conserve. Non-coding transcripts > 200 nt in length. Many functions. Commonly recruitment of histone modifiers

RNA expression Measuring the level of RNA in the sample. Generally microarray-, sequencing- or high-throughput PCR- based. Computation analysis and normalisation of expression data can be complicated.

RNA expression applications Relatively cheap and fast readout of the functional state of a cell HSC MEP Megakaryocyte Association with clinical features - sequence variations - response to therapy - patient survival Differential expression - between samples, or - between genes BCells TCells

RNA expression applications Differential expression of individual genes not necessarily informative. Genes are often grouped in gene-sets based on ontology or biological pathways.

Gene Regulation Epigenetics

Epigenetics Mechanisms that alter cellular function independent to any changes in DNA sequence Mechanisms include: Transcriptional regulation: Transcription Factors Genome methylation Histone modification / Nucleosome positioning Non-coding RNA

Transcriptional regulation Transcription factors are proteins that bind DNA to co-regulate gene expression. Typically binds at gene promoters or enhancers.

DNA methylation DNA is methylated on cytosine's in CpG dinucleotides

Nucleosomes & Histones Acetylation Methylation Phosphorylation Ubiquitination

What we will cover Structure of the human genome Genomic information DNA (Sequence variation) RNA (Genes & gene expression) Epigenetics DNA methylation Histone modification Transcription factor binding Data acquisition Microarrays Sequencing Chromatin IP

Array Technology Labeling Processing Relies on fluorescence-based on hybridisation of DNA against complementary probe on array. Known molecule that can be converted to cdna. Expression array (probe for exonic DNA regions) SNP array (probe for two alleles) Methylation array (probe for bisulfide converted DNA) Limited by probes present on the array. https://www.dkfz.de/gpcf/affymetrix_genechips.html

Array Technology Images Processing Quantification Pre-processing Backgrd. Subs., Norm. Statistics & Data Analytics e.g. DiffExp, Clinical Assoc Post-processing Batch and Outlier removal Systems Biology e.g. Pathway analysis https://www.dkfz.de/gpcf/affymetrix_genechips.html

Next-generation sequencing

Next-generation sequencing (Illumina) Library preparation cdna Synthesis Sequencing

RNA-seq (vs mrna Array) Alignment human reference genome Quantification mrna/mirna/lncrna Statistics / Bioinformatics

Chromatin Immunoprecipitation Sequencing (ChIP-seq) ChIP-seq of the seven transcription factors FLI1, ERG, GATA2, RUNX1, SCL, LYL1 and LMO ERG locus High- throughput sequencing Bioinformatics

Pros/cons of each technology NGS Greater dynamic range (only limited by depth of sequencing) Coverage of genome does not need to be limited. Many more applications from sequencing data. Data analysis and management can be challenging. Microarrays Microarrays are still significantly cheaper. Largest public datasets are likely to be microarray based. Data analysis pipelines are well standardised.

What we will cover Structure of the human genome Genomic information Data acquisition UCSC Genome Browser Background Genome Assemblies Annotation Tracks Associated Tools Practical Exercise

Genome Browser http://genome.ucsc.edu/

Background

Background Visualization of genomic data Graphical viewpoint on the very large amount of genomic sequence produced by the Human Genome Project. Human Genome: 3,156,105,057 bp Focus turned from accumulating and assembling sequences to identifying and mapping functional landmarks Genetic markers Genes SNPs Points of regulation Visualization of Next-generation-sequencing data

Background Client-side Integrative Genomics Viewer* Client-server UCSC Genome Browser Application (Java) on the user s machine Application on a web-server; access via web browser Often difficult to install No installation Does not have the extensive thirdparty data of the other browsers Access to a very large database of information in a uniform interface Much faster than web-based browsers Often difficult to import datasets http://www.broadinstitute.org/igv/

Background Intronerator was developed by J. Kent to map the exon intron structure of C. elegans RNAs mapped against genomic coordinates Jim Kent

Background Draft human genome sequence became available at the UCSC in 2000 Intronerator was used as the graphics engine 3' UTR exon < < < exon < exon < < < < ex 5' UTR

UCSC Genome Browser

Genome Browser http://genome.ucsc.edu/

Genome Assemblies Regular updates to genome assemblies to close gaps in genomic sequence, troubleshoot assembly problems and otherwise improve the genome assemblies Shifting coordinates for known sequences and a potential for confusion and error among researchers, particularly when reading literature based on older versions. Frequently used assemblies hg18/hg19 New assemblies increase genomic coverage 6-fold and have been deposited in GenBank. 127 genome assemblies have been released on 58 organisms (April 2012)

Annotation tracks

Annotation tracks The database may contain any data that can be mapped to genomic coordinates and therefore can be displayed in the Genome Browser Overview of tracks: http://genome.ucsc.edu/cgi-bin/hgtracks Three different categories: computed at UCSC computed elsewhere and displayed at UCSC computed and hosted entirely elsewhere

Annotation tracks computed at UCSC Comparative genomic annotations as well as Convert and liftover capabilities mrnas and ESTs in GenBank are aligned to the reference assembly in separate tracks (75 million GenBank RNAs and ESTs, ~3 billion bases of the human reference assembly 2 CPU-years of computing time) The Conservation composite track displays the results of the multiz algorithm that aligns the results from up to 46 pairwise Blastz alignments to the reference assembly (e.g. hg19 human assembly consumed 10 CPUyears)

Annotation tracks computed elsewhere and displayed at UCSC Annotations that are not post-processed by the UCSC Probe sets for commercially available microarrays, copy-number variation from t Database of Genomic Variants or expression data from the GNF Expression Atlas Data Coordination Center for the ENCODE project allowing access to a large num of functional annotations in regards to gene regulation Annotations that are post-processed by the UCSC dbsnp (Common SNPs, Flagged SNPs, Mult. SNPs) OMIM (OMIM Allelic Variant SNPs, OMIM Genes, OMIM Phenotypes)

Annotation tracks computed and hosted elsewhere Data tracks are hosted remotely (no data are stored at UCSC) and publicly available, e.g. Epigenomics Roadmap project http://epigenome.wustl.edu/

Tracks from the Epigenome project

Associated Tools Tools other than the main graphic image account for 42% of traffic on the UCSC server

Sessions

Custom track

Table Browser