Machine Learning. HMM applications in computational biology

Similar documents
Computational Genomics. Reconstructing signaling and dynamic regulatory networks

Announcement Structure Analysis

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Biology 644: Bioinformatics

Data Mining for Biological Data Analysis

Introduction to BIOINFORMATICS

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Introduction. CS482/682 Computational Techniques in Biological Sequence Analysis

CS273B: Deep learning for Genomics and Biomedicine

Mapping strategies for sequence reads

Hidden Markov Models. Some applications in bioinformatics

Lecture 10. Ab initio gene finding

Discovering gene regulatory control using ChIP-chip and ChIP-seq. An introduction to gene regulatory control, concepts and methodologies

Discovering gene regulatory control using ChIP-chip and ChIP-seq. Part 1. An introduction to gene regulatory control, concepts and methodologies

Measuring transcriptomes with RNA-Seq

ChIP-seq and RNA-seq

NEXT GENERATION SEQUENCING. Farhat Habib

ChIP-seq and RNA-seq. Farhat Habib

Computational Genomics

Algorithms in Nature. (brief) introduction to biology

Lecture 5: Regulation

Measuring transcriptomes with RNA-Seq. BMI/CS 776 Spring 2016 Anthony Gitter

CAP BIOINFORMATICS Su-Shing Chen CISE. 10/5/2005 Su-Shing Chen, CISE 1

RNA-Sequencing analysis

Computational gene finding

Applications of HMMs in Epigenomics

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

2/19/13. Contents. Applications of HMMs in Epigenomics

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

Computational gene finding

Sequence Assembly and Alignment. Jim Noonan Department of Genetics


Motivation From Protein to Gene

Functional microrna targets in protein coding sequences. Merve Çakır

Chapter 1. from genomics to proteomics Ⅱ

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies

RNA-SEQUENCING ANALYSIS

About Strand NGS. Strand Genomics, Inc All rights reserved.

Homework 4. Due in class, Wednesday, November 10, 2004

MATH 5610, Computational Biology

Introduction to 'Omics and Bioinformatics

Genomic resources. for non-model systems

Motif Search CMSC 423

Algorithms in Computational Biology (236522) Winter 2012 Intro Lecture

Computational Investigation of Gene Regulatory Elements. Ryan Weddle Computational Biosciences Internship Presentation 12/15/2004

EECS730: Introduction to Bioinformatics

Motifs. BCH339N - Systems Biology / Bioinformatics Edward Marcotte, Univ of Texas at Austin

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

Computational Methods for Analyzing and Modeling Gene Regulation Dynamics

2/10/17. Contents. Applications of HMMs in Epigenomics

ChIP-seq analysis 2/28/2018

VALLIAMMAI ENGINEERING COLLEGE

Methods and tools for exploring functional genomics data

Videos. Lesson Overview. Fermentation

MATH 5610, Computational Biology

GREG GIBSON SPENCER V. MUSE

DNA polymorphisms and RNA-Seq alternative splicing blow bubbles in de Bruijn Graphs

Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017

Analysis of RNA-seq Data

Outline. 1. Introduction. 2. Exon Chaining Problem. 3. Spliced Alignment. 4. Gene Prediction Tools

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Videos. Bozeman Transcription and Translation: Drawing transcription and translation:

The first order of genome annotation is to identify what elements a genome contains in the first place.

ENGR 213 Bioengineering Fundamentals April 25, A very coarse introduction to bioinformatics

ECS 234: Genomic Data Integration ECS 234

Course Information. Introduction to Algorithms in Computational Biology Lecture 1. Relations to Some Other Courses

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading:

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

The first thing you will see is the opening page. SeqMonk scans your copy and make sure everything is in order, indicated by the green check marks.

Microarrays & Gene Expression Analysis

Applications of HMMs in Computational Biology. BMI/CS Colin Dewey

Bioinformatics: Sequence Analysis. COMP 571 Luay Nakhleh, Rice University

Gene Identification in silico

Analysis of Biological Sequences SPH

Motif Finding: Summary of Approaches. ECS 234, Filkov

GenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs

De novo assembly in RNA-seq analysis.

Introduction to Algorithms in Computational Biology Lecture 1

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

ECS 234: Introduction to Computational Functional Genomics ECS 234

Random matrix analysis for gene co-expression experiments in cancer cells

Finding Genes with Genomics Technologies

Introduction to Genome Biology

RNA-Seq data analysis course September 7-9, 2015

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Year III Pharm.D Dr. V. Chitra

Introduction to Bioinformatics: Chapter 11: Measuring Expression of Genome Information

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

Genome 373: Hidden Markov Models III. Doug Fowler

Baum-Welch and HMM applications. November 16, 2017

10/20/2009 Comp 590/Comp Fall

BSC 4934: Q BIC Capstone Workshop. Giri Narasimhan. ECS 254A; Phone: x3748

Introduction to Microarray Analysis

Bioinformatics Programming and Analysis CSC Dr. Garrett Dancik

Gene Expression Technology

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Transcription:

10-601 Machine Learning HMM applications in computational biology

Central dogma DNA CCTGAGCCAACTATTGATGAA transcription mrna CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2

Biological data is rapidly accumulating Transcription factors Next generation sequencing DNA transcription RNA translation Proteins

DNA Biological data is rapidly Transcription factors transcription accumulating Array / sequencing technology RNA translation Proteins

Biological data is rapidly accumulating Transcription factors Protein interactions DNA transcription RNA translation Proteins 38,000 identified interactions Hundreds of thousands of predictions

6

FDA Approves Gene-Based Breast Cancer MammaPrint is a DNA microarray-based test that measures the activity of 70 genes in a sample of a woman's breast-cancer tumor and then uses a specific formula to determine whether the patient is deemed low risk or high risk for the spread of the cancer to another site. Test* *Washington Post, 2/06/2007

8

Active Learning 9

Sequencing DNA First human genome draft in 2001 Due to accumulated errors, we could only reliably read at most 100-200 nucleotides.

Shotgun Sequencing Wikipedia

Caveats Errors in reading Non-trivial assembly task: repeats in the genome MacCallum et al., GB 2009

Error Correction in DNA sequencing The fragmentation happens at random locations of the molecules. We expect all positions in the genome to have the same # number of reads K-mers = substrings of length K of the reads. Errors create error k-mers. Kellly et al., GB 2010

Transcriptome Shotgun Sequencing (RNA-Seq) @Friedrich Miescher Laboratory Sequencing RNA molecule transcripts. Reminder: (mrna) Transcripts are expression products of genes. Different genes having different expression levels so some transcripts are more or less abundant than others.

Challenges Large datasets: 10-100 millions reads of 75-150 bps. Memory efficiency: Too time consuming to perform outmemory processing of data. DNA Sequencing + others : alternative splicing, RNA editing, post-transcription modification.

Errors are non uniformly distributed Some transcripts are more prone to errors Errors are harder to correct in reads from lowly expressed transcripts

SEECER Error Correction + Consensus sequence estimation for RNA-Seq data

Key idea: HMM model Salmela et al., Bioinformatics 2011 The way sequencers work: Read letter by letter sequentially Possible errors: Insertion, Deletion or Misread of a nucleotide

Building (Learning) the HMMs and Making Corrections (Inference) Learning = Expectation-Maximization Inference = Viterbi algorithm Seeding: Guessing possible reads using k-mer overlaps. Constructing the HMM from these reads. Speed up: The k-mer overlaps yield approximate multiple alignments of reads. We can learn HMM parameters from this directly.

Clustering to improve seeding Real biological differences should be supported by a set of reads with similar mismatches to the consensus

1. Clustering positions with mismatches to identify clusters of correlated positions. 2. Build a similarity matrix between these positions. 3. Use Spectral clustering to find clusters of correlated positions. 4. Filter reads have mismatches in these clusters.

Comparison to other methods

Using the corrected reads, the assembler can recover more transcripts

Analysis of sea cucumber data B

Data integration in biology

Key problem: Most high-throughput data is static Time-series measurements Static data sources Sequencing motif CHIP-chip microarray PPI Time

DREM: Dynamic Regulatory Events Miner

a Expression Level Time Series Expression Data b TF A Static TF-DNA Binding Data time TF B TF C TF D c Expression Level Model Structure d IOHMM Model 0.1? time 1 0.95? 0.9 0.05 1

Things are a bit more complicated: Real data

A Hidden Markov Model T t t t n i T t t t i H i H p i H i O p O H L 2 1 1 1 )) ( ) ( ( )) ( ) ( ( ) ;, ( Hidden States Observed outputs (expression levels) t=0 t=1 t=2 t=3 H 0 H 1 H 2 H 3 O 0 O 1 O 2 O 3 Schliep et al Bioinformatics 2003 1

Input Output Hidden Markov Model Input (Static TF-gene interactions) I g Hidden States (transitions between states form a tree structure) H 0 H 1 H 2 H 3 Emissions (Distribution of expression values) Log Likelihood O 0 O 1 O 2 O 3 t=0 t=1 t=2 t=3 Sum over all genes Sum over all paths Q Product over all Gaussian emission density values on path Product over all transition probabilities on path

E. coli. response Stem cells differentiation 2 1 6 7 3 4 5 8 PLoS Comp. Bio. 2008 9 Nature MSB 2011 Mouse Immune response Fly development Science 2010 IRF7 Genome Research 2010, PLoS ONE 2011

Things that work Approximate learning to speed up on large datasets. In real world, one technique is not enough. A solution involves using many techniques. Precision and Recall are trade-offs.