Lecture 1 Introduc-on and Data

Size: px
Start display at page:

Download "Lecture 1 Introduc-on and Data"

Transcription

1 STATS M254 / BIOINFO M271 / BIOMATH M271 Sta$s$cal Methods in Computa$onal Biology Spring 2015 Lecture 1 Introduc-on and Data Instructor: Jingyi Jessica Li 1

2 Outline Introduc-on to molecular biology DNA, gene, RNA, protein, central dogma Typical data Gene expression RNA- seq Regulatory sequences ChIP- chip/seq Why is sta-s-cs important? 2

3 DNA DNA (Deoxyribonucleic acid) is a molecule to store gene-c informa-on of a living organism. DNA consists of two polymers made from four types of nucleo-des: adenine (A), guanine (G), cytosine (C) and thymine (T). Purines: A, G; Pyrimidines: C, T Two polymers are complementary to each other and from a double- helix structure 3

4 Chromosome 4

5 Chromosome 5

6 Gene 6

7 Central Dogma 7

8 Data type 1: Gene expression data 8

9 Principle of gene expression microarray 9

10 Source: Affymetrix Inc. 10

11 Source: Affymetrix Inc. 11

12 12

13 Gene expression data matrix 13

14 RNA- Seq data Wang, Gerstein, & Snyder (2009) Nature Reviews Gene$cs 10,

15 RNA- Seq data 1) Long RNAs are first converted into a library of cdna fragments through either RNA fragmenta-on or DNA fragmenta-on. 2) Sequencing adaptors (blue) are subsequently added to each cdna fragment and a short sequence is obtained from each cdna using high- throughput sequencing technology. 3) The resul-ng sequence reads are aligned with the reference genome or transcriptome, and classified as three types: exonic reads, junc-on reads and poly(a) end- reads. 4) These three types are used to generate a base- resolu-on expression profile for each gene, as illustrated at the bocom; a yeast ORF with one intron is shown. 15

16 Data type 2: Regulatory sequences 16

17 Transcrip-on factor binding sites & mo-fs 17

18 Mo-fs are regulatory codes 18

19 Finding mo-fs from co- regulated genes 19

20 Mo-f discovery is difficult in mammalian genomes Advanced methods in regulatory sequence analysis: 1) combinatorial binding pacern 2) mul-ple species conserva-on 3) heterogeneity in background 4) predic-ve modeling 20

21 Data type 3: ChIP- chip and ChIP- seq ChIP: Chroma-n ImmunoPrecipita-on chip: DNA micorarray seq: massive sequencing 21

22 Array vs. Sequencing 22

23 Gene regulatory network 23

24 Acknowledgments For sharing slides on the internet: Dr. Qing Zhou, UCLA Dr. Hongkai Ji, Johns Hopkins University Dr. Cheng Li, Peking University 24

25 Why is sta-s-cs important? Data analysis owchart Not a data analysis Descriptive Exploratory Are you trying to predict measurement(s) for individuals? No Yes No No Did you summarize the data? Yes Did you report the summaries without interpretation? No Did you quantify whether your discoveries are likely to hold in a new sample? Yes Are you trying to gure out how changing the average of one measurement a ects another? No Inferential Yes Predictive Yes Is the e ect you are looking for an average e ect or a deterministic e ect? Average Deterministic Leek and Peng, Science, Causal Mechanistic 25

26 Why is sta-s-cs important? Common mistakes REAL QUESTION TYPE PERCEIVED QUESTION TYPE PHRASE DESCRIBING ERROR Inferential Causal Correlation does not imply causation Exploratory Inferential Data dredging Exploratory Predictive Over tting Descriptive Inferential n of 1 analysis Leek and Peng, Science,