Bioinformatics. Outline of lecture

Size: px
Start display at page:

Download "Bioinformatics. Outline of lecture"

Transcription

1 Bioinformatics Uma Chandran, MSIS, PhD Department of Biomedical Informatics University of Pittsburgh /08/2014 Outline of lecture What is Bioinformatics? Examples of bioinformatics Past to present Molecular questions Pre molecular techniques research High throughput research Translational Medicine Personalized Medicine Bioinformatics and Personalized Medicine 1

2 What is Bioinformatics? iki/bioinformatics Application of information technology to molecular biology Databases Algorithms Statistical techniques Bioinformatics examples Sequence analysis Genome annotation Evolutionary biology Literature analysis Analysis of Gene Expression Analysis of regulation Analysis of protein expression Analysis of mutations in cancer Comparative genomics Systems Biology Image analysis Protein structure prediction From Wikipedia 2

3 Early Bioinformatics Robert Ledley and Margaret Dayhoff First bioinformaticians Using IBM 7090 and punch card analyzed amino acid structure of proteins Created amino acid scoring matrix Protein evolution Protein sequence alignment Databases to store sequence info Phage Φ X174 sequenced in 1977 GenBank 30, 000 organisms 143 billion base pairs BLAST program for sequence searching Algorithms, databases, software tools Sequence analysis 3

4 Evolutionary biology Compare relationships between organism by comparing DNA sequences Now whole genomes Can even find single base changes, duplication, insertions, deletions Uses advanced algorithms, programs and computational resources Literature mining Millions of articles in the literature How to find meaningful information Natural language processing techniques Example Type in p53 or PTEN in Pubmed will retrieve 1000s of publications How to summarize all the information for a particular gene Function, disease, mutations, drugs IHOP database creates network between genes and proteins for genes 4

5 Genome annotation Marking genes and other features in DNA Algorithms, software Bioinformatics Interdisciplinary discipline Gene/proteins/function/ Biologist In Cancer Physician/Scientist/Biologist Algorithms, for example, BLAST Math/CS Separate Signal from Noise, Diff gene expression, correlation with disease Statistician Tools, Software, Databases Software developers, programmers Aim to make sense of biological data 5

6 Translational bioinformatics Translational = benchside to bedside Bringing discoveries made at the benchside to clinical use the development of storage, analytic, and interpretive methods to optimize the transformation of increasingly voluminous biomedical data into proactive, predictive, preventative, and participatory health. Translational bioinformatics includes research on the development of novel techniques for the integration of biological and clinical data and the evolution of clinical informatics methodology to encompass biological observations. The end product of translational bioinformatics is newly found knowledge from these integrative efforts that can be disseminated to a variety of stakeholders, including biomedical scientists, clinicians, and patients. Translational = benchside to bedside Atul Butte, JAMIA 2008;15: doi: DNA is transcribed to RNA RNA is translated to protein Many regulatory processes control these stepse 1954 structure of DNA Central dogma proposed in 1970 Central dogma 6

7 Genome DNA is not static Modification like histone modifications and methylation can repress or activate genes Undergo changes in development programmed Undergo changes in diseases Genome Methylation, histone modifications, copy number changes, polymorphisms Genome Epigenetics Methylation and histone modification Epigenetics changes other than to the DNA sequence itself 7

8 Genome Copy number variation Normal CNVs Disease CNVs 12% or more of the genome Few KB to entire chromosomes such as duplications or deletions Genome single nucleotide polymorphism SNPs can be part of the normal variation in the population SNPs may also be associated with disease Can be coding region Or non coding region SNPs v mutation Definition based on population frequencies Can be germline so that inherited Or somatic for example in cancer Many projects such as 1000s genomes attempt to map these 8

9 What is a transcriptome A set of ALL RNAs (not just a single RNA) mrna, rrna, trna, ncrna, sirna Amount and concentration of each RNA molecule Can vary with environmental conditions Varies in different cell types Transcriptome is dynamic Genes are actively transcribed, repressed, degraded How is this different from exome? Transcriptome 2005 From Brendean Frey 9

10 mirna RNA regulation trna Earliest studies done in bacteria where levels change under different conditions such as starvation resulting in changes in translational efficiency Balance in trnamet can transform normal cell to malignant rrna Also studied in bacteria and change under different conditions mrna Transcript levels are very dynamic and can change rapidly Regulated at many levels such as initiation, degradation, rate of transcription, splicing etc 10

11 Central Dogma expanded transcriptome diversity Transcriptome How many (genes) in the human genome? 11

12 RNAs Studies show that each gene may have numerous aternate spliced forms s of pseudogenes 2000 mirna lncrna FANTOM project identifed greater than 35K non coding transcripts piwi RNA, sirna (silencing) Genes in other species Genome Biology, 2010, 11:206 12

13 Transcriptomics Methods to study the varying concentrations of all of the RNAs including splicing intermediates under different conditions such as environmental changes, tissue types, cell types, disease states etc.. Central dogma expanded regulation 13

14 Biological questions DNA Are there any mutations sickle cell anemia Cystic fibrosis Hemophilia Other diseases such as diabetes, cancer?? Polymorphisms Variation in the population Mutation DNA amplification Are there regions of amplification or deletions that correlate with disease If so, what genes are present in these regions HER2 amplification in breast cancer EGFR mutations in lung cancer 14

15 Molecular techniques Pre molecular techniques Study of causes for disease difficult to study Sickle cell anemia Inherited, predominantly in certain populations, something wrong with oxygen carrying capacity in blood Linus Pauling showed that Hb was defective We now know that mutation in globin gene Glucocorticoids (steroid) Extract from adrenal helped with arthritis Molecular techniques know that broad ranging effects on many tissues 15

16 Technology How are these changes measured low throughput approaches Many methods Example: Northern blot (measure RNA) Workflow of Northern blot Key points mrna run on gel separated by size transferred to a membrane immobilized Have a hypothesis for example studying RNA level for BRCA in normal and cancer Only probe for a mrna or transcript is labeled or tagged probe is prepared and labeled with radioactivity Hybridized to X ray film Only that mrna is detected and quantitated Low throughput Northern 16

17 Hypothesis driven research: Example: Glucocorticoid Receptor GR (NR3C1) GR Nuclear receptor What does it do? Stress hormones Has physiological effects on many tissues How does it function Hormone activates receptor which binds to GREs in DNA and activates or represses transcription What genes does it affect? Many genes How was this studied RNA, runonassyas, SAGE, EST tags, microarray GR regulated genes Lutzner et al. PLOS 17

18 Base pairing Microarray and Northern/Southern blots Exploit the ability of nucleotides to hybridize to each other Base pairing Complementary bases A :T (U) G: C High throughput microarray Probes on surface Glass beads, chips, slides Arrays can detect mrna microrna Methylation SNP High throughput 10000s of specific probes Measure global gene expression, SNP calls, LOH, amplification, methylation etc 18

19 Affymetrix Microarrays Solid surface Many different technologies Affy, Illumina, Agilent Probes are synthesized on the solid surface Synthesized using proprietary technology Probe are selected using proprietary algorithms RNA (or DNA) is in solutions RNA is labeled or tagged Hybridized to the chip Tagged RNA is quantitated Compare between conditions 19

20 Biomarker discovery approaches Antigens in prostate (1970) Purification of tissue specific antigen from prostate (1979) PSA measured quantitatively (1980) FDA approval (1980s) Examine all available genes on human genome for differences in breast cancer (2004) Studies genes 400 genes found to different in tumor behavior 21 genes could predict recurrence FDA approval (2007) Clinical questions DNA level Are there mutations or polymorphism between different cancer patient groups Good outcome v bad outcome Early stage vs late stage Therapy responders v non responders Examples: Renal cell, prostate cancer etc RNA Are there specific transcripts mrna, microrna that are up or down and are signature for outcome, disease and response 1000s of studies Consortia projects TCGA The Cancer Genome Atlas projects Profile 500 samples of each cancer for DNA, RNA changes 20

21 21

22 Need for computational methods Data Management Each file for a chip experiment is large 100MG x 10 = 1G Generates Gigabytes of data Data preprocessing Convert raw image into signal values Data analysis 1000s of genes (or SNPs) and few samples How to find differences between samples What statistical methods to use? Like finding needle in a haystack Data analysis Class discovery Are there novel subclasses within data? Class comparison How are tumor and normal different in expression? Which SNPs are different? Class prediction Predict class of new sample Advanced pathway Analysis 22

23 SNPs to detect Copy Number changes amplification amplification diploid deletion Hagenkord et al; Modern Pathology, 21:599 23

24 Integrative analysis Technology From low throughput approaches a decade ago to high throughput to very high throughput Produce massive amounts of data Integrative bioinformatics Analyze each dataset Integrate datasets Interpret Visualize What is personalized medicine Personalized medicine is the tailoring of medical treatment to the individual characteristics of each patient. Based on scientific breakthroughs in understanding of how a person s unique molecular and genetic profile makes them susceptible to certain diseases. ability to predict which medical treatments will be safe and effective for each patient, and which ones will not be. From ageofpersonalizedmedicine.org 24

25 Personalized Medicine From ageofpersonalizedmedicine.org Examples of personalized medicine Breast cancer 30% of patients over express HER2 Treated with Herceptin Oncotype Dx: gene expression predicting recurrence Cardiovascular Patients response to Warfarin, the blood thinner Response determined by polymorphism in a CYP genes 25

26 Personalized Medicine Examples of personalized medicine resulted from studies that generate Lots of data Rely on bioinformatics methods to discover these associations Oncotype Dx: Gene expression studies of large number of patients CYP polymorphisms Discover single nucleotide polymorphisms in patient polulations and association with response» Initial studies done with PCR methods Bioinformatics challenges in personalized medicine Processing large scale robust genomic data Interpreting the functional impact of variants Integrating data to relate complex interactions with phenotypes Translating into medical practice Fernald et al; Bioinformatics: 13:

27 Era of Personalized medicine Shift from microarrays to Next Gen Sequencing Next Gen Sequencing Directly sequence DNA to determine SNP CN Expression, mrna, microrna Protein binding sites Methylation Initial steps depend not on hybridization but also on base pairing or complementarity and DNA synthesis Bioinformatics is extremely challenging 27

28 Next Gen Sequencing NGS in personalized medicine Whole genome sequencing Sequence genomes and find variants (1000 genome project) Find variants associated with disease phenotype Sequence exomes only Find coding region variants associated with phenotypes RNA seq RNA sequence signatures associated with phenotype 28

29 Microarrays v NGS RNA Seq Restricted to probes on chips Only transcripts with probes File sizes in MBs to GB Algorithms, methods Typically done on PCs Storage on hard drives No predetermined probes Can detect everything that is sequenced More applications than microarray Very large file sizes Computationally very intensive Clusters, supercomputers Large scale storage solutions Microarrays v RNA seq Expression Analysis Dynamic range is low Statistic to determine expression based on signal Many methods in the last 10 years Dynamic range is high Based on reads Statistics based on counts Affected by read length, total number of transcripts, lack of replicates 29

30 Read mapping Alignment Denovo assembly Mapping to reference genome Based on complementarity of a given 35 nucleotide to the entire genome Computationally intensive Million of 35 bp reads has to search for alignment against the reference and align spefically to a given regions Large file sizes Sequence files in the TB Aligned file BAM files Several hundred GB Reference genome Bioinformatics Challenges Data Which technology to use Each technology has different error rates, Ion Torrent (higher error rate), SOLID, Illumina Speed of generation of data Ion Torrent is faster Application Whole genome or exome or targeted exome Analysis Analysis Algorithms, speed, accuracy BLAST is not good for WGS Other new algorithms Speed of analysis Alignment can take days Alignment relies on matches between sequence and reference genome How much mismatches to tolerate True mismatch or error sequencing error, true mismatch is it a SNP Quality of reference genome Large amounts of data Each whole genome sequencing experiment can generate TB of data Where to store patient privacy Servers, locations, networking Sample sizes how many samples to sequence to discover the association with disease 30

31 From Mark Boguski s presentation at the IOM, July 19, 2011 From Mark Boguski s presentation at the IOM, July 19,

32 From Mark Boguski s presentation at the IOM, July 19, 2011 Next Gen Sequencing From Mark Boguski s presentation at the IOM, July 19,