DNA Based Disease Prediction using pathway Analysis

Size: px
Start display at page:

Download "DNA Based Disease Prediction using pathway Analysis"

Transcription

1 2017 IEEE 7th International Advance Computing Conference DNA Based Disease Prediction using pathway Analysis Syeeda Farah Dr.Asha T Cauvery B and Sushma M S Department of Computer Science and Shivanand K S Department of Computer Science and Engineering Department of Computer Science and Engineering Bangalore Institute of Technology and Engineering Bangalore Institute of Technology Bangalore, Karnataka Bangalore Institute of Technology Bangalore, Karnataka asha.masti@gmail.com Bangalore, Karnataka syeeda.farah.93@gmail.com cauverysoni@gmail.com sushmasuresh93@gmail.com k.s.shivanand23@gmail.com ABSTRACT Most diseases are not triggered by a single genome but by a combination of genomes together. Sequences occurring more frequently in the diseased samples than in the healthy samples indicate the generic factors of the disease. DNA has become an extremely useful tool for predicting disease. By allowing medical professionals to identify genes in DNA that are markers for diseases, a person can make appropriate lifestyle or similar modifications to help lower the risk of disease. We propose a system in which the above knowledge is provided by determining the probabilistic levels of a disease occurring if the causal gene or the associated genes are mutated. Index Terms Data Mining, Bayesian Network, Pathway Analysis, Disease Prevention. I. INTRODUCTION DNA is a molecule that encodes the genetic instructions used in the development and functioning of all known living organisms and viruses. DNA has become an extremely useful source for identifying and predicting diseases. The human genome includes approximately genes. With the exception of identical twins, no two humans have the same genome. The genetic information in a genome is held within genes, and the complete set of this information in an organism is called its genotype. A gene is a unit of heredity and is a region of DNA that influences a particular characteristic in an organism. A Single Nucleotide Polymorphism (SNP) is a DNA sequence variation occurring commonly with a population in which a single nucleotide-a, T, C or G-in the genome differs between members of a biological species or paired chromosomes. Where there is life there is always disorder or disease. Diseases can be broadly classified as those which are caused by viruses and bacteria, and those which are caused by malfunction of organs i.e., ailments. Mutation in DNA causes the diseases of the second type which remain unknown till the last stages where very little can be done to prevent them. The mutations in disease-causal-gene along with other supporting genes make the organ vulnerable hence leading to diseases. The presence of mutation in only the causal gene may or may not produce the diseases. The mutation in causal gene might have been induced due to mutation in the support gene. We consider all these factors. Our proposed system only takes into consideration those genes which can impact on the cause of the diseases to a fairly large extent and hence removes the negligible data, which is done by applying the concepts of Bayesian network specifically using the pathway analysis concepts which provides a pathway of associated causal and supporting genes. This Paper starts by describing the necessary information and background details that are needed for understanding the work done. Followed by the methodologies that have been used in the proposed work. We then move on to the architecture of the proposed work. In this section we explain the algorithm and the implementation of the concepts we learned in the previous section. Followed by the Results of the case studies on diseases like Type-1 Diabetes and Crohn s disease. After this we have the conclusion and references. II. BACKGROUND A prior biological knowledge was needed to understand the genetics behind various diseases. Genetic diseases in particular. We went about every chromosome and the disease causing genes on every chromosome. Further, we studied gene interaction patterns in many diseases such as Crohn s disease, Alzheimer s disease, and a few types of cancers, etc. We found [7] that a disease can be caused by a mutation in a single gene or by a collection of genes. We also found [3] that mutation of one gene can be responsible for two or more diseases (Like Alzheimer s and Parkinson s diseases). Another interesting finding we found is that a typical microarray [3] data may have only a small number of records, while the number of fields, corresponding to number of genes is in thousands. Having so many fields relative to few samples creates a high likelihood of finding false positives. We decided to use the concepts of Pathway analysis in data mining because [6] Pathway analysis has become the first choice for extracting and explaining the underlying biology for high throughput molecular measurements. Today, virtually every bioinformatics study looks for statistically significant pathways as either biological interpretation or validation of computationally derived results /17 $ IEEE DOI /IACC

2 Hence aiding us to reduce these false positives and obtain a higher accuracy result. III. MODULES components denote the joint probability distribution for X. The BN structure S is a directed acyclic graph, meaning that the network is hierarchical and has both top-level and terminal nodes and no directed paths which eventually return to them. Given structure S, the joint probability distribution for X is given by, p (x) = (2) Score-based methods consider a number of possible BN structures and assign a score to each that measures how well it explains the observed set of data. The algorithms then typically return the single structure that maximizes the score. AIC= log (p (D, G )) - (3) Fig. 1. Architecture diagram A. Data collection The genes are extracted from the NCBI Datasets in the GenBank format and required fields are extracted for the statistical calculations. The GenBank format of genetic data will include various details such as gene name, NCBI gene ID, location if that gene on the chromosome along with the base pair ordering, species, etc. Along with these details, it contains the respective gene sequence. We have also integrated other databases to obtain the correlation values of the diseasomes. B. Data-Preprocessing GenBank data was processed using Perl program to extract gene ID and gene name. Results were tabulated in a database table which will be used further to visualize pathways.correlation is the proportion of variance that two traits share due to genetic causes.from correlation, the p-values were calculated and fed into the Bayesian network to build pathways (using JDBC). The Correlation co-efficient for a pair of genes is calculated using the Pearson s correlation coefficient as: Basic outline of Bayesian network: Input: Observational data Output: Bayesian network Input a cut-off value to restrict the number of records/variables which is provided as input to the algorithm. Use p-values from the database table obtained as an output from data processing. Generate the initial BN, evaluate and set it as the current BN(Using an appropriate scoring function) Generate the initial BN, evaluate and set it as the current BN(Using an appropriate scoring function) If the score of the neighbor is better than the score of the current BN, set the neighbor with the best score as the current BN and return to step 3. Evaluate the neighbors of the current BN. Otherwise stop the learning process. Visualize the pathway using JGraph like package. D. Visualizing the Pathway Pathways are visualized using Java APIs like swings taking the network developed in the previous module as the input here. = (1) C. Building Bayesian Network Bayesian networks have a number of features that make them viable for combining prior knowledge and data as BNs can deal with uncertainty, avoid over-fitting a model to training data and learn from incomplete datasets. BNs handle stochastic events in a probabilistic framework. Many BN structure learning algorithms are based on heuristic search techniques with likelihood approximation because of the infeasible computational complexity. BNs are graphical representations of statistical interdependencies amongst sets of nodes. Specifically, a BN for a set of variables X = {X1, X2,..., Xn} consists of (1) a network structure S that encodes a set of conditional independence assertions about variables in X, and (2) a set P of conditional probability distributions associated with each variable (Heckerman, 2008). Together, these

3 IV. ARCHITECTURE Fig. 3. Detailed flow diagram for data pre-processing B. Build Bayesian Network. Fig. 2. Flow Diagram A. Data Preprocessing Gene Sequence in GenBank format is obtained from the NCBI website. The relevant gene ID and gene name is obtained for the particular gene sequence in question. The code for this is present in the Perl module. The perl module also fetches the correlation values for the associated genes of the causal gene. This is obtained from the co-expressed gene files named with gene IDs. The correlation values obtained are used to calculate the p values. This information is stored in the database in the form of tables. The first table created is the parent gene table. The tables are created recursively for every entry in the parent gene table. Fig. 4. Detailed flow diagram to build Bayesian network The Bayesian network is built using Java. This module uses the concepts of linked hash map. The java code developed has two methods calculateaic() and conditionalp(). calculateaic() takes in p values and calculates AIC(Akaike Information criterion). AIC is a scoring function used to determine the addition of genes into the network. public static int calculateaic(double p1) { double x; int k=n; x=(2*k-(2*math.log(p1))); int z=(int)x; return z; } conditional() takes in the p values of the parent node and child node. Here we calculate the probability of occurrence of the child node when the parent node has already occurred. This method returns the calculated p values. public static double conditionalp(double parentpval, double childpval) { Double condp=parentpval*childpval; return condp; } C. Visualizing the network. The linked hash map is accessed to visualize the pathway. This is done by using the JGraphx package in netbeans. This module is included in the Java code of the Bayesian network

4 V. CASE STUDY From the implementation of the Pathway analysis we have visualized the pathways between the genes and hence knowledge is provided about which genes can affect a disease collectively. This is depicted by taking Type 1 Diabetes into consideration. 1) Causal Gene: From the research we have found that HLA-DQB1 is the causal gene for Type 1 Diabetes. 2) Support Genes: Through data mining techniques we have obtained the support genes that cause the Type 1 Diabetes even if the Causal gene is not mutated. We use the Net Beans IDE to run all the modules mentioned in the previous section. The Perl program gives the desired records from the huge amount of data based on the user input (Gene name and number of records to be fetched). Further modules work on the tables fetched to give the output. The output is in the form of a network or a pathway of the causal gene and its associated gene. The causal gene is colored white and the genes associated closely with the causal gene are colored pink. The genes which are known to have an effect on the closely associated genes are colored cyan. We specify the causal gene name and the number of records to be evaluated as shown below in Fig. 5 Fig. 6. Disease pathway of HLA-DQB1 which causes Type 1 Diabetes (for 10 records) Fig. 5. Input screen and number of records to be evaluated Once the input is given, the program runs and the pathway is displayed. The pathway of the causal gene and its associated gene is shown in the Fig.6 for 10 records and Fig. 7 for 25 records respectively: Fig. 7. Disease pathway of HLA-DQB1 which causes Type 1 Diabetes (for 25 records) The following tables are created when we run the program: TABLE I. DATABASE TABLE CREATED FOR INPUT GENE I.E., HLA-DQB

5 TABLE II. Database table created for HLA-DQA1 TABLE V. Database table created for HLA-DRB6 TABLE III. Database table created for HLA-DRB1 Figure 8 shows the output for Sickle cell anemia which is caused by a single gene HBB. Since there are no associations in the onset of the disease only a single node is displayed. TABLE IV. Database table created for HLA-DPA1 Fig. 8. Disease pathway of HBB which causes Sickle cell anemia

6 VI. CONCLUSION A single gene may or may not cause a disease. Gene to gene interaction also should be considered because they can affect the functionality of the genes. To identify this phenomenon we have visualized pathways for each gene responsible for a disease and the probabilistic statistics of occurrence of a disease is predicted. The statistics have been obtained by eliminating the redundant data and taking into account only those genes in close correlation. REFERENCES [1] Li Ding, Michael C. Wendl, Daniel C. Koboldt and Elaine R. Mardis, Analysis of next-generation genomic data in cancer: accomplishments and challenges, Human Molecular Genetics, R1 R9, [2] Sebastian Okser, TapioPahikkala and TeroAittokallio, Genetic variants and their interactions in disease risk prediction machine learning and network perspectives, BioData Mining, 6:5, 1-16, [3] W. B. Langdon and B. F. Buxton, Genetic Programming for Mining DNA Chip Data from Cancer Patients, Genetic Programming and Evolvable Machines, 5, , [4] Davnah Urbach and Jason H Moore, Mining the diseasome, BioData Mining, 4:25, 1-2, [5] Vijay K Ramanan, Li Shen, Jason H. Moore and Andrew J. Saykin, Pathway analysis of genomic data: concepts, methods, and prospects for future development, National Institute of Health,28(7): , [6] Purvesh Khatri, MarinaSirota and Atul J. Butte, Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges, Computational Biology, 8:2, 1-11, [7] Qingrun Zhang, Quan Long1 and JurgOtt, AprioriGWAS, a New Pattern Mining Strategy for Detecting Genetic Variants Associated with Disease through Interaction Effects, Computational Biology, 10:6, 1-11, 2014 [8] Wei Zhang, Data mining for biological data learning algorithm and Application, A Dissertation Submitted to the Graduate School of the University of Notre Dame,

Exome Sequencing Exome sequencing is a technique that is used to examine all of the protein-coding regions of the genome.

Exome Sequencing Exome sequencing is a technique that is used to examine all of the protein-coding regions of the genome. Glossary of Terms Genetics is a term that refers to the study of genes and their role in inheritance the way certain traits are passed down from one generation to another. Genomics is the study of all

More information

2. Materials and Methods

2. Materials and Methods Identification of cancer-relevant Variations in a Novel Human Genome Sequence Robert Bruggner, Amir Ghazvinian 1, & Lekan Wang 1 CS229 Final Report, Fall 2009 1. Introduction Cancer affects people of all

More information

Crash-course in genomics

Crash-course in genomics Crash-course in genomics Molecular biology : How does the genome code for function? Genetics: How is the genome passed on from parent to child? Genetic variation: How does the genome change when it is

More information

Studying the Human Genome. Lesson Overview. Lesson Overview Studying the Human Genome

Studying the Human Genome. Lesson Overview. Lesson Overview Studying the Human Genome Lesson Overview 14.3 Studying the Human Genome THINK ABOUT IT Just a few decades ago, computers were gigantic machines found only in laboratories and universities. Today, many of us carry small, powerful

More information

Lesson Overview. Studying the Human Genome. Lesson Overview Studying the Human Genome

Lesson Overview. Studying the Human Genome. Lesson Overview Studying the Human Genome Lesson Overview 14.3 Studying the Human Genome THINK ABOUT IT Just a few decades ago, computers were gigantic machines found only in laboratories and universities. Today, many of us carry small, powerful

More information

CS 262 Lecture 14 Notes Human Genome Diversity, Coalescence and Haplotypes

CS 262 Lecture 14 Notes Human Genome Diversity, Coalescence and Haplotypes CS 262 Lecture 14 Notes Human Genome Diversity, Coalescence and Haplotypes Coalescence Scribe: Alex Wells 2/18/16 Whenever you observe two sequences that are similar, there is actually a single individual

More information

Improving the Accuracy of Base Calls and Error Predictions for GS 20 DNA Sequence Data

Improving the Accuracy of Base Calls and Error Predictions for GS 20 DNA Sequence Data Improving the Accuracy of Base Calls and Error Predictions for GS 20 DNA Sequence Data Justin S. Hogg Department of Computational Biology University of Pittsburgh Pittsburgh, PA 15213 jsh32@pitt.edu Abstract

More information

ENGR 213 Bioengineering Fundamentals April 25, A very coarse introduction to bioinformatics

ENGR 213 Bioengineering Fundamentals April 25, A very coarse introduction to bioinformatics A very coarse introduction to bioinformatics In this exercise, you will get a quick primer on how DNA is used to manufacture proteins. You will learn a little bit about how the building blocks of these

More information

Data Mining for Biological Data Analysis

Data Mining for Biological Data Analysis Data Mining for Biological Data Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Data Mining Course by Gregory-Platesky Shapiro available at www.kdnuggets.com Jiawei Han

More information

Introduction to BIOINFORMATICS

Introduction to BIOINFORMATICS COURSE OF BIOINFORMATICS a.a. 2016-2017 Introduction to BIOINFORMATICS What is Bioinformatics? (I) The sinergy between biology and informatics What is Bioinformatics? (II) From: http://www.bioteach.ubc.ca/bioinfo2010/

More information

Multi-SNP Models for Fine-Mapping Studies: Application to an. Kallikrein Region and Prostate Cancer

Multi-SNP Models for Fine-Mapping Studies: Application to an. Kallikrein Region and Prostate Cancer Multi-SNP Models for Fine-Mapping Studies: Application to an association study of the Kallikrein Region and Prostate Cancer November 11, 2014 Contents Background 1 Background 2 3 4 5 6 Study Motivation

More information

Genetics and Bioinformatics

Genetics and Bioinformatics Genetics and Bioinformatics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be Lecture 1: Setting the pace 1 Bioinformatics what s

More information

COS 597c: Topics in Computational Molecular Biology. DNA arrays. Background

COS 597c: Topics in Computational Molecular Biology. DNA arrays. Background COS 597c: Topics in Computational Molecular Biology Lecture 19a: December 1, 1999 Lecturer: Robert Phillips Scribe: Robert Osada DNA arrays Before exploring the details of DNA chips, let s take a step

More information

advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA

advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA advanced analysis of gene expression microarray data aidong zhang State University of New York at Buffalo, USA World Scientific NEW JERSEY LONDON SINGAPORE BEIJING SHANGHAI HONG KONG TAIPEI CHENNAI Contents

More information

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology. G16B BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY Methods or systems for genetic

More information

Feature Selection of Gene Expression Data for Cancer Classification: A Review

Feature Selection of Gene Expression Data for Cancer Classification: A Review Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 50 (2015 ) 52 57 2nd International Symposium on Big Data and Cloud Computing (ISBCC 15) Feature Selection of Gene Expression

More information

Genomics and Biotechnology

Genomics and Biotechnology Genomics and Biotechnology Expansion of the Central Dogma DNA-Directed-DNA-Polymerase RNA-Directed- DNA-Polymerase DNA-Directed-RNA-Polymerase RNA-Directed-RNA-Polymerase RETROVIRUSES Cell Free Protein

More information

Evolutionary Computation. Lecture 1 January, 2007 Ivan Garibay

Evolutionary Computation. Lecture 1 January, 2007 Ivan Garibay Evolutionary Computation Lecture 1 January, 2007 Ivan Garibay igaribay@cs.ucf.edu Lecture 1 What is Evolutionary Computation? Evolution, Genetics, DNA Historical Perspective Genetic Algorithm Components

More information

mrna for protein translation

mrna for protein translation Biology 1B Evolution Lecture 5 (March 5, 2010), Genetic Drift and Migration Mutation What is mutation? Changes in the coding sequence Changes in gene regulation, or how the genes are expressed as amino

More information

Statistical Inference and Reconstruction of Gene Regulatory Network from Observational Expression Profile

Statistical Inference and Reconstruction of Gene Regulatory Network from Observational Expression Profile Statistical Inference and Reconstruction of Gene Regulatory Network from Observational Expression Profile Prof. Shanthi Mahesh 1, Kavya Sabu 2, Dr. Neha Mangla 3, Jyothi G V 4, Suhas A Bhyratae 5, Keerthana

More information

03-511/711 Computational Genomics and Molecular Biology, Fall

03-511/711 Computational Genomics and Molecular Biology, Fall 03-511/711 Computational Genomics and Molecular Biology, Fall 2011 1 Problem Set 0 Due Tuesday, September 6th This homework is intended to be a self-administered placement quiz, to help you (and me) determine

More information

FUNCTIONAL BIOINFORMATICS

FUNCTIONAL BIOINFORMATICS Molecular Biology-2018 1 FUNCTIONAL BIOINFORMATICS PREDICTING THE FUNCTION OF AN UNKNOWN PROTEIN Suppose you have found the amino acid sequence of an unknown protein and wish to find its potential function.

More information

Concepts of Genetics, 10e (Klug/Cummings/Spencer/Palladino) Chapter 1 Introduction to Genetics

Concepts of Genetics, 10e (Klug/Cummings/Spencer/Palladino) Chapter 1 Introduction to Genetics 1 Concepts of Genetics, 10e (Klug/Cummings/Spencer/Palladino) Chapter 1 Introduction to Genetics 1) What is the name of the company or institution that has access to the health, genealogical, and genetic

More information

From genome-wide association studies to disease relationships. Liqing Zhang Department of Computer Science Virginia Tech

From genome-wide association studies to disease relationships. Liqing Zhang Department of Computer Science Virginia Tech From genome-wide association studies to disease relationships Liqing Zhang Department of Computer Science Virginia Tech Types of variation in the human genome ( polymorphisms SNPs (single nucleotide Insertions

More information

Engineering Genetic Circuits

Engineering Genetic Circuits Engineering Genetic Circuits I use the book and slides of Chris J. Myers Lecture 0: Preface Chris J. Myers (Lecture 0: Preface) Engineering Genetic Circuits 1 / 19 Samuel Florman Engineering is the art

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics If the 19 th century was the century of chemistry and 20 th century was the century of physic, the 21 st century promises to be the century of biology...professor Dr. Satoru

More information

Instructions for Confirmation of HLA Typing (Form 2005 Revision 5)

Instructions for Confirmation of HLA Typing (Form 2005 Revision 5) (Form 2005 Revision 5) This section of the CIBMTR Forms Instruction Manual is intended to be a resource for completing the Confirmation of HLA Typing Form. E-mail comments regarding the content of the

More information

The University of California, Santa Cruz (UCSC) Genome Browser

The University of California, Santa Cruz (UCSC) Genome Browser The University of California, Santa Cruz (UCSC) Genome Browser There are hundreds of available userselected tracks in categories such as mapping and sequencing, phenotype and disease associations, genes,

More information

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016 CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016 Topics Genetic variation Population structure Linkage disequilibrium Natural disease variants Genome Wide Association Studies Gene

More information

Introduction to Quantitative Genomics / Genetics

Introduction to Quantitative Genomics / Genetics Introduction to Quantitative Genomics / Genetics BTRY 7210: Topics in Quantitative Genomics and Genetics September 10, 2008 Jason G. Mezey Outline History and Intuition. Statistical Framework. Current

More information

Statistical Methods for Network Analysis of Biological Data

Statistical Methods for Network Analysis of Biological Data The Protein Interaction Workshop, 8 12 June 2015, IMS Statistical Methods for Network Analysis of Biological Data Minghua Deng, dengmh@pku.edu.cn School of Mathematical Sciences Center for Quantitative

More information

ROAD TO STATISTICAL BIOINFORMATICS CHALLENGE 1: MULTIPLE-COMPARISONS ISSUE

ROAD TO STATISTICAL BIOINFORMATICS CHALLENGE 1: MULTIPLE-COMPARISONS ISSUE CHAPTER1 ROAD TO STATISTICAL BIOINFORMATICS Jae K. Lee Department of Public Health Science, University of Virginia, Charlottesville, Virginia, USA There has been a great explosion of biological data and

More information

Prediction of Success or Failure of Software Projects based on Reusability Metrics using Support Vector Machine

Prediction of Success or Failure of Software Projects based on Reusability Metrics using Support Vector Machine Prediction of Success or Failure of Software Projects based on Reusability Metrics using Support Vector Machine R. Sathya Assistant professor, Department of Computer Science & Engineering Annamalai University

More information

Microarrays & Gene Expression Analysis

Microarrays & Gene Expression Analysis Microarrays & Gene Expression Analysis Contents DNA microarray technique Why measure gene expression Clustering algorithms Relation to Cancer SAGE SBH Sequencing By Hybridization DNA Microarrays 1. Developed

More information

Heredity and DNA Assignment 1

Heredity and DNA Assignment 1 Heredity and DNA Assignment 1 Name 1. Which sequence best represents the relationship between DNA and the traits of an organism? A B C D 2. In some people, the lack of a particular causes a disease. Scientists

More information

Genomes contain all of the information needed for an organism to grow and survive.

Genomes contain all of the information needed for an organism to grow and survive. Section 3: Genomes contain all of the information needed for an organism to grow and survive. K What I Know W What I Want to Find Out L What I Learned Essential Questions What are the components of the

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Richard Corbett Canada s Michael Smith Genome Sciences Centre Vancouver, British Columbia June 28, 2017 Our mandate is to advance knowledge about cancer and other diseases

More information

Chapter 15 THE HUMAN GENOME PROJECT AND GENOMICS

Chapter 15 THE HUMAN GENOME PROJECT AND GENOMICS Chapter 15 THE HUMAN GENOME PROJECT AND GENOMICS Chapter Summary Mapping of human genes means identifying the chromosome and the position on that chromosome where a particular gene is located. Initially

More information

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University Reference Machine learning applications in genetics and genomics

More information

BTRY 7210: Topics in Quantitative Genomics and Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine jgm45@cornell.edu Spring 2015, Thurs.,12:20-1:10

More information

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology Lecture 2: Microarray analysis Genome wide measurement of gene transcription using DNA microarray Bruce Alberts, et al., Molecular Biology

More information

Bioinformatics : Gene Expression Data Analysis

Bioinformatics : Gene Expression Data Analysis 05.12.03 Bioinformatics : Gene Expression Data Analysis Aidong Zhang Professor Computer Science and Engineering What is Bioinformatics Broad Definition The study of how information technologies are used

More information

BIOINFORMATICS THE MACHINE LEARNING APPROACH

BIOINFORMATICS THE MACHINE LEARNING APPROACH 88 Proceedings of the 4 th International Conference on Informatics and Information Technology BIOINFORMATICS THE MACHINE LEARNING APPROACH A. Madevska-Bogdanova Inst, Informatics, Fac. Natural Sc. and

More information

Lecture 11 Microarrays and Expression Data

Lecture 11 Microarrays and Expression Data Introduction to Bioinformatics for Medical Research Gideon Greenspan gdg@cs.technion.ac.il Lecture 11 Microarrays and Expression Data Genetic Expression Data Microarray experiments Applications Expression

More information

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM) BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM) PROGRAM TITLE DEGREE TITLE Master of Science Program in Bioinformatics and System Biology (International Program) Master of Science (Bioinformatics

More information

HISTORICAL LINGUISTICS AND MOLECULAR ANTHROPOLOGY

HISTORICAL LINGUISTICS AND MOLECULAR ANTHROPOLOGY Third Pavia International Summer School for Indo-European Linguistics, 7-12 September 2015 HISTORICAL LINGUISTICS AND MOLECULAR ANTHROPOLOGY Brigitte Pakendorf, Dynamique du Langage, CNRS & Université

More information

Extraction of Hidden Markov Model Representations of Signal Patterns in. DNA Sequences

Extraction of Hidden Markov Model Representations of Signal Patterns in. DNA Sequences 686 Extraction of Hidden Markov Model Representations of Signal Patterns in. DNA Sequences Tetsushi Yada The Japan Information Center of Science and Technology (JICST) 5-3 YonbancllO, Clliyoda-ku, Tokyo

More information

Workshop on Data Science in Biomedicine

Workshop on Data Science in Biomedicine Workshop on Data Science in Biomedicine July 6 Room 1217, Department of Mathematics, Hong Kong Baptist University 09:30-09:40 Welcoming Remarks 9:40-10:20 Pak Chung Sham, Centre for Genomic Sciences, The

More information

Introduction to Pharmacogenetics Competency

Introduction to Pharmacogenetics Competency Introduction to Pharmacogenetics Competency Updated on 6/2015 Pre-test Question # 1 Pharmacogenetics is the study of how genetic variations affect drug response a) True b) False Pre-test Question # 2 Pharmacogenetic

More information

Pharmacogenetics: A SNPshot of the Future. Ani Khondkaryan Genomics, Bioinformatics, and Medicine Spring 2001

Pharmacogenetics: A SNPshot of the Future. Ani Khondkaryan Genomics, Bioinformatics, and Medicine Spring 2001 Pharmacogenetics: A SNPshot of the Future Ani Khondkaryan Genomics, Bioinformatics, and Medicine Spring 2001 1 I. What is pharmacogenetics? It is the study of how genetic variation affects drug response

More information

Gene function prediction. Computational analysis of biological networks. Olga Troyanskaya, PhD

Gene function prediction. Computational analysis of biological networks. Olga Troyanskaya, PhD Gene function prediction Computational analysis of biological networks. Olga Troyanskaya, PhD Available Data Coexpression - Microarrays Cells of Interest Known DNA sequences Isolate mrna Glass slide Resulting

More information

Compression and Integration of Genomic Variants Into Smart EHR Systems

Compression and Integration of Genomic Variants Into Smart EHR Systems Compression and Integration of Genomic Variants Into Smart EHR Systems Andrew Gritsevskiy and Adithya Vellal Mentor: Dr. Gil Alterovitz 6 th Annual PRIMES Conference May 22 2016 An Introduction to Genomic

More information

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls BMI/CS 776 www.biostat.wisc.edu/bmi776/ Colin Dewey cdewey@biostat.wisc.edu Spring 2012 1. Understanding Human Genetic Variation

More information

Role of Bio-informatics in Molecular Medicine

Role of Bio-informatics in Molecular Medicine Role of Bio-informatics in Molecular Medicine Manoj k kashyap * a, Amit Kumar b, Gaurav Kaushik e, Prakash C Sharma c, Madhu Khullar d Cellular and Molecular Neurobiology Division (821 ), Department of

More information

Improvement of Association-based Gene Mapping Accuracy by Selecting High Rank Features

Improvement of Association-based Gene Mapping Accuracy by Selecting High Rank Features Improvement of Association-based Gene Mapping Accuracy by Selecting High Rank Features 1 Zahra Mahoor, 2 Mohammad Saraee, 3 Mohammad Davarpanah Jazi 1,2,3 Department of Electrical and Computer Engineering,

More information

Decoding cell lineage from acquired mutations using arbitrary deep sequencing

Decoding cell lineage from acquired mutations using arbitrary deep sequencing Nature Methods Decoding cell lineage from acquired mutations using arbitrary deep sequencing Cheryl A Carlson, Arnold Kas, Robert Kirkwood, Laura E Hays, Bradley D Preston, Stephen J Salipante & Marshall

More information

Genetic Technologies.notebook March 05, Genetic Technologies

Genetic Technologies.notebook March 05, Genetic Technologies Genetic Testing Genetic Technologies Tests can be used to diagnose disorders and/or identify those individuals with an increased risk of inheriting a disorder. Prenatal Screening A fetus may be screened

More information

Applicazioni biotecnologiche

Applicazioni biotecnologiche Applicazioni biotecnologiche Analisi forense Sintesi di proteine ricombinanti Restriction Fragment Length Polymorphism (RFLP) Polymorphism (more fully genetic polymorphism) refers to the simultaneous occurrence

More information

Computers in Biology and Bioinformatics

Computers in Biology and Bioinformatics Computers in Biology and Bioinformatics 1 Biology biology is roughly defined as "the study of life" it is concerned with the characteristics and behaviors of organisms, how species and individuals come

More information

CSC 121 Computers and Scientific Thinking

CSC 121 Computers and Scientific Thinking CSC 121 Computers and Scientific Thinking Fall 2005 Computers in Biology and Bioinformatics 1 Biology biology is roughly defined as "the study of life" it is concerned with the characteristics and behaviors

More information

Punnett Square with Heterozygous Cross (Video clip) There is a glaring error with this video clip. Can you spot it???

Punnett Square with Heterozygous Cross (Video clip) There is a glaring error with this video clip. Can you spot it??? Section 3: Studying Heredity Objectives Predict the results of monohybrid genetic crosses by using Punnett squares. Apply a test cross to determine the genotype of an organism with a dominant phenotype.

More information

Human Genomics. 1 P a g e

Human Genomics. 1 P a g e Human Genomics What were the aims of the human genome project? To identify all the approximately 20,000-25,000 genes in Human DNA. To find where each gene is located To determine the sequences of the 3

More information

A heuristic method for simulating open-data of arbitrary complexity that can be used to compare and evaluate machine learning methods *

A heuristic method for simulating open-data of arbitrary complexity that can be used to compare and evaluate machine learning methods * A heuristic method for simulating open-data of arbitrary complexity that can be used to compare and evaluate machine learning methods * Jason H. Moore, Maksim Shestov, Peter Schmitt, Randal S. Olson Institute

More information

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow Technical Overview Import VCF Introduction Next-generation sequencing (NGS) studies have created unanticipated challenges with

More information

dominance neither trait is dominant; in a hybrid condition, there is a blending in the phenotype.

dominance neither trait is dominant; in a hybrid condition, there is a blending in the phenotype. Genetics NAME Period Date dominance neither trait is dominant; in a hybrid condition, there is a blending in the phenotype. - a condition when both alleles show up in

More information

MICROARRAYS: CHIPPING AWAY AT THE MYSTERIES OF SCIENCE AND MEDICINE

MICROARRAYS: CHIPPING AWAY AT THE MYSTERIES OF SCIENCE AND MEDICINE MICROARRAYS: CHIPPING AWAY AT THE MYSTERIES OF SCIENCE AND MEDICINE National Center for Biotechnology Information With only a few exceptions, every

More information

Genetics 101. Prepared by: James J. Messina, Ph.D., CCMHC, NCC, DCMHS Assistant Professor, Troy University, Tampa Bay Site

Genetics 101. Prepared by: James J. Messina, Ph.D., CCMHC, NCC, DCMHS Assistant Professor, Troy University, Tampa Bay Site Genetics 101 Prepared by: James J. Messina, Ph.D., CCMHC, NCC, DCMHS Assistant Professor, Troy University, Tampa Bay Site Before we get started! Genetics 101 Additional Resources http://www.genetichealth.com/

More information

Chapter 2: Access to Information

Chapter 2: Access to Information Chapter 2: Access to Information Outline Introduction to biological databases Centralized databases store DNA sequences Contents of DNA, RNA, and protein databases Central bioinformatics resources: NCBI

More information

Kickstart Biology. Year 11 and Year 12

Kickstart Biology. Year 11 and Year 12 Kickstart Biology Year 11 and Year 12 Year 11 workshops From 2019, we will be offering Kickstart Biology for Year 11 syllabus content. Building a strong foundation for students at this stage can encourage

More information

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE : GENETIC DATA UPDATE April 30, 2014 Biomarker Network Meeting PAA Jessica Faul, Ph.D., M.P.H. Health and Retirement Study Survey Research Center Institute for Social Research University of Michigan HRS

More information

Exploring the Genetic Basis of Congenital Heart Defects

Exploring the Genetic Basis of Congenital Heart Defects Exploring the Genetic Basis of Congenital Heart Defects Sanjay Siddhanti Jordan Hannel Vineeth Gangaram szsiddh@stanford.edu jfhannel@stanford.edu vineethg@stanford.edu 1 Introduction The Human Genome

More information

Available online at ScienceDirect. Procedia Computer Science 102 (2016 )

Available online at   ScienceDirect. Procedia Computer Science 102 (2016 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 102 (2016 ) 562 569 12th International Conference on Application of Fuzzy Systems and Soft Computing, ICAFS 2016, 29-30

More information

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls BMI/CS 776 www.biostat.wisc.edu/bmi776/ Mark Craven craven@biostat.wisc.edu Spring 2011 1. Understanding Human Genetic Variation!

More information

PUBH 8445: Lecture 1. Saonli Basu, Ph.D. Division of Biostatistics School of Public Health University of Minnesota

PUBH 8445: Lecture 1. Saonli Basu, Ph.D. Division of Biostatistics School of Public Health University of Minnesota PUBH 8445: Lecture 1 Saonli Basu, Ph.D. Division of Biostatistics School of Public Health University of Minnesota saonli@umn.edu Statistical Genetics It can broadly be classified into three sub categories:

More information

Grundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 11, 2011 1 1 Introduction Grundlagen der Bioinformatik Summer 2011 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a) 1.1

More information

Genetics and Heredity Power Point Questions

Genetics and Heredity Power Point Questions Name period date assigned date due date returned Genetics and Heredity Power Point Questions 1. Heredity is the process in which pass from parent to offspring. 2. is the study of heredity. 3. A trait is

More information

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE ACCELERATING PROGRESS IS IN OUR GENES AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE GENESPRING GENE EXPRESSION (GX) MASS PROFILER PROFESSIONAL (MPP) PATHWAY ARCHITECT (PA) See Deeper. Reach Further. BIOINFORMATICS

More information

Computational Methods for Systems Biology: Analysis of High-Throughput Measurements and Modeling of Genetic Regulatory Networks

Computational Methods for Systems Biology: Analysis of High-Throughput Measurements and Modeling of Genetic Regulatory Networks Tampereen teknillinen yliopisto. Julkaisu 548 Tampere University of Technology. Publication 548 Harri Lähdesmäki Computational Methods for Systems Biology: Analysis of High-Throughput Measurements and

More information

Genome-wide association studies (GWAS) Part 1

Genome-wide association studies (GWAS) Part 1 Genome-wide association studies (GWAS) Part 1 Matti Pirinen FIMM, University of Helsinki 03.12.2013, Kumpula Campus FIMM - Institiute for Molecular Medicine Finland www.fimm.fi Published Genome-Wide Associations

More information

Towards Gene Network Estimation with Structure Learning

Towards Gene Network Estimation with Structure Learning Proceedings of the Postgraduate Annual Research Seminar 2006 69 Towards Gene Network Estimation with Structure Learning Suhaila Zainudin 1 and Prof Dr Safaai Deris 2 1 Fakulti Teknologi dan Sains Maklumat

More information

Machine Learning. HMM applications in computational biology

Machine Learning. HMM applications in computational biology 10-601 Machine Learning HMM applications in computational biology Central dogma DNA CCTGAGCCAACTATTGATGAA transcription mrna CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Biological data is rapidly

More information

HLA and other tales: The different perspectives of Celiac Disease Gutierrez Achury, Henry Javier

HLA and other tales: The different perspectives of Celiac Disease Gutierrez Achury, Henry Javier University of Groningen HLA and other tales: The different perspectives of Celiac Disease Gutierrez Achury, Henry Javier IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's

More information

Pharmacogenetics of Drug-Induced Side Effects

Pharmacogenetics of Drug-Induced Side Effects Pharmacogenetics of Drug-Induced Side Effects Hui - Ching Huang Department of Pharmacy, Yuli Hospital DOH Department of Pharmacology, Tzu Chi University April 20, 2013 Brief history of HGP 1953: DNA

More information

Introduction to human genomics and genome informatics

Introduction to human genomics and genome informatics Introduction to human genomics and genome informatics Session 1 Prince of Wales Clinical School Dr Jason Wong ARC Future Fellow Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer

More information

Proceedings of the 5th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 16-18, 2006 (pp )

Proceedings of the 5th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 16-18, 2006 (pp ) Application of an automatic fuzzy logic based pattern recognition method for DNA Microarray Reader M.Sc. Wei Wei M.Sc. Xiaodong Wang Prof. Dr.-Ing Werner Neddermeyer Prof. Dr.-Ing Wolfgang Winkler Prof.

More information

. Definition The passing down of characteristics from generation to generation resulting in continuity and variation within a species

. Definition The passing down of characteristics from generation to generation resulting in continuity and variation within a species Section 3: The Basics of genetics. Definition The passing down of characteristics from generation to generation resulting in continuity and variation within a species Important Terms. Genes A specific

More information

Genes and Gene Technology

Genes and Gene Technology CHAPTER 7 DIRECTED READING WORKSHEET Genes and Gene Technology As you read Chapter 7, which begins on page 150 of your textbook, answer the following questions. What If...? (p. 150) 1. How could DNA be

More information

03-511/711 Computational Genomics and Molecular Biology, Fall

03-511/711 Computational Genomics and Molecular Biology, Fall 03-511/711 Computational Genomics and Molecular Biology, Fall 2010 1 Study questions These study problems are intended to help you to review for the final exam. This is not an exhaustive list of the topics

More information

BTRY 7210: Topics in Quantitative Genomics and Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine jgm45@cornell.edu January 29, 2015 Why you re here

More information

Bayer Pharma s High Tech Platform integrates technology experts worldwide establishing one of the leading drug discovery research platforms

Bayer Pharma s High Tech Platform integrates technology experts worldwide establishing one of the leading drug discovery research platforms Bayer Pharma s High Tech Platform integrates technology experts worldwide establishing one of the leading drug discovery research platforms Genomics Bioinformatics HTS Combinatorial chemistry Protein drugs

More information

03-511/711 Computational Genomics and Molecular Biology, Fall

03-511/711 Computational Genomics and Molecular Biology, Fall 03-511/711 Computational Genomics and Molecular Biology, Fall 2011 1 Study questions These study problems are intended to help you to review for the final exam. This is not an exhaustive list of the topics

More information

Alexander Statnikov, Ph.D.

Alexander Statnikov, Ph.D. Alexander Statnikov, Ph.D. Director, Computational Causal Discovery Laboratory Benchmarking Director, Best Practices Integrative Informatics Consultation Service Assistant Professor, Department of Medicine,

More information

ELE4120 Bioinformatics. Tutorial 5

ELE4120 Bioinformatics. Tutorial 5 ELE4120 Bioinformatics Tutorial 5 1 1. Database Content GenBank RefSeq TPA UniProt 2. Database Searches 2 Databases A common situation for alignment is to search through a database to retrieve the similar

More information

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary Executive Summary Helix is a personal genomics platform company with a simple but powerful mission: to empower every person to improve their life through DNA. Our platform includes saliva sample collection,

More information

Accuracy of the Bayesian Network Algorithms for Inferring Gene Regulatory Networks

Accuracy of the Bayesian Network Algorithms for Inferring Gene Regulatory Networks HELSINKI UNIVERSITY OF TECHNOLOGY Engineering Physics and Mathematics Systems Analysis Laboratory Mat-2.108 Independent research projects in applied mathematics Accuracy of the Bayesian Network Algorithms

More information

Challenging algorithms in bioinformatics

Challenging algorithms in bioinformatics Challenging algorithms in bioinformatics 11 October 2018 Torbjørn Rognes Department of Informatics, UiO torognes@ifi.uio.no What is bioinformatics? Definition: Bioinformatics is the development and use

More information

Syllabus for BIOS 101, SPRING 2013

Syllabus for BIOS 101, SPRING 2013 Page 1 Syllabus for BIOS 101, SPRING 2013 Name: BIOSTATISTICS 101 for Cancer Researchers Time: March 20 -- May 29 4-5pm in Wednesdays, [except 4/15 (Mon) and 5/7 (Tue)] Location: SRB Auditorium Background

More information

LARGE DATA AND BIOMEDICAL COMPUTATIONAL PIPELINES FOR COMPLEX DISEASES

LARGE DATA AND BIOMEDICAL COMPUTATIONAL PIPELINES FOR COMPLEX DISEASES 1 LARGE DATA AND BIOMEDICAL COMPUTATIONAL PIPELINES FOR COMPLEX DISEASES Ezekiel Adebiyi, PhD Professor and Head, Covenant University Bioinformatics Research and CU NIH H3AbioNet node Covenant University,

More information