Bioinformatics. Created by XMLmind XSL-FO Converter.

Size: px
Start display at page:

Download "Bioinformatics. Created by XMLmind XSL-FO Converter."

Transcription

1 Antal, Péter Hullám, Gábor Millinghoffer, András Hajós, Gergely Marx, Péter Arany, Ádám Bolgár, Bence Gézsi, András Sárközy, Péter Poppe, László

2 írta Antal, Péter, Hullám, Gábor, Millinghoffer, András, Hajós, Gergely, Marx, Péter, Arany, Ádám, Bolgár, Bence, Gézsi, András, Sárközy, Péter, és Poppe, László Publication date 2014 Szerzői jog 2014 Antal Péter, Hullám Gábor, Millinghoffer András, Hajós Gergely, Marx Péter, Arany Ádám, Bolgár Bence, Gézsi András, Sárközy Péter, Poppe László

3 Tartalom Bioinformatics DNA recombinant measurement technology, noise and error models Historic overview Clinical aspects of genome sequencing Partial Genetic Association Studies Genome Wide Association Studies First generation automated Sanger sequencing Next generation sequencing technologies Pyrosequencing and ph based sequencing Reversible terminator based sequencing Nanopore based sequencing Error characteristics of Next Generation Sequencing Carry forward/incomplete extension Homopolymer errors Capture technologies PCR capture Emulsion PCR Bridge amplification Targeted resequencing De-novo sequencing Next generation sequencing workflows Filtering Mapping Assembly Variant calling Paired end sequencing Multiplexing samples The post-processing, haplotype reconstruction, and imputation of genetic measurements Genome Genotype Single nucleotide polymorphisms Types of point mutation Haplotypes and recombination Linkage Disequilibrium Haplotype reconstruction Imputation Genotyping platforms Sample preparation Regions of interest Primer Design PCR Probe-tag based genotyping Sanger sequencing Real-time qualitative polymerase chain reaction SNP arrays Genotyping vs. gene expression Call rate and accuracy Comparative protein modeling and molecular docking Introduction The protein structure gap Methods of protein modeling Comparative protein modeling Steps of homology modeling Tools for homology modeling Molecular docking Protein-ligand interaction predictions iii

4 Protein-biomacromolecule interaction predictions References Methods of determining structure of proteins and protein structure databases Introduction Protein identification tools Simple protein analyses Levels and problems of protein structure predictions Experimental methods to determine the secondary structure of proteins Protein circular dichroism (CD) Synchrotron radiation circular dichroism (SRCD) Experimental methods to determining atomicstructures of proteins Protein X-ray crystallography Protein NMR spectroscopy Protein electron microscopy, electron diffraction and electron crystallography Protein neutron crystallography Quantitative models of the functional effects of genetic variants Introduction Variants SNP, indel Alternative splicing Levels of regulation Different regulatory elements microrna mirna development mirna regulatory methods Transcription factors Epigenetics Methylation Histone modifications References Mathematical models of gene regulatory networks Introduction Learning networks Representation Types of network learning algorithms TF, mirna, mrna regulatory networks References Standard analysis of genetic association studies Introduction Genetic data transformation Filtering Standard test for Hardy-Weinberg equilibrium Phenotype data transformation Transformation Discretization Univariate analysis methods Standard association tests Cochran-Armitage test for trend Odds ratios Univariate Bayesian methods Multivariate analysis methods Logistic regression Haplotype association Analysis of statistical power References Analyzing gene expression studies Introduction Pre-procession Background correction iv

5 Normalization Summarization Filtering Data Analysis Clustering Biological Interpretation of Results References Biomarker analysis Notation List of symbols Acronyms Introduction Background Bayesian multilevel analysis of relevance Multivariate scalability: k-mbs and k-mbg features A knowledge-rich aggregation of input features Interaction, redundancy based on posterior decomposition Relevance for multiple targets Conditional and contextual relevance Posteriors for the predictive power of input features Algorithmic aspects and applications Summary References Network biology Introduction Biological networks Basics of graph theory Network analysis Network topology Network models and dynamics Assortativity, degree distribution and scale-free networks Tasks and challenges An application to drug discovery References Dynamic modeling in cell biology Biochemical concepts and their computational representations Modeling with ordinary differential equations Stochastic modeling Hybrid methods Reaction-diffusion systems Model fitting Whole-cell simulation Overview References Causal inference in biomedicine Notation List of symbols Acronyms Introduction Representing independence and causal relations by Bayesian networks Constraint based inference of causal relations and models Learning complete causal domain models Bayesian inference of causal features Edges: direct pairwise dependencies Pairwise causal relations MBG subnetworks Ordering of the variables Effect modifiers References Text mining methods in bioinformatics v

6 Introduction Biomedical text mining Constructing the corpus Constructing the vocabulary Text mining tasks Basic techniques Pattern matching Document representation Methods for named entity recognition Methods for relation extraction Lexicalized probabilistic context-free grammars Difficulties in biomedical text mining Text mining and knowledge management References Experimental design: from the basics to knowledge-rich and active learning extensions Introduction The elements of experimental design Phases of biomedical DOE Types of biological experiments A decision theoretic approach to DoE Expected value of an experiment Adaptive designs and budgeted learning A Bayesian treatment of sequential decision processes Approaches to target variable selection Gene Prioritization Active learning Other practical tasks relying on bioinformatics References Big data in biomedicine Introduction The first wave of biomedical big data Post-genomic big data: the second wave The common big data The health-related common big data in biomedicine Bioinformatic challenges of common big data References Analysis of heterogeneous biomedical data through information fusion Introduction Information fusion and data fusion Types of data fusion Early fusion Intermediate fusion Late fusion Similarity-based data fusion References The Bayesian Encyclopedia Introduction The three worlds of data, knowledge and computation From fragmentation problems to workflow for unification Data repositories with semantic technologies Semantic publishing for the literature world Causal Bayesian network-based data analytic knowledge bases Examples for links between worlds Prospects for the Bayesian Encyclopedia References Bioinformatical workflow systems - case study Overview of tasks Data model and representation Use cases and architecture Implementation details of the server vi

7 bn-mcmc.exe mergeresults.exe Postprocessing steps Computational aspects of pharmaceutical research Overview of the process Chemoinformatical background Screening criteria Method Fragment-based design Drug repositioning References Metagenomics Introduction Metagenome analysis Community profiling Functional metagenomics Metagenomics step by step Sampling Sequencing Assembly Binning Gene calling and functional inference References vii

8

9 Typotex Kiadó, Creative Commons NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0) A szerző nevének feltüntetése mellett nem kereskedelmi céllal szabadon másolható, terjeszthető, megjelentethető és előadható, de nem módosítható DNA recombinant measurement technology, noise and error models Historic overview The Human Genome Project was initiated in 1990 and was finished in It resulted in a complete sequence of the entire human genome. At the time of the commissioning it was said that knowledge of the human genome is as important to advancing medicine as knowledge of anatomy was to the current state of medicine. The final sequence was obtained by using increasingly parallel runs of Sanger sequencing. In the early 2000's, sequencing an entire human genome was still not feasible, thus the HapMap project was commissioned to catalog the genomic variations among different populations on earth, stemming from the initial hypothesis that common diseases - especially ones that occur after childbearing age - are quite common and are most likely linked to common variant sites in the DNA. Human chromosomes were broken into smaller pieces of approximately 10,000 and 50,000 base pairs in length for sequencing. These long fragments were then cloned inside bacteria so the DNA could be copied by the bacterial DNA replication mechanism. These fragments were isolated from the bacterial DNA and further 1

10 broken into smaller pieces on the orders of a couple of hundred base pairs in length and then sequenced using Sanger sequencing, and then finally assembled. This method is known as the hierarchical shotgun method. The computing capacity of computers was initially insufficient to assemble large genomes, especially the 3 billion base pair human genome from random shotgun reads, so new methods had to be developed to assemble short reads. The 1000 Genomes project jump-started the commercialization of next generation sequencing (NGS), and helped researchers understand the measurement characteristics of these new platforms. Researchers sequences the entire DNA of 1000 humans in the scope of this project, and this has made it possible to perform the sequencing of an entire human genome for a couple of thousand dollars Clinical aspects of genome sequencing Like most new and emerging research fields, the initial hopes for human genome sequencing were extremely high. Researchers believed that the cause of most common diseases can be easily identified, and new drugs can be efficiently developed based on their findings. The results of genome sequencing unfortunately turned out to be much more difficult to interpret. Currently, sequencing has become a frequently used method of diagnosing certain diseases, as well as for providing decision support for treatment selection Partial Genetic Association Studies In PGAS studies, researchers select a subset of the human genome where they suspect that DNA variations that are associated with a specific disease are located. Then these variations are determined for large sets of case and control samples, and researchers then use statistical methods to analyze which variant has the most effect on the phenotype, and to determine causal variants. The study design of PGAS is elaborated in a subsequent chapter Genome Wide Association Studies After the completion of the HapMap project, researchers found themselves with a map of human recombination hotspots as well as a map of variations in the human genome. This has proven to be extremely useful in reducing the number locations where variations can reside. Multiple single nucleotide polymorphisms (SNP's) that are known from the HapMap project are selected to maximize the amount of linked SNP's, and they are determined on DNA chips in large populations, on the order of thousands of individuals. Most often a specific disease is targeted in the genome wide association study and thousands of cases and controls are analyzed with the DNA chips with up to 1 million SNP's per individual and then the SNP's are statistically analyzed to determine which of them correlate with the disease in question First generation automated Sanger sequencing Sanger sequencing is a method of DNA sequencing based on the selective incorporation of nucleotides using a DNA polymerase in in vitro replication. This method was developed in 1977 and was the most widely used sequencing method for almost 25 years. Sanger sequencing produces long contiguous reads of up to 800 nucleotides in length and is now most often used in small-scale studies. Each of the four nucleotides is marked with a different color fluorescent dye terminators, and the fragments are then run on gel electrophoresis and an image is recorded. The fluorescent bands are decoded as the final sequence. The Sanger method is quite slow and is relatively expensive compared to newer methods, although its reliability and its error characteristics are now well understood. 2

11 Next generation sequencing technologies All next generation sequencing technologies perform massively parallel sequencing of short sequences. It is possible to separate multiple samples in the same sequencing run, with the help of spatial separation samples. The increased parallelism did not significantly increase the quality of each read, but the redundant sequencing of overlapping segments increased the coverage of the region, which in turn allows for greater reliability. In practice, coverage can range from about 30 to over one thousand. The currently leading technology based on 3

12 cost per base sequenced is Illumina's HiSeq. But the average read length is relatively low and this technology is termed as a short read technology Pyrosequencing and ph based sequencing In 2005, 454 Life Sciences, now a subsidiary of Roche, developed a method of sequencing by synthesis, which involves taking the single-stranded DNA to be sequenced and then synthesizing its complementary strand enzymatically. The pyrosequencing method is based on detecting the activity of the DNA synthesizing enzyme DNA polymerase with another chemiluminescent enzyme (luciferase). Sequencing is achieved by adding one of the four deoxynucleoside triphosphates (dntps), and if that nucleotide is complementary to the next base, then it's incorporation into the complementary strand will trigger the luciferase enzyme to emit light. The relative intensity of the light emission is proportional to the number of identical bases incorporated into the strand. After this, the unincorporated nucleotides are washed away, and a different dntp is added, and the light emissions are recorded. These light emissions can be used to determine the nucleotide sequence of the strand in question. 4

13 In ph based sequencing, the proton emission caused by the incorporation of nucleotides causes a change in the ph of the sequencing reaction well, and this ph change is detected on a CMOS substrate. The level of ph change is proportional to the number of identical bases incorporated into the strand. The characteristics of pyrosequencing and ph based sequencing are highly similar Reversible terminator based sequencing Reversible terminator based sequencing was developed in 2006 by Illumina/Solexa. DNA strands are immobilized onto plates, then amplified in situ with bridge amplification, and then the following cycles are repeated. First, the four different nucleotides, each labeled with a different fluorescent dyes are incorporated into each strand, then the excess nucleotides are washed away and an image is recorded from the plate, after imaging the terminating groups are cleaved from the strands and washed away. The difference between a Illumina's and Helicos BioSciences' solutions is that Helicos incorporates only a single dye labeled nucleotide per cycle, while the Illumina marks all four nucleotides with different guys and incorporates them simultaneously. After the images are acquired, clonal clusters are located on the plates, and the nucleotide sequences can be determined from the colors and intensities in each cycle. 5

14 Nanopore based sequencing Nanopore based sequencing has been under development since No commercial solutions have been released as of yet, but the technology promises long and accurate reads by Nanopores are constructed out of proteins and are immobilized on the plate. A single strand of DNA is passed through the pore and the ionic current flowing from one side of the plate to the other is recorded. The nanopore is extremely small, on the order of 1 nm internal diameter, and as each nucleotide passes through the hole, it generates a current representative of the nucleotide. 6

15 Error characteristics of Next Generation Sequencing Often times a next generation sequencing measurement does not output the exact amount or type of data that was intended to be obtained. Library preparation is a highly complex process involving tens of hours of handson lab time, and it is very easy to introduce errors into the preparation. The amount of starting DNA is on the order of nanograms per picoliter. There are mainly three types of biases that lead to errors in NGS data - systematic bias, coverage bias, and batch effects, depending on sequencing platform, genome content, and experimental variability Carry forward/incomplete extension The carry forward/incomplete extension error often happens in sequencing by synthesis, when a subset of clonally identical fragments is not synchronously synthesized. For example if some of the strands do not incorporate the exact number of nucleotides (because an insufficient number of nucleotides were flowed over the plate), or if residual nucleotides remain inside the wells (because they were insufficiently washed out by the washing cycle in between flows), then the strands are not sequenced completely in sync. This results in signal level degradation and it reduces the quality the reads. Filtering algorithms are designed to detect (and in some cases even correct) this error and discard the affected reads. 7

16 Homopolymer errors In pyrosequencing and ph based sequencing multiple identical nucleotides are incorporated into a sequence if there is a stretch of identical nucleotides upcoming in the target sequence. The amount of light emitted in pyro sequencing and the ph change in ph based sequencing is proportional to the number of bases incorporated. With longer homopolymer stretches, the amount of noise and variation increases, thus making it difficult to call the exact number of identical bases in the sequence. Higher coverage of the target sequence can enable inference of the exact length of the homopolymer stretch Capture technologies In most studies, DNA sequencing is not done in a completely shotgun approach, and there are usually target regions of interest inside the genome that we want to sequence. The first step of library creation is capturing our target sequence and then amplifying that sequence. Multiple methods exist to perform this step PCR capture When using a polymerase chain reaction to capture a specific target, we must design a primer that specifically binds only to the leading or trailing sequence of our targeted region. This primer must be unique in a sense that it should only bind to a single spot on the DNA of the organism that is being sequenced. Great care must also be taken to select primer target sites that are free from known mutations, and are as short as possible while still remaining unique (on the order of 20 to 25 nucleotides). The sample DNA is heated so that the dual stranded DNA separates into two strands. Next, the primer oligonucleotide is added, along with enzymes and free nucleotides, and the mixture is cooled. This results in the primer annealing to the template strand, and the PCR reaction can start. The complementary strand is synthesized along the template strand. This heating-pcrannealing cycle is performed multiple times, and during each cycle, the amount of target DNA is doubled Uniplex PCR In uniplex PCR only one target sequence is amplified in each reaction volume. Thus it is not necessary to take into account the different melting and annealing temperatures of different primers. Uniplex PCR is compatible with all next generation sequencing platforms, and is straightforward and performed routinely. The maximum length of each target sequence is about 10kb, because longer reads lose robustness due to early termination of the PCR. Longer sequences can be captured using multiple overlapping PCR reactions. When pooling the results 8

17 of multiple uniplex PCR pools, it is imperative that the amount of DNA in each solution is normalized, so that even coverage across the targets can be achieved Multiplex PCR In multiplex PCR reactions, multiple primers are simultaneously added to the same reaction volume, and the target sequences are amplified all at once. This is only practical up to about 10 amplicons, because primer differences will result in greater coverage non-uniformity when too many different primers are used simultaneously. The interaction of multiple primers can cause nonspecific amplification, uneven amplification and can even cause an amplicon to completely fail to amplify. The advantage of this method its lower specific cost. Amplicon lengths must also be matched to ensure uniformity Microarray capture A microarray contains millions of probes immobilized on a glass plate, which can hybridize to specific targets of interest in a genomic DNA library. In microarray capture, large genomic DNA sequences are first amplified with PCR, and then are hybridized with the microarray. The sequences that do not bind to the probes are washed away, and the remaining immobilized strands of interest are eluted from the plate. The eluted DNA is then amplified and loaded into a sequencer. 9

18 Microfluidic capture Microfluidic capture mechanisms employ tiny bubbles of water in a suspension of oil. These droplets are filled with reagents required for the specific reaction to be performed inside the micro vessels. The droplets can be sorted based on their visual characteristics with the help of electromagnetic forces. These methods allow millions of reactions to take place in individual droplets simultaneously. 10

19 Emulsion PCR Adapters that are complementary to the ones on magnetic beads are ligated to a single-stranded DNA molecule library. Magnetic beads are then added to a dilute solution of single-stranded DNA molecules and PCR reagents. The ratio of magnetic beads to DNA strands must be greater than one in order to ensure that we will only have one copy of single-stranded DNA per bead. This solution is then emulsified so that each droplet contains at most 1 bead and one strand of DNA, and contains a high amount of PCR reagents. If a droplet that contains a bead has more than one strand of DNA then the copies that will be produced by the PCR will not be unique to that bead. After the emulsification, multiple cycles of annealing, extension, and denaturation are performed, thus producing a bead that has thousands of copies of the same strand attached to it. Next the emulsion is broken, and the individual beads are separated from the remainder of the solution using a magnetic separator. These beads can then be loaded on to plates which contain individual wells that are exactly the size of a single bead (so that there can only be one bead per well), and the sequencing process can begin. Emulsion PCR is the method used by pyro sequencing and ph based sequencing Bridge amplification Two different adapters that are complementary to the two adapters bound to a large glass plate are ligated to the two ends of each molecule in a single-stranded DNA library. Then the DNA solution is annealed to the glass plate. The individual molecules bind to random spots on the plate. The molecules formed bridges between two 11

20 different adapters on the plate, and then the target region is extended. When the extension is complete the molecules are denatured, and the annealing, bridging and extension process can begin once again with an increased amount of strands per cluster. Hundreds of millions of molecular clusters are formed on a plate. If the distance between two clusters is sufficient, then each localized cluster will contain only a single sequence of DNA. This plate is then loaded into a sequencer that performs reversible terminator based sequencing Targeted resequencing Next generation sequencing is most often used to resequence regions that have already been sequenced in multiple individuals and comparing the individual mutations and variations in each sample region, and comparing that to phenotype parameters. There are multiple ways of selecting and amplifying target regions of the genome of an organism. Targeted resequencing simplifies the mapping problem, because we already know which target regions the reads should map to. Since the reads do not come with any positional information, this greatly simplifies the task. Any sequencing technology produces errors similar to real variation, so we need to isolate the sources of these low-level sequencing errors in our reads and separate them from real variations later De-novo sequencing De-novo sequencing is most often performed when the sequence of the target organism is unknown, or we wish to infer large-scale rearrangements and mutations compared to the target organism, for example when sequencing highly mutated cancer cells. When performing de novo sequencing, the target DNA is amplified in its entirety. The amplified fragments are then fragmented into lengths that are suitable for the chosen sequencing technology. These fragments are then loaded into the sequencer using the standard library preparation methods. Assembly of the sequence fragments is an extremely difficult task, for example the human genome contains 3 billion base pairs and with the common read length that is only a few hundred bases, at least 100 million fragments have to be sequenced to be able to assemble the reason into contiguous chromosomes. Because coverage is not uniform across our target DNA, increased coverage depths are required to perform de novo sequencing. Organisms with smaller DNAs (e.g. virii and bacteria, which have genomes ranging from thousands to millions of base pairs) can often be sequenced in a single run of a sequencer. Multiple software tools exist to perform the assembly of read fragments into contiguous sequences Next generation sequencing workflows 12

21 Next generation sequencing is now not only used for research purposes but is often used to diagnose diseases. And thus there are clear guidelines on how to perform analysis of the results obtained from a sequencing run in order to provide diagnostic utility Filtering Quality scores are assigned by the sequencing platform to each base in each. The most often used scoring metric is the Phred score, which denotes the log 10 probability of a base call being incorrect. Reads must be filtered for quality control purposes; reads that are too short or do not reach minimum quality thresholds must be discarded, and erroneous and non-unique reads must also be identified and filtered Mapping Mapping, or otherwise known as alignment, is essentially a step in resequencing. It is the process of mapping each individual read to the target genome. There are multiple alignment algorithms available that attempt to find the most likely target for each read Assembly Assembly is the term used for assembling short reads into a contiguous sequence without a reference sequence. The task of assembly can be compared to shredding multiple copies of the same book and then trying to assemble the original book from the fragments. The most often used algorithm is the greedy algorithm where the object is to find the shortest common sequence covered by the reads. The pairwise alignment of each pair of fragments is calculated, and the two fragments with the largest overlap are merged into one fragment. This is repeated until there is only one fragment left Variant calling Variant calling is the process of multiple reads that map to the same location on the reference sequence and identify whether there is a difference from our expected result. There are multiple types of variance including single nucleotide polymorphisms or SNP's, as well as insertions, deletions, copy number variations, large structural rearrangements Paired end sequencing Often times short reads do not provide us with enough information to resolve large-scale copy number variations and rearrangements, and aligning reads to a de novo assembled large genome is extremely difficult. In paired end sequencing the fragments are much longer, on the order of thousands of bases of length. Both ends of each fragment are sequenced and the sequence data is aligned as a pair, taking into account the minimum and maximum distance between the ends. 13

22 Multiplexing samples When sequencing a single genomic region from multiple samples in a single sequencing run, it is imperative that we are able to determine which short reads belongs to which sample. Each next generation sequencing technology deals with this issue in a similar manner, by ligating short identification sequences to each fragment. This barcode is usually around 10 nucleotides in length and often contains redundancy to allow for error correction. Another method of multiplexing samples is by dividing each plate into lanes which are loaded separately with libraries from different samples The post-processing, haplotype reconstruction, and imputation of genetic measurements 14

23 Genome The genome contains the entire hereditary information of living organisms, which in most cases is coded by DNA. The expression was created by merging the words gene and chromosome. DNA is a nucleic acid with the shape of a double spiral that is created by the connections of four nucleotide base pairs; adenine (abbreviated A), cytosine (C), guanine (G) and thymine (T). The backbone of the DNA strand is made from alternating phosphate and sugar residues. The sugar is 2-deoxyribose, which is a pentose (five-carbon) sugar. The spiral is held together by the complementary bases on each strand. Adenine and thymine are connected by double hydrogen bonds, while cytosine and guanine are connected by triple hydrogen bonds. DNA strands in the genome are organized into chromosomes, and are present in the human body in the form of two homologous chromosomes, forming a chromosome pair. In human cells there are 23 pairs of chromosomes, each member of the pairs originates from one of the parents. The corresponding loci of the homologous chromosome pairs are called alleles, thus the human chromosomes are usually biallelic. If both alleles are the same, the organism is homozygous for the trait. If both alleles are different, the organism is heterozygous for that trait. The phenotype is any observable characteristic or trait of an organism; such as its morphology, biochemical or physical properties or behavior Genotype The genotype is the genetic trait that cannot be directly observed. It is the specific genetic sequence of a cell, and organism, or an individual, i.e., the specific allele make-up of the individual. Genotyping is the process of determining the genotype of an individual by the use of biological assays. It provides measurement of the genetic variation between members of a species Single nucleotide polymorphisms Single nucleotide polymorphisms (SNP, pronounced 'snip') are the most common type of genetic variation. SNPs may be base-pair changes or small insertions or deletions at a specific locus, usually consisting of two alleles (where the rare allele frequency is = 1%). SNPs are often found to be the biomarkers of many human diseases and are becoming of particular interest in pharmacogenetics. An SNP is a DNA sequence variation occurring when a single nucleotide - A, T, C, or G - in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual). For example, two sequenced DNA fragments from different individuals, AAGCCTA to AAGCTTA, contain a difference in a single nucleotide. In this case we say that there are two alleles: C and T. Almost all common SNPs have only two alleles. Within a 15

24 population, SNPs can be assigned a minor allele frequency - the lowest allele frequency at a locus that is observed in a particular population. This is simply the lesser of the two allele frequencies for single-nucleotide polymorphisms. There are variations between human populations, so an SNP allele that is common in one geographical or ethnic group may be much rarer in another. Variations in the DNA sequences of humans can affect how humans develop diseases. SNPs within a coding sequence can change the translated amino acid sequence, which can affect the structure and the function of the produced protein (Fig. 16). Over half of all known disease mutations come from these replacement polymorphisms. SNPs are also thought to be key enablers in realizing the concept of personalized medicine. However, their greatest importance in biomedical research is for comparing regions of the genome between cohorts (such as with matched cohorts with and without a disease) in genome-wide association studies Types of point mutation The change of a single nucleotide in the DNA molecule can be base change, or an insertion or deletion of a single nucleotide. The effect of a single change is greatly influenced by the importance of the locus in the transmission of information. In a silent mutation, the base-pair change does not results in a change in the amino acid sequence of the protein coded by the DNA, because there is an inherent redundancy in the coding of triplets of nucleotides to amino acids. In a nonsense mutation, a triplet encoding an amino acid is changed to a stop codon. This results in the process of protein translation being halted early, and will likely result in a protein with reduced or no functionality. The most drastic mutations are insertions and deletions, because they cause a frame-shift in the decoding of the nucleotide triplets, and will result in a protein that bears little resemblance to its intended form Haplotypes and recombination Recombination involves the breakage and rejoining of DNA strands corresponding to the same chromosome at the same site. Recombination occurs in meiosis by the pairing of the homologous chromosomes. In human meiosis, recombination sites are not completely random, they most often occur around recombination hotspots. These hotspots can be up to millions of base pairs distant from each other. Recombination produces new combinations of alleles, encoding a novel set of genetic information. Since recombination occurs at random distances, orders of magnitude larger than the length of the average gene, alleles at two or more loci in close proximity are often inherited together. This is called linkage disequilibrium. In other words, linkage disequilibrium is the occurrence of some combinations of alleles in a population more often or less often than what would be expected from their individual allele frequencies. Alleles are a combination of genetic markers at adjacent locations on the same chromosome that are inherited together. This set of markers is called a haplotype, and it may consist of a couple of loci up to entire chromosomes depending on the number of recombination events that have occurred between the set of loci Linkage Disequilibrium The deviation of the observed frequency of a haplotype from the expected one is called the linkage disequilibrium, and is commonly denoted by a capital D. This measure does deal well with rare alleles, thus 16

25 there is another measure of LD which is an alternative to D, which is the correlation coefficient between pairs of markers, denoted by. 17

26 Linkage disequilibrium is most often visualized with the help of a rotated autocorrelation matrix. The diagonal represents the markers on a strand of DNA, and the linkage disequilibrium between two loci can be found by tracing the rows and columns of the matrix. Each cell in the matrix is colored according to the level of LD between loci. Haplotype blocks are most often contiguous runs of loci in high LD Haplotype reconstruction Most SNP measurement technologies do not provide the ability to identify haplotypes; as the only measure single points in the DNA, and do not measure the entire corresponding strand. Identifying the haplotypes is an important task, for example if SNP's A and B only cause a phenotype to be expressed if both mutations are present in a copy of a gene, and one person's DNA measurement confirmed that he has the Aa and the Bb alleles, then we must identify if both of these heterozygous SNPs have the mutations on the same strand of DNA. The International HAPMAP Consortium was commissioned to identify the haplotype blocks and recombination hotspots in the human genome. An individual's haplotypes can be inferred from the SNP alleles of the individual and the HAPMAP data. The most accurate and widely used methods for haplotype estimation use some form of hidden Markov models carry out the inference. The most often used and most accurate method is PHASE. This method uses Gibbs sampling in which each individual's haplotypes are updated conditional upon the current estimates of haplotypes from all other samples. Approximations to the distribution of a haplotype conditional upon a set of other haplotypes were used for the conditional distributions of the Gibbs sampler. Most haplotype reconstruction algorithms offer some way of dealing with missing data, but only in the form of a binary missing/called assignment. The output of the haplotypes then can provide a distribution of possible genotypes for the missing samples. 18

27 Imputation Not only can the SNP data be used to infer the haplotypes of an individual through the use of linkage disequilibrium, but this linkage also allows us to infer the SNPs of an individual in case of the erroneous measurements. This is a common task in genetic association studies, because the high throughput nature of genotyping does not lend itself well to re-analyzing small sets of data that are missing at random from the entire study data. The rate of missing data can range from 1% to up to 20%. This poses a challenge in the downstream analysis of the genotype data, thus maximizing the amount of available data is essential. Imputation can utilize external data sources, or it can be configured to use only the measured data. An important aspect of imputation is that it can utilize uncertain data. Most genotyping platforms only produce discrete calls in the form of either a genotype assignment, or a missing indicator. More recent advancements have been the development of statistical genotyping tools that output a probability distribution of possible genotypes for each measured sample. Incorporating this uncertainty information into the imputation process allows us to provide more accurate and complete genotype calls Genotyping platforms There are multiple platforms commercially available to determine the genotypes of an individual. These platforms can be compared based on the amount of polymorphisms that are identifiable in one run and based on the number of samples that can be genotyped in parallel. The quality of these methods also differs wildly based on their throughput. The cost of a measurement often correlates with the accuracy. This is extremely important in identifying monogenic diseases, where there is very little tolerance for false positives or false negatives. Allele specific oligonucleotides are short pieces of synthetic DNA typically 15 to 21 nucleotide bases long, and are complementary to the sequence of variable target DNA. It is most often used in genetic testing and molecular biology research. This method is relatively low throughput, but can be customized for any target sequence. For example the human disease sickle cell anemia is most often tested with allele specific oligonucleotides. It acts as a probe for the presence of the target in a Southern blot assay. 19

28 Sample preparation In order to perform any kind of genetic measurements on DNA originating from cells in an organism, the DNA must be extracted from the cells and purified to be freed of any contaminating DNA and RNA. The cell walls are broken open, most often using chemical and physical methods such as blending, grinding or sonic dating the cells. Membrane light goods must be removed by adding detergents or surfactants which also promote cell lysis. Proteins must be removed or broken down with protease enzymes. RNA must be removed by adding RNase. The DNA purification can be done with ethanol precipitation, a phenol-chloroform extraction or mini column purification Regions of interest A primer is a strand of nucleic acid that serves as a starting point for DNA synthesis. It is required for DNA replication because the enzymes that catalyze this process, DNA polymerases, can only add new nucleotides to an existing strand of DNA. The polymerase starts replication at the 3'-end of the primer, and copies the opposite strand. In most cases of natural DNA replication, the primer for DNA synthesis and replication is a short strand of RNA (which can be made de novo) Primer Design Pairs of primers should have similar melting temperatures since annealing in a PCR occurs for both simultaneously. A primer with a Tm significantly higher than the reaction's annealing temperature may mishybridize and extend at an incorrect location along the DNA sequence, while Tm significantly lower than the annealing temperature may fail to anneal and extend at all. Primer sequences need to be chosen to uniquely select for a region of DNA, avoiding the possibility of mishybridization to a similar sequence nearby. A commonly used method is BLAST search, whereby all the possible regions to which a primer may bind can be seen. Both the nucleotide sequence as well as the primer itself can be BLAST searched. The free NCBI tool Primer-BLAST integrates primer design tool and BLAST search into one application [4], so does commercial software product such as eprime, Beacon Designer. Computer simulations of theoretical PCR results (Electronic PCR) may be performed to assist in primer design [5]. Mononucleotide repeats should be avoided, as loop formation can occur and contribute to mishybridization. Primers should not easily anneal with other primers in the mixture (either other copies of same or the reverse direction primer); this phenomenon can lead to the production of 'primer dimer' products contaminating the mixture. Primers should also not anneal strongly to themselves, as internal hairpins and loops could hinder the annealing with the template DNA PCR The polymerase chain reaction (PCR) is used to amplify a specific region of a DNA strand (the DNA target). Most PCR methods typically amplify DNA fragments of between 0.1 and 10 kilo base pairs (kb), although some techniques allow for amplification of fragments up to 40 kb in size. The amount of amplified product is determined by the available substrates in the reaction, which become limiting as the reaction progresses. Typically, PCR consists of a series of repeated temperature changes, called cycles, with each cycle commonly consisting of 2-3 discrete temperature steps, usually three. 20

29 Initialization step: This step consists of heating the reaction to a temperature of C, which is held for 1-9 minutes. Denaturation step: This step is the first regular cycling event and consists of heating the reaction to C for seconds. It causes DNA melting of the DNA template by disrupting the hydrogen bonds between complementary bases, yielding single-stranded DNA molecules. Annealing step: The reaction temperature is lowered to C for seconds allowing annealing of the primers to the single-stranded DNA template. Stable DNA-DNA hydrogen bonds are only formed when the primer sequence very closely matches the template sequence. The polymerase binds to the primer-template hybrid and begins DNA formation. Extension/elongation step: the DNA polymerase synthesizes a new DNA strand complementary to the DNA template strand by adding dntps that are complementary to the template in 5' to 3' direction, condensing the 5'- phosphate group of the dntps with the 3'-hydroxyl group at the end of the nascent (extending) DNA strand. The DNA polymerase will polymerize a thousand bases per minute. Under optimum conditions, i.e., if there are no limitations due to limiting substrates or reagents, at each extension step the amount of DNA target is doubled, leading to exponential (geometric) amplification of the specific DNA fragment. Final elongation: This single step is occasionally performed at a temperature of C for 5-15 minutes after the last PCR cycle to ensure that any remaining single-stranded DNA is fully extended Probe-tag based genotyping 21

30 This method uses fluorescent dye labeled probes bound to a glass plate. There are multiple wells on a plate each with different probes. In the case of biallelic SNPs two different fluorescent dyes are used; one for the wild allele and one for the mutant allele. PCR amplified target regions are hybridized onto the plate, and the level of fluorescence in each dye is measured. If signals from both dyes are recorded, then the sample is called as a heterozygous SNP, if only one dye shows fluorescence, then the corresponding wild type or mutant type homozygous allele is assigned to the sample Sanger sequencing Chain termination sequencing, otherwise known as Sanger sequencing, can also be used to identify specific SNPs in the DNA of an organism. More about this method can be read in the next chapter about next generation sequencing methods Real-time qualitative polymerase chain reaction This method enables both detection and quantification of SNPs in DNA. The procedure follows the general principle of polymerase chain reaction; its key feature is that the amplified DNA is detected as the reaction progresses in "real time". This is a new approach compared to standard PCR, where the product of the reaction is detected at its end. Two common methods for the detection of products in quantitative PCR are: non-specific fluorescent dyes that intercalate with any double-stranded DNA, and sequence-specific DNA probes consisting of oligonucleotides that are labeled with a fluorescent reporter which permits detection only after hybridization of the probe with its complementary sequence to quantify the DNA levels. This method is low throughput but highly accurate, because of the real-time nature of the PCR reaction SNP arrays SNP arrays are used to detect large numbers of SNPs in one individual. Modern SNP arrays can detect up to 3 million SNPs on a single chip. The array contains immobilized nucleic acid sequences of target and one or more labeled allele-specific oligonucleotide (ASO) probes. It is read by a detection system that records and interprets the hybridization signal, and assigns genotype calls based on the signal intensities. 22

31 Genotyping vs. gene expression Genotyping examines the organization of an organism's DNA and looks for variability and polymorphisms. In order for genes to be expressed, they are first transcribed into RNA. Only a small percentage of a cell's DNA actually codes for genes. Gene expression microarrays look at the relative amounts of all the RNA transcripts in a sample. Gene expression is a qualitative measurement, because we know that a specific RNA sequence is present in the cell, and want to measure the relative number of these RNA strands in relation to the total number of RNA strands in the cell. Whereas genotyping must assign a single, discrete result to each measurement, which denotes which exact variant is present at a site in the DNA Call rate and accuracy Call rate is the ratio of queried SNPs to the ratio of SNPs that actually received a genotype assignment in the measurement. Accuracy is the ratio of SNPs that are assigned a genotype to the number of correct genotype calls. As a general guideline it can be said that higher throughput methods often result in lower accuracy and lower call rate. Diagnostic tools most often used high precision methods instead of high throughput methods because the costs of rerunning a high throughput measurement are orders of magnitude higher than rerunning a single high precision measurement. Diagnostic tests usually only measure a few SNPs in the genome, while modern genome wide association studies measure on the order of millions of SNPs. There are multiple types of errors in high throughput genotyping. If a DNA sample is of insufficient quantity or quality, then most of the SNP measurements for that individual DNA can fail. If a primer corresponding to the SNP site is not unique with respect to the entire DNA, or is close to another variation, then the call rate for the SNP measurement in all 23

32 samples will most likely suffer. There is also a missing-at-random effect, where a single SNP for a single sample can fail, because of a wide variety of factors Comparative protein modeling and molecular docking Introduction Protein structure determination has become an important area of research in molecular biology and structural genomics. With the tertiary structure of proteins in hands researchers can explore and analyze protein function and active sites, thus facilitating important proteomics tasks such as protein engineering and structure-based drug design. Structures solved by experimental methods are deposited in the Protein Data Bank (PDB, [195] and form the primary basis of structure based proteomics studies. However, determination of protein structures by the aid various experimental methods (such as X-ray crystallography or NMR spectroscopy, see Chapter "Methods of determining structure of proteins") remains a difficult and costly process. The human proteome has approximately 30,000 annotated human proteins (in the Human Protein Reference Database, [196] but only about 5,000 human proteins or domains can be found in the PDB. Therefore, methods enabling excess to three-dimensional atomic-level structures based on sequence data are needed. Computational methods for protein structure prediction from primary structure information (i.e. sequence data) have thus evolved to achieve this task [197 és 198]. Since the first modeled protein structure [199], numerous modeling studies have been published. In this perspective, this chapter overviews the protein modeling techniques and the accuracy of the models. Modeling methods are required even when an X-ray or NMR structure is available because the structures may require local correction or modification (e.g. during structure-based drug design the number of possible ligandreceptor combinations is extremely high and thus solving their structures experimentally is not practical) The protein structure gap The genome sequencing efforts are providing us with complete genetic data for thousands of organisms; including humans (see the Genome database). Mankind is now facing with characterizing, understanding and modifying the functions of proteins encoded by these genomes. This task is primarily facilitated by protein three-dimensional structures which are best determined by experimental methods (such as X-ray crystallography or NMR spectroscopy, see Chapter "Methods of determining structure of proteins"). Despite significant advances in these techniques, experimental structure determination of many proteins is still missing due to several reasons. 24

33 Over the recent decades, the number of sequences in the comprehensive public sequence databases, such as UniProt (SwissProt/TrEMBL) [200], or NCBI Gene [201] increased tremendously and now contain almost 50 million sequences. In contrast, despite structural genomics, the number of experimentally determined structures deposited in the Protein Data Bank (PDB, increased slower and now (at the end of 2013) contain only around 95,000 structures. Thus, the gap between the numbers of known sequences and structures continues to grow (Fig. 23) [202]. Protein structure prediction methods attempt to bridge this gap [198 és 203] Methods of protein modeling The methods applicable for modeling atomic level structures of proteins from their sequence data depend on the size of the target protein and on the degree of sequence identity between the query protein and a homologous experimentally determined template structure (Fig. 24). The first class of methods, called ab initio (or de novo) protein modeling, predict the structure from sequence alone, without relying any similarity at the fold level between known structures and the modeled sequence [204]. These methods seek to build 3D protein models "from scratch", i.e., based on physical principles without previously solved structures. The de novo methods assume that the native structure corresponds to the global free-energy minimum of the protein and attempt to find this minimum by probing of many feasible protein conformations. The two key elements of de novo methods are the procedure for efficient conformational search and the goodness of the free-energy function used for evaluating possible conformations. These procedures tend to require vast computational resources, and have thus only been carried out for tiny proteins. 25

34 The second class of protein structure prediction methods, called comparative protein modeling (or homology modeling) is a computational method that predicts the 3D structure of a protein based on its amino acid sequence and a homologous experimentally determined template structure including threading and comparative modeling. The method relies on the observation that the 3D structure of a protein is better conserved than its sequence and therefore two proteins that are only partially identical at the sequence level may still share the same fold [198 és 205]. Construction of an atomic-resolution model of the "target" protein from its amino acid sequence by using an experimental three-dimensional structure of a related homologous "template" protein is feasible only if the degree of sequence identity between the "target" and "template" proteins exceeds Sequences falling below a sequence identity can have very different structures, and thus homology modeling is possible only if a compatible fold for the given sequence can be recognized. If there is such a fold, then homology model can be constructed Comparative protein modeling Despite progress in ab initio protein structure prediction [204], comparative protein modeling remains the most reliable method to predict atomic level three dimensional structure of a proteins. In many cases homology modeling results structures with an accuracy that can be comparable to a low-resolution, experimentally determined structure. Thus, comparative protein modeling is nowadays the primary tool of atomic level structure prediction of proteins [198 és 205] Steps of homology modeling The main steps of comparative protein modeling are template selection, alignment, backbone, loop and sidechain prediction, structure optimization and evaluation (Fig. 25). 26

35 Selection of proper template(s) is a crucial step since an error in "template" selection will result in an incorrect model. Therefore special care is needed to identify a "template" structure that has sufficient homology with the "target" sequence. In special instances the correct fold can be recognized even in case of low sequence homology between the "template" and "target" sequences. The "target" sequence is then aligned to the "template" sequence and the alignment is adjusted to ensure optimal match between the homologous regions. Once an optimal alignment is achieved, the backbone atoms of the "target" are modeled onto the threedimensional "template" structure and the non-matching loop regions and the non-conserved side-chain orientations are predicted. Next, optimization of the model in a proper force field removes steric clashes and improves structurally important interactions such as hydrogen-bonding network between atoms. The final model is then evaluated to identify erroneous or missing regions of the model (e.g. non-conserved loops, which may need to be modeled independently of the conserved regions). The results of the evaluation can lead to iterative refinements of the model until a final stage Template selection and initial alignment The initial step of comparative homology modeling is the selection of proper template(s). Decades ago, sequence database searches were efficiently automated through the development basic local alignment search (BLAST) tools [207]. The template selection methods include such searches against structural databases like the PDB - even in case of fold recognition - prior to the further modeling steps. The final outcome of the conventional homology modeling as well as protein modeling after fold recognition heavily relies on the search output. As a first approach, the hit of the highest sequence identity may be chosen as template. Note that even X-ray protein structures are imperfect (due to partial degradation during crystallization, due to low resolution electron density map, or simply due to human errors; see also Chapter "Methods of determining structure of proteins") [208]. If there are more than one structures available, another obvious solution is to choose the less errorcontaining structure (e.g. by using PDBREPORT, as template. In addition, other viewpoints (a protein may have active and inactive structures, presence of cofactors / ligands in the structure may be important etc.) should also be considered during template selection. Nowadays, the available computational power enables the use of multiple templates and choosing the best result for further refinements. Combination of multiple templates into one structure used as average template for modeling is also feasible. 27

36 Cases when a "template" structure of more than 25% sequence identity with the "target" sequence is found represent the level above which a homology modeling exercise might be attempted. This is demonstrated by the modeling of basic fibroblast growth factor (bfgf) depicted in Fig. 26. A homology model of bfgf was constructed using the rat keratinocyte growth factor (PDB code 1QQK, chain B) with 41% sequence identity (similarity ) as a template. The protein backbones of the model structure (red ribbon) and of the later determined experimental structure (blue ribbon, PDB code 1BFC) are seen to be very similar with exception of two less matching regions Sequence alignment correction After having a selected template and an initial alignment, a number of control tools may be used for the selection and refinement of structural alignments between "model" and "template" including three-dimensional structure visualization. Nowadays, only a few tools tackle the problem of automatic refinement of sequence alignments but promising approaches are published [209]. The goodness of a particular alignment may be checked by adding novel sequences of sufficient similarity to either to the "template" or to the "target" or by other experimental structures superposing well onto the template. In case of distantly related proteins it is also important to check the agreement of secondary structure predictions (for the query sequence) with secondary structure assignment (for the template) [210]. These data on the structural alignment may be visualized using the JOY format [17]. However, at lower levels of sequence conservation, structural alignment should also be evaluated more precisely at the threedimensional level (Fig. 27). Manual editing of the alignment is the most tedious and critical part in comparative protein modeling. Misalignment of only one residue in the model will result in an error of about in the final structure because the current homology-modeling algorithms generally cannot recover from errors in the alignment [18]. 28

37 Backbone model generation After finishing the sequence alignment, backbone model generation should be performed. Construction of the backbone for the model is quite trivial and usually means simply copying the coordinates of the aligned template's amino acid backbone atoms into the model. If at a certain position of the alignment between the template and model the amino acids are different, only the backbone N, C, C and O atomic coordinates (and in some cases C also) can be copied. If the amino acids are identical in a position of the structural alignment, in many instances even the atomic coordinates of the side chain can be copied into the model Loop model generation The alignment of "target" model and "template" structures may contain insertions and gaps. In case of deletions, the excess parts of the template are simply omitted and the ends of the resulting gap are joined. In case of insertions, the continuous backbone of the template is cleaved and the extra loop is inserted. It is obvious that the backbone conformation is changing in both cases. When insertions or deletions appear in the template/target alignment, the accuracy of modeling the missing parts varies significantly between various parts of the protein. Within well-defined secondary structure elements ( - helixes and -strands), where rigid backbone approximation is usually acceptable, modeling is more accurate, while less accuracy is expected within less structured loops which tend to be more mobile. Many homology modeling methods can generate the loops with acceptable covalent geometry, typically by loop-database search. However, finding a near-native loop conformation has proven difficult, and the loops without proper template are consistently the most inaccurate parts of the homology models [19] Side-chain modeling The difficulty of the side-chain prediction among other factors strongly depends on the degree of sequence similarity of the target and template and also on the quality of the template structure. It was found that the C -C torsion angles are often similar for similar proteins. Moreover, homologous (>40% sequence identity) proteins often (ca. 75% of the cases) have their C in similar orientation. 29

38 Consequently, in case of high sequence identity (>40%) the conserved amino acids can be often fully copied from the template to model structure. In many instances, this approach is more accurate than the methods copying only the template backbone atoms and model the side chain atoms by ab initio prediction. However, if sequence identity is low (<35%), the side chains of the models and the templates are different in 45% of the cases. In such cases modeling of the side-chain orientation is required. Most of the available tools for side-chain predictions rely on knowledge-based libraries. In several instances "fix" libraries are applied in which all possible orientations of a given side chain are stored. Other methods apply "position specific" libraries in which side-chain orientation is chosen according to the structure / conformation of the backbone. In the simple versions side-chain orientations are classified according to the secondary structure (helix or sheet) of the backbone whereas the more sophisticated methods select conformations from orientations in related high resolution structures (5-9) of different curvatures. Side-chain conformation prediction is more accurate for buried hydrophobic parts than for surface side chains. This is due to the fact that side-chains in flexible loops - which are present mostly at the surface - can adopt multiple conformations Model optimization After modeling the backbone with insertions and deletions compared to the template and predicting the sidechain conformations, further steps are necessary to normalize the geometry of the modeled structure, especially in the vicinity of insertions and deletions (see Section 3.2.1). By using proper force-fields, molecular mechanicsbased energy minimizations may eliminate severe van der Waals clashes and improve bond length and valence angle values as well. However, they will not bring atoms closer to their actual position. Energy minimizations are easily stuck in local minima due to the irregularity of the energy landscape. Thus, energy-minimized structures often show slightly increased global structural deviation compared to the un-minimized models or the starting templates. In addition to energy minimization, trajectory simulation (molecular dynamics, MD) may be also performed by using similar force-fields. MD methods are useful to explore the conformational space. Various snapshots along the MD trajectory may result in further models as good as the starting ones [20]. MD analysis may also be useful to indicate the precision (or error) of the models Model validation Validation of three-dimensional structures may require different levels of accuracy. At a high level of sequence identity (>50%), only small deviancies from the real coordinates may be achieved and thus tools designed to evaluate experimental structures may be used (e.g.: WHAT-CHECK [208]). At a lower level of sequence identity (25-50%), the overall quality of the model may not correlate with deviation from standard stereochemistry (especially after energy minimization; see Section 3.2.1). Non-bonding interatomic interactions may be evaluated more suitably by using atomic statistical potentials such as ERRAT [21], ANOLEA Further useful tools to assess results from modeling are ProSA [22] and Verify3D [23]. Below 25% sequence identity, evaluation of the model should be performed rather at the residue level. Precise and local analysis may be required in particular cases. Simultaneous visualization of the score and the threedimensional structure may be useful. Specific features may be observed in the active sites (or binding sites) or for residues contacting ions (especially, those involved in metal coordination) and/or deeply buried ligands (especially co-factors) because in such cases the residues have a non-classical environment. Similarly, particularities may be observed in thermostable proteins, which may be stabilized by buried salt bridges. When such particular features are observed, evaluation of the model quality may be extended to the assessment of the template structure as well Tools for homology modeling Nowadays many possibilities exist to perform comparative modeling tasks. Homology modeling tools are available as standalone programs (both commercial and freely available) as well as automated, Web-based services that make these technologies accessible to an audience of non-experts in bioinformatics. 30

39 Web-based homology modeling tools Almost two decades ago, the first automated modeling server - SWISS-MODEL - was made available on the Internet [24]. The major aim for automating the key steps of homology modeling - template selection, target-template alignment, model building and model quality evaluation (Fig. 25) - is the need of making these technologies publicly available to wider audience. Since then, numerous further services have been developed to provide tools for automated homology modeling of proteins [25]. The next part of this section provides a selected list of the available on-line comparative modeling tools. SWISS-MODEL Fully automated homology-mod-el-ing server (Accessible from the ExPASy web site or from the program DeepView - Swiss-PdbViewer ModWeb A Server for Protein Structure Modeling (Based on the MODELLER program; a license key is required). Robetta Web server using Rosetta homology modeling software (ab initio fragment assembly with Ginzu domain prediction. HHpred HHpred server was assessed as one of the best in templatebased structure prediction (Ranked as the No 1 server in CASP9). I-TASSER Web server for protein structure and function predictions. Models are built based on multiple-threading alignments by LOMETS and iterative TASSER simulations. (Ranked as the No 1 server in CASP8 and CASP10) Engine. Protein Homology/analogy Recognition M4T A comparative modeling server using a combination of multiple templates and iterative optimization of alternative alignments. 3D-JIGSAW A server builds 3D models for proteins based on homologues of known structure by using fragment assembly. RaptorX structure prediction Web service for prediction of secondary structure, solvent accessibility, disordered regions and tertiary structure prediction for a sequence. (Especially designed for predicting 3D structures for protein sequences without close homologs. Also available as standalone RaptorX Package QUARK On-line service especially designed for modeling structures with no suitable template (Uses ab initio protein folding and protein structure prediction. Ranked as the No 1 server in Free-modeling (FM) in CASP9 and CASP10 GeneSilico Metaserver Gateway to various methods for protein structure prediction providing predictions for primary structure, secondary structure, transmembrane helices, disordered regions, disulfide bonds, nucleic acid binding residues in proteins and tertiary structure Protein model databases 31

40 This section provides a selected list of the publicly available databases collecting models generated by comparative protein modeling methods. SWISS-MODEL Repository Database of annotated protein structure models generated by the fully automated comparative-modeling service SWISS-MODEL ModBase Database of annotated comparative protein structure models calculated by the modeling pipeline ModPipe (by the programs PSI- BLAST and MODELLER (Information on fold assignments, putative ligand binding sites, SNP annotation and pro-tein-protein interactions are included.) Protein Model Portal (PMP, Gives access to various models computed by comparative modeling methods provided by different partner sites, and provides access to various interactive services for model building, and quality assessment Homology modeling software MODELLER Software to generate homology models of proteins by using satisfaction of spatial restraints. Free for academic use. Commercial version with Graphical User Interfaces is available from Accelrys ProModel Software site for homology modeling from either a selected template or a user defined template. Modeling in manual mode (mutation, excision, deletion, insertion of residues or insertion of a loop) or in automated mode. Analysis of the target protein structure, active site and channels are enabled. Available from Vlife Prime Fully-integrated protein structure prediction program, providing graphical interface, sequence alignment, secondary structure prediction, homology modeling, protein refinement, loop-prediction, and side-chain prediction. Developed by Schrödinger DeepView - Swiss-PdbViewer A standalone program suite which is integrated with the SWISS-MODEL automated homology-modeling server at the ExPASy web site TASSER-Lite Protein structure comparative modeling tool, limited to protein target-template pairs with sequence identity >25%. Optimized to model single domain proteins whose lengths range from residues. Free for not-for-profit use. Rosetta@home. Standalone implementation of Rosetta algorithm (ab initio fragment assembly with Ginzu domain prediction). Only for non-commercial use. Rosetta CM Rosetta is the premier software suite for modeling macromolecular structures. As a flexible, multi-purpose application, it includes tools for structure prediction, design, and remodeling of proteins and nucleic acids. Free for non-commercial use. Molide It is an open-source cross-platform graphical environment for homology modeling. It implements the most frequently used steps involved in modeling. Free for noncommercial use Molecular docking Once a three-dimensional structure of a protein at atomic level resolution is available, various features of it such as shape, surface properties, presence of cavities can be investigated. Beside investigation of the properties of 32

41 the given protein itself information on interaction of a target protein with other molecules such as various smallsized ligands or other biological macromolecules (proteins or nucleic acids) is of high interest. Among the molecular modeling tools, molecular docking is a method which predicts the preferred orientation of a molecule (usually a ligand or even a biological macromolecule) to a second one (usually a biological macromolecule) when bound to each other to form a stable complex. Knowledge of the preferred orientation may be used to predict the strength of association or binding affinity between two molecules. Such data are applicable for example in function predictions, enzyme mechanistic studies, in-silico drug design or in systems biology. Two classes of docking methods can be considered [26] i) the faster but empirical evaluations or ii) the more expensive free energy calculations. The first approach uses a matching technique that describes the target protein and the docking molecule as complementary surfaces. The second approach simulates the actual docking process in which the target protein-docking molecule pair-wise interaction energies are calculated. The success of a docking program depends on two components: the search algorithm and the scoring function [26]. A variety of search strategies may be applied to the ligand and target molecule such as i) systematic or stochastic torsional searches about rotatable bonds; ii) molecular dynamics simulations; or iii) genetic algorithms to "evolve" new low energy conformations. According to the nature of the docking molecule, docking methods can be classified as i) protein / small ligand; ii) protein / peptide; iii) protein / protein; or iv) protein / nucleotide docking Protein-ligand interaction predictions Molecular recognition plays a key role in promoting fundamental biomolecular events such as enzymesubstrate, drug-protein and drug-nucleic acid interactions. Protein-ligand docking is a suitable molecular modeling tool to study such interactions [27]. Figure 28 indicates that docking methods may be successfully applied even if no experimental protein structure is available. The docking methods may differ according to the flexibility of the ligand and the target protein [26]-[28]. Most docking methods allow flexibility of the ligand and take into account several conformations of it. In contrast, the majority of the currently used docking methods assume the protein target as fixed in one given conformation. 33

42 This approach is generally applied due to speed and simplicity considerations, avoiding massive computational cost required to accurately handle flexibility of the binding site. However, there are successful efforts to incorporate protein flexibility which may help overcome some inaccuracies (e.g. enhance docking into receptor models). Several docking tools with multiple opportunities ranging from simple docking of small ligands into rigid proteins to flexible ligand/flexible binding sites even for protein-protein interactions are available such as AutoDock DOCK Gold FlexX VLifeDock and ArgusLab Protein-biomacromolecule interaction predictions Interaction of proteins with other biomacromolecules can also be approached by docking methods. Although protein-protein [29] as well as protein-nucleic acid [30] docking are feasible, the most successful approaches are using further experimental data [31] from NMR or electron microscopy (see Chapter "Methods of determining structure of proteins"). Current biomacromolecular docking methods evaluate a vast number of docked conformations by simple functions that measure surface complementarity. However, in addition to near-native states, these methods produce many false positives, i.e., structures with good surface complementarity but high root mean square deviations (RMSD). Substantial efforts have been devoted to the development of methods to eliminate the false positives. Although these procedures improve the discrimination such that conformations with less than RMSD are generally found within the top ten to hundred structures, for most complexes the highest ranked structures are still far from the native [32]. In addition to the docking tools designed to address mostly small molecule-protein interactions (Section 3.3.1), tools like HADDOCK ClusPro RosettaDock ZDOCK GRAMM-X or Hex are available for docking biomacromolecules (mostly proteins) to target proteins. 4. References [195] Berman H, Henrick K, Nakamura H, Markley JL (2007) The worldwide Protein Data Bank (wwpdb): ensuring a single, uniform archive of PDB data. Nucl Acids Res. 35(suppl 1): D301-D303. [196] Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, Ramabadran S, Chaerkady R, Pandey A. (2009) Human Protein Reference Database update. Nucleic Acids Res. 37(Database issue): D767-D772. [197] (a) Kopp J, Schwede T (2004) Automated protein structure homology modeling: a progress report. Pharmacogenomics. 5(4): ; (b) Jaroszewski L (2009) Protein structure prediction based on sequence similarity Meth Mol Biol. 569: [198] Orry AJ, Ruben Abagyan R (Eds.) (2012) Homology Modeling: Methods and Protocols (Meth Mol Biol. 857, ISBN: ), Humana Press, Totowa. [199] Browne WJ, North AC, Phillips DC, Brew K, Vanaman TC, Hill RL (1969) A possible three dimensional structure of bovine alpha-lactalbumin based on that of hen's egg-white lysozyme. J Mol Biol. 42: [200] (a) Magrane M, UniProt Consortium (2011) UniProt Knowledgebase: a hub of integrated protein data. Database. bar009; (b) UniProt Consortium (2013) Update on activities at the Universal Protein Resource (UniProt) in Nucleic Acids Res. 41(Database issue): D43-D47. 34

43 [201] Maglott D, Ostell J, Pruitt KD, Tatusova T (2011) Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 39(Database issue): D52-D57. [202] Schwede T (2013) Protein Modeling: What Happened to the "Protein Structure Gap"? Structure 21, [203] Baker D, Sali A (2001) Protein structure prediction and structural genomics. Science 294(5540): [204] (a) Baker D (2000) A surprising simplicity to protein folding. Nature 405: 39-42; (b) Bonneau R, Baker D (2001) Ab initio protein structure prediction: progress and prospects. Annu Rev Biophys Biomol Struct. 30: [205] Marti-Renom MA, Stuart A, Fiser A, Sanchez R, Melo F, Sali A (2000) Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct. 29: [206] Fiser A, Sanchez R, Melo F, Sali A (2001) Comparative protein structure modeling. In: Watanabe M, Roux B, MacKerell AD, Jr, Becker O, eds. Computational Biochemistry and Biophysics. New York: Marcel Dekker. pp [207] (a) Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215: ; (b) Altschul SF, Madden TL, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: [208] Hooft RWW, Vriend G, Sander C, Abola EE (1996) Errors in protein structures. Nature 381: [209] (a) Deane CM, Blundell TL (2001) Improved protein loop prediction from sequence alone. Protein Eng 14: ; (b) Deane CM, Kaas Q, Blundell TL (2001) SCORE: predicting the core of protein models. Bioinformatics 17: ; (c) Pei J, Sadreyev R, Grishin NV (2003) PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics 19: [210] Errami M, Geourjon C, Deleage G (2003) Detection of unrelated proteins in sequences multiple alignments by using predicted secondary structures. Bioinformatics 19: [17] Mizuguchi K, Deane CM, Blundell TL, Johnson MS, Overington JP (1998) JOY: protein sequencestructure representation and analysis. Bioinformatics. 14: [18] Fiser A, Sali A (2003) Comparative protein structure modeling. In: Chasman D, ed. Protein Structure - Determination, Analysis, and Applications for Drug Discovery. New York: Marcel Dekker, pp [19] Moult J, James MN (1986) An algorithm for determining the conformation of polypeptide segments in proteins by systematic search, Proteins 1: [20] Flohil JA,Vriend G,Berendsen HJC (2002) Completion and refinement of 3-D homology models with restricted molecular dynamics: Application to targets 47, 58, and 111 in the CASP modeling competition and posterior analysis. Proteins 48: [21] Colovos C, Yeates TO (1993) Verification of protein structures: patterns of nonbonded atomic interactions. Protein Sci. 2(9): [22] Sippl MJ (1993) Recognition of Errors in Three-Dimensional Structures of Proteins. Proteins 17, ; (b) Wiederstein M, Sippl MJ (2007) ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins. Nucleic Acids Research 35, W407-W410. [23] Eisenberg D, Luthy R, Bowie JU (1997) VERIFY3D: assessment of protein models with threedimensional profiles. Meth Enzymol. 277: [24] Guex N, Peitsch MC, Schwede T (2009) Automated comparative protein structure modeling with SWISS-MODEL and Swiss-PdbViewer: a historical perspective. Electrophoresis 30(Suppl 1): S162-S173. [25] (a) Battey JN, Kopp J, Bordoli L, Read RJ, Clarke ND, Schwede T (2007) Automated server predictions in CASP7. Proteins 69: 68-82; (b) Brazas, M.D., J.T. Yamada, and B.F. Ouellette (2010) Providing web 35

44 servers and training in Bioinformatics: 2010 update on the Bioinformatics Links Directory. Nucleic Acids Res. 38(Suppl), W3-W6. [26] Halperin I, Ma BY, Wolfson H, Nussinov R (2002) Principles of docking: An overview of search algorithms and a guide to scoring functions. Prot Struct Func Genetics 47: [27] (a) Mohan V, Gibbs AC, Cummings MD, Jaeger EP, DesJarlais RL (2005) Docking: Successes and Challenges. Current Pharmaceutical Design, 2005, 11, ; (b) Huang SY, Zou X (2010) Advances and challenges in protein-ligand docking. Int J Mol Sci. 11: ; (c) Yuriev E, Agostino M, Ramsland PA (2011) Challenges and advances in computational docking: 2009 in review. J Mol Recogn. 24: [28] Forster MJ (2002) Molecular modelling in structural biology. Micron 33: [29] (a) Pons C, Grosdidier S, Solernou A, Perez-Cano L, Fernandez-Recio J (2010) Present and future challenges and limitations in protein-protein docking. Proteins 78: ; (b) Li B, Kihara D (2012) Protein docking prediction using predicted protein-protein interface. BMC Bioinform 13: 7. [30] Roberts VA, Pique ME, Ten Eyck LF, Li S (2013) Predicting protein-dna interactions by full search computational docking. Prot Struct Funct Bioinf, doi: /prot [31] Melquiond ASJ, Bonvin AMJJ (2010) Data-driven docking: using external information to spark the biomolecular rendez-vous. In: Protein-protein complexes: analysis, modelling and drug design. Ed.: Zacharrias M, Imperial College Press, London, pp [32] Zacharias M (2010) Accounting for conformational changes during protein-protein docking. Curr Opin Struct Biol 20(2), Methods of determining structure of proteins and protein structure databases Introduction The most important aim of bioinformatics is to assign structure and/or function to novel sequences of unknown structure and/or function by searching for similar sequences among proteins with known structure and/or function. To achieve this goal, efficient and reliable methods are required to provide structural data for proteins. This chapter focuses on the existing experimental methods available for secondary structure characterization and three-dimensional structure determination at atomic level. To perform various bioinformatics processes protein sequences can be identified and analyzed at different levels Protein identification tools Identification of proteins is an important issue in proteomics research. The methods available to identify proteins are ranging from low resolution techniques (e.g. identification by isoelectric point, molecular weight and/or amino acid composition) through more precise identification and characterization with peptide mass fingerprinting data to high resolution techniques such as tandem mass spectrometry experiments. A number of web-based protein identification services are available on the ExPASY proteomics server to perform low resolution protein identification such as AACompIdent (identification of a protein by its amino acid composition), AACompSim (comparing amino acid composition of a UniProtKB/Swiss-Prot entry with all other entries), TagIdent or MultiIdent (protein identification by isoelectric point, pi; molecular weight, MW; sequence tag; or mass fingerprinting data by listing proteins close to a given pi and MW). Numerous services are available for peptide identification by mass fingerprinting (analysis and identification of peptides that result from unspecific cleavage of proteins by their experimental masses) such as Mascot, 36

45 PepMAPPER, FindMod, ProFound, FindPept or ProteinProspector. These services are usually able to take into account or predict potential protein post-translational modifications and potential single amino acid substitutions or protease autolytic cleavage in peptides. Experimentally measured peptide masses are compared with the theoretical peptides calculated from a specified database entry or from a user-entered sequence, and mass differences are used to better characterize the protein of interest. More sophisticated protein identifications/analyses are enabled by the use of tandem mass spectrometry (MS/MS) experiments. A number of web-based protein and peptide identification/characterization services using MS/MS data are available on the ExPASY proteomics server, such as QuickMod, Phenyx, Mascot, OMSSA, PepFrag, ProteinProspector. These services can usually perform MS/MS peptide spectra identification by searching libraries of mass spectra data related to known protein sequences Simple protein analyses In addition to protein identification, tools are available for statistical analysis of protein sequences (e.g. aminoacid and atomic compositions), for simple prediction of physico-chemical parameters of a protein from its sequence (pi, hydrophobicity, extinction coefficient, etc.), or to detect repetitive protein sequences, to predict domains and regions (e.g. zinc-fingerprint or peptide binding regions). Several web-based services are available on the ExPASY proteomics server to perform such simple protein analyses such as ProtParam (calculates physico-chemical parameters of a protein sequence: amino-acid and atomic compositions, pi, extinction coefficient, etc.); Compute pi/mw (compute the theoretical pi and MW from a SWISS-PROT or TrEMBL entry or for a user sequence) or ProtScale (amino acid scale representation: hydrophobicity, other conformational parameters, etc.) Levels and problems of protein structure predictions The general goal of the protein structure prediction is to determine the conformation of a protein(sequence) which belongs to the global minimum of free enthalpy. By using small models it could be proven that the problem is sc. NP-difficult. Because the time required for solution is non-polynomially (but more) increasing with the (protein)size, over a certain size the problem cannot be solved. In case of real proteins, however, the problem can be handled, because sequences of real proteins are rather specific (selected by the evolution) and therefore the already known structures can be applied as knowledge-base for predictions. The levels of structure predictions for proteins may vary from 1D predictions via 2D structural data to 3D structures at atomic level. In case of one dimensional predictions the features can be rendered to unique amino acids and result can be described with 1D string. Such cases are predictions for secondary structure, solvent accessibility, hydrophobicity, transmembrane helices or disordered regions. A large number of web-based services for various one dimensional predictions are available on the ExPASY proteomics server to predict protein secondary structure (APSSP, CFSSP, GOR, Porter, SOPMA), protein surface accessibility (NetSurfP), -turn (NetTurnP) or helical transmembrane regions (HTMSRAP) in protein sequences. There are servers providing multiple kinds of predictions as well as consensus predictions (Jpred, PredictProtein, PSIpred, Scratch Protein Predictor) (Fig. 29). Protein 2D predictions require prediction distances, contacts between pairs of amino acids. However, by taking all side chain interactions into account, the 3D structure can be built (see later the protein NMR methods). 37

46 To estimate side chain interactions the following data can be taken into account: correlated mutations of amino acids which are far from each other in the sequence; statistical data; mean space potentials. Trials for protein 2D predictions are often use neural networks. However, efforts for efficient 2D predictions are not quite successful so far Experimental methods to determine the secondary structure of proteins Circular dichroism (CD) is a spectroscopic technique widely used for the evaluation of the conformation and stability of proteins in several environmental conditions like temperature, ionic strength, and presence of solutes or small molecules [ 1, 2 ]. CD spectroscopy is nondestructive, relatively easy to operate, fast and requires small amount of sample and few data collection. Synchrotron radiation circular dichroism (SRCD) spectroscopy (a high flux of a synchrotron enables collection of data at lower wavelengths) extends the utility and applications of conventional CD spectroscopy (using laboratory-based instruments)[ 3 ] Protein circular dichroism (CD) CD spectroscopy is based on the differential absorption of left and right circularly polarized radiation by chromophores which either possess intrinsic chirality or are placed in chiral environments. A number of chromophores are present in proteins resulting in CD signals. In the far UV region ( nm) corresponding to absorption of peptide bonds, CD spectrum can provide information on regular secondary structural features such as -helix and -sheet (Fig. 30). 1 Kelly SM, Price NC (2000) The Use of Circular Dichroism in the Investigation of Protein Structure and Function. Curr Prot Peptide Sci. 1: Kelly SM, Jess TJ, Price NC (2005) How to study proteins by circular dichroism. Biochim Biophys Acta Prot Proteom. 1751: (a) Miles AJ, Wallace BA (2006) Synchrotron radiation circular dichroism spectroscopy of proteins and applications in structural and functional genomics. Chem Soc Rev. 35: 39-51; (b) Wallace BA, Janes RW (2010) Synchrotron radiation circular dichroism (SRCD) spectroscopy: an enhanced method for examining protein conformations and protein interactions. Biochem Soc Trans. 38(4):

47 The near UV region ( nm) of the CD spectrum reflects to the environments of the aromatic amino acid side chains and thus provides information about the tertiary structure of the protein. CD signals can arise due to other non-protein chromophores such as flavin and haem moieties and thus the full spectrum depends on the precise environment of all the chromophores concerned. Because of its relative easiness, CD has been used also to provide information about protein structure, the extent and rate of structural changes and ligand binding. CD methods can be used to study the structural stability and folding phenomena of proteins or designed protein fragments. CD proved to be an extremely useful technique to evaluate the structural integrity of membrane proteins. It is therefore evident that CD is a versatile technique in structural biology, with an increasingly wide range of applications Synchrotron radiation circular dichroism (SRCD) In addition to CD spectroscopy (using laboratory-based instruments) as a well-established technique in structural biology, synchrotron radiation circular dichroism (SRCD) spectroscopy extends the utility and applications of conventional CD spectroscopy because the high flux of a synchrotron enables collection of data at lower wavelengths (resulting in higher information content), detection of spectra with higher signal-to-noise levels and measurements in the presence of absorbing components (buffers, salts, lipids and detergents). Thus SRCD spectroscopy can provide important static and dynamic structural information on proteins in solution, including study of protein interactions, such as protein-protein complex formation by either induced-fit or rigidbody mechanisms or protein-lipid complexes. As a publicly available web-based bioinformatics resource, the Protein Circular Dichroism Data Bank (PCDDB), has been created which enables archiving, access and analyses of CD and SRCD spectra and supporting metadata [ 4 ]. 4 Whitmore L, Woollett B, Miles AJ, Janes RW, Wallace BA (2010) The protein circular dichroism data bank, a Web-based site for access to circular dichroism spectroscopic data. Structure. 18(10):

48 Experimental methods to determining atomicstructures of proteins Several methods are currently used to determine the structure of a protein at atomic level, including X-ray crystallography, neutron diffraction, electron microscopy and electron diffraction methods (providing protein structures in crystalline state) and NMR spectroscopy (providing structures in solution and also in solid-state). It should be kept in mind that each method has advantages and disadvantages. In all cases, scientists put many pieces of information together to create the final model at atomic precision. As a start, scientists gather some kind of experimental data about the structure of the molecule. In the case of NMR spectroscopy, distances between atoms that are close to one another provide information on the local conformation. The starting data in X-ray crystallography is the X-ray diffraction pattern. In electron microscopy, the starting point is an image of the overall shape of the molecule. Therefore, in almost all cases this initial set of experimental information is not sufficient alone to gain a structure at atomic precision. To finalize the structure determination, further information on the molecule should be used. The already known amino acid sequence of a protein often serves as additional information, as well as the typical geometry of atoms in proteins (e.g. the bond lengths and bond angles). With such additional information in hands, scientists can create models which are consistent with both the set of experimental data and the known sequence and expected geometry of the protein. Consequently, the experimental macromolecular structures are all models with a composition of experimental data and computational prediction in various ratios. Typically, in very high-resolution crystal structures the atomic coordinates of heavy atoms are overdetermined by the diffraction data [ 5 ], whereas methods with less experimental observables increasingly rely on computational tools to construct structural models for the spatial interpretation of the data (e.g., nuclear magnetic resonance [NMR], electron microscopy [EM], small-angle X- ray scattering [SAXS], fluorescence resonance energy transfer [FRET]) [ 6 ]. It is not surprising that even a relatively good quality experimental X-ray structure contained errors to be corrected (Fig. 31). When drawing conclusions based on experimental structures, therefore, it is always good to be a bit critical. Keep in mind that the structures in the worldwide PDB database [ 7 ] are determined using a mixture of experimental data and knowledge-based modeling. It is always advised to confirm that the experimental evidences for a particular structure support the model as given and the scientific conclusions based on a proper model. He JJ, Quiocho FA (1993) Dominant role of local dipoles in stabilizing uncompensated charges on a sulfate sequestered in a periplasmic active transport protein. Protein Sci. 2: Existence of an experimental structure enables us to perform analysis of the structure. Based on atomic level structures, analyses of the protein structure quality, charges, surfaces, cavities or secondary structure became 5 Read RJ, Adams PD, Arendall WB, Brunger AT, Emsley P, Joosten RP, Kleywegt GJ, Krissinel EB, Luetteke T, Otwinowski Z, Perrakis A, Richardson JS, Sheffler WH, Smith JL, Tickle IJ, Vriend G, Zwart PH (2011) A new generation of crystallographic validation tools for the protein data bank. Structure. 19: Schwede T (2013) Protein Modeling: What Happened to the Protein Structure Gap? Structure 21, Berman H, Henrick K, Nakamura H, Markley JL (2007) The worldwide Protein Data Bank (wwpdb): ensuring a single, uniform archive of PDB data. Nucl Acids Res. 35(suppl 1): D301 D

49 feasible. Moreover, identification of structure motifs or interaction with ligands or other biomolecules can also be performed Protein X-ray crystallography The majority of the structures deposited in the PDB database were determined by the aid of X-ray crystallography [ 8 ] via the steps depicted in Fig. 32. For their structure determination by X-ray crystallographic methods, first proteins should be produced, purified and crystallized. When a proper crystal is forming, it is then subjected to intense beams of X-rays in multiple orientations, and the diffraction patterns are captured with electronic detectors. Because crystals contain threedimensionally periodic molecules, the diffraction pattern comprises a series of spots rather than a continuous function. The spots are then analyzed to determine the distribution of electrons in the protein. An image of the atomic contents of the unit cell of the crystal is derived by applying a mathematical lens (i.e. inverse Fourier transform) to the diffracted X-rays. The image reconstruction process is complicated because only intensities of the diffracted X-rays are measurable, but not the relative phase shifts between each family of diffracted waves. The missing information represents the crystallographic phase problem. The missing phases can be obtained using various experimental / computational methods such as isomorphous replacement, heavy atom anomalous scattering or application of partially known structures. Because X-ray diffraction is caused by the interaction of electrons with the X-rays in a crystallographic experiment, the resulting image is the electron density distribution in the unit cell of the crystal. Interactive and iterative computation is then used to determine position of the atoms which fit the best into the experimental electron density map resulting in the final atomic model. For the crystal structures determined in this way, the PDB archive contains two types of data. The PDB files of the X-ray structures include atomic coordinates of the final model, and the data files also include the structure factors (the intensity and phase of the X-ray spots in the diffraction pattern) from the structure determination. By the aid of these data, an image of the electron density map can be created using tools like the Astex viewer. Biological molecule crystals may be quite different: some form perfect, well-ordered crystals whereas others form only poor crystals. Consequently, the accuracy of the atomic structure which can be determined depends on the quality of the crystals. The accuracy of a crystallographic structure can be characterized by two important measures such as its resolution (determining the amount of detail that may be visualized by the experimental data), and the R-value (measuring how well the atomic model is supported by the experimental data found in the structure factor file). Fig. 33 illustrates the importance of the resolution. It is visible that the high resolution ( 8 Lattman EE, Loll PJ (2008) Protein Crystallography: A Concise Guide. The John Hopkins University Press, Baltimore, Maryland, 2008, 152 pp. 41

50 1.0 Å) structure can provide accurate atomic positions whereas lower than 3 Å resolution can show only the basic contour of the protein and the unique atomic positions are inaccurate. X-ray crystallography can provide structures of very detailed atomic information, showing every heavy atom in a protein or nucleic acid along with details of presence and arrangement of ligands, inhibitors, ions, and other molecules which are involved into the crystal. However, the crystallization process is difficult and can limit the types of proteins that may be studied by this method. For example, structures of rigid proteins forming nice, well-ordered crystals can be ideally determined by X-ray crystallography, whereas it is far more difficult to study flexible proteins by this method because crystallography relies on having a lot of molecules aligned in exactly the same orientation. Flexible portions of a protein will often be invisible by X-ray crystallography, since their electron density will be smeared over a large space. This can result in structures with apparently missing coordinates Protein NMR spectroscopy Nuclear magnetic resonance (NMR) spectroscopy is the technique which provides information on proteins in solution [ 9 ], in contrast to those methods which require protein crystals or proteins bound to a microscope grid. Consequently, NMR spectroscopy is the method of choice for studying the atomic structures of flexible proteins. NMR spectroscopy may be used to determine the structure of proteins as shown in Fig. 34. For NMR structural studies, the protein of question is required in its purified form as a solution. Because only H nuclei (but not C or N) are NMR active, for structural investigations of larger polypeptides or proteins H-, C- or N-isotope-labeled protein samples are required. Efficient molecular biological techniques for incorporation of the stable, NMR active, C- and N-isotopes into overexpressed proteins have resulted in dramatic advances in the design and implementation of multidimensional heteronuclear NMR spectroscopic 9 Markwick PR, Malliavin T, Nilges M (2008) Structural biology by NMR: structure, dynamics, and interactions. PLoS Comp Biol. 4: e

51 techniques [ 10 ]. Consequently, the maximum size protein amenable to complete structural investigation has increased from 10 kda using H homonuclear NMR spectroscopy to 30 kda using C and N heteronuclear NMR spectroscopy and perhaps to kda using C and N heteronuclear NMR spectroscopy combined with fractional H enrichment. The technique is currently limited to such size of proteins, since large proteins present problems with overlapping peaks in the NMR spectra. During an NMR experiment, the protein solution sample is placed in a strong magnetic field, and then probed with radio waves. Manual or automated analysis of the NMR spectra resulting in a detailed assignment of resonances to atomic nuclei followed by further special NMR experiments (e.g. ones using nuclear Overhausereffect, NOE) give a set of atomic nuclei that are close to one another (Fig. 35). Forster MJ (2002) Molecular modelling in structural biology. Micron 33: Such data on various distances, bond angles and torsion angles characterize the local conformation of atoms that are bonded together. This list of restraints is then used to build a model of the protein which satisfies the restrains best and shows the location of each atom. A typical NMR structure is not only a unique protein structure but include an ensemble of protein structures, all of which are consistent with the observed list of experimental restraints to some extent. The NMR structures in such an ensemble contain some regions which are very similar to each other due to strong restraints, and contain also less constrained portions in which they are very different. These areas with fewer restraints are the flexible 10 Cavanagh J, Fairbrother WJ, Palmer AG, Rance M, Skelton NJ (2007) Protein NMR Spectroscopy (2nd Edition), Academic Press, Burlington. 43

52 parts of the molecule which do not give strong signals in the NMR experiments. Therefore an NMR structure may reflect to some extent to the dynamic behavior of the protein. In the PDB archive, typically two types of coordinate entries for NMR structures may be found. In one case the NMR structure includes a set of separate model structures all satisfying the restraints of the structural determination. In the other case, the PDB entry is a minimized average structure which attempts to reflect to the average properties of the protein. The PDB entries contain a list of restraints (e.g. hydrogen bonds and disulfide linkages, distances between hydrogen atoms that are close to one another, and restraints on the local conformation and stereochemistry of the chain) that were determined by the NMR experiment Protein electron microscopy, electron diffraction and electron crystallography Electron microscopy (EM) is applicable to determine structures of large macromolecular complexes. In EM, images of the molecular object are obtained directly by beams of electrons by the aid of various methods. If the proteins can be coaxed into forming small crystals or if they pack symmetrically in a membrane, electron diffraction (ED) can be used to generate a 3D density map, using methods similar to X-ray diffraction. If the molecule is very symmetrical, such as in virus capsids, many separate images may be taken, providing a number of different views. These views are then aligned and averaged to extract 3D information. Electron tomography, on the other hand, obtains many views by rotating a single specimen and taking several electron micrographs. These views are then processed to give the 3D information. Typically, EM experiments do not allow determination of atomic-level structure but provide an overall 3D shape of the molecule. In case of a few particularly well-behaving system, such as several membrane proteins, ED measurements can produce atomic-level data [ 11 ]. To determine atomic details, EM studies are often combined with information from X-ray crystallography or NMR spectroscopy and atomic structures from X-ray or NMR experiments are docked into the electron density map from ED to yield a model of the complex. This combined approach proved to be successful for various multi-biomolecular assemblies. Experimental data obtained by these techniques can be found in the Electron Microscopy Data Bank (EMDB) which is a public repository for electron microscopy density maps of macromolecular complexes and subcellular structures. It covers a variety of techniques, including single-particle analysis, electron tomography, and electron (2D) crystallography. A number of atomic-resolution structures of membrane proteins (better than 3Å resolution) have been determined recently by electron crystallography (EC) [ 12 ]. While this technique was established more than 40 years ago, it is still in its infancy with regard to the two-dimensional (2D) crystallization, data collection, data analysis, and protein structure determination. In terms of data collection, electron crystallography encompasses both image acquisition and electron diffraction data collection. EC can complement X-ray crystallography for studies of small crystals of proteins (<0.1 micrometers), such as membrane proteins, that cannot easily form the large 3D crystals required for X-ray methods. By EC protein structures can be determined from either 2-dimensional crystals (sheets or helices), polyhedrons (such as viral capsids), or dispersed individual proteins. Electrons can be used in these situations, whereas X-rays cannot, because electrons interact more strongly with atoms than X-rays do. In contrast to X-ray crystallography where the phase problem persists due to the fact that there is no X-ray lens, electron microscopes contain electron lenses, and so the crystallographic structure factor phase information can be experimentally determined in EC Protein neutron crystallography Neutron protein crystallography (NC) can provide a powerful complement to X-ray crystallography by enabling key hydrogen atoms to be located in biological structures that cannot be seen by X-ray analysis alone. The availability of fully deuterated protein by using bacterial expression systems eliminates the hydrogen incoherent scattering contribution to the background. 11 Fujiyoshi Y (2011) Electron crystallography for structural and functional studies of membrane proteins. J Electron Micr. 60(Suppl. 1): S149 S Gonen T (2013) The collection of high-resolution electron diffraction data. Methods Mol Biol 955:

53 Typically, X-ray structures of proteins do not provide exact positions of the hydrogen atoms. Although in high resolution X-ray crystal structures some of the hydrogen atoms could be observed, functionally important hydrogen atoms are often not visible. Joint X-ray and neutron diffraction studies indicated the importance of NC in determination of accurate atomic position of functionally important hydrogens (i.e. the protonation/deprotonation state of unique residues) in protein structures [ 13 ]. A major hurdle to protein NC is that unusually large crystals ( weak flux of available neutron beams. ) are required to compensate for the 6. 5 Quantitative models of the functional effects of genetic variants Introduction Genetic expression determines the cells identity and operation and abilities.the RNA which is coded in the DNA in the nucleus and the proteins seek to balance between translation and degradation, which is regulated on multiple levels by regulator loops. The DNA contains the instructions of a living organisms and the variations in the nucleic-acid chain can influence gene expression in many ways which can appear in the phenotype. Many researchers are focusing on the regulation of gene expression, but it happens on different levels and can be understood if we look at the complete picture. How do we get an amino-acid chain from the DNA and how the momentary amount of the expressed protein or the change in expression affects the phenotype. In this chapter we describe the different levels and types of genetic regulation and discover the possible functional effects of a variation. In this chapter we concentrate on the rapidly developing field of micrornas and transcription factors, but we give a brief introduction to other regulatory mechanisms also (e.g. epigenetics). Although this part of the book deals only with the effects of a variation in a later chapter we present methods for regulatory network analysis also Variants In order to understand the models presented in later section it is important to describe what are the possible functional effects of genetic variants. In this section we give a brief introduction of genetic variants and the functional effects of these mutations SNP, indel A Single Nucleotide Polymorphism (SNP) is a one-point mutation which is the best described genetic variation. In the DNA sequence one base is replaced with one other base. Based on the position of the replacement we can speak about: coding coding, synonymous coding, non-synonymous missense nonsense 13 (a) Yamaguchi S, Kamikubo H, Shimizu N, Yamazaki Y, Imamoto Y, Kataoka M (2007) Preparation of large crystals of photoactive yellow protein for neutron diffraction and high resolution crystal structure analysis. Photochem Photobiol. 83(2): ; (b) Howard EI, Blakeley MP, Haertlein M, Petit-Haertlein I, Mitschler A, Fisher SJ, Cousido-Siah A, Salvay AG, Popov A, Muller-Dieckmann C, Petrova T, Podjarny A (2011) Neutron structure of type-iii antifreeze protein allows the reconstruction of AFP-ice interface. J Mol Recognit. 24(4):

54 noncoding untranslated region intronic intergenic SNPs. The non-coding SNPs do not effect the amino acid sequence but can affect gene expression especially for the nearly lying genes. The synonymous coding SNPs do not modify the amino acid but can rarely have a direct effect on the protein structure. The non-synonymous SNPs cause amino-acid change which could have a two different effects. Nonsense mutations change the amino-acid to a stop codon which stops the protein translation. Usually this type of mutations significantly lowers the length of the amino-acid chain. Missense mutations change the amino acid to a not stop codon. Both can damage the final protein and have a huge effect on the expression of the gene. UTR SNPs can also influence the gene expression as we will show in a later section. Untranslated regions are the primary binding sites of micrornas which downregulate gene expression. Intronic SNPs have can similar effects like noncoding SNPs. The above mentioned mutations affects only one base in the DNA. Beside these, insertions and deletions (indels) can affect more consecutive bases in the DNA. In case of insertions 1 or more bases are inserted to a given position in the DNA sequence whereas deletions cut out 1 or more bases. These mutations can have a similar impact on the phenotype. If the mutations falls into a coding region we can distinguish two types: frame shift and non frame shift mutations. Frame shift indels insert or delete not 3 or integer multiple of 3 bases which shifts the amino acid translation Alternative splicing During transcription an mrna sequence is formed from the DNA. mrna maturation starts during transcription: introns are cut out and the mrna contains only the exons of the genes. This process is called splicing. If a gene contains more exons there are more possible ordering of the exons or some exons may even be left out from the mrna. This leads to different forms of mrna which results in different proteins. This process is often cell- or tissue specific and different proteins are translated from the mrna Levels of regulation The process which start from DNA transcription and translation and ends up with an amino acid chain is very complex and regulated by networks of regulatory elements. The regulatory elements can be divided based on they affect the DNA to mrna transcription (transcriptional or cotranscriptional regulation, e.g. transcription factors) or bind to the mature mrna (posttranscriptional regulation, e.g. mirna) sometimes connect to the protein (posttranslational regulation, e.g. phosphorylation). Between these levels there are many forward and backward loops of regulation. A mirna can repress a TF translation and a TF can block a mirna expression. These regulatory elements are the building blocks of the gene regulatory networks Different regulatory elements microrna A microrna is usually 22 base pair (bp) long single-stranded RNA molecule, which can bind to the mrna and repress the mrna translation. mirnas was first observed in Caenorhabditis elegans. The role of mirna in biological processes was demonstrated by many experiments in eukaryotes. In cell division, in apoptosis, in signalling pathway regulation or in the cardio- and neurogenesis genes are often controlled by mirnas. One mirna molecule can bind to a several hundred mrna target. The 2-8 bp long seed region at the 5' end of the mirna recognize the target mrna's binding site with Watson-Crick base-pairing. Usually mirna binds to the 3' end of the mrna but it can bind to the 5' end or even coding regions of the mrna also. It is shown that 46

55 the intensity of downregulation is different based on the position of biding site. The basic mechanisms of mirna regulation are as follows: inhibition of translation mrns deadenylation mrns sequestration A mirna always inhibits mrna translation. Figure 36 shows the different mechanisms of mirna regulation mirna development The mirna development is different in plants and animals. In this chapter we describe only the animal (and human) development of mirnas [37]. The development of mirnas start in the nucleus where pri-mirna is transcripted by the RNA-polymerase II enzyme. The pri-mirna can be several hundreds of bps and can contain more mirnas. The Drosha enzyme cleaves the hairpin precursor-mirna (pre-mirna) from the pri-mirna. The pre-mirna leaves the nucleus and it is cleaved by a Dicer enzyme to a mature single-stranded mirna. [37]. The mature mirna connects to the mirisc (mirna induced silencing complex) and drives the mirisc to those mrnas which it can connect by base-pairing to mirna regulatory methods Inhibition of translation MiRNAs often inhibits the initiation of translation but it can repress the translation after initiation in several ways. Based on experiments sometimes it causes premature ribosome drop off, another time it slows down the amino acid chain elongation or completely stop the elongation. In these cases less protein is produced but the amount of mrna remains the same mrna deadenylation During mrns deadenylation depending on the mirna the amount of mrna can be lower because the mirna destabilizes the mrna molecule. Deadenylation can be followed by removal of the cap at the 5' end of the mrna and mrna degradation. Although degradation depends on deadenylation the mrna molecule is not destabilized in all cases. In an experiment researchers found still stable, and partially destabilized mrnas however all of them were fully deadenylated. Despite the fact the the mrna was still stable the expression was strongly repressed by the mirna mrna sequestration An indirect form of mirna regulation is mrna sequestration. In this case the target mrna is driven to P- body of the cytoplasm. MiRNA binds in the same way to the mrna and brings the mrna in the P-body of the cytoplasm. The mrna can be deadenylated, degradeted or just stored in the P-body. There are not any ribosome in the P-body which is mandatory for the translation. 47

56 Transcription factors 48

57 Transcription factors (TF) affect the gene expression at the DNA to mrna transcription. There are many TF proteins which can initialize and regulate transcription. The have a DNA-binding domain this way they can bind to the promoter, silencer and enhancer regions of the DNA. Transcription factors (TF) can not only repress but activate the gene expression. Transcription factor binding sites can be found almost everywhere near to a gene: in promoter regions, in introns,in UTRs and more than thousand bp away upstream or downstream from the gene. Binding sites usually form clusters where more TF can bind at the same time. The parts of DNA in genes or near to genes which are responsible for the proper expression are called cis-regulatory elements. These elements are on the DNA sequence close to the genes in opposite the trans-regulators which affects distant parts of the DNA (e.g. genes on other chromosomes). More TF could bind to a gene (combinatorial regulation) and TFs can bind in different combinations to a gene Epigenetics At the beginning of the 21st century epigenetics become more and more popular although the term 'epigenetics' comes from the first half of the 20th century. Epigenetics means the molecular mechanisms which causes heritable states without changing the DNA sequence. They implement a cell- and tissue specific regulation of gene expression. Beside this these mechanisms help the short-term adaptation of cells to the changes in the environment. The two main mechanisms described in this chapter are histone modifications and methylation Methylation During methylation methyl-transferase enzymes can bind a methyl group ( ) to a citosine base. This reversible process can have bigger effect at CpG islands. Methylation does not turn of the transcription of a gene Histone modifications In the nucleus DNA is wind around histone proteins. This complex forms the chromatin. Histone modifications like deacetylation removes the acetyl group from histone proteins and make the chromatin more compact so RNA polymerases cannot transcribe the genes. This process is reversible just as methylation with histoneacetyltransferases. 7. References [37] K. Chen and N. Rajewsky, The evolution of gene regulation by transcription factors and micrornas. Nat Rev Genet, 8(2):93-103, [38] L. Cerulo, C. Elkan, and M. Ceccarelli, Learning gene regulatory networks from only positive and unlabeled data. BMC Bioinformatics, 11(1):228, [39] C. Elkan and K. Noto, Learning Classifiers from Only Positive and Unlabeled Data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '08, pp , New York, NY, USA, ACM. [40] T. D. Le, L. Liu, B. Liu, A. Tsykin, G. J. Goodall, K. Satou, and J. Li, Inferring microrna and transcription factor regulatory networks in heterogeneous data. BMC Bioinformatics, 14:92, Mathematical models of gene regulatory networks Introduction 49

58 In Chapter 5 we learned about the functional effects of a variant and modules of gene expression regulation. In this part of the book we will extend the previously described distinct modules into a full network. How the different parts of gene regulation interacts and how different genes could interact Learning networks There are many learning methods as it can be seen in Figure 37. The two main learning classes (supervised and unsupervised) has both its pros and cons. In case of unsupervised learning there is no labaled data so there is no feedback so it is harder to evaluate the results. In case of supervised learning the researcher will face the problem of only positive examples as in biology there are only positive examples of regulatory networks and there are no proven negative examples. This puts a bias in learning system Representation Machine learning techniques for gene regulatory networks usually represents these networks as directed graphs. Nodes are elements of the regulatory network, e.g. genes, proteins, metabolites and the edges are the interactions between them Types of network learning algorithms Unsupervised methods Four main types of unsupervised model building algorithms exists. Information theory models 50

59 Boolean network models Differential and difference equation models Bayesian models Information theory models such as ARACNE and CLR uses expression levels for finding interaction between the elements of the network. If the correlation coefficients of the expression levels of two genes are above a threshold those genes are thought to be in interaction by these methods. Boolean network models like REVEAL use a binary variable for the state of the gene activity and the edges of the directed graph are boolean functions to represent the interactions. Differential and difference equations define functions for gene expression changes for a gene. The gene expression change of a gene depends on the expression levels of other genes. These methods try to find a solution for a set of ordinary differential equations. Bayesian methods considers gene expression levels as random variables and these methods solve a set of Bayes rules. The biggest advantage of these methods is that prior knowledge can be easily incorporated into the learning algorithm Supervised methods The supervised methods needs not only the expression profiles as input but also information of known regulatory interactions. There multiple databases for these types of regulatory connections. Here we list some of the most known examples: TRANSFAC for transcription factors and transcription factor binding sites mirna databases for predicted nd experimentally supported mirna-target pairs mirtarbase miranda TarBase String for protein-protein interaction networks The intuitive approach here is the following. If element has expression profile and it regulates with expression profile then it is probable that other elements with the similar expression profiles has also a regulatory interaction. However, the databases above provide with information on known interactions, the data collected will contain positive examples only. Most classifiers cannot handle such bias. There are a couple of possible solutions although some of them is context dependent. The simplest method is to choose randomly negative examples and use those as training points. If we include false negatives in the training set usually it makes the performance of the applied method worse. In order to pick more reliable negative examples a text mining approach could be used. In the first step negative training points are chosen from the unlabeled set based on term frequency-inverse document frequency. After the completion of the training set a more classifiers applied for learning the regulatory network and the result of the best performing method is accepted. Another approach is to train a standard classifier on the positive training set and use it to define the probability of positive example. The PosOnly method described in [38] uses this concept. Here we give a brief description of PosOnly algorithm. For detailed description of the algorithm and performance evaluation see [39 és 38]. Characterize the data as usual with feature vector and class labels. Beside this introduce a new binary label as follows: 51

60 The goal is to learn. It has been shown that in this case where is a constant factor which could be estimated using a validation set. This implies the resulting conditional probabilities will differ only by a constant factor from the fully characterized training set. The authors of [38] shows a possible estimation for. where is the labeled subset of the validation set TF, mirna, mrna regulatory networks Genetic regulation forms highly complex networks as the regulatory elements like mirna and transcription factors (TF) affects not just the regulated gene (mrna) transcription and translation but also mirnas can affect the expression of transcription factor proteins whereas TFs can activate and repress mirna development. In this section we show a method described in [40] to learn gene regulatory network using expression data in three steps. 1. Data preparation 2. Network learning and integration 3. Network inference In the first step the expression data is normalized and the significant differentially expressed genes are classified into the three types of network elements. The prior network structure is based on the sequence based target prediction tools. TRANSFAC and mirbase contains data on experimentally supported and in silico predicted TF-target and mirna-target pairs. For forming a prior network structure these databases are a good starting point. On the other hand it introduces a bias as in silico prediction based on sequence information results in false positives and false negatives. To learn the network structure first the expression data is split to different conditions based on phenotype. The initial prior structure is built with the following restriction to avoid NP-hard search in the space of graphs. Only bipartite graphs are used with the following pairs: mirna-tf, mirna-mrna, TF-TF, TF-miRNA, TFmRNA. The initial structure is defined by the data collected from the databases. During the learning process each interaction is evaluated by a Bayesian scoring. The high-confidence interactions are used in the bootstrapping and integrating phase. In this step all interactions are validated with bootstrapping to increase statistical power. The interaction with are integrated in the global network. Finally network inference is applied with motif search. Those motifs will be the building blocks of the resulting network which are present with significantly higher frequency than in randomized networks. 9. References [37] K. Chen and N. Rajewsky, The evolution of gene regulation by transcription factors and micrornas. Nat Rev Genet, 8(2):93-103,

61 [38] L. Cerulo, C. Elkan, and M. Ceccarelli, Learning gene regulatory networks from only positive and unlabeled data. BMC Bioinformatics, 11(1):228, [39] C. Elkan and K. Noto, Learning Classifiers from Only Positive and Unlabeled Data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '08, pp , New York, NY, USA, ACM. [40] T. D. Le, L. Liu, B. Liu, A. Tsykin, G. J. Goodall, K. Satou, and J. Li, Inferring microrna and transcription factor regulatory networks in heterogeneous data. BMC Bioinformatics, 14:92, Standard analysis of genetic association studies Introduction The aim of genetic association studies (GAS) is to detect statistical dependencies between the investigated phenotypes and the frequencies of genotypes, measured by various techniques. The most common type of GAS is the case-control study in which the statistical dependency between single nucleotide polymorphisms (SNP) and a binary disease state descriptor is analyzed. If the distribution of possible genotypes of a given SNP significantly differs for cases from controls, then this indicates that the SNP plays a role in the mechanism of the studied disease. The rapid development of measurement techniques led to a remarkable change in the implementation of genetic studies, and also in the analysis of results. Previously, tens or hundreds of SNPs were measured jointly, which are now called as candidate gene association studies (CGAS). These were partially replaced by genome-wide association studies (GWAS) measuring thousands or even ten-thousands of SNPs at once. However, in many cases GWAS failed to fulfill its promise, as the genetic background of several multifactorial diseases (e.g. asthma, obesity) is yet to be fully uncovered. A possible reason of this failure is the inadequate measurement or the complete lack of investigation of environmental factors and phenotypes. Another possible reason is that available statistical methods have considerable limitations, such as the correction for multiple hypothesis testing. Due to all these factors, CGAS that investigate statistical dependency relationships using detailed environmental descriptors and phenotypes came into view. In this chapter we introduce statistical methods and tools that are frequently used for the analysis of genetic association studies Genetic data transformation The perquisite of a proper analysis is a well prepared data set. In case of genetic data this statement cannot be overemphasized. As there are so many possible error sources (e.g. measurement error, inadequate sample, data processing errors, etc.) we should always carry out a rigorous investigation of the data set Filtering Assuming a data set that has been already post processed by a genotyping instrument (i.e. measurement errors are marked in the data set) we start processing the data set by investigating errors. The aim of filtering is to discard improper data cells either by discarding samples or by excluding variables. In order to do this, we need to define two thresholds for the ratio of missing data. The first ratio should define the acceptable missing data rate per variable (MRV), i.e. a column of the data set e.g. SNP, and the other threshold should define the missing data rate per sample (MRS). After discarding (nearly) totally missing SNPs (MRV ), the samples should be filtered depending on the data set size and sample quality. In case of a large data set with high quality samples a strict threshold can be selected such as MRS:. In more practical scenarios we may select a more permissive threshold such as MRS:. However, in case of low sample size and moderate sample quality MRS may be as high as. Note that samples with missing values for key target variables should be discarded regardless of their MRS. After removing the samples with MRS above the selected threshold, variables with high MRV should be discarded. This threshold may also depend on the quality of the data set. We 53

62 consider,, and as strict, average and permissive thresholds respectively. Furthermore, the variability of variables should be also taken into account. Variables with a single possible value e.g. monomorphic SNPs, must be discarded. More generally, a variable with variability less than (i.e. if a value of the variable appears in less than of the data or in less than 10 cases) should be removed. There are several methods that are capable of handling the missing values of genotypes (i.e. imputation). One of the simplest approaches is to impute values acquired by a random sampling from the distribution of the given genotype Standard test for Hardy-Weinberg equilibrium The next step after the data set filtering is the testing of Hardy-Weinberg equilibrium (HWE) for all SNPs. HWE states that allele or genotype frequencies do not change from generation to generation if no evolutionary influences are present, such as mutation, genetic drift, or non-random mating. Given a trait with two alleles and with frequencies and respectively, the expected genotype frequencies are for the common homozygous case (AA), for the heterozygous case (Aa) and for the rare homozygous case (aa). These frequencies are called Hardy-Weinberg proportions and they satisfy the equation of. The genotype proportions in case of a bi-allelic locus can be represented by a de Finetti diagram (Figure 38). A special curve called Hardy-Weinberg parabola defines the proportions, which are in HWE. The deviation from HWE can be detected by a Pearson's chi-square test (see section on association tests for details) using observed values based on the data set, and expected values based on HWE [J. E. Wigginton et al.\ 2005]. In case of a significant result the null hypothesis of HWE has to be rejected. SNPs that have a significant p-value for the HWE test in controls should be discarded as this generally indicates a measurement error Phenotype data transformation Depending on the available phenotype descriptors, clinical and environmental variables further processing of the data may be required. Contrary to genetic data, phenotypes and other descriptors cannot be imputed. Therefore, the pre-processing of these variables can be crucial for a successful analysis Transformation 54

63 In case of multiple quantitative phenotype descriptors that may serve as "targets" of interest there is a conventional approach of merging and transforming these variables into a complex phenotype. Otherwise one must carry out as many separate analyses (in a frequentist framework) as the number of chosen descriptors. This may result in overly stringent p-value thresholds (i.e. correction) for association tests, which can be avoided by an appropriate selection and transformation of target variables. A possible solution is to use principal component analysis (PCA) to identify relevant phenotype components and form a composite phenotype to be used in subsequent association analyses [Zhang et al.\ 2012]. Note that within the Bayesian framework such merging of targets is not necessary, since all multiple targets can be analyzed at once Discretization Some statistical methods can only be applied on categorical (discrete) variables, in which case all quantitative phenotypes, environmental and clinical factors have to be discretized (i.e. binned). There are several available methods to perform binning such as the equal bin width method. Other, more sophisticated methods are available for example in the R framework Univariate analysis methods Univariate methods assume that the investigated factors are independent from each other, and thus analyze the relationship between the outcome (target) variable and each factor separately. Although the assumption of independence of factors is highly improbable, in many cases this approach is acceptable, because generally the goal is to identify the most significant factors, which may lead to efficient biomarkers (i.e. indicators of the presence or the severity of a disease). The discovery of interactions, dependency patterns and other features is secondary from this aspect. The application of univariate methods can also be justified with their relative simplicity and efficiency compared to more complex, computationally intensive multivariate methods. There are numerous association measures that can be applied for the analysis of GAS results, starting from basic association tests to various effect size measures such as odds ratios [Balding 2006] Standard association tests In the conventional frequentist framework statistical methods rely on hypothesis testing. The so called null hypothesis stands for no association, whereas the alternative hypothesis is either a general model, or in case of GAS, a specific genetic model (i.e. additive, dominant, recessive) describing association. The essential element of association tests is the computed statistic upon which the hypothesis evaluation depends. Generally, the null hypothesis is rejected if the significance level corresponding to a computed statistic is lower than an arbitrary threshold. In other words, the computed statistic is higher than a critical value corresponding to the chosen threshold. Typically is considered to be the threshold for significance. A frequently used statistic for case/control GAS is the Pearson's chi-square, which provides a straightforward way of assessing dependence between categorical variables, e.g. between disease state descriptors and genetic factors such as genotypes. In order to aid the calculation a contingency table of appropriate size can be constructed corresponding to the cardinality of variables [Agresti 2002]. For example in case of two binary variables (a specific allele) and (phenotype), a 2 2 table can be created. The chi-square statistic is computed based on observed frequencies of : variable value pairs, and expected frequencies related to the null hypothesis of independence as follows: 55

64 where and denotes the observed and the expected value of a cell respectively in the row and the column. Expected values are computed using the row ( ) and column ( ) subtotals of observed values where is the total number of samples. This test statistic asymptotically approaches the distribution with degrees of freedom. If the computed Pearson's chi-square statistic is higher than the critical value of the distribution corresponding to significance level, then the null hypothesis of independence is rejected. Consider the example 2 2 contingency table displayed in Table 2. All the observed values and the row and column totals are shown. The task is to investigate whether there is association between variable (genetic factor) and (target). The null hypothesis is that and is independent from each other, whereas the alternative hypothesis is that and are dependent. The first step is to compute the expected values based on the null hypothesis for all observed values. For example in case of the cell with an observed value of, the expected value is computed using the column and row totals and the grand total as. The second step is to calculate the Pearson's chi-square statistic using the expected and observed values: The third step is to determine the degrees of freedom of the distribution according to. Since both variables are binary, the number of rows ( ) and columns ( ) are both 2, so the total degree of freedom is 1. The last step is to match the computed chi-square statistic with a approximate, and identify the corresponding p-value. For the chi-square statistic of the the matching p-value is. This significance level is below the typically applied threshold of so we can reject the null hypothesis of independence, and we can state that the association between and is significant. Note that the significance threshold means that the probability of wrongly rejecting the null hypothesis is less than In case of subsequent association tests on the same data set, the chance of getting false positive results (type one error) increases. For example in a study of 1000 SNPs in which factors are investigated by association tests, at least 50 SNPs may be found significant just by chance. In other words, the number of false positive results would be unacceptably high. This phenomena is known as the multiple (hypothesis) testing problem, and various methods were created to resolve it. The most widely accepted approach is to apply some sort of correction on the p- values, such as the Bonferroni correction [Dunn 1961] or the Benjamini-Hochberg method [Benjamini and Hochberg 1995], which aims to control the false discovery rate. Another accepted method applies permutation tests to test the validity of results. In case of GAS these corrections are typically too conservative, thus hindering a meaningful analysis of results. This prompted researchers to devise new statistical methods for analysis of GAS results. The application of Bayesian methods became increasingly popular in this field as they handle the multiple testing problem in a normative way, i.e., they have a "built in correction" for multiple testing Cochran-Armitage test for trend The Cochran-Armitage test for trend is a special case of Pearson's chi-squared test, which allows the analysis of the dependency relationship between a binary and a multi-valued categorical variable [Cochran 1954 és 56

65 Armitage 1955]. The main assumption of the test is that the values of the multi-valued variable can be ordered according to a predefined ordering (i.e. there is a trend). Therefore, e.g. the labels 0,1,2 for categories of the variable can be interpreted as low, medium, and high respectively. In case-control genetic association studies the disease state descriptor, which defines each sample as a case or as a control, can serve as the binary variable (Target: ) for the Cochran-Armitage trend test. Consequently, the multi-valued variable corresponds to an analyzed SNP with values 0, 1, 2, which can be interpreted as common homozygote, heterozygote, and rare homozygote respectively (assuming three possible genotypes). Based on the example shown in Table 3 the statistic of the Cochran-Armitage test for trend (CATT) can be computed as where represents weights that are used to tune the test for the detection of specific association types. For genetic association studies the weights should be selected according to the assumed mode of inheritance as follows allele is dominant with respect to allele :, allele is recessive with respect to allele :, alleles and are additive (co-dominant):. The quotient of the CATT statistic and its standard deviation asymptotically approaches normal distribution, therefore the Cochran-Armitage test for trend can be implemented as a test for normality where is defined as the following expression: If the expected trend (dominant, recessive, or additive) is found to be significant, then the statistical power of this test will be greater than that of a general chi-squared test. However, it will not be able to detect a different type of trend other than the expected one. In case of genetic association studies, particularly in GWAS, this test is used for the detection of linear (additive) trends in most cases [Purcell et al.\ 2007] Odds ratios While association tests show whether there is a significant association between two variables qualitatively, effect size measures on the other hand define the strength of the association in a quantitative way. Odds ratio is one of the most popular effect size measures, which describes in the context of a given disease or condition how a certain trait influences the ratio of healthy and ill population [Balding 2006]. In other words, it quantifies whether a trait has a protective (OR ), risk increasing (OR ), or a neutral (OR ) role with respect to a given disease. Standard odds ratios do not take multivariate relationships into account, just the corresponding ratio of samples. 57

66 Let denote discrete variables that encode SNP states 0,1,2 that refer to common homozygote, heterozygote, rare homozygote genotypes respectively. Then denotes SNP in state. In case of a disease indicator, i.e. : non-affected (control), : affected (case), an odds is defined as Consequently an odds-ratio e.g. heterozygous (1) versus common homozygous (0) is given as Therefore, a log OR has the following form: An odds ratio (based on the observed data) can be viewed as an estimate of effect size of a given trait for the whole population. In this sense it is useful to investigate the reliability of this estimate. A confidence interval of an odds ratio is an interval estimate that represents a range of values in which the odds ratio lies if the study is repeated with different samples. The confidence level associated with a confidence interval describes the frequency that the interval contains the odds ratio given repeated experiments. The most frequent choice is the confidence interval, which means that the odds ratio is within that interval 95 times in case of 100 experiments. The confidence interval of odds ratios can be computed using the standard error of log odds ratios as their distribution is approximately normal (log(or), ) where denotes the number of cases in which and. Therefore, the confidence interval ( ) of the log odds ratio ( ) can be given as. This means the for odds ratio is [OR, OR ]. Using the example data shown in Table 2 the odds ratio and its confidence interval are calculated as follows: This means that trait has a protective effect with respect to the disease with an and a confidence interval of ( ). Since the CI does not intersect with the neutral odds ratio of 1, the odds ratio can be considered significant Univariate Bayesian methods The basic paradigm of Bayesian methods is that using a prior probability distribution P(A) and a likelihood P(B A) the posterior probability P(A B) can be computed according to the Bayes' theorem. The prior probability can be used to apply a priori knowledge or additional presumptions in the calculations. The likelihood, on the other hand, is a scoring function relying on the data. 58

67 Priors with a normal distribution or a mixture of normal distributions are frequently used in univariate Bayesian methods. An alternative choice is to use normal exponential gamma (NEG) priors [Stephens and Balding 2009]. Priors can be defined in terms of effect size (log-odds ratio) such that the proportion of SNPs having a nonzero effect size is given (e.g. that is 1 out of ) [Stephens and Balding 2009]. Log Bayes factor is univariate Bayesian measure which is increasingly applied in GAS. There are multiple implementations such as the one in SNPtest [Marchini et al.\ 2007]. The Bayes factor is a ratio of marginal likelihoods between two models. If the investigated models (containing variables and ) are the null model of independence ( ) and an alternative model describing associations ( ), then this allows the assessment of association between and based on model selection. The difference between the models is quantified based on the observed data D, the assumptions of the models and, and their parameterizations and as which can be approximated by using Laplace's method [Marchini et al.\ 2007]. Note that these methods consider SNPs as independent entities, which is unrealistic and valuable information contained in interactions and complex dependency patterns is neglected Multivariate analysis methods On one hand, multivariate methods allow the investigation of complex dependency relationships; on the other hand they are also computationally more complex. In case of categorical phenotypes logistic regression is a frequently applied method which can be used both as a multivariate and a univariate analysis tool Logistic regression Logistic regression is a regression analysis method used for binary outcome (target) variables [Agresti 2002]. Based on the values of explanatory variables a logistic regression model can be learned, which allows to predict the odds of a sample being a "case", i.e. the odds that the outcome variable has a certain value e.g. "1". The essence of logistic regression is the logistic function which takes values from 0 to 1 where stands for the linear combination of explanatory variables such that where is the probability that the outcome variable is a "case". is called the intercept and the other are the regression coefficients. Using the log odds can be written as which is called the logit function (left-hand side) that is equivalent to a linear regression expression (right-hand side). This transformation allows the application of linear regression on log odds. Usually maximum likelihood estimation is used to estimate regression coefficients. Such an iterative process has to be used for this purpose as there is no closed-form for the coefficient values maximizing the likelihood function. This step-wise process tries to improve an initial solution, which either ends in a state of convergence, i.e. there is no room for improvement; or it turns out that convergence is not possible. Explanatory variables with non-zero regression coefficients are considered to be the part of the resulting logistic regression model. Although that is a multivariate model, the contribution of individual variables can be assessed by performing a likelihood ratio test 59

68 or a Wald test using Wald statistic. Since the distribution of Wald statistic can be approximated with a distribution, significance can be determined in an analogous way Haplotype association Haplotype association analysis is a straightforward choice for the joint statistical analysis of SNPs. As each of the SNPs forming the haplotype ( ) have their own allele variants (e.g. ), the possible haplotype variants arise from their combinations (e.g. ACG,ACA,ATA,ATG,GTG,GTA, ). Several methods were developed to analyze the dependency relationship between a multi-valued haplotype variable and a binary target. All such methods have to deal with two essential tasks: (1) the lack of haplotype phase information and (2) the cardinality of the value set of the haplotype variable [Liu et al.\ 2008]. The phase information provides details on whether an allele is present on the maternal or the paternal chromosome. Without it all possible combinations of alleles must be taken into consideration. Some of the haplotype association analysis methods assume that phase information is available (either by measurements or by prediction), while others follow an integrative approach. These latter methods jointly predict the haplotype and perform the association analysis. The cardinality of the value set of the haplotype may pose a serious challenge, as the sample size of the data set is typically not large enough to represent the rare haplotype sufficiently (in statistical terms). For example, in case of a haplotype which consists of 4 biallelic SNPs (e.g. A/G entails 3 possible genotypes: AA, AG, GG) the cardinality is ( ). In order to have a sufficient number of samples ( ) for all possible haplotypes we would require at least samples assuming uniform frequency of haplotype values. In fact, it is inappropriate to assume uniform frequency, rather there are typically a couple of frequent haplotypes, while the majority of haplotypes are rare, having a frequency below. Various methods were proposed to handle rare haplotypes by forming haplotype groups out of similar haplotypes. Hierarchical clustering can be used to form such groups [Durrant et al. 2004], as well as probabilistic clustering based on the evolutionary tree concept [Tzeng 2005]. A further applicable method uses a weighted log-likelihood based approach [Souverein et al.\ 2006] Haplotype association test Basic haplotype association tests (goodness-of-fit tests) are used to investigate whether there is a significant difference between cases and controls with respect to the distribution of haplotypes. For this purpose a likelihood ratio statistic ( ) is composed in the following general from which follows a distribution asymptotically with degrees of freedom in case of a true null hypothesis. denotes the number of possible haplotypes. The drawback of this approach is that in case of a haplotype with high cardinality the power of this statistic will be insufficiently low to detect an association. Furthermore, it is possible that the sample size is so low that even if the null hypothesis is true the distribution of the statistic would not follow a distribution. A possible solution is to apply non-linear transformations on the distribution of haplotypes in order amplify the difference between case and control haplotypes. Consequently, the power of the applied [Zhao et al.\ 2006]. test may increase Since in a typical GAS multiple loci are analyzed parallel with haplotype association tests, the problem of multiple testing cannot be disregarded. Therefore, an appropriate correction has to be applied. A popular choice is to apply permutation tests to provide the necessary correction for significance values. The frequently used haplotype association analysis tool, Haploview also implements such tests [Barrett et al. 2005] Haplotype sharing 60

69 Methods investigating haplotype sharing focus on the similarity between alleles forming the haplotypes within case and control groups. Given a locus, a similarity measure, control haplotypes, and case haplotypes four basic metrics measuring haplotype sharing can be constructed [Nolte et al.\ 2007]. Haplotype sharing between controls: Haplotype sharing between cases: Haplotype sharing between case and control groups: Total haplotype sharing: Based on these metrics various statistics can be composed, such as the HSS test and the CROSS test [Nolte et al.\ 2007]. The HSS test relies on the comparison of case and control haplotypes with the assumption that the sharing between cases is larger than between controls. The reason behind this notion is that haplotypes increasing a risk of a particular disease tend to be similar, whereas haplotypes related to controls are more diverse. where denotes the standard deviation of the corresponding haplotype distribution. In case of a sufficiently large sample size and follow a normal distribution, and the significance of the difference between these distributions can be tested by a with degrees of freedom. In contrast, the CROSS test is based on the notion that the sharing between case and control haplotypes is lower than the sharing between two randomly selected haplotypes. where denotes standard deviation. The follows an approximately normal distribution, except in extreme cases of, in which a transformation can be applied that allows the approximation of the distribution with a distribution [Nolte et al.\ 2007]. Relying on the before mentioned metrics, further statistics can be created, most of which can be formulated according to the following quadratic form 61

70 where és denote haplotype distributions for cases and controls respectively. stands for a symmetric matrix, defined by the symmetric kernel function, which describes the similarities between all and haplotype pairs. In addition, denotes the standard deviation of. If and are free from singularities, then approximately follows a standard normal distribution [Tzeng et al.\ 2003] Haplotype association analysis with regression models The advantage of regression models is that they allow the prediction of a haplotype (in case of missing phase information) and the analysis of their effect at the same time. Regression methods can be divided into two groups according to their approach to likelihood computation: prospective likelihood methods and retrospective likelihood methods. Let denote the observed genotype information, and denote a possible haplotype (maternal and paternal haplotype pair) in case of the sample. Let be the a priori probability of haplotype, and let stand for environmental risk factors (e.g. age, gender, smoking) that increase disease susceptibility. Furthermore let denote the presence of the disease, and let denote the set of those haplotypes that are consistent with the observed genotype at the sample. Using these components, the prospective likelihood based on the analyzed data can be formulated as [Schaid 2004] where is the vector of regression coefficients, and is the total sample size. This prospective regression model can be fitted either using maximum-likelihood methods [Lake et al.\ 2003] or EM based methods [Zhao et al.\ 2003]. The essence of prospective approach is that based on the data, more specifically on the genotype ( haplotype ( ) information and also on the environmental factors, the probability of the presence of the disease can be assessed. In contrast, a retrospective approach relies on the actual disease state, and estimates the probabilities of haplotypes. Accordingly, the retrospective likelihood can be formulated as [Epstein and Satten 2003] ) and where and are the number of control and case samples respectively having genotype. The advantage of retrospective likelihood is that its statistical power is at least as high as or higher than that of the prospective likelihood. However the retrospective likelihood is also less robust in case of deviations from the Hardy- Weinberg equilibrium [Satten and Epstein 2004]. A further option is to use the generalization of regression models, the generalized linear model (GLM) as a statistical framework. The basic assumption of GLM is that the distribution of the dependent variable (in our case a disease descriptor) can be given by a distribution from the exponential family such that its mean value depends on the independent variables (e.g. genotype, environmental factors). The variables form a linear predictor ( ) as a linear combination of their corresponding parameters, so that. The dependence between the linear predictor and the mean of the distribution of is given by a link function such that. Therefore, the main equation of a GLM takes the following form 62

71 where E(.) denotes expectation computation. Note that the variance of can also be expressed as a function of the mean. GLM, as a statistical framework can be used to create statistics for the testing of haplotype associations as [Schaid 2004] where is the value of the disease descriptor at the sample, is a fitted value from a GLM using only environmental factors as covariates, and is a normalization factor according to the distribution used in the GLM. is the conditional expectation given the genotype defined by the data computed over the distribution of haplotypes. The statistic in fact measures the covariance between the expected value of haplotypes and the residuals (the error of predictions compared to the real values of ) of a GLM using environmental factors as covariates [Schaid 2004] Analysis of statistical power Statistical power ( ) expresses the probability that a test rejects a null hypothesis ( ) when it is indeed false ( ), that is. In other words, power is the opposite of type II. error, the false negative rate ( ). Statistical power is influenced by three main factors: 1. Sample size. The available sample size is a crucial factor, since the more samples are used for the analysis, the less the sampling error gets (with respect to the whole population). In other words, we may draw more reliable conclusions based on larger samples. 2. Effect size. The effect size of the analyzed genetic or environmental factors is a relevant aspect as we need more samples for the analysis of a factor with relatively small effect size than in case of a factor with a relatively large effect size. 3. Significance level. A threshold used for the evaluation of statistical tests, which defines the probability that the test rejects the null-hypothesis although it is true (type I error, false positive rate). The most common choice for this threshold is. Several other factors may influence statistical power, however these factors typically have a smaller effect and they depend on various properties of the study. Power analysis can be performed a priori or post-hoc, i.e. before or after the analysis (sample collection) respectively. In the previous case the aim of the power analysis is to estimate the number of required samples for a chosen level of power given a predefined effect size and significance level. In the latter case, the power analysis can be used to estimate the power given the actual sample size, effect size and significance level. The a priori usage of power analysis is widely accepted, whereas the post-hoc application is strongly debated, since the statistical power will depend on the achieved p-value from the test. Misleading results may arise in such a case when the sample size would not be enough to detect an effect of a certain size that was actually detected, and for which the power analysis was performed. A possible way to perform power calculations is to fit a regression model using the before mentioned parameters with a maximum-likelihood method. Both Quanto [Gauderman and Morrison 2006] and the online accessible Genetic Power Calculator [Purcell et al.\ 2003] implement such an approach. In addition, there are other available software tools that are capable of performing power analysis. 11. References [Agresti 2002] A. Agresti, Categorical Data Analysis. Wiley-Interscience, New York,

72 [Armitage 1955] P. Armitage, Tests for linear trends in proportions and frequencies. Biometrics, 11(3): , [Balding 2006] D. J. Balding,A tutorial on statistical methods for population association studies. Nat. Rev. Genet., 7(10): , [Barrett et al. 2005] J. C. Barrett, B. Fry, J. Maller, and M. J. Daly, Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, 21(2): , [Benjamini and Hochberg 1995] Y. Benjamini and Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc., 57(1): , [Cochran 1954] W. G. Cochran, Some methods for strengthening the common chi-squared tests. Biometrics, 10(4): , [Dunn 1961] O. J. Dunn, Multiple comparisons among means. Journal of the American Statistical Association, 56(293):52-64, [Durrant et al. 2004] C. Durrant, K. T. Zondervan, L. R. Cardon, S. Hunt, P. Deloukas, and A. P. Morris, Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes. Am. J. Hum. Genet., 75(1):35-43, [Epstein and Satten 2003] M. P. Epstein and G. A. Satten, Inference on haplotype effects in case-control studies using unphased genotype data. Am. J. Hum. Genet., 73(6): , [Gauderman and Morrison 2006] W. J. Gauderman and J. Morrison, QUANTO 1.1: A computer program for power and sample size calculations for genetic-epidemiology studies. 1-48, [J. E. Wigginton et al.\ 2005] J. E. Wigginton, D. J. Cutler, and G. R. Abecasis, A note on exact tests of Hardy-Weinberg equilibrium, Am J Hum Genet, 76: , [Lake et al.\ 2003] S. L. Lake, H. Lyon, K. Tantisira, E. K. Silverman, S. T. Weiss, N. M. Laird, and D. J. Schaid, Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous. Hum. Hered., 55(1):56-65, [Liu et al.\ 2008] N. Liu, K. Zhang, and H. Zhao, Haplotype-association analysis. Adv Genet., 60: , [Marchini et al.\ 2007] J. Marchini, B. Howie, S. Myers, G. McVean, and P. Donnelly, A new multipoint method for genome-wide association studies via imputation of genotypes, Nature Genetics, 39: , [Nolte et al.\ 2007] I. M. Nolte, A. R. devries, G. T. Spijker, R. C. Jansen, D. Brinza, A. Zelikovsky, and G. J. temeerman, Association testing by haplotype-sharing methods applicable to whole-genome analysis. BMC Proc., 1(Supp 1):S129, [Purcell et al.\ 2003] S. Purcell, S. S. Cherny, and P. C. Sham, Genetic Power Calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics, 19(1): , [Purcell et al.\ 2007] S. Purcell, B. Neale, K. Todd-Brown, L. Thomas, M. A. R. Ferreira, D. Bender, J. Maller, P. Sklar, P. I. W. debakker, M. J. Daly, and P. C. Sham, PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet., 81(3): , [Satten and Epstein 2004] G. A. Satten and M. P. Epstein, Comparison of prospective and retrospective methods for haplotype inference in case-control studies. Genet. Epidemiol., 27(3): , [Schaid 2004] D. J. Schaid, Evaluating associations of haplotypes with traits. Genet. Epidemiol., 27(4): , [Souverein et al.\ 2006] O. W. Souverein, A. H. Zwinderman, and M. W. T. Tanck, Estimating haplotype effects on dichotomous outcome for unphased genotype data using a weighted penalized log-likelihood approach. Hum. Hered., 61(2): ,

73 [Stephens and Balding 2009] M. Stephens and D.J. Balding, Bayesian statistical methods for genetic association studies. Nature Review Genetics, 10(10): , [Tzeng et al.\ 2003] J. Y. Tzeng, B. Devlin, L. Wasserman, and K. Roeder, On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am. J. Hum. Genet., 72(4): , [Tzeng 2005] J. Y. Tzeng, Evolutionary-based grouping of haplotypes in association analysis. Genet. Epidemiol., 28(3): , [Zhang et al.\ 2012] F. Zhang, X. Guo, S. Wu, J. Han, and Y. M. Liu, Genome-wide pathway association studies of multiple correlated quantitative phenotypes using principle component analyses. PLoS ONE, 7(12):e53320, [Zhao et al.\ 2003] J. Zhao, S. S. Li, and N. L. Khalid, A method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case-control studies. Am. J. Hum. Genet., 72(5): , [Zhao et al.\ 2006] J. Zhao, L. Jin, and M. Xiong, Nonlinear tests for genomewide association studies. Genetics, 174(3): , Analyzing gene expression studies Introduction DNA is a molecule that forms a double helix. The strands of the helix are each other's perfect complement: every adenine is flanked by a thymine, and every guanine is flanked by a cytosine on the other side. Hybridization is the process when two complementary strands of DNA (or RNA) bond to each other. Microarray technology takes advantage of this hybridization, and uses many single strand of a gene sequence segment (known as probes) attached to their surface to measure the quantity of RNA that are present in a sample. RNA delivers DNA's genetic message (the template of genes) to the cytoplasm where proteins are made by translating these templates to sequences of amino acids. Microarrays measure the expression levels (i.e. the amount of message in the form of RNA) of tens of thousands of genes in a single experiment. Color-labeled RNA is applied to the microarray, and if the RNA finds its complementary sequence on the array, then it hybridizes to it. The amount of color emitted by the array then tells how much RNA was produced for each gene. This enables scientists to compare the transcriptional profiles of biologic systems, processes and states of diseases in a hypothesis-free manner [67]. Microarrays have very diverse use cases: to classify diseases, to identify the effects of a specific treatment in vivo or in vitro, to find genes that may play a role in a specific disease or a specific biologic process [68]. In this chapter, we aim to provide a basic view of how to analyze microarray experiments. Instead of a comprehensive review of available computational tools, we focus only on the commonly used methods and approaches. First of all, there is a long way from measuring the raw intensities of the probes to obtaining genomic-level expression values of the genes and transcripts. In practice, various sources of variation step in which have to be accounted for, and a lot of manipulation has to be done to obtain accurate results. This process is referred to a pre-processing which is briefly described in Section 8.2. In Section 8.3, we focus on the relationship between the data and the biological question of interest: for example which genes are important in what situations? Which genes are expressed differentially between two (or more) states? Which biological processes are relevant in a certain situation? Note that throughout the chapter we only focus on single-channel microarrays in which RNA from a single sample is applied to the array (contrary to two-channel arrays in which differentially color-labeled RNA from two samples are hybridized to the same array and the intensity ratio of the two colors provide information about the differential expression of the two genes in the two samples) Pre-procession 65

74 Pre-processing consists of five tasks: (1) images analysis, in which we convert the pixel intensities of the scanned images into probe-level data, (2) background correction, in which we correct for the noise and the nonspecific hybridization of the measured probe intensities, (3) normalization, in which we correct many sources of variation in order to be able to compare measurements from different array hybridizations, (4) summarization, in which we summarize the background adjusted and normalized intensities of the probes representing the same transcripts into one quantity that estimates the amount of RNA transcript, and (5) quality assessment, to detect outlier measurements that go beyond the acceptable level of random fluctuations [69] Background correction After the image analysis the first step of pre-processing is to eliminate the effect of background noise. This is important, because the background noise hinders our estimates of differential expression. Consider the case when and are the true expression levels of a specific gene in two samples, and and are appropriately equal positive background intensities around the spots. In this case, the true ratio would be, but the observed ratio is closer to 1 than the true ratio, and the more so the smaller the absolute values of the true expression levels are compared to the background intensities. Many background adjustment methods exist, for example the background correction part of the RMA algorithm developed by Irizarry et al. [70] or the MicroArray Suite (MAS) software 5.0 algorithm of Affymetrix [71] Normalization The main goal of normalization is to make adjustments to the background corrected intensity data in order to make the measurements from different arrays comparable. Generally, the normalization methods fall into one of a few categories: (1) scaling that assumes that each array should have a similar mean or median signal, (2) quantile normalization that assumes that each array's signal intensity values should have the same distribution, (3) local regression (loess) normalization that assumes that the technical biases are intensity dependent, and fit loess curves to remove those biases and (4) model based normalization that explicitly account for specific technical sources of variation in the signals by fitting specific models [72]. Scaling. Choose a baseline array, and scale all other arrays to have the same mean or median signal intensity as the chosen one. See Figure 39 for an example. Quantile normalization. First, sort signal intensity values separately in each array. Next, compute the average signal for each rank. Finally, assign the normalized signal of each probe set to be the average signal for its rank within the array. See Figure 40 for an example. 66

75 Summarization As each gene on the array is represented by a number of probes, summarization of these technical replicates (i.e. the probe sets) is needed to produce a single expression value for a gene. This can be done in several ways, for example averaging on the log transformed expression values, the log transform of the average of the original expression values, median on the log scale, log of the median on the natural scale, or with more sophisticated model-based methods [69] Filtering After normalization, it is a common practice to filter out some probe sets before subsequent analysis steps. Appling filtering has several reasons. First, the technical aspects of processing the array may introduce potential bias or variation resulting in outlier probe sets or unreliable expression values. Second, it is expected that depending of the experiment many genes are unexpressed across all experimental conditions. By filtering, we aim to identify and exclude the unreliable, invariant or unexpressed probe sets in order to get more accurate results of the following statistical analysis steps [72]. For example Kaminski and Friedman [68] proposed the following filtering steps: First, they define a set of "legal genes" that pass a certain threshold of expression in at least one of the arrays. They settle this threshold by hybridizing the same sample on two microarrays, and by comparing the expression values. As the consistency of the expression values is intensity dependent (i.e. the higher the intensity is, the higher the agreement between the two arrays), often there is a threshold above which the consistency of the two arrays is impressive. This usually reduces the number of genes by a third to a half. Next, they define a set of "active genes", in which something has happened to their expression level. They filter out those genes that didn't change at least 1.5-fold in any direction in at least of the experiments. This process usually greatly reduces the number of genes for the subsequent analysis steps Data Analysis Clustering Clustering is mainly a discovery tool in analysing microarrays. It is a collection of methods based rather on intuition than on formal theory. The main intuition behind it is to discover those groups of samples or genes that are in a way isolated from each other but meanwhile have an internal cohesion. These clusters may arise 67

76 naturally because of the object of our study. The number of different clustering methods is overwhelming. In this section, we briefly introduce the most frequently used ones and the theory behind them Clustering samples The reasons why we cluster samples depend on the type of experiment we are conducting. In time-course experiments we sample an organism at different developmental stages. Here, with clustering, we can discover the (dis)similarities of these stages. For example, if we sample asthmatic individuals before, during and after asthma exacerbations, we can find out what time it takes for the cells to return into their previous state. In comparative experiments we sample different individuals under different conditions to discover the effect of the latter on the gene expression. In these experiments we usually sample more individuals and technical replicates of the same condition. Here, clustering may aid in quality control, because if a sample does not cluster with its replicates (unlike the others), than it can reveal normalization or hybridization problems of the specific sample. In clinical experiments we sample individuals with the same phenotype (e.g. with breast cancer), with the a priori knowledge that the individuals may differ genetically. In this case, clustering samples can be very important, because it may reveal the distinct groups of individuals with similar genotypes (i.e. with similar gene expression profiles). Before clustering we have to define two things: (1) What do we mean about "internal cohesion" of a cluster? and (2) What do we mean about "isolation" of distinct clusters? Distance between samples First of all, we have to define the distance between two data points. When the goal is to cluster samples, we can consider them as points represented by gene expression values in the high-dimensional space of genes. Then we can define the distance between the samples using geometric measures ( norms): where and are the expression values of the -th gene in sample and, respectively. The larger the value of, the more sensitive the measure is to outliers. The most robust is Manhattan distance ( ). This is the sum of the absolute distances between equivalent genes on two different arrays. The Euclidean distance ( ) is more sensitive to outliers, and therefore is commonly used for quality assessment purposes, where detecting outlier arrays is more important Distance between clusters Next, we have to define the distance between clusters of observations. What does "close" mean when you do not compare unique data points, but instead clusters of data points? This depends on how we condensate each group of data points into a single representative point. The mostly used linkage methods are: average linkage (the distance between two groups is the average of all pairwise distances), median linkage (median of all pairwise distances), centroid linkage (the distance between the centroids of the two groups), single linkage (the smallest of all pairwise distances) and complete linkage (the largest of all pairwise distances) Agglomerative hierarchical clustering Agglomerative hierarchical clustering is one of the most commonly used clustering algorithms in microarray experiments. It has several advantages, e.g. its visualization (the well-known dendrogram) is easy to understand and may provide important insights that otherwise would remain hidden. It is especially useful in cases where the samples have a hierarchical nature. For example in cancer tissue microarray experiment the different cancer 68

77 types are grouped in distinct clusters. Within these clusters several distinct genotypic clusters may exist, and on the final level, the technical replicates of each individual cluster together. In the process of agglomerative hierarchical clustering, we start by calculating all pairwise distances between all samples. Then the nearest data points are grouped together to form a cluster. After a new cluster is formed by agglomerating two clusters, we compute its distance from all other clusters. Then we search for the nearest pair of clusters to agglomerate, and so on. This results in an agglomerate process, in which single-member clusters are fused together to form bigger clusters. The resulting hierarchy can be visualized in a dendrogram (see Figure 41) Principal Component Analysis, PCA PCA is a well-known dimension reduction method that can be used to visualize high-dimensional data in two or three (or greater) dimensions. PCA creates a new set of orthogonal axes that are linear combinations of the original ones (i.e. the original dimensions represented by the gene expression values). The first axis (i.e. the first principal component) will have the greatest variation in the data associated with it. The second component will be the axis orthogonal to the first one that has the greatest variation in the data associated with it. The third axis will be orthogonal to both, with the greatest variation along it; and so on. If there is correlation among the genes, then the first few axes can explain the majority of the variation in the data; therefore plotting the samples by the first few axes can reveal the (dis)similarities between them (see Figure 42). 69

78 Clustering genes Beside clustering samples, identifying groups of similarly expressed genes (i.e. clustering genes) can be interesting as well. The motivation behind it is that coordinated expression (co-expression) of genes gives a measure of co-regulation, i.e. genes that behave similarly under different conditions share common characteristics, such as common regulatory mechanisms, or common functions. Therefore, in case of genes, similarity and dissimilarity measures are typically different than in the case of samples. The most frequently used dissimilarity measure is based on co-expression: where is the Pearson's correlation coefficient, given by: where is the covariance and and are the standard deviations of and respectively. Several clustering methods exist for clustering genes besides hierarchical clustering that we described before, including k-means [73], self-organizing maps [74], or graph theoretical approaches [75]. Among these we briefly introduce k-means clustering k-means clustering In the iterative process of k-means clustering we first choose how many distinct clusters we expect. Then the algorithm randomly selects cluster centers of the genes in the space of samples, and assigns each gene to the cluster it is closest to. Next, the process adjusts the center of each cluster to minimize the sum of distances of genes in each cluster to their corresponding center. This results in new cluster centers (based on the centroid of genes in the cluster), and the method reassigns all genes to the nearest cluster, and so on, until convergence. This method generates a predefined number of clusters. One of its disadvantages is the lack of visualization of the results [68] Differential Expression 70

79 Differential expression measures the change in expression level of genes among different conditions. For example, if the transcription level of some genes differ between healthy and diseased individuals, than these genes might be responsible for the specific disease Classical Hypothesis Testing The most commonly used statistical technique to identify differentially expressed genes is the classical hypothesis testing approach [67]. This tests for each gene the hypothesis that the particular gene is not differentially expressed that is called null hypothesis,. Unless there is enough evidence against it, we cannot reject it in favor of the alternative hypothesis,, which states that the specific gene is differentially expressed. Hypothesis testing is a method to summarize the evidence in the data (by calculating the so called test statistic) in order to decide between the two hypotheses. Calculating the test statistic results in a probability (the so called p-value) that expresses the absurdity of the null hypothesis. In other words, if a p-value is close to zero that indicates that the null hypothesis is absurd and should be rejected in favor of the alternative hypothesis. The process of hypothesis testing can be illustrated as in Figure 43. The most popular statistic for testing the difference between two means (the mean of expression values observed in two different conditions) is the t-statistic. The test statistic is actually the standardized mean difference between the two conditions for gene : where and are the means of the expression values of gene for the conditions and, respectively; and are the variances; and and are the sizes of the samples in the two conditions. Under the null hypothesis, it can be seen [76] that t-statistic is approximately t-distributed, and therefore the p- value can be calculated by the comparing to the Student t-distribution with the appropriate degrees of freedom. Many variants of the standard t-test have been introduced and used in microarray experiments. These either use bootstrap, permutation or variance-pooling approaches to relax the stringent constraints of the original t-test. The most frequently used ones are limma [77] and Significance Analysis of Microarrays, SAM [78] Multiple Testing Problem Statistical analysis of microarrays faces a serious problem that arises when more than one hypothesis is tested simultaneously, the so called "multiple testing problem" [67]. No matter what statistical method we use, the larger the number of hypotheses, the more likely we randomly observe extremes test statistics, and therefore it becomes more and more likely that falsely rejected null hypotheses (i.e. false positives, type I errors) will creep in. There are many approaches to handle this problem that differ between which error rate they control and how conservative they are. 71

80 The most conservative method is the Bonferroni procedure that controls the familywise error rate (FWER), which is the probability that among all genes that are not differentially expressed, at least one is incorrectly classified as differentially expressed. The Bonferroni procedure simply divides (the desired FWER significance threshold) by the number of hypotheses. For example, to ensure that the familywise error rate is below if we conduct statistical tests we need to set the acceptance threshold to. However, as microarray experiments are rather exploratory than confirmatory tools, controlling the false discovery rate (FDR) might be more wise to do. The FDR is the expected number of not differentially expressed genes among those that are declared differentially expressed. In other words, if our aim is to retrieve a set of hypotheses, such that most of them are not spurious, then we should control the FDR. Benjamini and Hochberg proposed [79] a step-down procedure to control FDR. The genes are ranked by their p-values, and are tested against an increasingly growing threshold. This results in a less conservative correction procedure that is frequently used in microarray analyses Biological Interpretation of Results Statistical analyses result in (often long) lists of differentially expressed genes, some of which will be familiar to the experimenter, some of which will be not. However, it is not straightforward to put them in meaningful biological context by eye. In this section, we briefly introduce concepts that aid loading the results with biological meaning Gene Ontology Analysis An elementary question could be: "What do the under- or overexpressed genes do in a cell?" or "What kind of biological function do they participate in?". In answering these questions Gene Ontology can come to our aid. Gene Ontology (GO) [80] is a standardized and structured vocabulary (i.e. an ontology) of biological terms, describing molecular functions, biological processes and cellular compartments; and the relations between these terms [81]. Besides, each gene is associated with the terms that describe its functionality. Therefore, if the previous statistical analysis steps yielded a gene list of under- or overexpressed genes between two conditions, then we can use a hypergeometric test to conclude which Gene Ontology terms are under- or overrepresented in them. Let's see the case when we want to compute the probability that a specific biological function is overrepresented in an interesting gene list (e.g. genes that are overexpressed in a condition compared to another). Consider an urn containing a ball for each gene ( genes on the microarray) and imagine that those that are associated with the given biological function are colored white ( genes), and all others that are not associated with it are colored black ( genes). Next, we draw balls from the urn; those genes that are interesting to us (overexpressed according to our experiment). Among these, we see number of white balls; those interesting genes that are associated with the given biological function. Then the probability of drawing exactly such balls is given by the hypergeometric distribution: Therefore, under the assumption of no association between the biological function and the interesting gene list, the number of interesting genes with the function should follow a hypergeometric distribution. Based on the observed values, an absurdity p-value can be computed, and the null-hypothesis can be rejected if the p-value is close to zero. If we test multiple hypotheses, correction of errors is needed as we discussed before. This analysis is available in many softwares, e.g. the BiNGO plugin [82] in Cytoscape [83] Gene Set Enrichment Analysis Gene Set Enrichment Analysis [84] is an important complementary method for loading gene expression data with biological meaning. It determines whether an a priori defined set of genes (e.g. genes with a specific biological function) shows statistical significant, concordant differences between two conditions (i.e. two biological states) [85]. The most important difference from the hypergeometric test (described before) is that GSEA does not require the discrimination between interesting and uninteresting genes. Rather it uses a ranking 72

81 of all genes, where the ordering of genes is based on a continuous-valued score (e.g. the value of the t-statistic). It calculates an enrichment score (ES), which reflects the degree to which a predefined gene set is overrepresented at the top or bottom of the ranked list of genes. A positive ES indicates that the predefined gene set is enriched at the top of the ranked list (see Figure 44); a negative ES indicates that the predefined gene set is enriched at the bottom of the ranked list. The basic intuition behind GSEA is that an increase of 20% in all genes participating in a metabolic pathway may dramatically alter the flux through the pathway and may be more important than a 20-fold increase in a single gene [84]. The GSEA method is freely available as a software package [85], together with the MSigDB database of more than predefined gene sets (as of version v3.1). 13. References [67] Ernst Wit and John McClure, Statistics for Microarrays: Design, Analysis and Inference. Wiley, 1st ed., July [68] Naftali Kaminski and Nir Friedman, Practical approaches to analyzing results of microarray experiments. American journal of respiratory cell and molecular biology, 27(2): , August PMID: [69] Bioinformatics and Computational Biology Solutions Using R and Bioconductor. [70] Rafael A. Irizarry, Bridget Hobbs, Francois Collin, Yasmin D. Beazer-Barclay, Kristen J. Antonellis, Uwe Scherf, and Terence P. Speed, Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics (Oxford, England), 4(2): , April PMID: [71] Affymetrix Web Site. [72] S. B. Pounds, C. Cheng, and A. Onar, Statistical Inference for Microarray Studies. In: D. J. Balding, M. Bishop, and C. Cannings, editors, Handbook of Statistical Genetics, pages John Wiley and Sons, Ltd, [73] M. Bittner, P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix, M. Radmacher, R. Simon, Z. Yakhini, A. Ben-Dor, N. Sampas, E. Dougherty, E. Wang, F. Marincola, C. Gooden, J. Lueders, A. Glatfelter, P. Pollock, J. Carpten, E. Gillanders, D. Leja, K. Dietrich, C. Beaudry, M. Berens, D. Alberts, and V. Sondak, 73

82 Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature, 406(6795): , August PMID: [74] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, and T. R. Golub, Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proceedings of the National Academy of Sciences of the United States of America, 96(6): , March [75] R. Sharan and R. Shamir, CLICK: a clustering algorithm with applications to gene expression analysis. Proceedings /... International Conference on Intelligent Systems for Molecular Biology; ISMB. International Conference on Intelligent Systems for Molecular Biology, 8: , PMID: [76] F. E. Satterthwaite, An approximate distribution of estimates of variance components. Biometrics Bulletin, 2(6): , December [77] Gordon K. Smyth, Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical applications in genetics and molecular biology, vol. 3, issue 1, PMID: [78] V. G. Tusher, R. Tibshirani, and G. Chu, Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences of the United States of America, 98(9): , April PMID: [79] Yoav Benjamini and Yosef Hochberg, Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1): , January [80] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock, Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics, 25(1)25-29, May PMID: [81] Louis du Plessis, Nives Skunca, and Christophe Dessimoz, The what, where, how and why of gene ontology-a primer for bioinformaticians. Briefings in bioinformatics, 12(6): November PMID: [82] Steven Maere, Karel Heymans, and Martin Kuiper, BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics (Oxford, England), 21(16): , August PMID: [83] Michael E. Smoot, Keiichiro Ono, Johannes Ruscheinski, Peng-Liang Wang, and Trey Ideker, Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics (Oxford, England), 27(3): , February PMID: [84] Aravind Subramanian, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, Scott L. Pomeroy, Todd R. Golub, Eric. S. Lander, and Jill P. Mesirov, Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43): , October [85] GSEA Biomarker analysis First we categorize recent challenges in biomarker discovery. Second, we summarize the conditional probabilistic approach to relevance and we introduce related structural properties (i.e., features) of Bayesian networks. Next, we discuss a scalable Bayesian relevance analysis based on Bayesian networks. 15. Notation 74

83 15.1. List of symbols 75

84 76

85 15.2. Acronyms Introduction The recent technological developments in life sciences enabling the sequencing of genomes and high-throughput genomic, proteomic, metabolic techniques have redefined biology and medicine and opened the genomic and post-genomic era. The grand promises of the post-genomic era are the "personal genome", the understanding of "normal" and disease-related genotype-phenotype relations, along with personalized prevention, diagnosis, drugs, and treatments. However, from clinical perspectives these "translational" promises are still to be fulfilled, and shifted gradually further and further to future dates. From a data analytic point of view the discovery of explanatory, diagnostic biomarkers failed to satisfy the expectations, as exemplified by infamous problems and papers, such as the "missing heritability" [259] and [93], referring to the relatively modest explanatory power of identified genetic factors and the debated clinical validity and utility of identified biomarkers. The relatively modest performance of current methods with respect to the number of valid biomarkers is surprising if we consider the richness and volume of the information sources related to drugs-genes-diseases accumulated mainly in the last two decades. This paradox situation of the growing amount of data and knowledge coupled with lagging performance initiated a wide range of transformations in statistical, biomedical, and pharmaceutical research, for example new biomarker discovery methods were proposed, which are more knowledge-intensive, systems biology oriented, and amenable for interpretation and full-fledged decision support. This chapter summarizes the use of Bayesian networks to characterize four different aspects of statistically associated biomarkers: 1. Directness. 2. Causality. 3. Effect strength. 4. Interactions. 77

86 Background In the predictive approach, the Feature Subset Selection problem (FSS), i.e., the concept of relevance can be defined specific to the applied model class used as a predictor, the optimization algorithm, the data set, and the loss function, whose generalization leads to the wrapper approach [99]. In the filter approach, conceptualizations and methods for FSS rely on the following model-based, probabilistic definition of relevance for a set of variables [161]. Definition 1.(Markov boundary) A set of variables is called a Markov blanket set of w.r.t. the distribution, if, where denotes conditional independence. A minimal Markov blanket is called Markov boundary. Its indicator function is denoted by. A conditional probabilistic version of relevance types, free of model class, optimization, dataset or loss function, is defined as follows: Definition 2. A feature (input variable) is strongly relevant to, if there exists an and, such that and. A feature is weakly relevant, if it is not strongly relevant, and there exists a subset of features of for which there exists some and such that and [99].. A feature is relevant, if it is either weakly or strongly relevant; otherwise it is irrelevant Bayesian networks (BNs) and their properties offer a wide range of options for representing relevance [161]. The following theorem gives a sufficient condition for the unambiguous BN representation of the relevant features. Theorem 1. For a distribution defined by Bayesian network the variables form a Markov blanket of, where denotes the set of parents, children and the children's other parents for [161]. If the distribution is stable w.r.t. the DAG, then forms a unique and minimal Markov blanket of, and iff is strongly relevant [108]. We also refer to as the Markov blanket set for in using the notation by the implicit assumption that is Markov compatible with 14. The induced (symmetric) pairwise relation w.r.t. between and is called Markov blanket membership The sufficiency and necessity of the predictors in the Markov boundary for the prediction or diagnosis of a given target variable provides a unique, central position for the concept of Markov boundary. Fig. 45 shows a real-world Markov boundary for the prediction and diagnosis of ovarian cancer [231]. 14 Note that in typical Bayesian scenarios (e.g., in case of Dirichlet distributions applied in the paper to specify ), the graphtheoretic neighborhood have measure 0 [160]. is the unique Markov Boundary with probability 1, i.e., the parameterizations encoding independencies 78

87 Bayesian multilevel analysis of relevance Earlier works on using Bayesian network properties in relevance analysis include the Markov Blanket Approximating Algorithm [101], its recent extensions [110], the IAMB algorithm and its variants [87, 108 és 109]. Beside these deterministic, maximum likelihood, or maximum a posteriori (MAP) identification methods, stochastic and Bayesian approaches were proposed as well (for an ad hoc randomized approach, see [107]). In the computationally more demanding Bayesian approach, we are interested in the posteriors for various model properties expressing relevance for a given target variable. In earlier works the goal was the overall characterization of the domain using edge and MBM posteriors [97, 100 és 102]. The FSS problem can be extended to include the identification of the interaction structure of the relevant variables, i.e., the use of Markov Blanket Graph (MBG) feature (property), a.k.a. classification subgraph [86 és 91]. Definition 3 (Markov Blanket Graph). A subgraph of Bayesian network structure is called the Markov Blanket Graph or Mechanism Boundary Graph of variable if it includes the nodes in the Markov blanket defined by and the incoming edges into and into its children. For a probabilistic and causal interpretation of MBGs, a representation of observation equivalent MBGs, bounds for their cardinality and use in prediction, see [86 és 91]. An important property of the MBG is that it is sufficient for relevance analysis in case of complete data (which is the direct consequence of Th. 1). Unfortunately, the MBG posterior is not tractable computationally, but it is easy to show that the orderingconditional posterior can be computed in polynomial time, which can be exploited in ordering-mcmc methods [91]. Note that the MBM and the MBS or MBG concepts reflect two different approaches to Bayesian network properties. The first approach provides an overall characterization as a fragmentary representation, and the number of features and feature values are tractable (e.g. linear or quadratic in the number of variables). Such features are pairwise edges, compelled edges, and Markov blanket relations. At the other extreme of feature learning we find the identification of arbitrary subgraphs with statistical significance [106]. The use of subgraphs restricted to Markov Blanket Graphs is advantageous, because it is a focused representation from a single, but complex point of view (i.e., from the point of view of the FSS problem). The Bayesian Multilevel Analysis of relevance goes one step further and it yields a comprehensive view of multiple levels. It allows for the calculation and crosslinking of the posteriors corresponding to features, sets of features, and (sub)graph models of features and a target variable. Following our assumption about the underlying BN representation, this implies the calculation of the posteriors for the Markov Blanket Memberships, Markov Blanket sets, and Markov Blanket graphs. Further levels would also be possible either using domain specific knowledge for defining groups of variables w.r.t. their types, or collapsing the MBG 79

88 space to the space of class-focused restricted partially directed acyclic graphs (C-RPDAGs) [86]. Note that the MBM, MBS, and MBG features form a hierarchy of increasing complexity ( ) Multivariate scalability: k-mbs and k-mbg features The multiple levels in BMLA offer a wide range of analysis at multiple abstraction levels (i.e., with varying complexity). However, the MBG and MBS features are much more expressive than the edge and MBM features, e.g. their cardinalities are superexponential, exponential, and linear for a given target respectively. Consequently, the MBG and MBS posteriors are often too "flat" (i.e., there are hundreds of MBS or MBG features with moderately high posteriors), even when the MBM posteriors are peaked (for further details see [91]). Typically, - even in the "flat" posterior case - the most probable MBS and MBG feature values often share a significant common part. To handle this we define concepts between MBMs and MBSs, and edges and MBGs, which are focused on target variables and they have intermediate, scalable complexities. Definition 4 (k-mbs). For a distribution ( ), if all the variables, where, are members of a Markov Boundary set and, then is called a k-ary Markov Boundary subset 15. The graph-theoretic characterization of the concept is as follows. Proposition 1. For a stable distribution defined by Bayesian network is k-ary Markov Boundary subset, iff and (otherwise may not be minimal). The concept offer scalable features for the analysis of relevance, as their cardinalities are polynomial. In practice this means that we can analyze the most probable feature values in a relatively large range of values dictated by the power of data (i.e., where the posteriors are peaked). The posteriors for k-mbs and k-mbg can be derived off-line from the estimates for the MBS and MBG posteriors. The maximum value of, at which model properties (feature values) with high probability are present, is problem-dependent. Reasonable limits can be found either by a bottom-up or a top-down approach starting from or, respectively (note that for intermediate values of the number of feature values is computationally not tractable, e.g. for ) A knowledge-rich aggregation of input features An attractive property of the Bayesian approach to relevance is that the model posterior can be transformed and interpreted without theoretical restrictions. In our case, using the space of Bayesian network structures, it means that the posterior can be aggregated by any partitioning over model structures, where each partitioning offers a potentially different interpretation. However, only few partitions have a general or domainspecific meaning. Beside noninformative model aggregation, the prior domain knowledge can be used as well to define interesting partitions. As with the noninformative aggregation, such an aggregation can (1) provide a more general description of relevance relations in the domain, and (2) yield more confident numerical results. E.g., a straightforward way to augment the space of genetic variants is to introduce the level of genes. On the level of genes, we have calculated the aggregated versions of the Markov blanket membership and Markov blanket set relations. The corresponding equations are derived from their counterparts belonging to the more specific SNP level, e.g. (where respectively denote the target, gene, and SNP variables) 15 Because is stable with probability 1 in case of Dirichlet distributions applied in the paper to specify [160], we also use the indicator function assuming that is compatible with. However, in regard to the possible non-stable cases with potential non-minimality of, we call these sets in general k-ary Markov Blanket subsets. 80

89 Interaction, redundancy based on posterior decomposition In relevance analysis we typically focus on high-scoring subfeatures, although low probabilities may also indicate important relations, because composite measures representing high-level semantic properties can be constructed. Such a score for the discovery of interaction and redundancy can be constructed using the exact posterior and its approximations as the product of the Markov Blanket Membership probabilities of each member variable in the given, as if their occurrences were independent: These approximations related to the decomposability of the structure posterior enable a direct Bayesian approach to the concept of redundancy and interaction. If the higher-order posterior is larger than the approximation based on lower-order posteriors, it may indicate that the subset has interacting features. In the opposite case, it may indicate the redundancy of features. This is formalized in the following definition, which can be generalized to multiple variables and orders higher than (i.e., not only for MBMs, which are s ). Definition 5 (Interaction and redundancy). The features are 1,k-product interacting (redundant), if the posterior is larger (less) than. The task of finding redundant subfeatures can be regarded as the complement of finding stable subfeatures, e.g. in the first case we are looking for those elements which often supplement the stable parts of features Relevance for multiple targets If there are multiple possible target variables which have to be examined together and the relations among them are irrelevant, one may ask for the variables relevant to the target set. Note that this is similar to the aggregation of input features in Section 9.5, but in this case the target variables are "aggregated". Fortunately, the basic concepts of relevance discussed earlier can easily be extended to use target sets instead of a single target node. Definition 6 (Multi-target relevance). A feature (stochastic variable) is strongly (weakly) relevant to, if it is strongly (weakly) relevant to any. It is easy to see that the union of the MBSs of the targets, except the elements of the target set itself, is a Markov Blanket set for the target set. Proposition 2. If, then is a Markov blanket for w.r.t. distribution. An equivalent proposition can be stated for Markov boundaries, although the effects of logical dependencies should be handled appropriately. Note that the posterior for a given target set cannot be calculated from the posteriors corresponding to the members of any partitioning of, because of the dependencies. However posteriors corresponding to subsets of the target set can be used for an approximation. In case of MBMs and singular variables, e.g. 81

90 Still, in case of MBMs, if the posteriors are available for all of the subsets, then for any, using inductively is a Markov blanket member for each., we can compute the posterior probability that The extension of the MBG concept for multiple targets is similarly straightforward, which again defines the necessary and sufficient dependency structure and parameters for predicting the targets under general conditions. Definition 7. A subgraph of Bayesian network structure is called the Markov Blanket Graph of the set of variables if it includes the nodes in the Markov blanket defined by and the incoming edges into and into its children Conditional and contextual relevance The fundamental definitions of relevance in Def. 1 are based on the general concept of conditional independence. However, as conditional independence can be made more specific by introducing contextual independence, we can introduce the concept of contextual relevance to support more refined analysis. Recall that contextual independence is a specialized form of conditional independence, i.e when conditional independence is valid only for a certain value of another disjoint set (for its use in the context of Bayesian networks, see e.g. [92]). Let us denote the contextual independence of and given and context with, that is, An analogous extension for relevance is as follows. Definition 8 (Contextual Irrelevance). Assume that is relevant for, that is,, and. We say that is contextually irrelevant if there exists some for which. For completeness, recall the definition of conditional relevance Definition 9 (Conditional Relevance). Assume that is relevant for, that is,, and. We say that is conditionally relevant if, but. This definition applies to both weak and strong relevance. Note that conditional relevance and contextual irrelevance are independent, although typically somewhat opposite concepts. In case of conditional relevance, we have to know a value of a relevant feature to ensure the relevance of an otherwise irrelevant feature. Whereas in case of contextual irrelevance there should be a value whose knowledge makes an otherwise relevant feature irrelevant. The BMLA method based on standard BNs allows for a model-based Bayesian inference about conditional relevance. However, to handle contextual relevances, a Bayesian network representing contextual dependencies is necessary, e.g., using decision trees as local dependency models [92] Posteriors for the predictive power of input features Since the wrapper approach in practice is based on predictive power and the filter approach is based on some model-based relevance, their relation is an open issue and their joint usage needs caution, just as using filter approaches to support predictive model construction (e.g., see [96]). The Bayesian analysis of relevance based on Bayesian networks in general corresponds to the model-based approaches, but it is specialized as much as possible towards the predictive approach by collapsing the structure posterior into a simpler space of complex structural features representing exactly the predictive aspects (e.g. the 82

91 MBG feature is a sufficient and necessary feature for prediction under broad conditions [91 és 88]). Although the relation of the model-based and predictive approaches is outside the scope of the paper, we shortly summarize a parallel Bayesian view for quantitative, prediction oriented model properties. The use of Bayesian networks as conditional predictors allows for the definition of quantitative model properties (features) for a given input-output relation expressing the predictive power of the input features on a given data set. Such features (assuming a binary target variable) are the following: the Misclassification Rate, the Odds Ratio, and the Area Under the (ROC) Curve [90]. These random variables are defined by the Bayesian network for a given input-output relation and on a given external data set (the posterior is typically defined by a different training data set). Note that by having a fully specified Bayesian network (i.e., the input distribution as well) we can define and use these random variables exclusively based on data Algorithmic aspects and applications The Bayesian inference over structural properties of Bayesian networks was proposed in [146 és 150]. In [102], Madigan et al. proposed a Markov Chain Monte Carlo (MCMC) scheme to approximate such Bayesian inference. The MCMC method over the DAG space was improved by Castelo et al. [98]. In [97], Friedman et al. reported an MCMC scheme over the space of orderings. In [100], Koivisto et al. reported a method to perform exact full Bayesian inference over modular features. To estimate the posteriors a DAG-based or an ordering-based Metropolis Coupled Markov Chain Monte Carlo ( ) method can be applied [91, 97 és 98]. This methodology was applied in a series of studies investigating the genetic background of allergy, asthma, autoimmune diseases, leukemia, heroin dependence, impulsivity and depression Summary The Bayesian network-based Bayesian multilevel analysis of relevance extends the repertoire of statistical biomarker analysis by providing a comprehensive, hierarchical overview of the types of associations, relevances and interactions, specifically in case of complex phenotypes. It is capable for incorporating a wide range of prior knowledge, thus it is especially applicable for the analysis of data with small sample size. The exact modeling of interactions by the MBG features using Bayesian networks and the Bayesian approach to the feature subset "selection" problem offered a principled solution for quantifying the uncertainty in inferring relevant features and their joint interactions. The joint usage of different feature levels has multiple advantages: we can better understand the types of the relevance relations, and the necessity and possibility of the multivariate, and the multivariate-interactionist analysis. The Bayesian network properties for relevance analysis have scalable polynomial complexity. Finally, the approach can be extended to the case of multiple targets and it can be used to quantify interactions and redundancies. 16. References [86] S. Acid, L. M. de Campos, and J. G. Castellano. Learning Bayesian network classifiers: searching in a space of partially directed acyclic graphs. Machine Learning, 59: , [87] C.F. Aliferis, I. Tsamardinos, and A. Statnikov. Large-scale feature selection using Markov blanket induction for the prediction of protein-drug binding, [88] P. Antal. Integrative Analysis of Data, Literature, and Expert Knowledge. Ph.D. dissertation, K.U.Leuven, D/2007/7515/99,

92 [231] P. Antal, G. Fannes, Y. Moreau, D. Timmerman, and B. De Moor. Using literature and data to learn Bayesian networks as clinical models of ovarian tumors. Artificial Intelligence in Medicine, 30: , [90] P. Antal, G. Fannes, D. Timmerman, Y. Moreau, and B. De Moor. Bayesian applications of belief networks and multilayer perceptrons for ovarian tumor classification with rejection. Artificial Intelligence in Medicine, 29:39-60, [91] P. Antal, G. Hullám, A. Gézsi, and A. Millinghoffer. Learning complex Bayesian network features for classification. In Proc. of third European Workshop on Probabilistic Graphical Models, pages 9-16, [92] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Context-specific independence in Bayesian networks. In Eric Horvitz and Finn V. Jensen, editors, Proc. of the 20th Conf. on Uncertainty in Artificial Intelligence (UAI-1996), pages Morgan Kaufmann, [93] L. Buchen. Missing the mark. Nature, 471(7339): , [146] W. L. Buntine. Theory refinement of Bayesian networks. In Proc. of the 7th Conf. on Uncertainty in Artificial Intelligence (UAI-1991), pages Morgan Kaufmann, [150] G. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9: , [96] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian networks classifiers. Machine Learning, 29: , [97] N. Friedman and D. Koller. Being Bayesian about network structure. Machine Learning, 50:95-125, [98] P. Giudici and R. Castelo. Improving Markov chain Monte Carlo model search for data mining. Machine Learning, 50: , [99] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97: , [100] M. Koivisto and K. Sood. Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, 5: , [101] D. Koller and M. Sahami. Toward optimal feature selection. In International Conference on Machine Learning, pages , [102] D. Madigan, S. A. Andersson, M. Perlman, and C. T. Volinsky. Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs. Comm.Statist. Theory Methods, 25: , [259] B. Maher. Personal genomes: The case of the missing heritability. Nature, 456(7218):18-21, [160] C. Meek. Causal inference and causal explanation with background knowledge. In Proc. of the 11th Conf. on Uncertainty in Artificial Intelligence (UAI-1995), pages Morgan Kaufmann, [161] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Francisco, CA, [106] D. Pe'er, A. Regev, G. Elidan, and N. Friedman. Inferring subnetworks from perturbed expression profiles. Bioinformatics, Proceedings of ISMB 2001, 17(Suppl. 1): , [107] J.M. Pena, R. Nilsson, J. Bjorkegren, and J. Tegnér. Towards scalable and data efficient learning of Markov boundaries. International Journal of Approximate Reasoning, 45: , [108] I. Tsamardinos and C. Aliferis. Towards principled feature selection: Relevancy, filters, and wrappers. In Proc. of the Artificial Intelligence and Statistics, pages ,

93 [109] I. Tsamardinos, C.F. Aliferis, and A. Statnikov. Algorithms for large-scale local causal discovery and feature selection in the presence of limited sample or large causal neighbourhoods. In The 16th International FLAIRS Conference, [110] Lei Yu and Huan Liu. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 5: , Network biology Introduction In the first decade of the 21st century, a new era has dawned upon the biomedical research community. Most commonly known as the post-genomic era, this new age has brought a holistic, system-wide view of cellular components and processes, considering complex relations and interaction patterns, instead of individual entities, e.g. genes or proteins. Advancements in computer and measurement technologies have led to a huge increase in the amount of heterogeneous biological data, stemming from multiple 'omic levels and posing new challenges for modern-day scientists. Systems biology seeks to provide insights and tools to cope with this challenge by employing comprehensive methodologies operating on multiple levels of cellular biology. When it comes to mathematical frameworks for systems biology, a natural choice is the application of network theory. Embracing the fact that the whole is more than the sum of its parts, network theory is a large sub-field of graph theory concerning itself with relations between discrete entities, relational patterns and emergent properties of networks. However, the collective term "network" is somewhat sloppy and can refer to several concepts, which have varying degrees of usefulness in systems biology. Turning to the details, one can set up roughly four conceptual levels: 1. Similarity networks, e.g. sequence similarity networks are numerous and can be easily generated from arbitrary similarity matrices. Although they can be very useful in several applications, they are not quite at the level of quantitative models described in 3 and Descriptive graphs, e.g. protein interaction networks represent the mainstream of network biology; many people consider this level to be THE level of systems biology. 3. Independence maps and causal diagrams, e.g. Bayesian networks are very popular in bioinformatics, although traditionally they are considered to be more like a statistical approach instead of a part of network theory and network biology. 4. Quantitative regulatory networks are sophisticated mathematical models of various cellular processes or functions, often utilizing ordinary and partial differential equations to model biochemical reactions. In this chapter, we give an introduction to the basic concepts in descriptive network theory, which fall mostly in the first two categories and do not necessarily have generative qualities. Some might argue that it is exactly the latter two categories where "real" systems biology lives; however, there is no consensus on what scientists mean when they refer to systems (network) biology. Finally, it should be noted that an exhaustive review is far beyond the scope of this book and therefore we refer the interested reader to [111 és 112] for a comprehensive discussion on this subject Biological networks Biological systems, from simple cells to whole ecosystems, are characterized by complex interactions between its constituents. Biological networks aiming to describe these interactions come in many flavors, the most common types being: Sequence/structural similarity networks can be defined whenever one can design a similarity measure for any pair of entities. Entities most commonly include genes, proteins, small molecules (e.g. drugs), or, leaving behind structures and sequences, more exotic objects (e.g. diseases or gene expression profiles). This kind of 85

94 networks is very popular and can be used for a wide range of purposes, including prediction of function and interactions [113 és 114], or drug discovery [115]. Protein-protein interaction networks (PPI, PINs) are built from physical protein binding data, usually measured by high-throughput experiments. They are primarily used to predict protein functions by analyzing their interactions with other molecules. Publicly available databases include DIP [116], MINT [117] and several other resources. Metabolic networks are used to investigate metabolic pathways in living organisms. They are built upon enzymes, their substrates and products (metabolites), and catalyzed reactions to model various biochemical processes. Examples of the most widespread, public databases are KEGG [118] and BioCyc [119]. Signal transduction networks focus on signal transmission, relevant molecular pathways, and cross-talk mechanisms. Examples of the former include MiST [120] and TRANSPATH [121]; for a database focusing on cross-talk mechanisms, see the SignaLink resource [122]. Regulatory networks (GRNs) address the regulation of gene expression, including the regulatory region of genes, transcription factors, RNA interference, post-translational modifications and interactions with other factors. For public databases, see e.g. JASPAR [123] or TRANSFAC [124]. Other, integrated networks seek to combine multiple heterogeneous resources into a single network and provide a unified view of the entities. Examples include multi-layered regulatory networks, drug-disease-gene networks and many others; see e.g. the Connectivity Map, which integrates diseases, small molecules and gene expression data [125] Basics of graph theory In this Section, we provide some definitions from the field of graph theory. A graph is a collection of vertices and edges, denoted by the ordered pair, where is a set of vertices (or nodes) and is a set of edges (or links). Each edge can be represented as a pair of vertices from, i.e. an edge always connects two vertices (becoming adjacent vertices), although these two might be the same. It turns out that, in many scenarios, edges have to be directed - just think of a family tree, which is an example of a directed graph. In that case, edges are ordered pairs of vertices; in other cases, where the relations are symmetric, unordered pairs will suffice, and therefore we talk about an undirected graph. A special case of directed graphs are directed acyclic graphs (DAGs), which, as the name suggests, do not contain cycles, and are very important in many applications. In some cases, it might also be useful to assign numerical values to the edges; these are called weighted edges, and the graph is a weighted graph. The number of edges which connect to a vertex is called the degree of the vertex. A regular graph is a graph in which all vertices have the same degree. A complete graph is a special case of regular graphs, where every two vertices are connected by edges. Obviously, not all graphs are complete; in fact, they do not even need to be connected. A graph is connected, if there exists a path between any two vertices, otherwise it is disconnected. A subgraph of a graph consists of some vertices and some edges of the graph, where each selected edge connects two of these selected vertices. The maximal (largest possible) connected subgraphs of a graph are called components, i.e. a disconnected graph has multiple components, while a connected one has exactly one. Complete subgraphs of a graph are called cliques, largest possible cliques are maximal cliques. There is a special class of graphs called bipartite graphs, whose vertices can be divided into two disjoint sets such that no two vertices within the same set are connected - imagine a checkerboard, in which every black field is adjacent only to white ones, and vice versa. Finally, a cluster is a subset of vertices which are strongly interconnected, i.e. they are much more connected to each other than to those outside the cluster. To measure the degree of clustering in a graph, one can employ various definitions of the clustering coefficient. Other important measures include shortest path and mean path length, network centralization, node centralities (e.g. degree, betweenness, closeness, eigenvector, etc. centralities) and graph density; these are beyond the scope of this book and we refer to [111 és 112] for the formulas and more details Network analysis 86

95 Network analysis concerns itself with the qualitative and quantitative properties of a given network, including its underlying design principles, functional organization, local patterns, emergent properties and dynamic behavior. Being a highly interdisciplinary field, its applicability is not restricted to network biology; similar tools are used in telecommunication, social network analysis and a vast number of other domains Network topology Network topology refers to the arrangement of nodes and links in a network, i.e. it describes how the nodes in the network are connected and how they communicate. As seen in Section 10.3, graphs often have well-defined structural elements (e.g. cliques, clusters); in this Section, we introduce more of these structures frequently investigated in network analysis, influencing the whole behavior of the network. A node with a much higher number of connections than the average is called a hub. Hubs tend to be key elements of the network, in the sense that deleting such nodes usually causes the network to deteriorate quickly, decomposing into isolated clusters of nodes. This phenomenon is called the centrality-lethality rule in PPI networks, as hubs often correspond to essential proteins. Other local topological structures are motifs (significantly over-represented directed subgraphs) and graphlets (undirected ones). Using the nomenclature of network biology, modules are roughly the same as clusters. They often correspond to functional subsystems, e.g. certain cellular processes or functions. In complex systems, modules usually interact with each other via overlapping or bridges (nodes that connect two or more modules). If a bridge happens to be the only connector between two modules, it is called a bottleneck. Modules can also exhibit a hierarchical behavior; i.e. a collection of interacting smaller modules gives rise to larger, less cohesive ones. Network clustering, an intensively researched field, refers to the act of identifying modules using a broad array of techniques (including graph theoretic, statistical and machine learning approaches). Node centrality, in general, refers to the presence of highly influential nodes, i.e. which, if any, of the nodes act as a global "coordinator" of the network (and therefore possess a high centrality). Centrality measures were mentioned in the previous Section. A related concept is network centralization, which takes the distribution of node centralities into account and therefore refers to the whole network; highly centralized networks exhibit a star-like topology, while at the other end of the spectrum node centralities are more evenly distributed. A subnetwork consisting of high-centrality nodes is called a network skeleton. Finally, one of the most amazing features of real-world networks is the surprisingly small average path length, despite the huge size of the network - a property often referred to as small worldness. The term stems from social sciences and the experiments of Stanley Milgram, with the concept first popularized by Hungarian novelist Frigyes Karinthy, who suggested that any person on the world can be reached through links of personal acquaintance in at most five steps (the concept later refined as "six degrees of separation") Network models and dynamics Many real-world networks, particularly the ones modeling biological systems, are constantly changing and evolving over time; network dynamics is a rapidly developing field investigating these temporal aspects. In order to understand the various properties of complex networks, a reasonable approach is to take a closer look at their formation and evolution, and discover the underlying principles. Models of complex networks are essentially "prototypes" of their real-world counterparts, and seek to provide insights into how these properties emerge from a simple set of design rules. Several such models were developed in the last fifty years, the most famous ones being the Erdős-Rényi model [126], the Watts-Strogatz model [127] and the Barabási-Albert model [128]. The Erdős-Rényi model is one of the simplest models introduced to describe random graphs. The construction scheme involves taking nodes and selecting edges from the possible edges in a random manner. Instances of this model show small worldness, however, they have only small variances in the node degrees, i.e. they fail to explain the clustering behavior of real-world networks (e.g. the formation of hubs). 87

96 The Watts-Strogatz model reproduces both the small world property and local clustering. All nodes are arranged on a ring and each node is connected to its nearest neighbors. Then each edge is rewired with a small probability, i.e. one end is connected to a randomly selected node, which introduces small worldness to the model. If is sufficiently, but not extremely small, the network will still have a fair amount of local clustering; for, one recovers the Erdős-Rényi model. The Barabási-Albert model exhibits not only the properties above, but scale-free degree distribution as well (which is frequently observed in real-world networks, such as biological ones or the Internet - see the next section). The key idea of this model is the application of growth and preferential attachment. New nodes are repeatedly added to the network (growth) and connected to others with a probability depending on their current degree, i.e. the new node prefers to attach to those already well-connected (a concept often expressed as "the rich get richer"). In fact, the preferential attachment process reflects the formation rules of many empirical networks (e.g. social ones); there are also fairly good explanations why cellular networks tend to follow this scheme and exhibit a scale-free topology [129]. 88

97 Assortativity, degree distribution and scale-free networks Network assortativity describes the preferential attachment of nodes to others which are, in some sense, similar; this usually means having a similar degree. In assortative networks, well-connected nodes tend to connect to other well-connected nodes; biological systems are typically disassortative, i.e. high-degree nodes connect to a large number of low-degree ones [130]. Another key property of biological networks the power-law degree distribution, giving rise to a so-called scalefree network. The degree distribution, denoted by, gives the probability that a selected node has degree. In the Erdős-Rényi model, the degree distribution is binomial, which can be approximated with a Poisson distribution in large networks, i.e. it is strongly peaked at the average degree (meaning that nodes with a degree far from the average are extremely rare). Scale-free networks have a degree distribution of the form, implying that there exist a few highly connected nodes (or hubs) and a large number of nodes with low degree (Figure 49). The degree exponent influences the behavior of the network in a fundamental way. The higher its value, the steeper the slope of the function is; it follows that at values, large hubs are infrequent and play no important role in the network, while lower values correspond to a strong presence of hubs. Most biological networks have a degree exponent between 2 and 3. It turns out that these networks are also "ultra-small", in the sense that the average path length is significantly shorter than that in random networks. For more details on scale-free networks, see the works of Barabási et al. [128 és 129]. 89

98 Tasks and challenges In practice, our knowledge about biological systems is never perfect. There are a number of reasons for that - theoretical ignorance, practical limitations, inherent uncertainties, errors and laziness, just to name a few. This implies that, however good the methodology and the execution were, one always ends up with incomplete models. Although reaching perfection is unfeasible in practice, one can significantly improve the model by exploiting "hidden" structures and connections embedded in the data, uncovering previously unknown information. In the context of network biology, this is achieved by solving network analysis problems, which come in many varieties: Node and edge prediction are among the most natural problems. Nodes and edges can be predicted using a wide range of techniques including similarities, topological features, temporal properties, network comparison, etc. [171]. Cluster analysis can be used to detect functional modules in biological systems and analyze the details of their interactions. Classification, regression and ranking are rather general terms, originating from the field of machine learning. This family of techniques can be applied in a wide range of network analysis problems, e.g. to discover roles or properties of nodes and edges, node and edge prediction, etc. Centrality analysis, path finding and robustness analysis can be used to understand the organization of the network and how the nodes "communicate" with each other. A natural application is drug target selection, i.e. which nodes or edges should be attacked to counteract the effects of a disease with the least possible sideeffects, or how to disrupt the network completely (antibiotics, anticancer drugs). 90

99 Graph isomorphism and network alignment are more recent approaches strongly connected to network integration. For an application, see e.g. [132], where PPI networks were aligned across multiple species to predict functional orthology. Graph motif search is somewhat similar to the former and has been applied e.g. to metabolic networks to gain a deeper understanding of their structure and building blocks [133]. Network inference or reverse engineering refers to the act of determining the structure of the network from the collected data. It is important to note that the structure of the inferred network is highly dependent on the method applied; therefore, focus is shifting towards the integration and joint utilization of multiple inferred networks. Network integration aims to combine and use multiple networks in a coherent way, representing a sub-field of information fusion. For more details on information fusion. Network visualization is one of the most basic, yet very important problems to address. Cytoscape is probably the most popular tool for biological network visualization, also incredibly useful in a wide range of network analysis applications An application to drug discovery Traditionally, drug discovery and development was all about designing molecules which bind to a single, or at most a handful of targets with maximum selectivity. Although it was well-known that many successful drugs act on several targets simultaneously, it was only in the last few years that drug discovery and network biology came across each other, giving rise to network pharmacy; this marriage holds the promise of developing drugs with higher efficacy and lower toxicity. The network approach is also attractive from the viewpoint of drug repositioning - since the number of new molecular entities produced by the pharmacological industry is decreasing almost every year, re-using existing drugs seems a reasonable strategy, also supported by the new paradigm. Despite the relatively young age of the field, there have been quite a few publications investigating network analysis methods for drug discovery and repositioning. Many of these methods aim to identify attractive drug targets using techniques similar to those described in the previous Section, others follow the similarity-based approach to create multiple levels of information (e.g. drug-drug and disease-disease similarity networks) and apply ad hoc fusion methods to combine these levels. For example, the Connectivity Map designed by Lamb et al. utilizes the language of gene expression changes to connect drugs, diseases and the level of genes [125]. Changes in gene expression profiles were determined experimentally for several drugs and diseases; a pair of these is connected if the effect of the drug counteracts the effect of the disease. The PREDICT system [134] defines a large number of similarities between drugs (based on chemical descriptions, side-effects, sequence, closeness in a PPI networks and functional annotation), and between diseases (based on e.g. phenotype and genetic signatures), then uses a machine learning approach to find valid drug-disease pairs by comparing them to previously known associations. Features for each unknown - pair are computed using a score function of the form which essentially computes the similarity to the closest known drug-disease pair for each similarity measure. Using these as features, unknown pairs are then classified using logistic regression which also performs the weighting of features to yield a classification score. 18. References [111] G. A. Pavlopoulos, M. Secrier, C. N. Moschopoulos, T. G. Soldatos, S. Kossida, J. Aerts, R. Schneider, and P. G. Bagos, Using graph theory to analyze biological networks. BioData Min, 4:10,

100 [112] Björn H. Junker and Falk Schreiber, Analysis of Biological Networks. Wiley Series in Bioinformatics, Wiley-Interscience, [113] T. Phuong and N. Nhung, Predicting gene function using similarity learning. BMC Genomics, 14 Suppl 4:S4, Oct [114] Q. Chen, W. Lan, and J. Wang, Mining featured patterns of MiRNA interaction based on sequence and structure similarity. IEEE/ACM Trans Comput Biol Bioinform, 10(2): , [115] P. Csermely, T. Korcsmaros, H. J. Kiss, G. London, and R. Nussinov, Structure and dynamics of molecular networks: a novel paradigm of drug discovery: a comprehensive review. Pharmacol. Ther., 138(3): , June [116] I. Xenarios, D. W. Rice, L. Salwinski, M. K. Baron, E. M. Marcotte, and D. Eisenberg, DIP: the database of interacting proteins. Nucleic Acids Res., 28(1): , Jan [117] A. Chatr-aryamontri, A. Ceol, L. M. Palazzi, G. Nardelli, M. V. Schneider, L. Castagnoli, and G. Cesareni, MINT: the Molecular INTeraction database. Nucleic Acids Res., 35 (Database issue):d , Jan [118] M. Kanehisa and S. Goto, KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 28(1):27-30, Jan [119] R. Caspi, T. Altman, R. Billington, K. Dreher, H. Foerster, C. A. Fulcher, T. A. Holland, I. M. Keseler, A. Kothari, A. Kubo, M. Krummenacker, M. Latendresse, L. A. Mueller, Q Ong, S. Paley, P. Subhraveti, D. S. Weaver, D. Weerasinghe, P. Zhang, and P. D. Karp, The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res., 42(1):D , Jan [120] L. E. Ulrich and I. B. Zhulin, MiST: a microbial signal transduction database. Nucleic Acids Res., 35 (Database issue):d , Jan [121] F. Schacherer, C. Choi, U. Gotze, M. Krull, S. Pistor, and E. Wingender, The TRANSPATH signal transduction database: a knowledge base on signal transduction networks. Bioinformatics, 17(11): , Nov [122] D. Fazekas, M. Koltai, D Turei, D. Modos, M. Palfy, Z. Dul, L. Zsakai, M. Szalay-Bekő, K. Lenti, I. J. Farkas, T. Vellai, P. Csermely, and T. Korcsmaros, SignaLink 2 - a signaling pathway resource with multilayered regulatory networks. BMC Syst Biol, 7:7, [123] A. Sandelin, W. Alkema, P. Engstrom, W. W. Wasserman, and B. Lenhard, JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res., 32 (Database issue):d91-94, Jan [124] E. Wingender, X. Chen, R. Hehl, H. Karas, I. Liebich, V. Matys, T. Meinhardt, M. Pruss, I. Reuter, and F. Schacherer, TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res., 28(1): , Jan [125] J. Lamb, E. D. Crawford, D. Peck, J. W. Modell, I. C. Blat, M. J. Wrobel, J. Lerner, J. P. Brunet, A. Subramanian, K. N. Ross, M. Reich, H. Hieronymus, G. Wei, S. A. Armstrong, S. J. Haggarty, P. A. Clemons, R. Wei, S. A. Carr, E. S. Lander, and T. R. Golub, The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science, 313(5795): , Sep [126] P. Erdős and A. Rényi, On the evolution of random graphs. In: Publication of the Mathematical Institute of the Hungarian Academy of Sciences, pages 17-61, [127] M. E. Newman, S. H. Strogatz, and D. J. Watts, Random graphs with arbitrary degree distributions and their applications. Phys Rev E Stat Nonlin Soft Matter Phys, 64(2 Pt 2):026118, Aug [128] A. L. Barabasi and R. Albert, Emergence of scaling in random networks. Science, 286(5439): , Oct

101 [129] A. L. Barabasi and Z. N. Oltvai, Network biology: understanding the cell's functional organization. Nat. Rev. Genet., 5(2): , Feb [130] M. E. Newman, Assortative mixing in networks. Phys. Rev. Lett., 89(20):208701, Nov [171] Linyuan Lü and Tao Zhou, Link prediction in complex networks: A survey. Physica A, 390(6): , [132] R. Singh, J. Xu, and B. Berger, Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc. Natl. Acad. Sci. U.S.A., 105(35): , Sep [133] V. Lacroix, C. G. Fernandes, and M. F. Sagot, Motif search in graphs: application to metabolic networks. IEEE/ACM Trans Comput Biol Bioinform, 3(4): , [134] A. Gottlieb, G. Y. Stein, E. Ruppin, and R. Sharan, PREDICT: a method for inferring novel drug indications with application to personalized medicine. Mol. Syst. Biol. 7:496, Dynamic modeling in cell biology The high-throughput methods in experimental biology provide access to a huge amount of data. As the data acquisition becomes easy, the data understanding is more and more challenging. Modeling is a tool for organizing knowledge in a formal specification, and using it in an iterative process to build up biological knowledge. Based on measurements, theorists can specify more accurate models and using simulation methods make the expected behavior of these systems predictable. These simulations are virtual measurements and the results can be compared with experimental data. As a result the suspected model is either confirmed or falsified. A more direct approach is the model based design of biological experiments to maximize the obtainable information. We can look to the models as a kind of common language between experimentalists and theorists, which allows direct link between biological data and theory [135]. As a first step, a formal model is constructed based on the biological knowledge. This model specifies the hypothesis about the biological system exactly, and contains only biological assumptions. This level is ideal for knowledge exchange in the scientific community, and also between software based on different theories. For simulation we must refine our model to be more specific with computational framework dependent assumptions. In some cases, this refinement step can be automatized, but the acceptance of the induced assumptions is always a modeling decision. For example in case of the continuous treatment of chemical concentrations our results will be correct in case of simulation a reaction in a beaker, but it will be incorrect in the case of an extremely small compartment like a mitochondria, where the discrete nature of the reacting particles becomes significant Biochemical concepts and their computational representations The building blocks of biochemical kinetic models are reactions. A reaction can be specified by substrates, products, stoichiometric coefficients and a rate constant, e.g.: The stoichiometric constants ( ) specify the relative quantities of the reactants and products, so define the structure of the reaction. The rate constant expresses the frequency that the reactant molecules - of and of - collides with sufficient energy to form the products. The actual rate of a reaction - the flux - is proportional to the product of the concentrations of reactants with respect to the stoichiometric coefficients: where denotes the concentration of, usually in units of mol/l. Strictly speaking, all reactions are reversible and can be written by two irreversible reaction equations: 93

102 or in a simpler form as When the two fluxes are equal: the system is in equilibrium, and the concentrations can be derived by rearrangement of the above algebraic equation: If - the reaction is irreversible - the equilibrium is reached when the reactants run out. In microscopic level, particle numbers are used instead of molarities, and the rate of reaction is expressed as hazard:, where denotes the state of the system the particle number vector. The probability of the th reaction occurs in a interval is. If the concentration of species is in a compartment volume of, the particle number is, where is the Avogadro s constant. If the th reaction follows a first order kinetics and the function has the form th species is the substrate of the reaction, the hazard In case of a bimolecular reaction the form of the hazard function is It is easy to see that the conversion of the macroscopic rate constant and stochastic rate constant is depends on the specific rate law [136 és 137]. We can specify for example a constant flux of species into the system naturally in the microscopic level, but the concentration change depends on the volume of the compartment, so in the continuous model is volume dependent: In case of a first order reaction, and is always equal, because is a dimensionless quantity, the relative proportion of the substrate which is transformed under time. In case of higher order reaction is inversely proportional with because the probability of the intermolecular collision is concentration dependent. For example rate constants of a second order reaction if the two substrates are different are and if there is only one substrate with stoichiometric constant 2: 94

103 It is worth mentioning that due to the above, the following reaction systems are not equivalent in a kinetic sense: A system of coupled biochemical reactions usually has a complex network structure, and a natural choice is to represent as a graph. There is no limit of the number of reactions in which a compound can participate, so compounds must be formalized as nodes. But a chemical reaction can have more than one substrate and product, so the network has hyperedges. Alternatively we can also formalize reactions as nodes, and define a labeled directed bipartite graph, where a directed edge exists from a substance node to a reaction node if and only if is substrate of, or a directed edge exists from to if and only if is a product of. For all edges an labeling defines stoichiometric constants in the given reaction. This graph formalizes the qualitative structure of the system. A labeling also defined over substance nodes called marking and defines the particle number of the substances. This type of bipartite graphs, called Petri nets have a detailed theory. The terminology in field of Petri nets calls as Place set (denoted by ) and as Transition set (denoted by ). Now we can define the stoichiometry matrix S, where is the change in the particle number of species, when reaction is occurred, so the elements of the matrix are the signed stoichiometric constants of the reactions: if is a reactant the sign is negative, if is a product, the sign is positive. The reaction occurrence is called transition firing in the field of Petri nets. Let be the initial marking, and be a vector of the occurred reactions, than the new state of the system is The examination of the S matrix can provide interesting information about the structure of our system. Examine the null space of S, the space of x vectors that are solutions of the equation: More intuitively we search for all of the reaction sequences that return the system to its original state. If is a solution of the above equation, it is called T-invariant of the Petri net or the elementary mode of the biochemical pathway. Now examine the null space of, the transpose of : The solutions of the above equation are called P-invariants or conservation laws in the system Modeling with ordinary differential equations The change of concentration during time is, so an ordinary differential equation (ODE) can be stated for the concentration of each chemical species (state variables). 95

104 The above ODE system can be solved to determine the dynamic behavior of the system. For the equilibrium, we can solve an algebraic equation system, where all derivatives are zero: which states flux equality, as discussed during the derivation of the equilibrium constant. The above ODE system in general can be written in a vector form: where is a vector of state variables, in this case concentrations, and is the rate law. Equivalently in a form which contains fluxes through the reaction channels: where is the stoichiometric matrix, and is the vector of reaction fluxes. The implicit assumption behind this method is that the concentrations can be treated as continuous variables Stochastic modeling In the level of cellular processes there could be very small amount of substances, for example the number of molecules in the system is less than some hundred, where the inherent discreteness of the chemical reactions is relevant. In this case the system can be simulated using integer particle numbers instead of concentrations. The reaction is defined as a probabilistic event, where the probability of molecular collision is proportional of the product of the particle number of the reactants. This type of model can be simulated using Monte Carlo methods. The most obvious way is to simulate the system in discrete timesteps, and based on a generated random number decide if the collision is occurred or not. If a reaction is occurred, modify the state according to the stoichiometric coefficients. This method is computationally intensive, and just an approximation of the continuous-time Markov chain. If we would like to be exact, and select a small timestep so that at most one reaction occurs in each timestep, then this algorithm is wasteful, because it simulates several timesteps when nothing is happening. It can be shown that the number of the reactions occurred in an interval of time follows a Poisson distribution and the distribution of the time between two events can also be calculated analytically: the time difference follows exponential distribution. This is the main idea of the Gillespie algorithm: instead of calculating the state of the system in several discrete timesteps, we can calculate the time of the next reaction, and simulate it [136]. 1. Initialize: ; ; 2. Calculate ; 3. Generate random numbers 4. Calculate 5. Determine for which 6. Apply reaction rule ; ; 96

105 7. If : Go to 2. After initialization, the hazards are calculated based on the current state of the system. Then the next reaction time and the type of the reaction are sampled using the inverse distribution method (Step 3-5). In Step 6 the reaction rule is applied, the appropriate number of reactants is removed, and products are added to the system. Instead of sampling the next reaction time, we can sample also the next occurrence time of all reactions given the actual state of the system, and then select the next one. At first glance, it is less efficient because we need to generate independent random numbers for each reaction in each step. In practice there are two opportunities for acceleration. If the hazard of a reaction is not changed during the previous step, the absolute time of the next occurrence is still valid. If the hazard changed from to the time remaining until the previously sampled occurrence can be rescaled: This is the underlying idea of the Gibson-Bruck algorithm, which is an efficient alternative to Gillespie's method Hybrid methods There are several intermediate methods to deal with the problem of method selection. In a system with a small number of reactants, small compartments, etc. the stochastic behavior of the reactions must be treated in the simulation. But the stochastic simulation, even with a sophisticated algorithm is lot more demanding than a differential equation solving method. There is a deal between the accuracy and the maximal complexity of the manageable model. An intermediate or hybrid solution can help to make a good deal. Classical methods widely used in mathematics, physics and finance can be used to approximate the problem in a stochastic but continuous manner. For an intuitive derivation, we will use that so instead of the discrete stochastic simulation, we can solve a continuous approximation in form of a stochastic differential equation (SDE), the Langevin equation of the process. where is noise term, we can solve the equation with standard techniques. In its general vectorial form, an SDE can be written as where stands for Wiener process, or standard Brownian motion, defined as and all non-overlapping increments are independent random variables. The simplest method for the numerical solution of SDEs is the generalization of Euler-method, called Euler- Maruyama approximation: 97

106 where We can also compute the time dependent evolution of the probability density by derivation of the Kolmogorov s forward equation of the above Langevin equation: This is called the Fokker-Planck equation. Another possibility to derive a hybrid method is that we treat a subset of system variables as discrete, and the other part as continuous. In this case we must to treat the continuous change of the system state between two simulation steps using inhomogeneous Poisson processes Reaction-diffusion systems All approaches discussed so far assume that the system under examination is well-mixed. The concentrations and collision probabilities of molecules are equal in all parts of the system. If these assumptions are at least approximately valid, all reactions can be treated as they occur in the same point of the space. In a cell the reactions are well localized, and this localization is indispensable to the operation of complex control mechanisms. In this case beside of the time, the spatial localization must be introduced as a variable. The formalization of the transport processes in the space is now essential, and the simplest mechanism of the transport is diffusion. Diffusion is a spontaneous process which is statistical in nature. The Brownian motion of the particles causes a continuous mixing in the system. Looking at the level of individual molecules, a particle performs a random walk in the space. The expected distance taken by a particle from its original position is, where is the number of collisions occurred between the two observations and is the mean free path length. At the level of population, in a small th piece of space there is number of particles. During a small time slice a particle has probability to cross a barrier between two pieces, so if the local concentration is greater in the th than in its neighbors, the expected number of exiting particles is higher than the expected number of entering ones. In the linear one-dimensional case, the probability of a particle crossing a given barrier is 0.5, so It can be shown that given the limit of the size of piece to zero, we get the following differential equation, called the diffusion equation: where D is the diffusion constant [138]. The molecular flux is proportional with the concentration gradient: And putting two equations together, we get the macroscopic Fick equation: 98

107 The above equations are in one dimension, but it is straightforward to derive the three dimensional version of these. Combining with the ODE system of reactions, we get a reaction-diffusion system in form of partial differential equation system: In case of solving these equations, the boundary conditions like the spatial geometry of the cell have a significant effect on the solution. The interplay between reaction kinetics and diffusion can generate quite complex patterns when the timescale of the two processes is similar. These are often called as Turing patterns because of the article "The Chemical Basis of Morphogenesis" by Alan Turing discussing the phenomenon [139]. In his famous article he applied the reaction-diffusion equations to model systems, and examined the properties of the solutions. In the living world there are several examples of motifs which strongly resemble the Turing patterns. They can be seen for example on the coat of animals, like the stripes of a tabby cat or the spots of a leopard Model fitting An essential connection between the model and the experiments can be established by the data. The model parameters can be fitted to the data using machine learning methods. In case of the differential equation approach, the function method can be used. in the vectorial equation must be fitted. For this purpose an arbitrary regression In case of stochastic simulation the model fitting is a more difficult task and actually an active area of research. The assumption of that we have all the reaction occurrence times is unreal, so in the stochastic model learning context we must deal with incomplete data. Markov Chain Monte Carlo methods can be used for Bayesian parameter inference of stochastic models [140]. A sampling scheme with data imputation can be used to derive the posterior distribution of the model parameters given the incomplete observations. An alternative approach is to perform the parametric inference procedure on the continuous normal approximation of the stochastic system. This model also requires imputation, because we cannot obtain samples from a sufficiently dense temporal subdivision, to directly apply the Euler-Maruyama approximation of the stochastic differential equation [141] Whole-cell simulation Understanding of a complex biological system like a whole cell has several aspects. Even if the whole genome of an organism is sequenced, it is clear that most of the mysteries are still unsolved. When all genes are annotated, gene products are identified, and the structures are resolved, there is still an incredible amount of open questions. The next level of knowledge is the function of the gene products, the complex interactions between them, and also between the gene products and the chromatin structure itself. The interactions can be direct, or indirectly organized into a pathway map by common metabolites. If we can draw this map, and know the whole matabolome of the organism, there is still a level of knowledge: the dynamical behavior of the cell [142]. This level can be considered the highest level phenotype of the organism, if we ignore the environment. The only feasible tool for studying the dynamic behavior of a cell is in silico simulation. Our expectation from the model is some fundamentally new prediction. Two main different aspects of these predictions are called appositely the "Physicist's perspective" and the "Engineer's perspective" by Freddolino et al. [143]. The first one is a widely applicable organizing principle that can help the scientific thinking about the system; the second one is a more practical quantitative prediction that can be useful in some engineering tasks like compound screening. The pathogenic microbe Mycoplasma genitalium has the smallest genome among all known organisms; it has 525 identified genes in its 580kb long genome. It is therefore not surprising that the first attempts for building a whole cell simulator used M. genitalium as a model organism. Because it has still a large number of genes, and the knock-out studies showed that not all genes are essential for the survival of the microorganism, a minimal 99

108 set of necessary genes - a minimal genome - can be selected. The artificial cell contains this genome called the minimal self-surviving cell (SSC). The E-CELL model (127 genes, 495 reaction rules) consume glucose from its environment and produces lactate as a waste product [144]. This trivial behavior can be predicted without in silico simulation, but this simple model can predict some interesting phenomena too. If the environmental glucose level reaches zero, the cell begins to starve. Paradoxically the models predict that in the very first phase of the starvation, the ATP level temporarily rises, then falls until the ATP resource is depleted (Figure 50.) [142 és 143]. This type of simulation can be used efficiently for modeling pathological conditions, or individual differences for selecting personalized intervention. Modeling a full-featured human cell is not available yet, but human erythrocyte models already exist. These models allow us to study certain type of hereditary anemias [142] Overview In this chapter we have shown the importance of dynamic modeling, and reviewed some computational tools for it. The tools differ mainly in the basic assumptions they impose on the system under investigation. For a grouping of the discussed frameworks see Table 4. The possibilities of stochastic treatment of the reactiondiffusion systems are not discussed here. 100

109 20. References [135] J. M. Bower and H. Bolouri, Computational Modeling of Genetic and Biochemical Networks. Bradford Books, MIT Press, [136] D. T. Gillespie, Exact stochastic simulation of coupled chemical reactions. The Journal of Physical Chemistry, 81(25): , [137] D. J. Wilkinson, Stochastic modelling for systems biology, Chapter Chemical and biochemical kinetics. Chapman and Hall/CRC mathematical and computational biology series, [145], Chapman and Hall/CRC, Boca Raton, Fla., [138] G. Bormann, F. Brosens, and E. De Schutter, Computational Modeling of Genetic and Biochemical Networks, Chapter Diffusion. Bradford Books, MIT Press, [135], [139] A. M. Turing, The Chemical Basis of Morphogenesis. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 237(641):37-72, Aug [140] R. J. Boys, D. J. Wilkinson, and T. B. L. Kirkwood, Bayesian inference for a discretely observed stochastic kinetic model. Statistics and Computing, 18(2): , [141] Andrew Golightly and Darren J. Wilkinson, Bayesian sequential inference for stochastic kinetic biochemical network models. Journal of Computational Biology, 13(3): , [142] M. Tomita, Whole-cell simulation: a grand challenge of the 21st century. TRENDS in Biotechnology, 19(6): , [143] P. L. Freddolino and S. Tavazoie, The dawn of virtual cell biology. Cell, 150(2): , July [144] M. Tomita, K. Hashimoto, K. Takahashi, T. S. Shimizu, Y. Matsuzaki, F. Miyoshi, K. Saito, S. Tanida, K. Yugi, J. C. Venter, and C. A. Hutchison, E-CELL: software environment for whole-cell simulation. Bioinformatics, 15(1):72-84, [145] D. J. Wilkinson, Stochastic modelling for systems biology. Chapman and Hall/CRC mathematical and computational biology series, Chapman and Hall/CRC, Boca Raton, Fla., Causal inference in biomedicine In this chapter we summarize a Bayesian approach to characterize, from a structural point of view, the causal relevance of biomarkers for a given target and the causal relations in the overall domain. We discuss its relation to other causality research programs, especially the biomedical motivations for this approach. We overview its assumptions and typical limitations, and potential applications. 22. Notation List of symbols 101

110 102

111 22.2. Acronyms 103

112 Introduction The availability of omic measurement techniques opened the new era of hypothesis free biomedical data analysis. Whereas high dimensionality and relatively modest sample size call for a shallow statistical analysis, the increasing computational power allows more and more complex analysis, such as the use of complex causal models in the Bayesian statistical framework. Causality research has undergone a rapid development in the last 25 years, ranging from the tabula rasa learning of causal relations and causal models, to identification of causal effects given a causal model, and to cope with contrafactual inference in a functional Bayesian network. The omic approach, i.e. the measurement of all the variables at a given abstraction level, ideally fit to the assumption of learning overall causal models, but the relatively small sample size with respect to this high dimensionality suggest the application of the Bayesian framework. First, we summarize the basic assumptions of learning causal models from observational data. Next, we illustrate constraint based approaches to infer causal relations. Third, we discuss the existence of structure posteriors combining causal prior and interventional data. Finally, we discuss the application of this posterior for the Bayesian inference of structural and parametric causal features, i.e. causally relevant model properties Representing independence and causal relations by Bayesian networks The explicit representation of independencies in a distribution are the key for the efficient representation, inference and learning of complete probabilistic models (also denoted as and ). This suggests the following definition Definition 10. The independence model of distribution contains the independence statements valid in. 104

113 To give a probabilistic definition of Bayesian networks we need the concept of d-separation. Definition 11. A distribution obeys the global Markov condition w.r.t. DAG, if where denotes that and are d-separated by, that is if every path between a node in and a node in is blocked by as follows: 1. either path contains a node in with non-converging arrows (i.e., or ), 2. or path contains a node not in with converging arrows (i.e., ) and none of the descendants of is in. Whereas there are multiple equivalent definitions, the following d-separation based definition directly shows the deep connection between graphs and distributions. Definition 12. A directed acyclic graph (DAG) is a Bayesian network of distribution, if the variables are represented with nodes in and satisfies the global Markov condition and is minimal (i.e., no edge(s) can be omitted without violating this condition). To exclude numerically encoded independencies, we introduce the concept of stable distributions. Definition 13. The distribution is stable 16 (or faithful), if there exists a DAG called perfect map exactly representing its (in)dependencies (i.e.,, ). The distribution is stable w.r.t. a DAG, if exactly represents its (in)dependencies. The assumption of stability and strict positivity of a distribution does not exclude the possibility the different DAGs implies the same independency model, thus we can define an equivalence relation between DAGs [161, 168 és 160]. Definition 14. Two DAGs relations (i.e., ). are observationally equivalent, if they imply the same set of independence The characterization of the DAGs within the same equivalence class relies on two observations. First, the undirected skeleton of the observationally equivalent DAGs are the same, because an edge in a DAG denotes a direct dependency, which has to appear in any Markov compatible DAG [161]. Second, the direct dependencies between and without direct dependence between and without independence such that has to be expressed with a unique converging orientation creating a so-called v-structure according to the global semantics. The theorem characterizing the DAGs within the same observational (and distributional) equivalence class is as follows. Theorem 2 (\cite{pearl88,chickering95equi}). Two DAGs are observationally equivalent, iff they have the same skeleton (i.e., the same edges without directions) and the same set of v-structures (i.e., two converging arrows without an arrow between their tails) [161]. If in the Bayesian networks and the variables are discrete and the local conditional probabilistic models are multinomial distributions, then the observational equivalence of implies equal dimensionality and bijective relation between the parameterizations and called distributional equivalence [147]. The limitation of DAGs to represent uniquely a given (in)dependency model poses a problem for the interpretation of the direction of the edges. It also poses the question of representing the identically oriented edges in observationally equivalent DAGs. As the definition of the observational equivalence class suggests the common v-structures identify the starting common edges and further identical orientations are the consequences of the constraint that no new v-structures can be created. This leads to the following definition (for an efficient, sound, and complete algorithm, see [160]). 16 For a different interpretation of this term in probability theory, see [164]. 105

114 Definition 15. The essential graph representing DAGs in a given observational equivalence class is a partially oriented DAG (PDAG) that represents the edges that are identically oriented among all DAGs from the equivalence class (called compelled edges) in such a way that exactly the compelled edges are directed in the common skeleton, the others are undirected representing inconclusiveness. The classical problem of "from (observational) correlation to causation", that is the question of determining causal status of a passively observed dependency between and can be decomposed using the concepts introduced earlier to the question about the DAG-based representation of independencies (i.e., probabilistic Bayesian network), the existence of exact representation (i.e., stability) and the existence of unambiguous representation (i.e., essential graph). First, we have to consider whether all direct dependencies among the constructed domain variables are causal. Second, we have to consider stability that would guarantee that a corresponding Bayesian network exactly represents the independencies. Third, we have to adopt the "Boolean" Ockham principle, namely that only the minimal, consistent models are relevant. The essential graph resulting from the joint analysis of the observational conditional independencies (i.e., "correlations ) indicates causal relations under these conditions. In short, under the condition of stability the essential graph represents the direct causal dependencies and the orientations that are dictated by (in)dependencies in the domain through the minimal models (DAGs) compatible with them. Furthermore, the direction of the edges corresponds to the intuitive expectation as the intransitive dependency triplets are represented as v-structures. Correspondingly, we can define a causal model as a Bayesian network using the causal interpretation that edges denote direct influences. Definition 16. A DAG is called a causal structure over a set of variables, if each node represents a variable and edges direct influences. A causal model is a causal structure extended with local probabilistic models for each node w.r.t. the structure describing the causal stochastic dependency of variable on its parents. As the conditionals are frequently from a certain parametric family, the conditional for is parameterized by, and denotes all the parameters, so a causal model consists of a structure and parameters. With further assumption of stability, the essential graph shows exactly the independency relations and exhaustively the identifiable causal relations, which suggests that whereas the question of causation is underconstrained for a pair of variables (restricted to "no dependency - no causation ), the joint analysis of the system of independencies allows partial identification. The following condition ensures the validity and sufficiency of a causal structure. Definition 17. A causal structure and distribution satisfies the Causal Markov Condition, if obeys the local Markov condition w.r.t.. The Causal Markov condition relies on Reichenbach's "common cause principle" that dependency between events and occurs either because causes, or causes or there is a common cause of and (it is possibly an aggregate of multiple events) [163 és 154]. Consequently, the precondition of the Causal Markov condition for is that the set of variables is causally sufficient for, that is all the common causes for the pairs are inside. Note that hidden variables are allowed fitting to the usually high level of abstraction of the model, only variables that influence two or more variables in are necessary for causal sufficiency. The causal Markov condition links the causal relations to dependencies and states sufficiency to model the observed probabilistic dependencies. On the other hand, the condition of stability of w.r.t. a causal structure states the necessity of. These two assumptions guarantee that observational (in)dependence (1) is exactly represented by the DAGbased relation (Def. 11) in a Markov compatible graph and that causal (in)dependence (Def. 3) is exactly represented by standard separation in the causal structure [153]. To complete the probabilistic approach to causation, we introduce its central concept of interventional distributions corresponding to the operation (18) according to the "Manipulation theorem" ([167]) or "graph surgery" ([163]). It is performed simply by deleting the incoming edges for the intervened variables in the operator and omitting these factors from the factorization in Eq. 4 (for details and for the deterministic definition of causal irrelevance, see [163 és 153]). 106

115 Definition 18. Let denote the intervention of setting variable(s) to value and the corresponding interventional distribution [162]. Let denote the appropriate interventional distributions over and are disjoint subsets. Then denote the probabilistic causal independence of and given with, that is Now there is a connection again between properties of a distribution and a corresponding DAG. Theorem 3. For a stable distribution defined by a Bayesian network, path interception exactly represents causal irrelevance, i.e.,, ), where denotes that intercepts all directed paths from to, that is if every path between a node in and a node in contains a node in Constraint based inference of causal relations and models A large family of methods for finding complete models best fitting the observations are the constraint-based algorithms. These construct a network by performing independence tests with certain prespecified significance level, which is an NP-hard task [148]. Assuming no hidden variables, a stable distribution and correct hypothesis tests, the Inductive Causation (IC) algorithm correctly identifies a Bayesian network that exactly represents the independencies (see [163, 154 és 167]). 1. Skeleton: Construct an undirected graph (skeleton), such that variables are connected with an edge iff, where. 2. v-structures: Orient iff are nonadjacent, is a common neighbor and that, where and. 3. propagation: Orient undirected edges without creating new v-structures and directed cycle. It can be shown that the following four rules are necessary and sufficient. 1. if, then ; 2. if, then ; 3. if, then ; 4. if, then. However, there is no generally recommendable prespecified significance level and final significance level for the identified model. Furthermore, because of the frequentist approach, there is no principled way to incorporate uncertain prior information. On the other hand, efficient constraint-based algorithms exist that work in the presence of hidden variables [163], which is currently not tractable with Bayesian methods. Interestingly, certain causal dependencies can still be identified even in the presence of potential hidden common causes (confounders), that is if the Causal Markov Condition is violated (for local causal discovery see [149, 165 és 159]. Example 1. The Causal Markov Condition (i.e., the assumption of no hidden common causes) guarantees that from the observation of no more than three variables we can infer causal relation as follows. The direct dependencies between and without direct dependence between and without conditional independence such that converging orientation from Def. 11) resulting in a v-structure. (i.e., with conditional dependence) should be expressed with a unique according to the global semantics (i.e., DAG-based relation 107

116 Example 2. If potential confounders are not excluded a priori, we have to observe at least four variables to possibly exclude that direct dependency is caused by a confounder. Continuing the Example 1, assume furthermore that we observe a forth variable with the direct dependence and conditional independence (because of stability depends on and ). As induces independence the global semantics dictates an (note the earlier v-structure) and it cannot be mediated by a confounder ( as an effect would not block) Learning complete causal domain models Another large family of methods for finding complete models best fitting the observations are the score-based methods. The score-based learning of Bayesian networks best fitting to the data consists of the definition of a score function the posterior for the causal structure [150]) and a search method in the space of DAGs. A natural choice is, for which an efficiently computable closed form can be derived (see termed Bayesian Dirichlet metric [155]. If the initial hyperparameters ensure indistinguishability within an equivalence class, then it is denoted as. If the initial hyperparameters are constant then it is denoted by [150]. If the initial hyperparameters are the converse of the number of parameters corresponding to the local, overall multinomial models of the variables then it is denoted by [146 és 155]. The corresponding score functions are defined as. For interventional data, the product is truncated according to the graph surgery semantics of the do operator [154]. However, the high complexity of Bayesian network model class implies high sample complexity [152], which in turn frequently results in flat posterior and the lack of dominant maximum a posterior model. This leads to the expectation that in case of small amount of data, at least certain properties with high significance of a complex model can be inferred. So the goal is the automated learning of what is learnable with high confidence in the considered model space given the data and to support the interpretation of statistical inference by indicating confidence measures for such properties. Furthermore, the model properties with high significance can be used heuristically as "hard" constraints or "soft" bias to support the inference of the complete model, either by influencing it through priors in learning from heterogeneous sources or in the case of using the same data set by influencing the optimization process itself. Note the similarity of this approach to the frequentist constraintbased Bayesian network learning methods, which perform hypothesis tests on local model properties (on features) and integrate them into a consistent domain model. In a potential Bayesian analog the hypothesis tests are replaced by the model-based feature posteriors instead of the significance levels and p-values of hypothesis tests, enhancing their integration in subsequent phases of learning a complete domain model Bayesian inference of causal features The increasing complexity of the models, the incorporated prior knowledge and the queries leads to the issue of Bayesian inference over general properties of Bayesian networks (i.e., to estimation of the expectation of binary random variables). Note that the expectation of functions over the space of DAGs w.r.t. a posterior appears in a wide range of problems, such as in the posterior of a feature (i.e., structural model property), in the expected loss of the selection of a given model and in the full-scale Bayesian inference over domain values: 108

117 There is a large variety of features (i.e., model properties) to provide an overall or specialized characterization of the underlying model, such as the undirected edges or compelled edges (as direct relations or direct causal relations under CMA), pairwise or partial ancestral ordering (related to causal ordering), the parental sets, the pairwise relevance relations or the subset relevance relations (Markov blankets) Edges: direct pairwise dependencies The first family of Bayesian network features are the "direct" (unconditional) causal pairwise relations. If the hypotheses are the DAGs as causal models, then this feature corresponds to the edges. If the hypotheses are the observational equivalence classes as independence models, then such relations are exactly identified by the compelled edges assuming no hidden variables, the causal Markov condition and stability. The corresponding posteriors in the Bayesian context are the following In the presence of possible hidden variables there are more advanced constraint-based algorithms for identifying relations with various causal interpretations, though not in the Bayesian framework (see [163 és 154], [149 és 165]). The starting point for these algorithms shown in Example 2 can be used autonomously for the identification of "direct" causal pairwise relations requiring only limited background knowledge (exogenous variables) and four local independency tests [149]. Despite its incompleteness, its low computational complexity and asymptotic correctness makes this method attractive, particularly for large data sets such as in case of textmining, [166, 158 és 159]. The data-mining application of a related local algorithm for identifying potential v- structures is reported [165] Pairwise causal relations The causal interpretation of Bayesian networks allows the definition of the following pairwise relations in a causal structure (recall that in stable causal models the dependency relations always represent exactly the probabilistic dependency relations). Table 7 shows graphical model based definition of types of associations, relevances and causal relations. In case of multiple target variable, more refined relations can be defined, such as in Table

118 The posterior of pairwise relation is defined as follows: MBG subnetworks The feature subset selection problem does not include explicitly the issue of dependencies between the features, though the interaction between the selected features is important for their interpretation. A generalization of the FSS problem includes the construction of a (sub)model containing the variables relevant to a target variable and their observational dependency and causal dependency relations w.r.t.. Definition 19. A subgraph of is called the Markov Blanket (sub)graph or Mechanism Boundary (sub)graph of variable if it includes the nodes in the Markov blanket defined by and the incoming edges into and into its children. The first interpretation of the MBGs is related to the fact that in case of complete data the prediction of fully determined by the Markov blanket spanning subgraph and its parameters (the local models for and its children). Another interpretation of the MBG feature is that it encompasses all the causal mechanisms directly related to a given variable. Because of the generality of the MBG feature, we call such a model a Markov Blanket Graph or Mechanism Boundary Graph (a.k.a. classification subgraph, feature subgraph). In the Bayesian framework using Bayesian networks, the corresponding score for the MBG feature is the posterior is Ordering of the variables The pairwise transitive causal relevance ( ) can be extended to complete orderings (permutations) of the variables. Although the identification of the ordering of the variables rarely appears as a direct target, indirectly it is usually present in BN learning. In the acausal approach the identification of an acausal Bayesian network heavily influenced by the identification of a good ordering of the variables, because the learning of an acausal Bayesian network structure for a given ordering is computationally efficiently doable. In the causal approach when the hypotheses are the DAGs, the causal structures directly define causal orderings as ancestral orderings. Consequently a score for a Bayesian network can be interpreted as an approximate score for the underlying partial orderings. In fact, any structure learning can be interpreted as an indirect learning of orderings, but certain algorithms explicitly use orderings as a central representation. For example, the use of 110

119 genetic algorithms has been reported to find the best ordering for the learning of Bayesian network structures [156]. A remarkable fact used in many efficient Bayesian Monte Carlo inference is that the corresponding (unnormalized) posterior over the complete orderings can be computed with polynomial time complexity [151] Effect modifiers Despite the centrality of "association" in genetic association studies (GAS), in gene-environment (GE) studies, and in personalized medicine (PM) and in pharmacogenomic (PhG) studies, the underlying exact type of the association is rarely examined explicitly. Such association types are shown in Fig. 51, e.g. the dotted circle indicates the associated variables for, the dashed path from to indicates the variables potentially influenced by "associated" or relevant variables for the symmetric dependency relation, and the dotted paths from to indicate the variables potentially influenced by "associated" or relevant variables for the causal relation. To illustrate the power and easiness of the causal Bayesian network-based (graphical model-based) approach to causation, consider the following biomedical questions. Let denote the predefined set of variables, the outcome variables, and the set of "associated" variables, such as predictors, factors, and covariates. 1. GAS-Informational relevance What is the (minimal) set of directly associated (relevant) variables for the outcome variables from the predefined set of variables, i.e. that? 2. GE-Strength modifier of informational relevance What are the (minimal) sets of directly associated (relevant) variables influencing the dependence between environmental (external) variables and the outcome variables, i.e. that? 3. PM/PhG-Strength modifier of causal relevance What are the (minimal) sets of directly associated (relevant) variables influencing the effect of an intervention on the outcome variables, i.e. that? Whereas the first question is answered by the Markov boundary set, the second and third questions do not have any standard solutions yet. Consider the following. 111

120 Definition 20. In a DAG the set of nodes is a causal Markov Modifier set, iff it contains all the parents for all, where contains all the nodes on the directed paths from to. This definition could be used to support the discovery of genetic variants (or any other biomarker) relevant to causal mechanisms in pharmacodynamics and pharmacokinetics of a drug, which is a challenging task as the targets and mechanisms themselves are typically unknown, and the identification of the affected pathways is not feasible. Thus the discovery of causal biomarkers modifying these cascades of mechanisms was confined to analyze their distant, derived effects on final outcomes, e.g. on clinical end points. A natural response is to increase the set of response variables, e.g. by including more and more biological variable known to be related to the causal mechanisms such as expression data, and by including surrogate end points [157]. Whereas the idea of aggregating over more and more effects seems to be promising, the dependencies between the response variables quickly invalidates any naive approach uniformly joining and treating the response variables. Although the identification of the relevant pathways is typically not feasible, partly because of impartial observations and partly because of insufficient sample size, Bayesian averaging over the dependency networks or causal networks including the intervention (e.g. treatment/drug) and the effects (e.g. outcomes, adverse effects) could be applied, and it collects statistics to estimate the a posteriori probability that a biomarker directly modifies the strength of the relation. 23. References [146] W. L. Buntine, Theory refinement of Bayesian networks. In: Proc. of the 7th Conf. on Uncertainty in Artificial Intelligence (UAI-1991), pages Morgan Kaufmann, [147] D. M. Chickering, A transformational characterization of equivalent Bayesian network structures. In: Proc. of 11th Conference on Uncertainty in Artificial Intelligence (UAI-1995), pages Morgan Kaufmann, [148] D. M. Chickering, D. Geiger, and D. Heckerman, Learning Bayesian networks: Search methods and experimental results. In: Proceedings of Fifth Conference on Artificial Intelligence and Statistics, pages , [149] G. Cooper, A simple constraint-based algorithm for efficiently mining observational databases for causal relationships. Data Mining and Knowledge Discovery, 2: , [150] G. F. Cooper and E. Herskovits, A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9: , [151] N. Friedman and D. Koller, Being Bayesian about network structure. In: Proc. of the 16th Conf. on Uncertainty in Artificial Intelligence (UAI-2000), pages Morgan Kaufmann, [152] N. Friedman and Z. Yakhini, On the sample complexity of learning Bayesian networks. In: Proc. of the 12th Conf. on Uncertainty in Artifician Intelligence (UAI-1996), pages Morgan Kaufmann, [153] D. Galles and J. Pearl, Axioms of causal relevance. Artificial Intelligence, 97(1-2):9-43, [154] C. Glymour and G. F. Cooper, Computation, Causation, and Discovery. AAAI Press, [155] D. Heckerman, D. Geiger, and D. Chickering, Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20: , [156] P. Larranaga, C. M. H. Kuijpers, R. H. Murga, and Y. Yurramendi, Learning Bayesian network structures by searching for the best ordering with genetic algorithms. IEEE Transactions on Systems, Man, and Cybernetics, 26(4): , [157] C. D. Lathia, D. Amakye, W. Dai, C. Girman, S. Madani, J. Mayne, P. MacCarthy, P. Pertel, L. Seman, A. Stoch, P. Tarantino, C. Webster, S. Williams, and J. A. Wagner, The value, qualification, and regulatory use of surrogate end points in drug development. Clinical Pharmacology and Therapeutics, 86(1):32-43,

121 [158] S. Mani and G. F. Cooper, Causal discovery from medical textual data. In: AMIA Annual Symposium, pages 542-6, [159] Subramani Mani and Gregory F. Cooper, A simulation study of three related causal data mining algorithms. In: International Workshop on Artificial Intelligence and Statistics, pages Morgan Kaufmann, San Francisco, CA, [160] C. Meek, Causal inference and causal explanation with background knowledge. In: Proc. of the 11th Conf. on Uncertainty in Artificial Intelligence (UAI-1995), pages Morgan Kaufmann, [161] J. Pearl, Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Francisco, CA, [162] J. Pearl, Causal diagrams for empirical research. Biometrika, 82(4): , [163] J. Pearl, Causality: Models, Reasoning, and Inference. Cambridge University Press, [164] A. Rényi, Probability Theory. Akadémiai Kiadó, Budapest, [165] C. Silverstein, S. Brin, R. Motwani, and J. D. Ullman, Scalable techniques for mining causal structures. Data Mining and Knowledge Discovery, 4(2/3): , [166] P. Spirtes and G. Cooper, An experiment in causal discovery using a pneumonia database, [167] P. Spirtes, C. Glymour, and R. Scheines, Causation, Prediction, and Search. MIT Press, [168] T. Verma and J. Pearl, Equivalence and synthesis of causal models, vol. 6, pages Elsevier, Text mining methods in bioinformatics Introduction Ever since the beginning of the digital age, humanity has been using computers to develop, store and share knowledge. Nowadays, as millions of publications are produced every year, it is hopeless for modern-day researchers to keep up with the increasing amount of collective knowledge, even in their own field. Text mining, a rapidly evolving discipline, seeks to alleviate this burden, or more precisely, the goal of text mining is to uncover hidden knowledge by processing large amounts of textual data. In terms of biomedical researches, this usually involves the analysis of tens of thousands, or even millions of publications to identify previously unknown relations or generate novel hypotheses. First applied in the '80s, but making its way into the mainstream only in the late 20th century, this scientific domain can be viewed as an offspring of data mining. Due to the advances in computer technology and a number of related fields (including data mining, machine learning, statistics, computational linguistics and biology), biomedical text mining has undergone enormous progress since then. This chapter gives a rough overview of the basic concepts and techniques frequently applied Biomedical text mining Biomedical text mining encounters knowledge usually, but not always, in the form of scientific publications (other sources might involve reports, patents, drug package inserts, blog posts etc.). This collection of documents, also called a corpus, serves as the input of the text mining process, often along with a controlled vocabulary of terms of interest and various sources of background knowledge. As output, structured data are produced which have to be stored, managed and possibly integrated into larger knowledge bases, just like any other database encountered throughout the research process. A general workflow can be described as follows: 1. Task specification, method selection. The first steps involve the definition of the problem domain and the specification of the task at hand (i.e. what is the goal, what one hopes to achieve by applying text mining). It is also vitally important to choose the appropriate tools for these purposes. 113

122 2. Corpus construction. The corpus is a collection of documents which serves as the input of the text mining process. Construction involves the downloading, filtering and merging of large amounts of textual data, moreover, one might want to use multiple corpora, each designed specifically for a particular subtask. 3. Corpus processing. Processing steps cast the data into a more manageable format, making it ready for further operations. Other important transformations might also be performed in this phase (see Section for details). 4. Vocabulary construction (optional). Many methods require a controlled list of terms as an additional input. As these vocabularies can get very complicated, their construction can be very cumbersome and timeconsuming. 5. Feature extraction (optional). Machine Learning algorithms typically consider the data in terms of extracted features, which can be thought of as a compact, more substantial representation of the data. The goal is to provide these algorithms with appropriate features which carry large amounts of information and can be handled efficiently. 6. Analysis. There have been many methods devised, ranging from simple occurrence analyses to natural language processing (NLP), machine learning techniques and other sophisticated methods; this is the main concern of this chapter and many techniques will be discussed later. 7. Data management, integration, further steps. The resulting structured data can be integrated with data from other sources into a more comprehensive knowledge base which can be utilized in many ways (e.g. searching, inference, question answering, etc.) Constructing the corpus 114

123 The entirety biomedical texts, sometimes referred to as bibliome, can be thought of as the input of the corpus construction process. Traditionally, biomedical text mining has focused on a subset of the bibliome, namely, scientific abstracts (the main reasons being compactness and public availability). Nowadays, however, the focus has shifted towards other types of documents, including patents and full text articles, which are increasingly available due to the growing acceptance of the open access philosophy. A common property of these documents is that they exist primarily in the form of unstructured data. Unstructured data, as opposed to structured data, lack any kind of pre-defined structure and data model which are present in a proper database. Examples include videos, images, or more importantly, free text. A small subset of the bibliome consists of semi-structured documents, e.g. XML-files, which represent a transition between databases and unstructured data. 115

124 The construction typically involves the querying the bibliome using publicly available tools, e.g. PubMed, Google, or other search engines. Depending on the task at hand, one might filter the results to build a more domain-specific corpus and eliminate confounding factors. This can be done in several ways, including filtering by publication date, type, keywords, MeSH terms, journals, or many other aspects. When the collection is complete, the data are processed and stored in an appropriate format. Corpus processing is a broad term and can involve many operations, such as stemming, lemmatization (reduction to dictionary form), stop word removal (words that are unwanted or disturb the results, e.g. function words) or tokenization (segmenting the documents into smaller units, e.g. sentences). A special kind of corpus processing is corpus annotation, the process of attaching non-textual information to various elements of the corpus. In the biomedical domain, this usually means semantic annotation, that is, assigning categories to certain elements (e.g. the gene or protein symbols), based on a pre-defined ontology. An example for annotated corpora is the GENIA corpus [169] Constructing the vocabulary A controlled vocabulary is a list of terms of interest required by many text mining methods, also called dictionary-based methods. Dictionary-based techniques usually work by searching for these terms in the corpus, performing entity recognition, co-occurrence analysis, semantic annotation, classification, etc. Vocabularies come in many flavors: Controlled vocabularies in a general sense can be constructed from various knowledge sources, the most important ones being expert knowledge and online databases. Extracting and filtering terms from online resources can be done in an automated or semi-automated manner, and many databases offer such capabilities (UMLS, HUGO, OMIM, etc.). Terms can also be extracted from textual data (thus, representing another subfield of text mining, see e.g. [170] on ontology creation). Taxonomies are controlled vocabularies endowed with a hierarchical structure, traditionally referring to a systematic classification of organisms. Notable modern-day examples include the International Classification of Diseases (ICD), the Anatomical Therapeutic Chemical Classification of drugs (ATC) and many other domain-specific taxonomies. Thesauri differ from taxonomies in allowing (not necessarily hierarchical) relations between terms beside the hierarchical structure. An example is the UMLS Metathesaurus, which contains millions of biomedical and health related terms, their synonyms and relationships. Ontologies, in a strict sense, are controlled vocabularies represented in a formal, computer-readable ontology representation language, however, in practice, the term "ontology" is used to refer to all of the above. The Open Biological and Biomedical Ontologies (OBO) Foundry maintains several such ontologies in a very wide range of domains Text mining tasks Even considered only in terms of biomedical research, text mining has a wide application field. Common tasks include: Information retrieval is the process of returning relevant entities according to certain, user-specified criteria (query). Systems performing information retrieval are often called search engines (see e.g. [171] on PubMed and other widely used engines). Named entity recognition (NER) refers to the identification and annotation of words in the corpus which represent unique things - gene or protein symbols, diseases, and other "named" entities. Normalization refers to the act of linking these findings to external database identifiers. Some methods for NER are listed in Section Relation extraction methods seek to identify relations between named entities, and can be the next step after NER. Although NER is generally considered to be a problem mostly solved, relation extraction is much more complicated and remains unsolved despite considerable effort. Some methods for relation extraction are listed in Section

125 Hypothesis generation. By analyzing the structure of extracted relations and statistical associations between entities, hidden information can be uncovered and novel hypotheses can be generated. Classification and clustering refers to the "grouping" of entities into previously known or unknown classes, respectively. Entities can be named entities, or higher level objects, such as topics or documents. In the field of Machine Learning, classification and clustering are extensively researched problems, and there are many good books on the subject. Summarization methods create a compact summary from the original document while maintaining high information content. This usually involves the scoring of individual sentences according to multiple measures (e.g. position, number of keywords) and extracting the most informative ones (summary extraction). An alternative method is abstraction which uses a semantic representation of the document to generate a natural language summary. However, natural language generation is a field still in its infancy. Ontology creation was briefly mentioned in the previous Section. For further details, see e.g. [170]. Question answering systems (QAS) can be interpreted as special information retrieval systems with natural language interface. QASs apply syntactic and semantic analyses to a re-formulate a natural language question. In the next stage, several approaches can be used to extract, filter and score informative text fragments (e.g. inference, machine learning, and information retrieval techniques) Basic techniques In this Section, we present some simple and a few slightly more advanced methods frequently applied in biomedical text mining, without going into details. For more information on biomedical text mining or text mining in general, we refer to the textbooks [172] and [173] Pattern matching Pattern matching seeks to locate pre-defined "patterns" in the text and is one of the most basic techniques in text mining. Patterns can be defined as strings or regular expressions (a special kind of expressions representing a set of rules which can match multiple strings). In the second half of the 20th century, there were many algorithms designed for both purposes, see e.g. the Boyer-Moore algorithm for string matching [174] or the review of Cox on finite state machines and regular expressions [175]. Fuzzy pattern matching is a more complicated matter where approximate matches are also allowed, according to some distance metric. These methods are used not only in text mining, but also in sequence alignment. Common metrics include: Hamming-distance: number of positions in equally-sized strings where the corresponding characters are different. Levenshtein-distance: computed from the (possibly weighted) number of insertions, deletions and substitutions. Manhattan-distance: the sum of absolute differences of the coordinates in a vector space representation. Biology-inspired distance metrics: Needleman-Wunsch, Smith-Waterman distances; originally developed for sequence alignment Document representation In order to analyze free text using computers, documents have to be represented in a well-defined, machinereadable way - in other words, structured data. The conversion can be done in a number of ways, depending on the task at hand. Most commonly used representations are the vector space model and probabilistic approaches. Let denote terms, and denote documents. Let be an matrix, also called the term-document matrix, for which if the document contains the term. 117

126 Thus, the terms are represented with the rows, each row forming a vector in an -dimensional vector space, hence the name of the model. Similarly, the columns represent documents, and form vectors in an -dimensional vector space. This family of models do not take the particular order of terms into account, therefore, they are often called "bag of words" models. More sophisticated models use, the frequency of the term in the document instead of binary occurrences, or employ more complex weighting schemes. One of the most commonly used weighting scheme is tf-idf (term frequency-inverse document frequency), which can be computed as follows: where denotes the relative frequency of the term in the document, denotes the number of documents in which the term occurs, and denotes the inverse document frequency of the term (taking the logarithm is a convention). A considerable advantage of the vector space model is that it makes the computation of document-document or term-term similarities particularly easy, which comes in handy in many tasks (e.g. classification, clustering). There exist a large number of applicable similarity measures, ranging from cosine similarity to very complex and sophisticated methods. It is obvious that vector space representations can be very high-dimensional and very sparse. In real-world text mining problems, many algorithms were developed to reduce the dimensionality of the data, including (a list far from being complete): Linguistic approaches: stemming, lemmatization, stop word filtering. Matrix decompositions: singular value decomposition (SVD, in this context, also called latent semantic indexing), CUR decomposition, other low-rank approximations. Machine learning methods: feature selection/extraction, principle component analysis (PCA), multidimensional scaling (MDS), self-organizing maps (SOM). Another approach is the application of probability theory and probabilistic models. Traditionally applied in information retrieval systems and spam filtering, probabilistic approaches have become essential in text mining as they often outperform other models and are extremely well-suited for many biomedical text mining tasks as well. Unfortunately, the discussion of this field exceeds the scope of this book. Here we present only a list of popular techniques, and refer the interested reader to [176] for more information on probabilistic models: Markov Random Fields, Conditional Random Fields Hidden Markov Models Bayesian Models 118

127 Bayesian Networks Probabilistic Context-Free Grammars (PCFG, LPCFG) Methods for named entity recognition Named entity recognition (NER) refers to the localization and annotation of elements in the text which represent unique, "named" entities. There are four mainstream approaches to NER: Dictionary-based methods use exact and approximate pattern matching to identify named entities in free text. Rule-based methods employ various rules to find named entities. It is well-known that even a few intuitive rules can achieve fairly good performance, considering e.g. capital letters, contextual features (quotes, parentheses), position in the text or in the headline, frequencies, domain-specific features, etc. Rules can also be learned automatically using Machine Learning techniques. Machine Learning techniques were also applied with good performance. Classifi-cation-based approaches apply classification algorithms, such as SVMs, from the very wide repertoire of Machine Learning, which require training on annotated corpora in order to recognize named entities in further documents. Sequencebased algorithms, such as some of the probabilistic approaches listed in Section are parameterized using tagged corpora and predict the most likely tags for the words in the text. Hybrid methods can employ any combination of these strategies. For more details and description of public tools see e.g. [177]. The next step following NER is usually entity normalization, i.e. the linking of recognized entities to database identifiers - easily done if dictionary-based methods were used, somewhat more tedious with the other approaches Methods for relation extraction Relation extraction aims to detect various types of relationships between named entities. If used correctly, this is a very powerful tool for hypothesis generation, as it can uncover previously unknown relationships embedded in the scientific literature. That being said, relation extraction is a much more non-trivial matter than named entity recognition as the words defining the relationship tend to be scattered throughout sentences or paragraphs. The approaches described in the previous Section can be utilized in relation extraction as well, i.e. there are dictionary-based, rule-based, and machine learning-based systems. The extracted relations can belong to the following types: Statistical relations between terms are the easiest to detect. Dictionary-based NER methods are well-suited for counting occurrences, which can be used to compute co-occurrence statistics between entities. The simplest of these are binary co-occurrences and frequency-based models; a more sophisticated measure is e.g. mutual information. A serious pitfall of this approach is that it does not consider the context: speculation, hypotheses and direct negation are still identified as valid relations. Semantic relations are usually extracted using Natural Language Processing (NLP, see Section ). These systems identify relations by building the parse tree (i.e. a tree which represents the syntactic structure of the sentence) of the sentences and identifying structures in them, e.g. subject-predicate-object structures used by the RDF data model. An example for the former is the phrase "camp inhibits Ras", which corresponds to such a structure. Syntactic relations are in the focus of more recent researches, strongly connected to kernel methods for relational learning. The main idea is to consider relations in the form of syntactic structures (parse trees or dependency graphs). Using known relations as training data, and a Machine Learning approach to classify further relations extracted from sentences, valid interactions can be identified with fairly good accuracy [178]. As mentioned before, relation extraction is a very useful tool in hypothesis generation. The earliest model using this approach was introduced by Swanson in 1986 [179]. Swanson's famous ABC model of discovery assumes two separate subsets of literature which do not interact (that is, there are no articles in common, they do not cite each other and they are not cited together). The model states that if a relation between the entities and is 119

128 described the first group, and a relation between entities and in the other, a previously unknown relation between and can be hypothesized. The Arrowsmith tool used this approach while utilizing co-occurrence statistics to induce relationships between terms. For more Literature-Based Discovery (LBD) systems and their evaluation see [180] Lexicalized probabilistic context-free grammars Formal language theory is a well-established field of research in the intersection of mathematical logic, computational linguistics and computer sciences. Despite having been around for more than a hundred years, new applications are still being discovered. LPCFGs, also called Stochastic Lexicalized Context-Free Grammars (SLCFGs) are a particularly powerful tool for NLP, implemented by many state-of-the-art parsers, such as the Stanford Parser [181]. In biomedical text mining, this toolset can be utilized to build parse trees for the sentences of scientific publications, and transcend traditional co-occurrence and rule-based models. Context-free grammars (CFGs) can be defined as the 4-tuple where: is a finite set of non-terminal symbols, e.g. S (sentence), VP (verb phrase), NP (noun phrase), NN (noun), Vi/Vt (intransitive/transitive verb). is a finite set of terminal symbols, e.g. camp, Ras, inhibit. is a finite set of production rules, which take the form where is a single non-terminal symbol, and can be any symbol; e.g. S NP VP, NN camp. is the start symbol, which will be the root of the parse tree (S). Using the production rules, for every grammatically correct sentence, one or more parse trees can be constructed. Probabilistic CFGs are a trivial extension to the above. In order to resolve ambiguity, each production rule gets a probability, e.g. : By multiplying probabilities for a given parse tree, the most likely parse tree can be selected (Figure 54). Lexicalized PCFGs go one step further and condition on particular symbols in the production rules, e.g. : 120

129 Difficulties in biomedical text mining Despite the considerable effort that went into the research of text mining, there were many inherent pitfalls which proved quite difficult to overcome. Synonymy primarily affects dictionary-based NER systems. In order to detect and normalize entities with satisfying accuracy, taking synonyms into account is inevitable, which introduces an enormous increase in the number of terms and a considerable performance overhead. Homonymy refers to terms with identical spelling but completely different meanings, which also affects the performance of NER systems. Anaphora, according to the Merriam-Webster dictionary, is the "use of a grammatical substitute (as a pronoun or a pro-verb) to refer to the denotation of a preceding word or group of words". Anaphora resolution is an intensively researched sub-field of text mining [182]. Morphological variants are frequent within the biomedical literature, usually handled by adding them as synonyms or using approximate string matching techniques. Spelling mistakes are inevitable when analyzing large amounts of free text. Approximate string matching techniques may be used to detect mis-spelled entities. Acronyms are incredibly common in the biomedical literature, posing serious challenges to named entity recognition. Acronyms are also subject to homonymy (e.g. a gene symbol can refer to multiple unrelated genes), which makes entity normalization difficult. Moreover, acronyms can have the same spelling as other terms, including conjunctions, which affects purely dictionary-based NER systems (however, rule-based extensions can circumvent this problem). Entity boundaries are generally not trivial to determine as they may be overlapping or context-dependent. Many systems resort to rule-based approaches or syntactic parsing. Vocabularies become obsolete fairly quickly as science progresses, and have to be maintained, which requires considerable effort. Reference databases for entity normalization are usually incomplete and linking to them (or defining mappings between them) can prove a challenging task. Biomedical text mining is also subject to a number of biases: Publication bias refers to the phenomenon that "positive" results are much more likely to published than "negative" ones. To circumvent this problem, many authorities and medical journals require the study to be registered prior to the launch. However, as of 2009, less than half of the registered clinical trials have published their results [183]. Selection bias occurs due to the fact that not every publication is available as full text; large-scale text mining studies usually resort to abstracts, which only contain partial information. Growing acceptance of the Open Access principle may alleviate this problem. Sampling bias is also observable, as there is a preference towards more frequently studied entities, which might distort the conclusions Text mining and knowledge management 121

130 So far, we talked about the analysis of unstructured (free) text and its conversion to structured data. This intermediate representation comes in many forms, including bag-of-words models, probabilistic models, parse trees, dependency graphs, conceptual graphs, etc. A subtle difference between representations is the amount of semantics: while bag-of-words models reduce the text into data vectors which only consider weighted term frequencies, NLP approaches provide representations which preserve much of the original rich semantics. Many text mining algorithms employ inductive inference on the intermediate structured data, i.e. they identify general rules based on the particular observations present in the model. Although this kind of inference is perfectly suited for data mining, and works rather well in text mining, it does not exploit the rich expressive power of natural language. A more "natural" approach would be the application of abductive or deductive inference to uncover new knowledge from an appropriate representation of the semantic content of the text. This approach can be further augmented by following the principles of semantic publishing. Semantic publishing refers to the enrichment of scientific publications with semantic information, essentially creating a layer of formal knowledge representation which could support information retrieval, knowledge discovery, and provide a unified view of the entire scientific literature. Although many guidelines, semantic languages and concepts (e.g. Structured Digital Abstracts) were developed, this new era of scientific publishing is yet to come. 25. References [169] J. D. Kim, T. Ohta, Y. Tateisi, and J. Tsujii, GENIA corpus-semantically annotated corpus for biotextmining. Bioinformatics, 19 Suppl 1:i , [170] Philipp Cimiano, Ontology Learning and Population from Text: Algorithms, Evaluation and Applications. Springer-Verlag New York, Inc., Secaucus, NJ, USA, [171] Z. Lu, PubMed and beyond: a survey of web tools for searching biomedical literature. Database (Oxford), 2011:baq036, [172] Matthew S. Simpson and Dina Demner-Fushman, Biomedical Text Mining: A Survey of Recent Progress. In: Charu C. Aggarwal and ChengXiang Zhai, editors, Mining Text Data, pages Springer, [173] Sholom M. Weiss, Nitin Indurkhya, and T. Zhang, Text Mining. Predictive Methods for Analyzing Unstructured Information. Springer, Berlin, 1st. ed [174] Robert S. Boyer and J. Strother Moore, A Fast String Searching Algorithm. Commun. ACM 20(10): , October [175] Russ Cox, Regular expression matching can be simple and fast, [176] Yizhou Sun, Hongbo Deng, and Jiawei Han, Probabilistic Models for Text Mining. In: Charu C. Aggarwal and ChengXiang Zhai, editors, Mining Text Data, pages Springer, [177] U. Leser and J. Hakenberg, What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform, 6(4): , December [178] Chad M. Cumby and Dan Roth, On Kernel Methods for Relational Learning. In: T. Fawcett and N. Mishra, editors, Proceedings of the 20th International Conference on Machine Learning (ICML 2003), pages , Washington, DC, USA, August AAAI Press. [179] D. R. Swanson, Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect. Biol. Med., 30(1):7 18, [180] M. Yetisgen-Yildiz and W. Pratt, Evaluation of Literature-Based Discovery Systems. Literature-based Discovery, pages [181] Dan Klein and Christopher D. Manning, Accurate Unlexicalized Parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, Vol. 1, ACL '03, pages , Association for Computational Linguistics, Stroudsburg, PA, USA,

131 [182] Jennifer D'Souza and Vincent Ng, Anaphora Resolution in Biomedical Literature: A Hybrid Approach. In: Proceedings of the 3rd ACM Conference on Bioinformatics, Computational Biology and Biomedicine, pages , [183] S. Mathieu, I. Boutron, D. Moher, D. G. Altman, and P. Ravaud, Comparison of registered and published primary outcomes in randomized controlled trials. JAMA, 302(9): , Sep Experimental design: from the basics to knowledge-rich and active learning extensions Introduction Experimentation is one of the most powerful ways for humanity to discover and understand the world they live in. Scientific, or even philosophical progress would be inconceivable without carefully designed experiments. No wonder that, according to most developmental psychologists, experimentation plays a central role in human cognitive development. Jean Piaget describes months old children as "young scientists" exploring the world by designing and conducting experiments. However, it was not until the 20th century that mathematicians started to pay attention to this subject. Since Ronald Fisher, one of the greatest statisticians (also an evolutionary biologist and geneticist), published his work The Design of Experiments (1935), experimental design became a large sub-field of mathematical statistics. In this chapter, we will review the process of experimental design from both the biologists' and the statisticians' point of view The elements of experimental design The purpose of experimental design (Design of Experiments or DOE) is to ensure that the experiment is in some sense optimal. Usually, this means the maximization of the information gained with the minimization of bias, error and, of course, time and costs. It is also vital to ask valid questions and to allow the conductor of the experiment to draw valid conclusions; unreasonable questions and misinterpreted results can ruin the whole research however good the samples and measurements were. Other important concepts are repeatability and reproducibility. Biomedical DOE also involves the planning of more practical tasks such as sample collection and storage, equipment usage, personnel management, etc. It is also worth noting that biomedical DOE relies heavily on basic concepts from epidemiologic study design. These will not be covered in the current chapter; for further reading, see [184] Phases of biomedical DOE The workflow of biomedical experimental design can be broken down to the following main steps: 1. Modelling the field of interest. This usually involves the thorough investigation of the scientific literature, most often done by the scientists themselves with various amounts of support from bioinformatics. One end of the spectrum can be a scientist using a search engine (e.g. PubMed) and reading publications, whereas the other can be a fully integrated data and text mining system which performs the extraction, modelling and visualization of the knowledge with little or no human intervention required. 2. Determining the goals. This step is closely related to the development of hypotheses. On one hand, experiments are usually designed to help the scientist to decide between competing explanations. On the other hand, pre-established hypotheses are no longer required in the field of biology: a number of highthroughput measurement technologies were developed in the post-genomic era, which are in fact often utilized to generate hypotheses. 3. Determining the sample size and target variables. Target variables are essentially the outputs of the experiment, i.e. in an experiment we set various input parameters or factors and aim to discover their effect 123

132 on the outputs or target variables. A good question is often the key to a successful experiment, therefore it is crucial to define an appropriate sample size and target variable set. 4. Refining the technical details. In this step, a number of technical aspects are investigated such as the data/sample collection protocol, storage, handling of missing data, data/sample preprocessing, choice of technology and equipment (and related activities e.g. assay design), ethical and legal issues, etc. Many of these rely heavily on bioinformatics support Types of biological experiments Experiments can be categorized on the basis of many aspects. For example, according to the mathematicalstatistical nature of the experiment, we can think of: Detecting associations. Association means a significantly higher occurrence of a certain entity (e.g. a gene variant) in patients suffering from a given disease. However, it should be noted that association does not necessarily imply any causal or etiological relationship. Classification. Classification refers to assigning samples to pre-defined classes. Consider e.g. screening tests, where these classes can be defined straightforwardly as "healthy" and "not healthy". Clustering. Clustering differs from the former in not having any pre-defined classes. Clustering is frequently used in analyzing gene expression data (e.g. bi-clustering of microarray data). Regression. Regression refers to predicting numerical values for unknown samples and determining the most influential factors with respect to the target variables. A possible application is predicting the outcome of a given disease. Comparison. Comparison is one of the most simple and powerful ways to establish novel hypotheses. Modelling/hypothesis generation. When performing modelling, one maps the complex relationships present in the real world to a simpler mathematical construction. This process is also called abstraction; it essentially distinguishes between "relevant" and "irrelevant" properties. Extracting and representing the relevant content in an effective way can uncover hidden information, support decision making and establishing new hypotheses (or even generating them in a systematic manner) A decision theoretic approach to DoE 124

133 Expected value of an experiment To get a grasp on the statistical framework for experimental design, first we have to apply the concepts of utility theory. Let us imagine a workflow of experiments, all of which take some input data, input parameters and can produce various output data. Based on that data, we might perform some action which can trigger events to occur. We might conduct further experiments given that particular event. Such a system can be modelled as a probabilistic graph seen in Figure 56. While conducting a series of experiments, the scientist moves through the tree along its edges. Each experiment holds a certain value to us, which is the utility of the experiment. A reasonable strategy is to always conduct the experiment which maximizes the expected utility. More details of this approach can be found in the original work of Bernardo and Smith [185]. Let denote the experiments, denote the data obtained, denote the actions, denote the events and denote the utility function. Considering the probabilistic nature of the transitions, the expected utility of an action can be derived by integrating out the events : By taking the action which maximizes the expected utility we can choose an optimal decision from the available actions for all. Then We can now step back again and write the expected utility of an experiment by integrating out 125

134 where the last term denotes the likelihood of the data given the experiment. However, a new problem emerges here. When should we stop collecting results and be satisfied with the knowledge we gained so far? For example, a basic principle of medical ethics states that one should only perform examinations that have an impact on the treatment of the patient. If we had a measure on the impact of the data we would obtain in the future, the issue would be solved. Fortunately, there is a way to compute the expected value of the data and the expected value of the experiment which addresses exactly this problem. The expected utility of performing no experiment denoted by is straightforward: Now we can express the value of the data to be gained with the experiment by taking the difference between the utilities of performing and not performing it: This quantity is called the expected value of the data (EVD). By integrating out the data value of the experiment (EVE): we get the expected Adaptive designs and budgeted learning In real-world scenarios, there are always constraints on the research process, such as money, time, equipment, etc. In most of the cases, the goal is to obtain the highest possible amount of information until the budget is exhausted. Budgeted learning and adaptive study design are two closely related concepts and have a long history in the field of pharmacology and clinical trials, traditionally focusing on the adaptive selection of sample size. Since the late 1970s, a great deal of effort went into finding a viable alternative to fixed-sample designs. While being the modus operandi for a long time, these designs suffer from a central flaw: one has to always use the same number of samples and cannot look at the data in the meantime. Beyond the economic drawbacks (e.g. the costs of an unnecessarily big sample size), there are also ethical and administrative issues to this method. A number of different approaches were devised to alleviate these problems (see [186] for further methods): 1. Group sequential methods allow the experimenter to take multiple looks at the accumulating data at fixed intervals. If at some point the experiment proves to be successful (by reaching a certain significance level), the process is terminated and no more samples are required. However, it can be shown that "being significant" in at least one group leads to a much higher level of the overall type I error rate. Therefore, the nominal significance levels for each look must be adjusted accordingly. An overview of the popular adjusting schemes can be found in [187]. 2. Alpha-spending approaches can be thought of as an extension to the previous method by allowing the "looks" to be irregular (i.e. the group sizes can be different). This is achieved by requiring a pre-determined overall type I error rate and keeping track of the cumulative type I error rate (mathematically speaking, by defining an error spending function where and for ). Every time the experimenter looks at the data, the nominal significance level is calculated according to the spending curve. 3. Whitehead's triangular design, also called the boundary approach, differs from the above two in requiring continuous data monitoring. At every look at the data, two statistics are calculated, one denoting the difference between the active and control groups, and the other denoting variance of the difference statistic; both of them being represented on the axes of a 2D coordinate system on which the accumulating data are plotted. Furthermore, two theoretical boundaries are calculated which can be represented as two lines on the 126

135 plot. When the upper is crossed, the experiment is considered successful, and the opposite goes for the lower. As long as the data "stays" between these two, the experiment is continued (hence the name "continuation region" and "triangular design": the shape of this region is triangular). 4. Stochastic curtailment takes another different approach by approximating the likely outcome of the experiment. If the desired level of significance would be reached regardless of the future samples, or, on the contrary, reaching it seems possible but unlikely, the experiment is terminated. All of these methods share some common advantages. Besides being "closer" to the real-world experiments (e.g. regular progress monitoring) and fairly convenient to conduct, they also allow the early stopping of the experiments thereby leading to lower sample sizes and shorter studies A Bayesian treatment of sequential decision processes Bayesian statistics and Bayesian Networks are very well-suited for modelling sequential decision processes. Recently, a number of publications aimed to augment the attractive properties of the Bayesian framework even further by utilizing informative priors and utility functions, parallel computations, or, more importantly, by the incorporation of various, previously unrelated methods (e.g. gene prioritization). In this section, we present an adaptive technique to design a sequence of experiments by selecting only the most promising variables (e.g. SNPs) in each step and thereby ensuring a relatively large sample size within a fixed budget. In order to achieve this, we will use the concepts from the previous sections and a Bayesian approach. This method was first applied in the investigation of the genetic background of asthma using PGAS [188]. The central idea is to apply relevance analyses (i.e. determining the variables which are closely related to the object of our interest, e.g. a phenotype) followed by variable pruning in an iterative manner. The general workflow is shown in Figure 57. First, an initial set of candidate variables are selected from measurement data and expert knowledge (possibly with the support of other tools including search engines, prioritizers, text miners, etc.). Then the candidate variables enter a cycle of subsequent experiments, relevance analyses and pruning, in which the algorithm always keeps only the variables with the largest expected utility. After each iteration, a decision is made whether to continue or stop the experiments, where the latter involves the optimal reporting of the relevant sets. This discussion follows [188]. Consider the structural features and a posterior over given our current knowledge at step. The optimal reported feature can be determined by maximizing the expected utility of reporting a feature 127

136 In each step, a decision must be made whether to stop or continue the experiments. In the case of stopping, the utility of the steps so far, equals the utility of the optimal report. Otherwise, is defined as the utility of the expected data. It is worth noting that can be approximated with the utility of the report. That being said, the only thing missing is the utility function itself. Note the recursive nature of the above equation which implies that at some point, a direct scoring function is required. Let the structural features be variable sets and let denote the set of variable sets. The direct scoring function is defined as where is the MBM-score of a variable in the set, is the MBS-score of the set, and is the MBG-score of the set (see [189] on Markov Blanket Sets and Bayesian Multilevel Analysis) Approaches to target variable selection Gene Prioritization Gene prioritization is a ranking problem where the goal is to find the most relevant entities to a given query. It can be thought of as a "biomedical Google" where the query can consist of diseases, disease-related genes, terms describing diseases, etc. The output of the prioritization system is a set of genes ordered by their relevance to the query. As integration and coherent use of multiple information sources (information fusion) became more and more prominent in the scientific literature, gene prioritization developed strong ties to fusion methods. Although most prioritization systems are based on pairwise similarities and graphical representations, there are other methods such as order statistics [190] and Bayesian Networks [191]. Many systems are described in detail in the literature [192]. For an easier understanding, we demonstrate the SVM-based prioritization through a simple practical example. Suppose we have the task of finding genes with a role in the cell cycle. For that we have gene expression data from microarray studies and some well-known proto-oncogenes (query). We assume that the genes with "similar" expression profiles will have more or less the same function. We now have to define the concept of "similarity", i.e. we have to choose a similarity measure - as the number of possibilities is very large, this also serves as a way to incorporate our expert knowledge. Using a mathematical space defined by these similarities, the one-class Support Vector Machine (SVM) computes a surface, which separates our query from the other genes with the highest possible margin. In the next step, genes are prioritized on the basis of their distance to this surface; the smaller the distance, the higher the probability that the gene has a role in the cell cycle (Figure 58). 128

137 For more details on the one-class and -SVMs, see [193]. The primal of the one-class SVM can be written as where the first part of the objective function ensures the smoothness of the model, controls the model complexity and denotes the margin, are the slack variables necessary for the soft-margin formulation. provides the mapping to the Reproducing Kernel Hilbert Space, i.e.. The dual is In the prioritization framework, the distance from the origo can be computed as where the denominator stands for the normalization and the constant parameter is omitted Active learning Let us now consider the above scenario with a little modification. Suppose we have the genes and the expression profiles but we do not know anything about the function of the genes. In fact, a separate experiment is required to "reveal" if a gene has the function of our interest. Our goal is to discover the genes with the function with reasonable accuracy and with a relatively small number of experiments. This problem resembles the drug discovery process where the aim is to discover active compounds in a huge molecular library. In 2003, Warmuth 129

138 suggested an elegant framework for such problems, using the concept of active learning [194]. Active learning is an iterative process described by the following steps: 1. Build a model based on an initial sample set (the same number of experiments required). 2. Select previously unlabelled samples according to some criterion and reveal their label (again, experiments required). 3. Refine the model. 4. Repeat 2-3 until some stopping criterion. In our case, two reasonable selection strategies could be the selection of the genes closest to the hyperplane of the SVM, or the other way around, the genes farthest from the hyperplane ("inside", i.e. on the "positive" side!). The former choice is the basis of the Minimal Marginal Hyperplane methods, which essentially select samples about which the model is the most uncertain, and improve on the model by examining these border-line cases. The latter strategy (Maximum Marginal Hyperplane) is based on reviewing samples which have been deemed to be certain. For other selection strategies and their behaviour, see [194]. The "active" term refers to the active exploration of the data as opposed to the previous algorithm which used a static training set with known labels. Also note the sequential nature of the algorithm which suggests a relation to other concepts such as Sequential Experimental Design and Adaptive Experimental Design Other practical tasks relying on bioinformatics Modern experimental design would be inconceivable without the support of bioinformatics. The most important steps of the workflow which relies on bioinformatics are: Literature investigations. Processing the scientific literature and extracting relevant information requires a fair amount of bioinformatics support. Many widely used search engines offer extensive capabilities such as filtering and organization of publications, citation tools, APIs, etc. Moreover, automated text mining systems are also at a researcher's disposal. Sample and data collection. Preparing, distributing, collecting and processing questionnaires (or providing an electronic interface), the identification and transportation of samples all require strong informatics support. Storage issues. Physical sample storage is usually integrated with electronic inventory management systems. Similarly, standardized data storage (e.g. input and measurement data) is achieved by utilizing modern database systems. Security. Data security is crucial from both legal and ethical aspects. A closely related concept is shared access, which plays an essential role in the synchronization of people performing different tasks during the experiment. Quality assurance is also deeply related both to security and informatics. 27. References [184] W. Ahrens and I. Pigeot, Handbook of Epidemiology. Springer, [185] J. M. Bernardo and A. F. M. Smith, Bayesian Theory. Wiley Series in Probability and Statistics, John Wiley and Sons Canada, Ltd., [186] S. Senn, Statistical issues in drug development. Wiley-Interscience, [187] C. Jennison and B. W. Turnbull, Group Sequential Methods with Applications to Clinical Trials. Chapman and Hall/CRC Interdisciplinary Statistics, Taylor and Francis, [188] P. Antal, G. Hajós, A. Millinghoffer, G. Hullám, Cs. Szalai, and A. Falus, Variable pruning in Bayesian sequential study design. Machine Learning in Systems Biology, page 141,

139 [189] Péter Antal, András Gézsi, Gábor Hullám, and András Millinghoffer, Learning complex bayesian network features for classification. In: Proc. of third European Workshop on Probabilistic Graphical Models, pages 9-16, [190] S. Aerts, D. Lambrechts, S. Maity, P. Van Loo, B. Coessens, F. De Smet, L. C. Tranchevent, B. De Moor, P. Marynen, B. Hassan, P. Carmeliet, and Y. Moreau, Gene prioritization through genomic data fusion. Nat. Biotechnol., 24: , May [191] A. Parikh, E. Huang, C. Dinh, B. Zupan, A. Kuspa, D. Subramanian, and G. Shaulsky, New components of the Dictyostelium PKA pathway revealed by Bayesian analysis of expression data. BMC Bioinformatics, 11:163, [192] L. C. Tranchevent, F. B. Capdevila, D. Nitsch, B. De Moor, P. De Causmaecker, and Y. Moreau, A guide to web tools to prioritize candidate genes. Brief. Bioinformatics, 12:22-32, Jan [193] Bernhard Schölkopf, John C. Platt, John C. Shawe-Taylor, Alex J. Smola, and Robert C. Williamson, Estimating the support of a high-dimensional distribution. Neural Comput., 13: , July [194] M. K. Warmuth, J. Liao, G. Ratsch, M. Mathieson, S. Putta, and C. Lemmen, Active learning with support vector machines in the drug discovery process. J Chem Inf Comput Sci, 43(2): , Big data in biomedicine The accumulation of data is overwhelming in many fields of science and led to the concept of the fourth paradigm of scientific discovery: driven by big data. We discuss the standard definitions of big data, their appearance in biomedicine and argue that biomedicine has a unique position with respect to big data: first, because biomedicine remained a knowledge rich science, closer to the concept of e-science; second, biomedicine integrates the common or causal big data. In fact, we argue that this third wave of big data in biomedicine is crucial to its progress towards efficient personalized medicine Introduction The Moore's law about the size and density of transistors from the 1960's remained valid over forty years and applicable over electronic data storage as well. The accumulation of data became a general rule in many scientific fields, such as nuclear physics, astronomy, climate research and in biology. This shift from simulation to data was particularly evident in physics, exemplified by the search for the Higgs bozon using unprecedented data collection and data analysis. The pace of similar large-scale data accumulation accelerated from the 1990's in biology, first with sequence data from the Human Genome Project, then after However, there are marked differences in biomedicine with respect to physics. The first such difference is the presence of multiple autonomous, weakly connected omic levels with voluminous knowledge. Despite the reductionist approach, these levels have their own specific mixture of data, knowledge and computational models, and neither separately nor jointly, the role of knowledge is still definitive. Consequently, the concept of e-science based on a more balanced view of data, knowledge and computational models is still more adequate in biomedicine. 131

140 Another difference compared to physics is the infiltration of common big data about health issues, life style and environment to mainstream biomedical research, as phenotypic data. Indeed, this can be seen as the third wave biomedical big data. Static biological data. The first wave of biomedical big data contained mostly sequence and structural information (inherited part of the variome can be classified here). Cell-level phenotype. The second wave biomedical big data contained expression data from multiple omic levels (somatic part of the variome can be classified here). Individual phenotype. The third wave of biomedical big data contained expression. Further characterizations are that the first wave mostly corresponds to passive observation, whereas the second and third contains interventions at cellular or individual (patient) levels. Another difference between the first and second waves is that the second is more oriented towards interactions and generative, causal models, such as gene regulation networks, protein-protein interactions, genotype-phenotype associations, target-ligand interactions. 132

141 This third wave of information is best exemplified by a very important new trend: the comprehensive utilization of the currently approved drugs, the "drugome". Recent developments in approval and insurance regulatory policy aiming at detailed follow-ups of efficiency and side-effects in Phase IV are creating a previously nonexistent abundance of heterogeneous information sources, including national and international registries of efficiency, side-effects, off-label uses, results of orphan drug research, and even patient forum and blog posts. In the chapter we summarize the first wave of biomedical big data, then overview standard technical definitions of big data, and the current trends and challenges using these three waves of big data in an integrated way The first wave of biomedical big data The Human Genome Program initiated a dramatic development of sequencing technologies: according to the Moore's law or later Carlson's law the amount of sequencing data doubled biannually and the cost of sequencing dropped exponentially [195]. Beside the unprecedented amount of biological data this also led to the concept of omics, which means the comprehensive set of entities at a given abstraction level or from a given point of view. The concept of omics led to the concept of hypothesis-free research and to the proliferation of further omic levels, such as metilome, transcriptome, proteome, lipidome, metabolome, interactome, drugome, microbiome or even diseasome, phenome and enviromentome, bibliome. 133

142 Post-genomic big data: the second wave The hypothesis-free omic approach generated a huge amount of information at multiple levels, but paradoxically the high-dimensionality of the data sets similarly defines limits for induction through the multiple hypothesis testing problem. Uninformative correction methods require unrealistically large sample sizes, thus systemsbased multivariate data analysis methods became a popular choice to treat the hypotheses jointly and perform correction at the level of complete models. The systems-based biological approach has given further impetus to generate and analyze heterogeneous omic data sets, especially that the underlying mathematical foundations of systems-based causal analysis have undergone a rapid development in the last decades [196]. The second wave of biomedical big data sets could be linked to this causal, systems-based approach to explore the autonomous mechanisms and regulatory networks. The primary example for this systems-based approach is the ENCODE project, which systematically maps the transcription binding sites in various tissues and collects data at multiple omic levels. Another flagship project of this interventional era, especially the joint investigation of drugs-genes-diseases/genetics are the Connectivity MAP (CMAP) or the Genomics of Drug Sensitivity in Cancer [197 és 198]. They screen triplets of drugs-genesdiseases/genetics by applying systematically drugs/compounds for different cell lines and disease models and measure various expression data. 134

143 Partly based on this second wave of data, new large-scale dynamical models became available, such as cell models and organ models The common big data Large data sets accumulated not only in scientific fields, but also in everyday life, such as in the following areas. 1. Financial and commercial transaction data, e.g. fraud detection. 2. Phone call data, e.g. for targeted advertising. 3. Software usage, e.g. click stream data for user modeling. 4. Internet search data. 5. Traffic data, e.g. to resolve traffic jams. 6. Electricity consumption data, e.g. prediction. 7. Satellite and sensor data in agriculture, e.g. for precision farming. After finance, commerce, communication and internet, big data also appeared in other aspects of everyday life through the internet of things, such as in mobile health (e.g., health oriented functionality of mobile phones), in health monitoring (e.g., wearable electronics), in ambient assisted living or in intelligent homes (sensor networks). 135

144 Indeed, this health-related casual or common big data already forms a substantial ratio of big data, particularly if traditional health data, such as electronic patient records, and health-related data from the internet are included (such as all forms of internet communication, e.g. data from the social networks and computer games). Because this set of data includes high resolution physiological, cognitive, and even emotional data, it can be regarded as an ultimate phenotype data at the level of the individual. In fact, because of the crucial role of "deep" phenotype in modern biomedical research and drug discovery, this data can be conceived of the third wave of biomedical data flood. Before discussing the applications of such data in biomedicine, it is worth to discuss the adequacy and speciality of this "big data" terminology in biomedicine, e.g. to consider the transferability of big data solutions. The "big data" expression first appeared in 1997 simply denoting excessive amount of data with respect to the information infrastructure at that time [199]. After 2000, the three V's became a standard definition: volume, variety, and velocity, where velocity denotes the fast turn-around time from the generation of the data to a decision based on it. A definition familiar in the biomedical context is the following: "[big data]... represents the totality or the universe of observations. That is what qualifies as big data. You do not have to have a hypothesis in advance before you collect your data. You have collected all there is - all the data there is about a phenomenon. (E. Dumbill: Making sense of big data, Big Data, vol. 1, no. 1, 2013)." which is the standard, well-known definition of omic data. A crucial aspect of the standard definition of big data is temporality, both with respect to the lifetime, validity of the data and time to utilize the data. Nonetheless, the number of new properties to define the concept of big data is endless, such as the "Vast, Volumes of Vigorously, Verified, Vexingly Variable Verbose yet Valuable Visualized high Velocity Data", thus we use the term in an informal sense The health-related common big data in biomedicine Each big data waves in biomedicine exceeded the actual IT infrastructure, computational resources and methodologies, both the omic data and the systems-based data. This third wave of deep phenotypic data similarly poses many challenges, because of its highly unstructured, heterogeneous, still weakly linked multilevel nature, and because of its sheer volume. Note that whereas omic data was in the range of Terabytes ( ), interactional and interventional systems-based data is in the range of Petabytes ( ), the volume of deep phenotypic data can quickly exceeds the range of Exabyte ( ), because of the myriads of individuals and gadgets connected to the internet and collecting such data. This deep phenotype data perfectly fits to the recent trends in genetic association research, in personalized medicine and in drug discovery. The grand promises of the post-genomic era are the "personal genome", the understanding of "normal" and disease-related genotype-phenotype relations, along with personalized prevention, diagnosis, drugs, and treatments. However, from clinical perspectives these "translational" promises 136

145 are still to be fulfilled, and shifted gradually further and further to future dates. From a data analytic point of view both the discovery of explanatory, diagnostic biomarkers and the discovery of new drug targets failed to satisfy the expectations, as exemplified by infamous problems and papers, such as the "missing heritability" [200], "missing the mark" [201], and the "production gap" in pharmacy, referring to the relatively modest explanatory power of identified genetic factors, the clinical validity and utility of identified biomarkers, and the different trends of R and D pharmaceutical expenditures and the yearly number of new, approved drugs. The relatively modest performance of current methods with respect to the number of valid biomarkers and approved new drugs highlighted the role of more detailed descriptions of the disease and better clinical endpoints, which could be delivered by the deep phenotype data. For example, large-scale cohort studies were initiated tracking the life style and health status of the participants. The pharma industry and the national insurance state agencies and companies initiated detailed follow-ups of efficiency and side-effects in Phase IV, which created a previously non-existent abundance of heterogeneous information sources, including national and international registries of efficiency, side-effects, off-label uses, results of orphan drug research, and even patient forum and blog posts. Similar initiatives involving the collection of large-scale health-related phenotypic information can be expected in food safety and chemical safety. However, the most mundane and profound reason behind the collection and utilization of everyday health information is the self-inquiry and health preservation of the individuals themselves. Whereas various wearable electronic gadgets and sensor networks for ambient assisted living are on the verge of mass production from the millennium, their widespread use can be expected to happen in this decade. Such devices and information sources are the following: 1. Physiological data (smart phones, smart watches, wrist bands, necklace, earlace), recording basic physiologic data, unexpected movements, such as falling, not moving, coughing, sneezing. 2. Sensor networks for ambient assisted living, monitoring the level of regular daily activities and diagnosing rare events, such as accidents. 3. Intelligent homes, e.g. electronic tracking of energy consumption and regular usage and position of household devices. 4. Drug consumption, efficacy and side-effect tracking with dedicated systems, e.g. using smart glasses. 137

146 Bioinformatic challenges of common big data The use of health-related common big data in biomedical research can be separated as research-oriented and as health-care-oriented. The application of health-related common big data in research is challenging, because of its unstructured, heterogeneous and temporally dependent, or even ephemeral nature. However, its fusion with omic and systems-based data and knowledge fits to the standard academic framework. On the contrary, the integrated use of the three waves of data and knowledge in health-care systems, e.g. in prevention, requires methodological and infrastructural changes. Such challenges are the support for finding similar patients both by the medical professionals and by patients, support for online diagnosis, and home-care systems with various monitoring and alerting activities. Each of these activities could exploit the integration of background biomedical codified knowledge and actual, deep phenotypic information about the patient. Activities in a homecare system are shown in Fig

147 The Bayesian statistical framework and decision networks provide a sound and principled theoretical background for data analysis, integration and decision support. However, the big data sets from daily life raise the question of the necessity of "common sense" to analyze such common big data sets (for the attempts to formalize common sense, see e.g. [202]-[206]). This question is outside the scope of this chapter, but paradoxically exactly such multilevel big data sets from everyday life could allow the automated construction of common sense knowledge bases and allow the learning of concepts as clinical endpoints. Health-related common big data sets, complemented with various disease-specific diaries will soon find their way into mainstream drug discovery and biomedical research, and also into recommendation systems and social networks to support finding similar patients, into online diagnosis systems, and into home-care systems utilizing the common big data as the ultimate whole-body phenotype (cf. with gene expression as "ultimate" cellular phenotype [207]-[210]). 139

148 29. References [195] Carlson R: The Pace and Proliferation of Biological Technologies. Biosecurity and Bioterrorism: Biodefense Strategy, Practice, and Science 2004, 1(3). [196] Pearl J: Causality : models, reasoning, and inference. Cambridge, U.K. ; New York: Cambridge University Press; [197] Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP, Subramanian A, Ross KN et al: The connectivity map: Using gene-expression signatures to connect small molecules, genes, and disease. Science 2006, 313(5795): [198] Lamb J: Innovation - The Connectivity Map: a new tool for biomedical research. Nat Rev Cancer 2007, 7(1): [199] Bryson S, Kenwright D, Cox M, Ellsworth D, Haimes A: Visually exploring gigabyte data sets in real time. Communications of the Acm 1999, 42(8): [200] Maher B: Personal genomes: The case of the missing heritability. Nature 2008, 456(7218): [201] Gewin V: Missing the mark. Nature 2007, 449(7164): [202] LENAT D, GUHA R, PITTMAN K, PRATT D, SHEPHERD M: CYC - TOWARD PROGRAMS WITH COMMON-SENSE. Communications of the Acm 1990, 33(8): [203] ELKAN C, GREINER R: BUILDING LARGE KNOWLEDGE-BASED SYSTEMS - REPRESENTATION AND INFERENCE IN THE CYC PROJECT - LENAT,DB, GUHA,RV. Artificial Intelligence 1993, 61(1): [204] LENAT D: CYC - A LARGE-SCALE INVESTMENT IN KNOWLEDGE INFRASTRUCTURE. Communications of the Acm 1995, 38(11): [205] LENAT D, MILLER G, YOKOI T: CYC, WORDNET, AND EDR - CRITIQUES AND RESPONSES - DISCUSSION. Communications of the Acm 1995, 38(11): [206] Panton K, Matuszek C, Lenat D, Schneider D, Witbrock M, Siegel N, Shepard B, Cai Y, Abascal J: Common sense reasoning from Cyc to intelligent assistant. Ambient Intelligence in Everday Life 2006, 3864:1-31. [207] Dermitzakis E: From gene expression to disease risk. Nature Genetics 2008, 40(5): [208] Emilsson V, Thorleifsson G, Zhang B, Leonardson A, Zink F, Zhu J, Carlson S, Helgason A, Walters G, Gunnarsdottir S et al: Genetics of gene expression and its effect on disease. Nature 2008, 452(7186):423- U422. [209] Schadt E, Monks S, Drake T, Lusis A, Che N, Colinayo V, Ruff T, Milligan S, Lamb J, Cavet G et al: Genetics of gene expression surveyed in maize, mouse and man. Nature 2003, 422(6929): [210] Schadt E, Lamb J, Yang X, Zhu J, Edwards S, GuhaThakurta D, Sieberts S, Monks S, Reitman M, Zhang C et al: An integrative genomics approach to infer causal associations between gene expression and disease. Nature Genetics 2005, 37(7): Analysis of heterogeneous biomedical data through information fusion Introduction 140

149 Modern biological and bioinformatics research are greatly affected by the technological revolution started in the second half of the 20th century (also known as the "century of physics"). Analogously to Moore's Law, which predicted the exponential nature of the increase of computational capacity, measurement technologies evolved in a similar pace (see Carlson's Laws [211]). In the 21st century, also known as the "century of biology", a plethora of new high-throughput methodologies were developed and an abundance of heterogeneous data were created which cannot be synthetized and analyzed with only the human mind. Recent advances in the field of biology and computer science, and simultaneously the reduction in the cost of measurements and computations has led to new research paradigms. This includes the hypothesis-free paradigm ("gene fishing") as well as the joint analysis of multiple omic levels (e.g. genomics, proteomics, etc.). Since the beginning of the new millennium, modern biological research has moved from the entity-based viewpoint to system-level analyses (systems biology). Along with the increasing amount of data, the number of biomedical databases has also grown. These can be categorized into the following categories (without attempting to be comprehensible): Sequence: GenBank, EMBL, ExProt, SWISS-PROT/TrEMBL, PIR Pathway: KEGG, Reactome Regulation: RNAiDB, TRANSFAC, TRANSPATH Epigenetics: PubMeth Protein motif: Blocks, InterPro, Pfam, PRINTS, SUPFAM, PROSITE Protein structure: PDB, MMDB Gene-disease associations: HuGENet, PharmGKB, GenAtlas Pharmacology, pharmacogenomics: DrugBank, SIDER, PharmGKB, PubChem Gene expression: GEO, YMGV Molecular interactions: BIND, DIP, BRENDA, BioGRID Metabolic networks: EcoCyc, MetaCyc, GeneNet Mutations, variations: OMIM, dbsnp, HGMD Ontologies, thesauri: Go, UMLS, MeSH, Galen Publications: PubMed 141

150 Information fusion and data fusion The fusion of heterogeneous biological knowledge can be divided into two categories: information fusion and data fusion, where the latter is a subset of the former. Information fusion seeks to support research through the joint, coherent use of multiple information sources, while data fusion comes down to combining raw biological data (e.g. sequences, expression), often in a numerical way. One of the main goals of the fusion paradigm is the combination of measurement data and background knowledge, which is therefore a transition between the fields of information fusion and data fusion. All of these approaches aim to support data analysis and interpretation, study design and decision support. According to Synnergren, fusion systems can be categorized into the following categories [212]: Knowledge extraction systems Knowledge integration systems Knowledge fusion systems Knowledge extraction means the automated retrieval of information regarding the query from various biological knowledge bases, usually by applying data and text mining techniques. These systems also help to visualize, arrange and browse the extracted information. Most data mining systems fall into this category (DAVID [213], WebGestalt [214]). The goals of knowledge integration systems are the representation and visualization of the knowledge on a consistent interface (STRING [215]), therefore providing a unified view [212]. They often contain data extraction and complex querying subsystems (e.g. Natural Language Processing) and links to relevant publications and analyses. An early example of knowledge base integration is TAMBIS [216], which has a 142

151 sophisticated querying system to translate the query for various databases, knowledge bases and services; then collets the answers, integrates and visualizes them on a consistent interface. With the above approaches, the actual fusion is done by the researcher who uses the information and his or her expert knowledge as well. Knowledge fusion systems do this fusion in an automated way by transforming the heterogeneous data to a level which provides a unified representation. One of the earliest approaches was the semantic integration which aimed to introduce a common language. Beyond the ontologies, translators and dictionaries standardizing the concepts, another notable example is the standardization on the level of relations (e.g. Gene Ontology). A newer approach was to use the drug- or disease-induced changes in gene expression levels as the common language, where the interactions between entities could be estimated based on the correlations between gene expression change profiles. Modern techniques include graphical models (probabilistic graphical models, e.g. MAGIC [217]), formal logic languages and stochastic inductive logic programming, similarity-based fusion (kernel methods, e.g. Endeavour [218]), workflow systems and various programming environments (Bioclipse [219], Cytoscape [220]), which provide many ways of knowledge representation and many algorithms, and are usually modularly designed (plugin-based). Detailed description of various systems and further information can be found in the literature [212] Types of data fusion Heterogeneous data fusion became one of the central questions of the new research paradigms. One can rightfully expect from the techniques the following: considering multiple aspects should lead to better results should support the integration of expert knowledge 143

152 should be automated should be easy to use, user-friendly should be usable with various input data formats (e.g. non-vectorial data) should have a stable mathematical foundation should be computationally efficient should scale well with the number and size of data sources should handle missing data, should be noise tolerant Fusion methods can be divided into three categories (Figure 69) [221]: Early/low level fusion Intermediate/intermediate level fusion Late/high level fusion Early fusion Early fusion combines the descriptions of entities on the data-level (data integration). The simplest and most common method is to concatenate the vectorial data (VSI, Vector Space Integration) and analyze them. Beyond its simplicity and computational efficiency another advantage is that the analyzing algorithm receives all information from all of the sources, therefore it profits directly from the correlations between the descriptions of entities, independently of the sources. Disadvantages include the relative inflexibility compared to the other methods, the difficulties of representation (e.g. non-vectorial data) and the problem of integrating field-specific expert knowledge Intermediate fusion 144

153 Intermediate methods do the fusion based on an intermediate representation of the data. The two most prevalent techniques are the family of kernel methods (e.g. Support Vector Machines, Gaussian Processes) and graphbased approaches (especially Probabilistic Graphical Models). Former uses the entity-entity similarity matrices (kernels) as the intermediate representation, whereas the latter uses most commonly a Bayesian Network. Intermediate fusion combines the efficiency of early methods with the flexibility of late methods and became exceptionally prevalent. Kernel methods have a stable mathematical foundation, can be applied to arbitrary input data formats (as long as we can compute similarities between entities), are very computationally efficient and the free choice of similarity measures and design of kernels provide a way to integrate prior knowledge as well. On the other hand, finding the optimal parameterization of the kernels and algorithms can prove quite difficult. Bayesian Networks represent the background knowledge in the form of distributions above prior model classes which can be used to compute posterior distributions with the integration of the data. Advantages are the normative combination of background knowledge, the handling of uncertainty and missing data, and the efficient inference system which provides easily interpretable probabilistic statements. However, the transformation of prior knowledge to the formal logic (qualitative) and quantitative level is often difficult, furthermore, inference in Bayesian Networks have a rather large computational complexity Late fusion Late fusion (decision-level fusion) involves the separate analysis of each information source and the combination of the resulting decisions. One of the main advantages is the immense flexibility: virtually any kind of data can be combined and the analyzing algorithms can be different for each source, where the choice of algorithms provides a way to incorporate expert knowledge. Since the outputs are usually the same format, the fusion can be carried out fairly easily. Disadvantages are the high computational complexity (separate analyses for each source and the combination of the results) and a significant dimension reduction on the decision level which makes the late methods relatively insensitive to correlations between entity descriptions compared to the early techniques. One of the simple methods is the algebraic combination of outputs (e.g. sum, weighted mean, median, etc.), whereas more sophisticated techniques include ensemble methods (Mixture of Experts, bagging, boosting, stacking) and order/rank fusion methods (order statistics, Borda ranking, parallel selection, Pareto ranking, etc.). Many methods, their detailed descriptions and comparisons can be found in Svensson's publication [222]. Sum rank: the ranks of an entity in all lists are summed up, the final list is computed using the combined ranks. Sum score: the score of an entity is divided by the highest score in the corresponding list, then the resulting values are summed up. The final list is computed using these relative scores. Pareto ranking: The final rank of an entity depends on the number of entities with a higher score in all lists. The sum rank method is applied to break ties. Rank vote: each list votes for its first entities. The final list is computed using the sum of the votes. The sum score method is applied to break ties. Parallel selection: the best entity is selected in each list. If an entity has already been selected, the next one is chosen instead, then the process is repeated Similarity-based data fusion The entity-entity similarity matrix-based fusion first occurred at the problem of clustering gene expression data, but became prevalent only after Lanckriet's publication [223]. Instead of the sum, this novel approach used the weighted mean of the similarity matrices (kernels) where the weights were computed from the solution of an optimization problem. This formulation was based on Support Vector Machines (SVM), which have many advantages including good accuracy, generalization performance and efficient control of the model complexity. They also have several beneficial mathematical properties which can be exploited to gain very good computational performance (they lead to a sparse solution and scale well to high-dimensional data) and also allow the automatic weighting of the information sources. 145

154 Every symmetric positive semidefinite similarity matrix (kernel) defines a Hilbert space called Reproducing Kernel Hilbert Space (RKHS). Let be a kernel function (similarity measure), for example i.e. the kernel matrix contains these values. Corresponding to this, there exists the Hilbert space where where does the mapping of the data vectors to the RKHS. The SVM will compute the separating hyperplane in this space. The function defined above is called the Gaussian Radial Basis Function (RBF). It can be shown that in this case the RKHS is infinite dimensional. To integrate more information sources, kernel fusion methods can be applied (Multiple Kernel Learning). At first, the sum or weighted average of the kernels was used as a combined kernel [221 és ]. Here we can exploit the fact that the optimal weighting can be computed if we incorporate the weights in the optimization problem of the SVM for which a number of different formulations were suggested. At this point, the regularization of weights became a relevant problem, where the normalization performed better than sparse ( ) methods. There were many formulations suggested to incorporate the kernel weights into the optimization problem [224, 225 és 226]. A recent formulation leads to a differentiable dual objective function which allows the application of the very efficient SMO algorithm [227]. The primal can be written as 146

155 where controls the regularization of the kernel weights. The dual is In the prioritization framework the perpendicular distance from the origo can be computed as where the denominator stands for the normalization and the constant parameter is omitted. We have seen that the kernel fusion framework is also capable of solving ranking problems. An example for this is the Endeavour system developed in the Catholic University of Leuven [218] or the improved version of the former (ProDiGe [228]). This approach exceeds the conventional, global similarity-based techniques in a number of aspects. With the automatic weighting of the sources the method becomes context-sensitive, i.e. the fusion is carried out on the basis of the information content of the query. Another advantage is that the method can detect an (even unknown) togetherness of the query, e.g. if our query has genes which lie on the same biological pathway, our pathway-based source (if present) will gain a higher score. One of the classic applications of the one-class SVM is outlier detection: if the query is inhomogeneous and contains far-off elements, the algorithm can detect them. On the other hand, the ranking can become meaningless in this case, moreover, in extreme cases the query can be pushed back to the end of the ranking. Another disadvantage is the relative sensitivity to noisy kernels, therefore the wise choice of information sources can be of critical importance. 31. References [211] R. Carlson, The pace and proliferation of biological technologies. Biosecur Bioterror, 1: , [212] J. Synnergren, B. Olsson, and J. Gamalielsson, Classification of information fusion methods in systems biology. In Silico Biol. (Gedrukt), 9:65-76, [213] d. a. W. Huang, B. T. Sherman, Q. Tan, J. Kir, D. Liu, D. Bryant, Y. Guo, R. Stephens, M. W. Baseler, H. C. Lane, and R. A. Lempicki, DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Res., 35:W , July [214] B.Zhang, S. Kirov, and J. Snoddy, WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res., 33:W , July [215] C. von Mering, L. J. Jensen, M. Kuhn, S. Chaffron, T. Doerks, B. Kruger, B. Snel, and P. Bork, STRING 7 - recent developments in the integration and prediction of protein interactions. Nucleic Acids Res., 35:D , Jan [216] P. G. Baker, A. Brass, S. Bechhofer, C. Goble, N. Paton, and R. Stevens, TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources. An Overview. In: Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology (ISMB'98), pages 25-34, Menlow Park, California, June 28-July AAAI Press. 147

156 [217] O. G. Troyanskaya, K. Dolinski, A. B. Owen, R. B. Altman, and D. Botstein, A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl. Acad. Sci. U.S.A., 100: , July [218] T. De Bie, L. C. Tranchevent, L. M. van Oeffelen, and Y. Moreau, Kernel-based data fusion for gene prioritization. Bioinformatics, 23:i , July [219] O. Spjuth, T. Helmus, E. L. Willighagen, S. Kuhn, M. Eklund, J. Wagener, P. Murray-Rust, C. Steinbeck, and J. E. Wikberg, Bioclipse: an open source workbench for chemo- and bioinformatics. BMC Bioinformatics, 8:59, [220] M. E. Smoot, K. Ono, J. Ruscheinski, P. L. Wang, and T. Ideker, Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics, 27: , Feb [221] P. Pavlidis, J. Weston, J. Cai, and W. S. Noble, Learning gene functional classifications from multiple data types. J. Comput. Biol., 9: , [222] F. Svensson, A. Karlen, and C. Skold, Virtual screening data fusion using both structure- and ligandbased methods. J Chem Inf Model, 52(1): , Jan [223] G. R. G. Lanckriet, M. Deng, N. Cristianini, M. I. Jordan, and W. S. Noble, Kernel-based data fusion and its application to protein function prediction in yeast. In: Proceedings of the Pacific Symposium on Biocomputing, [224] Alain Rakotomamonjy, Francis R. Bach, Stephane Canu, and Yves Grandvalet, SimpleMKL. Journal of Machine Learning Research, 9: , November [225] Marius Kloft, Ulf Brefeld, Soeren Sonnenburg, Pavel Laskov, Klaus-Robert Müller, and Alexander Zien, Efficient and Accurate Lp-Norm Multiple Kernel Learning. In: Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages , [226] Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan, Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of the twenty-first international conference on Machine learning, ICML '04, pages 6-, ACM, New York, NY, USA, [227] S. V. N. Vishwanathan, Z. Sun, N. Theera-Ampornpunt, and M. Varma, Multiple Kernel Learning and the SMO Algorithm. In: Advances in Neural Information Processing Systems, December [228] F. Mordelet and J. P. Vert, ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples. BMC Bioinformatics, 12:389, The Bayesian Encyclopedia In which we discuss the use of semantic technologies for the unificiation biomedical data, literature and models. First we overview the accumulation of biomedical data, literature and models. Second, we illustrate the detrimental effect of fragmentation in the field of genetic and pharmacogenomics and overview a complete "from bench to bedside"' workflow from the discovery of a clinically relevant genetic variants to its use in clinical practice. Third, we discuss current approaches for sharing and integration of data, literature and models, such as standardizations and enrichment by ontologies. We overview such methods for data repositories, particularly for genetic data and rich phenotypic data; for literature, especially the use of semantic publishing; and for models, especially open approaches. Finally, we highlight trends and discuss the possibility of the emergence of a Bayesian Encyclopedia Introduction The new molecular biological measurement techniques allowed the creation of unprecedented amount of information related to drugs-genes-diseases, accumulated mainly in the last two decades. This includes pharmaceutical information such as drug taxonomies, chemical fingerprints, target proteins, systematic gene 148

157 expression profiles of drugs in in vitro compendium, gene expression profiles for diseases, side-effects, indications, interactions, combinations, and off-label use of drugs. Furthermore, there is a growing amount of information about the underlying molecular biological aspects of the diseases, such as pathway information, gene regulatory mechanisms, protein-protein networks, gene-disease networks, and effects of genetic, epigenetic variations. The appearance of this voluminous data (and knowledge) in multiple scientific fields led to the concept of a new data-driven paradigm of scientific research, the "fourth paradigm" [235, 254 és 253]. Nonetheless high-throughput methods still have not changed that biology and bio-med-i-cine are knowledgerich sciences. The spectrum of knowledge starting from raw observations incorporates automatically generated results of data analysis, unintentionally formed patterns of co-occurrences of concepts on the web and in the literature, hypothesized models in peer-reviewed papers, manually curated knowledge bases, and codified textbook knowledge. The increasing weight of medical and clinical aspects with their multiple abstraction levels and the problems of pure, data-based inductive methods prompted new terms such as clinical genomics, and translational research, which indicates the "interpretational" bottleneck after the model-free, high-throughput methods. This bottleneck is one of the driving force behind the increasing appreciation of electronic background knowledge in the post-genomic era. The rapidly accumulating scientific knowledge and data, even the narrowly interpreted "domain knowledge" increasingly exceeds the limits of individual cognition. Consequently biomedical knowledge is becoming more and more "external" (i.e., distributed, collectively shared and maintained in knowledge bases, databases and electronically accessible repositories of natural language publications). This suggests that further development of life sciences depends equally on efficient externalization and fusion of knowledge as on further technological breakthroughs. Indeed, there is a paradox situation that the growing amount of data and knowledge is coupled with lagging performance at the clinical level, which initiated a wide range of transformations in statistical, biomedical, and pharmaceutical research; as well as in public healthcare. Whereas it is widely shared that the appreciation that further development of life sciences depends on fusion of data and knowledge, it is also became widely accepted that the heterogeneity of this information pool poses a formidable challenge, and that is to use it in an integrated fashion to support various aspects of personalized medicine and drug discovery. From a statistical, knowledge engineering, and informatics point of view it became apparent that new methodologies are needed besides information integration technologies traditionally developed for the financial field, i.e. database-level integration. Such new methodologies were the ontologies, standardizing the vocabulary and relations within and between scientific fields; text-mining technologies, to support the automated construction of knowledge-bases from free-text; probabilistic graphical models, to cope with high-dimensional distributions; casual inference, to support the formalization of prior knowledge and management of confounders; kernel approaches, to minimize the statistical and computational burden; the Bayesian statistical framework, to tackle the relative scarcity of the data; and large-scale data fusion techniques. Overall terms for this integrative approach, such as e-science, was also suggested [268]. An important and inherent feature of this new voluminous information pool is uncertainty. Various forms of uncertainty may arise because of the multilevel and multiple approaches in biomedicine, beside incompleteness and inherent uncertainty, but many of these can be managed within the single framework of probability theory using a subjectivist interpretation. The corresponding Bayesian framework offers a normative method for representing knowledge, learning from observations and, with utility theory, reaching optimal decisions. In short, the Bayesian approach provides a normative and unified framework for knowledge engineering, statistical machine learning and decision support. Its ability to incorporate consistently the voluminous and heterogeneous prior knowledge in statistical learning connects statistics and knowledge engineering, leading to the concept of adaptive knowledge bases or "knowledge intensive" statistics. The Bayesian framework also offers a computational framework for learning and using complex probabilistic models, mainly by various stochastic simulations to perform Bayesian inference, leading to computationally-intensive statistics. However, the vast biomedical domain knowledge is a mixture of human expertise, knowledge bases, databases and literature repositories, which has posed many practical challenges for applied Bayesian data analysis: how to use heterogeneous domain knowledge and data efficiently in knowledge engineering, machine learning and decision support. This challenge is particularly acute in the complex and rapidly changing fields of medicine and genomics, here the proper interpretation of the results of data analysis became an important bottleneck. That is, beside the technology of measurements and the statistical aspects of data analysis, the support for understanding and revealing the biomedical relevance of the results became essential. 149

158 In the chapter we discuss the use of a probabilistic semantics over data, raw results of data analysis, codified knowledge and model fragments, which view can be seen as a modern positivist approach for the unified representation of biomedical science, a Bayesian positivism. The ingredients and concept of the Bayesian Encyclopedia is shown in Fig The three worlds of data, knowledge and computation From the 90's, the rapid accumulation of biomedical data and results brought up multiple interrelated questions in the context of semantic technologies and e-science. The issue of data repositories and data standardization was one of the most important and urgent issue, because it is critical to guarantee the verification of the published results, to reuse the raw data in different postprocessing and analysis and to support meta-analysis. The direct results of this movement in case of microarray data are the Microarray Gene Expression Data (MGED) standard, the Minimum Information About a Microarray Experiment (MIAME) standard, and repositiories, such as the Gene Expression Omnibus (GEO) [242 és 241]. Analogously, in case of genetic polymorphisms data standards were proposed, such as the Minimum Information about a Genotyping Experiment (MIGEN) and repositories [255], such as the European Genotyping Archive were created to deposit genetic polymorphisms data, especially from genome-wide association studies (GWASs). In parallel with the standardization of data repositories, the issue of dissemination of the results of data analysis emerged. In line with the prevailing academic routine this lead to the well-known dual structure of free-text scientifc publications and manually curated databases. Whereas biomedical, bio- and chemoinformatic publications are covered by PubMed and MedChem, experimental and in silico results and predictions are dispersed throughout myriads of databases. For the unification and standardization of the results, various ontologies were proposed, such as the Gene Ontology (GO) and the Unified Medical Language System (UMLS) collection of ontologies. A very special class of databases contains only or nearly exclusively pieces ofinformation which are extracted by experts from the literature, using a range of information retrieval, textmining and automated text summarization methods. Beside dedicated, central public databases, such as the NCBI's collection, there are multiple private databases with public academic access, such as the Online Inheritence In Man (OMIM), GeneCard, PharmGKB, but it also became a whole industry to construct such databases which are only commercially available both in academic research and clinical practise. Such databases for example are the IPA, Ariadne, Alamut, GODisease and Knome. 150

159 Finally, the dissemination of quantitative models has also emerged, although only recently got larger attention. An example of such standardization is the Strengthening the reporting of genetic risk prediction studies: the GRIPS statement (GRIPS) or the Predictive Model Markup Language (PMML). For representing Bayesian networks, the XBN language and recently the openbel approach emerged. From a somewhat surprising direction, from the field of synthetic biology, representations are also emerging for such purposes (see BioBricks [264, 239, 244 és 271]). However, these three worlds, the world of data, world of (logic-based or free-text) knowledge and the world of quantitative models (computation) are quite separated, despite the fact of their common ground and the available informatic technologies for their integration. Next, we discuss the detrimental effect of their separation, for an early realization of the "blurring boundaries" between the data world and literature (free-text) world, see [243, 249, 248, 266, 250, 267, 238, 269 és 270] From fragmentation problems to workflow for unification In the currently prevailing practice, the data world, the knowledge world, and the quantitative model world are densily hyperlinked and partially are converted to dedicated databases. However, this expert oriented integration is fundamentally different from the machine oriented semantic integration. To illustrate this difference, examples will be discussed along the complete "from bench to bedside"' workflow from the discovery of a clinically relevant genetic variant to its use in clinical practice in the field of genetics and pharmacogenomics (see Fig. 73). 151

160 Symptomatic problems, related to the missing semantic integration between objects in the three worlds, are as follows (examples are taken from the genetic association field). 1. Study design. Whereas human expertise and creativity is an essential part of study design, the utilization of earlier datasets and the literature is still ad hoc, fragmentary, despite the availability of literature-based, NLPoriented gene and variant prioritization methods and data-based prioritization methods to support or even automate the design of a confirmatory, hypothesis driven study. 2. Separated databanking/biobanking: the design of the databank/biobank to collect the rich phenotype information is separated from the data preprocessing, data analysis and interpretation of the results, e.g. constraints and dependencies between the phenotypic variables are redundantly modeled or even reinvented in later phases. 3. Prior generation. The generation of priors from the literature is very rudimentary, especially the extraction and conversion of parameter priors. 4. Interpretation of the data analytic results. Because of the typical interdisciplinary cooperation of biomedical, domain experts and statisticans, the interpretation of the data analytic results in the context of the literature is unsolved. The current practice of separated information retrieval screening of individual results is in sharp 152

161 contrast with the ideal use of the entirity of the results with appropriately interpreted statistical confidence or credibility. 5. Internal sharing of the growing results. The internal sharing of the results and their candidate interpretation is unresolved even within the small group of domain experts. This is especially problematic in the case of multiple versions of data postprocessing, which is typical in targeted next-generation sequencing studies. 6. Dissemination of weakly significant results. Weakly significant results are abundant, particularly in the multivariate context, such as in the case of gene-gene, gene-environment interactions. The current publication practice does not support the formal dissemination of such results, although these information fragments, if properly aggregated, could be used in meta-analyses. 7. Dissemination of quantitative models. The publication of quantitative models, e.g. risk prediction models, is similarly separated from the underlying data collection protocol, study design, actual data and interpretative publications. 8. Literature-based enrichment for report generation. The enrichement of rare genetic variants by the most upto-date information could be vital for the proper evaluation of the actual case, e.g. in the interpretation of genetic anomalies in children. Additionally, it also could be used to build-up a mixture of expert- and literature-based annotation for the collection of patients, which information pool could be an even more powerful tool. 9. Literature-based explanation generation for diagnostic and therapeutic decisions. Decision support models could be linked to source data sets (if any) and could be automatically anchored in the literature to provide a constantly update explanation generation for the usage of the model. The root of these problems could be traced back to two seminal concepts: the concept of semantic publishing and the concept of data-analytic knowledge bases. In fact, these concepts are the chronological followers of the standardization of the data world, but in the case of knowledge (literature) world and in the case of (quantitative) model world Data repositories with semantic technologies As summarized above, the biomedical data standardization started with the sequencing data and gene expression data, resulting in ontologies, such as the Gene Ontology; standards, such as the Microarray Gene Expression Data (MGED) standard and the Minimum Information About a Microarray Experiment (MIAME) standard; and repositiories, such as the Gene Expression Omnibus (GEO). These were followed by other omic fields, e.g. in the case of genetic polymorphisms data standards, such as the Minimum Information about a Genotyping Experiment (MIGEN); and repositories, such as the European Genotyping Archive (EGA) or the more focused, disease or gene/region specific Locus-Specific DataBases (LSDBs), Leiden Open-source Variation Databases (LOVDs). However, the extension of this standardization using vocabularies and ontologies to phenotypic levels is still an open issue, despite the crucial role of phenotypic data in current genetic association research (for its potential role in missing heritability, see [259]; for the concept of "deep phenotyping", see [257]). The range of phenotypic data, in a strict sense starts at the expression levels, such as gene expression, proteomics and metabolomics at cellular or tissue levels, but even focusing on the clinical levels phenotypic data still include a wide range of information. This can consist of detailed pathological information, such as pathological description of a tumor; the rich clinical information, such as quantitative information about the efficiency of the applied medication and side-effects; and even lifestyle data, such as food diary and physical exercises. Unfortunately, the standard clinical code systems, such as the currently prevailing IDC10 and IDC11, have a too large granularity for academic purposes, as their goal are accounting in medical finance. The Unified Medical Langugage (UMLS), which is a loose collection of ontologies, has a highly variable quality. Recent approaches, such as the Human Phenotype Ontology (HPO) [263] similarly has a limited success as a general ontology. A notable exception is the Medical Dictionary for Regulatory Activities (MedDRA) for the domain of side-effect, which is of central importance in recent programmes for detailed tracking of efficiency and side-effects of drugs. Note that in certain cases, such as in allergy, the enviromental data including pollen and meterological data is similarly essential, also with the emerging microbial data from metagenomics. 153

162 Semantic publishing for the literature world Besides the automated text-mining methods and the commercial literature-based, "bibliomic" databases a promising candidate for the semantic integration of the literature/knowledge world is the semantic publishing. Semantic publishing (SP), the enrichment of scientific publications with formal knowledge representations, is at the crossroad of multiple scientific disciplines, multiple technological developments, multiple economic rationales and multiple philosophical traditions. These different roads reflect the numerous aspects of knowledge representation, such as the surrogate, an ontological commitment, a theory of reasoning, a medium for computation and communication (DAVIS et al, 1993). Furthermore, these roles are complicated by various roles of scientific research itself, ranging from the quest for objective knowledge to social life. One of the oldest roots of semantic publishing can be traced back to the encyclopedist traditions, a more recent anchor are logical positivism, the Vienna Circle, H.G. Wells's "World Brain", and E. Garfield's "Informatorium". Modern appearance of this line of thought is Wikipedia, which exemplifies human-oriented knowledge representation, despite its electronic and richly linked, optionally tagged form, and despite the existence of multiple semantic extensions (Brohee et al, 2010). Other predecessors of semantic publishing are the large-scale knowledge bases, e.g. the Cyc project and its successors. The Cyc project is at the other end of the spectrum, because it is completely oriented towards automated reasoning and not towards human communication, i.e. towards formal knowledge representation and not towards free-text. An important precursor of semantic publishing is the development of ontologies, which flourished in the last decade, particularly in biomedicine (Ashburner et al, 2000). A key element in the emergence of semantic publishing was the development of the semantic web, which provided the technological background [272, 236, 237, 240, 268 és 260]. Semantic publishing also received support from the dual goal of data sharing behind publications, which aimed to increase validity and repeatability, and also to boost the possibility of meta-analysis. The emerging concept of semantic publishing and particularly the related concept of structured digital abstracts could be summarized as a revolutionary step to create a layer of formal knowledge representation of scientific publications. However, despite the sporadic presence of semantic publication, such as in structural chemistry, and the development of semantic languages in many fields a decade ago, and despite the available guidelines from leading publishers, semantic publishing is still in its infancy. A brief list of illustrative milestonesis as follows: 1. Development and routine application of semantic publication in structural chemistry [272 és 265]. 2. Series of papers about the "blurring boundaries" between the data and knowledge worlds or databases and semi-structured/free-text information [243, 249, 248, 266, 238 és 269]. 3. An exemplar paper with semantic publishing technologies [270]. 4. The Structured Digital Abstract proposal (Seringhaus/Gerstein, 2008), which suggests to add a 'structured XML-readable summary of pertinent facts' [250]. 5. FEBS's proposal and guide for digital abstracts [267]. 6. Cell's proposal and guide for digital abstracts (Article of the Future. Cell, 2009 onwards. Tabbed and hyperlinked presentation of the article; Graphical Abstract and Highlights on the landing page). 7. Elsevier s Initiatives In Bioinformatics And Semantic Enrichment (Biological Research Information Enrichment Framework,BRIEF). 8. Investigation of text-mining methods to support or even substitute semantic publishing [250 és 267]. There are multiple explanations for the slow acceptance of semantic publishing by the authors. The lack of ontologies, particularly for the phenotypical, clinical levels, is probably an important factor, but this is improving by the appearance of general ontologies. Another reason is probably the lack of incentives for the authors, which could be changed (1) by the emergence of useful research tools, (2) by the gradual shift of the policy of the publishers to require such enrichment, or (3) by the introduction of new credit systems to acknowledge citations at a more detailed level. Finally, a mundane reason is the lack of basic research to build a 154

163 specialized, but fully functional system, which could be used to study the entire system of semantic publishing: from enriching and entering publications into the system to the use of this compendium of enriched publications in information retrieval or in automated inference. However, there are special subfields, particularly in biomedicine, in which the application of semantic publishing would be vital and publication policy of the community could enforce its rapid and widespread adoption. For example, such an area is the molecular diagnosis and pharmacogenomic treatment of tumors, for which the aforementioned guidelines and standards are also relevant. In this case, the standardized reporting of clinically valid and efficient results could be accessed and processed accurately and automatically, without the need for costly and rapidly obsoleted manually curated knowledge bases. A currently not utilized synergy between study guides and semantic publishing is the availability of general study guides in many fields, which also include guidelines and standards for scientific publishing, such as the guides related to the publication of genomic results. Trivially, these guides could define the contents in semantic publishing systems. 1. STREGA: STrengthening the REporting of Genetic Associations [258], 2. STROBE: STrengthening the Reporting of OBservational studies in Epidemiology [273], 3. STROBE-ME: STrengthening the Reporting of OBservational studies in Epidemiology: Molecular Epidemiology [245], 4. GRIPS: Strengthening the reporting of genetic risk prediction studies: the GRIPS statement [256]. Semantic publishing could support information retrieval and the tracking of factual statements to their empirical evidences, and it could also provide a new, synthetic form to interrogate and overview the entirety of publications. Furthermore, it could support statistical meta-analysis by ensuring access to and interpretation of data corresponding to publications. Finally, it could even support automated logical inference over the hypothetical knowledge base composed of the formal representations of the publications Causal Bayesian network-based data analytic knowledge bases Semantic publishing cannot cope with the challenge of the dissemination of weakly significant results from data analysis, which are more and more accepted as important assets, "gold dust", in meta-analysis [262]. Such results are particularly frequent in the multivariate context; in conditional settings, such as in the analysis of subpopulations; and in versioning, e.g. analysing early or differently preprocessed data sets. Graphical models, specifically causal Bayesian networks offer a rich language for the detailed representation of types of relevance, including causal, acausal, and multi-target aspects (Chapters Biomarker and Causal inference). They allow the decomposition and refinement of the overloaded concept of association. Bayesian networks in the Bayesian framework allow the inference of posteriors over multivariate strong relevance, interactions, global dependency and causal relations, optionally with various specialization for multiple targets. Furthermore, the Bayesian network-based Bayesian MultiLevel Analysis (BN-BMLA) of relevance in GAS allows scalable intermediate levels between univariate strong relevance and full multivariate relevance to interpret the results at partial multivariate levels. The advantage of the direct probabilistic semantics of the Bayesian statistical approach allows a mathematically direct and biomedically interpretable way to postprocess the results. In short, it is an ideal candidate for creating probabilistic knowledge bases to support the fusion of background knowledge and to support the fusion of weakly significant results from multiple data analyses to increase the reliability of the fused result ("off-line" meta-analysis). The coherent characterization of the uncertainties over the detailed types of relevances offers the opportunity to interpret the results of a Bayesian GAS analysis as a "Bayesian data analytic knowledge base" (see the related openbel [271]). The concept of data analytic knowledge base is relevant both in the literature world and in the quantitative model world, and connects them, because submodels, e.g. mechanisms, can also be formally represented, which can be seen as part of a quantitative model repository or as part of a semantic publishing level. Furthermore, data-analytic probabilistic knowledge-bases can integrate codified, logical knowledge, performing a powerful knowledge integration. Accordingly, the following levels can be identified. 155

164 1. Probabilistic database level. The use of probabilistic relational databases to represent efficiently the multivariate results of a biomedical data analysis. 2. Semantic publishing and WWW. The use of semantic publishing and internet technologies to wrap the dataanalytic probabilistic knowledge-bases. 3. Probabilistic logic. The use of probablistic inference methods to integrate data-analytic probabilistic knowledge-bases with logical knowledge. Fig. 74 shows the concept of a full scale Bayesian knowledge base integrating the results of multiple multivariate Bayesian data analyses and background knowledge Examples for links between worlds The integration of the literature world and the mechanism world was investigated in the context of model-based information retrieval, in which the constructed Bayesian network could be used to complement the query [233 és 234]. Another early attempt tried to annotate the decision support model, which could be used in explanation generation for decision support [234]. An early attempt to integrate data collection and model construction suggested to use a common formally represented data model, which could be used along all the phases of data collection, data analysis and decision support [229]. For the connection between the data world and literature world, examples are the generation of priors from the literature to learn Bayesian networks [232 és 231], the joint use of literature data and gene expression data [230, 251 és 252] Prospects for the Bayesian Encyclopedia Whereas the use of semantic technologies for biomedical is widely accepted, semantic publishing and the use of structured digital abstracts are still in its infancy. The demand for this methodology is the greatest in the biomedical research community, specifically in genetics and pharmacogenomics. The methodologies developed to date are seriously limited in many of the following features; they do not utilize the full potential of the semantic description of the text, they lack the ability to connect the information to other formal knowledge and data bases, they do not follow established terminologies and do not directly link to the empirical data and analyses supplied in the publications. The pharmacogenomic application of semantic publishing will allow the automated annotation and interpretation of molecular diagnostic tests, particularly genetic tests, and even of complex medical decisions. The probabilistic extension of semantic publication will allow new possibilities in off-line meta-analysis, e.g. in the automated construction and parameterizations of decision networks. These decision models specially linked to concrete statements in publications will open up new possibilities in evidence-based medicine, foster the development of translational science, and support to achieve the goals of personalized medicine. To achieve these goals, the following steps can be expected. 156

165 1. Specification and development of controlled natural languages for reporting results in the field of pharmacogenomics. 2. Development of methodologies applying existing text-mining methods to extend existing publications into semantic publications, and to annotate new publications. 3. Development of new statistical biomarker analysis methods, whose results can be represented in relational data-analytic knowledge bases in order to cite them efficiently, thus allowing easy integration into semantic publishing. 4. Quantitative and probabilistic extension of semantic publishing, which provides the possibility of citing uncertain terms from probabilistic data-analysis databases. 5. Methodologies to integrate scientific publications described by this new controlled natural language into a large novel probabilistic knowledge base. 6. In pharmacogenomics, unification of currently available semantic publishing methodologies (e.g. BRIEF) and the currently available structured digital support tools (e.g. FEBS), as well as workflows containing text extraction and mining tools. Particularly, to extend them to comply with the STREGA, STROBE, STROBE- ME, GRIPS, MIGEN standards and applications, mutation describing standards and databases (e.g. LOVD) and relevant molecular biological and clinical ontologies. 7. Consistency check of large probabilistic knowledge base from semantic publishing spanning multiple levels, such as (1) free text, (2) controlled language, (3) data appendices, source code, major results, standardized model descriptions, (4) new "data-analysis" results. 8. Efficient methodologies for the parameter estimation of decision networks by exploiting the improved entity identification in semantic publishing and by using the partial statistics in the probabilistic knowledge base. Methodologies for determining the structure of decision networks automatically, and for aiding experts in the construction process. 9. Methodology for the description of the pharmacogenomic aspects of cancer, and especially for the description of clinically important treatment selection information. 10. Development of multiple annotated decision networks and explanation generation methods for translating the complex probabilistic inference in annotated decision networks into expert reasoning, which we will support by citing relevant statements in publications and by using existing medical protocols. The ever-growing amount of genetic association and clinically validated pharmacogenomic results makes the replacement or extension of expert built, annotated, manually maintained, and text-mined knowledge bases an attractive option. This option is fortified by the availability of ontologies and the increasing adoption of standards which govern the reporting of measurement methods, data and predictive models. The semantic unification of the three worlds would make the quantitative extension of uncertain knowledge term representation possible through semantic publications, especially that of large volumes of weakly significant results. This novel methodology would enable the direct connection of clinical questions and basic research results, automating translational efforts. This function is essential for both the interpretation of diagnostic results, and in quantitative medical decision support. It is important to note that these techniques can enable and support the work of researchers, genetics consultants and medical specialists with proper biomedical background knowledge, and not substitute human experts and medical professionals by giving direct advice to the patients, e.g. to support the interpretation of direct-to-costumer genetic tests. In short, these techniques can be seen as a step towards the idealistic, positivist representation of knowledge, dreamed by thinkers and philosophers, such as H.G. Wells ("World Brain"), E. Garfield ("Informatorium") [246 és 247], L. Wittgenstein (universal language) and K. R. Popper [261]. 33. References [229] S. Aerts, P. Antal, B. De Moor, and Y. Moreau, Web-based data collection for ovarian cancer: a case study. In: Proc. of the 15th IEEE Symp. on Computer-Based Medical Sys. (CBMS-2002), pages ,

166 [230] P. Antal, G. Fannes, T. Meszaros, P. Glenisson, B. De Moor, J. Grootens, T. Boonefaes, P. Rottiers, and Y. Moreau, Towards an integrated usage of expression data and domain literature in gene clustering: representations and methods, [231] P. Antal, G. Fannes, Y. Moreau, D. Timmerman, and B. De Moor, Using literature and data to learn Bayesian networks as clinical models of ovarian tumors. Artificial Intelligence in Medicine, 30: , [232] P. Antal, P. Glenisson, G. Fannes, J. Mathijs, Y. Moreau, and B. De Moor, On the potential of domain literature for clustering and Bayesian network learning. In: Proc. of the 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (ACM-KDD-2002), pages , [233] P. Antal, T. Meszaros, B. De Moor, and T. Dobrowiecki, Annotated Bayesian networks: a tool to integrate textual and probabilistic medical knowledge. In: Proc. of the 13th IEEE Symp. on Comp.-Based Med. Sys. (CBMS-2001), pages , [234] P. Antal, T. Meszaros, B. De Moor, and T. Dobrowiecki, Domain knowledge based information retrieval language: an application of annotated Bayesian networks in ovarian cancer domain. In: Proc. of the 15th IEEE Symp. on Computer-Based Medical Sys. (CBMS-2002), pages , [235] A. Szalay, G. Bell, and T. Hey, Beyond the data deluge. Science, 323(5919): , [236] T. Berners-Lee and J. Hendler, Publishing on the semantic web. Nature, 410: , [237] T. Berners-Lee, J. Hendler, and O. Lassila, The semantic web. Scientific American, May:29-37, [238] P. Bourne, Will a biological database be different from a biological journal? Plos Computational Biology, 1(3): , [239] Y. Cai, M. L. Wilson, and J. Peccoud, Genocad for igem: a grammatical approach to the design of standard-compliant constructs. Nucleic Acids Res., 38(8): , [240] S. Decker, P. Mitra, and Sergey Melnik, Framework for the semantic web: an rdf tutorial. IEEE Internet Computing, 410:68-73, 2000 Nov-Dec. [241] Ron Edgar, Michael Domrachev, and Alex E. Lash, Gene expression omnibus: Ncbi gene expression and hybridization array data repository. Nucleic Acid Research, 30(1): , [242] A. Brazma et al., Minimum information about a microarray experiment (miame) - toward standards for microarray data. Nature genetics, 29: , [243] R. J. Roberts et al., Building a 'genbank' of the published literature. Science, 291: , [244] P. Fu, A perspective of synthetic biology: assembling building blocks for novel functions. Biotechnol J., 1(6):690-9, [245] V Gallo and et al., Strengthening the reporting of observational studies in epidemiology - molecular epidemiology (strobe-me): An extension of the strobe statement. Preventive Medicine, 53(6): , [246] E. Garfield, Essays of an Information Scientist, chapter Towards the World Brain. ISI Press, Cambridge, MA, [247] Eugene Garfield, From the world brain to the informatorium. Information Services and Use, 19:99-105, [248] M. Gerstein, E-publishing on the web: Promises, pitfalls, and payoffs for bioinformatics. Bioinformatics, 15(6): , [249] M. Gerstein and J. Junker, Blurring the boundaries between scientific 'papers' and biological databases, Nature (web debate, on-line 7 May 2001). [250] M. Gerstein, M. Seringhaus, and S. Fields, Structured digital abstract makes text mining easy. Nature, 447(7141): ,

167 [251] P. Glenisson, P. Antal, J. Mathys, Y. Moreau, and B. De Moor, Evaluation of the vector space representation in text-based gene clustering. In: Proc. of the Pacific Symposium on Biocomputing (PSB03), pages , [252] P. Glenisson, B. Coessens, S. Van Vooren, J. Mathijs, Y. Moreau, and B. De Moor, Txtgate: Profiling gene groups with text-based information. Genome Biology, 5(6), [253] David Heckerman, The Fourth Paradigm in Practice. Creative Commons, [254] Tony Hey, The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, [255] J. Huang and et al., Minimum information about a genotyping experiment (migen). Standards in Genomic Sciences, 5(2): , [256] A. Janssens and et al., Strengthening the reporting of genetic risk prediction studies: The grips statement. Genetics in Medicine, 13(5): , [257] R. Joober, The 1000 genomes project: deep genomic sequencing waiting for deep psychiatric phenotyping. J Psychiatry Neurosci, 36(3):147-9, [258] J. Little and et al., Strengthening the reporting of genetic association studies (strega): an extension of the strobe statement. Human Genetics, 125(9): , [259] B. Maher, Personal genomes: The case of the missing heritability. Nature, 456(7218):18-21, [260] H. Pearson, The future of the electronic scientific literature. Nature, 413:1-3, [261] K. R. Popper, Objective Knowledge: An Evolutionary Approach. Oxford Univerity Press, London, [262] M. A. Province and I. B. Borecki, Gathering th gold dust: methods for assessing the aggregate impact of small effect genes in genomic scans. In: Proc. of the Pacific Symposium on Biocomputing (PSB08), volume 13, pages , [263] P. N. Robinson and S. Mundlos, The human phenotype ontology. Clin Genet, 77: , [264] G. Rokke, E. Korvald, J. Pahr, O. Oyas, and R Lale, Biobrick assembly standards and techniques and associated software tools. Methods Mol Biol., 1116:1-24, [265] H. Rzepa and P. Murray-Rust, A new publishing paradigm: Stm articles as part of the semantic web. Learned Publishing, 14(3): , [266] M. Seringhaus and M. Gerstein, Publishing perishing? towards tomorrow's information architecture. BMC Bioinformatics, 8, [267] M. Seringhaus and M. Gerstein, Manually structured digital abstracts: A scaffold for automatic text mining. Febs Letters, 582(8): , [268] N. Shadbolt, What does the science in e-science, IEEE Intelligent Systems, 17(May/June):2-3, [269] D. Shotton, Semantic publishing: the coming revolution in scientific journal publishing. Learned Publishing, 22(2):85-94, [270] D. Shotton, K. Portwin, G. Klyne, and A. Miles, Adventures in semantic publishing: Exemplar semantic enhancements of a research article. Plos Computational Biology, 5(4): , [271] T. Slater, Recent advances in modeling languages for pathway maps and computable biological networks. Drug Discov Today, 19(2): , [272] Vanessa Speding, Xml to take science by storm. Scientific Computing World, Supplement (Autumn):15-18,

168 [273] J. Vandenbroucke and et al., Strengthening the reporting of observational studies in epidemiology (strobe): Explanation and elaboration. Plos Medicine, 4(10): , Bioinformatical workflow systems - case study Bioinformatics as an interdisciplinary science could come into existence because of the growth of amount and accessibility of computational capacity. The appearance of supercomputers and distributed computer systems made tractable several methods, which previously were infeasible. The increased computational capacity, however, created not only new opportunities, but new tasks as well: the efficient utilization of a supercomputer or a distributed system poses a considerable informatical challenge. In the chapter, we discuss the case study of such a system, through which we can get an illustration of typical problems and their possible solutions related to the implementation of such systems. The following of the chapter is organized as follows: Section 18.1 gives a general overview of the introduced system, in Section 18.2 the applied data model is discussed. In Section 18.3 higher-order use cases are listed and the architecture of the system is introduced, whereas Section 18.4 gives an overview of server-side implementation details. Postprocessing, the final phase of the workflow system is treated in Section Overview of tasks The workflow system in question is based on BMLA analyses, the primary goal of which is to examine the relationships of a given domain through statistics collected by MCMC simulations about structural features of Bayesian networks. Since the above MCMC simulations are rather demanding computationally, and a single BMLA analysis requires several MCMC runs, the workflow system to be implemented needs the following properties: It has to be able to handle the MCMC runs corresponding to a given BMLA analysis and their input and output data. The system has to keep track of analyses started by users without a permanent connection between the system and the client. It has to be able to handle available computational resources automatically. The above requirements suggest a multilevel client-server architecture, in which the client (the user) can assemble and initiate BMLA analyses on the server, and later can query the state and the results of it from the server Data model and representation Bayesian network models and observation data serve as fundamentals of BMLA analyses. The program BayesCube provides a full-scale set of tools for the editing and handling of them, hence these input data can be regarded as given from the point of view of the examined workflow system 17. Besides observation data and the corresponding model, the parametrization of the MCMC simulations to be carried out has to be specified. This information consists of the following: These are the variables about which MCMC statistics will be collected. Restricting the set of targets (i.e. omitting an exploratory analysis covering the entire set of variables) is necessary because of the dimensions (in cases exceeding the magnitude of GBs) of the collected statistics. 17 Since the corresponding functionalities of BayesCube are discussed in detail in another chapter, their introduction is omitted here. 160

169 If multiple target variables are present, some BMLA features (e.g. MBS) can be calculated for the entirety of the set of targets or separately for each member. A third option is to create an auxiliary model for each target with the rest of the targets removed. Most typical are the MBM, MBS, and MBG properties, and the so-called causal relation property describing the structural relation (e.g. child-parent, descendant-ancestor, descendants with a common ancestor) of pairs of nodes. It is possible to perform examinations and tests above the level of individual MCMC simulations: such are permutation tests and bootstrap methods. Repeated execution can be useful for the assessment of convergence and confidence as well. The program executing MCMC simulations itself has several parameters. The values and combinations of values of these also have to be specified here. Since the BayesCube tool also supports the editing of the above BMLA configurations, the whole input data set can be assembled on the client side. In light of this, we can overview what functionalities the workflow system has to provide for the client, and devise an architecture capable of providing this functionality Use cases and architecture After the description of the data model we can overview the most important use cases, according to which the architecture of the workflow system can be designed. The list of fundamental use cases of the workflow system is as follows: This step carried out by BayesCube can be regarded as a preliminary step: the user assembles the set of observation data and the corresponding model, and determines the set of MCMC runs to execute by specifying the configuration file described in Section There is no actual interaction with the workflow system in this stage yet. The user uploads the configuration file of the previous point to the BMLA server (along with the data and model files), where on the one hand these basic data are archived, and on the other hand the necessary programs are executed. Since the calculations belonging to a single BMLA analysis can run for multiple days, furthermore they can be delayed by the execution of other analyses, it is important that the user can monitor the state of his analyses. The ultimate step is the fetching of the results of the finished analysis from the server to the local client-side computer, where further postprocessing can be performed, primarily with the aid of BayesCube. The architecture of the system implementing the above functionalities consists of the following modules: Functionalities responsible for the initiation of the above use cases are implemented embedded into a function library for the sake of modularity and reusability. Each use case will be a separate function call in the interface provided by the library, which hence will be able to be easily integrated into any software tool handling BMLA analyses (cf. the program BayesCube).The primary goal of this module is the encapsulation and abstraction of inner details of the BMLA workflow implementation. This is the server-side counterpart of the client-side function library: it assigns a server call to each BMLA use case, hence, together with the previous module, it can be regarded as a part of the abstraction layer hiding the web connection between the implementation and user interface.functions implemented in this module access directly the further parts of the architecture, performing the necessary operations through them It performs administrative functionalities: besides storing user accounts, it stores the basic data for each uploaded BMLA analysis (observation data and model, the configuration file, and the data and time of submission). It also stores the results of the last query about the state of each analysis. These are the tools called directly by the central web server application which perform the following basic operations: (1) assembly of the set of calculations to execute, (2) starting these calculations, (3) querying the 161

170 state of the calculations, optionally terminating them, and (4) integration of results (and providing them for access to the client). In the BMLA system, several individual program executions have to be coordinated, since multiple BMLA analyses may be present simultaneously, furthermore one single BMLA analysis consists of multiple program runs. On the other hand, multiple computers may be available for the execution of calculations. These two factors both require the application of a full-scale job management system, capable of the coordination of the parallel execution of programs in a possibly distributed system.in the BMLA framework the HTCondor system is applied for this task, i.e. a separate HTCondor job will be created for each separate program run, which will be executed by the HTCondor system. As suggested by the above, the HTCondor system creates another abstraction layer, which hides the details of execution hardware from BMLA tools. Hence the BMLA system does not have to treat with the computational nodes directly, they only have to meet the following assumptions: (1) tools required for assigning them to a HTCondor system have to be installed on them, and (2) they have to be capable of executing the programs implementing MCMC simulations Implementation details of the server In this section we overview the server-side applications, which implement the actual workflow based on the coordination of the main server application. HTCondor As we have seen in the previous section, the task of the HTCondor general task management system is to hide the details of the computer pool from BMLA processes. The HTCondor system possesses the following properties, which are of importance from out point of view: A task to be executed can be described by the means of a job, which, besides the executable program, specifies its command-line arguments and the list of input files. Each job might have a detailed description of its resource requirements as well, however, in the BMLA system jobs are not distinguished by this. The computers performing calculations (so-called nodes) are interpreted as resources. The HTCondor system continuously monitors the set of available resources, and assigns yet unexecuted jobs to them (by default in a first-come-first-served manner). Besides the continuous monitoring of the states of jobs, the system also ensures that the outputs of finished jobs are transferred back to the original directory on the server. It is also supported to specify a precedence order among jobs, through which it can be ensured, that tasks requiring the output of other jobs (e.g. the ones aggregating results of separate MCMC runs) will only be run when all the files required by them are present. soapbmla.cmd.generatecondorjobs.class This tool is responsible for creating the list of MCMC runs to be executed based on the BMLA configuration file. As it has been seen previously, the list of parameters contained by the configuration file can be divided into two separate groups: (1) to those that are passed directly to the MCMC run, and (2) to the higher-order ones, such as the one specifying the number of repetitions, or the one specifying permutation tests. According to this, the assembly of the list of submit files to be passed to the HTCondor system is carried out according to the following steps: 1. Most higher-order tests and procedures require the modification of model and/or data in some manner 18. If such requirements are present, the corresponding auxiliary data and model files are created first. 2. According to the above and the specified MCMC parameter combinations the list of all different parametrizations are created. 3. If necessary (the parameter number-of-runs is present), the whole set of submit files are multiplied. 4. The above set of jobs is augmented by a further one, responsible for the integration of their results (carried out by the program mergeresults.exe). 18 E.g. a permutation test requires the randomization of the observations about target variables; whereas bootstrap methods require the resampling of the original data 162

171 All the above jobs are joined by a HTCondor dagman 19 descriptor, through which the execution of the entire set can be started by the submission of a single job bn-mcmc.exe This program performs MCMC runs, its inputs are the data and model files and the set of MCMC parameters passed as command-line arguments, its outputs are the files containing the statistics collected during the simulations. bn-mcmc.exe executions are carried out by the HTCondor system according to the corresponding submit files mergeresults.exe Performs the integration of raw results provided by bn-mcmc.exe. It is automatically executed after the MCMC runs, so that the fetching of result files could be more efficient (in cases it merges hundreds of files into a couple, much more concise ones), however it can be run "manually" as well (this will be treated in Section 18.5 in detail) Postprocessing steps After the successful completion of calculation, results are transferred to the client side for processing and interpretation by experts. The program BayesCube provides tools for these tasks, however, these cannot be regarded strictly as parts of the BMLA workflow. The other tool applicable for postprocessing is mergeresults.exe, which is responsible for the merging and aggregation of raw MCMC results. Since a typical BMLA analysis consists of several separate MCMC runs, such integration can be useful because of both practical (storage space requirement, readability) and theoretical (calculation of basic statistics, simpler convergence and confidence scores) considerations. The program mergeresults.exe proceeds according to the following steps: Its inputs are the raw results of MCMC runs and the corresponding log files containing the MCMC parametrizations. Results of runs with equivalent parametrizations are merged. Basic statistics (e.g. average, standard deviation, minimum, maximum) are calculated for the merged results. Output contains the above merged results and statistics, optionally grouped into separate files according to values of parameters specified by the user. An important question during the above steps is what MCMC parametrizations can be considered to be equivalent. By default only those are treated as equivalent which have equal values for each of their parameters, however, it is possible to "aggregate out" given parameters. "Aggregating out" one (or more) parameters means, that those MCMC runs the parametrizations of which only differ in the specified parameter(s) are considered equivalent and the calculated statistics are evaluated over these sets of equivalent runs. The BMLA workflow is concluded with the postprocessing step detailed above, after which the interpretations of the examination results or another BMLA analysis may follow, reconfigured according to previous experiences Computational aspects of pharmaceutical research 19 This tool can be used to specify precedence amongst jobs. 163

172 Overview of the process This chapter aims to give a short introduction to the modern techniques of small molecular drug design, especially in the frontier area of informatics, mathematics and organic chemistry and also to serve as a starting point for the interested reader. The topics covered in this chapter addressed by several books and ever-increasing number of scientific publications. An essential element of a pharmaceutical development plan is the definition of a target, which can be an achievable effect or a well-defined molecular target. A molecular target is usually a macromolecule in the organism which can be modulated by a drug. A drug can be a small molecule or a macromolecule as well (for example antibodies or peptides), but in this chapter we deal with the design of small molecular agents. The molecular target can be identified and selected based on the biological or medical knowledge of the disease, or known mechanism of existing drugs. When the target is specified, a set of promising molecules can be selected by in-silico screening or highthroughput in vitro screening. As a first step, a huge number of compounds - a library - screened for hits. A library can be a real collection of compounds or just a virtual one. Then the best hits are selected based on different properties to form a smaller set of leads. Leads and other analogues are then optimized, and tested in preclinical experiments. The Preclinical Phase has a dual role: the in-vitro and animal tests minimize the risk of toxicity before the clinical test on human subjects, and reduces the risk of unsuccessful clinical trial, which is extremely expensive. The activity data collected by testing a set of analogues is used furthermore for building structure-activity model of the chemical space around our leads. After the preclinical evaluation, clinical trials are carried out with participation of volunteers, to determine the drug safety profile and effectiveness. The clinical trial process is usually divided into three main phases (Phase I, II and III) and an additional post-marketing phase (Phase IV). Under the clinical trials, the safety dose range is determined in humans (Phase I) and the effectiveness in the given medical condition is tested in a placebocontrolled setting with increasing sample size in multiple steps (Phase II and III). The collection of adverse events is continuous from Phase I to the post-marketing phase, when the drug is already on the market. The clinical trial process is continuously monitored statistically - called interim analysis - and can be terminated because of ethical reasons or to saving time Chemoinformatical background To find a new pharmaceutically active compound with convenient properties, sometimes more than a million of compounds must be analyzed. This huge database cannot be synthesized economically, so in the first steps the filtering is often carried out on a virtual library: a database of huge set of commercially available or at least probably synthesizable compounds, which can contains never synthesized ones. The database contains the chemical structure and possibly several calculated properties of the compounds. Usually a chemical structure can be defined by a labeled connectivity matrix of the atoms (graph representation), and some extra information about the spatial orientation of some substructures. A given atom-atom connectivity network can represent several three-dimensional structures. If a set of threedimensional structures can be converted to each other at room temperature by the thermal fluctuation, then the structures can be considered as identical compounds, and called conformers. So the energy barrier between two conformers is so small, that they cannot be isolated in practice, all of the conformers can be found in the same sample with probabilities defined by the Boltzmann distribution. If there is relatively large energetic barrier between two set of 3D structures, the two set represents two different compounds, and called isomers. More specifically: if the connectivity structure of the molecule is the same, only the three-dimensional structure differs between two compounds, they are called stereoisomers. The concept also referred as chirality (from the Greek word for hand, means "Handedness"). The main property of a chiral object is that it is not identical to its mirror image. To encode the difference between two stereoisomers, we must complete the molecule graph with additional information. For example, in the case of quaternary carbon with four different substitutions we can distinguish two different connectional orders. This is the most common case in organic compounds and called central chirality. A conventional rule called Cahn-Ingold-Prelog priority rule (or CIP convention) usually applied to label this type of atoms - called chirality centres - and other chiral elements by the labels S(Sinister, 164

173 Latin for left) or R (Rectus, Latin for right). The main idea of the CIP convention is to label all substituents of the central atom with numbers based on the ordering of the atomic number of directly connected atoms iteratively, and the molecule positioned so that the substituent labeled with the smallest number is located under the plain of the image. Now the numbering of the three other substituents can be either clockwise or counterclockwise. For the precise rules see an Organic chemistry textbook, or the relevant IUPAC recommendation [274 és 275]. There are more special cases of chirality like axial chirality (see 75 and 76) where the chiral element is an axis instead of single atom. A series of compounds called helicenes built by connected aromatics rings can form a three-dimensional spiral. Helicenes lack chiral centres yet they still exist in two forms: the clockwise, and the counter-clockwise. In a biological system different stereoisomers can have strongly different effects, because the geometric matching of the molecular target (usually a protein) and the active compound is essential. The minimal number of matching features to induce a chiroselective system is three. These features must have nearly equal contribution to the binding energy, otherwise less than three elements dominate the binding and the affinity difference between the two isomers will be small. For example the (S) stereoisomer of a sedative drug thalidomide is teratogenic. This drug developed to treat morning sickness in pregnant woman, originally marketed under the trade name Contergan. Thalidomide is a good example of another phenomenon called racemization: some molecules can change their isomeric state with the aid of some enzyme which is present in a biological system, in this case in the human body. Therefore the pure (R)-thalidomide also shows teratogenic property. As we will see later in this chapter, this dangerous compound can also be used in medicine in some new indications, when the pregnancy can be excluded. The binding affinity of a molecule with respect to a target can be defined with the dissociation constant, usually denoted of the reaction 165

174 where T is for ligand-free target, L is for the free ligand and TL is for the complex. molar concentration, and defined as has a dimension of where square brackets indicate equlibrium molar concentrations [276]. The lesser the, the more active the compound is. A 1uM affinity means that half of the target molecules are occupied in a 1uM/l concentration solution of the modulator, because if, then so The strength of the interaction can be expressed as Gibbs free energy. The connection between the two quantities is where is the temperature of the system and is the ideal gas constant Screening criteria The pharmacological properties of a compound can be divided into two parts: pharmacodynamic (PD) and pharmacokinetic (PK) ones. Pharmacodynamics usually describes "How does the drug act on the biological system?", like what is the target, how potent is our drug, how promiscuous the binding is and so on. While pharmacokinetics ask: "How does the biological system acts on our drug?" like how the molecule is transported, distributed, transformed in the body. In a drug development process, the expected biological activity is only one of the several criteria which must be met. Other very important criteria are referred by the term ADMET: Absorption, Distribution, Metabolism, Excretion and Toxicity. The most simple way to handle the kinetics is to describe the molecules with physicochemical properties like solubility, polar surface area, lipophilicity, molecular mass, etc. which are computationally predictable with low average error. A classical attempt to filter out non drug-like compounds is the application of the Lipinski s Rule of Five. The rule constrains the maximal number of hydrogen bond donors in 5, acceptors in 10, maximizes the molecular mass in 500, and the octanol-water partition coefficient (see in the box below) in 5 in case of orally active drugs [277]. It is worth mentioning that there are exceptions to these rules. Another similar rule is the more strict "Rule of three" in fragment design (not same as Jörgensens's rule of three) which constrains the maximal number of hydrogen bond donors in 3, acceptors in 3, maximize the molecular mass in 300, and the octanol-water partition coefficient in 3 [278]. These properties are not just predictable, but can be adjusted relatively easily by chemical modification of the lead. Octanol-Water Partition coefficient (LogP) Partition coefficient defined as a ratio of concentrations in two immiscible solution in equilibrium. 166

175 where L is the compound in un-ionized form. The logp is a measure of lipophilicity. If the logp is low, the compound is called hydrophilic, if it is high, than lipophilic. A conceptionally different pharmacokinetic subject is metabolism, which is more difficult to predict. The possible metabolic reactions are usually predictable by matching reaction patterns to our compound, but the binding profile of several promiscuous enzymes must be taken into account to predict the truly relevant metabolic pathways. The goal of the metabolism is to turn the exogenous compound more water soluble to facilitate excretion. The process has two main steps. Phase I is dominated by oxidative processes, while in phase II conjugation with endogenous compounds takes place. For example a broad class of oxidases called the Cytochrome P450 family usually abbreviated as CYPs has a prominent role in the hepatic metabolism of several drugs. The metabolism is also one of the earliest areas of pharmacogenomics, and several polymorphisms of these enzymes identified in relation to the personal differences of drug actions. In some cases like the case of warfarin and some CYP2C9 polymorphism, the association is also indicated on the package leaflets of the drugs, and genotyping is used in clinical practice to aid dose adjustment [279]. There are several other specific interactions behind the pharmacokinetic properties of the drugs like transporters and tissue specific enzymes, so the simple physicochemical treatment of the PK problem is limited. Prediction of pharmacodynamic properties is more complex in nature. Usually assumed that the drug effect is mediated by one or more specific binding interactions between the small molecule and molecular targets. But the number of targets can be large in case of promiscuous compounds, or can be more aspecific or even controversial like the interaction of the ethanol with lipid membranes. After some good hits with desirable effects are selected, the next step is optimization. In this step several analogues of the given hit is synthesized and screened to select better candidates. The selection criterion in this phase is not only the activity but the just mentioned other properties too. A model called QSAR (Quantitative Structure-Activity Relationship) can be fitted to the results of the analogue screen to design possibly better compounds in an iterative process. In this process the molecular mass and lipophilicity of the candidates typically increases. The growing size can be problematic in case of ADME properties see e.g. Lipinski s rule, therefore a balance must be found. A quantity called ligand efficiency is widely used to take this conflicting property of size and activity into account: where is the number of non-hydrogen atoms. In case of constant temperature and are interchangeable. We use to define these metrics, but several other affinity or activity like quantity can be used in practice like or 20. A modified version of the measure is proposed as correction for the nonlinearity of the molecular size - mean activity relationship, called SILE (size-independent ligand efficiency): The functional form of the measure can be explained by energy contributions proportional to molecular volume and solvent accessible surface area [280]. Another efficiency measures called LLE (lipophilic ligand efficiency) address the conflicting relationship between low lipophilicity and high affinity: Or a general measure for both, the LELP (Ligand Efficiency-dependent lipophilicity): This measure should be minimized in contrast with the others. In the words of the creators, it expresses a price that should be paid in lipophilicity for a unit of ligand efficiency [281]. 20 is the half inhibitory concentration of an enzyme inhibitor, so in that concentration of the inhibitor the activity of the enzyme is half as high as its native activity [276]. 167

176 In a deeper theoretical point of view the increase of molecular size and lipophilicity can be attributed to the entropy-driven optimization strategy. To get an overview about the nature of entropy-driven versus enthalpydriven optimization, take a look at the definition of the Gibb's free energy: where is the net enthalpy change and is the net entropy change during the binding process. The Gibb's free energy can be optimized either by minimizing - enthalpy-driven strategy - or by maximizing - entropy-driven strategy. It is very difficult in practice to optimize one of the two terms without occurrence of significant compensations in the other. For example, if we introduce a strong interaction between the target and the ligand, this will limit the conformational flexibility of the complex, and cause entropic penalty [282]. The main components of the enthalpy term are the polar interactions between the target and ligand, for example hydrogen bonds (favorable) and the interaction between water and the polar groups of the ligand/receptor (unfavorable). The components of the entropy term are the solvation entropy and conformational entropy. The solvation entropy change is a favorable component, it represents a repulsive interaction between the lipophilic groups of the ligand and water, but it is a clearly non-selective component of the binding. The conformational entropy change is an unfavorable one, caused by the limitation of conformational space during the binding process. In the above it is obvious that a large and lipophilic compound can have a high affinity, but as we know the affinity is not the only parameter we would like to optimize Method If a molecular target is given, the search for active modulators can be carried out using the information about this structure, and possibly the known interactions with known modulators: endogenous or exogenous. The methods which assume that a target structure is available are called structure-based methods. Another class of methods called ligand-based methods only uses structural information of known active compounds, and tries to fit models to identify common structural features or structure-activity relationships. 168

177 The simplest model of the target-ligand interaction is the lock-and-key model. We assume that the target has some specific region with relatively rigid surface geometry called the binding site and some conformation of the ligand is complementary with it. Beside the geometry, there are other properties that must match, like the charge, hydrogen bonding, and hydrophobicity (see 77). A more complex model of the interaction is the induced-fit model. In this case not only the ligand is considered flexible, but the target too. As the ligand comes close to the binding site, mutual forces occur, and induce conformational change in the parties. An example of the structure-based methods is molecular docking, which is a geometry-based method used for predicting the structure of the bounded complex of molecules, and the strength of the interaction. Docking method is a state space search algorithm developed to solve an optimization problem: find an optimal pose (relative orientation of the ligand and the target) and evaluate the fitness of the result with a scoring function. Docking can be performed using rigid bodies, or intermediate cases like rigid receptor and flexible ligand. A more computationally intensive version of docking can calculate induced-fit effects. The optimality criterion in docking is either an empirical scoring function, or the approximated potential energy of the complex which is determined by heuristic function and parameter set called force field. The general form of the energy is a sum, like: Depending on the used force field the functional form of the contributions and the parameters differ. The parameters are tuned empirically using experimental results and high level quantum chemistry calculations. For example the bond length potential can be a simple harmonic potential, or a Morse-potential: where is the dissociation energy, is the equilibrium length and is a width parameter. The Van der Waals potential can be approximated by Lennard-Jones potential: where is the depth of the potential well and is the distance at which the potential is zero. There are several other functional forms used beyond the above mentioned examples. In case of docking studies the modeled process takes place in the presence of water, so additional term for modeling solvation implicitly is frequently introduced. The ligand-based QSAR and QSPR (Quantitative Structure-Property Relationship) are widely accepted and popular methods in the field of drug design. These terms are used collectively to any statistical model that describes the relationship between some property (like the activity on a target in case of the QSAR, or some physicochemical property in case of QSPR) and the chemical structure. These models are usually valid only in a restricted part of the chemical space: in a set of analogues. Several statistical methods can be used for QSAR model building, e.g.: regression models (usually with dimension reduction, like PLS), neural networks, SVMs, and so on. In case of unknown molecular target several similarity-based searching methods can be used. These searching methods have lots of common features with QSAR modeling. First step in both case is to transform the representation of the compound to a semantically interpretable format. One possible solution is fingerprinting. In this case the structure is converted to a sequential data: usually binary string or sequence of numbers. All of the numbers correspond to an elemental property like the occurrence of a structural element. These structural keys can be evaluated on the graph representation, or on the 3D structure. Special cases of the 3D fingerprints are the pharmacophore-based fingerprints. Pharmacophore means carrier (phoros) of drug (pharmacon) properties, and it is a set of structural features and their relative orientation which is recognized by a target. Normally there exist far more distinguishable features than bits we have for a single molecule representation, therefore a hash function with low collision probability is used to compress the fingerprint. The above mentioned ligand-based methods are in clear accordance with the similarity property: if two molecules have high similarity, their properties are probably similar. The main disadvantage of the classical 169

178 methods is that they search for new molecular entities in a narrow neighboring subset in the chemical space. A molecule with similar pharmacological properties but different scaffold can be useful in some cases, like very week ADME properties, or intellectual property issues. This need is in apparent conflict with similarity property. The resolution of this conflict is scaffold hopping or core hopping. Instead of modifying side chains, the scaffold of the molecule is transformed systematically or completely replaced so that the relevant elements of the structure are unchanged. There is a more or less continuous spectrum from single atom replacement methods to pharmacophore-based new scaffold design. Good example of the intermediate methods is ring manipulation. In pharmacodynamics sense, a rigid molecule with a high level of connectedness is usually preferred, because a rigid structure has less conformers, and the binding to the target is energetically more favorable: the entropy loss of the system is reduced. If we have a flexible molecule, and the active conformation is known, we can lock the molecule in this conformation by introducing a ring closure bond. Another favorable property of a rigid system is the smaller probability of promiscuous binding. As always, the world is not black and white. A rigid system with several rings is usually less soluble, and the ADME properties are worse. Sometimes we must open rings for creating a system with more favorable ADME, or to intentionally reduce the potency of our compound in a given target. Diethylstilbestrol, a widely used synthetic estrogen from early 40s to 70s is very similar to a ring opened analogue of estradiol (see 78) Fragment-based design A promising complementary approach of the classical high-throughput methods is the fragment-based approach. In this approach a significantly smaller number of compounds are screened against the molecular target. This library contains small molecules, and the objective of the screen is to detect small interactions which can be utilized to build a lead with high-affinity from fragments. This high sensitivity requirement encourages the chemists to use highly informative experimental methods like NMR spectroscopy instead of in-silico methods, and can cause experiment dependency, but recently more and more attempts are made to identify fragments with computational techniques. The affinity determination method - either in-silico or experimental - can provide structural information of the weak interactions, allowing to build a ligand based on non-overlapping fragments that bind to proximal binding sites of the target. A suitable in-silico method for this can be docking [283]. If non-overlapping fragments are identified against the target, suitable linkers can be designed between them. In case of overlapping fragments a merging strategy can be used. This divide and conquer approach can search a large chemical space with exponential order of resource saving. Screening with a representative set of all possible drug-like molecules is impossible because of the size of this space, but with small fragments it can be a realistic goal. A molecular target can be characterized by a fragment screen so the druggability of the target can be estimated. The fragment-based approach can also aid the lead optimization phase, because the fragments can be selected by some criterion based on ligand efficiency, so the molecular mass and lipophilicity can be more controllable Drug repositioning Drug repositioning is a term for the reuse of approved substances in a new therapeutic indication. This concept is popular because of its cost-effectiveness: the safety and toxicology studies are already carried out, and the results at least some parts of it can be reused. In the repositioning context, there are more rich information sources, like side effects, known indications, already known molecular targets and so on. There was several 170

179 serendipitous repositioning in the history of drug design. A well-known example is the case of sildenafil, which originally developed as a cardiac medication (against angina pectoris and hypertension) and later it is marketed under the trade name Viagra, as an erectile dysfunction drug. The common feature of the two indications is targeted by the vasodilator property of the drug, mediated by its inhibitory effect of a phosphodiesterase enzyme subtype (PDE5). Drug repositioning is also an efficient tool to develop drugs for orphan diseases. The orphan disease and orphan drug is a legal category in several countries, but it can be defined intuitively as a disease (and a drug which can be used as a treatment of the given disease) which is so rare that the classical approach of drug development is difficult and highly unprofitable. For example the previously mentioned teratogenic drug thalidomide can be repositioned against some type of leprosy and cancer, and as immune suppressant. There is no sharp border between orphan drugs and "true" personalized medicine, because lots of orphan diseases are caused by rare genetic mutations, and in extreme cases the management of the disease must be highly patient specific. In the context of drug repositioning, the data fusion techniques (discussed in the Chapter "Analysis of heterogeneous biomedical data through information fusion") can be particularly useful [284]. We have several types of information sources like chemical structure, side effect, genetic factors, molecular targets, biochemical pathways, etc. The similarity-based approach can be extended to these types of data sources. A rich database, like several phenotypic information can be obtained from already conducted trials and from post-marketing information. The phenotype in the traditional sense of the concept is some static property, an observable characteristic of an organism. In case of the effect of a pharmaceutical agent, we investigate some property of the "chemically excited" biological system, such as biochemical changes, effect, side effect, adverse events. The side effect-based similarity metric for example proposed by Campillos et al. in 2008 [285]. The hypothesis was that: If two drugs share several side effects, probably they have a common molecular target, or at least have targets which are part of the same pathway. Because of the richness of the available information in the field of drug repositioning, it can be an ideal frontier between big data research, pharmaceutical chemistry and biology. 36. References [274] Lajos Novák and József Nyitrai, Szerves kémia [275] International Union of Pure and Applied Chemistry. Commission on the Nomenclature of Organic Chemistry, R. Panico, W. H. Powell, and J. C. Richer, A Guide to IUPAC Nomenclature of Organic Compounds: Recommendations IUPAC chemical data series. Blackwell Scientific Publications, [276] Kenneth A. Krohn and Jeanne M. Link, Interpreting enzyme and receptor kinetics: keeping it simple, but not too simple. Nuclear Medicine and Biology, 30(8): , Workshop on Receptor-Binding Radiotracers [277] Christopher A. Lipinski, Franco Lombardo, Beryl W. Dominy, and Paul J. Feeney, Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Advanced Drug Delivery Reviews, 23(1-3):3-25, [278] Miles Congreve, Robin Carr, Chris Murray, and Harren Jhoti, A 'Rule of Three' for fragment-based lead discovery? Drug Discovery Today, 8(19): , [279] Guruprasad P. Aithal, Christopher P. Day, Patrick J. L. Kesteven, and Ann K. Daly, Association of polymorphisms in the cytochrome P450 CYP2C9 with warfarin dose requirement and risk of bleeding complications. The Lancet, 353(9154): , [280] J. Willem M. Nissink, Simple size-independent measure of ligand efficiency. Journal of Chemical Information and Modeling, 49(6): , PMID: [281] György G. Ferenczy and György M. Keserű, Thermodynamics guided lead discovery and optimization. Drug Discovery Today, 15(21 22): , [282] Adam J. Ruben, Yoshiaki Kiso, and Ernesto Freire, Overcoming roadblocks in lead optimization: A thermodynamic perspective. Chemical Biology and Drug Design, 67(1):2-4,

180 [283] Huameng Li and Chenglong Li, Multiple ligand simultaneous docking: Orchestrated dancing of ligands in binding sites of protein. Journal of Computational Chemistry, 31(10): , [284] A. Arany, B. Bolgar, B. Balogh, P. Antal, and P. Matyus, Multi-aspect candidates for repositioning: Data fusion methods using heterogeneous information sources. Current Medicinal Chemistry, 20(1):95-107, T00:00:00. [285] Monica Campillos, Michael Kuhn, Anne-Claude Gavin, Lars Juhl Jensen, and Peer Bork, Drug target identification using side-effect similarity. Science, 321(5886): , Metagenomics Introduction Microbes are everywhere. The estimated bacterial and archaeal cells (together as prokaryotes) in Earth are the largest containers of the basic nutrients (carbon, nitrogen, phosphorous) and, according to some estimates, dominates the biomass of Earth [286]. There are many extreme environments on Earth where only prokaryotes can survive, let it be extreme hot, cold, acidic, or salty places. Microbes remediate naturally produced toxins in the environment as well as toxins that are by-products of human activities, such as oil and chemical spills. Although they cannot be usually seen, microbes are essential for all live forms on Earth, including every part of human life [287]. Microbes convert dead material into forms accessible to all other living things. Almost all multicellular eukaryote organisms have closely related symbiotic microbial communities that make necessary nutrients and vitamins available to their hosts. The microbes living in our intestine and mouth enable us to extract energy from food that otherwise would be indigestible. The complex community of microbes living inside and outside us participate in the protection against disease causing agents. In fact, human body can be thought of as a superorganism, containing human cells and approximately ten times more,, bacterial cells [286 és 287]. Since the first bacterial genome project in 1995 [288], until nowadays more than a thousand bacterial genomes have been sequenced. These studies and the massive amount of data and knowledge produced by them have greatly stimulated the fields of comparative genomics and of systems biology. Despite the huge data and knowledge gathered so far, studying single organisms have necessary limitations. First, in order to sequence the entire genome of a microbe, current technology limitations require that the organism has to be clonally cultured first, and this is rarely possible: only a very small percentage of the microbes in nature can be cultured. Second, microbes usually live in complex communities, where species interact both with each other and with their environment. Therefore, studying a clonal culture does not give a true picture about organism interaction, the functional capabilities or the genomic variance of the population. Next generation sequencing technologies have greatly facilitated microbial studies by overcoming the limitations mentioned above. Environmental sampling makes it possible to obtain genomic information directly from microbial communities in their natural habitats. Instead of looking at a few species individually, we are able to study the community of them as a whole. A new research field has emerged: metagenomics, the study of the sequence data taken directly from the environment (which is called the metagenome). However, environmental sequencing has its own limitations as well. In a genome project of a single organism one can get a nearly complete picture of the microbe's genome. The assembly of the genome is feasible. The sequences can be annotated; the location of genes and operons can be inferred. In contrast, environmental sampling is not so simple. Each sequence fragment may originate from a different species, and there could be many species in the sample. Therefore, full genome assembly is possible only in special environments, for example when a given species dominates the sample. Even in that case, only the genome of the dominant species can be assembled. In most natural environments, where there are many species, genome assembly is not feasible. Short sequence reads can be assembled into contigs usually not larger than 5,000 bp. Consequently, the annotation of the sequences can only be partly done; we only gain a schematic view of the community. In this chapter we discuss the main approaches to a metagenome analysis and then follow the workflow of a typical metagenome project. 172

181 Metagenome analysis In this section we briefly discuss the main approaches to metagenome analysis Community profiling One may be interested only in the question of community composition, namely what species constitute the microbial community ("Who is there?"). In this case, marker genes can be sequenced using universal primers instead of whole genome shotgun sequencing. This is a relatively rapid and cost-effective method for assessing bacterial diversity. Besides, this method is often used in a preliminary step in larger metagenomic studies to get a first initial view of the community in question [289], and it is also used for monitoring changes in community composition over time and space [290]. The most frequently used marker gene is 16S rrna or 18S rrna for prokaryotic and eukaryotic samples, respectively. Ribosomal RNAs (rrnas) are essential components of ribosomes on which proteins are made. It is highly conserved through the evolution, but different enough that it could be used as a marker for evolutionary distance. Its widespread use is justified by the availability of enormous database of rdna gene sequences [291 és 292]. One drawback of the 16S rrna gene is that its copy number is different in different bacterial species, which of course, can strongly influence estimates of community composition. To eliminate this limitation, single-copy genes (such as RpoB) have been applied for the same purpose, because they are thought to provide more accurate estimates of community composition than markers such as 16S rrna genes with a variable copy number [293]. However, existing bacterial databases contain much less sequence information on these genes. Another limitation concerning marker genes is that one has to choose a primer sequence for capturing the sequence of the marker gene. Conserved they are, there is always possibility that the primers used will be different from the rdna in the sample, which would result in many species not being identified. Profiling viral communities is more problematic, because no universally conserved marker genes exist for viruses. In this case environmental shotgun sequencing is the only option Functional metagenomics Beside the question of community composition, one may be interested in the functional capabilities of a given metagenome ("What are they capable of doing?"). We are not necessarily interested in which gene came from which organism; the product of the same gene in two different species is providing the same (or at least very similar) function, irrespectively of the gene's origin. So we concentrate on genes of the community as a whole instead of distinct species. In this case, large amounts of DNA sequence is sampled from the environment and then sequenced using traditional Sanger sequencing or with next generation sequencing platforms. The sequences are then assembled as far as it is possible, putative open reading frames (ORFs, part of a gene that encodes a protein) are inferred, and biological function is assigned to the putative ORFs. This is called functional annotation. The assigned functions and genes can be identified in biological networks, for example in metabolic pathways. The over- or underrepresented biological functions or pathways can be regarded as a clue to the functional capabilities of the community from which the sample originates. Of course, this method has several limitations. In most cases, the community is too complex to enable complete or nearly complete genome assembly, and only partial ORFs can be identified. ORFs can be searched for homologs in existing databases, trying to identify the function of the predicted translated protein of its sequence, but this is necessary limited by the available data in existing databases. The ORFs can be searched for motifs, or other sequence signatures as well that indicate possible functionality ("What is the predicted protein capable of?"), but several errors may creep in due to the incompleteness of ORFs, limitations of motif finding algorithms and our limited knowledge [286]. 173

182 Beside indicating the functional capabilities of a community, random shotgun sequencing may reveal more information on community diversity than marker gene based methods, because it is not limited by the use of a primer sequence. Therefore, with random shotgun methods we are able to find bacteriophages and other viruses, prokaryotes and eukaryotes, and even novel species that a not so "universal" primer would miss Metagenomics step by step In this chapter, we demonstrate the workflow of a typical random shotgun based metagenome project Sampling Sample size considerations in the light of species diversity A metagenome project starts with sampling the environment. The main problem concerning sampling is how should we know that we have collected enough sample material if we can't see the organisms we are trying to collect? Besides, how many sequences are enough? It depends on community structure (i.e. the biodiversity of the community) and the objectives of the study. The structure of the community depends on the number of the different species (also denoted as richness) and their relative abundances (also denoted as evenness). In most natural environments the relative abundances of the species is not even. The simplest way to characterize this unevenness is by plotting the rank-abundance curve, in which each taxonomic unit is represented by a vertical bar proportional to its abundance (see Figure 79). A rank-abundance curve would be flat in case of an even community. How does it come to sequencing? If a sequencing platform were able to sequence the whole genome of a single cell with high accuracy, then one read per cell would be enough to get a good picture of an individual organism of a given species. However, current technical limitations allow the sequencing of reads approximately bp length. The short fragments have to be assembled with the help of the common parts of the reads. Therefore, a single base pair should be covered by many reads. Coverage is the mean number of times a nucleotide is being sequenced. Suppose the approximate genome size of a dominant species in an environment is 3 Mbp (e.g. the approximate genome size of S. pneumoniae is 2.2 Mbp), the frequency of this species in the environment is 10% and 700 Mbp is obtained (e.g. the typical throughput of a Roche GS FLX Titanium XL+ system per run). In this case, the dominant species is represented by approximately 70 Mbp resulting in approximately 23.3X coverage 174