Addressing the Concerns of the Lacks Family: Quantification of Kin Genomic Privacy

Size: px
Start display at page:

Download "Addressing the Concerns of the Lacks Family: Quantification of Kin Genomic Privacy"

Transcription

1 Addressing the Concerns of the Lacks Family: Quantification of Kin Genomic Privacy Mathias Humbert Erman Ayday Jean-Pierre Hubaux Laboratory for Communications and Applications EPFL, Lausanne, Switzerland ABSTRACT The rapid progress in human-genome sequencing is leading to a high availability of genomic data. This data is notoriously very sensitive and stable in time. It is also highly correlated among relatives. A growing number of genomes are becoming accessible online (e.g., because of leakage, or after their posting on genome-sharing websites). What are then the implications for kin genomic privacy? We formalize the problem and detail an efficient reconstruction attack based on graphical models and belief propagation. With this approach, an attacker can infer the genomes of the relatives of an individual whose genome is observed, relying notably on Mendel s Laws and statistical relationships between the nucleotides (on the DNA sequence). Then, to quantify the level of genomic privacy as a result of the proposed inference attack, we discuss possible definitions of genomic privacy metrics. Genomic data reveals Mendelian diseases and the likelihood of developing degenerative diseases such as Alzheimer s. We also introduce the quantification of health privacy, specifically the measure of how well the predisposition to a disease is concealed from an attacker. We evaluate our approach on actual genomic data from a pedigree and show the threat extent by combining data gathered from a genome-sharing website and from an online social network. Categories and Subject Descriptors C.2. [Computer-Communication Networks]: General Security and protection; J.3 [Life and Medical Sciences]: Biology and genetics; K.4.1 [Computer and Society]: Public Policy Issues Privacy Keywords Genomic Privacy; Inference Algorithms; Metrics; Kinship The family of Henrietta Lacks (August 1, October 4, 1951), whose DNA was sequenced and published online without the consent of her family. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. CCS 13, November 4 8, 213, Berlin, Germany. Copyright 213 ACM /13/11...$ Amalio Telenti Institute of Microbiology University Hospital of Lausanne Lausanne, Switzerland amalio.telenti@chuv.ch 1. INTRODUCTION With the help of rapidly developing technology, DNA sequencing is becoming less expensive. As a consequence, the research in genomics has gained speed in paving the way to personalized (genomic) medicine, and geneticists need large collections of human genomes to further increase this speed. Furthermore, individuals are using their genomes to learn about their (genetic) predispositions to diseases, their ancestries, and even their (genetic) compatibilities with potential partners. This trend has also caused the launch of healthrelated websites and online social networks (OSNs), in which individuals share their genomic data (e.g., OpenSNP [1] or 23andMe [2]). Thus, already today, thousands of genomes are available online. Even though most of the genomes on the Internet are anonymized, it is possible to find genomes with the identifiers of their owners (e.g., OpenSNP [1]). Furthermore, it has been shown that anonymization is not sufficient for protecting the real identities of the genome donors [29,47]. Once the owner of a genome is identified, he is faced with the risk of discrimination (e.g., by employers or insurance companies) [9]. Some believe that they have nothing to hide about their genetic structure, hence they might decide to give full consent for the publication of their genomes on the Internet to help genomic research. However, our DNA sequences are highly correlated to our relatives sequences. The DNA sequences between two random human beings are 99.9% similar, and this value is even higher for closely related people. Consequently, somebody revealing his genome does not only damage his own genomic privacy, but also puts his relatives privacy at risk [46]. Moreover, currently, a person does not need consent from his relatives to share his genome online. This is precisely where the interesting part of the story begins: kin genomic privacy. A recent New York Times article [3] reports the controversy about sequencing and publishing, without the permission of her family, the genome of Henrietta Lacks (who died in 1951). On the one hand, the family members think that her genome is private family information and it should not be published without the consent of the family. On the other hand, some scientists argued that the genomes of current family members have changed so much over time (due to gene mixing during reproduction), that nothing accurate could be told about the genomes of current family members by using Henrietta Lacks genome. As we will also show in this work, they are wrong. Minutes after Henrietta Lacks genome was uploaded to a public website called SNPedia, researchers produced a report full of personal information

2 about Henrietta Lacks. Later, the genome was taken offline, but it had already been downloaded by several people, hence both her and (partially) the Lacks family s genomic privacy was already lost. Unfortunately, the Lacks, even though possibly the most publicized family facing this problem, are not the only family facing this threat. As we mentioned before, the genomes of thousands of individuals are available online. Once the identity of a genome donor is known, an attacker can learn about his relatives (or his family tree) by using an auxiliary side channel, such as an OSN, and infer significant information about the DNA sequences of the donor s relatives. We will show the feasibility of such an attack and evaluate the privacy risks by using publicly available data on the Web. Although the researchers took Henrietta Lacks genome offline from SNPedia, other databases continue to publish portions of her genomic data. Publishing only portions of a genome does not, however, completely hide the unpublished portions; even if a person reveals only a part of his genome, other parts can be inferred using the statistical relationships between the nucleotides in his DNA. For example, James Watson, co-discoverer of DNA, made his whole DNA sequence publicly available, with the exception of one gene known as Apolipoprotein E (ApoE), one of the strongest predictors for the development of Alzheimer s disease. However, later it was shown that the correlation (called linkage disequilibrium by geneticists) between one or multiple polymorphisms and ApoE can be used to predict the ApoE status [4]. Thus, an attacker can also use these statistical relationships (which are publicly available) to infer the DNA sequences of a donor s family members, even if the donor shares only part of his genome. It is important to note that these privacy threats not only jeopardize kin genomic privacy, but, if not properly addressed, these issues could also hamper genomic research due to untimely fear of potential misuse of genomic information. In this work, we evaluate the genomic privacy of an individual threatened by his relatives revealing their genomes. Focusing on the most common genetic variant in human population, single nucleotide polymorphism (SNP), and considering the statistical relationships between the SNPs on the DNA sequence, we quantify the loss in genomic privacy of individuals when one or more of their family members genomes are (either partially or fully) revealed. To achieve this goal, first, we design a reconstruction attack based on a well-known statistical inference technique. The computational complexity of the traditional ways of realizing such inference grows exponentially with the number of SNPs (which is on the order of tens of millions) and relatives. Therefore, in order to infer the values of the unknown SNPs in linear complexity, we represent the SNPs, family relationships and the statistical relationships between SNPs on a factor graph and use the belief propagation algorithm [37, 41] for inference. Then, using various metrics, we quantify the genomic privacy of individuals and show the decrease in their level of genomic privacy caused by the published genomes of their family members. We also quantify the health privacy of the individuals by considering their (genetic) predisposition to certain serious diseases. We evaluate the proposed inference attack and show its efficiency and accuracy by using real genomic data of a pedigree. More importantly, by using genomic data and pedigree information we collected from a public genome-sharing website and an OSN, we show that the proposed inference attack threatens not only the Lacks family, but also many other families. The rest of the paper is organized as follows. In Section 2, we give a brief background on genomics and belief propagation. In Section 3, we present the proposed framework in detail. In Section 4, we evaluate the performance of the proposed inference attack using different metrics. In Section 5, we show how the proposed inference attack threatens the genomic and health privacy of several families gathered from OSNs. In Section 6, we summarize the related work on genetic inference and genomic-privacy protection. Finally, we conclude the paper in Section BACKGROUND In this section, we briefly introduce the relevant genetic principles, as well as the concept of belief propagation. 2.1 Genomics 11 DNA is a double-helix structure that consists of two complementary polymer chains. Genetic information is encoded on the DNA as a sequence of nucleotides (A,T,G,C) and a human DNA includes around 3 billion nucleotide pairs. With the decreasing cost of DNA sequencing, genomic data is currently being used mainly in the following two areas: (i) clinical diagnostics, for personalized genomic medicine and genetic research (e.g., genome-wide association studies 1 ), and (ii) direct-to-consumer genomics, for genetic risk estimation of various diseases or for recreational activities such as ancestry search. In the following, we briefly introduce some concepts, which we use throughout this paper, about the human genome and reproduction Single Nucleotide Polymorphism As already mentioned, human beings have 99.9% of their DNA in common. Thus, there is no need to focus on the whole DNA but rather on the most important variants. Single nucleotide polymorphism (SNP) is the most common DNA variation in human population. A SNP occurs when a nucleotide (at a specific position on the DNA) varies between individuals of a given population (as illustrated in Fig. 1). There are approximately 5 million SNP positions in human population [4]. Recent discoveries show that the susceptibility of an individual to several diseases can be computed from his SNPs [5, 33]. For example, it has been reported that two particular SNPs (rs7412 and rs429358) on the Apolipoprotein E (ApoE) gene indicate an (increased) risk for Alzheimer s disease. SNPs carry privacy-sensitive information about individuals health, hence we will quantify health privacy focusing on individuals published (or inferred) SNPs and the diseases they reveal. In general, two different nucleotides (called alleles) are observed at a given SNP position: (i) the major allele is the most frequently observed nucleotide, and (ii) the minor allele is the rare nucleotide. 2 From here on, we represent the major allele as B for a SNP position, and the minor allele as b (where both B and b are in {A, T, G, C}). Furthermore, each SNP position contains two nucleotides (one inherited from the mother and one from the father, as we will discuss next). Thus, the content of a SNP position 1 Examination of many genetic variants in different individuals to determine if any variant is associated with a trait. 2 The two alleles for the SNP position in Fig. 1 are C and T.

3 A T T G C C G A C... A T T G T C G A C... Figure 1: Single nucleotide polymorphism (SNP) with alleles C and T illustrated on a single string of two different individuals DNAs. Mother (M) Father (F) BB Bb bb BB (1,,) (,,) (,1,) Bb (,,) (5,,5) (,,) bb (,1,) (,,) (,,1) Table 1: Mendelian inheritance probabilities for a SNP j, given different genotypes for the parents. The probabilities of the child s genotype is represented ( in parentheses. Each table entry represents Pr(x C j = BB x M j, x F j ), Pr(x C j = Bb x M j, x F j ), Pr(x C j = bb x M j, x F j ) ). can be in one of the following states: (i) BB (homozygousmajor genotype), if an individual receives the same major allele from both parents; (ii) Bb (heterozygous genotype), if he receives a different allele from each parent (one minor and one major); or (iii) bb (homozygous-minor genotype), if he inherits the same minor allele from both parents. We represent the content of a SNP position as x i j for SNP j at individual i, where x i j {BB, Bb, bb}. For simplicity of presentation, in the rest of the paper, we denote BB as, Bb as 1, and bb as 2 (i.e., x i j {, 1, 2}). Finally, each SNP i is assigned a minor allele frequency (MAF), p b i, which represents the frequency at which the minor allele (b) of the corresponding SNP occurs in a given population (typically, < p b i < ) Reproduction Mendel s First Law states that alleles are passed independently from parents to children for different meioses (the process of cell division necessary for reproduction). For each SNP position, a child inherits one allele from his mother and one from his father. Each allele of a parent is inherited by a child with equal probability of. Let F R(x M j, x F j, x C j ) be the function modeling the Mendelian inheritance for a SNP j, where (M, F, C) represent mother, father, and child, respectively. We illustrate the Mendelian inheritance probabilities for a SNP j in Table 1. Based on F R(x M j, x F j, x C j ), we can say that, given both parents genomes, a child s genome is conditionally independent of all other ancestors genomes Linkage Disequilibrium As we discussed before, DNA sequences are highly correlated, leading to interdependent privacy risks. Linkage disequilibrium (LD) [24] is a correlation that appears between any pair of SNP positions in the whole genome due to the population s genetic history. Because of LD, the content of a SNP position can be inferred from the contents of other SNP positions. The strength of the LD between two SNP positions is usually represented by r 2 (or D ), where r 2 = 1 represents the strongest LD relationship. 2.2 Belief Propagation Belief propagation [37, 41] is a message-passing algorithm for performing inference on graphical models (Bayesian networks, Markov random fields). It is typically used to compute marginal distributions of unobserved variables conditioned on observed ones. Computing marginal distributions is hard in general as it might require summing over an exponentially large number of terms. The belief propagation algorithm can be described in terms of operations on a factor graph, a graphical model that is represented as a bipartite graph. One of the two disjoint sets of the factor graph s vertices represents the (random) variables of interest, and the second set represents the functions that factor the joint probability distribution (or global function) based on the dependences between variables. An edge connects a variable node to a factor node if and only if the variable is an argument of the function corresponding to the factor node. The marginal distribution of an unobserved variable can be exactly computed by using the belief propagation algorithm if the factor graph has no cycles. However, the algorithm is still welldefined and often gives good approximate results for factor graphs with cycles. Belief propagation is commonly used in artificial intelligence and information theory. It has demonstrated empirical success in numerous applications including LDPC codes [42], reputation management [11, 12], and recommender systems [1]. 3. THE PROPOSED FRAMEWORK In this section, we formalize our approach and present the different components that will allow us to quantify kin genomic privacy. Fig. 2 gives an overview of the framework. In a nutshell, the goal of the adversary is to infer some targeted SNPs of a member (or multiple members) of a targeted family. We define F to be the set of family members in the targeted family (whose family tree, showing the familial connections between the members, is denoted as G F ) and S to be the set of SNP IDs (i.e., positions on the DNA sequence), where F = n and S = m. Note that the SNP IDs are the same for all the members of the family. We also let x i j be the value of SNP j (j S) for individual i (i F), where x i j {, 1, 2} (as introduced in Section 2.1). Furthermore, X i = {x i j : j S, i F} represents the set of SNPs for individual i. We let X be the n m matrix that stores the values of the SNPs of all family members. Some entries of X might be known by the adversary (the observed genomic data of one or more family members) and others might be unknown. We denote the set of SNPs from X whose values are unknown as X U, and the set of SNPs from X whose values are known (by the adversary) as X K. F R(x M j, x F j, x C j ) is the function representing the Mendelian inheritance probabilities (in Table 1), where (M, F, C) represent mother, father, and child, respectively. The m m matrix L represents the pairwise linkage disequilibrium (LD) between the SNPs in S, that can be expressed by r 2 and

4 Familial relationships gathered from social networks or genealogy websites Actual genomic sequences AG CT AA GC AT AC AG CC AC GC AT AA AG CT AA CC TT AC m SNPs Adversary s Background Knowledge Rules of meiosis Linkage disequilibrium values: Matrix of pairwise joint prob. SNP i GPPM SNP j, Minor allele frequencies SNP i Observed genomic sequences AG AA AT CT AA AC m SNPs Reconstruction Attack (Inference) Genomic Privacy Quantification Health Privacy Quantification Decision Figure 2: Overview of the proposed framework to quantify kin genomic privacy. Each vector X i (i {1,..., n}) includes the set of SNPs for an individual in the targeted family. Furthermore, each letter pair in X i represents a SNP x i 4 j; and for simplicity, each SNP x i j can be represented using {BB, Bb, bb} (or {, 1, 2}), as discussed in Section Once the health privacy is quantified, the family should ideally decide whether to reveal less or more of their genomic information through the genomic-privacy preserving mechanism (GPPM). D ; L i,j refers to the matrix entry at row i and column j. L i,j > if i and j are in LD, and L i,j = if these two SNPs are independent (i.e., there is no LD between them). P = {p b i : i S} represents the set of minor allele probabilities (or MAF) of the SNPs in S. Finally, note that a joint probability p(x i, x j) can be derived from L i,j, p b i, and p b j. The adversary carries out a reconstruction attack to infer X U by relying on his background knowledge, F R(x M j, x F j, x C j ), L, P, and on his observation X K. Once the targeted SNPs are inferred by the adversary, we evaluate genomic and health privacy of the family members based on the adversary s success and his certainty about the targeted SNPs and the diseases they reveal. Finally, we discuss some ideas to preserve the individuals genomic and health privacy. 3.1 Adversary Model An adversary is defined by his objective(s), attack(s), and knowledge. The objective of the adversary is to compute the values of the targeted SNPs for one or more members of a targeted family by using (i) the available genomic data of one or more family members, (ii) the familial relationships between the family members, (iii) the rules of reproduction (in Section 2.1.2), (iv) the minor allele frequencies (MAFs) of the nucleotides, and (v) the population LD values between the SNPs. We note that (i) and (ii) can be gathered online from genome-sharing websites and OSNs, and (iii), (iv), and (v) are publicly known information. Note that, in the future, the increasing possibility to accurately sequence, and to impute the actual haplotypes carried by an individual in each of the copies of the diploid genome will allow a more accurate inference of relatives genotype than relying on population LD patterns only. Various attacks can be launched, depending on the adversary s interest. The adversary might want to infer one particular SNP of a specific individual (targeted-snp-targetedrelative attack) or one particular SNP of multiple relatives in the targeted family (targeted-snp-multiple-relatives attack) by observing one or more other relatives SNP at the same position. Furthermore, the adversary might also want to infer multiple SNPs of the same individual (multiple-snptargeted-relative attack) or multiple SNPs of multiple family members (multiple-snp-multiple-relatives attack) by observing SNPs at various positions of different relatives. In this paper, we propose an algorithm that implements the latter attack, from which any other attacks can be carried out. We formulate this attack as a statistical inference problem. 3.2 Inference Attack We formulate the reconstruction attack (on determining the values of the targeted SNPs) as finding the marginal probability distributions of unknown variables X U, given the known values in X K, familial relationships, and the publicly available statistical information. We represent the marginal distribution of a SNP j for an individual i as p(x i j X K). These marginal probability distributions could traditionally be extracted from p(x U X K, F R(x M j, x F j, x C j ), L, G F, P), which is the joint probability distribution function of the variables in X U, given the available side information and the observed SNPs. Then, clearly, each marginal probability distribution could be obtained as follows: p(x i j X K) = p(x U X K, F R(x M j, x F j, x C j ), L, G F, P), X U \{x i j } (1)

5 where the notation X U\{x i j} implies all variables in X U except x i j. However, the number of terms in (1) grows exponentially with the number of variables, making the computation infeasible considering the scale of the human genome (which includes tens of million of SNPs). In the worst case, the computation of the marginal probabilities has a complexity of O ( 3 nm). Thus, we propose to factorize the joint probability distribution function into products of simpler local functions, each of which depends on a subset of variables. These local functions represent the conditional dependences (due to LD and reproduction) between the different variables in X. Then, by running the belief propagation algorithm on a factor graph, we can compute the marginal probability distributions in linear complexity (with respect to nm). A factor graph is a bipartite graph containing two sets of nodes (corresponding to variables and factors) and edges connecting these two sets. Following [37], we form a factor graph by setting a variable node for each SNP x i j (j S and i F). We use two types of factor nodes: (i) familial factor node, representing the familial relationships and reproduction, and (ii) LD factor node, representing the LD relationships between the SNPs. We summarize the connections between the variable and factor nodes below (Fig. 3): Each variable node x i j has its familial factor node fj i and they are connected. Furthermore, x k j (k i) is also connected to fj i if k is the mother or father of i (in G F ). Thus, the maximum degree of a familial factor node is 3. Variable nodes x i j and x i m are connected to a LD factor node g i j,m if SNP j is in LD with SNP m. Since the LD relationships are pairwise between the SNPs, the degree of a LD factor node is always 2. Given the conditional dependences given by reproduction and LD, the global distribution p(x U X K, F R(x M j, x F j, x C j ), L, G F, P) can be factorized into products of several local functions, each having a subset of variables from X as arguments: p(x U X K, F R(x M j, x F j, x C j ), L, G F, P) = 1 [ ] fj(x i i j, Θ(x i j), F R(x M j, x F j, x C j ), P) Z i F j S [ ] gj,m(x i i j, x i m, L j,m), (2) i F (j,m) s.t. L j,m where Z is the normalization constant, and Θ(x i j) is the set of values of SNP j for the mother and father of i (in G F ). Next, we introduce the messages between the factor and the variable nodes to compute the marginal probability distributions using belief propagation. We denote the messages from the variable nodes to the factor nodes as µ. We also denote the messages from familial factor nodes to variable nodes as λ, and from LD factor nodes to variable nodes as β. Let X (ν) = {x i (ν) j : j S, i F} be the collection of variables representing the values of the variable nodes at the iteration ν of the algorithm. The message µ (ν) i k (xi (ν) j ) denotes the probability of x i (ν) j = l (l {, 1, 2}), at the ν th iteration. Furthermore, λ (ν) k i (xi (ν) j ) denotes the probability that x i (ν) j = l, for l {, 1, 2}, at the ν th iteration given M (1) C (3) (a) F (2) Set of family members: Mother (M), Father (F) and Child (C). We represent M as 1, F as 2, and C as 3. Set of SNP IDs. Variable node representing the value of SNP j for individual I, where. Familial factor node, representing the familial relationships and reproduction. LD factor node, representing the LD relationship between the SNPs. (c) Figure 3: The factor graph representation of a trio (mother, father, child) using 3 SNPs. (a) G F, showing the familial connections among the trio. (b) descriptions of the notations in the factor graph. (c) factor graph representation of the trio using SNPs in S = {1, 2, 3}. The message passing is described on the nodes (x 1 1, f 3 1, and g 1 1,2) highlighted in the graph. Θ(x i j), F R(x M j, x F j, x C j ), and P. Finally, β (ν) k i (xi (ν) j ) denotes the probability that x i (ν) j = l, for l {, 1, 2}, at the ν th iteration given the LD relationships between the SNPs. For the clarity of presentation, we choose a simple family tree consisting of a trio (i.e., mother, father, and child) in Fig 3(a), and 3 SNPs (i.e., F = 3 and S = 3). In Fig. 3(c), we show how the trio and the SNPs are represented on a factor graph, where i = 1 represents the mother, i = 2 represents the father, and i = 3 represents the child. Furthermore, the 3 SNPs are represented as j = 1, j = 2, and j = 3, respectively. We describe the message exchange between the variable node representing the first SNP of the mother (x 1 1), the familial factor node of the child (f1 3 ), and the LD factor node g1,2. 1 The belief propagation algorithm iteratively exchanges messages between the factor and the variable nodes in Fig. 3(c), updating the beliefs on the values of the targeted SNPs (in X U) at each iteration, until convergence. We denote the variable and factor nodes x 1 1, f1 3, and g1,2 1 with the letters i, k, and z, respectively. The variable nodes generate their messages (µ) and send to their neighbors. Variable node i forms µ (ν) i k (x1 (ν) 1 ) by multiplying all information it receives from its neighbors excluding the familial factor node k. 3 Hence, the message from variable node i to the familial factor node k at the ν th iteration is given by µ (ν) i k (x1 (ν) 1 1 ) = Z λ (ν 1) w i (x1 (ν 1) 1 ) β (ν 1) y i (x 1 1(ν 1) ), w ( k) y {z,g1,3 1 } 3 The message µ (ν) i z (x1 1(ν) ) from the variable node i LD factor node z is constructed similarly. (b) (3)

6 where Z is a normalization constant, and the notation ( k) means all familial factor node neighbors of the variable node i, except k. This computation is repeated for every neighbor of each variable node. It is important to note that the message in (3) is valid if the value of x 1 1 is unknown to the adversary (i.e., x 1 1 X U). However, the value of x 1 1 can also be observed by the adversary (i.e., x 1 1 X K). Thus, if x 1 1 X K and x 1 1 = ρ (ρ {, 1, 2}), then µ (ν) i k (x1 (ν) 1 = ρ) = 1 and µ (ν) i k (x1 (ν) 1 ) = for other potential values of x 1 1 (regardless of the values of the messages received by the variable node i from its neighbors). Next, the factor nodes generate their messages. The message from the familial factor node k to the variable node i at the ν th iteration is formed using the principles of belief propagation as λ (ν) k i (x1 (ν) 1 ) = f1 3 (x 1 1, Θ(x 1 1), F R(x M j, x F j, x C j ), P) {x 2 1,x3 1 } µ (ν) y k (x1 (ν) 1 ). (4) y {x 2 1,x3 1 } Note that f1 3 (x 1 1, Θ(x 1 1), F R(x M j, x F j, x C j ), P) p(x 1 1 Θ(x 1 1), F R(x M j, x F j, x C j ), P), and this probability is computed using Table 1. Furthermore, if the degree of the familial factor node is 1 for a particular SNP, then the local function corresponding to the familial factor node only depends on the MAF of the corresponding SNP. For example, the degree of f1 1 (in Fig. 3(c)) is 1, hence f1 1 (x 1 1, Θ(x 1 1), F R(x M j, x F j, x C j ), P) p(x 1 1 p b 1). The above computation must be performed for every neighbor of each familial factor node. Similarly, the message from the LD factor node z to the variable node i at the ν th iteration is formed as β (ν) z i (x1 (ν) 1 ) = g1,2(x 1 1 1, x 1 2, L 1,2) µ (ν) y k (x1 (ν) 1 ). (5) y {x 1 2 } x 1 2 As before, this computation is performed for every neighbor of each LD factor node. We further note that g1,2(x 1 1 1, x 1 2, L 1,2) p(x 1 1, x 1 2), which is derived from L 1,2, p b 1, and p b 2. The algorithm proceeds to the next iteration in the same way as the ν th iteration. The algorithm starts at the variable nodes. Thus, at the first iteration of the algorithm (i.e., ν = 1), the variable node i sends messages to its neighboring factor nodes based on the following rules: (i) If the value of x 1 1 is unknown to the adversary (x 1 1 X U), µ (1) i k (x1 (1) 1 ) = 1 for all potential values of x 1 1 and, (ii) if the value of x 1 1 is known to the adversary (x 1 1 X K) and x 1 1 = ρ (ρ {, 1, 2}), µ (1) i k (x1 (1) 1 = ρ) = 1 and µ (1) i k (x1 (1) 1 ) = for other potential values of x 1 1. The iterations stop when all variables in X U have converged. The marginal probability of each variable in X U is given by multiplying all the incoming messages at each variable node. 3.3 Computational Complexity The computational complexity of the proposed inference attack is proportional to the number of factor nodes. In our setting, there are nm familial factor nodes and a maximum of nm(m 1)/2 LD factor nodes. Hence, the worst-case computational complexity per iteration is O ( nm 2). However, as each SNP is in LD with a limited number of other SNPs, the matrix L is sparse and the number of LD factor nodes grows with m rather than with m(m 1)/2, especially if we focus on SNPs in strong LD only. Thus, the average computational complexity per iteration is O ( nm ). Based on our experiments, we can state that the number of iterations before convergence is a small constant, between 1 and 15. Note finally that this complexity can be further reduced by using similar techniques developed for message-passing decoding of LDPC codes (e.g., working in log-domain [2]). 3.4 Privacy Metrics A crucial step towards protecting kin genomic privacy is to quantify the privacy loss induced by the release of genomic information. Through the inference attack, the adversary infers the targeted SNPs (in X U) belonging to the members of a targeted family by using his background knowledge and observed genomic data (of the family members). The inferred information can be expressed as the posterior distribution p(x U X K, F R(x M j, x F j, x C j ), L, G F, P). Moreover, each posterior marginal probability distribution is represented as p(x i j X K), for all i F, j S. We propose to quantify kin genomic privacy using the following metrics: expected estimation error (incorrectness) and uncertainty. 4 Correctness was already proposed in the context of location privacy [45]. In our scenario, correctness quantifies the adversary s success in inferring the targeted SNPs. That is, it quantifies the expected distance between the adversary s estimate on the value of a SNP, x i j (x i j X U) and the true value of the corresponding SNP, ˆxi j. This distance can be expressed as the expected estimation error as follows: Ej i = p(x i j X K) x i j ˆx i j. (6) x i j {,1,2} Privacy can also be represented as the adversary s uncertainty [22, 43], that is the ambiguity of p(x i j X K). This uncertainty is generally considered to be maximum if the posterior distribution is uniform. This definition of uncertainty can be quantified as the (normalized) entropy of p(x i j X K) as follows: Hj i x i j = {,1,2} p(xi j X K) log p(x i j X K). (7) log(3) The higher the entropy is, the higher is the uncertainty. Finally, we propose another entropy-based metrics that quantifies the mutual dependence between the hidden genomic data that the adversary is trying to reconstruct, and the observed data. This is quantified by mutual information I(x i j; X K) = H(x i j) H(x i j X K) [8]. As privacy decreases with mutual information, we propose the following (normalized) privacy metrics: I i j = 1 H(xi j) H(x i j X K) H(x i j ) = H(xi j X K) H(x i j ). (8) The aforementioned metrics are useful for quantifying the genomic privacy of individuals. In order to quantify a more tangible privacy, we must convert these genomic-privacy metrics into health-privacy metrics. To quantify an individual s health privacy, we focus on his predisposition to different diseases. Let S d be the set of IDs of the SNPs that are associated with a disease d. Then, a metrics quantifying the 4 These metrics are not specific to the proposed inference attack; they can be used to quantify genomic privacy in general.

7 health privacy for an individual i regarding the disease d can be defined as follows: Dd i 1 = c k G i k, (9) k S d c k k S d where G i k is the genomic privacy of a SNP k for individual i, computed using (6), (7), or (8), and c k is the contribution of SNP k to disease d. 5 Other health-privacy metrics based on non-linear combinations of genotypes or combinations of alleles will be defined in future work. Note that healthprivacy metrics are valid at a given time, and cannot be used to evaluate future privacy provision, as genome research can change knowledge on the contribution of SNPs to diseases. 3.5 Genomic-Privacy Preserving Mechanism Individuals willing to share genomic data for research or recreational purposes might be unwilling to share all their DNA sequence, and thus need to properly obfuscate the sensitive part(s) before releasing their genomic data. To do so, their DNA will go through an obfuscation process, that we call genomic-privacy preserving mechanism (GPPM). GPPM can be implemented using one of the following techniques: (i) hiding the SNPs, or (ii) reducing the precision or the quantity of the revealed SNPs. Hiding all or specific SNPs can be achieved either by not releasing them or by encrypting them. Obviously, not releasing any of the SNPs would hinder genetic research, thus it is not a preferred way to protect the genomic privacy of individuals. Instead of not releasing the SNPs, the use of cryptographic algorithms to encrypt the genome is proposed. For example, Kantarcioglu et al. propose using homomorphic encryption on the SNPs of the individuals to perform genetic research on the encrypted SNPs [35]. However, the security of an individual s genome should be guaranteed for at least 7-1 years (i.e., during the typical lifetime of a human). As we show in this paper, even lifelong protection is not enough, considering kin privacy implications (e.g., for offsprings). It is known that even the best of the cryptographic algorithms we use today could be broken in around 3 years. Therefore, the appropriateness of cryptographic techniques for storing and processing the genomic data has been questioned due to long-term security requirements of the genomic data. As an alternative to the cryptographic techniques, utility (i.e., precision and quantity of the revealed SNPs) can be traded for privacy. The precision of the revealed SNPs can be reduced, for example, by revealing only one of the two alleles of a SNP. Similarly, family members SNPs can be selectively revealed by also considering the previously revealed SNPs from the corresponding family (to keep the genomic privacy of other family members above a desired threshold): we evaluate the privacy provided by this technique in Section 4 by assessing the inference power of the adversary for different fractions of observed data from a targeted family. Eventually, using one of the above techniques, the GPPM will take X as input and output X K as the set of revealed SNPs. We note that a detailed implementation of the GPPM by using one of the aforementioned techniques is out of the scope of this work. We plan to study it in the future. 5 These contributions are determined as a result of medical studies. Some SNPs might increase (or decrease) the risk for a disease more than others. GP1 GP2 GP3 GP4 P5 P6 C7 C8 C9 C1 C11 Figure 4: Family tree of CEPH/Utah Pedigree 1463 consisting of the 11 family members that were considered. The symbols and represent the male and female family members, respectively. 4. EVALUATION In this section, we first evaluate the performance of the proposed inference attack, then compare the performance of the inference with and without considering the linkage disequilibrium (LD) between SNPs, and finally evaluate the entropy-based metrics with respect to the expected estimation error in quantifying the genomic privacy. For this evaluation, we use the CEPH/Utah Pedigree 1463 that contains the partial DNA sequences of 17 family members (4 grandparents, 2 parents, and 11 children) [23]. We note in Fig. 4 that we only use 5 (out of 11) children for our evaluation because (i) 11 is much above the average number of children per family, (ii) we observe that the strength of adversary s inference does not increase further (due to the children s revealed genomes) when more that 5 children s genomes are revealed, and (iii) the belief propagation algorithm (in Section 3.2) might have convergence issues due to the number of loops in the factor graph, and this number increases with the number of children. As the SNPs related to important diseases, like Alzheimer s, are not included in this dataset, we quantify health privacy in Section 5 by using the data collected from a genome-sharing website. To quantify the genomic privacy of the individuals in the CEPH family, we focus on their SNPs on chromosome 1 (which is the largest chromosome). We rely on the three metrics introduced in Section 3.4. That is, we compute the genomic privacy of each family member using the expected estimation error in (6), the (normalized) entropy in (7), and the (normalized) mutual information in (8) on the targeted SNPs, and we average the result based on the number of targeted SNPs for each individual. We rely on the L 1 norm to measure the distance between two SNP values in (6). First, we assume that the adversary targets one family member and tries to infer his/her SNPs by using the published SNPs of other family members without considering the LD between the SNPs. We select an individual from the CEPH family and denote him as the target individual. We construct S, the set of SNP IDs that we consider for evaluation, from 8k SNPs on chromosome 1. Thus, the set of targeted SNPs (X U) includes 8k SNPs of the target individual. Furthermore, we gradually fill the set of observed SNPs (X K) with the set of 8k SNPs of other family members. That is, we sequentially reveal 8k SNPs (whose IDs are in S) of all family members (excluding the target in-

8 1.9.8 Grandparent GP1 s privacy Parent P5 s privacy Estimation error Normalized entropy 1 (mutual information) Child C7 s privacy Estimation error Normalized entropy 1 (mutual information) Privacy level Privacy level Privacy level Estimation error Normalized entropy 1 (mutual information) GP3 GP4 P6 C7 C8 C9 C1 C11 GP2 P5 Revealed relatives (a) GP3 GP4 P6 C7 C8 C9 C1 C11 GP1 GP2 Revealed relatives (b) GP1 GP2 GP3 GP4 C8 C9 C1 C11 P5 P6 Revealed relatives (c) Figure 5: Evolution of the genomic privacy of the (a) grandparent (GP1), (b) parent (P5), and (c) child (C7). We reveal all the 8k SNPs on chromosome 1 of other family members starting from the most distant family members of the target individual (in terms of number of hops to the target individual in Fig. 4); the x-axis represents the disclosure sequence. We note that x = represents the prior distribution, when no genomic data is observed by the adversary. dividual) beginning with the most distant family members from the target individual (in terms of number of hops in Fig. 4) and we keep revealing relatives until we reach his/her closest family members. 6 In Fig. 5 we show the evolution of the genomic privacy of three target individuals from the CEPH family (in Fig. 4): (i) grandparent (GP1), (ii) parent (P5), and (iii) child (C7). We note that all entropy-based metrics for each target individual start from the same values. We also observe that the parent s and the child s genomic privacy decreases considerably more than the grandparent s (the adversary s error for the grandparent s genome does not go below ). Moreover, the observation of GP3, GP4 and P6 s genomes has no effect on GP1 and P5 s privacy as their genomes are independent (if no other relatives genomes are observed). We observe in Fig. 5(a) that the grandparent s genomic privacy is mostly affected by the SNPs of the first revealed children (C7, C8), and also by those of his spouse and his child (P5). We also observe (in Fig. 5(b)) that, by revealing all family members SNPs (expect P5), the adversary can almost reach an estimation error of. The target parent s genomic privacy significantly decreases only with the observation of his children s and spouse s SNPs. Finally, we observe in Fig. 5(c) that C7 s genomic privacy decreases smoothly with the observation of his grandparents SNPs, and then of his siblings. We also observe a slight decrease of privacy once the parents SNPs (P5 and P6) are also revealed, but the observation of parents (after the other children) does not have a significant effect on the adversary s error. It is important to note that the importance of a family member for the inference power of the adversary also depends on the sequence at which his/her SNPs are revealed in Fig. 5. For example, in Fig. 5(c), if the SNPs of the parents (P5 and P6) of the target child (C7) were revealed before her siblings (C8-C11), then the observation of her parents would reduce the genomic privacy of the target child more than her siblings (but the final genomic privacy would not change). Next, we include the LD relationships and observe the change in the inference power of the adversary using the LD 6 The exact sequence of the family members (whose SNPs are revealed) is indicated for each evaluation. values. We construct S from 1 SNPs on chromosome 1. Among these 1 SNPs, each SNP is in LD with 5 other SNPs on average. Furthermore, the strength of the LD (r 2 value in Section 2.1.3) uniformly varies between and 1 (where r 2 = 1 represents the strongest LD relationship, as discussed before). We note that we only use 1 SNPs for this study as the LD values are not yet completely defined over all SNPs, and the definition of such values is still an ongoing research. As before, we define a target individual from the CEPH family, construct the set X U from his/her SNPs, and sequentially reveal other family members SNPs to observe the decrease in the genomic privacy of the target individual. We observe that individuals sometimes reveal different parts of their genomes (e.g., different sets of SNPs) on the Internet. Thus, we assume that for each family member (except for the target individual), the adversary observes 5 random SNPs from S only (instead of all the SNPs in S), and these sets of observed SNPs are different for each family member. In Fig. 6, we show the evolution of genomic privacy of three target individuals when the adversary also uses the LD values. We observe that LD decreases genomic privacy, especially when few individuals genomes are revealed. As more family member s genomes are observed, LD has less impact on the genomic privacy. We also evaluate the inference power of the adversary to infer multiple SNPs among all family members, given a subset of SNPs belonging to some family members, and also considering the LD between SNPs. That is, we evaluate the inference power of the adversary for different fractions of observed data for the family members. Using the same set of 1 SNPs, we construct X U from (κ 1 n) SNPs, randomly selected from all family members, where n is the number of family members in the family tree (n = 11 for this scenario), and κ 1. We assume that the SNPs that are not in X U are observed by the adversary (i.e., in X K), and we observe the inference power of the adversary for the SNPs in X U, for different values of κ. In Fig. 7, we observe an exponential decrease in the global genomic privacy (privacy of all family members), showing that the observation of a small portion of the family s SNPs can have a huge impact on genomic privacy. The estimation error is decreased by around 3 by observing only the first 1% of the SNPs.

9 1.9.8 Grandparent GP1 s privacy Parent P5 s privacy Estimation error (w/o LD) Estimation error (with LD) Normalized entropy (w/o LD) Normalized entropy (with LD) 1 mutual info. (w/o LD) Child C1 s privacy Estimation error (w/o LD) Estimation error (with LD) Normalized entropy (w/o LD) Normalized entropy (with LD) 1 mutual info. (w/o LD) 1 mutual info. (with LD) Privacy level Privacy level 1 mutual info. (with LD) Privacy level Estimation error (w/o LD) Estimation error (with LD) Normalized entropy (w/o LD) Normalized entropy (with LD) 1 mutual info. (w/o LD) 1 mutual info. (with LD) GP3 GP4 P6 C7 C8 C9 C1 C11 GP2 P5 Revealed relatives (a) GP3 GP4 P6 C7 C8 C9 C1 C11 GP1 GP2 Revealed relatives (b) GP1 GP2 GP3 GP4 C8 C9 C1 C11 P5 P6 Revealed relatives (c) Figure 6: Evolution of the genomic privacy of the (a) grandparent (GP1), (b) parent (P5), and (c) child (C7), with and without considering LD. For each family member, we reveal 5 randomly picked SNPs (among 1 SNPs in S), starting from the most distant family members, and the x-axis represents the exact sequence of this disclosure. Note that x = represents the prior distribution, when no genomic data is revealed. Global privacy level Estimation error Normalized entropy 1 mutual information Percentage of SNPs revealed Figure 7: Evolution of the global privacy for the whole family by gradually revealing 1% of SNPs. 5. EXPLOITING GENOME-SHARING WEB- SITES AND ONLINE SOCIAL NETWORKS In order to show that the proposed inference attack threatens not only the Lacks family, but potentially all families, we collected publicly available data from a genome-sharing website and familial relationships from an OSN, and evaluated the decrease in genomic and health privacy of people due to the observation of their relatives genomic data. We gathered individuals genomic data from OpenSNP [1], a website on which people can publicly share sets of SNPs. Then, we identified the owners of some gathered genomic profiles by using their names and sometimes profile pictures. Among these identified individuals, we managed to find family relationships of 6 of them (who publicly reveal the names of some of their relatives) on Facebook. 7 We expect this number to increase in the future, as more health-related OSNs (which let people share their genomic profiles, such as 23andMe [2]) emerge. Furthermore, we anticipate that the current widely used health-related OSNs (e.g., Patients- LikeMe [6]) will let users upload and share their genomic data. We identified 29 target individuals from 6 different families, whose genomic data can be inferred using the observed SNPs of the identified individuals. We focus on 2 individuals I 1 and I 2 out of these 6 identified individuals and evaluate the genomic and health privacy for their family members. We observed that both I 1 and I 2 publicly disclosed around 1 million of their SNPs. Furthermore, we identified the names of (i) 1 mother, 2 sons, 2 daughters, 1 grandchild, 1 aunt, 2 nieces, and 1 nephew of I 1, and (ii) 1 sibling, 1 aunt, 1 uncle, and 6 cousins of I 2 on Facebook. We compute the genomic and health privacy of these target individuals using the (normalized) entropy in (7) on the targeted SNPs, and normalize the result based on the number of targeted SNPs for each individual. We do not use the expected estimation error in (6), as we do not have the ground truth for the genomes of the target individuals. Thus, privacy is quantified as the uncertainty of the adversary in this section. To quantify the genomic privacy of the target individuals (i.e., family members of I 1 and I 2), we first construct S from all SNPs on chromosome 1 (from the observed genomes of I 1 and I 2). The set of observed SNPs (X K) includes the observed SNPs of I 1 (respectively I 2) for the inference of family members of I 1 (respectively I 2). The set of targeted SNPs (X U) includes 77k SNPs for I 1 s family and 79k for I 2 s family (from S) for each evaluation. In Fig. 8, we show the decrease in the genomic privacy for different family members of I 1 (aunt, niece/nephew, grandchild, mother, child) and I 2 (cousin, aunt/uncle, sibling) as a result of our proposed inference attack, first without considering the LD dependencies (similarly to previous section). We observe that as expected, the decrease in the genomic privacy of close family members is significantly higher than that of more distant family members. However, as we have seen in Section 4, the observation of one (or more) additional family member(s) has often much more impact on the target s privacy than the observation of only one relative. 7 According to [28], around 12% of Facebook users publicly share at least one family member on their profiles.

Quantifying Interdependent Risks in Genomic Privacy

Quantifying Interdependent Risks in Genomic Privacy Quantifying Interdependent Risks in Genomic Privacy MATHIAS HUMBERT, CISPA, Saarland University ERMAN AYDAY, Bilkent University JEAN-PIERRE HUBAUX, EPFL AMALIO TELENTI, Human Longevity Inc. The rapid progress

More information

Algorithms for Genetics: Introduction, and sources of variation

Algorithms for Genetics: Introduction, and sources of variation Algorithms for Genetics: Introduction, and sources of variation Scribe: David Dean Instructor: Vineet Bafna 1 Terms Genotype: the genetic makeup of an individual. For example, we may refer to an individual

More information

Overview. Methods for gene mapping and haplotype analysis. Haplotypes. Outline. acatactacataacatacaatagat. aaatactacctaacctacaagagat

Overview. Methods for gene mapping and haplotype analysis. Haplotypes. Outline. acatactacataacatacaatagat. aaatactacctaacctacaagagat Overview Methods for gene mapping and haplotype analysis Prof. Hannu Toivonen hannu.toivonen@cs.helsinki.fi Discovery and utilization of patterns in the human genome Shared patterns family relationships,

More information

Differential Privacy Preserving Genomic Data Releasing via Factor Graph

Differential Privacy Preserving Genomic Data Releasing via Factor Graph Differential Privacy Preserving Genomic Data Releasing via Factor Graph Zaobo He, Yingshu Li, and Jinbao Wang Georgia State University, USA Harbin Institute of Technology, China Background Cost of DNA

More information

Privacy-Enhancing Technologies for Medical Tests Using Genomic Data

Privacy-Enhancing Technologies for Medical Tests Using Genomic Data Privacy-Enhancing Technologies for Medical Tests Using Genomic Data Technical Report Erman Ayday, Jean Louis Raisaro and Jean-Pierre Hubaux School of Computer and Communication Sciences Ecole Polytechnique

More information

Introduction to Quantitative Genomics / Genetics

Introduction to Quantitative Genomics / Genetics Introduction to Quantitative Genomics / Genetics BTRY 7210: Topics in Quantitative Genomics and Genetics September 10, 2008 Jason G. Mezey Outline History and Intuition. Statistical Framework. Current

More information

Genotype Prediction with SVMs

Genotype Prediction with SVMs Genotype Prediction with SVMs Nicholas Johnson December 12, 2008 1 Summary A tuned SVM appears competitive with the FastPhase HMM (Stephens and Scheet, 2006), which is the current state of the art in genotype

More information

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1 Human SNP haplotypes Statistics 246, Spring 2002 Week 15, Lecture 1 Human single nucleotide polymorphisms The majority of human sequence variation is due to substitutions that have occurred once in the

More information

Making sense of DNA For the genealogist

Making sense of DNA For the genealogist Making sense of DNA For the genealogist Barry Sieger November 7, 2017 Jewish Genealogy Society of Greater Orlando OUTLINE Basic DNA concepts Testing What do the tests tell us? Newer techniques NGS Presentation

More information

Computational Haplotype Analysis: An overview of computational methods in genetic variation study

Computational Haplotype Analysis: An overview of computational methods in genetic variation study Computational Haplotype Analysis: An overview of computational methods in genetic variation study Phil Hyoun Lee Advisor: Dr. Hagit Shatkay A depth paper submitted to the School of Computing conforming

More information

The Lander-Green Algorithm. Biostatistics 666

The Lander-Green Algorithm. Biostatistics 666 The Lander-Green Algorithm Biostatistics 666 Last Lecture Relationship Inferrence Likelihood of genotype data Adapt calculation to different relationships Siblings Half-Siblings Unrelated individuals Importance

More information

The Haplotyping Problem: An Overview of Computational Models and Solutions

The Haplotyping Problem: An Overview of Computational Models and Solutions The Haplotyping Problem: An Overview of Computational Models and Solutions Paola Bonizzoni Gianluca Della Vedova Riccardo Dondi Jing Li June 21, 2003 Abstract The investigation of genetic differences among

More information

The Haplotyping Problem: An Overview of Computational Models and Solutions

The Haplotyping Problem: An Overview of Computational Models and Solutions The Haplotyping Problem: An Overview of Computational Models and Solutions Paola Bonizzoni Gianluca Della Vedova Riccardo Dondi Jing Li July 1, 2003 Abstract The investigation of genetic differences among

More information

Book chapter appears in:

Book chapter appears in: Mass casualty identification through DNA analysis: overview, problems and pitfalls Mark W. Perlin, PhD, MD, PhD Cybergenetics, Pittsburgh, PA 29 August 2007 2007 Cybergenetics Book chapter appears in:

More information

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations An Analytical Upper Bound on the Minimum Number of Recombinations in the History of SNP Sequences in Populations Yufeng Wu Department of Computer Science and Engineering University of Connecticut Storrs,

More information

Haplotypes, linkage disequilibrium, and the HapMap

Haplotypes, linkage disequilibrium, and the HapMap Haplotypes, linkage disequilibrium, and the HapMap Jeffrey Barrett Boulder, 2009 LD & HapMap Boulder, 2009 1 / 29 Outline 1 Haplotypes 2 Linkage disequilibrium 3 HapMap 4 Tag SNPs LD & HapMap Boulder,

More information

Estimation of Genetic Recombination Frequency with the Help of Logarithm Of Odds (LOD) Method

Estimation of Genetic Recombination Frequency with the Help of Logarithm Of Odds (LOD) Method ISSN(Online) : 2319-8753 ISSN (Print) : 237-6710 Estimation of Genetic Recombination Frequency with the Help of Logarithm Of Odds (LOD) Method Jugal Gogoi 1, Tazid Ali 2 Research Scholar, Department of

More information

Multi-SNP Models for Fine-Mapping Studies: Application to an. Kallikrein Region and Prostate Cancer

Multi-SNP Models for Fine-Mapping Studies: Application to an. Kallikrein Region and Prostate Cancer Multi-SNP Models for Fine-Mapping Studies: Application to an association study of the Kallikrein Region and Prostate Cancer November 11, 2014 Contents Background 1 Background 2 3 4 5 6 Study Motivation

More information

Crash-course in genomics

Crash-course in genomics Crash-course in genomics Molecular biology : How does the genome code for function? Genetics: How is the genome passed on from parent to child? Genetic variation: How does the genome change when it is

More information

ReCombinatorics. The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination. Dan Gusfield

ReCombinatorics. The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination. Dan Gusfield ReCombinatorics The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination! Dan Gusfield NCBS CS and BIO Meeting December 19, 2016 !2 SNP Data A SNP is a Single Nucleotide Polymorphism

More information

Introduction to Artificial Intelligence. Prof. Inkyu Moon Dept. of Robotics Engineering, DGIST

Introduction to Artificial Intelligence. Prof. Inkyu Moon Dept. of Robotics Engineering, DGIST Introduction to Artificial Intelligence Prof. Inkyu Moon Dept. of Robotics Engineering, DGIST Chapter 9 Evolutionary Computation Introduction Intelligence can be defined as the capability of a system to

More information

Genetic data concepts and tests

Genetic data concepts and tests Genetic data concepts and tests Cavan Reilly September 21, 2018 Table of contents Overview Linkage disequilibrium Quantifying LD Heatmap for LD Hardy-Weinberg equilibrium Genotyping errors Population substructure

More information

CS 262 Lecture 14 Notes Human Genome Diversity, Coalescence and Haplotypes

CS 262 Lecture 14 Notes Human Genome Diversity, Coalescence and Haplotypes CS 262 Lecture 14 Notes Human Genome Diversity, Coalescence and Haplotypes Coalescence Scribe: Alex Wells 2/18/16 Whenever you observe two sequences that are similar, there is actually a single individual

More information

PUBH 8445: Lecture 1. Saonli Basu, Ph.D. Division of Biostatistics School of Public Health University of Minnesota

PUBH 8445: Lecture 1. Saonli Basu, Ph.D. Division of Biostatistics School of Public Health University of Minnesota PUBH 8445: Lecture 1 Saonli Basu, Ph.D. Division of Biostatistics School of Public Health University of Minnesota saonli@umn.edu Statistical Genetics It can broadly be classified into three sub categories:

More information

SNP Matching Guide, BF McAllister

SNP Matching Guide, BF McAllister Informa(on in this guide is prepared and presented by Bryant McAllister, Associate Professor of Biology at The University of Iowa. This and other resources for understanding the interpreta(ons and uses

More information

Part I: Predicting Genetic Outcomes

Part I: Predicting Genetic Outcomes Part I: Predicting Genetic Outcomes Deoxyribonucleic acid (DNA) is found in every cell of living organisms, and all of the cells in each organism contain the exact same copy of that organism s DNA. Because

More information

Computational Workflows for Genome-Wide Association Study: I

Computational Workflows for Genome-Wide Association Study: I Computational Workflows for Genome-Wide Association Study: I Department of Computer Science Brown University, Providence sorin@cs.brown.edu October 16, 2014 Outline 1 Outline 2 3 Monogenic Mendelian Diseases

More information

The Genomics Revolution: The Good, The Bad, and The Ugly

The Genomics Revolution: The Good, The Bad, and The Ugly The Genomics Revolution: The Good, The Bad, and The Ugly (The Privacy Edition) Emiliano De Cristofaro University College London https://emilianodc.com From: James Bannon, ARK From: The Economist 4 How

More information

Answers to additional linkage problems.

Answers to additional linkage problems. Spring 2013 Biology 321 Answers to Assignment Set 8 Chapter 4 http://fire.biol.wwu.edu/trent/trent/iga_10e_sm_chapter_04.pdf Answers to additional linkage problems. Problem -1 In this cell, there two copies

More information

Dan Geiger. Many slides were prepared by Ma ayan Fishelson, some are due to Nir Friedman, and some are mine. I have slightly edited many slides.

Dan Geiger. Many slides were prepared by Ma ayan Fishelson, some are due to Nir Friedman, and some are mine. I have slightly edited many slides. Dan Geiger Many slides were prepared by Ma ayan Fishelson, some are due to Nir Friedman, and some are mine. I have slightly edited many slides. Genetic Linkage Analysis A statistical method that is used

More information

GENETIC ALGORITHMS. Narra Priyanka. K.Naga Sowjanya. Vasavi College of Engineering. Ibrahimbahg,Hyderabad.

GENETIC ALGORITHMS. Narra Priyanka. K.Naga Sowjanya. Vasavi College of Engineering. Ibrahimbahg,Hyderabad. GENETIC ALGORITHMS Narra Priyanka K.Naga Sowjanya Vasavi College of Engineering. Ibrahimbahg,Hyderabad mynameissowji@yahoo.com priyankanarra@yahoo.com Abstract Genetic algorithms are a part of evolutionary

More information

Haplotypes versus Genotypes on Pedigrees

Haplotypes versus Genotypes on Pedigrees Haplotypes versus Genotypes on Pedigrees Bonnie Kirkpatrick 1 Electrical Engineering and Computer Sciences, University of California Berkeley and International Computer Science Institute, email: bbkirk@eecs.berkeley.edu

More information

Our motivation for a NGS/MPS SNP panel

Our motivation for a NGS/MPS SNP panel Increasing the power in paternity and relationship testing utilizing MPS for the analysis of a large SNP panel Ida Grandell 1, Andreas Tillmar 1,2 1 Department of Forensic Genetics and Forensic Toxicology,

More information

I See Dead People: Gene Mapping Via Ancestral Inference

I See Dead People: Gene Mapping Via Ancestral Inference I See Dead People: Gene Mapping Via Ancestral Inference Paul Marjoram, 1 Lada Markovtsova 2 and Simon Tavaré 1,2,3 1 Department of Preventive Medicine, University of Southern California, 1540 Alcazar Street,

More information

Machine Learning. Genetic Algorithms

Machine Learning. Genetic Algorithms Machine Learning Genetic Algorithms Genetic Algorithms Developed: USA in the 1970 s Early names: J. Holland, K. DeJong, D. Goldberg Typically applied to: discrete parameter optimization Attributed features:

More information

Machine Learning. Genetic Algorithms

Machine Learning. Genetic Algorithms Machine Learning Genetic Algorithms Genetic Algorithms Developed: USA in the 1970 s Early names: J. Holland, K. DeJong, D. Goldberg Typically applied to: discrete parameter optimization Attributed features:

More information

Haplotype Based Association Tests. Biostatistics 666 Lecture 10

Haplotype Based Association Tests. Biostatistics 666 Lecture 10 Haplotype Based Association Tests Biostatistics 666 Lecture 10 Last Lecture Statistical Haplotyping Methods Clark s greedy algorithm The E-M algorithm Stephens et al. coalescent-based algorithm Hypothesis

More information

CMSC423: Bioinformatic Algorithms, Databases and Tools. Some Genetics

CMSC423: Bioinformatic Algorithms, Databases and Tools. Some Genetics CMSC423: Bioinformatic Algorithms, Databases and Tools Some Genetics CMSC423 Fall 2009 2 Chapter 13 Reading assignment CMSC423 Fall 2009 3 Gene association studies Goal: identify genes/markers associated

More information

Gen e e n t e i t c c V a V ri r abi b li l ty Biolo l gy g Lec e tur u e e 9 : 9 Gen e et e ic I n I her e itan a ce

Gen e e n t e i t c c V a V ri r abi b li l ty Biolo l gy g Lec e tur u e e 9 : 9 Gen e et e ic I n I her e itan a ce Genetic Variability Biology 102 Lecture 9: Genetic Inheritance Asexual reproduction = daughter cells genetically identical to parent (clones) Sexual reproduction = offspring are genetic hybrids Tendency

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Contents De novo assembly... 2 Assembly statistics for all 150 individuals... 2 HHV6b integration... 2 Comparison of assemblers... 4 Variant calling and genotyping... 4 Protein truncating variants (PTV)...

More information

Association studies (Linkage disequilibrium)

Association studies (Linkage disequilibrium) Positional cloning: statistical approaches to gene mapping, i.e. locating genes on the genome Linkage analysis Association studies (Linkage disequilibrium) Linkage analysis Uses a genetic marker map (a

More information

Dr. Mallery Biology Workshop Fall Semester CELL REPRODUCTION and MENDELIAN GENETICS

Dr. Mallery Biology Workshop Fall Semester CELL REPRODUCTION and MENDELIAN GENETICS Dr. Mallery Biology 150 - Workshop Fall Semester CELL REPRODUCTION and MENDELIAN GENETICS CELL REPRODUCTION The goal of today's exercise is for you to look at mitosis and meiosis and to develop the ability

More information

BTRY 7210: Topics in Quantitative Genomics and Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine jgm45@cornell.edu Spring 2015, Thurs.,12:20-1:10

More information

Privacy Preserving Data Publishing

Privacy Preserving Data Publishing Georgia State University ScholarWorks @ Georgia State University Computer Science Dissertations Department of Computer Science 8-7-2018 Privacy Preserving Data Publishing Zaobo He Follow this and additional

More information

Quiz will begin at 10:00 am. Please Sign In

Quiz will begin at 10:00 am. Please Sign In Quiz will begin at 10:00 am Please Sign In You have 15 minutes to complete the quiz Put all your belongings away, including phones Put your name and date on the top of the page Circle your answer clearly

More information

12/8/09 Comp 590/Comp Fall

12/8/09 Comp 590/Comp Fall 12/8/09 Comp 590/Comp 790-90 Fall 2009 1 One of the first, and simplest models of population genealogies was introduced by Wright (1931) and Fisher (1930). Model emphasizes transmission of genes from one

More information

AN EVALUATION OF POWER TO DETECT LOW-FREQUENCY VARIANT ASSOCIATIONS USING ALLELE-MATCHING TESTS THAT ACCOUNT FOR UNCERTAINTY

AN EVALUATION OF POWER TO DETECT LOW-FREQUENCY VARIANT ASSOCIATIONS USING ALLELE-MATCHING TESTS THAT ACCOUNT FOR UNCERTAINTY AN EVALUATION OF POWER TO DETECT LOW-FREQUENCY VARIANT ASSOCIATIONS USING ALLELE-MATCHING TESTS THAT ACCOUNT FOR UNCERTAINTY E. ZEGGINI and J.L. ASIMIT Wellcome Trust Sanger Institute, Hinxton, CB10 1HH,

More information

Inference and computing with decomposable graphs

Inference and computing with decomposable graphs Inference and computing with decomposable graphs Peter Green 1 Alun Thomas 2 1 School of Mathematics University of Bristol 2 Genetic Epidemiology University of Utah 6 September 2011 / Bayes 250 Green/Thomas

More information

Statistical Methods for Quantitative Trait Loci (QTL) Mapping

Statistical Methods for Quantitative Trait Loci (QTL) Mapping Statistical Methods for Quantitative Trait Loci (QTL) Mapping Lectures 4 Oct 10, 011 CSE 57 Computational Biology, Fall 011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 1:00-1:0 Johnson

More information

PopGen1: Introduction to population genetics

PopGen1: Introduction to population genetics PopGen1: Introduction to population genetics Introduction MICROEVOLUTION is the term used to describe the dynamics of evolutionary change in populations and species over time. The discipline devoted to

More information

Basic Concepts of Human Genetics

Basic Concepts of Human Genetics Basic Concepts of Human Genetics The genetic information of an individual is contained in 23 pairs of chromosomes. Every human cell contains the 23 pair of chromosomes. One pair is called sex chromosomes

More information

High-density SNP Genotyping Analysis of Broiler Breeding Lines

High-density SNP Genotyping Analysis of Broiler Breeding Lines Animal Industry Report AS 653 ASL R2219 2007 High-density SNP Genotyping Analysis of Broiler Breeding Lines Abebe T. Hassen Jack C.M. Dekkers Susan J. Lamont Rohan L. Fernando Santiago Avendano Aviagen

More information

EPIB 668 Genetic association studies. Aurélie LABBE - Winter 2011

EPIB 668 Genetic association studies. Aurélie LABBE - Winter 2011 EPIB 668 Genetic association studies Aurélie LABBE - Winter 2011 1 / 71 OUTLINE Linkage vs association Linkage disequilibrium Case control studies Family-based association 2 / 71 RECAP ON GENETIC VARIANTS

More information

An introduction to genetics and molecular biology

An introduction to genetics and molecular biology An introduction to genetics and molecular biology Cavan Reilly September 5, 2017 Table of contents Introduction to biology Some molecular biology Gene expression Mendelian genetics Some more molecular

More information

What is Genetics? Genetics The study of how heredity information is passed from parents to offspring. The Modern Theory of Evolution =

What is Genetics? Genetics The study of how heredity information is passed from parents to offspring. The Modern Theory of Evolution = What is Genetics? Genetics The study of how heredity information is passed from parents to offspring The Modern Theory of Evolution = Genetics + Darwin s Theory of Natural Selection Gregor Mendel Father

More information

Lab 1: A review of linear models

Lab 1: A review of linear models Lab 1: A review of linear models The purpose of this lab is to help you review basic statistical methods in linear models and understanding the implementation of these methods in R. In general, we need

More information

Genetics of dairy production

Genetics of dairy production Genetics of dairy production E-learning course from ESA Charlotte DEZETTER ZBO101R11550 Table of contents I - Genetics of dairy production 3 1. Learning objectives... 3 2. Review of Mendelian genetics...

More information

Genetics. Chapter 10/12-ish

Genetics. Chapter 10/12-ish Genetics Chapter 10/12-ish Learning Goals For Biweekly Quiz #7 You will be able to explain how offspring receive genes from their parents You will be able to calculate probabilities of simple Mendelian

More information

Haplotype Inference by Pure Parsimony via Genetic Algorithm

Haplotype Inference by Pure Parsimony via Genetic Algorithm Haplotype Inference by Pure Parsimony via Genetic Algorithm Rui-Sheng Wang 1, Xiang-Sun Zhang 1 Li Sheng 2 1 Academy of Mathematics & Systems Science Chinese Academy of Sciences, Beijing 100080, China

More information

Human Chromosomes Section 14.1

Human Chromosomes Section 14.1 Human Chromosomes Section 14.1 In Today s class. We will look at Human chromosome and karyotypes Autosomal and Sex chromosomes How human traits are transmitted How traits can be traced through entire families

More information

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016 CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016 Topics Genetic variation Population structure Linkage disequilibrium Natural disease variants Genome Wide Association Studies Gene

More information

Constrained Hidden Markov Models for Population-based Haplotyping

Constrained Hidden Markov Models for Population-based Haplotyping Constrained Hidden Markov Models for Population-based Haplotyping Extended abstract Niels Landwehr, Taneli Mielikäinen 2, Lauri Eronen 2, Hannu Toivonen,2, and Heikki Mannila 2 Machine Learning Lab, Dept.

More information

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE : GENETIC DATA UPDATE April 30, 2014 Biomarker Network Meeting PAA Jessica Faul, Ph.D., M.P.H. Health and Retirement Study Survey Research Center Institute for Social Research University of Michigan HRS

More information

Genetics of Beef Cattle: Moving to the genomics era Matt Spangler, Assistant Professor, Animal Science, University of Nebraska-Lincoln

Genetics of Beef Cattle: Moving to the genomics era Matt Spangler, Assistant Professor, Animal Science, University of Nebraska-Lincoln Genetics of Beef Cattle: Moving to the genomics era Matt Spangler, Assistant Professor, Animal Science, University of Nebraska-Lincoln Several companies offer DNA marker tests for a wide range of traits

More information

Exome Sequencing Exome sequencing is a technique that is used to examine all of the protein-coding regions of the genome.

Exome Sequencing Exome sequencing is a technique that is used to examine all of the protein-coding regions of the genome. Glossary of Terms Genetics is a term that refers to the study of genes and their role in inheritance the way certain traits are passed down from one generation to another. Genomics is the study of all

More information

Supplementary Note: Detecting population structure in rare variant data

Supplementary Note: Detecting population structure in rare variant data Supplementary Note: Detecting population structure in rare variant data Inferring ancestry from genetic data is a common problem in both population and medical genetic studies, and many methods exist to

More information

Haplotypes versus genotypes on pedigrees

Haplotypes versus genotypes on pedigrees RESEARCH Open Access Haplotypes versus genotypes on pedigrees Bonnie B Kirkpatrick 1,2 Abstract Background: Genome sequencing will soon produce haplotype data for individuals. For pedigrees of related

More information

Content Objectives Write these down!

Content Objectives Write these down! Content Objectives Write these down! I will be able to identify: Key terms associated with Mendelian Genetics The patterns of heredity explained by Mendel The law of segregation The relationship between

More information

Phasing of 2-SNP Genotypes based on Non-Random Mating Model

Phasing of 2-SNP Genotypes based on Non-Random Mating Model Phasing of 2-SNP Genotypes based on Non-Random Mating Model Dumitru Brinza and Alexander Zelikovsky Department of Computer Science, Georgia State University, Atlanta, GA 30303 {dima,alexz}@cs.gsu.edu Abstract.

More information

Basic Concepts of Human Genetics

Basic Concepts of Human Genetics Basic oncepts of Human enetics The genetic information of an individual is contained in 23 pairs of chromosomes. Every human cell contains the 23 pair of chromosomes. ne pair is called sex chromosomes

More information

Exam 1 Answers Biology 210 Sept. 20, 2006

Exam 1 Answers Biology 210 Sept. 20, 2006 Exam Answers Biology 20 Sept. 20, 2006 Name: Section:. (5 points) Circle the answer that gives the maximum number of different alleles that might exist for any one locus in a normal mammalian cell. A.

More information

What is genetic variation?

What is genetic variation? enetic Variation Applied Computational enomics, Lecture 05 https://github.com/quinlan-lab/applied-computational-genomics Aaron Quinlan Departments of Human enetics and Biomedical Informatics USTAR Center

More information

MONTE CARLO PEDIGREE DISEQUILIBRIUM TEST WITH MISSING DATA AND POPULATION STRUCTURE

MONTE CARLO PEDIGREE DISEQUILIBRIUM TEST WITH MISSING DATA AND POPULATION STRUCTURE MONTE CARLO PEDIGREE DISEQUILIBRIUM TEST WITH MISSING DATA AND POPULATION STRUCTURE DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate

More information

Human linkage analysis. fundamental concepts

Human linkage analysis. fundamental concepts Human linkage analysis fundamental concepts Genes and chromosomes Alelles of genes located on different chromosomes show independent assortment (Mendel s 2nd law) For 2 genes: 4 gamete classes with equal

More information

Axiom mydesign Custom Array design guide for human genotyping applications

Axiom mydesign Custom Array design guide for human genotyping applications TECHNICAL NOTE Axiom mydesign Custom Genotyping Arrays Axiom mydesign Custom Array design guide for human genotyping applications Overview In the past, custom genotyping arrays were expensive, required

More information

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014 Single Nucleotide Variant Analysis H3ABioNet May 14, 2014 Outline What are SNPs and SNVs? How do we identify them? How do we call them? SAMTools GATK VCF File Format Let s call variants! Single Nucleotide

More information

Novel Variant Discovery Tutorial

Novel Variant Discovery Tutorial Novel Variant Discovery Tutorial Release 8.4.0 Golden Helix, Inc. August 12, 2015 Contents Requirements 2 Download Annotation Data Sources...................................... 2 1. Overview...................................................

More information

Linkage Analysis Computa.onal Genomics Seyoung Kim

Linkage Analysis Computa.onal Genomics Seyoung Kim Linkage Analysis 02-710 Computa.onal Genomics Seyoung Kim Genome Polymorphisms Gene.c Varia.on Phenotypic Varia.on A Human Genealogy TCGAGGTATTAAC The ancestral chromosome SNPs and Human Genealogy A->G

More information

Niemann-Pick Type C Disease Gene Variation Database ( )

Niemann-Pick Type C Disease Gene Variation Database (   ) NPC-db (vs. 1.1) User Manual An introduction to the Niemann-Pick Type C Disease Gene Variation Database ( http://npc.fzk.de ) curated 2007/2008 by Dirk Dolle and Heiko Runz, Institute of Human Genetics,

More information

GENETICS FOR GENEALOGY

GENETICS FOR GENEALOGY GENETICS FOR GENEALOGY Getting to Know Your DNA Relatives Bryant McAllister, PhD Associate Professor of Biology University of Iowa bryant-mcallister@uiowa.edu Des Moines, IA August 8, 2016 COURSE GOALS

More information

Human linkage analysis. fundamental concepts

Human linkage analysis. fundamental concepts Human linkage analysis fundamental concepts Genes and chromosomes Alelles of genes located on different chromosomes show independent assortment (Mendel s 2nd law) For 2 genes: 4 gamete classes with equal

More information

Polymorphisms in Population

Polymorphisms in Population Computational Biology Lecture #5: Haplotypes Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Oct 17 2005 L4-1 Polymorphisms in Population Why do we care about variations? Underlie

More information

QTL Mapping, MAS, and Genomic Selection

QTL Mapping, MAS, and Genomic Selection QTL Mapping, MAS, and Genomic Selection Dr. Ben Hayes Department of Primary Industries Victoria, Australia A short-course organized by Animal Breeding & Genetics Department of Animal Science Iowa State

More information

Algorithms for Computational Genetics Epidemiology

Algorithms for Computational Genetics Epidemiology Georgia State University ScholarWorks @ Georgia State University Computer Science Dissertations Department of Computer Science 9-11-2006 Algorithms for Computational Genetics Epidemiology Jingwu He Follow

More information

Introduction to Pharmacogenetics Competency

Introduction to Pharmacogenetics Competency Introduction to Pharmacogenetics Competency Updated on 6/2015 Pre-test Question # 1 Pharmacogenetics is the study of how genetic variations affect drug response a) True b) False Pre-test Question # 2 Pharmacogenetic

More information

On the Power to Detect SNP/Phenotype Association in Candidate Quantitative Trait Loci Genomic Regions: A Simulation Study

On the Power to Detect SNP/Phenotype Association in Candidate Quantitative Trait Loci Genomic Regions: A Simulation Study On the Power to Detect SNP/Phenotype Association in Candidate Quantitative Trait Loci Genomic Regions: A Simulation Study J.M. Comeron, M. Kreitman, F.M. De La Vega Pacific Symposium on Biocomputing 8:478-489(23)

More information

Complex Inheritance and Human Heredity

Complex Inheritance and Human Heredity Complex Inheritance and Human Heredity Before You Read Use the What I Know column to list the things you know about human heredity and genetics. Then list the questions you have about these topics in the

More information

Personal Use of the Genomic Data: Privacy vs. Storage Cost

Personal Use of the Genomic Data: Privacy vs. Storage Cost Personal Use of the Genomic Data: Privacy vs. Storage Cost Erman Ayday School of Comp. and Comm. Sciences EPFL, Lausanne, Switzerland Email: erman.ayday@epfl.ch Jean Louis Raisaro School of Comp. and Comm.

More information

Statistical Tools for Predicting Ancestry from Genetic Data

Statistical Tools for Predicting Ancestry from Genetic Data Statistical Tools for Predicting Ancestry from Genetic Data Timothy Thornton Department of Biostatistics University of Washington March 1, 2015 1 / 33 Basic Genetic Terminology A gene is the most fundamental

More information

Questions/Comments/Concerns/Complaints

Questions/Comments/Concerns/Complaints Reminder Exam #1 on Friday Jan 29 Lectures 1-6, QS 1-3 Office Hours: Course web-site Josh Thur, Hitchcock 3:00-4:00 (?) Bring a calculator Questions/Comments/Concerns/Complaints Practice Question: Product

More information

Evolutionary Computation. Lecture 1 January, 2007 Ivan Garibay

Evolutionary Computation. Lecture 1 January, 2007 Ivan Garibay Evolutionary Computation Lecture 1 January, 2007 Ivan Garibay igaribay@cs.ucf.edu Lecture 1 What is Evolutionary Computation? Evolution, Genetics, DNA Historical Perspective Genetic Algorithm Components

More information

Improving the Accuracy of Base Calls and Error Predictions for GS 20 DNA Sequence Data

Improving the Accuracy of Base Calls and Error Predictions for GS 20 DNA Sequence Data Improving the Accuracy of Base Calls and Error Predictions for GS 20 DNA Sequence Data Justin S. Hogg Department of Computational Biology University of Pittsburgh Pittsburgh, PA 15213 jsh32@pitt.edu Abstract

More information

BTRY 7210: Topics in Quantitative Genomics and Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine jgm45@cornell.edu January 29, 2015 Why you re here

More information

Name: Class: Biology Weekly Packet January th, 2013 Tuesday January 22, 2013

Name: Class: Biology Weekly Packet January th, 2013 Tuesday January 22, 2013 Name: Class: Biology Weekly Packet January 22-25 th, 2013 Tuesday January 22, 2013 Graphs The x- axis is horizontal and is the dependent variable. The y- axis is vertical and is the independent variable.

More information

Haplotype phasing in large cohorts: Modeling, search, or both?

Haplotype phasing in large cohorts: Modeling, search, or both? Haplotype phasing in large cohorts: Modeling, search, or both? Po-Ru Loh Harvard T.H. Chan School of Public Health Department of Epidemiology Broad MIA Seminar, 3/9/16 Overview Background: Haplotype phasing

More information

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University Reference Machine learning applications in genetics and genomics

More information

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome Wide Association Study

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome Wide Association Study Learning Your Identity and Disease from Research Papers: Information Leaks in Genome Wide Association Study Rui Wang, Yong Li, XiaoFeng Wang, Haixu Tang, Xiaoyong Zhou Indiana University Bloomington Bloomington,

More information

Topic 2: Population Models and Genetics

Topic 2: Population Models and Genetics SCIE1000 Project, Semester 1 2011 Topic 2: Population Models and Genetics 1 Required background: 1.1 Science To complete this project you will require some background information on three topics: geometric

More information

Inheritance (IGCSE Biology Syllabus )

Inheritance (IGCSE Biology Syllabus ) Inheritance (IGCSE Biology Syllabus 2016-2018) Key definitions Chromosome Allele Gene Haploid nucleus Diploid nucleus Genotype Phenotype Homozygous Heterozygous Dominant Recessive A thread of DNA, made

More information