Algorithms for Genetics: Introduction, and sources of variation

Size: px
Start display at page:

Download "Algorithms for Genetics: Introduction, and sources of variation"

Transcription

1 Algorithms for Genetics: Introduction, and sources of variation Scribe: David Dean Instructor: Vineet Bafna 1 Terms Genotype: the genetic makeup of an individual. For example, we may refer to an individual as having a heterozygous genotype Aa, or a homozygous genotype AA or aa for a particular gene. Phenotype: a measurable trait of an organism, usually due to genetic variation. A phenotype may refer to a common trait, such as height, the presence of a particular disease, or some other measurable biological characteristic. Gene: a region of an organism s genome, which codes for inherited biological traits. Some genes have been discovered to have a critical role in the development of disease, such as the ApoE4 gene and Alzheimer s disease. Allele: a specific genetic variant at a location. For example, the locus for the ApoE gene has 3 major variants, or alleles, ApoE2, ApoE3, and ApoE4. Locus: the location of an allele; can refer to a nucleotide position, a genetic marker, a gene, or a chromosomal segment. For example, 19q13.2 refers to a particular location on chromosome 19. Ploidy: the number of copies of each chromosome that is contained in somatic (non-gamete) cells of a species. In humans and most other animal species, the somatic cells are usually diploid, meaning they have 2 copies of each chromosome, whereas the gamete cells are haploid and have a single copy of each chromosome. Some plant and animal species are known to have more than 2 copies of each chromosome, which is called polyploidy. Haplotype: a particular combination of alleles in an individual that are located on a single chromosome. For example, an individual may have a given sequence of alleles on one chromosome, labeled as DEf, that is different than the alleles for the other, DeF. 2 Sources of Variation A number of mechanisms introduce variation into a population. The main sources are described below. Point Mutations: Refers to small-scale mutational events: * The typical mutation rate seen in humans is fairly slow, estimated to be about 10 8 per base pair per generation. Point mutations are usually caused by exposure to harmful amounts of radiation, such as UV or microwave radiation. * The infinite sites assumption states that each site of a point mutation will undergo at most one mutation, over the course of human evolution. Perhaps the biggest implication of the infinite sites assumption is that it enables a phylogenic record of evolutionary history to be constructed. SNPs from mitochondrial DNA, which is inherited only through our mothers and does not recombine, can be analyzed to contruct an ancestry for an individual in a population. Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA 1

2 * A single nucleotide polymorphism, or SNP, is a point mutation where the genotype of the wild type and mutated allele are known. Essentially, a SNP refers to a single nucleotide base, that is known to have mutated to a different nucleotide base at some point during evolution. Given the infinite sites assumption, a convenient way to view and analyze SNPs from population data is to create a binary matrix, using individuals as rows and variant sites as columns. Each individual can be represented by 1 or 2 rows, which contain the DNA sequence data of one or both of their chromosomes. Also, it is convention to use 0 to represent the ancestral allele and 1 to represent the mutated allele, if it is known which is the ancestral allele. * Short tandem repeats (STRs) are regions of DNA where a short DNA sequence is repeated a variable number of times. For a given locus, different individuals can have different numbers of repeats. Here is an example showing variable number of repeats of the sequence ATC : In order to create a DNA fingerprint, a set of STR locations can be chosen such that the set of measured repeat lengths create a unique identifier for an individial. Enough STR locations are chosen to ensure that it is extremely unlikely for two people to have the same DNA fingerprint. A system used by the FBI, Combined DNA Index System (CODIS), performs a form of DNA fingerprinting that examines 13 core loci that have variable numbers of STRs. The alleles from these loci are generally inherited independently, which means the probability of having a particular combination of STR values can be determined by multiplying the probabilities of having a particular STR value at each locus. This procedure creates a DNA fingerprint that is so unique that the probability of two individuals having the same fingerprint is less than One of the few exceptions where this breaks down is in the case of identical twins. 2

3 Structural Variation: Large scale mutation events: * A structural variation refers to large sections of DNA that are inserted, deleted, and inverted in a genome. These large-scale genetic changes can cause disease, including certain cancers. * If we were interested in an experimental protocol to detect structural variations, we would compare regions of an individual s genome to that of a wild-type human genome. The haplotype map provided by the International HapMap Consortium is an example of a human genome database that could be used. Also, a karyotype of whole chromosomes would be able to identify large structural changes to a chromosome. Notice that the chromosomes are ordered from largest to smallest. Recombination: Variation due to crossover * Recombination events are caused by a crossing-over of homologous chromosomes during meiosis (cell division). This causes a mixing of genetic material between the two chromosomes. DNA recombination can also refer to an artificial recombination of DNA performed in a biology lab, such as to insert a gene into an E. coli bacterium. * The typical recombination rate for humans is similar to the mutation rate, estimated to be about 10 8 per base pair per generation. 3

4 * Not all of human DNA recombines. In particular, mitochondrial DNA (inherited from the mother only) and the Y chromosome (inherited from the father only) do not recombine. Gene conversion: Variation due to crossover * During gene conversion, a gene on one chromosome is transferred to the homologous gene on the other chromosome, leaving the first chromosome unchanged. This is similar to recombination in that genes are being transfered from one chromosome to another. However, in recombination, DNA is exchanged between the two chromosomes, whereas with gene conversion, only one of the chromosomes is changed. 3 Equilbiria Population geneticists study the entirety of variations (genotype) and their consequences on phenotypes. As the variations arise and disappear within a population, they give rise to many equilibria under neutral conditions. An important goal in population genetics is to investigate regions not under quilibria and to investigate the cause of this departure. Hardy-Weinberg equilibrium: an equilibrium of allele frequencies. * The Hardy-Weinberg equilibrium is defined as follows. Given that a set of assumptions are met (including large population size, random mating, no natural selection, etc.), then with a locus that has two alleles, A and a, with frequencies, p and q, the frequencies of the 3 possible genotypes are p 2 (for AA), 2pq (for Aa), and q 2 (for aa). * The Hardy-Weinberg equilibrium can be extended for multiple alleles with frequencies p i : i = 1, 2,... k. If we consider multiple alleles, the HW equilibrium states something similar. The frequency of a homozygous genotype is p 2 i. And the frequency of a heterozygous genotype is 2p ip j. * The HW equilibrium can also be extended to consider multiple loci. If the alleles of different loci are not linked (i.e. not on the same chromosome), then the frequencies of combined genotypes is simply the product of the frequency of a genotype at one locus and the frequency of a genotype at another locus. These frequencies are independent and can thus be multiplied together. In the case of loci that are linked, then one needs to know the probability of the combinations of alleles being inherited together. 4

5 For example, if we consider 2 loci that have 2 alleles each, we can label the 4 alleles A, a, B, and b. Then, if we know the probability of these alleles being inherited together (i.e. P(AB), P(Ab), P(aB), and P(ab)), then these combinations can be treated as multiple alleles at a single locus. Applying the HW equilibrium to multiple alleles at a single locus is described above. * If we assume an infinite size population with random mating, the allele frequency does not change from generation to generation. The allele frequency will remain constant over time as the inheritance of alleles follows the laws of statistics. Going back to our simple example of two alleles at a single locus, A and a, with frequencies, p and q, we should have a population with the genotypes frequencies p 2 (for AA), 2pq (for Aa), and q 2 (for aa). We can calculate the frequency of the alleles expected in the next generation: p nextgeneration = (1)p 2 + (0.5)2pq + (0)q 2 p nextgeneration = p 2 + pq p nextgeneration = p 2 + p(1 p) p nextgeneration = p 2 + p p 2 p nextgeneration = p The infinite population size enables any deviations from the expected frequency to be averaged out. With finite sized populations, and especially small populations, then the allele frequencies can vary randomly from generation to generation due to a sampling effect. This effect is called genetic drift. * Example: Phenylketonuria Phenylketonuria (PKU) is a disease that is caused by an autosomal recessive allele, with an observed frequency of 1 in 10,000 caucasians. With this information, we can use the HW equilibrium to calculate the frequency of this allele in the population, and calculate the percentage of the population who are carriers of the allele. If we define q to be the frequency of the recessive allele (a), then the disease genotype (aa) should occur with the frequency q 2. By setting q 2 = 1/10000, we calculate q = 1/100. So the frequency of the allele in the population is 1/100. To calculate the percentage of carriers, we are looking for the value of 2pq (for the genotype Aa) = 2(99/100)(1/100) = * Example: Red-green colorblindness Males are 100 times more likely to have the red type of color blindness than females. Males are much more likely to have this form of colorblindness because the genetic mutation occurs on the X chromosome. The disease mutation is recessive, allowing female carriers of the mutation to not develop the phenotype. Males only have a single copy of the X chromosome, causing them to develop the phenotype if the mutation is present. With this information, we can use the HW equilibrium to calculate the frequency of this disease allele in the population. Let s define q to be the frequency of the recessive allele. For men, the frequency of the disease phenotype is simply the frequency of the recessive allele, q. For women, having 2 X chromosomes, will develop the disease phenotype with frequency q 2. Knowing that men are 100 times more likely to have red-green colorblindness allows us to calculate the frequency of this allele: q = 100q 2 q = 1/100 Linkage (dis-)equilibrium (LD): Describes correlation of allelic values across mutiple loci. * Linkage dis-equilibrium (LD) is a measure of correlation or independence, in terms of the inheritance of alleles from different loci. With high recombination rates or when examining loci on different chromosomes, the probability of two alleles both being inherited is simply the product of the probabilities of each allele being inherited independently. This is refered to as linkage equilibrium. With low recombination rates or when the loci are very close to each other on a chromosome, then the probability of two alleles both being inherited is different from the product of the probabilities of each allele being inherited independently. This difference is the measure of linkage dis-equilibrium. 5

6 * Measures of LD: D, D, ρ. D = P 00 P 0 P 0 D max = max {P 0 P 0, P 0 P 1, P 1 P 0, P 1 P 1 } D = P00 P0 P 0 D max ρ = D P0 P 1 P 0P 1 * Extra Credit: Compare LD with other measures of correlation between loci, such as correlation coefficient and hamming distance. * LD is known to vary with distance between the loci. There is an exponential decay of LD as the distance between the two loci is increased. It can be assumed that the recombination rate increases linearly with an increase in distance, however, it is known that recombination rates vary from region to region. * Similarly, there is a relationship between LD and time. There is an exponential decay of LD as time increases. As time moves forward, recombination events cause this decay of LD until it disappears completely (i.e. linkage equilibrium). * LD can be used for gene mapping by exploiting the fact that LD varies with the distance between two loci. Instead of measuring the LD between two loci, we can replace one of the loci with a vector of disease diagnoses (reflecting the presence or absence of a disease in individuals). Then, by measuring the LD between this diagnosis vector and SNPs throughout the genome, the location of a possible disease gene can be infered by its high measures of LD. HapMap consortium * Extra Credit: Describe the goals of the HapMap project. Read through the paper and describe a few of the conclusions. 6