Rationale of Genetic Studies Some goals of genetic studies include: to identify the genetic causes of phenotypic variation develop genetic tests o benefits to individuals and to society are still uncertain drug development o finding genes responsible for a disease, or even a sub-type of disease, provides valuable insight into how pathways could be targeted for drug development o identify genetic profiles associated with adverse drug reaction Data Explosion! The amount of data available for use in genetic studies has exploded in the last decade. In the past few years we have seen the release of the first drafts of the 3 billion base pair human genome and the genomes of model organisms. In a recent build of the human genome, annotation data are available for approximately 32,000 genes with around 18,000 confirmed genes. The typical confirmed human gene has 12 exons of an average length of 236 base pairs each, separated by introns of an average length of 5,478 base pairs. In addition, data are being generated daily on sequence variation between populations. More and more data are becoming available that quantify the expression of these genes at the mrna and the protein level for a variety of tissues. As the genomes for more and more organisms are sequenced, we have unprecedented homology information between organisms. The Need for Experimental Design and Statistics With so much data and so many options, there is a pressing need for well-designed studies that incorporate genetic variation along with the corresponding accurate and efficient statistical methods. Our goal for the quarter will be to study potential designs that incorporate genetic data, learn the corresponding methods for analyzing data from these designs Our goals in these tasks will be to: o understand the basic idea of each type of study o know the assumptions each type of analysis depends on for validity o understand the limitations of different types of studies o learn how to correctly interpret study results 1
Some Basic Terminology I recommend Chapter 1 of the Sham text for a quick introduction to these fundamental concepts. Biologists distinguish two types of cells, eukaryotic cells and prokaryotic cells. Eukaryotic cells differ from prokaryotic cells in that eukaryotic cells contain many membrane bound organelles, small membrane-bound structures inside the cell that carry out specialized functions. In particular, eukaryotic cells have a nucleus. Human beings and probably any animal that you might think of are eukaryotes. Some bacteria are prokaryotes. The nucleus in a eukaryotic cell contains most of the genetic material of the cell (and therefore the organism); the genetic material is encoded in DNA, which is packaged into chromosomes. The centromere is the attachment site for the spindle fiber that moves the chromosome during cell devision. The centromere defines two arms of the chromosome, the short arm p and the long arm q. Chormosomes can be telocentric (centromere at the end), acrocentric (centromere near one end), or metacentric (centromere near the middle). Chromosomes come in pairs. Chromosomes within a pair carry the same set of genes and are called homologous. Chromosomes that carry different sets of genes are called nonhomologous. In humans, the pair that determines an individual s gender is called the sex chromosomes. All other chromosomes are referred to as autosomes. Every species has its own characteristic number of different chromosomes n. Humans have 23 pairs of chromosomes, 22 autosomes and 2 sex chromosomes. The autosomes are numbered 1-22 from largest to smallest (except #22 is actually slightly larger than #21). Therefore, there are 46 chomosomes in a human somatic cell. In humans, there are two sex chromosomes X and Y. Females have two X chromosomes and males have one X and one Y. The mechanism of sex determination is different in different species. 2
Mitosis is cell division that yields two identical diploid cells, which have two of each chromosome. Meiosis is a special type of cell division that happens in reproductive tissue yielding haploid cells (which have one of each chromosome) called gametes. In females, the gametes are the egg cells and in males the gametes are the sperm cells. Genetically, a chromosome is just a long string of DNA. DNA is a biochemical molecule, but quantitative scientists think of it more as information in some sense. We think of DNA as a long string of letters that come from a four-letter alphabet: A, T, G, C (Adenine, Thymine, Guanine, Cytosine). DNA is a double-stranded molecule, with each strand made up of A s, T s, G s, and C s. A very important property of DNA is complementary base pairing between the two strands (see the figure on the next-to-last page): A and T always pair and G and C always pair. Complementary base pairing means that each single strand of DNA contains all the information for recreating the full double-stranded molecule. DNA Molecule Cell Nucleus Chromosome Gene Nucleotides Some sub-strings of DNA encode a recipe. These substrings are genes. Specifically, a gene is a sequence of DNA that is transcribed into mrna (messenger RNA), which, in turn, is translated into protein. Proteins are strings of amino acids. There are twenty different amino acids. 3
The genetic code is the codebook that gives the correspondence between DNA and protein. Every triplet of DNA bases (a codon) corresponds to a specific amino acid, or else signals START or STOP. The genetic code is almost universal across species. Promoter Transcription Exons I II III Introns DNA I II III Splicing mrna I II III Exons Translation Protein Double-stranded DNA: 5...TGCATGCATGGTTGCA...3 Coding or sense strand 3...ACGTACGTACCAACGT...5 Template or anti-sense strand Transcription reads template strand from 3 to 5 to produce mrna mrna 5...GCAGCAGGGCA...3 Translation reads mrna from 5 to 3 to produce polypeptides N-terminal...Cys Met His Gly Cys...C-terminal 1. Note that the coding strand is the one that is not used in transcribing the mrna molecule. 2. In transcription, the template strand is read from the 3 to 5 direction to produce mrna. 3. In translation, the mrna is read from 5 to 3 to produce proteins. A specific location on a chromosome, for instance the location of a gene, a SNP (singlenucleotide polymorphism), or another genetic marker, is a locus (plural: loci). There can be more than one form of a locus. These forms are called alleles. When there is more than one allele at a locus, the locus is said to be polymorphic. 4
When two haploid gametes unite, the complete diploid number of chromosomes is reinstated. We see also that an individual has one chromosome of maternal origin and one chromosome of paternal origin. Thus for a given locus an individual will have one allele of maternal origin and one allele of paternal origin. These define an individual s genotype. If an individual has two copies of the same allele, then that individual is homozygous at that locus. If an individual has two different alleles at a locus, then s/he is heterozygous. Mendel s First Law states that the two members of a gene pair segregate (separate) from each other into the gametes, so that one-half of the gametes carry one member of the pair and the other one-half of the gametes carry the other member of the gene pair. Gregor Mendel conducted pioneering work in Genetics performing breeding experiments in plants. It is useful to consider some experiments similar to Mendel s to become proficient in the basic concepts of genetics. Here are some basic exercises that should help you master these background concepts. 1. It is known that about 22 percent of the double-stranded DNA of an organism consists of thymine. Can the other base percentages be determined? If so, what are they? If T is 22% then A must also be 22% due to complementary base pairing. This then accounts for 44% of the composition. C and G must then account for 56%, and since they must also be equal, each accounts for 28%. 2. Double stranded DNA with 300 nucleotide pairs has a base composition of A=0.32, G=0.18, C=0.18, and T=0.32. Assume that a single strand of RNA is transcribed from this gene. Can you determine, from the information given, the base composition of the RNA? If so, what is it? This cannot be determined from the information because the coding strand of DNA could have, for example, all A and G and the template strand could be entirely C and T, or vice versa. These are extreme cases, but show that the base composition of the RNA could vary wildly. 3. A certain DNA virus has a base ratio of (A+G)/(C+T)=0.85. Is this single- or doublestranded DNA? Explain. It must be single-stranded. Otherwise, the ratio would be 1. 5
4. Consider a DNA triplet pair: 3 GTC5 5 CAG3 where the top strand is the template strand that transcribes mrna. What is the amino acid does the triplet code for? We read the coding strand from 5 to 3 to see that the codon is CAG, which codes for Glutamine. 5. 5...TCGTTTAAGGGCTTGTGCGCCACGGAT...3 coding strand 3...AGCAAATTCCCGAACACGCGGTGCCTA...5 template strand 1 2 3 (a) What are the first three proteins in the sequence? Ser Phe Lys (b) A base is added as the result of exposure to acridine dye (this is called a frameshift mutation). At which position (2 or 3) would it likely have the most damaging effect on the gene product? Explain. Since translation happens in the 5 to 3 direction, an added base at position 2 is likely more damaging since this would affect more codons. (c) The base guanine is added at position 1. What effect would it have on the gene product? The new sequence would be: TCG TGT TAA, In mrna form: CG G AA which would code for: Ser Cys STOP Therefore, the second amino acid is Cys instead of Phe and translation stops prematurely. 6
The RNA Codons C A Phenylalanine (Phe) Second nucleotide C A G C Serine (Ser) A Tyrosine (Tyr) G Cysteine (Cys) C Phe CC Ser AC Tyr GC Cys C A Leucine (Leu) CA Ser AA STOP GA STOP A G Leu CG Ser AG STOP C Leucine (Leu) CC Proline (Pro) CA Histidine (His) GG Tryptophan (Trp) CG Arginine (Arg) CC Leu CCC Pro CAC His CGC Arg C CA Leu CCA Pro CAA Glutamine (Gln) CGA Arg CG Leu CCG Pro CAG Gln CGG Arg G A Isoleucine (Ile) AC Threonine (Thr) AA Asparagine (Asn) G A AG Serine (Ser) AC Ile ACC Thr AAC Asn AGC Ser C AA Ile ACA Thr AAA Lysine (Lys) AG Methionine (Met) or START G Valine Val AGA Arginine (Arg) ACG Thr AAG Lys AGG Arg G GC Alanine (Ala) GA Aspartic acid (Asp) GG Glycine (Gly) GC (Val) GCC Ala GAC Asp GGC Gly C G GAA Glutamic GA Val GCA Ala GGA Gly A acid (Glu) GG Val GCG Ala GAG Glu GGG Gly G A 7