Forensics and DNA Sta1s1cs Harry R Erwin, PhD CIS308 Faculty of Applied Sciences University of Sunderland
References Goodwin, Linacre, and Hadi (2007) An Introduc+on to Forensic Gene+cs, Wiley. Butler (2005) Forensic DNA Typing, 2 nd edi1on, Elsevier.
Sta1s1cs and DNA According to Butler, Sta1s1cal gene1c informa1on is oren more difficult for DNA analysts to grasp than the technology and biology issues because of its heavy use of mathema1cs par1cularly algebra. The concepts of probabili1es can be challenging to forensic scien1sts schooled in biology rather than mathema1cs. The implica1on is that you may need to provide the necessary exper1se. 8(
Lecture Plan Review STR popula1on database analyses Profile frequency es1mates, likelihood ra1os, and source a\ribu1on Approaches to sta1s1cal analysis of mixtures and degraded DNA Kinship and parentage tes1ng
Review: What to Remember Probability Laws of probability Likelihood ra1os Bayesian sta1s1cs Sta1s1cs Hypothesis tes1ng Chi square test Confidence intervals Randomiza1on tests
Introduc1on There are three possible outcomes of a DNA test: 1. No match 2. Inconclusive 3. Match Only a match requires sta1s1cs to provide meaning. Which sta1s1cs to apply is debatable.
Laws of probability Probability: number of 1mes an event occurs divided by the number of opportuni1es for it to occur. Three laws of probability to remember 1. Probabili1es range between 0.0 and 1.0. 2. If two events are mutually exclusive the probability of either taking place is the sum of their probabili1es. 3. If two events are independent the probability of both occurring is the product of their individual probabili1es.
Likelihood ra1os A Likelihood Ra+o (LR) is the comparison of the probabili1es of the evidence under two alterna1ve (mutually exclusive) hypotheses. The Null Hypothesis, and The Alterna1ve Hypothesis. These hypotheses should cover all cases LR = Pr(H p )/Pr(H d )
Bayesian sta1s1cs Posterior odds = (Likelihood ra1o)*(prior odds) Pr(H p E)/Pr(H d E) = LR*Pr(E H p )/Pr(E H d ) Verbal terminology for likelihood ra1os Likelihood Ra,o Verbal Equivalent 1 10 Limited support for the prosecu1on hypothesis 10 100 Moderate support for the prosecu1on hypothesis 100 1000 Moderately strong support for the prosecu1on hypothesis 1000 10000 Strong support for the prosecu1on hypothesis 10000 100000 Very strong support for the prosecu1on hypothesis
Fallacies to avoid Prosecutor s fallacy Defendant s fallacy
Sta1s1cs Sta1s1cs measures uncertainty and reliability. A popula+on is the set of objects of interest. A sample is an observable subset of a popula1on. A sta+s+c is some observable property of a sample.
Hypothesis tes1ng Choose two alterna1ve hypotheses, H 0 and H 1 Select appropriate sta1s1cal model Specify the level of significance and its cri1cal value, C Collect data and calculate sta1s1c Check region of rejec1on for sta1s1c Yes No Accept H 0 Reject? Accept H 1
Chi square test A goodness to fit test. Answers How close do the observa1ons come to the expected results? The Χ 2 sta1s1c is parameterised by degrees of freedom, df, and large values indicate there s a significant devia1on from theory.
Confidence intervals Usually the sample mean plus and minus two standard devia1ons. An observa1on outside that interval is 95% unlikely. Other confidence intervals can be defined. These are used to help visualise measurements against a popula1on.
Randomiza1on tests These explore whether collec1ng the data differently would affect the results. Usually starts by trea1ng the collected data as representa1ve of the popula1on, and permu1ng it, leaving samples out, or randomly resampling it mul1ple 1mes to see the range of descrip1ve sta1s1cs Get a computa1onal sta1s1cian involved if these ques1ons come up. The tools are available in R to do these kinds of analyses. Keywords: resampling, bootstrap, jackknife
Principles of Popula1on Gene1cs Laws of gene1cs Number of alleles and number of possible genotypes
Popula1ons What is a popula1on? A group of people sharing common ancestry. Usually defined broadly Hardy Weinberg Equilibrium Within a randomly ma1ng popula1on, the genotype frequencies at any single gene1c locus will remain constant. This allows genotype frequencies to be predicted from allele frequencies. (See Punne\ Square.) All human popula1ons deviate (mildly) from HWE and your sta1s1cs will require (mild) correc1ons.
Punne\ Square Father: A p Father: a q Mother: A p AA p 2 Aa pq Mother: a q aa qp aa q 2 AA P 2 Aa 2pq aa q 2 Note the following: p + q = 1.0 The fitness of the alleles (A and a) must be equal in the popula+on. This usually is the result of hybrid vigor, where the heterozygote has an advantage over both homozygotes.
Devia1ons from HWE in Human Popula1ons Finite popula1ons produce random gene1c drir not an issue for popula1ons larger than a small town. Non random ma1ng is not believed to affect the STR loci. Migra1on effects disappear over a period of several genera1ons. Natural selec1on is not believed to affect the STR loci. Muta1on rates at ~0.2%/genera1on are not likely to affect allelic frequencies.
STR Popula1on Database Analyses Popula1on DNA databases Sta1s1cal tests on DNA databases Prac1cal considera1ons
Crea1ng a Popula1on DNA Database Not for amateurs Need >100 samples per local popula1on group ORen uses anonymous samples from a blood bank watch for sampling effects Analysis use appropriate STR kits Determine allele frequencies at each locus note sampling bias issues Check HWE Note the poten1al existence of non interbreeding popula1ons
Sta1s1cal tests on DNA databases There are a number of computer programmes available to evaluate the usefulness of a DNA database. Consider using DNATYPE first of all Need to test for independence of alleles at each gene1c locus and between loci Unfortunately, independence tes1ng does not validate the product rule 8( Compare to other popula1on data sets Watch for popula1on substructure
DNATYPE PowerStats GDA GENEPOP DNA VIEW ARLEQUIN PowerMarker PopStats TFPGA Programmes Available
Prac1cal considera1ons Watch these journals for popula1on data: For the Record ar1cles in Journal of Forensic Science Announcements of Popula1on Data in Forensic Science Interna+onal Understand the numbers reported. Understand why the markers in use have been chosen. Understand what the most common and rarest genotypes are for the DNA markers in use.
Frequency Es1mates, Likelihood Ra1os, and Source A\ribu1on Frequency es1mate calcula1ons Likelihood ra1o Source a\ribu1on Other topics
Frequency Es1mate Calcula1ons Work through a frequency es1mate calcula1on. Take a DNA profile and use the allele frequencies in a popula1on database. A random match probability is not the probability that someone is guilty or that someone else ler the biological material. Understand how rare alleles and tri allelic pa\erns are handled. Understand the product rule Understand the differences between popula1on databases. Understand the impact of popula1on structure Understand the impact of rela1ves.
Likelihood ra1o Prac1ce quan1fying the eviden1ary value of a match between a reference sample, K, and a ques1oned sample, Q Explore likelihood ra1os.
Source a\ribu1on When p x is the random match probability for a profile X, (1 p x ) N is the probability of not observing the par1cular profile in a sample of N unrelated individuals. When this probability is greater than or equal to a confidence level 1 a, then (1 p x ) N >= 1 a or p x <= 1 (1 a) 1/N In the American popula1on, a random match probability (RMP) of 3.35 x 10 11 will confer a 99% confidence that the profile is unique in the popula1on. For the UK, the RMP is 2.01 x 10 10
Other topics DNA database searches mul1ply the RMP by the number of persons in the database to adjust for the possibility of matching that many people. For lineage markers use the count of the profile in the database as an es1mate of its underlying probability in the popula1on and do a frequency es1mate with a confidence interval based on that.
Sta1s1cal Analysis of Mixtures and Degraded DNA Mixture interpreta1on Par1al DNA profiles
Mixture interpreta1on This is nasty, but any truth is be\er than indefinite doubt. The most conserva1ve approach is to judge whether the suspect might be represented by the mixture found in the sample. Some 1mes you can pull apart the alleles, one known person at a 1me. Duplicate alleles among the persons in the mixture are then a problem. When contribu1ons of donors are about equal, you have a serious problem.
Exclusion Probabili1es Use the combined probability of exclusion. This is an es1mate of the propor1on of the popula1on that has at least one allele not observed. The combined probability of exclusion assumes independence and mul1plies the excluded popula1on propor1on at each locus. Vulnerable to non detec1on of alleles Provides a conserva1ve es1mate
Likelihood Ra1o Set up two compe1ng hypotheses The problem is defining the hypotheses is not straighyorward. Uses the evidence be\er than the exclusion method.
Mixtures Complicated to interpret Basic approach is to iden1fy the alleles from known contributors. Any detected alleles outside that set had to come from unknowns (one or more ) When the mixture results are affected by lowcopy number stochas1c limits, degrada1on, or PCR inhibi1on, so that alleles are missing, all bets are off.
Par1al DNA profiles Only loci with results can be interpreted. Degraded samples or low copy number samples will cause PCR to fail. Interpret only the detected alleles Any data are be\er than none at all.
Kinship and Parentage Tes1ng When DNA samples being compared are from related individuals, the assump1on of independence is violated, and different sta1s1cal equa1ons must be applied. Parentage tes1ng Sta1s1cal calcula1ons Impact of muta1onal events Reference samples Reverse parentage tes1ng Data from both parents is oren not available
Conclusions Unfortunately, you re likely to be the expert. If you have the opportunity, study this on your own or do a forensics qualifica1on (post graduate or subject area) You know where to find help. Michael Oakes Peter Dunne Malcolm Farrow Be honest about your level of skill More sta1s1cs won t hurt you.