Methods for the analysis of nuclear DNA data

Similar documents
An Introduction to Population Genetics

Chapter 25 Population Genetics

Questions we are addressing. Hardy-Weinberg Theorem

Papers for 11 September

Why do we need statistics to study genetics and evolution?

Lecture 10: Introduction to Genetic Drift. September 28, 2012

b. (3 points) The expected frequencies of each blood type in the deme if mating is random with respect to variation at this locus.

1) (15 points) Next to each term in the left-hand column place the number from the right-hand column that best corresponds:

PopGen1: Introduction to population genetics

Conifer Translational Genomics Network Coordinated Agricultural Project

Edexcel (B) Biology A-level

Exam 1, Fall 2012 Grade Summary. Points: Mean 95.3 Median 93 Std. Dev 8.7 Max 116 Min 83 Percentage: Average Grade Distribution:

Park /12. Yudin /19. Li /26. Song /9

5/18/2017. Genotypic, phenotypic or allelic frequencies each sum to 1. Changes in allele frequencies determine gene pool composition over generations

University of York Department of Biology B. Sc Stage 2 Degree Examinations

A Primer of Ecological Genetics

GENETICS - CLUTCH CH.21 POPULATION GENETICS.

Distinguishing Among Sources of Phenotypic Variation in Populations

Constancy of allele frequencies: -HARDY WEINBERG EQUILIBRIUM. Changes in allele frequencies: - NATURAL SELECTION

The evolutionary significance of structure. Detecting and describing structure. Implications for genetic variability

Random Allelic Variation

B. Incorrect! 64% is all non-mm types, including both MN and NN. C. Incorrect! 84% is all non-nn types, including MN and MM types.

Linkage Disequilibrium. Adele Crane & Angela Taravella

Lecture 23: Causes and Consequences of Linkage Disequilibrium. November 16, 2012

TEST FORM A. 2. Based on current estimates of mutation rate, how many mutations in protein encoding genes are typical for each human?

Evolution of Populations (Ch. 17)

Principles of Population Genetics

Population Genetics. If we closely examine the individuals of a population, there is almost always PHENOTYPIC

Hardy Weinberg Equilibrium

HISTORICAL LINGUISTICS AND MOLECULAR ANTHROPOLOGY

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

Population Structure and Gene Flow. COMP Fall 2010 Luay Nakhleh, Rice University

POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping

Population Genetics. Ben Hecht CRITFC Genetics Training December 11, 2013

"Genetics in geographically structured populations: defining, estimating and interpreting FST."

Linkage & Genetic Mapping in Eukaryotes. Ch. 6

Variation Chapter 9 10/6/2014. Some terms. Variation in phenotype can be due to genes AND environment: Is variation genetic, environmental, or both?

DNA Collection. Data Quality Control. Whole Genome Amplification. Whole Genome Amplification. Measure DNA concentrations. Pros

Introduction to population genetics. CRITFC Genetics Training December 13-14, 2016

10/28/2009. American black bear (Ursus americanus) Severe population decline. Great Smoky Mountain National Park (GSMNP)

Genotype AA Aa aa Total N ind We assume that the order of alleles in Aa does not play a role. The genotypic frequencies follow as

Lab 2: Mathematical Modeling: Hardy-Weinberg 1. Overview. In this lab you will:

Statistical Methods for Quantitative Trait Loci (QTL) Mapping

Algorithms for Genetics: Introduction, and sources of variation

How Populations Evolve. Chapter 15

Population Genetics Modern Synthesis Theory The Hardy-Weinberg Theorem Assumptions of the H-W Theorem

AP BIOLOGY Population Genetics and Evolution Lab

QTL Mapping, MAS, and Genomic Selection

Genetics Effective Use of New and Existing Methods

BST227 Introduction to Statistical Genetics. Lecture 3: Introduction to population genetics

Population stratification. Background & PLINK practical

Quiz will begin at 10:00 am. Please Sign In

BST227 Introduction to Statistical Genetics. Lecture 3: Introduction to population genetics

Human linkage analysis. fundamental concepts

Introduction to Population Genetics. Spezielle Statistik in der Biomedizin WS 2014/15

Human linkage analysis. fundamental concepts

7-1. Read this exercise before you come to the laboratory. Review the lecture notes from October 15 (Hardy-Weinberg Equilibrium)

HWE Tutorial (October 2007) Mary Jo Zurbey PharmD Candidate 2008

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Genetics of dairy production

Genetic data concepts and tests

The Evolution of Populations

Genetic Equilibrium: Human Diversity Student Version

Population genetic structure. Bengt Hansson

POPULATION GENETICS studies the genetic. It includes the study of forces that induce evolution (the

Lesson: Measuring Microevolution

COMPUTER SIMULATIONS AND PROBLEMS

The Evolution of Populations

HARDY WEIBERG EQUILIBRIUM & BIOMETRY

Genetic variation, genetic drift (summary of topics)

Conifer Translational Genomics Network Coordinated Agricultural Project

Introduction to Quantitative Genomics / Genetics

EXERCISE 1. Testing Hardy-Weinberg Equilibrium. 1a. Fill in Table 1. Calculate the initial genotype and allele frequencies.

QTL Mapping Using Multiple Markers Simultaneously

B) You can conclude that A 1 is identical by descent. Notice that A2 had to come from the father (and therefore, A1 is maternal in both cases).

Lecture 5: Inbreeding and Allozymes. Sept 1, 2006

Population genetics. Population genetics provides a foundation for studying evolution How/Why?

The Evolution of Populations

SYLLABUS AND SAMPLE QUESTIONS FOR JRF IN BIOLOGICAL ANTHROPOLGY 2011

Chapter 23: The Evolution of Populations. 1. Populations & Gene Pools. Populations & Gene Pools 12/2/ Populations and Gene Pools

The Modern Synthesis. Terms and Concepts. Evolutionary Processes. I. Introduction: Where do we go from here? What do these things have in common?

Two-locus models. Two-locus models. Two-locus models. Two-locus models. Consider two loci, A and B, each with two alleles:

Lecture #3 1/23/02 Dr. Kopeny Model of polygenic inheritance based on three genes

Measurement of Molecular Genetic Variation. Forces Creating Genetic Variation. Mutation: Nucleotide Substitutions

Understanding genetic association studies. Peter Kamerman

The Evolution of Populations

Population Genetics and Evolution

Midterm#1 comments#2. Overview- chapter 6. Crossing-over

SAMPLE MIDTERM QUESTIONS (Prof. Schoen s lectures) Use the information below to answer the next two questions:

Module 20: Population Genetics, Student Learning Guide

Office Hours. We will try to find a time

LAB ACTIVITY ONE POPULATION GENETICS AND EVOLUTION 2017

Population Genetics. Chapter 16

Chapter 3 Some Basic Concepts from Population Genetics

Let s call the recessive allele r and the dominant allele R. The allele and genotype frequencies in the next generation are:

Lecture 5: Genetic Variation and Inbreeding. September 7, 2012

The Evolution of Populations

Module 20: Population Genetics, Student Learning Guide

Genetics - Fall 2004 Massachusetts Institute of Technology Professor Chris Kaiser Professor Gerry Fink Professor Leona Samson

A little knowledge is a dangerous thing. So is a lot. Albert Einstein. Distribution of grades: Exam I. Genetics. Genetics. Genetics.

Transcription:

Methods for the analysis of nuclear DNA data JUHA KANTANEN 1 and MIIKA TAPIO 2 1 BIOTECHNOLOGY AND FOOD RESEARCH, MTT AGRIFOOD RESEARCH FINLAND JOKIOINEN, FINLAND juha.kantanen@mtt.fi 2 INTERNATIONAL LIVESTOCK RESEARCH INSTITUTE (ILRI), NAIROBI, KENYA m.tapio@cgiar.org Breed characterization production traits (quantitative genetic variation) environment and management conditions where a breed is raised molecular genetic characterization (breed characterization using molecular genetic markers) Genetic diversity is the variety of alleles and genotypes Molecular genetic variation two components of genetic variation = genetic variation within breeds and between breeds 1. Study genetic diversity within breeds 2. Study genetic relationships of breeds, breed origins and molecular ancestries 3. Infer factors which shaped genetic variation 4. Use this molecular information for breed management, utilization and conservation 1

Study design: how many individuals, how many loci Sampling design Assumed population structure Thumb rules Well-defined uniform populations No clear populations Traditional population study 25-50 individuals per a breed (sampling of 25 individuals results in 2N=50 drawing of alleles per a locus) Sexes in 1:1 ratio Biogeographic study Geographically representative sampling (population density, factors affecting movement) 2 individuals per location (m&f) Some clear populations Combination of the two above, often closer to the 1 st Sampling design If possible, include a cosmopolitan breed / breeds Helps in interpretation and in data matching across studies Check if you get data from collaborators Need for reference samples! Which individuals? ~Single generation, if no specific reason to sample over generations. No closely related individuals. In absence of pedigree records, 2-3 individuals per owner (random sampling). 2

Information about the sample Consistently from all samples ID, date, sex, population name, owner Name of the place and coordinates using GPSr Is the sampling site descriptive? Standard photograph Phenotypic observations, animal origin, pedigree information, major diseases (or lack of) and other background information. AT AT AT the most typical markers (so far) in diversity analysis: AT AT Microsatellites CA CA CA CA Microsatellites DNA loci or short tandem repeat (STR) are segments of repeated DNA with a short repeat length, usually 1-6 nucleotides Typically no specific function Repeat unit occurs typically 10 30 times CCG CCG CCG CCG allele 1 -------CACACACACACA--------------- (6) -------GTGTGTGTGTGT--------------- allele 2 -------CACACACACACACACA------------- (8) -------GTGTGTGTGTGTGTGT------------- Simple microsatellites: (CA) 32 Interrupted microsatellites: (CA) 9 TA(CA) 11 Complex microstatellites: (CAG) 6 (CAA) 7 3

Number of loci For breed differentiation and within-breed diversity analyses, 20 30 microsatellites FAO recommendation lists for different farm animal species: http://dad.fao.org/en/refer/library/guidelin/marker.pdf E.g. in the estimation of relatedness among individuals, the number and polymorphism of markers as well as population structure affect the robustness of different methods in the calculation of relatedness between individuals, even hundreds of microsatellites are needed. Highly polymorphic markers are recommended in order to minimise identity-by-state of alleles; the rule of thumb is at least 4 different alleles at a locus. Overestimation of diversity? Hanslik et al. 2000. Animal Genetics 31: 31-38. studied Holstein-Friesian populations, gene diversity estimates for populations 0.43 0.48, and for 39 microsatellites 0.01 0.81 With FAO cattle markers, typically 0.6 0.7 for breeds OTHER MARKERS SNP analysis Alele 0..GAATTTACT.. Alele 1..GAATTCACT.. Genomewide genotyping done mainly by commercial service providers. Zenger et al. 2007. Animal Genetics 38: 7-14. 845 SNP markers in 431 Holstein-Friesian bulls. 4

META-ANALYSIS an analyse of previous analyses combines the results of several studies that address a set of related research hypotheses Global meta-analysis of previous microsatellite studies by pooling different data sets Reference animals are needed that the allele scrores can be adjusted Checking allele frequency distributions can reveal discrepancies ALLELE FREQUENCIES Two examples, please calculate allele frequencies 1. Codominantly inherited alleles, following genotypes were obtained AD DD AA AA AD AA AA AD DD DD 2. Two alleles, a dominat one (L) and a recessive (l), following phenotypes were obtained: L L - (=homozygous recessive) - L L - - - L 5

CODOMINANCE Genotypes Frequency AA P AD H P + H + Q = 1 (100%) DD Q The frequency of allele A p=p + ½H The frequency of allele D q=q + ½H or q = 1 p AA 4 4/10 = 0.4 A 0.4 + 0.3/2 = 0.55 AD 3 3/10 = 0.3 B 0.3 + 0.3/2 = 0.45 DD 3 3/10 = 0.3 OR 11 times A-allele (20 chromosomes typed): 11/20=0.55 9 times D-allele: 9/20=0.45 (or 1-0.55=0.45) DOMINANCE We can not simply count up the alleles from the observed phenotypes, for example individuals with L-phenotype, can be genotype LL or Ll. 1. Calculate the frequency of homozygous recessive individuals In our case, 5 individuals: 5/10 2. Square root of the frequency of homozygous recessive individuals = the frequency of a recessive allele In our case, 0.7 3. The frequency of a dominant allele q = 1 p In our case, 0.3 6

TESTING ASSUMPTIONS PRIOR TO ANALYSIS 1. The presence of null alleles at microsatellite loci 2. The selectively neutrality of each locus 3. The independent assortment of the loci 1. The presence of null alleles at microsatellite loci non-amplifying microsatellite alleles ( null-alleles ) Mutation at primer binding site No primer binding No PCR product Microsatellite null alleles are problematic as they create false homozygotes (excess of homozygotes) and may inflate levels of genetic differentiation and affect population genetic analyses that rely on Hardy-Weinberg expectations. The presence of null alleles and genetic differentiation: overestimation of genetic differentiation between populations (e.g. F ST, Chord distance, Nei s standard genetic distance). Chord distance less affected by null alleles than Nei s standard distance The presence of null alleles and various assignment tests: a slight reduction in the power to correctly assign individuals, but still not alter the overall outcome of assignment testing Loci prone to null alleles should be used with caution as they lower the power of assigmnet tests and alter the accuracy of F ST, and loci less prone to null alleles should always be preferred. 7

2. The selectively neutrality of a locus most of the analysis of population genetic markers assumes neutrality, i.e. no effects of selection, and are based on the interaction of the genetic drift (i.e. random change of allele frequency over generations), mutation and/or migration over time, genetic drift and mutation will lead to divergence of allele frequencies among subpopulations over time, migration will lead to homogenisation of allele frequencies strong selection may overcome these processes: selection at a locus can stablise allele frequencies (e.g. via overdominance = heterozygote advantage, e.g. sickle-cell anemia) across all subpopulations : leads to an underestimation of population substructure and genetic distance differences in selective pressure e.g. in different regions may cause a fixation of different alleles in different subpopulations : leads to overestimation of population substructure and genetic distance linkage of microsatellites to selected loci 3. The independent assortment of the loci Random association between two alleles of each of two genes, showing expected gametic frequencies when the alleles are in linkage equilibrium Alleles of A gene Allele A1 A2 Alleles of B gene Allele Freq p1 p2 B1 q1 A1B1 p1q1 B2 q2 A1B2 p1q2 A2B2 p2q1 A2B2 p2q2 8

Gametic disequilibrium = the nonrandom association of alleles at different loci into gametes. This can arise from a variety of reasons: 1. Physical linkage ( linkage disequilibrium ) 2. Epistatic selection (Epistasis takes place when the action of one gene is modified by one or several other genes) 3. Genetic hitchhiking (a statistical association of alleles at a neutral locus with another locus undergoing selection, the neutral allele is carried along because of the selective advantage of the associated nonneutral allele). 4. Random drift in a small population 5. Migration and admixture One of the very first analyses to be done: Deviations from Hardy-Weinberg equilibrium Random mating population + No selection No mutation No migration Allele and genotype frequencies constant from generation to generation A relationship between the allele frequencies and genotype frequencies: If the frequencies of two alleles among the parents are p and q, then the genotype frequencies among the progeny are p 2, 2pq and q 2. This means that there is no association between the pair of alleles that an individual receives from its parents. 9

Deviations from Hardy-Weinberg equilibrium: 1. Nonrandom mating (inbreeding, assortative mating = to mate with individuals that are like in some respect (positive assortative mating) or dissimilar (negative assortative mating). 2. Population subdivision (Wahlund s effect) 3. Selection (overdominance) 4. Migration 5. Sex-specific differences in allele frequencies 6. Chronological sampling factors (sampling in different years) 7. Presence of null alleles In the Hardy-Weinberg equilibrium testing, observed genotype frequencies are compared to those expected from the predictions. 1) The chi-square test The sample allele frequencies are p A =0.75, p B =0.25 (expected genotype frequencies are calculated based on these frequencies) AA AB BB Total Observed 6 3 1 10 Expected 5.625 3.750 0.625 10 Obs-Exp 0.375-0.750 0.375 0.00 The chi-square (X 2 ) test statistic is X 2 =(0.375) 2 /5.625 + (-0.750) 2 /3.750 + (0.375) 2 /0.625 = 0.40 (df=1) The hypothesis of Hardy-Weinberg frequencies is not rejected However, the chi-square test should not be applied for expected numbers less than 1. 10

2) Exact test Recommended for small sample sizes. Idea of the test: looking at all possible sets of genotypic frequencies for the particular observed set of allele frequencies and rejecting the hypothesis of HWE if the observed genotypic frequencies turn out to be very unusual under HWE. The probabilities of all possible set of genotypes are computed The genotype sets are ordered according to their probabilities The probability of the observed genotype set + probabilities of all less probable sets are summed the hypothesis is rejected if this total probability is less than alfa ( ) The sample allele frequencies are p A =0.75, p B =0.25 AA AB BB Total Observed 6 3 1 10 There are three possible genotypic arrays with the same allelic counts: AA AB BB Probability 5 5 0 0.5201 6 3 1 0.4334 7 1 2 0.0464 The probability of the observed data, or a less likely dataset, when Hardy-Weinberg holds, is therefore 0.4334+0.0464=0.4798 and the hypothesis is not rejected. 11

One example: INRA035 microsatellite typed in six cattle breeds (Suksun, Istoben,Yaroslavl, Kholmogor, Pechora, and Ukrainian Grey from Russia and Ukraine) Heterozygosity Observed Expected Suksun 0.225 0.534 Istoben 0.063 0.138 Yaroslavl 0.256 0.578 etc Excess of genotype INRA035 104/104 detected. Chi-square test for Hardy-Weinberg equilibrium: HWE rejected Obvious reason: null allele(s) segregating at INRA035 (chromosome 16) Programs MIKRO-CHECKER: software for identifying and correcting genotyping errors in microsatellite data the program aids identification of genotyping errors due to e.g. null alleles. Von Oosterhout et al. say that the program can discriminate between inbreeding and Wahlund effects, and Hardy-Weinberg deviations caused by null alleles GENEPOP 007: the null allele option allows maximum-likelihood estimation of allele frequencies when a null allele is present. 12

Selective neutrality The fixation of a beneficial mutation in a population also affects sites linked to the target of selection ( hitchhiking ). Screening a a large number of markers is expected to identify those linked to a selected site, which therefore deviate from neutral expectations. Neutral model of evolution assumes that all the alleles observed at given locus are functionally equivalent. In diversity analyses, few microsatellites have deviated from neutral expectations. E.g. in cattle studies, BoLA-DRBP1 (ch. 23) and CSSM66 (ch. 14). The BoLA-microsatellite is located within the highly polymorphic bovine major histocompatibility complex (MHC). A close genetic linkage between CSSM66 and a QTL affecting milk yield, milk fat and protein composition has been detected in Holstein Cattle. Neutrality tests are rather seldom conducted. Ewens-Watterson test to determine wheter the Hardy-Weinberg homozygosity in a sample of size 2N with n different alleles is consistent with the homozygosity when there is a mutationdrift equilibrium. Example: Marker Obs H O 95% CI for the expected homozygosity INRA005 0.3866 [0.2931, 0.9484] CSSM66 0.1350 [0.1660, 0.6996] HEL1 0.2688 [0.2344, 0.8673] 13

Modified F ST -based method by Beaumont & Balding (2004) FDIST2/LOSITAN is a program to detect loci of which the genetic diversity within (heterozygosity) and between populations (F ST ) does not conform to the prediction of an infinite or finite-island model obtained by coalescent simulations. [coalescent=a theory that describes the structure of the geneology of a sample of genes from present time to their most recent common ancestor] [finite-island model=a conceptual model for gene flow under which a finite number of demes exchange migrants with each other] Modified F ST method should be more suitable when some populations exhibit lower variability or reduced immigration than others E.g. Balancing selection: high heterozygosity + tends to homogenise allele frequencies between populations Plot of F ST against heterozygosity for the 30 microsatellites analysed. Two outliers are detected. The yellow, blue and red lines denote upper and lower 95% confidence limits, and median, respectively, of 100,000 independent loci simulated. 14

Linkage equilibrium Per centage of locus pairs demonstrating linkage disequilibrium with p values<0.05 Fisher s combined probability test: chi-square TOT = - 2 (&ln P j ) This statistics follows a chi-square distribution with 2 x degrees of freedom. Clustering methods traditional population genetic analyses (F-statistics and genetic distances) have been common approaches for characterizing population differentiation these approaches rely on a priori definition of populations (breeds) disadvantages: within-population diversity is typically ignored, patterns of breed clustering may not be very robust (low level of differentiation, admxitures,not robust tree topology), recent demographic history of a population effects on interbreed genetic distances CLUSTERING METHODS development and use of these methods have been possible in recent years (more power in computers + more data) several software programs available with the aid of which you can answer questions such as: - how may genetic populations are there in your metapopulation? - to which population does this individual belong? 15

o Traditional estimators in population genetics could be calculated using simple analytical calculations, modern population genetic analyses rely heavily on computer power o Most of the recent advances in clustering methodology have been made in a Bayesian statistical framework o Simultaneous estimation of many interdependent parameters in complex models. o In the Bayesian inference the posterior probability of a parameter depends explicitly on its prior probability, reflecting some previous belief about this parameter. o Parameter = unobservable quantities of interest, e.g. population parameters, such as allele frequencies o Parameter space = the set of all possible values of the quantity of interest (parameter); the parameter space for a population allele frequency includes all values between zero and one. o Posterior distribution = the conditional probability distribution of the unobserved quantities of interest (parameters) given the observed data. o Markov chain Monte Carlo (MCMC) techniques is often used to estimate the joint posterior distribution of a set of parameters without having to explore the whole parameter space. o Joint posterior distribution = when a model defined by more than one parameter, it is the posterior distribution of all possible combinations of parameter values. 16

HOW MANY POPULATIONS? identification of discrete populations in the metapopulation without a priori definition of populations the idea is to divide the total sample of genotypes into an unknown number of subpopulations = clusters of individuals. individuals are assigned to groups based on their multilocus genotypes and the assumption that the markers should be in Hardy-Weinberg and linkage equilibrium. STRUCTURE, PARTITION, BAPS WHICH POPULATION? the use of allele frequency data from known populations to determine the most likely source of an individual with given multi-locus genotypes when the actual source of the individual is unknown. GENECLASS, STRUCTURE (STRUCTURE assumes that all potential source populations have been sampled) STRUCTURE (Pritchard et al. 2000; Falush et al. 2003) 1. The Bayesian clustering method takes a sample of genotypes and uses the assumption of Hardy-Weinberg and linkage equilibrium within subpopulations 2. finds 1) the number of subpopulations k that best fits the data and 2) the individual assigments that mimize H-W and linkage disequilibrium in those subpopulations 3. Derives the posterior probability distribution of k from separate MCMC chais, each with a different fixed value of k 4. Also exploits data on linked markers PARTITION (Dawson and Belkhir 2001) 1. Uses Bayesian inference, estimates k by employing a MCMC method to generate an estimate of the posterior distribution of the sample partition 2. Identifies metapopulaton subdivision, assigns individuals to populations on the basis of their genotypes 3. uses the assumption of Hardy-Weinberg and linkage equilibrium within subpopulations 4. Assumes that all individuals are of pure ancestry 17

BAPS (Corander et al. 2003, Corander et al. 2004) 1. Estimates the number of genetic clusters 2. Provides the proportion of the genome of each individual that can be assigned to the inferred clusters (admixture analysis) 3. Assumes HWE within clusters and unlinked markers; may use information on the sampling origin of individuals 4. Can consider each population as a unit and determine which of the populations have different allele frequencies 5. Versions 1 and 2 used MCMC method, the newest version simulated annealing method (quicker) Comparison of Bayesian clustering softwares: Latch et al. (2006) studied the relative performance of STRUCTURE, BAPS and PARTITION at low levels of population differentiation (F ST =0.01 0.10). PARTITION was unable to correctly identify the number of subpopulations until the levels of F ST reached around 0.09. STRUCTURE and BAPS performed better at low levels of population differentiation, were able to correctly identify the number of subpopulations at F ST around 0.03. F ST should be >0.05 to reach an assigment accuray (assigment of individuals to a correct subpopulation) of greater than 97%. 18

EXAMPLE I Li et al. (2007) Molecular Ecology 16: 3839-3853. 21 Eurasian cattle breeds 30 autosomal microsatellites STRUCTURE was used 1) To estimate the number of clusters (k), a Monte Carlo Markov chain was run for all models of k with a burn-in period of 20 000 and a run length of 10 000 iterations. 2) Log-likelihood estimates were calculated for k = 1 21, with 100 independent runs for each fixed number of clusters (k) 3) All runs used the independent allele frequency and the admixture models. To remove the influence of starting point by running the chain for some time before beginning to sample points 4) The coefficient C quantifying the similarity of results for an ordered pair of structure runs with the same number of assumed clusters k was calculated according to Rosenberg et al. (2002). 5) The graphical display of the structure results was generated using distruct (Rosenberg 2004). How did they choose the k value? 1) ln P(D) increased from K = 1 to K = 20, after which it began to decline 2) This indicated that the most significant differentiation occurred at the level of populations 3) for K = 2 to K = 3, the increase in ln P(D) is large 4) 100 structure runs at K = 2 or K = 3 have pairwise similarity coefficient C > 0.90 (CK=2 = 0.96, CK=3 = 0.92), indicating that nearly all individuals have similar membership coefficients across all pairs of runs; 5) for K > 3, variation between runs increased as suggested by the dramatic drop in the similarity coefficient (CK=4 = 0.43) depicting quite a few possible different clustering solutions (suggesting a lack of additional high-level substructure in the populations. (A) Posterior probability [ln P(D)] against the maximum number of populations (K); (B) (B) the increase of ln P(D) for a given K, calculated as (ln P(D) K - ln P(D) K-1 ) for the combined analysis 19

EXAMPLE II Canon et al. 2006. Animal Genetics 37: 327 334 1426 goats, 45 breeds from 15 countries EXAMPLE III 15 well- and clearly-defined animal populations About 250 animals whose origin not known. The population structure (the assignment of individuals) was unfolded using trained clustering followed by genetic mixture analysis (Corander et al, 2006) implemented in BAPS v4.13 (Corander et al, 2006). In trained clustering, 15 populations were treated as reference populations with predetermined assignment to their source population and the rest of animals (undefined) were treated as animals with unknown origin. Clustering was performed using 15-50 as a maximum population number, each 10 times and the clustering with largest likelihood was processed further in admixture analysis. 20

In the optimal clustering (natural logarithm of the marginal likelihood of the data -5732.049) unknown animals grouped into 20 clusters. 21