Genetics and Psychiatric Disorders Lecture 1: Introduction

Genetics and Psychiatric Disorders Lecture 1: Introduction Amanda J. Myers LABORATORY OF FUNCTIONAL NEUROGENOMICS

All slides available @: http://labs.med.miami.edu/myers Click on courses First two links Pswd: Lecture 1=furbies1 Lecture 2=furbies2

Epidemiology, Genetics, and Computational biology these are not the same things.and take care when identifying a analyst for your study

Epidemiology- not what we will discuss Study of psychiatry in relation to populations and environments Not genetics, some overlap of tests, but many of tests in epidemiology are ONLY appropriate for samples collected as populations Some tests: Risk ratios, survival curves Some studies: Religious Orders Study, Honolulu-Asia Aging Study, Baltimore Longitudinal Study on Aging, Genetics of Healthy Aging, Kungsholmen Project, SardiNIA Programs employed: SAS/ SPSS

Genetics Study of psychiatry in relation to families or groups Some tests: Odds ratios, chi tests, LOD scores, transmission disequilibrium test, t-tests, ANOVA, generalized linear models Some studies: next lecture gives a few examples Programs employed: SAS/ SPSS, GENEHUNTER/GENEFINDER, MERLIN, ALLEGRO, ARLEQUIN, SOLAR, CLUMP, CYRILLIC, EH, EH-PLUS, FASTLINK, FBAT and on and on Rockefeller keeps a good list: http://linkage.rockefeller.edu/soft/list1.html#d

Computational/Bioinformatics Genetics on steroids usually these studies deal with case/control studies on a grandiose scale (i.e. full genome * 10K samples) Some tests: Similar to conventional genetics, but dealing with this level of data requires a new set of programming skills Non-independence of tests + corrections Some studies: GWAS, Next Generation Sequencing Programs usually employed: PLINK, Birdseed, Affymetrix (Affymetrics Power Tools) and Illumina (Genomestudio)

CONCEPTS

Hypothesis Testing The Basics A lot of human genetics relies on this Always think about what the null hypothesis is for the study/ test statistic of interest The null is not always what you think P-values Conclusiveness index NOT that your data is right, more of a measure of how not wrong it is Formally: it is a measure of the probability of obtaining a result equal or greater than your test statistic given that the null hypothesis is true P-VALUES HAVE NOTHING TO DO WITH EFFECT SIZE/BIOLOGICAL IMPORTANCE

Alpha + Beta The Basics For every study these two statistics are the framework around which studies are designed - TYPE I error Acceptable false positive error rate: i.e. reject the null hypothesis when it is true not the same as the p-value- alpha is set prior to the study, and is inherent to the study design, typically set to 0.05 -TYPE II error False negative error rate i.e. retain the null hypothesis when it is false

OUR DECISION IN THE STUDY Alpha + Beta The Basics REALITY null hypothesis is true test hypothesis is true Accept null hypothesis correct decision TYPE II error 1-alpha beta Reject null hypothesis TYPE I error correct decision alpha 1-beta, equivalent to power

Power The Basics For every study this is used to predict the appropriate sample size for the study of interest However, this calculation relies on input that is inherently unknown Power= 1-beta Input to determine Power is inherent to study design i.e. for a case-control screen, you need: Alpha, your predicted model for how the gene is inherited (recessive or dominant), disease prevalence, relative risk of disease allele, and the measure of linkage disequilibrium between the real locus and the marker you typed and sample size One Program: GPC: http://pngu.mgh.harvard.edu/~purcell/gpc/ There are many others.

Power Sample sizes IMHO: The Basics FAMILY BASED STUDIES: 1. Linkage Screens- bigger the better, restricted by finding large families with multiple generations affected -first early onset Alzheimer s gene-> pedigree ~40 people -second early onset Alzheimer s gene-> pedigrees ~ 100 people 2. Sibling-pair Screens ~ 200-600 sibling pairs 3. Transmission disequilibrium screens ~ 200-600 sibling pairs QUANTITATIVE TRAIT STUDIES: N~ 100-1000, depends upon what your effect size is CASE CONTROL STUDIES N ~ 100-40,000 Case in point: APOE verses CLU

The Basics Allele Variants of a particular gene/dna position Haplotype Collection of alleles that are transmitted together that occur on a single chromosome Phase Allele transmission relative to parental inheritance If allele 1 from Marker 1 is inherited from the same parent as allele 1 from Marker 2, they are in phase Homozygotes are phase known

Penetrance: The Basics Important for both linkage and association screens Does the allele I just mapped cause disease? Fully penetrant alleles = 1:1 relationship between that allele and genotype Everything else is a risk factor Typically balance between: Common allele, low penetrance Rare allele, high penetrance Determination of penetrance can be tricky because of sampling issues and ascertainment bias

The Basics Two main flavors of screens: 1. Linkage 2. Association Two main flavors of samples: 1. Family based Can get both linkage and/or association 2. Unrelated/ case control Can only get association

FAMILY BASED SCREENS

Linkage Marker A Disease locus? marker b Linkage is determining the position of a disease locus in relation to marker loci along a chromosome Is my disease locus in the same block of inherited chromosome as Marker A or marker b (or both)? Null: there is recombination between trait and marker Typically big distances (depends upon number of generations) Not allele specific

The Basics Distance: Physical- # of base-pairs between 2 points on a chromosome the glory of the sequencing of the human genome, this can now be perfectly mapped Genetic- either recombination fractions ( ) or map length : the probability that 2 alleles at 2 different loci are derived from different parental chromosomes = 0= no recombination = 0.5= loci are unlinked Only observed if there is an odd number of cross-overs Centimorgan (Map length): counts of average number of cross-overs in a particular interval of a single chromatid On average 1 megabase= 1% recombination= 1cM, but genetic distances are location specific

Markers Microsatellites Also called: SSR : simple sequence repeat STR: short tandem repeat STRP: short tandem repeat polymorphism VNTR: variable nucleotide tandem repeat Simple runs of repetitive DNA sequence Dinucleotide i.e. (CA)n Trinucleotide- i.e. (ATA)n Tetranucleotide- i.e. (GATA)n - Preference: tetras>tris>dis- because run out on PAGE gels, better separation with tetras, but tetras less common Scoring by size range, which is equivalent to number of repeats, except where have slippage usually bin sizes

Markers Microsatellites Naming by HUGO nomenclature D/S number is used to indicate anonymous DNA sequences D= chromosome S=segment E= expressed D10S1211 is not expressed, D3S2250E is expressed Advantages: Highly informative Dispersed throughout the genome Easy to type (used to be true) Disadvantages Easy to type is relative- no chip technology Distance is based on recombination different in males/females- maps are typically averaged

Markers Microsatellites Where find the maps? Originally done by Marshfield= 8,325 markers CEPH families Predominantly from UTAH 8 families, ~ 130 individuals Now DECODE Maps HUGE Icelandic pedigrees 146 families, 869 individuals, 1257 meiotic events

Method PAGE: polyacrylamide gel electrophoresis Marker 1 Marker 2 Marker 3 Standard Sizing of repeats

Test Statistic Likelihood Ratio LR: P data Test Hypothesis vs. P data Alternate hypothesis H a : P (observed marker data 0< <0.5) i.e. the recombination fraction between marker and trait is small H 0 : P (observed marker data =0.5) i.e. the marker is unlinked to the trait LOD scores Logarithm of Odds= log 10 likelihood ratio LRs are log transformed so that they can be summed across families or studies LOD= 3 means data is roughly 100 times more likely as a LOD=1 Multipoint LOD score= MLS Look at the information at the marker loci as well as between markers Disease locus could be anywhere along the chromosome

FAMILY BASED SCREENS STANDARD LINKAGE

Linkage: Standard Analysis Need: Multiple generation pedigrees Model of disease inheritance Procedure: Follow segregation of markers throughout a family, trace which regions of the genome cosegregate with disease phenotype Issues: Late onset diseases, no parents Diseases with genetic heterogeneity, can t properly specify the model of inheritance

FAMILY BASED SCREENS LINKAGE, SIBLING PAIRS

Linkage: Allele Sharing Methods Compare genotypes of affected sibling pairs (ASPs) do not need parental genotypes try to test whether the inheritance pattern of a particular region is not consistent with random segregation Know chance probability of siblings sharing DNA: 50% of the time sibs will share 1 allele 25% of the time they will share no alleles 25% of the time they will share both alleles Non-parametric No model of inheritance specified b/c the goal is not to follow a particular allele through a family but to look at the pattern of allele sharing within a population. Issues: Power Distance

FAMILY BASED SCREENS TRIOS

Transmission Disequilibrium M1M2 M2M2 M1M2 Transmitted =M1M2 non-transmitted =M2M2 Need: parents and one affected child Procedure: Compare portion of transmitted alleles and untransmitted alleles Null: there is no preferential transmission-i.e. any given allele is transmitted to children 50% of the time Issues: Power, need heterozygotes

Linkage Screen QC Mendelian inheritance errors Genotypes have to be physically possible based on the allelic properties mapped by a monk spending a lot of time with pea plants Gender checks Pedigree errors- i.e. input issues, data reporting

UNRELATEDS SCREENS

Association CASES Allele A allele b CONTROLS Distribution difference? Mapping the relationship between an individual s genotypes and their phenotypes No distance in the statistics- not position like linkage Null: there is no difference in the distribution between affecteds and unaffecteds allele specific (this is not the case for linkage) Issues: Stratification can give a result even if there is no true relationship ETHNICITY

The Basics Distance No distance in the statistics- not position like linkage BUT: need to be aware of the concept of linkage disequilibrium LD UNEQUAL SORTING OF ALLELES SUCH THAT SOME ALLELES APPEAR TOGETHER MORE OFTEN THAN WOULD BE EXPECTED BY CHANCE Don t measure this by transmission, measure this by looking at the probabilities of a given allele relationship Test statistic: D Test haplotype frequencies against the population frequencies of each allele Low frequencies of recombinant haplotypes= high LD High LD typically D >0.8

The Basics LD- take home point 1: In an association study the finding can be that the allele found is merely in LD with the causative allele LD- take home point 2: LD can be due to the fact that the alleles are very close on the chromosome but also because of recent genetic distance. i.e. a rare recent mutation will be in LD with a lot of things

Markers Single Nucleotide Polymorphisms (SNPs) Change of a single nucleotide at a particular location in the genome Naming by dbsnp (www.ncbi.gov) Many different submitters to this DB, all submitted their variations These were then mapped against each other and each variation was assigned a unique number= rs number Advantages: common could be the actual risk affect (i.e. coding change) Easy to type (insanely easy) Disadvantages Bi-allelic, less variability= not as informative as a microsatellite Too much of a good thing how to tell what is really risk.

Markers SNPs Where find the maps? Physical: Human genome sequenced!! Yeah. www.ncbi.nlm.gov, www.ensembl.org, genome.ucsc.edu Frequencies: Hapmap www.hapmap.org African, Asian, European (CEPH) families Measure transmission of mapped SNPs Gets at LD in these families 1,000 genomes project www.1000genomes.org NextGen sequencing on many more people- better info Pilot data: low coverage n=180 Targeted 1,000 genes in 1,000 individuals

Method: old school RFLP: restriction fragment length polymorphism

Method: new school Microarray: Allelic Specific Hybridization Probes designed to hybridize one allele or the other

Method: newest school NextGen: Next Generation Sequencing Sequencing-by-synthesis Sequencing-by-ligation Single Molecule sequencing Unlike current microarrays, no predesign of probes + hybridization necessary -get the actual SNPs!

Test Statistic Chi-test: Compare observed allele counts to expected counts: is there a difference? Null: distribution of alleles is normal OBSERVED: ALLELE 1 allele 2 TOTAL AFFECTEDS A1 a2 A1+ a2 UNAFFECTEDS U1 u2 U1+ u2 TOTAL A1+ U1 a2 + u2 total number of chromosomes EXPECTED: ALLELE 1 allele 2 TOTAL AFFECTEDS A1+ a2 * A1+ U1 / total #chr A1+ a2 * a2 + u2 / total #chr A1+ a2 UNAFFECTEDSU1+ u2 * A1+ U1 / total #chr U1+ u2 * a2 + u2 / total #chr U1+ u2 TOTAL A1+ U1 a2 + u2 total number of chromosomes 2 statistic calculated by subtracting expected table counts from observed table counts, squaring the differences, dividing by the expected numbers and summing to yield the proportion of difference between observed data and expected data. What it isn t: a direct comparison of cases and controls

Test Statistic Odds Ratio: Compare observed allele frequencies to each other: are there more cases for a particular allele? Null: frequencies are the same ALLELE 1 allele 2 AFFECTEDS A1 a2 UNAFFECTEDS U1 u2 OR statistic calculated by cross product A1*u2/U1*a2 Why? A1/U1=odds of being a case given that an individual has ALLELE 1 a2/u2=odds of being a case given that an individual has allele 2 ODDS RATIO= A1/U1 divided by a2/u2, which through simple algebra= A1*u2/U1*a2 What it isn t: a measure of the risk of a particular allele i.e. big ORs don t mean alleles are common and everyone is going to get the disease.

Test Statistic Confidence Interval: The equivalent of a p-value for an Odds Ratio Measure of the robustness of the OR result Remember that: OR=A1*u2/U1*a2 So if the CI spans 1, that means the OR is not significant, i.e. there is no difference in the probability of being a case or a control given a certain allele CIs that are negative= PROTECTIVE EFFECT CIs that are positive=risk EFFECT SINCE IT IS A RATIO OF ALLELES, THE SIGN DEPENDS UPON THE ALLELE YOU ARE USING AS YOUR TEST ALLELE, TYPICALLY ONE TESTS THE MINOR ALLELE AS THE DISEASE ALLELE.

Test Statistic Confidence Interval: Forest Plots

Association Screen QC Hardy Weinberg Equilibrium When there is: 1. random mating within a large population 2. No selection or recent mutations Then: allelic/genotypic frequencies should remain stable from generation to generation Mostly test controls and cases separately, some argue that cases should be out of HWE for markers of interest BEST TEST: MARKERS KNOWN TO NOT BE LINKED TO DISEASE Gender checks Marker frequencies in controls against public DB Population checks Make sure no one in your population is related Check also for outliers

http://labs.med.miami.edu/myers