Sequence Variations. Baxevanis and Ouellette, Chapter 7 - Sequence Polymorphisms. NCBI SNP Primer:

Sequence Variations Baxevanis and Ouellette, Chapter 7 - Sequence Polymorphisms NCBI SNP Primer: http://www.ncbi.nlm.nih.gov/about/primer/snps.html

Overview Mutation and Alleles Linkage Genetic variation in populations SNPs as genetic markers Classical genetic diseases Multi-factorial diseases and risk factors Genome scans (genotyping)

A review of some basic genetics

Alleles An allele is a particular DNA sequence for a gene. Some gene alleles are responsible for ordinary phenotypes like blue/brown eyes. Others lead to classic genetic diseases like cystic fibrosis or Huntington s disease.

Changes occur in DNA sequences = mutations

Many Causes of Mutations Somatic vs. reproductive cells Radiation and/or chemical damage to DNA Random errors of the replication machinery Normal biological processes - methylation

Mutations Create Alleles Mutations occur randomly throughout DNA. Most have no phenotypic effect (non-coding regions, equivalent codons, similar AAs). Some damage the function of a protein or regulatory element. A very few provide an evolutionary advantage.

Population Genetics Chromosome pairs segregate and recombine in every generation. Every allele of every gene has its own independent evolutionary history (and future). Frequencies of various alleles differ in different subpopulations of people.

Human Alleles The OMIM (Online Mendelian Inheritance in Man) database at the NCBI tracks all human mutations with known pheontypes. It contains a total of about 2,000 genetic diseases [and another ~11,000 genetic loci with known phenotypes - but not necessarily known gene sequences] It is designed for use by physicians: can search by disease name contains summaries from clinical studies

OMIM Morbid Map: Cytogenetic map location of disease genes.

Variation Makes Life Interesting The Human Genome has been sequenced; what s next? Much of what makes us unique individuals is represented by the differences in our DNA sequence from other people. There are rare and common forms (alleles) of every gene. Probably only 3-4 alleles are present in 95% of the population for most genes, but lots of rare mutations.

SNPs are Mutations

SNPs A mutation that causes a single base change is known as a Single Nucleotide Polymorphism (SNP). Other kinds of mutations include insertions and deletions. Large breaks and rearrangement of chromosomes also occur (translocations)s GATTTAGATCGCGATAGAG GATTTAGATCTCGATAGAG ^

SNPs are Very Common SNPs are very common in the human population. Between any two people, there is an average of one SNP every ~1250 bases. Most of these have no phenotypic effect. Only <1% of all human SNPs impact protein function (non-coding regions). Selection against mis-sense mutations (think about what would happen to dominant lethal mutations?). Some are alleles of genes.

Genome Sequencing finds SNPs The Human Genome Project involves sequencing DNA cloned from a number of different people. [The Celera sequence comes from 5 people.] Even within one person s DNA, the homologous chromosomes have SNPs. This inevitably leads to the discovery of SNPs - any single base sequence difference These SNPs can be valuable as the basis for diagnostic tests

We describe a map of 1.42 million single nucleotide polymorphisms (SNPs) distributed throughout the human genome, providing an average density on available sequence of one SNP every 1.9 kilobases. These SNPs were primarily discovered by two projects: The SNP Consortium and the analysis of clone overlaps by the International Human Genome Sequencing Consortium. The map integrates all publicly available SNPs with described genes and other genomic features. We estimate that 60,000 SNPs fall within exon (coding and untranslated regions), and 85% of exons are within 5 kb of the nearest SNP. Nucleotide diversity varies greatly across the genome, in a manner broadly consistent with a standard population genetic model of human history. This high-density SNP map provides a public resource for defining haplotype variation across the genome, and should help to identify biomedically important genes for diagnosis and therapy.

http://www.ncbi.nlm.nih.gov/snp

SNP Discovery: dbsnp database

Search dbsnp with BLAST As of June, 2008, dbsnp has 12.8 million SNPs in the human genome It is possible to search dbsnp by BLAST comparisons to a target sequence

>gnl dbsnp rs1042574_allelepos=51 total len = 101 taxid = 9606 snpclass = 1 Length = 101 Score = 149 bits (75), Expect = 3e-33 Identities = 79/81 (97%) Strand = Plus / Plus If a matching SNP is found, then it can be directly located on the Genome map Query: 1489 ccctcttccctgacctcccaactctaaagccaagcactttatatttttctcttagatatt 1548 Sbjct: 1 ccctcttccctgacctcccaactctaaagccaagcactttatattttcctyttagatatt 60 Query: 1549 cactaaggacttaaaataaaa 1569 Sbjct: 61 cactaaggacttaaaataaaa 81

Uses for SNPs Diagnostic tests for disease alleles Markers to aid in cloning of interesting genes (disease genes) Pharmacogenomics - genetics of reponse to drugs (effectiveness and side effects)

DNA Diagnostic Testing Hereditary diseases - potential parents, prenatal, late onset diseases. Genes that predispose to disease (risk factors). Genotyping of infectious agents (bacterial & viral). Forensics - using DNA testing to establish identity.

Clinical Manifestations of Genetic Variation (All disease has a genetic component) Susceptibility vs. resistance Variations in disease severity or symptoms Reaction to drugs (pharmacogenetics) Variable disease course and prognosis SNPs can be found that are linked to all of these traits.

Finding Disease Genes Virtually all diseases have a genetic component. Start with DNA samples from families that show inheritance of the disease. Use STS markers to map the gene or genes involved (linkage analysis). Find SNPs in the genetic region(s) that are likely candidates for involvement in that disease. Get the gene from genomic sub-clone.

Some Diseases Involve Many Genes There are a number of classic genetic diseases caused by mutations of a single gene. Huntington s, Cystic Fibrosis, Tay-Sachs, PKU, etc. There are also many diseases that are the result of the interactions of many genes: asthma, heart disease, cancer Each of these genes may be considered to be a risk factor for the disease. Groups of genetic markers (SNPs) may be associated with a disease without determining a mechanism.

Multiple Causes Some diseases may actually be caused by any of a group of different genes (multiple causes), but all show the same symptoms. SNP linkage analysis can identify these sub-populations more efficiently than classical molecular genetic approaches. Machine learning, genetic algorithms, SVMs

The study of the distribution of genetic variants, including SNPs, lies within the domain of population genetics, and the study of the relationship between SNPs and phenotypic variation lies in the domain of quantitative genetics. Gibson&Muse

A B c a B C a B C A B c a B C a B c A b c A b c A b c a b C a b C A b c A b c a B C A B c a b C a B c A b c Quantitative Trait Locus Mapping A B C a b c F 1 A B C a b c F 1 X a b c a b c A B C A B C Parent 3 Parent 4 X HEIGHT GENOTYPE BB Bb bb B b Bb Bb Bb BB BB BB bb bb bb a b c a b c A B C A B C Parent 1 Parent 2 X Knott et al. (1997) TAG 84:810-820

Association Mapping ancestral chromosomes G T * recombination through evolutionary history present-day chromosomes in natural population G A C C G A T C * G A T T * *

SNP Discovery Methods Pairwise Sequence Comparison from databases, esnp Deep Resequencing

SNP Analysis Agenda Sequence-Based SNP Identification Common Bioinformatic Solutions Phred, Phrap, Consed, Polyphred, and Polybayes High-Throughput SNP Identification Solution

Overlapping PCR Amplicons across entire gene Make no assumptions about sequence function Sequence diversity and genetic structure for each gene is different Proper association studies can only be designed in this context Complete resequencing facilitates population genetics methods

Sequence-based SNP Identification Amplify DNA 5 3 Sequence Phred Phrap Sequence each end of the fragment. Base-calling Quality determination Contig assembly Final quality determination PolyPhred/Polybayes Polymorphism detection ATAGACG ATAGACG ATACACG ATACACG ATAGACG ATACACG Consed Sequence viewing Polymorphism tagging Analysis Homozygotes Heterozygote Polymorphism reporting Individual genotyping Phylogenetic analysis

Phred, Phrap, Consed, Polyphred, Polybayes phred: Base calling and quality assignments phrap: Contig formation and new quality assignments consed: Visual X-Windows graphic interface, to view and edit alignments and contigs, and to view the original traces polyphred: find polymorphisms in phrap contigs, quality calls, add data to phrap files to permit consed finding and visualization of polymorphisms. polybayes: Fully probabilistic SNP detection algorithm that calculates the probability (SNP score) that discrepancies at a given location of a multiple alignment represent true sequence variations as opposed to sequencing errors.

Nature Genetics 23, 452-456 (1999) A general approach to single-nucleotide polymorphism discovery Gabor T. Marth, Ian Korf, Mark D. Yandell, Raymond T. Yeh, Zhijie Gu, Hamideh Zakeri, Nathan O. Stitziel, LaDeana Hillier, Pui-Yan Kwok & Warren R. Gish Figure 1. Application of the POLYBAYES procedure to EST data. a, Regions of known human repeats in a genomic sequence are masked. b, Matching human ESTs are retrieved from dbest and traces are re-called. c, Paralogous ESTs are identified and discarded. d, Alignments of native EST reads are screened for candidate variable sites. e, An STS is designed for the verification of a candidate SNP. f, The uniqueness of the genomic location is determined by sequencing the STS in CHM1 (homozygous DNA). g, The presence of a SNP is analysed by sequencing the STS from pooled DNA samples.

PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing Deborah A. Nickerson*, Vincent O. Tobe and Scott L. Taylor Nucleic Acids Research; 1997-25:2745 SNP calling Correct call False positive False positivefalse positive

Trace File High quality region no ambiguities

Trace File Medium quality region some ambiguities

Trace File Poor quality region low confidence

Using PolyPhred to Visualize SNPs Compares sequences across traces obtained from different individuals to identify sites for SNPs. Will occasionally miscall genotypes - frequency of such mistakes depends on the sequencing chemistry used to generate the trace. To reduce the number of miscalled sites, ignores regions of poor quality & ends

Polyphred Reads the ACE file to obtain the consensus sequence and the names of the trace (chromat) files used in the assembly. Reads the PHD files associated with each trace. During the SNP search phase, PolyPhred combines information from all of the sequence traces to derive a genotype and a score for each sequence The score indicates how well the trace at the site matches the expected pattern for a SNP. Updates the ACE and PHD files by adding tags that mark the positions of the sites. The tagged sites can then be examined using Consed.

Polybayes Bayesian statistical model takes into account: - depth of coverge - base quality values of the sequences Polybayes calculations are aided with information on major/minor allele frequencies as well as polymorphism rates within the species under investigation **Can also integrate into the poly files for viewing with Consed

Alignment and SNP Calling Pipeline Challenges in High-Throughput SNP Identification Alignment Critical in the automation of base calls Commonly used Phrap (from PhredPhrap) is an assembler and is NOT ideal for alignments Many commonly used aligners work best with protein sequences or with a reference sequence Preservation of quality scores for input into SNP identification programs Speed for high-throughput programs Automated SNP Calls - Reference Sequence Required - Traditional approaches without reference sequence include esnps (human, maize, and pine) -Very little redundancy outside of abundant genes -Overall high number of false positives (single pass reads) - Not specific to frequencies observed in different organisms - High number of false positives in currently accepted methods (Polybayes & Polyphred)

5 UTR exon Intron 3 UTR

4-Coumarate CoA Ligase (4CL) 0 500 1000 1500 2000 2500 1 9 9 4 1 4 1 0 1 1 6 6 0 9 9 7 1 1 8 9 4 3 5 4 2 0 0 4 2 3 8 5 2 5 8 9 F4 R4 F3 R3 F2 R1A 61 601 947 1454 1486 2003 F5 R3 F6 R6 491 1956 2728 743-781 bound_moiety="amp" 2396-2417 proposed active site A C T A C T G A A T A C T A C T G A A T A C T A C T G A A T A C T A C T G A A T A C T A C T G G A T A C T A C T G G A T A C T A C T G G A T A C T A C T G G A T A C T A C C G G A T A C T A C C G G A T A C T A C C G G A T A C T A C C G G A T A C T A C C G G A C A C T A C C G G A C A C T A C C G G A C A C T A C C G G A C A C T A C C A G A C A C T A C C A G A C A C T A C C A G A C A C T A C C A G A C A C T A C C A G A C A C T A C C A G A C A C T A C C A G A C A C T G T C G G G C A C T G T C G G G C G C A G C C G G G C 1 2 3 4 5 6 7 8 9 1