Sequence Variations. Baxevanis and Ouellette, Chapter 7 - Sequence Polymorphisms. NCBI SNP Primer:

Similar documents
CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Mutations during meiosis and germ line division lead to genetic variation between individuals

An introduction to genetics and molecular biology

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

SNP calling and VCF format

Association Mapping in Plants PLSC 731 Plant Molecular Genetics Phil McClean April, 2010

Concepts: What are RFLPs and how do they act like genetic marker loci?

Midterm 1 Results. Midterm 1 Akey/ Fields Median Number of Students. Exam Score

FORENSIC GENETICS. DNA in the cell FORENSIC GENETICS PERSONAL IDENTIFICATION KINSHIP ANALYSIS FORENSIC GENETICS. Sources of biological evidence

Sept 2. Structure and Organization of Genomes. Today: Genetic and Physical Mapping. Sept 9. Forward and Reverse Genetics. Genetic and Physical Mapping

Worksheet for Bioinformatics

Computational Workflows for Genome-Wide Association Study: I

Conifer Translational Genomics Network Coordinated Agricultural Project

Theory and Application of Multiple Sequence Alignments

Gap Filling for a Human MHC Haplotype Sequence

SUPPLEMENTARY INFORMATION

Measurement of Molecular Genetic Variation. Forces Creating Genetic Variation. Mutation: Nucleotide Substitutions

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

4.1. Genetics as a Tool in Anthropology

Structural variation. Marta Puig Institut de Biotecnologia i Biomedicina Universitat Autònoma de Barcelona

Identification of Single Nucleotide Polymorphisms and associated Disease Genes using NCBI resources

The Polymerase Chain Reaction. Chapter 6: Background

Basic Concepts of Human Genetics

7-1. Read this exercise before you come to the laboratory. Review the lecture notes from October 15 (Hardy-Weinberg Equilibrium)

Multiple choice questions (numbers in brackets indicate the number of correct answers)

Mapping and Mapping Populations

AGRO/ANSC/BIO/GENE/HORT 305 Fall, 2016 Overview of Genetics Lecture outline (Chpt 1, Genetics by Brooker) #1

Why learn sequence database searching? Searching Molecular Databases with BLAST

Lecture Four. Molecular Approaches I: Nucleic Acids

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

Lecture #1. Introduction to microarray technology

Lecture 8: Sequencing and SNP. Sept 15, 2006

Gene-centered resources at NCBI

Answers to additional linkage problems.

Péter Antal Ádám Arany Bence Bolgár András Gézsi Gergely Hajós Gábor Hullám Péter Marx András Millinghoffer László Poppe Péter Sárközy BIOINFORMATICS

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

The Polymerase Chain Reaction. Chapter 6: Background

Chapter 15 Gene Technologies and Human Applications

Bio 311 Learning Objectives

Introduction to Bioinformatics

Molecular Markers CRITFC Genetics Workshop December 9, 2014

PV92 PCR Bio Informatics

Read Mapping and Variant Calling. Johannes Starlinger

Linking Genetic Variation to Important Phenotypes

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

Mutation entries in SMA databases Guidelines for national curators

Mendel & Inheritance. SC.912.L.16.1 Use Mendel s laws of segregation and independent assortment to analyze patterns of inheritance.

MICROSATELLITE MARKER AND ITS UTILITY

Association mapping of Sclerotinia stalk rot resistance in domesticated sunflower plant introductions

Why can GBS be complicated? Tools for filtering, error correction and imputation.

Genetic Equilibrium: Human Diversity Student Version

Hands-On Four Investigating Inherited Diseases

LATE-PCR. Linear-After-The-Exponential

Finishing Drosophila Ananassae Fosmid 2728G16

7 Gene Isolation and Analysis of Multiple

Bioinformatics for Proteomics. Ann Loraine

Gen e e n t e i t c c V a V ri r abi b li l ty Biolo l gy g Lec e tur u e e 9 : 9 Gen e et e ic I n I her e itan a ce

PHYSICIAN RESOURCES MAPPED TO GENOMICS COMPETENCIES AND GAPS IDENTIFIED WITH CURRENT EDUCATIONAL RESOURCES AVAILABLE 06/04/14

Genomic Research: Issues to Consider. IRB Brown Bag August 28, 2014 Sharon Aufox, MS, LGC

Polymerase Chain Reaction (PCR) and Its Applications

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

Chp 10 Patterns of Inheritance

MAS refers to the use of DNA markers that are tightly-linked to target loci as a substitute for or to assist phenotypic screening.

Book chapter appears in:

Next Generation Sequencing. Target Enrichment

Principles of Population Genetics

BA, BSc, and MSc Degree Examinations

Observing Patterns In Inherited Traits

B I O I N F O R M A T I C S

Genome Sequence Assembly

Introduction to the UCSC genome browser

Terminology: chromosome; gene; allele; proteins; enzymes

Introduction to BIOINFORMATICS

Introduction to Pharmacogenetics Competency

Basic Concepts of Human Genetics

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Population and Community Dynamics. The Hardy-Weinberg Principle

Molecular Biology: DNA sequencing

CHAPTER 21 LECTURE SLIDES

3I03 - Eukaryotic Genetics Repetitive DNA

Genome research in eukaryotes

can be found from OMIM (Online Mendelian Inheritance in Man),

Genomic resources and gene/qtl discovery in cereals

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Table of Contents. Chapter: Heredity. Section 1: Genetics. Section 2: Genetics Since Mendel. Section 3: Biotechnology

Variant calling in NGS experiments

Genetics Lecture 21 Recombinant DNA

Exploring the Genetic Basis of Congenital Heart Defects

HLA and Next Generation Sequencing it s all about the Data

Expressed Sequence Tags: Clustering and Applications

Guided tour to Ensembl

Chapter 14: Genes in Action

Higher Human Biology Unit 1: Human Cells Pupils Learning Outcomes

Targeted resequencing

Observing Patterns in Inherited Traits. Chapter 11

GENETICS. I. Review of DNA/RNA A. Basic Structure DNA 3 parts that make up a nucleotide chains wrap around each other to form a

Uniparental disomy (UPD) analysis of chromosome 15

Application for Automating Database Storage of EST to Blast Results. Vikas Sharma Shrividya Shivkumar Nathan Helmick

Transcription:

Sequence Variations Baxevanis and Ouellette, Chapter 7 - Sequence Polymorphisms NCBI SNP Primer: http://www.ncbi.nlm.nih.gov/about/primer/snps.html

Overview Mutation and Alleles Linkage Genetic variation in populations SNPs as genetic markers Classical genetic diseases Multi-factorial diseases and risk factors Genome scans (genotyping)

A review of some basic genetics

Alleles An allele is a particular DNA sequence for a gene. Some gene alleles are responsible for ordinary phenotypes like blue/brown eyes. Others lead to classic genetic diseases like cystic fibrosis or Huntington s disease.

Changes occur in DNA sequences = mutations

Many Causes of Mutations Somatic vs. reproductive cells Radiation and/or chemical damage to DNA Random errors of the replication machinery Normal biological processes - methylation

Mutations Create Alleles Mutations occur randomly throughout DNA. Most have no phenotypic effect (non-coding regions, equivalent codons, similar AAs). Some damage the function of a protein or regulatory element. A very few provide an evolutionary advantage.

Population Genetics Chromosome pairs segregate and recombine in every generation. Every allele of every gene has its own independent evolutionary history (and future). Frequencies of various alleles differ in different subpopulations of people.

Human Alleles The OMIM (Online Mendelian Inheritance in Man) database at the NCBI tracks all human mutations with known pheontypes. It contains a total of about 2,000 genetic diseases [and another ~11,000 genetic loci with known phenotypes - but not necessarily known gene sequences] It is designed for use by physicians: can search by disease name contains summaries from clinical studies

OMIM Morbid Map: Cytogenetic map location of disease genes.

Variation Makes Life Interesting The Human Genome has been sequenced; what s next? Much of what makes us unique individuals is represented by the differences in our DNA sequence from other people. There are rare and common forms (alleles) of every gene. Probably only 3-4 alleles are present in 95% of the population for most genes, but lots of rare mutations.

SNPs are Mutations

SNPs A mutation that causes a single base change is known as a Single Nucleotide Polymorphism (SNP). Other kinds of mutations include insertions and deletions. Large breaks and rearrangement of chromosomes also occur (translocations)s GATTTAGATCGCGATAGAG GATTTAGATCTCGATAGAG ^

SNPs are Very Common SNPs are very common in the human population. Between any two people, there is an average of one SNP every ~1250 bases. Most of these have no phenotypic effect. Only <1% of all human SNPs impact protein function (non-coding regions). Selection against mis-sense mutations (think about what would happen to dominant lethal mutations?). Some are alleles of genes.

Genome Sequencing finds SNPs The Human Genome Project involves sequencing DNA cloned from a number of different people. [The Celera sequence comes from 5 people.] Even within one person s DNA, the homologous chromosomes have SNPs. This inevitably leads to the discovery of SNPs - any single base sequence difference These SNPs can be valuable as the basis for diagnostic tests

We describe a map of 1.42 million single nucleotide polymorphisms (SNPs) distributed throughout the human genome, providing an average density on available sequence of one SNP every 1.9 kilobases. These SNPs were primarily discovered by two projects: The SNP Consortium and the analysis of clone overlaps by the International Human Genome Sequencing Consortium. The map integrates all publicly available SNPs with described genes and other genomic features. We estimate that 60,000 SNPs fall within exon (coding and untranslated regions), and 85% of exons are within 5 kb of the nearest SNP. Nucleotide diversity varies greatly across the genome, in a manner broadly consistent with a standard population genetic model of human history. This high-density SNP map provides a public resource for defining haplotype variation across the genome, and should help to identify biomedically important genes for diagnosis and therapy.

http://www.ncbi.nlm.nih.gov/snp

SNP Discovery: dbsnp database

Search dbsnp with BLAST As of June, 2008, dbsnp has 12.8 million SNPs in the human genome It is possible to search dbsnp by BLAST comparisons to a target sequence

>gnl dbsnp rs1042574_allelepos=51 total len = 101 taxid = 9606 snpclass = 1 Length = 101 Score = 149 bits (75), Expect = 3e-33 Identities = 79/81 (97%) Strand = Plus / Plus If a matching SNP is found, then it can be directly located on the Genome map Query: 1489 ccctcttccctgacctcccaactctaaagccaagcactttatatttttctcttagatatt 1548 Sbjct: 1 ccctcttccctgacctcccaactctaaagccaagcactttatattttcctyttagatatt 60 Query: 1549 cactaaggacttaaaataaaa 1569 Sbjct: 61 cactaaggacttaaaataaaa 81

Uses for SNPs Diagnostic tests for disease alleles Markers to aid in cloning of interesting genes (disease genes) Pharmacogenomics - genetics of reponse to drugs (effectiveness and side effects)

DNA Diagnostic Testing Hereditary diseases - potential parents, prenatal, late onset diseases. Genes that predispose to disease (risk factors). Genotyping of infectious agents (bacterial & viral). Forensics - using DNA testing to establish identity.

Clinical Manifestations of Genetic Variation (All disease has a genetic component) Susceptibility vs. resistance Variations in disease severity or symptoms Reaction to drugs (pharmacogenetics) Variable disease course and prognosis SNPs can be found that are linked to all of these traits.

Finding Disease Genes Virtually all diseases have a genetic component. Start with DNA samples from families that show inheritance of the disease. Use STS markers to map the gene or genes involved (linkage analysis). Find SNPs in the genetic region(s) that are likely candidates for involvement in that disease. Get the gene from genomic sub-clone.

Some Diseases Involve Many Genes There are a number of classic genetic diseases caused by mutations of a single gene. Huntington s, Cystic Fibrosis, Tay-Sachs, PKU, etc. There are also many diseases that are the result of the interactions of many genes: asthma, heart disease, cancer Each of these genes may be considered to be a risk factor for the disease. Groups of genetic markers (SNPs) may be associated with a disease without determining a mechanism.

Multiple Causes Some diseases may actually be caused by any of a group of different genes (multiple causes), but all show the same symptoms. SNP linkage analysis can identify these sub-populations more efficiently than classical molecular genetic approaches. Machine learning, genetic algorithms, SVMs

The study of the distribution of genetic variants, including SNPs, lies within the domain of population genetics, and the study of the relationship between SNPs and phenotypic variation lies in the domain of quantitative genetics. Gibson&Muse

A B c a B C a B C A B c a B C a B c A b c A b c A b c a b C a b C A b c A b c a B C A B c a b C a B c A b c Quantitative Trait Locus Mapping A B C a b c F 1 A B C a b c F 1 X a b c a b c A B C A B C Parent 3 Parent 4 X HEIGHT GENOTYPE BB Bb bb B b Bb Bb Bb BB BB BB bb bb bb a b c a b c A B C A B C Parent 1 Parent 2 X Knott et al. (1997) TAG 84:810-820

Association Mapping ancestral chromosomes G T * recombination through evolutionary history present-day chromosomes in natural population G A C C G A T C * G A T T * *

SNP Discovery Methods Pairwise Sequence Comparison from databases, esnp Deep Resequencing

SNP Analysis Agenda Sequence-Based SNP Identification Common Bioinformatic Solutions Phred, Phrap, Consed, Polyphred, and Polybayes High-Throughput SNP Identification Solution

Overlapping PCR Amplicons across entire gene Make no assumptions about sequence function Sequence diversity and genetic structure for each gene is different Proper association studies can only be designed in this context Complete resequencing facilitates population genetics methods

Sequence-based SNP Identification Amplify DNA 5 3 Sequence Phred Phrap Sequence each end of the fragment. Base-calling Quality determination Contig assembly Final quality determination PolyPhred/Polybayes Polymorphism detection ATAGACG ATAGACG ATACACG ATACACG ATAGACG ATACACG Consed Sequence viewing Polymorphism tagging Analysis Homozygotes Heterozygote Polymorphism reporting Individual genotyping Phylogenetic analysis

Phred, Phrap, Consed, Polyphred, Polybayes phred: Base calling and quality assignments phrap: Contig formation and new quality assignments consed: Visual X-Windows graphic interface, to view and edit alignments and contigs, and to view the original traces polyphred: find polymorphisms in phrap contigs, quality calls, add data to phrap files to permit consed finding and visualization of polymorphisms. polybayes: Fully probabilistic SNP detection algorithm that calculates the probability (SNP score) that discrepancies at a given location of a multiple alignment represent true sequence variations as opposed to sequencing errors.

Nature Genetics 23, 452-456 (1999) A general approach to single-nucleotide polymorphism discovery Gabor T. Marth, Ian Korf, Mark D. Yandell, Raymond T. Yeh, Zhijie Gu, Hamideh Zakeri, Nathan O. Stitziel, LaDeana Hillier, Pui-Yan Kwok & Warren R. Gish Figure 1. Application of the POLYBAYES procedure to EST data. a, Regions of known human repeats in a genomic sequence are masked. b, Matching human ESTs are retrieved from dbest and traces are re-called. c, Paralogous ESTs are identified and discarded. d, Alignments of native EST reads are screened for candidate variable sites. e, An STS is designed for the verification of a candidate SNP. f, The uniqueness of the genomic location is determined by sequencing the STS in CHM1 (homozygous DNA). g, The presence of a SNP is analysed by sequencing the STS from pooled DNA samples.

PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing Deborah A. Nickerson*, Vincent O. Tobe and Scott L. Taylor Nucleic Acids Research; 1997-25:2745 SNP calling Correct call False positive False positivefalse positive

Trace File High quality region no ambiguities

Trace File Medium quality region some ambiguities

Trace File Poor quality region low confidence

Using PolyPhred to Visualize SNPs Compares sequences across traces obtained from different individuals to identify sites for SNPs. Will occasionally miscall genotypes - frequency of such mistakes depends on the sequencing chemistry used to generate the trace. To reduce the number of miscalled sites, ignores regions of poor quality & ends

Polyphred Reads the ACE file to obtain the consensus sequence and the names of the trace (chromat) files used in the assembly. Reads the PHD files associated with each trace. During the SNP search phase, PolyPhred combines information from all of the sequence traces to derive a genotype and a score for each sequence The score indicates how well the trace at the site matches the expected pattern for a SNP. Updates the ACE and PHD files by adding tags that mark the positions of the sites. The tagged sites can then be examined using Consed.

Polybayes Bayesian statistical model takes into account: - depth of coverge - base quality values of the sequences Polybayes calculations are aided with information on major/minor allele frequencies as well as polymorphism rates within the species under investigation **Can also integrate into the poly files for viewing with Consed

Alignment and SNP Calling Pipeline Challenges in High-Throughput SNP Identification Alignment Critical in the automation of base calls Commonly used Phrap (from PhredPhrap) is an assembler and is NOT ideal for alignments Many commonly used aligners work best with protein sequences or with a reference sequence Preservation of quality scores for input into SNP identification programs Speed for high-throughput programs Automated SNP Calls - Reference Sequence Required - Traditional approaches without reference sequence include esnps (human, maize, and pine) -Very little redundancy outside of abundant genes -Overall high number of false positives (single pass reads) - Not specific to frequencies observed in different organisms - High number of false positives in currently accepted methods (Polybayes & Polyphred)

5 UTR exon Intron 3 UTR

4-Coumarate CoA Ligase (4CL) 0 500 1000 1500 2000 2500 1 9 9 4 1 4 1 0 1 1 6 6 0 9 9 7 1 1 8 9 4 3 5 4 2 0 0 4 2 3 8 5 2 5 8 9 F4 R4 F3 R3 F2 R1A 61 601 947 1454 1486 2003 F5 R3 F6 R6 491 1956 2728 743-781 bound_moiety="amp" 2396-2417 proposed active site A C T A C T G A A T A C T A C T G A A T A C T A C T G A A T A C T A C T G A A T A C T A C T G G A T A C T A C T G G A T A C T A C T G G A T A C T A C T G G A T A C T A C C G G A T A C T A C C G G A T A C T A C C G G A T A C T A C C G G A T A C T A C C G G A C A C T A C C G G A C A C T A C C G G A C A C T A C C G G A C A C T A C C A G A C A C T A C C A G A C A C T A C C A G A C A C T A C C A G A C A C T A C C A G A C A C T A C C A G A C A C T A C C A G A C A C T G T C G G G C A C T G T C G G G C G C A G C C G G G C 1 2 3 4 5 6 7 8 9 1