Genomics I. Organization of the Genome

Similar documents
BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Database Searching and BLAST Dannie Durand

Lecture 2: Biology Basics Continued

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Creation of a PAM matrix

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Basic Bioinformatics: Homology, Sequence Alignment,

Evolutionary Genetics. LV Lecture with exercises 6KP

Gene Identification in silico

Multiple choice questions (numbers in brackets indicate the number of correct answers)

MATH 5610, Computational Biology

ELE4120 Bioinformatics. Tutorial 5

Algorithms in Bioinformatics

Why learn sequence database searching? Searching Molecular Databases with BLAST

Hands-On Four Investigating Inherited Diseases

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

DNA is normally found in pairs, held together by hydrogen bonds between the bases

COMPUTER RESOURCES II:

Biotechnology Explorer

The Genetic Code and Transcription. Chapter 12 Honors Genetics Ms. Susan Chabot

O C. 5 th C. 3 rd C. the national health museum

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Dynamic Programming Algorithms

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Genome Sequencing-- Strategies

CHAPTER 21 LECTURE SLIDES

Protein Structure Prediction. christian studer , EPFL

AGRO/ANSC/BIO/GENE/HORT 305 Fall, 2016 Overview of Genetics Lecture outline (Chpt 1, Genetics by Brooker) #1

Sequence Databases and database scanning

user s guide Question 3

Bio11 Announcements. Ch 21: DNA Biology and Technology. DNA Functions. DNA and RNA Structure. How do DNA and RNA differ? What are genes?

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Theory and Application of Multiple Sequence Alignments

Lecture 2: Central Dogma of Molecular Biology & Intro to Programming

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

FINDING GENES AND EXPLORING THE GENE PAGE AND RUNNING A BLAST (Exercise 1)

Genome Sequence Assembly

Introduction to Bioinformatics

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Sequence Analysis Lab Protocol

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

From DNA to Protein: Genotype to Phenotype

DNA Structure and Analysis. Chapter 4: Background

Nucleic acids. How DNA works. DNA RNA Protein. DNA (deoxyribonucleic acid) RNA (ribonucleic acid) Central Dogma of Molecular Biology

user s guide Question 3

Identification of Single Nucleotide Polymorphisms and associated Disease Genes using NCBI resources

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Chapter 15 Gene Technologies and Human Applications

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

An introduction to genetics and molecular biology

Higher Human Biology Unit 1: Human Cells Pupils Learning Outcomes

Why Use BLAST? David Form - August 15,

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Answer: Sequence overlap is required to align the sequenced segments relative to each other.

Introduction to Bioinformatics

Protein Synthesis

Gene Regulation & Mutation 8.6,8.7

Introduction to Molecular Biology

Year III Pharm.D Dr. V. Chitra

Chimp Sequence Annotation: Region 2_3

Alignment to a database. November 3, 2016

PV92 PCR Bio Informatics

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

The common structure of a DNA nucleotide. Hewitt

PRESENTING SEQUENCES 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 3 CTTACGCCGAATCTGACCATGCTACCTTG 5

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Modern BLAST Programs

Zoology Evolution and Gene Frequencies

BIOINFORMATICS IN BIOCHEMISTRY

Bio 101 Sample questions: Chapter 10

Worksheet for Bioinformatics

Molecular Biology Primer. CptS 580, Computational Genomics, Spring 09

TEKS 5C describe the roles of DNA, ribonucleic acid (RNA), and environmental factors in cell differentiation

Lecture for Wednesday. Dr. Prince BIOL 1408

Genes and Gene Technology

Bundle 5 Test Review

Lecture #1. Introduction to microarray technology

BIOINFORMATICS Introduction

Chapter 12. DNA TRANSCRIPTION and TRANSLATION

Computational Molecular Biology. Lecture Notes. by A.P. Gultyaev

Genes and human health - the science and ethics

4.1. Genetics as a Tool in Anthropology

Introduction to Bioinformatics

Guided tour to Ensembl

PROTEIN SYNTHESIS. copyright cmassengale

PROTEIN SYNTHESIS. copyright cmassengale

Hello! Outline. Cell Biology: RNA and Protein synthesis. In all living cells, DNA molecules are the storehouses of information. 6.

Make the protein through the genetic dogma process.

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Unit 6: Molecular Genetics & DNA Technology Guided Reading Questions (100 pts total)

Introduction to Bioinformatics

SAMPLE LITERATURE Please refer to included weblink for correct version.

LECTURE 12: INSIGHTS FROM GENOME SEQUENCING

NCBI web resources I: databases and Entrez

Transcription:

Genomics I Organization of the Genome

Outline Organization of genome Genomes, chromosomes, genes, exons, introns, promoters, enhancers, etc. Databases Why do we need them? How do we access them? What can they do for us? Basic principles of Bioinformatics

What is a genome? Definition the complete set of genetic material present in the cells of an organism The genetic material is composed of DNA Base pairing + base stacking double helix

Genome Sizes and Phylogeny 0.5 to 7 Mbp 10 Mbp to 1000 Gbp

The Human Genome February 2001 Considered a crowning achievement blueprint of life Yet, many questions regarding fidelity, organization (e.g., how many genes?)

The Human Genome Project

What is the Human Genome Project? Completed in 2003, the Human Genome Project (HGP) was a 13-year project coordinated by the U.S. Department of Energy and the National Institutes of Health. During the early years of the HGP, the Wellcome Trust (U.K.) became a major partner; additional contributions came from Japan, France, Germany, China, and others. Goals identify the approximate 20,000-25,000 genes in human DNA determine the sequences of the 3 billion bases that make up human DNA store this information in databases develop tools for data analysis transfer related technologies to the private sector, and address the ethical, legal, and social issues that arise from genome research

Why is the Department of Energy involved? -after atomic bombs were dropped during War War II, Congress told DOE to conduct studies to understand the biological and health effects of radiation and chemical by-products of all energy production -best way to study these effects is at the DNA level

Whose genome is being sequenced? the first reference genome is a composite genome from several different people generated from 10-20 primary samples taken from numerous anonymous donors across racial and ethnic groups

Benefits of HGP Research improvements in medicine microbial genome research for fuel and environmental cleanup DNA forensics improved agriculture and livestock better understanding of evolution and human migration more accurate risk assessment

Ethical, Legal, and Social Implications of HGP Research fairness in the use of genetic information privacy and confidentiality psychological impact and stigmatization genetic testing reproductive issues education, standards, and quality control commercialization conceptual and philosophical implications

For More Information about HGP Human Genome Project Information Website http://www.ornl.gov

Basic numbers in Human Genome 3x10 9 bp ~30,000 genes 23 x 2 = 46 chromosomes All from 4 bases (A,C,G,T)

Chromosomes a single large macromolecule of DNA, and is the basic 'unit' of DNA in a cell. It is a very long, continuous piece of DNA (a single DNA molecule), which contains many genes, regulatory elements and other intervening nucleotide sequences. Supercontig Rat Chromosome 13 ( PreceedContigs = ) Start End Start End 1 NW_047390 1 19,234,043 1 19,234,043 Gap 1 50,000 19,234,044 19,284,043 2 NW_047391 1 11,093,222 19,284,044 30,377,265 Gap 1 50,000 30,377,266 30,427,265 3 NW_047392 1 2,305,237 30,427,266 32,732,502 Gap 1 50,000 32,732,503 32,782,502 4 NW_047393 1 7,069,318 32,782,503 39,851,820 Gap 1 50,000 39,851,821 39,901,820 5 NW_047394 1 4,889,800 39,901,821 44,791,620 Gap 1 50,000 44,791,621 44,841,620 6 NW_047395 1 4,278,911 44,841,621 49,120,531 Gap 1 50,000 49,120,532 49,170,531 7 NW_047396 1 2,820,895 49,170,532 51,991,426 Gap 1 50,000 51,991,427 52,041,426 8 NW_047397 1 16,884,033 52,041,427 68,925,459 Gap 1 50,000 68,925,460 68,975,459 9 NW_047398 1 13,699,042 68,975,460 82,674,501 Gap 1 50,000 82,674,502 82,724,501 10 NW_047399 1 12,573,714 82,724,502 95,298,215 Gap 1 50,000 95,298,216 95,348,215 11 NW_047400 1 11,599,125 95,348,216 106,947,340 Gap 1 50,000 106,947,341 106,997,340 12 NW_047401 1 242,424 106,997,341 107,239,764 Gap 1 50,000 107,239,765 107,289,764 13 NW_047402 1 954,180 107,289,765 108,243,944 Gap 1 50,000 108,243,945 108,293,944 14 NW_047403 1 671,604 108,293,945 108,965,548 Gap 1 50,000 108,965,549 109,015,548 15 NW_047404 1 2,333,410 109,015,549 111,348,958 + Plus strand 5 3 111,348,958 (only an issue if you are building a database)

Contig assembly: physical map Software (Image or Bandleader) is used to identify overlapping clones with common restriction fragments and assembles them into a contig (FPC) Clone A B C D E F G * * * * http://www.gensips.gatech.edu/slides/mardis.ppt

Sequence data assembly: Supercontig creation and gap filling (A) A supercontig is constructed by successively linking pairs of contigs that share at least two forward-reverse links. Here, three contigs are joined into one supercontig. (B) ARACHNE attempts to fill gaps by using paths of contigs. The first gap in the supercontig shown here is filled with one contig, and the second gap is filled by a path consisting of two contigs. Genome Research 12: 177-189 (2002)

Whole genome map assembly Genome map Edit contigs and align to map. Gaps between clones can be filled with other clones, such as fosmids, or by generating PCR products from BAC clones or genomic DNA.

Genes The Central Dogma Metabolites Interactions DNA RNA Protein Growth rate Expression A more realistic picture

The Genetic Code In reality, there is more information in the genome than just amino acid sequences.

The classic molecular human disease: Sickle cell, HbS Normal RBC 6-8 µm; 4e12 per L Sickle cells; HbS 1949 Castle & Pauling Single nucleotide polymorphism (SNP) GAG to GUG : E6V. Treatments: antibiotics, hydroxyurea, or bone-marrow transplant (From an old version of George Church s Biophysics 101 class see further reading)

Routine screening for intelligence alleles Phenylketonuria is one of the commonest inherited disorders - occurring in approximately 1 in 10,000 babies born in the U. S. PKU (Phenylketonuria) gene required for F (phenyalanine) to Y (Tyrosine) conversion. Phenylalanine builds-up prevents the brain from developing properly. Progressive intellectual disability results if PKU is not treated from early infancy. Discovered by Folling in 1944. Nature/Nurture: ~100% Genetic with normal diet leading to mental retardation ~100% Environmental varying with knowledge of prevention by reduced F in the diet. All states and U.S. territories screen newborns for PKU. (some since the 1960s)

So where do I find the genome? NCBI: http://www.ncbi.nlm.nih.gov/genomes/ UCSC genome browser: http://hgdownload.cse.ucsc.edu/downloads.html Ensembl: http://www.ensembl.org

Organization of the Gene Exons regions of DNA that code for protein Introns intervening regions that are spliced out Transcriptional start site (TSS) - where transcription begins Promoter sequences upstream of TSS that are bound by transcription factor proteins to regulate gene expression TSS Regulatory Region PROMOTER Coding Region i n t r o n s e x o n s

BLAST (Basic Local Alignment Search Tool) Compares sequences of DNA for sequence similarity Can be two sequences of yours, or one of yours against known human, rat,... Genome Will give you back similarities, not just identical matches Can give you disjoint or continuous hits BLAST genome

What BLAST tells you BLAST reports surprising alignments Different than chance Assumptions Random sequences Constant composition Conclusions Surprising similarities imply evolutionary homology Evolutionary Homology: descent from a common ancestor Does not always imply similar function

Basic Local Alignment Search Tool Widely used similarity search tool Heuristic approach based on Smith Waterman algorithm Finds best local alignments Provides statistical significance All combinations (DNA/Protein) query and database. DNA vs DNA DNA translation vs Protein Protein vs Protein Protein vs DNA translation DNA translation vs DNA translation www, standalone, and network clients

BLAST and BLAST-like programs Traditional BLAST (blastall) nucleotide, protein, translations blastn nucleotide query vs. nucleotide database blastpprotein query vs. protein database blastx nucleotide query vs. protein database tblastnprotein query vs. translated nucleotide database tblastx translated query vs. translated database Megablast nucleotide only Contiguous megablast Nearly identical sequences Discontiguous megablast Cross-species comparison Position Specific BLAST Programs protein only Position Specific Iterative BLAST (PSI-BLAST) Automatically generates a position specific score matrix (PSSM) Reverse PSI-BLAST (RPS-BLAST) Searches a database of PSI-BLAST PSSMs

GTACTGGACATGGACCCTACAGGAACGTATACGTAAG 11-mer GTACTGGACAT GTACTGGACATGGACCCTACAGGAACGT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC TGGACATGGACCCTACAGGAACGTATAC GGACATGGACC GACATGGACCC ACATGGACCCT... Nucleotide Words WORD SIZE blastn megablast CATGGACCCTACAGGAACGTATACGTAA... Make a lookup table of words Def. 11 28 Query Min. 7 12

Query: Make a lookup table of words Protein Words GTQITVEDLFYNIATRRKALKN GTQ TQI QIT ITV Word size = 3 (default) TVE VED EDL DLF Word size can only be 2 or 3 Neighborhood Words LTV, MTV, ISV, LSV, etc....

Minimum Requirements for a Hit ATCGCCATGCTTAATTGGGCTT CATGCTTAATT exact word match one match Nucleotide BLAST requires one exact match Protein BLAST requires two neighboring matches within 40 aa GTQITVEDLFYNI SEI YYN neighborhood words two matches

An alignment that BLAST can t find 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

Megablast: NCBI s Genome Annotator Long alignments for similar DNA sequences Concatenation of query sequences Faster than blastn Contiguous Megablast exact word match Word size 28 Discontiguous Megablast initial word hit with mismatches cross-species comparison

Templates for Discontiguous Words W = 11, t = 16, coding: 1101101101101101 W = 11, t = 16, non-coding: 1110010110110111 W = 12, t = 16, coding: 1111101101101101 W = 12, t = 16, non-coding: 1110110110110111 W = 11, t = 18, coding: 101101100101101101 W = 11, t = 18, non-coding: 111010010110010111 W = 12, t = 18, coding: 101101101101101101 W = 12, t = 18, non-coding: 111010110010110111 W = 11, t = 21, coding: 100101100101100101101 W = 11, t = 21, non-coding: 111010010100010010111 W = 12, t = 21, coding: 100101101101100101101 W = 12, t = 21, non-coding: 111010010110010010111 W = word size; # matches in template t = template length (window size within which the word match is evaluated) Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5

Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Expect Value E = number of database hits you expect to find by chance size of database Alignments your score expected number of random hits E = Kmne -λs or E = mn2 -S K = scale for search space λ = scale for scoring system S = bitscore = (λs - lnk)/ln2 Score (applies to ungapped alignments)

Scoring Systems Position Independent Matrices Nucleic Acids identity matrix Proteins PAM Matrices (Percent Accepted Mutation) Implicit model of evolution Higher PAM number all calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstitution Matrices) Empirically determined from alignment of conserved blocks Each includes information up to a certain level of identity BLOSUM62 widely used Position Specific Score Matrices (PSSMs( PSSMs) PSI and RPS BLAST

BLOSUM62 A 4 R -1 5 N -2 0 6 D -2-2 1 6 Common amino acids have low weights C 0-3 -3-3 9 Q -1 1 0 0-3 5 E -1 0 0 2-4 2 5 G 0-2 0-1 -3-2 -2 6 H -2 0 1-1 -3 0 0-2 8 I -1-3 -3-3 -1-3 -3-4 -3 4 L -1-2 -3-4 -1-2 -3-4 -3 2 4 Rare amino acids have high weights K -1 2 0-1 -3 1 1-2 -1-3 -2 5 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 X 0-1 -1 Positive -1-2 for -1 more -1-1 likely -1-1 substitutions -1-1 -1-1 -2 0 0-2 -1-1 -1 A R N D C Q E G H I L K M F P S T W Y V X Negative for less likely substitutions

Position Specific Substitution Rates Typical serine Typical serine Active site serine Active site serine

Position Specific Score Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D 0-2 0 2-4 2 4-4 -3-5 -4 0-2 -6 1 0-1 -6-4 -1 207 G -2-1 0-2 -4-3 -3 6-4 -5-5 0-2 -3-2 -2-1 0-6 -5 208 V -1 1-3 -3-5 -1-2 6-1 -4-5 1-5 -6-4 0-2 -6-4 -2 209 I -3 3-3 -4-6 0-1 -4-1 2-4 6-2 -5-5 -3 0-1 -4 0 210 S -2-5 0 8-5 -3-2 -1-4 -7-6 -4-6 -7-5 1-3 -7-5 -6 211 S 4-4 -4-4 -4-1 -4-2 -3-3 -5-4 -4-5 -1 4 3-6 -5-3 212 C -4-7 -6-7 12-7 -7-5 -6-5 -5-7 -5 0-7 -4-4 -5 0-4 213 N -2 0 2-1 -6 7 Serine 0-2 0 scored -6-4 differently 2 0-2 -5-1 -3-3 -4-3 214 G -2-3 -3-4 -4-4 -5 in these 7-4 -7 two -7 positions -5-4 -4-6 -3-5 -6-6 -6 215 D -5-5 -2 9-7 -4-1 -5-5 -7-7 -4-7 -7-5 -4-4 -8-7 -7 216 S -2-4 -2-4 -4-3 -3-3 -4-6 -6-3 -5-6 -4 7-2 -6-5 -5 217 G -3-6 -4-5 -6-5 -6 8-6 -8-7 -5-6 -7-6 -4-5 -6-7 -7 218 G -3-6 -4-5 -6-5 -6 8-6 -7-7 -5-6 -7-6 -2-4 -6-7 -7 219 P -2-6 -6 Active -5-6 site -5 nucleophile -5-6 -6-6 -7-4 -6-7 9-4 -4-7 -7-6 220 L -4-6 -7-7 -5-5 -6-7 0-1 6-6 1 0-6 -6-5 -5-4 0 221 N -1-6 0-6 -4-4 -6-6 -1 3 0-5 4-3 -6-2 -1-6 -1 6 222 C 0-4 -5-5 10-2 -5-5 1-1 -1-5 0-1 -4-1 0-5 0 0 223 Q 0 1 4 2-5 2 0 0 0-4 -2 1 0 0 0-1 -1-3 -3-4 224 A -1-1 1 3-4 -1 1 4-3 -4-3 -1-2 -2-3 0-2 -2-2 -3

Gapped Alignments Gapping provides more biologically realistic alignments Gapped BLAST parameters must be simulated Affine gap costs = -(a+bk) a = gap open penalty b = gap extend penalty A gap of length 1 receives the score -(a+b)

Scores V D S C Y V E T L C F BLOSUM62 +4 +2 +1-12 +9 +3 7 PAM30 +7 +2 0-10 +10 +2 11

Becker et al., Nature, 1998 Position Weight Matrices

PWMs (continued)

Formulae used in searching DNA sequences

Inter-species Comparison Albumin gene promoters obtained from rat, human and mouse genomes using Promoser Aligned using BLAST: conserved regions (hu vs. mu/rat) span from -250 to +50 relative to TSS -1000 +50-1000 +50 RAT RAT MOUSE HUMAN Regulatory elements obtained using Possum Retained 200 bp upstream from TSS

Phylogenetic footprinting

Further Reading George Church s Computational Biology (Biophysics 101) course http://www.courses.fas.harvard.edu/~bphys101/ Your Molecular Cell Biol. text!