Genomics I. Organization of the Genome

Genomics I Organization of the Genome

Outline Organization of genome Genomes, chromosomes, genes, exons, introns, promoters, enhancers, etc. Databases Why do we need them? How do we access them? What can they do for us? Basic principles of Bioinformatics

What is a genome? Definition the complete set of genetic material present in the cells of an organism The genetic material is composed of DNA Base pairing + base stacking double helix

Genome Sizes and Phylogeny 0.5 to 7 Mbp 10 Mbp to 1000 Gbp

The Human Genome February 2001 Considered a crowning achievement blueprint of life Yet, many questions regarding fidelity, organization (e.g., how many genes?)

The Human Genome Project

What is the Human Genome Project? Completed in 2003, the Human Genome Project (HGP) was a 13-year project coordinated by the U.S. Department of Energy and the National Institutes of Health. During the early years of the HGP, the Wellcome Trust (U.K.) became a major partner; additional contributions came from Japan, France, Germany, China, and others. Goals identify the approximate 20,000-25,000 genes in human DNA determine the sequences of the 3 billion bases that make up human DNA store this information in databases develop tools for data analysis transfer related technologies to the private sector, and address the ethical, legal, and social issues that arise from genome research

Why is the Department of Energy involved? -after atomic bombs were dropped during War War II, Congress told DOE to conduct studies to understand the biological and health effects of radiation and chemical by-products of all energy production -best way to study these effects is at the DNA level

Whose genome is being sequenced? the first reference genome is a composite genome from several different people generated from 10-20 primary samples taken from numerous anonymous donors across racial and ethnic groups

Benefits of HGP Research improvements in medicine microbial genome research for fuel and environmental cleanup DNA forensics improved agriculture and livestock better understanding of evolution and human migration more accurate risk assessment

Ethical, Legal, and Social Implications of HGP Research fairness in the use of genetic information privacy and confidentiality psychological impact and stigmatization genetic testing reproductive issues education, standards, and quality control commercialization conceptual and philosophical implications

For More Information about HGP Human Genome Project Information Website http://www.ornl.gov

Basic numbers in Human Genome 3x10 9 bp ~30,000 genes 23 x 2 = 46 chromosomes All from 4 bases (A,C,G,T)

Chromosomes a single large macromolecule of DNA, and is the basic 'unit' of DNA in a cell. It is a very long, continuous piece of DNA (a single DNA molecule), which contains many genes, regulatory elements and other intervening nucleotide sequences. Supercontig Rat Chromosome 13 ( PreceedContigs = ) Start End Start End 1 NW_047390 1 19,234,043 1 19,234,043 Gap 1 50,000 19,234,044 19,284,043 2 NW_047391 1 11,093,222 19,284,044 30,377,265 Gap 1 50,000 30,377,266 30,427,265 3 NW_047392 1 2,305,237 30,427,266 32,732,502 Gap 1 50,000 32,732,503 32,782,502 4 NW_047393 1 7,069,318 32,782,503 39,851,820 Gap 1 50,000 39,851,821 39,901,820 5 NW_047394 1 4,889,800 39,901,821 44,791,620 Gap 1 50,000 44,791,621 44,841,620 6 NW_047395 1 4,278,911 44,841,621 49,120,531 Gap 1 50,000 49,120,532 49,170,531 7 NW_047396 1 2,820,895 49,170,532 51,991,426 Gap 1 50,000 51,991,427 52,041,426 8 NW_047397 1 16,884,033 52,041,427 68,925,459 Gap 1 50,000 68,925,460 68,975,459 9 NW_047398 1 13,699,042 68,975,460 82,674,501 Gap 1 50,000 82,674,502 82,724,501 10 NW_047399 1 12,573,714 82,724,502 95,298,215 Gap 1 50,000 95,298,216 95,348,215 11 NW_047400 1 11,599,125 95,348,216 106,947,340 Gap 1 50,000 106,947,341 106,997,340 12 NW_047401 1 242,424 106,997,341 107,239,764 Gap 1 50,000 107,239,765 107,289,764 13 NW_047402 1 954,180 107,289,765 108,243,944 Gap 1 50,000 108,243,945 108,293,944 14 NW_047403 1 671,604 108,293,945 108,965,548 Gap 1 50,000 108,965,549 109,015,548 15 NW_047404 1 2,333,410 109,015,549 111,348,958 + Plus strand 5 3 111,348,958 (only an issue if you are building a database)

Contig assembly: physical map Software (Image or Bandleader) is used to identify overlapping clones with common restriction fragments and assembles them into a contig (FPC) Clone A B C D E F G * * * * http://www.gensips.gatech.edu/slides/mardis.ppt

Sequence data assembly: Supercontig creation and gap filling (A) A supercontig is constructed by successively linking pairs of contigs that share at least two forward-reverse links. Here, three contigs are joined into one supercontig. (B) ARACHNE attempts to fill gaps by using paths of contigs. The first gap in the supercontig shown here is filled with one contig, and the second gap is filled by a path consisting of two contigs. Genome Research 12: 177-189 (2002)

Whole genome map assembly Genome map Edit contigs and align to map. Gaps between clones can be filled with other clones, such as fosmids, or by generating PCR products from BAC clones or genomic DNA.

Genes The Central Dogma Metabolites Interactions DNA RNA Protein Growth rate Expression A more realistic picture

The Genetic Code In reality, there is more information in the genome than just amino acid sequences.

The classic molecular human disease: Sickle cell, HbS Normal RBC 6-8 µm; 4e12 per L Sickle cells; HbS 1949 Castle & Pauling Single nucleotide polymorphism (SNP) GAG to GUG : E6V. Treatments: antibiotics, hydroxyurea, or bone-marrow transplant (From an old version of George Church s Biophysics 101 class see further reading)

Routine screening for intelligence alleles Phenylketonuria is one of the commonest inherited disorders - occurring in approximately 1 in 10,000 babies born in the U. S. PKU (Phenylketonuria) gene required for F (phenyalanine) to Y (Tyrosine) conversion. Phenylalanine builds-up prevents the brain from developing properly. Progressive intellectual disability results if PKU is not treated from early infancy. Discovered by Folling in 1944. Nature/Nurture: ~100% Genetic with normal diet leading to mental retardation ~100% Environmental varying with knowledge of prevention by reduced F in the diet. All states and U.S. territories screen newborns for PKU. (some since the 1960s)

So where do I find the genome? NCBI: http://www.ncbi.nlm.nih.gov/genomes/ UCSC genome browser: http://hgdownload.cse.ucsc.edu/downloads.html Ensembl: http://www.ensembl.org

Organization of the Gene Exons regions of DNA that code for protein Introns intervening regions that are spliced out Transcriptional start site (TSS) - where transcription begins Promoter sequences upstream of TSS that are bound by transcription factor proteins to regulate gene expression TSS Regulatory Region PROMOTER Coding Region i n t r o n s e x o n s

BLAST (Basic Local Alignment Search Tool) Compares sequences of DNA for sequence similarity Can be two sequences of yours, or one of yours against known human, rat,... Genome Will give you back similarities, not just identical matches Can give you disjoint or continuous hits BLAST genome

What BLAST tells you BLAST reports surprising alignments Different than chance Assumptions Random sequences Constant composition Conclusions Surprising similarities imply evolutionary homology Evolutionary Homology: descent from a common ancestor Does not always imply similar function

Basic Local Alignment Search Tool Widely used similarity search tool Heuristic approach based on Smith Waterman algorithm Finds best local alignments Provides statistical significance All combinations (DNA/Protein) query and database. DNA vs DNA DNA translation vs Protein Protein vs Protein Protein vs DNA translation DNA translation vs DNA translation www, standalone, and network clients

BLAST and BLAST-like programs Traditional BLAST (blastall) nucleotide, protein, translations blastn nucleotide query vs. nucleotide database blastpprotein query vs. protein database blastx nucleotide query vs. protein database tblastnprotein query vs. translated nucleotide database tblastx translated query vs. translated database Megablast nucleotide only Contiguous megablast Nearly identical sequences Discontiguous megablast Cross-species comparison Position Specific BLAST Programs protein only Position Specific Iterative BLAST (PSI-BLAST) Automatically generates a position specific score matrix (PSSM) Reverse PSI-BLAST (RPS-BLAST) Searches a database of PSI-BLAST PSSMs

GTACTGGACATGGACCCTACAGGAACGTATACGTAAG 11-mer GTACTGGACAT GTACTGGACATGGACCCTACAGGAACGT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC TGGACATGGACCCTACAGGAACGTATAC GGACATGGACC GACATGGACCC ACATGGACCCT... Nucleotide Words WORD SIZE blastn megablast CATGGACCCTACAGGAACGTATACGTAA... Make a lookup table of words Def. 11 28 Query Min. 7 12

Query: Make a lookup table of words Protein Words GTQITVEDLFYNIATRRKALKN GTQ TQI QIT ITV Word size = 3 (default) TVE VED EDL DLF Word size can only be 2 or 3 Neighborhood Words LTV, MTV, ISV, LSV, etc....

Minimum Requirements for a Hit ATCGCCATGCTTAATTGGGCTT CATGCTTAATT exact word match one match Nucleotide BLAST requires one exact match Protein BLAST requires two neighboring matches within 40 aa GTQITVEDLFYNI SEI YYN neighborhood words two matches

An alignment that BLAST can t find 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

Megablast: NCBI s Genome Annotator Long alignments for similar DNA sequences Concatenation of query sequences Faster than blastn Contiguous Megablast exact word match Word size 28 Discontiguous Megablast initial word hit with mismatches cross-species comparison

Templates for Discontiguous Words W = 11, t = 16, coding: 1101101101101101 W = 11, t = 16, non-coding: 1110010110110111 W = 12, t = 16, coding: 1111101101101101 W = 12, t = 16, non-coding: 1110110110110111 W = 11, t = 18, coding: 101101100101101101 W = 11, t = 18, non-coding: 111010010110010111 W = 12, t = 18, coding: 101101101101101101 W = 12, t = 18, non-coding: 111010110010110111 W = 11, t = 21, coding: 100101100101100101101 W = 11, t = 21, non-coding: 111010010100010010111 W = 12, t = 21, coding: 100101101101100101101 W = 12, t = 21, non-coding: 111010010110010010111 W = word size; # matches in template t = template length (window size within which the word match is evaluated) Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5

Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Expect Value E = number of database hits you expect to find by chance size of database Alignments your score expected number of random hits E = Kmne -λs or E = mn2 -S K = scale for search space λ = scale for scoring system S = bitscore = (λs - lnk)/ln2 Score (applies to ungapped alignments)

Scoring Systems Position Independent Matrices Nucleic Acids identity matrix Proteins PAM Matrices (Percent Accepted Mutation) Implicit model of evolution Higher PAM number all calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstitution Matrices) Empirically determined from alignment of conserved blocks Each includes information up to a certain level of identity BLOSUM62 widely used Position Specific Score Matrices (PSSMs( PSSMs) PSI and RPS BLAST

BLOSUM62 A 4 R -1 5 N -2 0 6 D -2-2 1 6 Common amino acids have low weights C 0-3 -3-3 9 Q -1 1 0 0-3 5 E -1 0 0 2-4 2 5 G 0-2 0-1 -3-2 -2 6 H -2 0 1-1 -3 0 0-2 8 I -1-3 -3-3 -1-3 -3-4 -3 4 L -1-2 -3-4 -1-2 -3-4 -3 2 4 Rare amino acids have high weights K -1 2 0-1 -3 1 1-2 -1-3 -2 5 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 X 0-1 -1 Positive -1-2 for -1 more -1-1 likely -1-1 substitutions -1-1 -1-1 -2 0 0-2 -1-1 -1 A R N D C Q E G H I L K M F P S T W Y V X Negative for less likely substitutions

Position Specific Substitution Rates Typical serine Typical serine Active site serine Active site serine

Position Specific Score Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D 0-2 0 2-4 2 4-4 -3-5 -4 0-2 -6 1 0-1 -6-4 -1 207 G -2-1 0-2 -4-3 -3 6-4 -5-5 0-2 -3-2 -2-1 0-6 -5 208 V -1 1-3 -3-5 -1-2 6-1 -4-5 1-5 -6-4 0-2 -6-4 -2 209 I -3 3-3 -4-6 0-1 -4-1 2-4 6-2 -5-5 -3 0-1 -4 0 210 S -2-5 0 8-5 -3-2 -1-4 -7-6 -4-6 -7-5 1-3 -7-5 -6 211 S 4-4 -4-4 -4-1 -4-2 -3-3 -5-4 -4-5 -1 4 3-6 -5-3 212 C -4-7 -6-7 12-7 -7-5 -6-5 -5-7 -5 0-7 -4-4 -5 0-4 213 N -2 0 2-1 -6 7 Serine 0-2 0 scored -6-4 differently 2 0-2 -5-1 -3-3 -4-3 214 G -2-3 -3-4 -4-4 -5 in these 7-4 -7 two -7 positions -5-4 -4-6 -3-5 -6-6 -6 215 D -5-5 -2 9-7 -4-1 -5-5 -7-7 -4-7 -7-5 -4-4 -8-7 -7 216 S -2-4 -2-4 -4-3 -3-3 -4-6 -6-3 -5-6 -4 7-2 -6-5 -5 217 G -3-6 -4-5 -6-5 -6 8-6 -8-7 -5-6 -7-6 -4-5 -6-7 -7 218 G -3-6 -4-5 -6-5 -6 8-6 -7-7 -5-6 -7-6 -2-4 -6-7 -7 219 P -2-6 -6 Active -5-6 site -5 nucleophile -5-6 -6-6 -7-4 -6-7 9-4 -4-7 -7-6 220 L -4-6 -7-7 -5-5 -6-7 0-1 6-6 1 0-6 -6-5 -5-4 0 221 N -1-6 0-6 -4-4 -6-6 -1 3 0-5 4-3 -6-2 -1-6 -1 6 222 C 0-4 -5-5 10-2 -5-5 1-1 -1-5 0-1 -4-1 0-5 0 0 223 Q 0 1 4 2-5 2 0 0 0-4 -2 1 0 0 0-1 -1-3 -3-4 224 A -1-1 1 3-4 -1 1 4-3 -4-3 -1-2 -2-3 0-2 -2-2 -3

Gapped Alignments Gapping provides more biologically realistic alignments Gapped BLAST parameters must be simulated Affine gap costs = -(a+bk) a = gap open penalty b = gap extend penalty A gap of length 1 receives the score -(a+b)

Scores V D S C Y V E T L C F BLOSUM62 +4 +2 +1-12 +9 +3 7 PAM30 +7 +2 0-10 +10 +2 11

Becker et al., Nature, 1998 Position Weight Matrices

PWMs (continued)

Formulae used in searching DNA sequences

Inter-species Comparison Albumin gene promoters obtained from rat, human and mouse genomes using Promoser Aligned using BLAST: conserved regions (hu vs. mu/rat) span from -250 to +50 relative to TSS -1000 +50-1000 +50 RAT RAT MOUSE HUMAN Regulatory elements obtained using Possum Retained 200 bp upstream from TSS

Phylogenetic footprinting

Further Reading George Church s Computational Biology (Biophysics 101) course http://www.courses.fas.harvard.edu/~bphys101/ Your Molecular Cell Biol. text!