MATH 5610, Computational Biology Lecture 2 Intro to Molecular Biology (cont) Stephen Billups University of Colorado at Denver MATH 5610, Computational Biology p.1/24
Announcements Error on syllabus Class meets 5:30-6:45, not 7:00-8:15. Office hours for Christiaan Van Woudenberg: TR after class (6:45-7:45) MATH 5610, Computational Biology p.2/24
Key Ideas from Last Lecture DNA and RNA are strings from 4 letter alphabet. Orientation Complementarity Central Dogma MATH 5610, Computational Biology p.3/24
Tonight More Molecular Biology Proteins Translation & the genetic code Genes: reading frames, introns/exons Protein Structure and Function Hydrophobicity/Hydrophilicity Sequence Alignment MATH 5610, Computational Biology p.4/24
Proteins Do most of the work in a cell. Enzymes: catalysts for chemical reactions. Structural Proteins: form cellular structure. Regulatory Proteins: control expression of genes or activities of other proteins. Transport Proteins: carry molecules across membranes or around body. Composed of strings of amino acids. Translated from RNA by ribosome complexes. Fold up into 3 dimensional structure which largely determines protein function. MATH 5610, Computational Biology p.5/24
Translation mrna is translated to form proteins. Genetic Code: 3 nucleotides translate to 1 amino acid. 20 amino acids + stop codon = 21 codes needed. Question: How many possible sequences of 3 nucleotides are there? MATH 5610, Computational Biology p.6/24
Translation mrna is translated to form proteins. Genetic Code: 3 nucleotides translate to 1 amino acid. 20 amino acids + stop codon = 21 codes needed. Question: How many possible sequences of 3 nucleotides are there? Answer: 4 3 = 64. Duplication: different codons code for same amino acids. Often an error in 3rd position in codon results in same amino acid. MATH 5610, Computational Biology p.6/24
Translation mrna is translated to form proteins. Genetic Code: 3 nucleotides translate to 1 amino acid. 20 amino acids + stop codon = 21 codes needed. Question: How many possible sequences of 3 nucleotides are there? Answer: 4 3 = 64. Duplication: different codons code for same amino acids. Often an error in 3rd position in codon results in same amino acid. MATH 5610, Computational Biology p.6/24
Genes A gene is a sequence of nucleotides in the DNA molecule coding for one unit of genetic information. A gene is expressed when that segment of DNA is transcribed into mrna. RNA polymerase binds to DNA molecule, and then moves along the DNA building the complementary RNA molecule. For binding to occur, there must be a promoter sequence on the DNA, which is a set of specific nucleotide sequences in just the right positions relative to the gene. Gene expression is regulated by proteins (or RNA molecules) that bind to the promoter regions. Sometimes, such a protein makes it easier for RNA polymerase to bind to the DNA and begin transcription (positive regulation). Other times, the protein makes it harder (negative regulation). MATH 5610, Computational Biology p.7/24
Reading Frames Recall: nucleotides in the mrna molecule are translated to proteins in triplets. If you shifted by one nucleotide, you would get an entirely different amino acid sequence: tyr leu arg leu ----- ----- ----- ----- U A C C U U A G A C U C G ----- ----- ----- ----- thr leu asp ser There are 3 possible reading frames, which correspond to the 3 possible ways of dividing the sequence up into triplets. MATH 5610, Computational Biology p.8/24
Open reading frames The choice of reading frame depends on where the ribosome binds to the mrna. This always occurs at a start codon (AUG). Translation begins immediately following a start codon (AUG), and continues until a stop codon is reached (UAA, UAG, or UGA). An open reading frame (ORF) is a sequence of mrna beginning with a start codon, and ending at the first stop codon encountered. All proteins are translated from ORFs. But not all ORFs are translated. Long ORFs are rare, unless the ORF corresponds to an actual protein. In a random Nucleotide sequence, stop codons make up 3/64 of the codons, so average ORF length is about 21 codons. Most proteins are hundreds of amino acids long. MATH 5610, Computational Biology p.9/24
Introns and Exons In prokaryotes, the mrna is transcribed directly from the DNA. In Eukaryotes, the mrna can be modified by splicing before it is translated. Splicing involves removing internal sequences called introns from the mrna. Introns do not code for proteins. The parts of the mrna that are not removed are called exons. Exons contain the sequence information for the protein. Alternative splicing: Many RNA sequences can be spliced in multiple ways (called alternative splicings). MATH 5610, Computational Biology p.10/24
Protein Structure and Function The 3-D structure of a protein largely determines the function. Hierarchy of structure: Primary structure: Linear order of the amino acids. Secondary structure: Location and direction of common structures called α-helices and β-sheets. Tertiary structure: The 3-dimensional shape of the protein. Quaternary structure: The overall 3-D structure of a complex of multiple proteins. (Image from www.biology.bnl.gov/structure/images/swami_p18.jpg) MATH 5610, Computational Biology p.11/24
Hydrophobicity/Hydrophilicity Some amino acids have polar side chains. (the charge of the molecule is not symmetric. These residues are attracted to water. Such amino acids are called Hydrophilic. Other amino acids involve nonpolar side chains. These are called Hydrophobic. Because proteins reside in water, they tend to fold up in ways such that the hydrophylic residues are on the outside and the hydrophobic residues are in the inside. MATH 5610, Computational Biology p.12/24
Topics not covered from Ch. 1 Read about this on your own. Chemical details. Molecular Biology Tools Genomic Information Content (Optional). MATH 5610, Computational Biology p.13/24
Sequence Comparison/Alignment Dot Plots Sequence Alignment Scoring methods Derivation of scoring matrices Dynamic Programming MATH 5610, Computational Biology p.14/24
Dot Plot A C T C G A G C A C A G T A G C MATH 5610, Computational Biology p.15/24
Sequence Alignment Definition: Alignment = pairwise matching between the characters of each sequence. often requires inserting gaps into the sequences. Ex: 2 alignments of the same sequences: AATCTATA AAG-AT-A Which is better? AATCTATA AA--GATA MATH 5610, Computational Biology p.16/24
Sequence Alignment Definition: Alignment = pairwise matching between the characters of each sequence. often requires inserting gaps into the sequences. Ex: 2 alignments of the same sequences: AATCTATA AAG-AT-A Which is better? AATCTATA AA--GATA MATH 5610, Computational Biology p.17/24
Types of Alignments Global best alignment of two fixed sequences Semiglobal Finds best overlap of two sequences. Doesn t penalize gaps at beginning or end. Local Finds the best scoring alignment of subsequences. Multiple Sequence Alignment aligns multiple sequences. MATH 5610, Computational Biology p.18/24
Scoring Alignments Goal: Devise a scoring function for an alignment such that the best alignment gets the highest score. Once the scoring function is defined, we will then be able to devise algorithms to search for the highest scoring alignments. Simple Example: Matches = +1 Mismatches = 0 Gaps = -1 AATCTATA AAG-AT-A ++0-00-+ = +1 AATCTATA AA--GATA ++--0+++ = +3 MATH 5610, Computational Biology p.19/24
Discussion Question What makes a good scoring function? MATH 5610, Computational Biology p.20/24
Ideas for Scoring Functions: Edit distance. (how many edits (sub., ins., del.) are needed to transform one sequence to another?) Homology. Assume both sequences evolved from a common (but unknown) ancestor. Which alignment best reflects this evolutionary relationship? Avoid unintended mathematical biases. Computational efficiency. Complex scoring functions may be harder to compute with. MATH 5610, Computational Biology p.21/24
Substitution Score Matrix Evolutionarily, some substitutions are more probable than others. Physical/Chemical properties. Ex: in DNA, transitional substitutions (purine purine) are more probable than transverional substitutions (purine pyrimidine) Selective pressure during evolution. Ex: in protein, substitutions that change structure are selected against. MATH 5610, Computational Biology p.22/24
Nucleotide Score Matrices Usually quite simple: BLAST matrix (match=5, mismatch=-4) A T C G A 5-4 -4-4 T -4 5-4 -4 C -4-4 5-4 G -4-4 -4 5 Transition/Transversion matrix A T C G A 1-5 -5-1 T -5 1-1 -5 C -5-1 1-5 G -1-5 -5 1 MATH 5610, Computational Biology p.23/24
Amino Acid Substitution Score Matrix Based on statistical model of accepted mutations (i.e., mutations that survive evolution). Example: (BLOSUM62 amino acid substitution matrix) C S T P A G C 9 S -1 4 T -1 1 5 P -3-1 -1 7 A 0 1 0-1 4 G -3 0-2 -2 0 6.... MATH 5610, Computational Biology p.24/24