Introduction to Molecular Biology

Introduction to Molecular Biology Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 2-1-

Important points to remember We will study: Problems from bioinformatics. Algorithms used to solve them. Perl programming (language of choice for bioinformatics). Requirements: Homework and programming assignments. Final project or paper due at end of semester. CSE 408 students must also scribe one lecture. CSE 308-408 is NOT a programming course: Previous programming experience not required to do well. Being a great programmer does not imply a high grade. Best grades go to students who work to learn bioinformatics. -2-

Course grading CSE 308 CSE 408 Homework assignments = 25% grade 20% grade Programming assignments = 25% grade 20% grade Final project or paper = 50% grade 50% grade Scribe duty (CSE 408)* = n/a 10% grade * Note that CSE 308 and CSE 408 point totals will be different and each will be curved separately. -3-

The Central Dogma of Molecular Biology 1. DNA copies its information in process involving many enzymes (replication). 2. DNA codes for production of mrna during transcription. 3. mrna migrates from nucleus to cytoplasm. 4. mrna carries coded information to ribosomes which "read" it and use it for protein synthesis (translation). http://allserv.rug.ac.be/~avierstr/principles/centraldogma.html -4-

The Central Paradigm of Bioinformatics By developing techniques for analyzing sequence data and the structures that result, we can attempt to understand the genetic nature of diseases. http://cmgm.stanford.edu/biochem218/ -5-

Junk DNA Recall that genes are contiguous stretches along a chromosome. At this point in time, <10% of the DNA in the human genome can be associated with genes. The remainder is known as junk DNA because it has no apparent function. However, recent studies are showing that non-coding DNA may play an important role in regulating gene expression (enhancing or suppressing expression of proximal genes). It's also used in forensic analysis as mutations are more likely in non-coding DNA regions than within genes (why?). http://www.accessexcellence.org/ab/gg/genes.html http://www.psrast.org/junkdna.htm -6-

Genetic inheritence 1. Cells in mother and father both contain paired sets of chromosomes (diploid). 2. Through meiosis, gametes (sex cells) contain only one chromosome from each pair (haploid). 3. Fertilized egg cell (zygote) receives one chromosome from mother, one from father. 4. Zygote splits and reproduces through mitosis to yield multicellular diploid organism. http://www.accessexcellence.org/ab/gg/hapdip.html -7-

Crossing over (recombination) The two chromosomes that form a pair are called homologous. During meiosis, homologous chromosomes may cross over (recombine) forming chromosomes that mix genes from each parent. Note that liklihood of recombination is function of distance between two genes. This observation is used in creating genetic linkage maps. Here we see recombination of gene c/c which appears in two forms (alleles). Genes ab (AB) are unlikely to recombine. http://www.accessexcellence.org/ab/gg/comeiosis.html -8-

Genomes Complete set of chromosomes that determines an organism is known as a genome. Sizes of some genomes Mus musculus Note that each cell in an organism contains its entire genome! http://www.cbs.dtu.dk/databases/dogs/ http://www.nsrl.ttu.edu/tmot1/mus_musc.htm -9- http://www.oardc.ohio-state.edu/seedid/single.asp?strid=324

We're more similar than you might think (The DNA of chimpanzees and humans is ~99% similar.) http://www.ornl.gov/sci/techresources/human_genome/graphics/slides/ttmousehuman.shtml http://www.news.cornell.edu/releases/dec03/chimp.life.hrs.html - 10 -

Studying a genome Most genomes are enormous (e.g., 1010 base pairs in case of human). Current sequencing technology, on the other hand, only allows biologists to determine ~103 base pairs at a time. This disparatey leads to some of the most interesting problems in computational biology. Genetic linkage map (107 108 base pairs) Physical map (105 106 base pairs) Sequencing (103 104 base pairs) ACTAGCTGATCGATTTAGCAGCAG... - 11 -

Studying a genome Cloned DNA molecules are made progressively smaller and fragments subcloned to obtain pieces small enough to sequence directly. These results are compiled to provide sequence across a chromosome. Yeast artificial chromosome (YAC) is designed to fool yeast replication mechanism. Cosmids and plasmids are vectors that can be cloned in bacteria. http://www.ornl.gov/sci/techresources/human_genome/publicat/primer/primer.pdf - 12 -

Cutting DNA using restriction enzymes A restriction enzyme surrounds DNA molecule at specific point, called restriction site (sequence GAATTC in this case). It cuts one strand of DNA helix at one point and second strand at a different, complementary point (between G and A). The separated pieces have single-stranded sticky ends, which allow complementary pieces to combine. Note that GAATTC CTTAAG GAATTC (i.e., palindrome). http://www.accessexcellence.org/ab/gg/restriction.html - 13 -

Breaking DNA DNA can also be broken in random places through mechanical means (e.g., vibration). This is typically the first step in shotgun sequencing. http://occawlonline.pearsoned.com/bookbind/pubbooks/bc_mcampbell_genomics_1/medialib/method/shotgun.html - 14 -

Copying DNA Most analytic procedures in the lab require a quantity of the DNA under study. The process of copying DNA is known as amplification. As we have seen, one possible approach is to use nature: insert the DNA of interest into the genome of a host (or vector) and let the organism multiply itself. This is called recombinant DNA. http://www.accessexcellence.org/ab/gg/plasmid.html - 15 -

Polymerase Chain Reaction Another way to amplify DNA is polymerase chain reaction (PCR). PCR alternates two phases: separate DNA into single strands using heat; convert into double strands using primer and polymerase reaction. PCR rapidly amplifies a single DNA molecule into billions of molecules. http://www.accessexcellence.org/ab/gg/polymerase.html http://www.iupui.edu/~wellsctr/mmia/htm/animations.htm - 16 -

Polymerase Chain Reaction http://www.dnalc.org/ddnalc/resources/pcr.html - 17 -

Reading DNA Gel electrophoresis is process of separating a mixture of molecules in a gel media by application of an electric field. In general, DNA molecules with similar lengths will migrate same distance. First "cut" DNA at each base: A, C, G, T. Then run gel and read off sequence: TCGCGA... This is known as sequencing. http://www.apelex.fr/anglais/applications/sommaire2/sanger.htm http://www.iupui.edu/~wellsctr/mmia/htm/animations.htm - 18 -

Gel electrophoresis http://www.dnalc.org/ddnalc/resources/electrophoresis.html - 19 -

Reading DNA Original sequence: ATCGTGTCGATAGCGCT G A ATCG ATCGTG ATCGTGTCG ATCGTGTCGATAG ATCGTGTCGATAGCG A ATCGTGTCGA ATCGTGTCGATA T AT ATCGT ATCGTGT ATCGTGTCGAT ATCGTGTCGATAGCGCT C ATC ATCGTGTC ATCGTGTCGATAGC ATCGTGTCGATAGCGC - 20 -

Sanger sequencing http://www.dnalc.org/ddnalc/resources/sangerseq.html - 21 -

DNA sequencing http://www.bii.a-star.edu.sg/docs/sbg/notes/n1/hu-genome%20lecture2.pdf - 22 -

Sequence assembly fragments fragment assembly contig contig gap target original - 23 -

Sequence assembly Simple model of DNA assembly is Shortest Supersequence Problem: given a set of sequences, find shortest sequence S such that each of original sequences appears as subsequence of S. Look for overlap between prefix of one sequence and suffix of another: ACCGT 3 2 1 CGTGC TTACCGTGC TTAC --ACCGT-----CGTGC TTAC----- - 24 -

Inferring gene functionality Researchers want to know functions of new genes. Simply comparing new gene sequences to known DNA often does not reveal actual function of gene. For 40% of sequenced genes, functionality cannot be ascertained by such techniques. DNA microarrays allow biologists to infer gene function when there is insufficent evidence based on similarity alone. http://www.bioalgorithms.info/presentations/ch10_clustering.ppt - 25 -

DNA microarray analysis DNA microarrays measure the activity (expression level) of the gene under varying conditions/time points. Expression level is estimated by measuring the amount of mrna for that particular gene. A gene is active if it is being transcribed. More mrna usually indicates more gene activity. Measurements are relative, not absolute! http://www.bioalgorithms.info/presentations/ch10_clustering.ppt - 26 -

DNA microarray experiments Analyze mrna produced from cells in tissue with environmental conditions you are testing. Produce cdna from mrna (DNA is more stable). Attach phosphor to cdna to see when a particular gene is expressed. Different color phosphors are available to compare many samples at once. Hybridize cdna over the microarray. Scan the microarray with a phosphor-illuminating laser. Scan microarray multiple times for different color phosphors. Illumination reveals transcribed genes. http://www.bioalgorithms.info/presentations/ch10_clustering.ppt - 27 -

DNA microarray experiments Phosphors can be added here instead...... then instead of staining, laser illumination is used http://www.bioalgorithms.info/presentations/ch10_clustering.ppt - 28 -

DNA microarrays http://www.bio.davidson.edu/courses/genomics/chip/chip.html - 29 -

Using DNA microarrays Track sample over a period of time to see gene expression over time. Track two different samples under same conditions to see difference in gene expressions. Each box represents one gene s expression over time http://www.bioalgorithms.info/presentations/ch10_clustering.ppt - 30 -

Using DNA microarrays Green: expressed only from control. Red: expresses only from experimental cell. Yellow: equally expressed in both samples. Black: NOT expressed in either control or experimental cells. http://www.bioalgorithms.info/presentations/ch10_clustering.ppt - 31 -

DNA microarray data Microarray data are usually transformed into an intensity matrix (see below). The intensity matrix allows biologists to make correlations between diferent genes (even if they are dissimilar) and to understand how genes functions might be related. Clustering comes into play (more on this later). Intensity (expression level) of gene at measured time http://www.bioalgorithms.info/presentations/ch10_clustering.ppt - 32 - Time: Time X Time Y Time Z Gene 1 10 8 10 Gene 2 10 0 9 Gene 3 4 8.6 3 Gene 4 7 8 3 Gene 5 1 2 3

Visualizing microarray data From Cluster analysis and display of genome-wide expression patterns by Eisen, Spellman, Brown, and Botstein, Proc. Natl. Acad. Sci. USA, Vol. 95, pp. 14863 14868, December 1998-33 -

Scientists build phylogenetic trees in an attempt to understand evolutionary relationships. http://en.wikipedia.org/wiki/phylogenetic_tree http://users.rcn.com/jkimball.ma.ultranet/biologypages/t/taxonomy.html Building the Tree of Life (These trees are best guesses and certainly contain errors.) - 34 -

Wrap-up Readings for next time: IBA Chapter 2 on algorithms (skim if already familiar). BB&P Chapter 2 on software (skim if already familiar). Remember: Come to class having done the readings. Check Blackboard regularly for updates. - 35 -