Introduction to Bioinformatics Changhui (Charles) Yan Old Main 401 F http://www.cs.usu.edu www.cs.usu.edu/~cyan 1
How Old Is The Discipline? "The term bioinformatics is a relatively recent invention, not appearing in the literature until 1991 However, had been building databases, developing algorithms and making biological discoveries by sequence analysis since the 1960s--- ---long before anyone thought to label this activity with a special term.so bioinformatics has, in fact, been in existence for more than 400 years (Mark S. Boguski, Trends Guide to Bioinformatics Elsevier, Trends Supplement 1998 p1) 2
What Is Bioinformatics? Any use of computers to handle biological information The use of computers to characterize biology molecules or to simulate dynamics of molecules The use of computers to store, compare, retrieve, or analyze biology information Computational Biology, Proteomics, Genomics, Medical Informatics 3
Bioinformatic Problems 4
Central Dogma 5
Genome 6
Bioinformatic Problems Genome Sequencing 7
Human Genome Project (HGP) To determine the sequences of the 3 billion bases that make up human DNA To identify the approximate 100,000 genes in human DNA (The estimates has been changed to 20,000-25,000 by Oct 2004) To store this information in databases To develop tools for data analysis 8
Human Genome Project (HGP) HGP began in October 1990 and completed in 2003 99% human DNA sequence finished to 99.99% accuracy (April 2003) 15,000 full-length length human genes identified (March 2003) Finished genome sequences of E. coli, S. cerevisiae, C. elegans, D. melanogaster (April 2003) Post-genome era 9
Completely ly Sequenced Genomes 10
Genome Projects More than 60 eukaryotic genome sequencing projects are underway 11
Genome Sequencing 12
Genome Sequencing 13
Difficulties due to Repeats Uncertainty Missing data Huge size!!!! 14
Gene finding Genome Sequencing Gene Finding 15,000 human genes identified The estimates are 100,000 (1990) 20,000-25,000 25,000 (Oct 2004) 3 billion bases that make up human DNA 15
Gene-finders 16
Sequence Alignment Genome Gene Finding Sequence alignment 17
Longest Common Subsequences 18
Sequence Alignment Pair-wise Alignment Multiple Sequence Alignment Searching Databases http://www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov/blast/ 19
Sequence Alignment Global vs. Local 20
Gene Expression Genome Sequencing Gene Finding Sequence Alignment Gene Expression 21
Gene Expression 22
Protein Folding Genome Sequencing Gene Finding Sequence Alignment Gene Expression Protein Structure 23
Protein Structure Visualization of protein structure Protein structure alignment Protein structure prediction 24
Protein Structure Prediction Comparative modeling If the sequence is similar to another one whose structure is known. Fold recognition In absence of a significantly similar sequence with known structure, these methods try to determine how well a known structure fits the sequence to model. Ab initio prediction Can detect the structures that have not been discovered. Monte Carlo search for lowest energy. 25
Protein Function Prediction Genome Sequencing Gene Finding Sequence Alignment Gene Expression Protein Structure Protein Function 26
Protein Function Prediction similar sequence-similar similar structure-similar similar function paradigm Identification of homologous sequences (BLAST, PSI- BLAST) (>30% identity) Identification of conserved functional sites (<=30%) 27
Conserved Functional Sites -- Motifs [AG]-G-x(0,1) x(0,1)-[gap] [GAP]-x-N-x-[STA]-x(6) x(6)-[gs] [GS]-x(9) x(9)-g 28
Motifs 29
Conserved Functional Sites -- Motifs Single motif PROSITE: a database of biologically significant sites 30
Conserved Functional Sites -- Motifs Multiple motifs PRINTS: a database of protein fingerprints. A fingerprint is a group of conserve motifs characterizing a protein function 31
PRINTS >ATHA_PIG 32
PRINTS 33
Conserved Functional Sites -- Motifs Hidden Markov Model Pfam: 34
Protein Interaction Network Genome Gene Finding Sequence Alignment Gene Expression Protein Structure Protein Function Protein Interaction Network 35
Protein Interaction Network 36
37
Protein Interaction Network 38
Bioinformatic Problems Genome Gene Finding Sequence Alignment Gene Expression Protein Structure Protein Function Protein Interaction Network 39
Bioinformatic Problems There are more. Phylogeny analysis: Tree of life Databases and tools development 40
Bioinformatic Databases GenBank (DNA sequences) ProteinDataBank (Protein structures) PIR (Protein sequences) Nucleic Acids Research (2005) 719 databases 41
Bioinformatic Programs Sequence analysis: BLAST, ClustalX,, EMBOSS, GCG Molecular imaging/modeling: PyMol, MOLMOL, RasMol 42