INTRODUCTION TO BIOINFORMATICS SAINTS GENETICS 12-120522 - Ian Bosdet (ibosdet@bccancer.bc.ca)
Bioinformatics bioinformatics is: the application of computational techniques to the fields of biology and medicine bioinformatics is generally associated with the analysis of DNA/RNA/protein sequences and other data related to biomolecules and cell biology the roots of bioinformatics are in linux, Perl and C
Bioinformaticians background 1) life scientists with computational skills biology, genetics, microbiology, molecular biology, medicine 2) computer scientists with knowledge of biology computer science, mathematics, physics, engineering, statistics 3) graduates of bioinformatics training programs (http://bioinformatics.bcgsc.ca/) common software scripting languages - Python, Perl statistical software - R programming languages - C, C++, Java Microsoft Excel Vancouver Bioinformatics Users Group: http://www.vanbug.org
Types of data and example databases DNA/RNA/Protein sequence- NCBI, Ensembl Protein - Domains (Pfam) Gene Expression - NCBI GEO Epigenetics - ENCODE Variation Sequence (dbsnp) Copy-number (DGV) Mutations Cancer (COSMIC) Health (ClinVar) Published Literature - PubMed Expert analysis and interpretation - PubMed, GeneReviews
Common Online Sites NCBI - http://www.ncbi.nlm.nih.gov/ Ensembl - http://ensembl.org/ UCSC - http://genome.ucsc.edu/ search tools links to outside databases custom genome browsers
Genome Browsers http://genome.ucsc.edu/
BLAST/BLAT NCBI BLAST (Basic Local Alignment Search Tool) Finds similar sequences within a large database of known sequences Provides a statistical estimate of the likelihood that the match is simply chance Search DNA, Protein, DNA Protein, Protein DNA http://blast.ncbi.nlm.nih.gov/blast.cgi UCSC BLAT (BLAST Like Alignment Tool) Similar to BLAST but quicker Less versatile http://genome.ucsc.edu/cgi-bin/hgblat Try searching this sequence with each tool: caggcccaactgtgagcaaggagcacaagccacaagtcttccagaggatg cttgattccagtggttctgcttcaaggcttccactgcaaaacactaaaga
Sequence file formats Genbank Sequence Features Literature links http://www.ncbi.nlm.nih.gov/nucleotide/41327737?report=genbank FASTA Sequence only Header line with name http://www.ncbi.nlm.nih.gov/nuccore/41327737?report=fasta
Multiple alignments Find similar regions in different proteins these regions may highlight evolutionary conservation and gives clues to protein function http://www.ebi.ac.uk/tools/msa/clustalw2/ >hs_tp53 MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDI EQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQ KTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDST PPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGN LRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRP ILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQDQTSFQKENC >mm_trp53 MTAMEESQSDISLELPLSQETFSGLWKLLPPEDILPSPHCMDDLLLPQDV EEFFEGPSEALRVSGAPAAQDPVTETPGPVAPAPATPWPLSSFVPSQKTY QGNYGFHLGFLQSGTAKSVMCTYSPPLNKLFCQLAKTCPVQLWVSATPPA GSRVRAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLYP EYLEDRQTFRHSVVVPYEPPEAGSEYTTIHYKYMCNSSCMGGMNRRPILT IITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEVLCPELPPGS AKRALPTCTSASPPQKKKPLDGEYFTLKIRGRKRFEMFRELNEALELKDA HATEESGDSRAHSSLQPRAFQALIKEESPNC >Xenopus_trp53 MEPSSETGMEPPLSQETFEDLWSLLPDPLQTGTGQMENFAEFSEYPLAPDMTVLQEGLMGNTVPTVTSSA VPSTEDYAGSYGLKLEFQQNGTAKSVTCTYSTDLNKLFCQLAKTCPLLVRVERPPPLGSILRATAVYKKS EHVAEVVKRCPHHERSVEPGDDPAPPSHLMRVEGNSKAYYMEDVGTGRHSVCVPYEGPQVGTECTTVLYN YMCNSSCMGGMNRRPILTIITLESPEGLLLGRRCFEVRVCACPGRDRRTEEDNCTKKRGLKPNGKRELSH PPSSDPPLPKKRLVEEDDEETFTLLIKGRSRYEMIKKLNDALELQESLDQQKLSIKCRKCRDEIKPKKGK KLLVKDELQDSE
Exercise: Is there a mouse model for Li-Fraumeni Syndrome? 1. Goto OMIM in your browser: http://omim.org/ 2. Enter the search term lfs1 - click the top search result (#151623) 3. Looking at the Phenotype Gene Relationships table, what two genes are associated with this disease? 1. 2. 4. In this table, click on the link to the gene on chromosome 17 (MIM number: 191170) 5. What is another of the diseases associated with mutations in this gene? 1. 6. Click the Genomic coordinates link to see this gene in the UCSC Genome Browser 7. Below the genome display, click the gray Default tracks button 8. Find the RefSeq Genes track and right-click - select Pack to display all splice forms 9. Find the bottom (longest) splice form and left-click to see the gene details. 10. Click the PubMed link. Approximately how many publications are related to this gene? 1. 11. Click Back in your browser to return to the RefSeq gene details. Click the RefSeq link to go to NCBI.
Exercise - Li-Fraumeni Syndrome 13. Run BLAST on this sequence. Select Run BLAST from the toolbar links on the right. 14. Under Choose Search Set select Mouse genomic + transcript and click BLAST 15. What is the Accession number of the top transcript hit? What is the E value of the alignment? 1. 2. 16. Is there a mouse with this gene knocked out? Search the gene common name (Trp53) at http://www.findmice.org.
Exercises 1. Is the exact peptide sequence Serine-Alanine-Isoleucine-Asparagine-Threonine-Serine found in the human genome? If so, what protein(s)? If not, what is the closest match? 2. You are sequencing DNA isolated from a sample of Vancouver drinking water. One DNA fragment contains a small open-reading frame that codes the following peptide: mgydwlgrmpykgsvengaykaqgvqltak What organism does this come from (and should it be in the water)? Are there any conserved domains in this peptide? 3. [UCSC Browser] Find the name of a SNP found in an exon or intron of the human gene KRAS. Click on the name to see a summary report. Click on the dbsnp link to see a detailed report. What is the frequency of this variant in the human population? 4. [UCSC Browser] Find the gene Notch1. Click the DNA link at the top of the page and then click the extended case/color options button. Select underline for ESTs, blue color for SNPs(135) and bold for RepeatMasker