Genomics and Database Mining (HCS 604.3) April 2005

Size: px
Start display at page:

Download "Genomics and Database Mining (HCS 604.3) April 2005"

Transcription

1 Genomics and Database Mining (HCS 604.3) April 2005 David M. Francis OARDC 1680 Madison Ave Wooster, OH Introduction: Computers have changed the way biologists go about the business of their science. In genetic research this has been accomplished by merging new biological techniques with advances in hardware and software. During this class we will explore internet resources for sequence analysis. The purpose of the Genomics and Database Mining is to introduce students to some of the distributed resources (i.e. resources available on the world wide web) for the analysis of genetic data. With the maturation of plant genome projects, the science of Genomics and Proteomics will become more relevant to what we do. Definitions: 1) Genomics - the application of information science concepts and methodologies to genome mapping and sequence data. Genomics requires the integration of sequence data with function and is based on the premise that structure/function relationships are conserved. Genomics is "large scale" due to automation and informatics (genome + informatics = genomics) 2) Database Mining - the semi-automated use of information technology to facilitate research, teaching, and extension. Pitfalls: When using distributed resources, it is not uncommon to encounter messages like the following: Error code 400 Can't open cache file [chkcache] As users of distributed resources you must always be aware that you are using someone else s computer for your analysis. It is advisable to either become familiar multiple URLs for the same analysis or down load the software for essential tasks when multiple resources do not exist. Where to Start: An Index to Links was developed as an introduction and guide to Genomics and Database Mining (1998, D. Francis, S. Kamoun, D. Lohnes, T. Meulia and K. Simcox). This index has been maintained and updated and now may be accessed through

2 For students who want to pursue these topics in greater depth, the Genomics and Database Mining index contains two links for tutorials. 1) Guide to sequence searching: 2) Sequence Analysis With Distributed Resources (SADR): Retrieving Sequence Data from the MCIC Sequence data can be retrieved by following links from the Molecular and Cellular Imaging Centers web page: Follow the sequence genomics, sequencing and genotyping, download your data or simply type: ftp://ftp.oardc.ohio-state.edu/ in your browser address. Note: you may need to use the refresh button if you get a The page cannot be displayed message. Username: dnaseq Password: mcic ALTERNATIVE Use FTP program and type ftp.oardc.ohio-state.edu/ ; then enter username and password as described above. Open the sequence folder, and open the folder with your data. Note each sequence is represented by four files. Descriptions follow. Files appended.ab1 and.scf are trace files that can be opened with different viewers. Several freeware chromatogram viewers can be downloaded from the MCIC from the Electropherogram viewing software link. These viewers and files may be useful for visual inspection and correcting text files. Files appended.phd are phrap quality score files. The quality value is a logtransformed error probability, specifically Q = -10 log ( P ) 10 e, where Q and P are respectively the quality value and error probability of a particular base call. Files appended.seq are text sequence files in FASTA format. FASTA format Text files containing DNA or protein sequences are formatted as FASTA files. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data (either nucleotide or protein). The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. An example sequence in FASTA format is: >cdna4 ctacaaatacaaatacattgaatttgttaattaacgaacatggcaaacattgtcaacttc cctatcattgacatggagaagctcaataattataatggtgttgagaggagtcttgttttg etc

3 Sequences are represented in standard amino acid and nucleic acid codes. Lowercase letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue). The nucleic acid codes supported are: A --> adenosine C --> cytidine G --> guanine T --> thymidine U --> uridine R --> G A (purine) Y --> T C (pyrimidine) K --> G T (keto) M --> A C (amino) S --> G C (strong) W --> A T (weak) B --> G T C D --> G A T H --> A C T V --> G C A N --> A G C T (any) - gap of indeterminate length For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted amino acid codes are: A alanine B aspartate or asparagines C cystine D aspartate E glutamate F phenylalanine G glycine H histidine I isoleucine K lysine L leucine M methionine N asparagines P proline Q glutamine R arginine S serine T threonine U selenocysteine V valine W tryptophan Y tyrosine Z glutamate or glutamine X any * translation stop - gap of indeterminate length Note: This information can be retrieved by following the search link next to the BLAST dialog box where FASTA files are pasted. Resources for Analysis of DNA and Protein Sequences Using cdna 4 as an example, we will follow work through some basic analysis of data. The National Center for Biotechnology information

4 maintains comprehensive databases and tools for sequence analysis. In order to analyze the cdna sequence, we use the Basic Local Alignment Search Technique (BLAST) program. Since we are BLASTing a nucleotide sequence against a nucleotide database, we use BLASTn. Pasting in the sequence (cdna4 in FASTA format), the following results are returned (note I have abridged the data): Sequences producing significant alignments: Score E (bits) Value gi gb M DINCARSR Carnation senescence related gi gb L DINACCA Dianthus caryophyllus amino-c gi dbj AB Dianthus caryophyllus DC-ACO1 g gi dbj AB Dianthus caryophyllus DNA, simi e-82 gi gb AF Rosa hybrid cultivar 1-aminocycl e-70 gi dbj AB Pyrus pyrifolia PPAOX4 mrna for e-62 gi gb AY Antirrhinum majus ACC oxidase AC e-61 From this we learn that the cdna is likely amino-cyclopropane carboxylic acid oxidase (ACC oxidase). The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to "0" the more "significant" the match is. However, keep in mind that searches with short sequences, can be virtually indentical and have relatively high EValue. This is because the calculation of the E-value also takes into account the length of the Query sequence. Scrolling down gives alignments (e.g. from first hit ) >gi gb M DINCARSR Length = 1250 Carnation senescence related protein RNA, complete cds Score = 2450 bits (1236), Expect = 0.0 Identities = 1236/1236 (100%) Strand = Plus / Plus Query: 1 ctacaaatacaaatacattgaatttgttaattaacgaacatggcaaacattgtcaacttc 60 Sbjct: 1 ctacaaatacaaatacattgaatttgttaattaacgaacatggcaaacattgtcaacttc 60 From this output we learn that the cdna is 100% homologous to the carnation senescence related ACC oxidase across the entire length of the cdna (again, I have truncated the results to save space). Next we can translate the cdna sequence into a protein sequence. The cdna can be translated in 6 frames, 3 from the plus (+) strand and 3 from the minus strand (-). For Eukaryotes usually only one frame is transcribed. In order to find the open reading frames (ORFs) in the sequence we can use ORF finder :

5 Results show that there are six ORFs, but only one that spans the length of the sequence ( , coding for a peptide 966 aa long). There are two small ORFs in the 3 frame, and small ORFs in +2, -2, and 1 frames. The likely candidate is the +1 frame. 40 atggcaaacattgtcaacttccctatcattgacatggagaagctc M A N I V N F P I I D M E K L 85 aataattataatggtgttgagaggagtcttgttttggaccaaatt N N Y N G V E R S L V L D Q I 130 aaggatgcttgtcacaactggggattcttccaggtggtgaaccat K D A C H N W G F F Q V V N H 175 agtttgtcacatgaactgatggacaaagtggagaggatgacaaaa S L S H E L M D K V E R M T K 220 gagcattacaagaaattcagggagcaaaagttcaaagacatggtt E H Y K K F R E Q K F K D M V 265 cagaccaaaggtttagtgtctgctgagtctcaagtcaatgacatt Q T K G L V S A E S Q V N D I 310 gattgggagagcaccttctaccttcgtcatcgtcccacctccaac D W E S T F Y L R H R P T S N 355 atctccgaggtccctgatctcgacgaccaatacaggaagttgatg I S E V P D L D D Q Y R K L M 400 aaggagtttgcagcccagattgagaggttatccgagcaactgttg K E F A A Q I E R L S E Q L L 445 gacttgttatgtgagaaccttggccttgagaaagcgtaccttaag D L L C E N L G L E K A Y L K 490 aatgccttctatggtgccaatggccccacttttggtaccaaggtc N A F Y G A N G P T F G T K V 535 agcaactacccgccttgccccaaacccgaccttatcaaaggactt S N Y P P C P K P D L I K G L 580 agggcccacaccgacgctggtggcatcattctcttgttccaggac R A H T D A G G I I L L F Q D 625 gacaaggtcagcggcctccagctcctcaaggatggtcattgggtt D K V S G L Q L L K D G H W V 670 gatgttcctcccatgaaacactccattgttgttaacttgggggac D V P P M K H S I V V N L G D 715 caacttgaggttattacaaatggcaagtacaagagtgtgatgcac Q L E V I T N G K Y K S V M H 760 cgcgtgatagcgcagacagatggtaacaggatgtcgatagcatca R V I A Q T D G N R M S I A S 805 ttctacaacccgggaagtgatgccgtgatttacccggcgccaaca F Y N P G S D A V I Y P A P T 850 ttggtggaaaaagaagaggagaaatgcagagcatacccaaaattt L V E K E E E K C R A Y P K F 895 gtgttcgaggattacatgaatctctacttaaagctcaagttccaa V F E D Y M N L Y L K L K F Q 940 gagaaggagcccaggtttgaagcaatgaaggccatggaaaccacg E K E P R F E A M K A M E T T 985 ggtcccattccaactgcttga 1005 G P I P T A * From the ORF finder output we learn that an open reading frame starting with methionine and ending with the stop codeon tga spans the clone. The first amino acid, methionine, is 40 bases from the 5 end in the +1 frame. You can use BLASTp to confirm that this protein is homologous to ACC oxidase.

6 Additional resources In addition to the NCBI, there are several additional resources that can be used for sequence manipulation and analysis. For example, the Baylor College of Medicine Human Genome Sequencing Center offers utilities for multiple types of analysis. Assignment 1, WWW resources to analyze sequence data: Identify putative function of cdna BLAST vs GenBank and GenBank EST Identify open reading frame ORF finder Assignment 2, WWW resources to analyze sequence data Retrieve sequences from the MCIC: Sequencing Run_ Yang.seq files for 1347CDF_C CDR_C CDF_E CDR_E CDF_G CDR_G06 Evaluate quality Compare multiple sequences Identify genetic differences Identify putative function of cdna BLAST forward and reverse sequences CLUSTAL or Baylor Note: you will have to find reverse Compliment for all F or R. BLAST or CLUSTAL BLAST vs GenBank and GenBank EST