NCBI Molecular Biology Resources
|
|
- Irma Blake
- 5 years ago
- Views:
Transcription
1 NCBI Molecular Biology Resources Part 2: Using NCBI BLAST December 2009
2 Using BLAST Basics of using NCBI BLAST Using the new Interface Improved organism and filter options New Services Primer BLAST Align 2 Sequences Integration COBALT protein multiple alignment BLAST URL API C++ BLAST binaries
3 Basic Local Alignment Search Tool Widely used similarity search tool Heuristic approach based on Smith Waterman algorithm Finds best local alignments Provides statistical significance All combinations (DNA/Protein) query and database. DNA vs DNA DNA translation vs Protein Protein vs Protein Protein vs DNA translation DNA translation vs DNA translation www, standalone, and network client
4 Weekdays often exceed 400 K searches BLAST Activity Interactive searches: 140 K per weekday
5 BLAST and BLAST-like programs Traditional BLAST (formerly blastall) nucleotide, protein, translations blastn nucleotide query vs. nucleotide database blastp protein query vs. protein database blastx nucleotide query vs. protein database tblastn protein query vs. translated nucleotide database tblastx translated query vs. translated database Megablast nucleotide only Contiguous megablast Nearly identical sequences Discontiguous megablast Cross-species comparison Position Specific BLAST Programs protein only Position Specific Iterative BLAST (PSI-BLAST) Automatically generates a position specific score matrix (PSSM) Reverse PSI-BLAST (RPS-BLAST) Searches a database of PSI-BLAST PSSMs
6 BLAST Searches by Program November 9-13, 2009; n=886,407
7 Nucleotide and Protein BLAST Programs
8 Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Expect Value E = number of database hits you expect to find by chance size of database Alignments your score expected number of random hits E = Kmne -λs or E = mn2 -S K = scale for search space λ = scale for scoring system S = bitscore = (λs - lnk)/ln2 Score (applies to ungapped alignments)
9 The BLAST homepage
10 Basic BLAST: Databases
11 Non-redundant protein nr (non-redundant protein sequences) GenBank CDS translations NP_, XP_ refseq_protein Outside Protein PIR, Swiss-Prot, PRF PDB (sequences from structures) pat protein patents env_nr environmental samples Services blastp blastx
12 Protein Database Sizes Database Sequences Residues nr 10,133,783 3,456,922,644 refseq_protein 7,413,069 2,589,005,568 swissprot 430, ,291,105 pat 817, ,184,433 12/04/2009 pdb 44,202 10,171,945
13 Protein Database Selection November 9-13, 2009; n=222,791
14 Nucleotide Databases: Human and Mouse Megablast, blastn service Human and mouse genomic and transcript now default Separate sections in output for mrna and genomic Direct links to Map Viewer for genomic sequences
15 Nucleotide Databases: Traditional Services blastn tblastn tblastx
16 Nucleotide Databases: Traditional nr (nt) Traditional GenBank NM_ and XM_ RefSeqs refseq_rna NCBI Genomes NC_ RefSeqs GenBank Chromosomes dbest EST Division non-human, nonmouse ests Databases are mostly non-overlapping htgs HTG division gss GSS division wgs whole genome shotgun contigs env_nt environmental samples
17 Nucleotide Database Sizes Database Sequences Residues nr/nt 10,362,162 29,617,088,643 refseq_rna 2,042,538 3,240,301,155 NCBI genomes 10,047 49,094,451,709 est 63,832,451 35,136,825,005 htgs 143,742 24,082,224,044 gss 27,198,629 17,658,377,015 wgs 31,377, ,309,157,200 12/04/2009 env_nt 17,708,548 7,218,208,433
18 Nucleotide Database Selection November 9-13, 2009; n=535,836
19 Using Basic BLAST
20 Universal Form: Protein
21 Less Universal Form: Nucleotide More Sensitivity Speed More Less
22 Limiting Database: Organism Organism autocomplete
23 Combining Organisms Primates and Rodents without human or mouse
24 More Limits Eliminate models and environmental samples Entrez query limit, any valid Entrez query.
25 Algorithm parameters: Protein Expand May limit results Adjust to set stringency Default statistics adjustment for compositional bias Off now by default. Conflicts with comp-based stats
26 Automatic Short Sequence Adjustment Protein e-value Word Size 2 Matrix PAM30 Comp Stats Off Low Comp Filter Off Nucleotide e-value 1000 Word Size 7 Matrix 1,-3 Low Comp Filter Off
27 Algorithm parameters: Nucleotide blastn Masks species-specific interspersed repeats Essential for genomic query sequences Masks LC sequence (simple repeats) Prevents starting alignment in masked region Allows extensions through masked regions
28 Basic BLAST: Protein
29 The hard way to run a BLAST Search 1. Search protein with Human Muscle Creatine Kinase 2. Click on summary for NP_ Change format to FASTA 4. Select sequence 5. Copy sequence 6. Google search BLAST 7. Link to NCBI BLAST Homepage 8. Link to Protein BLAST form 9. Paste FASTA sequence into form 10. Click BLAST button
30 An easier way: Entrez protein record Analysis Tools PubMed Citations Identical Proteins Discovery Column Reference Sequences Gene Record HomoloGene Cluster
31 BLAST Ad to BLAST form
32 Database and limits NCBI Reference Sequences Mammals without primates Exclude predicted proteins
33 Run Search
34 BLAST Formatting Page Conserved Domain Results
35 BLAST Output: Graphical Overview mouse over
36 BLAST Output: Descriptions Link to Entrez Sorted by e values 7 X Default e value cutoff 10
37 BLAST Output: Alignments Identical match positive score (conservative) Negative or zero gap
38 What happens without XP_ filter? Results filtered for domestic dog proteins. 26 additional gene predictions from Dog alone. Many are extra splice variants predicted by Gnomon.
39 Other Reports TreeView Tax BLAST COBALT extension
40 TaxBLAST: Taxonomy Reports Four genes in each mammal.
41 TreeView: Distance Tree Mitochondrial Creatine Kinases Muscle Ubiquitous Four genes Brain -specific Muscle-specific Cytoplasmic Creatine Kinases
42 Basic BLAST: Nucleotide
43 Universal Form: Nucleotide Less More Sensitivity Speed More Less
44 Nucleotide Results: ALB mrna megablast disco. megablast blastn
45 Macaque CDC20 Search
46 Separate Sections for Transcript and Genome Sortable Results Pseudogene on Chromosome 9 Functional Gene on Chromosome 1
47 Total Score: All Segments Functional Gene Now First
48 Alignments: Sorting in Exon Order Query start position; in exon order Default Sorting Order: e-value Longest exon usually first
49 Links to Map Viewer Chromosome 1 Chromosome 9
50 BLAST Formatting Options
51 Formatting Page (Now on Results)
52 Reformatted Results Gap (error) introduces frame shift
53 Download Options (Now on Results) Structured Formats
54 The Hit Table # BLASTP (Aug ) # Query: gi ref NP_ MutL protein homolog 1 [Homo sapiens] # Database: swissprot # Fields: query id, subject ids, % identity, % positives, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score # 80 hits found ref NP_ gi gi sp P38920 MLH1_YEAST e ref NP_ gi gi sp Q9P7W6 MLH1_SCHPO e ref NP_ gi gi sp Q8RA70 MUTL_THETN e ref NP_ gi gi sp Q8KAX3 MUTL_CHLTE Also available in comma separated format for 5e-55 Excel 215 ref NP_ gi gi sp P MUTL_ECOLI e ref NP_ gi gi sp Q8FAK9 MUTL_ECOL e ref NP_ gi gi sp Q8XDN4 MUTL_ECO e ref NP_ gi gi sp Q72PF7 MUTL_LEPIC e ref NP_ gi gi sp P57886 MUTL_PASMU e ref NP_ gi gi sp P44494 MUTL_HAEIN e ref NP_ gi gi sp Q8ZIW4 MUTL_YERPE e ref NP_ gi gi sp Q9JYT2 MUTL_NEIMB e ref NP_ gi gi sp Q9KAC1 MUTL_BACHD e ref NP_ gi gi sp Q87L05 MUTL_VIBPA e ref NP_ gi gi sp Q9JTS2 MUTL_NEIMA e ref NP_ gi gi sp Q6GHD9 MUTL_STAAR e ref NP_ gi gi sp Q8NWX9 MUTL_STAAW e ref NP_ gi gi sp Q5HGD5 MUTL_STAAC e ref NP_ gi gi sp P65492 MUTL_STAAN e ref NP_ gi gi sp Q9KV13 MUTL_VIBCH e ref NP_ gi gi sp P14161 MUTL_SALTY e ref NP_ gi gi sp Q9CDL1 MUTL_LACLA e ref NP_ gi gi sp Q7MH01 MUTL_VIBVY e ref NP_ gi gi sp Q8Z187 MUTL_SALTI e ref NP_ gi gi sp Q8DCV0 MUTL_VIBVU e ref NP_ gi gi sp Q5E2C6 MUTL_VIBF e ref NP_ gi gi sp Q88DD1 MUTL_PSEPK e
55 Structured formats: XML and ASN.1 <Iteration_hits> <Hit> XML <Hit_num>1</Hit_num> <Hit_id>gi sp P40692 MLH1_HUMAN</Hit_id> Seq-annot ::= { <Hit_def> desc { DNA mismatch repair protein Mlh1 (MutL protein user { homolog 1) </Hit_def> <Hit_accession>P40692</Hit_accession> <Hit_len>756</Hit_len> <Hit_hsps> <Hsp> <Hsp_num>1</Hsp_num> <Hsp_bit-score>1568.9</Hsp_bit-score> <Hsp_score>4061</Hsp_score> <Hsp_evalue>0</Hsp_evalue> <Hsp_query-from>1</Hsp_query-from> <Hsp_query-to>756</Hsp_query-to> <Hsp_hit-from>1</Hsp_hit-from> <Hsp_hit-to>756</Hsp_hit-to> <Hsp_query-frame>0</Hsp_query-frame> <Hsp_hit-frame>0</Hsp_hit-frame> <Hsp_identity>0</Hsp_identity> <Hsp_positive>0</Hsp_positive> <Hsp_gaps>0</Hsp_gaps> <Hsp_align-len>756</Hsp_align-len> type str "Hist Seqalign", data { { label str "Hist Seqalign", data bool TRUE } } }, user { type str "Blast Type", data { { label id 0, data int 0 } } }, user { type str "BLAST database title", data { { label str "Non-redundant SwissProt ASN.1
56 PSSMs: Restart PSI-BLAST ASN.1 ScoreMat, Portable ASCII encoded, Web only
57 Managing Searches Recent Results Saved Strategies
58 Recent Results Login to My NCBI to save search strategies Results available for 36 hours
59 Saved Strategies Re-run searches to keep up to date
60 Genome and Specialized BLAST
61 Nucleotide Databases: Human and Mouse Megablast, blastn service Human and mouse genomic and transcript now default Separate sections in output for mrna and genomic Direct links to Map Viewer for genomic sequences
62 Genome BLAST pages
63 Map Viewer Homepage
64 Poplar Genome BLAST
65 tblastn Genome BLAST Results Protein-nucleotide alignments Exons and genes mixed
66 Genomic Context of BLAST Hits
67 Hits in Map Viewer
68 Specialized BLAST Pages
69 BLAST extensions and improvements PrimerBlast primer designer / specificity checker COBALT Protein Multiple Alignment tool Integration / expansion of BLAST 2 Sequences
70 Primer BLAST from Sequence Record
71 Primer BLAST: Template and Primers
72 Primer BLAST: specificity params Organism-specific search
73 Primer Results Conserved region Exon boundary Specific for this family member
74 BLAST 2 Sequences Region between UGT2B15 and TMRSS11E in primary reference and alternate locus
75 Alignment of reference and null-allele Deleted region Similar regions on either side Null Allele (Alt. Locus) Relative insertion Primary Reference
76 COBALT Extension of BLAST Lower vertebrate creatine kinases
77 COBALT (Constraint Based Alignment Tool) True multiple sequence alignment that uses conserved domain information
78 Family 2 COBALT Tree zebrafish zebrafish Salmonidae Family 2B Extra brain type genes in tetraploid fishes Salmonidae
79 COBALT Interface
80 Keeping up with what s new NCBI News on Bookshelf
81 Getting Help
82 Service Addresses General Help BLAST Telephone support: