NCBI Molecular Biology Resources

Size: px
Start display at page:

Download "NCBI Molecular Biology Resources"

Transcription

1 NCBI Molecular Biology Resources Part 2: Using NCBI BLAST December 2009

2 Using BLAST Basics of using NCBI BLAST Using the new Interface Improved organism and filter options New Services Primer BLAST Align 2 Sequences Integration COBALT protein multiple alignment BLAST URL API C++ BLAST binaries

3 Basic Local Alignment Search Tool Widely used similarity search tool Heuristic approach based on Smith Waterman algorithm Finds best local alignments Provides statistical significance All combinations (DNA/Protein) query and database. DNA vs DNA DNA translation vs Protein Protein vs Protein Protein vs DNA translation DNA translation vs DNA translation www, standalone, and network client

4 Weekdays often exceed 400 K searches BLAST Activity Interactive searches: 140 K per weekday

5 BLAST and BLAST-like programs Traditional BLAST (formerly blastall) nucleotide, protein, translations blastn nucleotide query vs. nucleotide database blastp protein query vs. protein database blastx nucleotide query vs. protein database tblastn protein query vs. translated nucleotide database tblastx translated query vs. translated database Megablast nucleotide only Contiguous megablast Nearly identical sequences Discontiguous megablast Cross-species comparison Position Specific BLAST Programs protein only Position Specific Iterative BLAST (PSI-BLAST) Automatically generates a position specific score matrix (PSSM) Reverse PSI-BLAST (RPS-BLAST) Searches a database of PSI-BLAST PSSMs

6 BLAST Searches by Program November 9-13, 2009; n=886,407

7 Nucleotide and Protein BLAST Programs

8 Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Expect Value E = number of database hits you expect to find by chance size of database Alignments your score expected number of random hits E = Kmne -λs or E = mn2 -S K = scale for search space λ = scale for scoring system S = bitscore = (λs - lnk)/ln2 Score (applies to ungapped alignments)

9 The BLAST homepage

10 Basic BLAST: Databases

11 Non-redundant protein nr (non-redundant protein sequences) GenBank CDS translations NP_, XP_ refseq_protein Outside Protein PIR, Swiss-Prot, PRF PDB (sequences from structures) pat protein patents env_nr environmental samples Services blastp blastx

12 Protein Database Sizes Database Sequences Residues nr 10,133,783 3,456,922,644 refseq_protein 7,413,069 2,589,005,568 swissprot 430, ,291,105 pat 817, ,184,433 12/04/2009 pdb 44,202 10,171,945

13 Protein Database Selection November 9-13, 2009; n=222,791

14 Nucleotide Databases: Human and Mouse Megablast, blastn service Human and mouse genomic and transcript now default Separate sections in output for mrna and genomic Direct links to Map Viewer for genomic sequences

15 Nucleotide Databases: Traditional Services blastn tblastn tblastx

16 Nucleotide Databases: Traditional nr (nt) Traditional GenBank NM_ and XM_ RefSeqs refseq_rna NCBI Genomes NC_ RefSeqs GenBank Chromosomes dbest EST Division non-human, nonmouse ests Databases are mostly non-overlapping htgs HTG division gss GSS division wgs whole genome shotgun contigs env_nt environmental samples

17 Nucleotide Database Sizes Database Sequences Residues nr/nt 10,362,162 29,617,088,643 refseq_rna 2,042,538 3,240,301,155 NCBI genomes 10,047 49,094,451,709 est 63,832,451 35,136,825,005 htgs 143,742 24,082,224,044 gss 27,198,629 17,658,377,015 wgs 31,377, ,309,157,200 12/04/2009 env_nt 17,708,548 7,218,208,433

18 Nucleotide Database Selection November 9-13, 2009; n=535,836

19 Using Basic BLAST

20 Universal Form: Protein

21 Less Universal Form: Nucleotide More Sensitivity Speed More Less

22 Limiting Database: Organism Organism autocomplete

23 Combining Organisms Primates and Rodents without human or mouse

24 More Limits Eliminate models and environmental samples Entrez query limit, any valid Entrez query.

25 Algorithm parameters: Protein Expand May limit results Adjust to set stringency Default statistics adjustment for compositional bias Off now by default. Conflicts with comp-based stats

26 Automatic Short Sequence Adjustment Protein e-value Word Size 2 Matrix PAM30 Comp Stats Off Low Comp Filter Off Nucleotide e-value 1000 Word Size 7 Matrix 1,-3 Low Comp Filter Off

27 Algorithm parameters: Nucleotide blastn Masks species-specific interspersed repeats Essential for genomic query sequences Masks LC sequence (simple repeats) Prevents starting alignment in masked region Allows extensions through masked regions

28 Basic BLAST: Protein

29 The hard way to run a BLAST Search 1. Search protein with Human Muscle Creatine Kinase 2. Click on summary for NP_ Change format to FASTA 4. Select sequence 5. Copy sequence 6. Google search BLAST 7. Link to NCBI BLAST Homepage 8. Link to Protein BLAST form 9. Paste FASTA sequence into form 10. Click BLAST button

30 An easier way: Entrez protein record Analysis Tools PubMed Citations Identical Proteins Discovery Column Reference Sequences Gene Record HomoloGene Cluster

31 BLAST Ad to BLAST form

32 Database and limits NCBI Reference Sequences Mammals without primates Exclude predicted proteins

33 Run Search

34 BLAST Formatting Page Conserved Domain Results

35 BLAST Output: Graphical Overview mouse over

36 BLAST Output: Descriptions Link to Entrez Sorted by e values 7 X Default e value cutoff 10

37 BLAST Output: Alignments Identical match positive score (conservative) Negative or zero gap

38 What happens without XP_ filter? Results filtered for domestic dog proteins. 26 additional gene predictions from Dog alone. Many are extra splice variants predicted by Gnomon.

39 Other Reports TreeView Tax BLAST COBALT extension

40 TaxBLAST: Taxonomy Reports Four genes in each mammal.

41 TreeView: Distance Tree Mitochondrial Creatine Kinases Muscle Ubiquitous Four genes Brain -specific Muscle-specific Cytoplasmic Creatine Kinases

42 Basic BLAST: Nucleotide

43 Universal Form: Nucleotide Less More Sensitivity Speed More Less

44 Nucleotide Results: ALB mrna megablast disco. megablast blastn

45 Macaque CDC20 Search

46 Separate Sections for Transcript and Genome Sortable Results Pseudogene on Chromosome 9 Functional Gene on Chromosome 1

47 Total Score: All Segments Functional Gene Now First

48 Alignments: Sorting in Exon Order Query start position; in exon order Default Sorting Order: e-value Longest exon usually first

49 Links to Map Viewer Chromosome 1 Chromosome 9

50 BLAST Formatting Options

51 Formatting Page (Now on Results)

52 Reformatted Results Gap (error) introduces frame shift

53 Download Options (Now on Results) Structured Formats

54 The Hit Table # BLASTP (Aug ) # Query: gi ref NP_ MutL protein homolog 1 [Homo sapiens] # Database: swissprot # Fields: query id, subject ids, % identity, % positives, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score # 80 hits found ref NP_ gi gi sp P38920 MLH1_YEAST e ref NP_ gi gi sp Q9P7W6 MLH1_SCHPO e ref NP_ gi gi sp Q8RA70 MUTL_THETN e ref NP_ gi gi sp Q8KAX3 MUTL_CHLTE Also available in comma separated format for 5e-55 Excel 215 ref NP_ gi gi sp P MUTL_ECOLI e ref NP_ gi gi sp Q8FAK9 MUTL_ECOL e ref NP_ gi gi sp Q8XDN4 MUTL_ECO e ref NP_ gi gi sp Q72PF7 MUTL_LEPIC e ref NP_ gi gi sp P57886 MUTL_PASMU e ref NP_ gi gi sp P44494 MUTL_HAEIN e ref NP_ gi gi sp Q8ZIW4 MUTL_YERPE e ref NP_ gi gi sp Q9JYT2 MUTL_NEIMB e ref NP_ gi gi sp Q9KAC1 MUTL_BACHD e ref NP_ gi gi sp Q87L05 MUTL_VIBPA e ref NP_ gi gi sp Q9JTS2 MUTL_NEIMA e ref NP_ gi gi sp Q6GHD9 MUTL_STAAR e ref NP_ gi gi sp Q8NWX9 MUTL_STAAW e ref NP_ gi gi sp Q5HGD5 MUTL_STAAC e ref NP_ gi gi sp P65492 MUTL_STAAN e ref NP_ gi gi sp Q9KV13 MUTL_VIBCH e ref NP_ gi gi sp P14161 MUTL_SALTY e ref NP_ gi gi sp Q9CDL1 MUTL_LACLA e ref NP_ gi gi sp Q7MH01 MUTL_VIBVY e ref NP_ gi gi sp Q8Z187 MUTL_SALTI e ref NP_ gi gi sp Q8DCV0 MUTL_VIBVU e ref NP_ gi gi sp Q5E2C6 MUTL_VIBF e ref NP_ gi gi sp Q88DD1 MUTL_PSEPK e

55 Structured formats: XML and ASN.1 <Iteration_hits> <Hit> XML <Hit_num>1</Hit_num> <Hit_id>gi sp P40692 MLH1_HUMAN</Hit_id> Seq-annot ::= { <Hit_def> desc { DNA mismatch repair protein Mlh1 (MutL protein user { homolog 1) </Hit_def> <Hit_accession>P40692</Hit_accession> <Hit_len>756</Hit_len> <Hit_hsps> <Hsp> <Hsp_num>1</Hsp_num> <Hsp_bit-score>1568.9</Hsp_bit-score> <Hsp_score>4061</Hsp_score> <Hsp_evalue>0</Hsp_evalue> <Hsp_query-from>1</Hsp_query-from> <Hsp_query-to>756</Hsp_query-to> <Hsp_hit-from>1</Hsp_hit-from> <Hsp_hit-to>756</Hsp_hit-to> <Hsp_query-frame>0</Hsp_query-frame> <Hsp_hit-frame>0</Hsp_hit-frame> <Hsp_identity>0</Hsp_identity> <Hsp_positive>0</Hsp_positive> <Hsp_gaps>0</Hsp_gaps> <Hsp_align-len>756</Hsp_align-len> type str "Hist Seqalign", data { { label str "Hist Seqalign", data bool TRUE } } }, user { type str "Blast Type", data { { label id 0, data int 0 } } }, user { type str "BLAST database title", data { { label str "Non-redundant SwissProt ASN.1

56 PSSMs: Restart PSI-BLAST ASN.1 ScoreMat, Portable ASCII encoded, Web only

57 Managing Searches Recent Results Saved Strategies

58 Recent Results Login to My NCBI to save search strategies Results available for 36 hours

59 Saved Strategies Re-run searches to keep up to date

60 Genome and Specialized BLAST

61 Nucleotide Databases: Human and Mouse Megablast, blastn service Human and mouse genomic and transcript now default Separate sections in output for mrna and genomic Direct links to Map Viewer for genomic sequences

62 Genome BLAST pages

63 Map Viewer Homepage

64 Poplar Genome BLAST

65 tblastn Genome BLAST Results Protein-nucleotide alignments Exons and genes mixed

66 Genomic Context of BLAST Hits

67 Hits in Map Viewer

68 Specialized BLAST Pages

69 BLAST extensions and improvements PrimerBlast primer designer / specificity checker COBALT Protein Multiple Alignment tool Integration / expansion of BLAST 2 Sequences

70 Primer BLAST from Sequence Record

71 Primer BLAST: Template and Primers

72 Primer BLAST: specificity params Organism-specific search

73 Primer Results Conserved region Exon boundary Specific for this family member

74 BLAST 2 Sequences Region between UGT2B15 and TMRSS11E in primary reference and alternate locus

75 Alignment of reference and null-allele Deleted region Similar regions on either side Null Allele (Alt. Locus) Relative insertion Primary Reference

76 COBALT Extension of BLAST Lower vertebrate creatine kinases

77 COBALT (Constraint Based Alignment Tool) True multiple sequence alignment that uses conserved domain information

78 Family 2 COBALT Tree zebrafish zebrafish Salmonidae Family 2B Extra brain type genes in tetraploid fishes Salmonidae

79 COBALT Interface

80 Keeping up with what s new NCBI News on Bookshelf

81 Getting Help

82 Service Addresses General Help BLAST Telephone support: