Sequence Based Function Annotation

Size: px
Start display at page:

Download "Sequence Based Function Annotation"

Transcription

1 Sequence Based Function Annotation Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

2 Sequence Based Function Annotation 1. Given a sequence, how to predict its biological function? 2. How to describe the function of a gene? 3. How to works with 50,000 genes?

3 Usage scenarios for sequence based function annotation Genomic scale function annotation for non-model organisms RNA-seq data Genomic sequencing Assembly (Trinity) Assembly (SOAP de novo) ORF prediction (Trinity) Gene prediction (Maker) Function prediction

4 Alternative Usage Scenario You are interested in this gene: Ig Ig Ig TM Kinase You want to know: Is this gene present in the new genome?

5 Given a protein sequence, how to predict its function? >unknow_protein_1 MVHLTDAEKAAVSCLWGKVNSDEVGGEALGRLLVVYPWTQR YFDSFGDLSSASAIMGNAKVKAHGKKVITAFNDGLNHLDSL KGTFASLSELHCDKLHVDPENFRLLGNMIVIVLGHHLGKDF TPAAQAAFQKVVAGVATALAHKYH Common approaches Sequence alignment: BLAST Domains: PFAM, InterProScan

6 NCBI BLAST How does BLAST work? BLAST and Psi-BLAST: Position independent and position specific scoring matrix.

7 Two scenarios of BLAST 1. You have a gene sequence, you want to know the function of this gene. Query: your gene sequence Database: Genbank/Genpept, Refseq, Uniprot, Swissprot

8 Two scenarios of BLAST 2. You have a genome/transcriptome, you want to know whether your gene is present. Query: your gene sequence Database: BLAST formatted genome database

9 BLAST programs blastn nucleotide query vs. nucleotide database blastp protein query vs. protein database blastx nucleotide query vs. protein database tblastn protein query vs. translated nucleotide database tblastx translated query vs. translated database

10 BLAST Databases Services megablast blastn tblastn tblastx

11 Nucleic Databases NCBI: Genbank (raw) Refseq (curated) UCSC: Refseq EBI: Ensembl UCSC Refseq is derived from reference genome

12 Sequence types Genome Gene (exon + intron) cdna (exon)

13 Ensembl Biomart Provides each to use interface to retrieve sequences

14 Protein Databases NCBI: Genpept (raw) Refseq (curated) UCSC: Refseq EBI: SwissProt Uniref

15 NCBI Refseq Proteins: 81M Organisms: 68 K

16 84,827,567 EBI UniProtKB/Swiss-Prot Release 2017_04 UniProtKB Swiss-prot UniRef100 UniRef90 UniRef50 85 M 0.6 M 106 M 55 M 22 M

17 How does BLAST work Step 1. Create alignments between HSPs (High-scoring Segment Pair)

18 How does BLAST work Step 2. Score each alignment, and report the top alignments Number of Chance Alignments = 2 X Match=+2 Mismatch=-3 Gap -(5 + 4(2))= NCBI Discovery Workshops

19 BLOSUM62, a position independent matrix

20 How does BLAST work Step 2. Score each alignment protein alignment Number of Chance Alignments = 4 X K K +5 K E +1 Q F -3 Gap -(11 + 6(1))= - 18 Scores from BLOSUM62, a position independent matrix - NCBI Discovery Workshops

21 BLOSUM62 substitution score is position independent Scores from BLOSUM62, a position independent matrix - NCBI Discovery Workshops

22 PSSM Alignment: Globins Conserved Histidine - NCBI Discovery Workshops

23 PSSM Viewer Histidine scored differently at two positions - NCBI Discovery Workshops

24 PSI-BLAST Build PSSM with PSI-BLAST 1. Iteration 1: Regular BLASTP (BLOSSOM62) to identify a list of closely related proteins. Build PSSM from these proteins. 2. Iteration 2: Use the PSSM built from Iteration 1 to score alignment in this Iteration. 3. Repeat multiple iterations. - NCBI Discovery Workshops

25 Build PSSM with DELTA-BLAST DELTA-BLAST employs a subset of NCBI's Conserved Domain Database (CDD) to construct PSSM

26 Heme Binding Site Conserved Histidine blastp DELTA-BLAST - NCBI Discovery Workshops

27 Heme Binding Site Conserved Histidine blastp DELTA-BLAST BLAST is not reliable for alignment of homologous genes between distantly related species. - NCBI Discovery Workshops

28 BLAST does Local Alignment (Basic Local Alignment Search Tool) Local Alignment vs Global Alignment HSP-1 HSP-2 (Bowtie, BWA, ClustalW, et al)

29 Database usage examples 1. Identify species for a list of sequences. BioHPC: fastq_species_detector (using NCBI Genbank) e&i=149#c 2. Function annotation BLAST to a closely related species. (Ensembl, Flybase, et al) BLAST to Swissprot, Uniref (EBI) or Refseq (NCBI)