Introduction to Bioinformatics. What are the goals of the course? Who is taking this course? Different user needs, different approaches

Size: px

Start display at page:

Download "Introduction to Bioinformatics. What are the goals of the course? Who is taking this course? Different user needs, different approaches"

Meryl Matthews
5 years ago
Views:

Introduction to Bioinformatics Who is taking this course? Monday, November 19, 2012 Jonathan Pevsner pevsner@kennedykrieger.org Bioinformatics M.E:800.

1 Introduction to Bioinformatics Who is taking this course? Monday, November 19, 2012 Jonathan Pevsner Bioinformatics M.E: People with very diverse backgrounds in biology Some people with backgrounds in computer science and biostatistics Most people (will) have a favorite gene, protein, or disease, or a high throughput dataset Different user needs, different approaches web-based or graphical command line user interface (GUI) What are the goals of the course? UCSC, Ensembl genome browsers NCBI EBI, central resources Partek MEGA5 RStudio Galaxy: web implementation of browser data, NGS tools Perl, R: manipulate data files Software for data analysis, large databases Linux: nextgeneration sequencing, other tools To provide an introduction to bioinformatics with a focus on the National Center for Biotechnology Information (NCBI), UCSC, and EBI To focus on the analysis of DNA, RNA and proteins To introduce you to the analysis of genomes To combine theory and practice to help you solve research problems Workflow #1: How do we analyze a protein? Workflow #2: How do we analyze a human genome? Suppose we are interested in learning about beta globin (HBB), a subunit of hemoglobin NCBI offers information at the levels of DNA (e.g. variation in disease), RNA (e.g. gene expression data from microarrays), protein (e.g. 3D structure), pathways EBI offers comparable information Other portals such as ExPASy offer hundreds of analysis tools Choose an individual (e.g. a patient), obtain informed consent, get blood, purify genomic DNA Obtain raw DNA reads by next-generation sequencing Align the reads to a reference human genome Call variants (single nucleotide variants, indels) Determine the functional significance of variants (deleterious or neutral) 1

2 Textbook The course textbook has no required textbook. I wrote Bioinformatics and Functional Genomics (Wiley-Blackwell, 2 nd edition 2009). The lectures in this course correspond closely to chapters. I will make pdfs of the chapters available to everyone. You can also purchase a copy at the bookstore, at amazon.com, or at Wiley with a 20% discount through the book s website Web sites The course website is reached via moodle: (or Google moodle bioinformatics ) --This site contains the powerpoints for each lecture, including black & white versions for printing --The weekly quizzes are here --You can ask questions via the forum --Audiovisual files of each lecture will be posted here The textbook website is: This has powerpoints, URLs, etc. organized by chapter. This is most useful to find web documents corresponding to each chapter. Themes throughout the course: the beta globin gene/protein family Computer labs We will use beta globin as a model gene/protein throughout the course. Globins including hemoglobin and myoglobin carry oxygen. We will study globins in a variety of contexts including --sequence alignment --gene expression --protein structure --phylogeny --homologs in various species There is no in-class computer lab, but the seven weekly quizzes function as a take-home computer lab. To solve the questions, you will need to go to websites, use databases, and use software. Most quizzes are due in 7 days. Because of Thanksgiving, the first quiz will be due in 9 days (November 28 th at noon). Grading Google moodle bioinformatics to get here; Click Bioinformatics to sign in; The enrollment key you need is 60% moodle quizzes (your top 6 out of 7 quizzes). Quizzes are taken at the moodle website, and are due one week after the relevant lecture. Special extended due date for quizzes due immediately after Thanksgiving and the New Year. 40% final exam Thursday, January 17 (2pm, in class). Closed book, cumulative, no computer, short answer / multiple choice. Past exams will be made available ahead of time. 2

Outline for the course (all on Mondays) 1. Accessing information about DNA and proteins Nov. 19 2. Pairwise alignment and BLAST Nov. 26 3. Advanced BLAST to next-generation sequencing Dec. 3 4.

3 Outline for the course (all on Mondays) 1. Accessing information about DNA and proteins Nov Pairwise alignment and BLAST Nov Advanced BLAST to next-generation sequencing Dec Multiple sequence alignment Dec Molecular phylogeny and evolution Dec Microarrays Jan Next-generation sequencing Jan. 14 Final exam (Thursday 2:00 pm) Jan. 17 Outline for today Learning objectives for today Definition of bioinformatics Overview of the NCBI website Accession numbers, RefSeq, and Entrez Gene Two genome browsers: UCSC and Ensembl From UCSC Table Browser to Galaxy [1] You should be able to explain what accession numbers are and why the RefSeq project is significant [2] You should be able to find accession numbers for any gene (or protein) from any organism via Entrez Gene [3] You should be able to locate any human gene using the UCSC Genome Browser [4] You should be able to locate information about genes (and proteins) using the UCSC Table Browser and the related resource Galaxy What is bioinformatics? Interface of biology and computers Analysis of proteins, genes and genomes using computer algorithms and computer databases Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects. Three perspectives on bioinformatics The cell The organism The tree of life Page 4 3

First perspective: the cell Central dogma of molecular biology DNA RNA protein genome transcriptome proteome Central dogma of bioinformatics and genomics Growth of GenBank DNA genomic DNA databases

2 Page 18 Sequences (millions) Base pairs of DNA (millions) 1982 1986 1990 1994 1998 20

4 First perspective: the cell Central dogma of molecular biology DNA RNA protein genome transcriptome proteome Central dogma of bioinformatics and genomics Growth of GenBank DNA genomic DNA databases RNA cdna ESTs UniGene protein protein sequence databases phenotype Fig. 2.2 Page 18 Sequences (millions) Base pairs of DNA (millions) Fig. 2.1 Year Page 17 Sequence Read Archive (SRA) at NCBI: over 900 terabases of sequence (~1 petabase) Arrival of next-generation sequencing: We have sequenced hundreds of terabases (November 2012) Size, terabases 6 years ago GenBank celebrated reaching 100 billion base pairs of DNA Year A project to sequence 10,000 human genomes requires about 3 petabytes of storage (3,000 terabytes). To buy a server with 100 Tb now costs $10,000. Now when my lab sequences the genome of one patient, we obtain ~150 billion base pairs of sequence and ~1 terabyte of data for $3,000. This course will reflect the major impact of increased DNA sequencing on all areas of biology. 4

There are three major public DNA databases There storage of next-generation sequence data is an emerging challenge EMBL Housed at EBI European Bioinformatics Institute GenBank Housed at NCBI National

http://www.ncbi.nlm.nih.gov/traces/sra/ The European Bioinformatics Institute (EBI) offers the European Nucleotide Archive (ENA), http://www.ebi.ac.uk/ena/ For individual labs, tens of terabytes of storage are routinely needed.

Taxonomy at NCBI: >250,000 species are represented in GenBank The most sequenced organisms in GenBank Homo sapiens 16.3 billion bases Mus musculus 10.0b Rattus norvegicus 6.5b Bos taurus 5.

5 There are three major public DNA databases There storage of next-generation sequence data is an emerging challenge EMBL Housed at EBI European Bioinformatics Institute GenBank Housed at NCBI National Center for Biotechnology Information DDBJ Housed in Japan The underlying raw DNA sequences are identical NCBI offers a sequence read archive (SRA), but the best storage strategies are uncertain. The European Bioinformatics Institute (EBI) offers the European Nucleotide Archive (ENA), For individual labs, tens of terabytes of storage are routinely needed. Page 14 Second perspective: the organism Third perspective: the tree of life Time of development Body region, physiology, pharmacology, pathology Page 5 After Pace NR (1997) Science 276:734 Page 6 Taxonomy at NCBI: >250,000 species are represented in GenBank The most sequenced organisms in GenBank Homo sapiens 16.3 billion bases Mus musculus 10.0b Rattus norvegicus 6.5b Bos taurus 5.4b Zea mays 5.1b Sus scrofa 4.9b Danio rerio 3.1b Strongylocentrotus purpurata 1.4b Macaca mulatta 1.3b Oryza sativa (japonica) 1.3b 11/12 Page 16 Updated Nov GenBank release Excluding WGS, organelles, metagenomics Table 2-2 Page 17 5

Outline for today National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.

6 Outline for today National Center for Biotechnology Information (NCBI) Definition of bioinformatics Overview of the NCBI website Accession numbers, RefSeq, and Entrez Gene Two genome browsers: UCSC and Ensembl From UCSC Table Browser to Galaxy Fig. 2.4 Page 24 NCBI key features: PubMed National Library of Medicine's search service 22 million citations in MEDLINE (as of 2012) links to participating online journals PubMed tutorial on the site or visit NLM: Be sure to access PubMed via Welch Library! Also get to know your informationists Peggy Gross, Rob Wright et al. Page 23 NCBI key features: Entrez search and retrieval system Entrez integrates the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes Page 24 Outline for today Accession numbers are labels for sequences Definition of bioinformatics Overview of the NCBI website Accession numbers, RefSeq, and Entrez Gene Two genome browsers: UCSC and Ensembl From UCSC Table Browser to Galaxy NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Page 26 6

What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.

corresponds to the most stable, agreed-upon reference version of a sequence. X02775 NG_000007.3 rs192792910 GenBank genomic DNA sequence RefSeqGene dbsnp (single nucleotide polymorphism) AA970968.

7 What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for beta globin, HBB): NCBI s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon reference version of a sequence. X02775 NG_ rs GenBank genomic DNA sequence RefSeqGene dbsnp (single nucleotide polymorphism) AA An expressed sequence tag (1 of 2,345) NM_ RefSeq DNA sequence (from a transcript) NP_ CAA Q YE0 B RefSeq protein GenBank protein SwissProt protein Protein Data Bank structure record DNA RNA protein Page 27 RefSeq identifiers include the following formats: Complete genome NC_###### Complete chromosome NC_###### Genomic contig NT_###### mrna (DNA format) NM_###### e.g. NM_ Protein NP_###### e.g. NP_ Page 27 NCBI s RefSeq project: many accession number formats for genomic, mrna, protein sequences Access to sequences: Entrez Gene at NCBI Accession Molecule Method Note AC_ Genomic Mixed Alternate complete genomic AP_ Protein Mixed Protein products; alternate NC_ Genomic Mixed Complete genomic molecules NG_ Genomic Mixed Incomplete genomic regions NM_ mrna Mixed Transcript products; mrna NM_ mrna Mixed Transcript products; 9-digit NP_ Protein Mixed Protein products; NP_ Protein Curation Protein products; 9-digit NR_ RNA Mixed Non-coding transcripts NT_ Genomic Automated Genomic assemblies NW_ Genomic Automated Genomic assemblies NZ_ABCD Genomic Automated Whole genome shotgun data XM_ mrna Automated Transcript products XP_ Protein Automated Protein products XR_ RNA Automated Transcript products YP_ Protein Auto. & Curated Protein products ZP_ Protein Automated Protein products Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_ for beta globin DNA corresponding to mrna) or protein (NP_ ) Page 29 From the NCBI home page, type beta globin and hit Search Fig. 2.5 Page 28 Follow the link to Gene Fig. 2.5 Page 28 7

Entrez Gene is in the header Note the Official Symbol HBB for beta globin Note the limits option

NCBI offers a wealth of information Entrez Gene (bottom of page): non-refseq accessions (it s unclear

Ontology (organizing principles of biological process, molecular function, cellular component)

HomoloGene) Many, many links to external resources Page 29 Entrez Protein: accession, organism,

8 Entrez Gene is in the header Note the Official Symbol HBB for beta globin Note the limits option Entrez Gene (top of page): Note a useful summary, and links to other databases Page 30 Gene page at NCBI offers a wealth of information Entrez Gene (bottom of page): non-refseq accessions (it s unclear what these are, highlighting usefulness of RefSeq) Genomic context Bibliography Phenotypes Gene Ontology (organizing principles of biological process, molecular function, cellular component) Reference sequences Additional (non-refseq sequences) Many, many links to NCBI resources (e.g. HomoloGene) Many, many links to external resources Page 29 Entrez Protein: accession, organism, literature Entrez Protein: features of a protein, and its sequence in the one-letter amino acid code Fig. 2.8 Page 31 Fig. 2.8 Page 31 8

You should learn the one-letter amino acid code!

Glycine Gly G Histidine His H Isoleucine Ile I Name 3 Letter 1 Letter Leucine Leu L Lysine Lys K Methionine Met M Phenylalanine Phe F Proline Pro P Serine Ser S Threonine Thr T Tryptophan Trp W

there are many others FASTA Sequences in one letter DNA or protein code FASTQ DNA sequences with quality scores for each base BAM compressed binary version of SAM SAM Sequence Alignment/Map file

9 You should learn the one-letter amino acid code! Entrez Protein: You can change the display (as shown) Name 3 Letter 1 Letter Alanine Ala A Arginine Arg R Asparagine Asn N Aspartic acid Asp D Cysteine Cys C Glutamic Acid Glu E Glutamine Gln Q Glycine Gly G Histidine His H Isoleucine Ile I Name 3 Letter 1 Letter Leucine Leu L Lysine Lys K Methionine Met M Phenylalanine Phe F Proline Pro P Serine Ser S Threonine Thr T Tryptophan Trp W Tyrosine Tyr Y Valine Val V Page 31 FASTA format: versatile, compact with one header line followed by a string of nucleotides or amino acids in the single letter code While FASTA is one file format, there are many others FASTA Sequences in one letter DNA or protein code FASTQ DNA sequences with quality scores for each base BAM compressed binary version of SAM SAM Sequence Alignment/Map file (tab-delimited) VCF variant call format (genomic variants; indels) (See genome.ucsc.edu/faq/faqformat.html for the following:) BED WIG GFF a table including chromosome, start, end wiggle format (displays dense, continuous data) General Feature Format (tab separated) Fig. 2.9 Page 32 Also, besides Excel (.xls,.xlsx) spreadsheets can also be:.txt tab-delimited text file (or space delimited).csv comma separated text file FASTQ format specification The FASTQ format has four lines per read (and typically has millions of reads) Sequencing run information Sequence read (like FASTA) Quality scores (per base) Outline for today Definition of bioinformatics Overview of the NCBI website Accession numbers, RefSeq, and Entrez Gene Two genome browsers: UCSC and Ensembl From UCSC Table Browser to Galaxy 9

Genome Browsers: increasingly important resources Genomic DNA is organized in chromosomes.

org) note BioMart The two most essential human genome browsers are at Ensembl and UCSC. We will focus on UCSC (but the two are equally important). The browser at NCBI is less commonly used.

A quiz question about Ensembl/BioMart For this week s quiz/computer lab, one question asks you to use BioMart at Ensembl to find all the micrornas on chromosome 11, as well as their GC content.

10 Genome Browsers: increasingly important resources Genomic DNA is organized in chromosomes. Genome browsers display ideograms (pictures) of chromosomes, with user-selected annotation tracks that display many kinds of information. Ensembl genome browser ( note BioMart The two most essential human genome browsers are at Ensembl and UCSC. We will focus on UCSC (but the two are equally important). The browser at NCBI is less commonly used. click human enter beta globin Ensembl output for beta globin includes views of chromosome 11 (top), the region (middle), and a detailed view (bottom). There are various horizontal annotation tracks. A quiz question about Ensembl/BioMart For this week s quiz/computer lab, one question asks you to use BioMart at Ensembl to find all the micrornas on chromosome 11, as well as their GC content. There are step-by-step instructions and it should take no more than a couple minutes. The UCSC Genome Browser: an increasingly important resource This browser s focus is on humans and other eukaryotes you can select which tracks to display (and how much information for each track) tracks are based on data generated by the UCSC team and by the broad research community you can create custom tracks of your own data! Just format a spreadsheet properly and upload it The Table Browser is equally important as the more visual Genome Browser, and you can move between the two 10

globin), hit submit Page 36 An assembly (or build ) is a fixed version of a genome. Builds are released every several years.

They are annotated with different types of information such as experimental data sets. To learn more visit http://www.ncbi.

display --add custom tracks --the Table Browser is complementary Exploring the UCSC Genome Browser The human genome can be

11 [1] Visit click Genome Browser Note that there are choices of assemblies such as hg19 [2] Choose organisms, enter query (beta globin), hit submit Page 36 An assembly (or build ) is a fixed version of a genome. Builds are released every several years. In practice, you should always be aware whether you are using hg18 or hg19 for the human genome. They are annotated with different types of information such as experimental data sets. To learn more visit [3] Choose the RefSeq beta globin gene (HBB) [4] On the UCSC Genome Browser: --choose which tracks to display --add custom tracks --the Table Browser is complementary Exploring the UCSC Genome Browser The human genome can be viewed with different assemblies (hg18, hg19). These contain different data sets. You can get information about a track by clicking its header (e.g. RefSeq Genes ). You can choose the density to display each track (e.g. hide, dense, squish, pack). Outline for today Definition of bioinformatics Overview of the NCBI website Accession numbers, RefSeq, and Entrez Gene Two genome browsers: UCSC and Ensembl From UCSC Table Browser to Galaxy 11

You can reach the Table Browser from the Genome Browser Use the

Genome Browser tracks. It is quantitative, not visual.

check box to send a BED file to Galaxy Get output Get a summary

panel; click edit attributes to change name to snps Tools panel

Step 3 (continued): Thus, there are ~890,000 SNPs on human

group to Genes and the track to RefSeq Genes, then get output

12 You can reach the Table Browser from the Genome Browser Use the UCSC Table Browser to get tabular data related to any of the Genome Browser tracks. It is quantitative, not visual. species assembly position (where you were on the Genome Browser) check box to send a BED file to Galaxy Get output Get a summary of the output Step 3: In history panel, click eye to view main panel; click edit attributes to change name to snps Tools panel Display panel History panel Step 2: click Send query to Galaxy Step 3 (continued): Thus, there are ~890,000 SNPs on human chromosome 11 Step 4 (continued): To get coding exons, set the group to Genes and the track to RefSeq Genes, then get output Step 4: To get coding exons, click Get Data (under Tools) then UCSC Main table browser 12

Step 4 (continued): Set the output to Coding Exons then send the query to Galaxy Step 5: In Galaxy, a

Under Tools click Operate on Genomic Intervals. Step 6: Click Intersect the intervals of two datasets.

There are 11,652 SNPs in coding exons. Finally, click display at UCSC main. Step 7.

(entitled User Supplied Track ). Summary: Galaxy Galaxy s URL is http://usegalaxy.org.

It is intuitive and does not require knowledge of computer programming, but it offers access to

13 Step 4 (continued): Set the output to Coding Exons then send the query to Galaxy Step 5: In Galaxy, a new data set appears. Rename it coding exons by clicking the pencil. Under Tools click Operate on Genomic Intervals. Step 6: Click Intersect the intervals of two datasets. In the central panel, choose snps and coding exons. Execute. Step 7. Here s the answer. There are 11,652 SNPs in coding exons. Finally, click display at UCSC main. Step 7. This returns us to the UCSC Genome Browser, where our Galaxy results are displayed as a custom track (entitled User Supplied Track ). Summary: Galaxy Galaxy s URL is Its features include: Integration with UCSC and many other major genomics resources such as BioMart It is intuitive and does not require knowledge of computer programming, but it offers access to sophisticated software There are excellent videocasts and tutorials It fosters reproducibility. Your history and workflows are saved. See the User tab and sign up for a free account! 13

14 Summary: today s lecture [1] NCBI is a central resource for bioinformatics. Reminder: Please enroll! Google moodle bioinformatics to get here; click Bioinformatics to sign in; The enrollment key is [2] We described accession numbers and the RefSeq project, which offers a trusted version of a sequence. Use Entrez Gene as a key source of gene information. [3] We used the UCSC Genome Browser to visualize information along chromosomes. [4] We then used the UCSC Table Browser to obtain tabular outputs of queries. The underlying data are the same in the Genome and Table Browsers. [5] Next we used Galaxy to solve a problem, and sent the final result back to the UCSC Genome Browser. 14

Introduction to Bioinformatics. What are the goals of the course? Who is taking this course? Textbook. Web sites. Literature references

Introduction to Bioinformatics Who is taking this course? People with very diverse backgrounds in biology Some people with backgrounds in computer science and biostatistics Most people (will) have a favorite