Redundancy at GenBank => RefSeq. RefSeq vs GenBank. Databases, cont. Genome sequencing using a shotgun approach. Sequenced eukaryotic genomes

Size: px
Start display at page:

Download "Redundancy at GenBank => RefSeq. RefSeq vs GenBank. Databases, cont. Genome sequencing using a shotgun approach. Sequenced eukaryotic genomes"

Transcription

1 Databases, cont. Redundancy at GenBank => RefSeq i?rid=handbook RefSeq vs GenBank Many sequences are represented more than once in GenBank 2003 RefSeq collection : curated secondary database non-redundant selected organisms Genome DNA (assemblies) Transcripts (RNA) Protein Not curated Author submits GenBank Only author can revise Multiple records from same loci common Records can contradict each other No limit to species included Akin to primary literature Curated RefSeq NCBI creates from existing data NCBI reivses as new data emerge Single records for each molecule of major organisms Limited to model organisms Akin to review articles Genome sequencing using a shotgun approach Sequenced eukaryotic genomes 1

2 Sequencing going wild... BGI : "capacity to sequence the equivalent of 1,600 complete human genomes each day" "BGI and BGI Americas aim to build a library of digital life, which includes 1,000 plant and animal reference genomes, 10,000 microorganism genomes". NCBI Trace Archive 2012: 2 x 10 9 single pass reads, ~ nt Initiated 2001 purpose collect raw data at sequencing centers worldwide permanent repository of single-pass reads Each trace is between 300 and 1,000 nucleotides 2

3 Date Cost per Mb of DNA Sequence Cost per Genome september 2001 $ $ mars 2002 $ $ september 2002 $ $ mars 2003 $ $ oktober 2003 $ $ januari 2004 $ $ april 2004 $ $ juli 2004 $ $ oktober 2004 $ $ januari 2005 $ $ april 2005 $ $ juli 2005 $ $ oktober 2005 $ $ januari 2006 $ $ april 2006 $ $ juli 2006 $ $ oktober 2006 $ $ januari 2007 $ $ april 2007 $ $ juli 2007 $ $ oktober 2007 $ $ januari 2008 $ $ april 2008 $ $ juli 2008 $ 8.36 $ oktober 2008 $ 3.81 $ januari 2009 $ 2.59 $ april 2009 $ 1.72 $ juli 2009 $ 1.20 $ oktober 2009 $ 0.78 $ januari 2010 $ 0.52 $ april 2010 $ 0.35 $ juli 2010 $ 0.35 $ oktober 2010 $ 0.32 $ januari 2011 $ 0.23 $ april 2011 $ 0.19 $ juli 2011 $ 0.12 $ oktober 2011 $ 0.09 $ The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. As with other major human genome reference projects, data from the 1000 Genomes Project will be made available quickly to the worldwide scientific community through freely accessible public databases. The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. The Cancer Genome Atlas (TCGA) is a landmark research program supported by the National Cancer Institute and National Human Genome Research Institute at the National Institutes of Health. TCGA researchers will identify the genomic changes in more than 20 different types of human cancer. By comparing the DNA in samples of normal tissue and cancer tissue taken from the same patient, researchers can identify changes specific to that particular cancer. TCGA is analyzing hundreds of samples for each type of cancer. By looking at many samples from many different patients, researchers will gain a better understanding of what makes one cancer different from another cancer. This is important because even two patients with the same type of cancer may experience very different outcomes or respond very differently to treatments. By connecting specific genomic changes with specific outcomes, researchers will be able to develop more effective, individualized ways of helping each cancer patient. 3

4 NCBI Entrez The identification of genes that are mutated and hence drive oncogenesis has been a central aim of cancer research since the advent of recombinant DNA technology. The Cancer Genome Project is using the human genome sequence and high throughput mutation detection techniques to identify somatically acquired sequence variants/mutations and hence identify genes critical in the development of human cancers (see here for a description of our strategy). This initiative will ultimately provide the paradigm for the detection of germline mutations in non-neoplastic human genetic diseases through genome-wide mutation detection approaches. * Nucleotide * Protein * Structure * PubMed * OMIM (genetic diseases) * dbsnp * Taxonomy browser EMBL and Genbank formats EMBL format ID LISOD standard; DNA; PRO; 756 BP. AC X64011; S78972; SV X DT 28-APR-1992 (Rel. 31, Created) DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) DE L.ivanovii sod gene for superoxide dismutase KW sod gene; superoxide dismutase. OS Listeria ivanovii OC Bacteria; Firmicutes; Bacillus/Clostridium group; OC Bacillus/Staphylococcus group; Listeria. RN [1] RX MEDLINE; RA Haas A., Goebel W.; RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and characterization of the RT gene product."; RL Mol. Gen. Genet. 231: (1992). RN [2] RP RA Kreft J.; RT ; RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases. RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am RL Hubland, 8700 Wuerzburg, FRG DR SWISS-PROT; P28763; SODM_LISIV. FH Key Location/Qualifiers FH FT source FT /db_xref="taxon:1638" FT /organism="listeria ivanovii" FT /strain="atcc 19119" FT RBS FT terminator FT CDS FT /db_xref="swiss-prot:p28763" FT /transl_table=11 FT /EC_number=" " FT /product="superoxide dismutase" FT /protein_id="caa " FT /translation="mtyelpklpytydalepnfdketmeihytkhhniyvtklneavsg FT HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAA FT IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGL FT DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK" SQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other; cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 60 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 120 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 180 gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca 240 ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt 300 4

5 EMBL and Genbank formats Examples of feature table elements * to represent a coding sequence that is constructed from a range of exons: CDS join( , , , ) * to represent a coding sequence on the complementary strand of DNA: CDS complement( ) EMBL format ID LISOD standard; DNA; PRO; 756 BP. AC X64011; S78972; SV X DT 28-APR-1992 (Rel. 31, Created) DT 30-JUN-1993 (Rel. 36, Last updated, Version 6) DE L.ivanovii sod gene for superoxide dismutase KW sod gene; superoxide dismutase. OS Listeria ivanovii OC Bacteria; Firmicutes; Bacillus/Clostridium group; OC Bacillus/Staphylococcus group; Listeria. RN [1] RX MEDLINE; RA Haas A., Goebel W.; RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by RT functional complementation in Escherichia coli and characterization of the RT gene product."; RL Mol. Gen. Genet. 231: (1992). RN [2] RP RA Kreft J.; RT ; RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases. RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum Am RL Hubland, 8700 Wuerzburg, FRG DR SWISS-PROT; P28763; SODM_LISIV. Common sequence formats 1. EMBL release format 2. Genbank (ASN.1) 3. FASTA format : >X12345 Y098TR gene CGTATCTTACGAGCTACTACGA GGTCTTATCGGACGAGCGACT FASTQ format GATTTGGGGTTCAAAGCAGTATCGATCAAATAGT +!''*((((***+))%%%++)(%%%%).1***-+* A selection of search fields using NCBI Entrez. Search Field Accession All Fields Author Name Feature Key Journal Name Modification Date Organism Properties Publication Date Definition Contains the unique accession number of the sequence or record, assigned to the nucleotide, protein, structure, genome record, or PopSet by a sequence database builder. The Structure database accession index contains the PDB IDs but not the MMDB IDs. Contains all terms from all searchable database fields in the database. Contains all authors from all references in the database records. The format is last name space first initial(s), without punctuation (e.g., marley jf). Contains the biological features assigned or annotated to the nucleotide sequences and defined in the DDBJ/EMBL/GenBank Feature Table ( Not available for the Protein or Structure databases. Contains the name of the journal in which the data were published. Journal names are indexed in the database in abbreviated form (e.g., J Biol Chem). Journals are also indexed by their by ISSNs. Browse the index if you do not know the ISSN or are not sure how a particular journal name is abbreviated. Contains the date that the most recent modification to that record is indexed in Entrez, in the format YYYY/MM/DD (e.g., 1999/08/05). A year alone, (e.g., 1999) will retrieve all records modified for that year; a year and month (e.g., 1999/03) retrieves all records modified for that month that are indexed in Entrez. Contains the scientific and common names for the organisms associated with protein and nucleotide sequences. Contains properties of the nucleotide or protein sequence. For example, the Nucleotide database's Properties index includes molecule types, publication status, molecule locations, and GenBank divisions. A Properties index is not available in the Structure database. Contains the date that records are released into Entrez, in the format YYYY/MM/DD (e.g., 1999/08/05). It is the date the Qualifier [ACCN] [ALL] [AUTH] [FKEY] [JOUR] [MDAT] [ORGN] [PROP] [PDAT] 5