Course Overview. Objectives

Size: px
Start display at page:

Download "Course Overview. Objectives"

Transcription

1 Current Topics in

2 Course Overview Objectives Introduce the frontiers in genomics and epigenomic research, including the new concepts in the related fields and the new computational and experimental techniques. Cultivate critical thinking skills Learn to present scientific ideas

3 Course Overview Lecturer: Bing Ren, Ph.D. Co-Lecturer: Jin Gu, Ph.D. Weilong Guo (TA)

4 Course Overview Format Lecture (1 hour) Student presentation (1 hour) Assignment Paper reading Term paper writing (due May 1 st )

5 Course Overview Grades Total 100 points Class participation (33 pts) Paper presentation (33 pts) Term paper (33 pts) discuss a recent publication (600 words) Grading criteria Class participation: every student must speak out at least once in each class. Paper presentation: logic, clear and accurate Term paper: accurate, depth, critical and innovative.

6 Bing Ren University of California, San Diego Ludwig Institute for Cancer Research

7 Outline Human genome project Genome sequencing technologies Personal genomics

8 Why do we sequence genomes? Understand the biology Gene -> Phenotypes Diseases Many human diseases are due to genetic changes Industrial applications (domestic animals + agriculture) Microbes?

9 Whose genome has been sequenced?

10 Human Genome Project Completed in 2003 GOALS? identify all the approximately 20,000-25,000 genes in human DNA, determine the sequences of the 3 billion chemical base pairs that make up human DNA, store this information in databases, improve tools for data analysis, transfer related technologies to the private sector, and address the ethical, legal, and social issues (ELSI) that may arise from the project.

11 Human Genome Normal human karyotype contain 22 pairs of autosomal chromosomes + 1 one pair of sex chromosomes 3,200 Mbp in 46 chromosomes Giemsa staining following digestion of chromosomes w/ trypsin Dark regions = heterochromatic, AT-rich Light regions = euchromatic, GC-rich, gene-rich

12 GCATCCATCTTGGGGCGTCCCAATTGCTGAGTAACAAATGAGACGC TGTGGCCAAACTCAGTCATAACTAATGACATTTCTAGACAAAGTGAC TTCAGATTTTCAAAGCGTACCCTGTTTACATCATTTTGCCAATTTCGC GTACTGCAACCGGCGGGCCACGCCCCCGTGAAAAGAAGGTTGTTT TCTCCACATTTCGGGGTTCTGGACGTTTCCCGGCTGCGGGGCGGG GGGAGTCTCCGGCGCACGCGGCCCCTTGGCCCCGCCCCCAGTCA TTCCCGGCCACTCGCGACCCGAGGCTGCCGCAGGGGGCGGGCTG AGCGCGTGCGAGGCGATTGGTTTGGGGCCAGAGTGGGCGAGGCG CGGAGGTCTGGCCTATAAAGTAGTCGCGGAGACGGGGTGCTGGTT TGCGTCGTAGTCTCCTGCAGCGTCTGGGGTTTCCGTTGCAGTCCTC GGAACCAGGACCTCGGCGTGGCCTAGCGAGTTATGGCGACGAAGG CCGTGTGCGTGCTGAAGGGCGACGGCCCAGTGCAGGGCATCATCA ATTTCGAGCAGAAGGCAAGGGCTGGGACGGAGGCTTGTTTGCGAG GCCGCTCCCACCCGCTCGTCCCCCCGCGCACCTTTGCTAGGAGCG GGTCGCCCGCCAGGCCTCGGGGCCGCCCTGGTCCAGCGCCCGGT CCCGGCCCGTGCCGCCCGGTCGGTGCCTTCGCCCCCAGCGGTGC GGTGCCCAAGTGCTGAGTCACCGGGCGGGCCCGGGCGCGGGGC GTGGGACCGAGGCCGCCGCGGGGCTGGGCCTGCGCGTGGCGGG AGCGCGGGGAGGGATTGCCGCGGGCCGGGGAGGGGCGGGGGCG GGCGTGCTGCCCTCTGTGGTCCTTGGGCCGCCGCCGCGGGTCTG TCGTGGTGCCTGGAGCGGCTGTGCTCGTCCCTTGCTTGGCCGTGT TCTC

13 Over 50% of the genome is repetitive sequences (1) Transposon-derived repeats (~45% of genome) (2) Probable retrotransposed mrnas (pseudogenes) (3) Simple sequences (A) n, (CA) n, (CGG) n (4) Segmental duplications (1-300 kbp copies from elsewhere in the genome) (5) Tandemly repeated regions

14 Protein-coding genes Genes: 1.06% of the genome codes for protein At least 4 times that account for non-coding RNAs Roughly 24,000 genes with heavy use of alternative splicing 100s of genes from bacteria (probable horizontal gene transfer) Dozens of genes from transposable elements

15 Structure of a gene

16 Some estimates: Median number of exons 7 Median intron length 1,023 Median 3 UTR length 400 bp Median 5 UTR length 240 bp Median coding sequence length 1,100 bp Median genomic extent 14 kb Some extremes: Dystrophin gene genomic extent 2,400 kb Titan gene 80,780 bp Titan gene number of exons 178 Titan gene longest exon 17,106 bp

17 #protein-coding genes cellular complexity 19,000 14,000 ~20-25,000 ~20-25,000 6,000

18 Alternative splicing generates multiple mrnas and proteins from one protein-encoding gene

19 Non-protein coding genes trna genes: copies (and more than 300 trna-related pseudogenes) Small nuclear RNAs (snrnas, splicing machinery): - many dispersed copies (at least 44 for U6, 16 for U1) Small nucleolar RNAs (snornas, ribosomal RNA maturation): - 97 known, dispersed in genome 5S ribosomal RNA genes: - encoded from tandemly copied loci, largest is ~300 copies on chr 1 - thousands of 5S RNA pseudogenes throughout genome 40S ribosomal RNA precursor (5.8S, 18S, 28S RNAs): - encoded from loci with tandem copies on chr 13, 14, 15, 20, 21 - tandemly repeated DNAs not completely represented in the database Other non-coding RNAs (ncrnas): - Xist, involved in X chromosome inactivation in females - RNAs in the vault complex - many micrornas - many long non-coding RNAs (conserved)

20

21 Sequencing, and more sequencing Adapted from genome.gov

22 Sanger sequencing The Sanger sequencing reaction. Single stranded DNA is amplified in the presence of fluorescently labeled ddntps that serve to terminate the reaction and label all the fragments of DNA produced. The fragments of DNA are then separated via polyacrylamide gel electrophoresis and the sequence read using a laser beam and computer.

23 An electropherogram of a finished sequencing reaction. As the fragments from the sequencing reaction are resolved via electrophoresis, a laser reads the fluorescence of each fragment (blue, green, red or yellow) and compiles the data into an image. Each colour, or fluorescence intensity, represents a different nucleotide (e.g. blue for C) and reveals where that nucleotide is in the sequence.

24 Two Strategies for genome sequencing

25 Next Generation Sequencing technologies Technology Cost Throughput Relative Efficiency Capillary (Sanger Sequencing) Next Gen Sequencing (Illumina HiSeq2000) $1000/1Mbp ~200bp/hr 1 $0.1/1Mbp ~1x10 9 bp/hr 10 11

26 Time and costs to sequence a human genome Genome Sequencing Center 1 HiSeq2000 Five centers, hundreds of sequencers, >1000 people, >1 year, billions of $$ <10 days, 1 instrument, one technician, <$10,000

27 Current Platforms of High Throughput Sequencers all Require Single Molecular Amplification DNA to be sequenced 5 P1 P2 3 Sequencers currently in operation: 1. Roche-454 long reads (>0.4 kb) but modest throughput 2. Illumina-GA fast, ultran high throughput (200 Gb/run) 3. AB-SOLiD ultra high throughput (1 billion reads/run)

28

29 Sequencing by synthesis (SBS): 454 pyrosequencing Metzker, Nat. Rev Genetics Margulies et al. 2005, Nature bp reads, paired-end possible ~1 million reads per run ~10 hour for each run 1 billion bp output each run Used to sequence Jim Watson s genome in 2007

30 Pyrosequencing

31 Sequencing by ligation (SBL): SOLiD 50bp paired-end reads ~1 billion reads per run 5-10 days run time 100 billion bp output each run

32 Sequencing by synthesis: Genome Analyzer or HiSeq 2000 (Illumina) Shendure & Lee, Nat. Biotech bp paired-end reads ~1 billion reads per run 4-8 days run time 200 billion bp output each run

33 Cluster Formation on Solexa Flowcell

34 Sequencing by Synthesis G T A T T T T C G G C A C A G -5 T C A C T G G T A Cycle 1: Add sequencing reagents First base incorporated Remove unincorporated bases Detect signal Cycle 2-n: Add sequencing reagents and repeat

35 Reversible Terminator Chemistry O O X HN cleavage site fluor 5 HN O N DNA O N PPP 3 O block Incorporation Detection Deblock; fluor removal O O 3 OH free 3 end Next cycle

36 Sequential Base Calling T G C T A C G A T T T T T T T T G T

37 Third Generation of High Throughput Sequencing Technologies Systems 1. Helicos 2. PacificBio 3. Nanopore 4. Ion Torrent Key features Single molecule sequencing Ultra high density, short reading Sequencing by synthesis Single molecule sequencing Intermediate density, long reading Sequencing by immobilized Pol II Single molecule sequencing Intermediate density, long reading Sequencing by immobilized nanopore Single molecule sequencing with PH Medium reads

38 Sequencing using PH meters (Ion Torrent) Low cost, desktop instrument 100bp or longer single-end reads 2-5 million reads per run 1 hour run time million base each run

39 Single-molecule real time sequencing Pacific Bio SMRT technology Long reads: up to 10kb fast: 20 minutes run time million reads per run 20 Gigabase per run

40

41 Era of personal genomics

42 A wealth of opportunities for biologists and the medical community

43 Sept 2007: HuRef

44 Vital statistics Sanger sequencing Cost $70 million De novo assembly 7.5x coverage 2.81 billion bp 4.1 million variants 3.2 million SNPs 851,000 insertion/deletions 99.5% similarity between two diploid chromosomes C. Venter

45 Establishing Genotype-phenotype relations Alleles absent: Huntington s Cystic Fibrosis Alleles present in various risk combinations: heart disease, Alzheimer s Traits and behaviors o o o lactose intolerance, novelty seeking, tobacco addiction

46 Variants detected in Venter genome

47 March 2008: 2nd individual human genome sequenced The association of genetic variation with disease and drug response, and improvements in nucleic acid technologies, have given great optimism for the impact of 'genomic medicine'. However, the formidable size of the diploid human genome1, approximately 6 gigabases, has prevented the routine application of sequencing methods to deciphering complete individual human genomes. To realize the full potential of genomics for human health, this limitation must be overcome. Here we report the DNA sequence of a diploid genome of a single individual, James D. Watson, sequenced to 7.4-fold redundancy in two months using massively parallel sequencing in picolitresize reaction vessels. This sequence was completed in two months at approximately one-hundredth of the cost of traditional capillary electrophoresis methods. Comparison of the sequence to the reference genome led to the identification of 3.3 million single nucleotide polymorphisms, of which 10,654 cause amino-acid substitution within the coding sequence.. As a result, we further demonstrate the acquisition of novel human sequence, including novel genes not previously identified by traditional genomic sequencing. This is the first genome sequenced by nextgeneration technologies.

48 Vital statistics J. Waston 454 pyro-sequencing Cost $1 million Align to reference genome 7.4x coverage 3.32 million SNPs 220,000 insertion/deletions of various length 10,569 non-synonymous SNPs affecting thousands of genes, some potentially damaging

49 How different are the two genomes? 1.7 million SNPs 1.68 million SNPs 1.7 million SNPs 7,648 different proteins

50 October 2008 Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we used uniquely mapped reads to assemble a high-quality consensus sequence for 92% of the Asian individual's genome. We identified approximately 3 million single-nucleotide polymorphisms (SNPs) inside this region, of which 13.6% were not in the dbsnp database. Genotyping analysis showed that SNP identification had high accuracy and consistency, indicating the high sequence quality of this assembly. We also carried out heterozygote phasing and haplotype prediction against HapMap CHB and JPT haplotypes (Chinese and Japanese, respectively), sequence comparison with the two available individual genomes (J. D. Watson and J. C. Venter), and structural variation identification. These variations were considered for their potential biological impact. Our sequence data and analyses demonstrate the potential usefulness of next-generation sequencing technologies for personal genomics.

51 Aug Genome = 4 runs on the Stanford HeliScope at $48,000

52 Sept 2009

53