第八章 基因组学 (Genomics) 1

Size: px
Start display at page:

Download "第八章 基因组学 (Genomics) 1"

Transcription

1 第八章基因组学 (Genomics) 1

2 The study of genomes in their entirety 2

3 Structural Genomics Genome sequence Locating interested gene in a genome sequence Comparative Genomics Considers the genomes of closely and distantly related species for evolutionary insight Functional Genomics Delineate networks of interacting genes active during some developmental process 3

4 4

5 第一节真核生物基因组组成 Organization of Eukaryotic Genome 1. C value paradox and evolutionary complexity C 值悖理与进化的复杂性 C value: the amount of DNA contained in the haploid genome of a species 5

6 蝾螈 6

7 C value paradox The increase of total DNA content are dramatic DNA content in closely related organisms can vary 10-fold or more There are large scale of non-coding sequence in eukaryotic genome 7

8 Renaturation kinetics( 复性动力学 ) reveals repetitive sequences 8

9 Highly repetitive sequences 高度重复序列 5~300bp, 10 5 copies Middle-repetitive sequences 中度重复序列 10~1000 copies Unique sequences 单拷贝序列 9

10 2. Organization of human Genome 10

11 Gene family( 基因家族 ): a set of genes in one genome all descended from the same ancestral gene. Chr. 16 Chr. 11 All members in a family may locate on the same (gene cluster 基因簇 ) or different loci The genes in a family can be identical or not identical 11

12 Psudogene 假基因 An inactive gene derived from an ancestral active gene. 12

13 Dispersed repetitive sequences( 散在重复序列 )result from transposition short interspersed nuclear elements, SINEs 短散布元件, < 500bp, >500,000 copies Alu family long interspersed nuclear elements, LINEs 长散布元件 L1, 6400bp, 100,000 copies 13

14 Clustered repetitive sequences 串联重复序列 satellite DNA ( 卫星 DNA ) 100~500bp 100~5000 kb minisatellite DNA ( 小卫星 DNA ) 11~60bp 100bp ~ 20kb microsatellite DNA ( 微卫星 DNA ) 1~5 bp, (CA) 5~50, account for 0.5% 14

15 第二节结构基因组学 Structural Genomics The ultimate goal of structural genomics is to determine the ordered nucleotide sequences of entire genomes of organisms 15

16 1985, bacteriophage λ 1995, the first genome of a living organism (Haemophilus influenzae) 1996, the first genome the eukaryotic organism (yeast) 2000, The first draft of the human genome 16

17 一 全基因组测序策略 Strategies for Genome Sequencing Whole genome shotgun Clone-by-Clone Approach (Map-based sequencing) 17

18 Shotgun strategy Clone-by-Clone Approach ~ 400 kb YAC or BAC Genomic DNA Cosmid ~ 40 kb 200 kb 10 kb 2 kb Paired-end reads ~ 4 kb Plasmid Assembly 18

19 How to determine the minimum tiling path Molecular marker 19

20 How to assemble DNA fragments into whole-genome-sequence 20

21 二 人类基因组作图 Mapping human genome Human Genome Project using the Map-based approach to sequence a complex genome Often, one of the first steps in characterizing a genome is to prepare genetic map and physical map of its chromosomes. 21

22 1. Genetic map 遗传图 The map in which mutant alleles or DNA markers are assigned relative positions along a chromosome on the basis of the recombination frequencies between them ( 利用连锁分析技术构建的能显示基因以及其它 DNA 标记在染色体上位置的图谱 ) 22

23 Tomato genetic map (1952) 23

24 Genetic markers( 遗传标记 ) 可追踪染色体 染色体某一节段 某个基因座在家系中传递的任何一种遗传特性 具有两个基本特征, 即可遗传性和可识别性, 因此生物的任何有差异表型的基因突变型均可作为遗传标记 Alleles are the firstly used landmarks DNA markers are the lately used landmarks 24

25 DNA markers RFLP 限制性片段长度多态性 STRP 简单串联重复多态性 VNTR or minisatellite, 11~60bp SSR or microsatellite, 2~9 bp SNP 单核苷酸多态性 实质是? 25

26 VNTR DNA Fingerprint Run DNA samples on a gel. Perform Southern blotting. Hybridize with probe containing microsatellite sequence Additional minisatellite loci 26

27 SSR 27

28 28

29 M is probably linked in cis configuration to the disease allele P 29

30 SNP (Single-nucleotide polymorphism) A single base pair in the DNA that varies in a population. 30

31 A genetic map of human chromosome 1 31

32 2. Physical map 物理图 A diagram of a chromosome or DNA molecule with distances given in base pairs, kilobases, or megabases Restriction map Contig map Sequence-tagged sites (STSs) map 32

33 33

34 STS (sequence-tagged sites 序列标记位点 ) A DNA sequence, present once per haploid genome, can be amplified by PCR 34

35 A complete physical map of human Y chromosome 35

36 36

37 37

38 38

39 Anchor markers Genetic and physical maps may differ in relative distances and even in the position of genes on a chromosome 39

40 3. Sequencing and Assembly When is a genome sequence complete? Draft quality the general outline is there, but there are typographical errors, grammatical errors, gaps, sections that need rearranging, and so forth Finished quality a very low rate of typographical errors, some missing sections but everything that is currently possible has been done to fill in these sections Truly complete no typographical errors, every base pair absolutely correct from telomere to telomere 40

41 三 基因组序列的解读 Bioinformatics: Meaning from Genomic Sequence Identification of all of the functional elements of the genome----annotation. 41

42 Compare the newly sequenced genomic DNA to the known sequences already stored in various databases. BLAST (Basic Local Alignment Search Tool) Unknown sequences ORF to Detect Genes Direct evidence from cdna sequences Predictions of binding sites Predictions based on codon bias 42

43 Blast 43

44 Search ORF in the Genomic Sequences

45 Predictions of mrna and polypeptide structure from genomic DNA sequence depend on an integration of information from cdna sequence, docking-site predictions, polypeptide similarities, and codon bias

46 Major Features of the Human Genome The human genome contains 3.1 billion nucleotides, but protein-coding sequences make up only about 2 percent of the genome. The genome sequence is ~99.9 percent similar in individuals of all nationalities. SNPs and copy number variations (CNVs) account for genome diversity from person to person. The genome is dynamic. At least 50 percent of the genome is derived from transposable elements, such as LINE and Alu sequences, and other repetitive DNA sequences. The human genome contains approximately 20,000 protein coding genes, far fewer than the originally predicted number of 80, ,000 genes. The average size of a human gene is ~25 kb, including gene regulatory regions, introns, and exons. On average, mrnas produced by human genes are ~3000 nt long. Many human genes produce more than one protein through alternative splicing, thus enabling human cells to produce a much larger number of proteins (perhaps as many as 200,000) from only ~20,000 genes. More than 50 percent of human genes show a high degree of sequence similarity to genes in other organisms; however, more than 40 percent of the genes identified have no known molecular function. Genes are not uniformly distributed on the 24 human chromosomes. Gene-rich clusters are separated by gene-poor deserts that account for 20 percent of the genome. These deserts correlate with G bands seen in stained chromosomes. Chromosome 19 has the highest gene density, and chromosome 13 and the Y chromosome have the lowest gene densities. Chromosome 1 contains the largest number of genes, and the Y chromosome contains the smallest number. Human genes are larger and contain more and larger introns than genes in the genomes of invertebrates, such as Drosophila. The largest known human gene encodes dystrophin, a muscle protein. This gene, associated in mutant form with muscular dystrophy, is 2.5 Mb in length, larger than many bacterial chromosomes. Most of this gene is composed of introns. The number of introns in human genes ranges from 0 (in histone genes) to 234 (in the gene for titin, which encodes a muscle protein). 46

47 After the HGP What is next? Personal genome project (PGP) Exome Sequencing Encyclopedia of DNA Elements (ENCODE) Project Stone-age genomics 2005, 13 million bp from a 27,000-year-old woolly mammoth, showing 98.5% identity with Africa elephants 2010, a draft sequence of the Neandertal genome Human microbiome project (2008) Genome 10K plan Sequence 10,000 vertebrate genomes Comparative Genomics 47

48 Comparative Genomics Example 1: 人类祖先与尼安德特人在欧洲有过混血, 为现代人祖先走出非洲很久以后与尼安德特人有过混血的假说提供了强有力的证据 Example 2: 天然 转基因, 比你想得更普遍 剑桥大学的研究者通过对 26 种动物的基因组和转录组进行分析, 发现这些生物的基因组中存在多达上百个 外来 基因, 大多来自细菌和原生生物, 其他还有古菌 真菌和植物 在人类基因组中, 多达 145 个基因为基因水平转移的结果, 这其中包括透明质酸合成酶基因 肥胖基因 FTO 以及 ABO 血型基因 48

49 第三节功能基因组学 Functional Genomics This global approach to the study of the function, expression, and interaction of gene products is termed functional genomics. Omics Transcriptome (Microarray, RNA-seq) Proteome (2-DE/MS, LC-MS/MS ) Interactome Metabolome 49

50 1. Predicting Function from Sequence Homology searches Homologous genes found in different species that evolved from the same gene in a common ancestor are called orthologs Homologous genes in the same organism (arising by duplication of a single gene in the evolutionary past) are called paralogs Homologous genes (both orthologs and paralogs) often have the same or related functions 50

51 51

52 肌红蛋白 A phylogenetic tree of human hemoglobin genes

53 Protein domains Complex proteins often contain regions that have specific shapes or functions called protein domains. Many protein domains have been characterized If the gene sequence encodes one or more domains whose functions have been previously determined, the function of the domain can provide important information about a possible function of the new gene. 53

54 2. Gene Expression Analysis Many important clues about gene function come from knowing when and where the genes are expressed. Microarrays or RNA-seq allow the expression of thousand of genes to be monitored simultaneously 54

55 55

56 3. Analysis of the Interactome Two-hybrid test to study the protein protein interactome Studying the protein DNA interactome using ChIP 56

57 4. Mutagenesis Analysis Targeted mutagenesis Gene knock-out, RNAi, Crispr cas9 Genomewide random mutagenesis 57

58 Genomewide Mutagenesis The mutations are induced by exposing the organisms to radiation, a chemical mutagen, or transposable elements Insertional Mutagenesis Arabidopsis Transformation by T-DNA Transposon Tagging ( 转座子标签 ) 58

59 A PCR screening protocol enables rapid screening of a large collection of transformed plant genomes for a rare one containing disruption of the gene of interest 59

60 The figures and tables are cited from: Genetics (From genes to genomes), Leland Hartwell, Mcgraw-Hill Companies, Inc Concept of Genetics, William S.Klug, Prentice Hall, Inc Introduction to Genetics Analysis, Anthony J.F. Griffiths, W.H.Freeman, Inc Principle of Genetics, D.Peter Snustad, John Wiley & Sons, Inc Genetics-A Conceptual Approach, Benjamin A. Pierce, W. H. Freeman 60

61 Major findings of the ENCODE project The majority, ~80 percent, of the human genome is considered functional. This is partly because large segments of the genome are transcribed into RNA. Most of these RNAs do not encode proteins. These various RNAs include trna, rrnas, and mirnas. For example, at least 13,000 sequences specify long noncoding RNAs (lncrnas). Other reports suggest there may be over 17,000 lncrnas. It may turn out that the number of noncoding RNA sequences will outnumber protein coding genes. The functional sequences also include gene-regulatory regions: ~70,000 promoter regions and nearly 400,000 enhancer regions. There are 20,687 protein-coding genes in the human genome. A total of 11,224 sequences are characterized as pseudogenes, previously thought to be inactive in all individuals. Some of these are inactive in most individuals but occasionally active in certain cell types of some individuals, which may eventually warrant their reclassification as active, transcribed genes and not pseudogenes. SNPs associated with disease are enriched within noncoding functional elements of the genome, often residing near protein-coding genes. Back 61