Computational Biology I LSM5191 (2003/4)

Size: px
Start display at page:

Download "Computational Biology I LSM5191 (2003/4)"

Transcription

1 Computational Biology I LSM5191 (2003/4) Aylwin Ng, D.Phil. Lecture Notes: Features of the Human Genome

2 Reading List International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature 409:860. Venter et. al. (2001). The sequence of the human genome. Science 291:1304. Aravind et. al. (2001). Apoptotic molecular machinery: vastly increased complexity in vertebrates revealed by genome comparisons. Science 291: Sachidanandam et. al. (2001). A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409: Li et. al. (2001). Evolutionary analyses of the human genome. Nature 409: Cheung et al., (2001). Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature 409:953-8.

3 The Human Genome Project Begun in 1990, funded by governments & charities across the world. In 1998, a second human genome project was set up by a private company, Celera Genomics. Both projects completed a draft of the human genome sequence in 2001, and the results were published in: International Human Genome Sequencing Consortium (IHGSC), 2001, Initial sequencing and analysis of the human genome. Nature 409:860. Venter et. al., 2001, The sequence of the human genome. Science 291:1304.

4 Approaches to Genome Sequencing Genome is broken into manageable segments, a few hundred kb or Mb. These are then sequenced by the shortgun method.

5 Problems with Shotgun approach

6 The Human Genome Project 84% of total genome sequenced. 98% Euchromatic or gene-rich portion of genome sequenced (as at April 2003), with 99.99% accuracy, or less than 1 error in 10,000 bases. Each base has been sequenced 8-10 times on average, and each gap is smaller than 150kb. Total size of the Euchromatic portion estimated at 3.2 Gb.

7 Features of the Human Genome Important to remember that there are in fact many human genome sequences, (not just THE human genome sequence) because every individual have their own version. Differences or variations between individual genomes due to Single Nucleotide Polymorphisms (SNPs). SNPs refer to positions in genome where some individuals have one nucleotide and others have a different nucleotide. Over 1.4 million SNPs have been identified (i.e. 1 SNP every 2 kb of sequence). Many SNPs have no effect on the function of the genome, but 60,000 SNPs are found within genes: these SNPs can affect genes or their activities, variations that confer our individual biological characteristics, e.g. responses to pharmaceutical drugs.

8 Other features of the Human Genome Human genome 3200 Mb Genes & gene-related sequences 1200 Mb REPEATS Intergenic DNA 2000 Mb Genes 48 Mb Gene-related sequences 1152 Mb Interspersed Repeats 1400 Mb (44%) Other intergenic regions 600 Mb Pseudogenes LINEs 640 Mb Microsatellites 90 Mb Gene fragments SINEs 420 Mb Various 510 Mb Introns & UTRs LTR elements 250 Mb DNA transposons 90 Mb Adapted from Brown T.A. Genomes 2, 2002, Wiley-Liss; IHGSC 2001, Nature 409:860; Venter C.A. et al., 2001., Science 291:1304.

9 Repeat Sequences Repeat sequences account for close to 50% (& likely much more) of the genome. Are Repeats junk? Previously dismissed as uninteresting. May hold clues about evolutionary events (mutations and selection) a palaeontological record. may have important roles in reshaping the genome, duplicating introns, creating entirely new genes, modifying & reshuffling existing genes. Five classes of repeats: Transposon-derived repeats (most abundant), Processed Pseudogenes, Simple sequence repeats, Segmental duplications, Tandemly repeated sequences (e.g at centromeres, telomeres, etc.)

10 TRANSPOSABLE ELEMENTS Most abundant human repeat sequences (~44% of genome). 4 classes of transposable elements: LINEs or Long interspersed elements (20% of genome), SINEs or Short interspersed elements (13%), LTR retrotransposons (8%), DNA transposons (3%).

11 (1) LINES: Long interspersed nuclear elements They are autonomous mobile elements that transpose via an RNA intermediate. Known as non-viral retrotransposons (which lack LTRs), ~6kb long, Very abundant: 868,000 copies occur in the human genome Harbour an internal polymerase II promoter, Encode 2 open reading frames (ORFs).

12 (2) SINES: Short interspersed nuclear elements Short (~ bp) mobile (but non-autonomous) elements, Harbours an internal polymerase III promoter. Encodes no proteins freeloaders on the backs of LINE elements. Thought to use the LINE machinery for transposition. 3 distinct SINE families in human genome: Alu, MIR and Ther2/MIR3. Alu Sequences most common SINEs in the human genome. concentrate in gene-rich areas over time: suggesting Alu located near genes might be useful & thus retained by genome. contain recognition sites for restriction enzyme, AluI, hence the name. very similar to 7SL RNA, a component of the signal-recognition particle. Gene DISRUPTION: Transposition of Alu sequences into neurofibroma (NF1) gene Individuals with such transposition in one NF1 allele + somatic mutation in the other NF1 allele Neurofibromatosis

13 (3) LTR-containing RETROTRANSPOSONS Autonomous mobile elements; transpose through an RNA intermediate utilizing RT. Flanked by short direct repeats (typical of all mobile elements) and also long terminal repeats or LTRs (~ bp) containing req.d transcriptional regulatory elements. Contain gag and pol genes, which encode a protease, RT, RNase H and integrase. Retroviruses thought to have arisen from endogenous retrotransposons by acquisition of a cellular envelope or env gene. Transposition begins with transcription in nucleus, reverse transcription (primed by trna) in the cytoplasm (cf transposition of LINEs).

14 (4) DNA TRANSPOSONS Resemble bacterial transposons. Contain terminal inverted repeats. Encode transposase that binds near the inverted repeats Mediates mobility through cut-and-paste mechanism (Chpt 15, Lewin, Genes VII, Oxford U.Press)

15 TANDEMLY REPEATED DNA Satellite DNA Long series of tandem repeats (hundreds of kb), Several types, each with a different repeat unit (5-200bp) e.g. alphoid DNA repeats in centromeres. Minisatellites Form clusters (up to 20kb in length), Repeat units up to 25bp. e.g. 5 -TTAGGG-3 repeats in telomeres. Microsatellites (short tandem repeat, STR) Short clusters (<150bp), occurrence: every 2kb (on avg). Repeat units usually 4bp or less (e.g. CA), repeated times. Many microsatellites are highly polymorphic, i.e. number of repeats being highly variable in different individuals. No 2 persons have exactly the same combination of microsatellite alleles. Application: Genetic profiling or DNA fingerprints e.g. microsatellites in human β T-cell receptor locus (Chrom. 7)

16 Where are the genes (in the sea of Seqs)? Human genome 3200 Mb Genes & gene-related sequences 1200 Mb Intergenic DNA 2000 Mb Genes 48 Mb Gene-related sequences 1152 Mb Interspersed Repeats 1400 Mb (44%) Other intergenic regions 600 Mb Pseudogenes LINEs 640 Mb Microsatellites 90 Mb Gene fragments SINEs 420 Mb Various 510 Mb Introns & UTRs LTR elements 250 Mb DNA transposons 90 Mb Adapted from Brown T.A. Genomes 2, 2002, Wiley-Liss; IHGSC 2001, Nature 409:860; Venter C.A. et al., 2001., Science 291:1304.

17 Where are the genes? Unevenly Distributed Density range: 0 to 64 genes per 100kb Cytogenetic Analysis: Chromosome banding pattern by Giemsa stain, Dark G-bands contain fewer genes Giemsa stains regions rich in A & T (A + T content substantially greater than 60%) Genes generally have A+T contents of 45-50%. Base composition of genome (as a whole) is 59.7% A+T.

18 Where are the genes? Density of CpG islands roughly correlates with gene density: CpG dinucleotide is greatly under-represented in human DNA (occuring at just 20% of the frequency expected of GC doublets). Why? Most CpG dinucleotides are methylated on C. Spontaneous deamination of methyl-c residues T residues, Spontaneous deamination of ordin. C residues Uracil (but readily repaired). i.e. CpG dinucleotides steadily mutate to TpG dinucleotides. CpG islands are stretches of DNA (typically less than 1800bp long) that have an average G+C content of ~60% compared with 40% in bulk DNA. are of interest because many are associated with 5 ends of genes. In full sequence: In repeat-masked sequence: CpG island density: 50,267 CpG islands. 28,890 CpG islands (Alu repeats are GC-rich) islands/mb (most chromosomes) 2.9 islands/mb (Chromosome Y) islands/mb (Chromosomes 16, 17 & 18)

19 Where are the genes? Classification of gene products into broad functional categories Some of the Findings: (IHGSC 2001, Nature 409:860) Eukaryotes possess the same basic set of genes. Interestingly, more complex species (?) have a greater no. of genes. What do you think? Humans have the greatest number of genes in all but one of the categories Metabolism. C. elegans (with only 959 cells) has a v. high no. of genes for cell-cell communication!!! Humans have cells, but have only 250 more genes for cell-cell communication!

20 What really do the genes do? 26,383 genes Molecular function unknown (41.7%) 12,809 genes Venter et. al. (2001). Science 291:

21 Gene Families in clusters Arising as a result of gene duplication. (1) Simple or Classical multigene families: Members have identical or nearly identical sequences. ~2000 genes for 5S rrna located in a single cluster on chromosome copies of a repeat unit containing genes for 28S, 5.8S and 18S rrna, grouped into 5 clusters of repeats, one cluster on each of chromosome 13, 14, 15, 21. (2) Complex multigene families: Members have similar sequence, but are sufficiently different for their gene products to have distinctive functions or properties. α-globin gene cluster on chromosome 16. β-globin gene cluster on chromosome 11.

22 Globin family of genes exhibiting different properties β-globin gene cluster Aγ or Gγ -encoded polypeptides have a higher affinity for oxygen than adult haemoglobins consisting of either δ- or β-encoded polypeptides Aγ or Gγ -encoded polypeptides are expressed only during fetal life.

23 GENOME EVOLUTION Non-coding DNA, Repeats and Genome Evolution Transposable elements may enhance the potential for recombination events, leading to genome rearrangements. Elements with similar sequences can initiate recombination between 2 regions of the same chromosome or between different chromosomes. Transposable elements This often leads to the disruption of important genes, which is harmful. The result may also be beneficial. E.g. Duplication of the β globin gene (resulting in Gγ and Aγ) was thought to be due to recombination between a pair of LINE-1 elements (~35 million yrs ago).

24 GENOME EVOLUTION Duplication of β-globin genes Different β-globin genes probably arose by duplication of an ancestral gene, most likely as a result of an unequal crossover during recombination in a germ-cell. Homologous recombination between L1 sequences will generate: one chromosome with 2 copies of the globin, and the other chromosome with a deletion of the globin gene. Subsequent independent mutations in the duplicated genes could lead to: slightly different functional properties of the encoded proteins, or formation of non-functional pseudogenes.

25 Homologous recombination

26 GENOME EVOLUTION Homologous, Orthologous, Paralagous genes Homologous genes Genes that share a common evolutionary ancestor. Homologs have extensive sequence similarity. (Caution: Sequence similarity does not necessarily imply homology) Two categories of homologous genes: (1) Orthologous genes Homologous genes that are present in different organisms and whose common ancestor pre-dates the speciation event. (2) Paralogous genes Homologous genes present in the same organism. Often members of a multigene family. Their common ancestor pre-dates the gene duplication event, but may or may not pre-date the species in which the genes are now found.

27 Adapted from NCBI, W.M, Mark Boguski

28 GENOME EVOLUTION The role of non-coding DNA MYSTERY as to why 97% of genome is non-coding DNA. Why is this excess tolerated? Non-coding DNA might have a vital function that is yet to be identified. Non-coding DNA might have a vital control or regulatory function. Non-coding DNA might be tolerated because there is no selective pressure to get rid of it! propagation of junk, selfish DNA.

29 Genomic Complexity C-value = total amount of DNA in the haploid genome as big as a!! Phylum Species DNA content / Genome size (bp) Bacterium E. coli 4 x 10 6 Yeast S. cerevisiae 1 x 10 7 Nematode C. elegans 8 x 10 7 Insect D. melanogaster 1 x 10 8 Amphibian X. laevis 3 x 10 9 Mammal H. sapiens 3 x 10 9 Drosophila melanogaster (fruitfly) Chicken Carp Boa constrictor (snake) Rat Human Tobacco Onion Amphiuma (salamander) Lungfish Ophioglossum (Fern) Amoeba dubia 0.18 Gb 1.2 Gb 1.7 Gb 2.1 Gb 2.9 Gb 3.2 Gb 3.8 Gb 18.0 Gb 84.0 Gb Gb 160 Gb 670 Gb Humans have genomes 300x larger than yeast & 200x smaller than Amoeba dubia. Discrepancy between genome size and genetic complexity = C-value paradox

30 Genomic Complexity Species Genome size (bp) No. of Genes predicted E. coli 4 x ,200 S. cerevisiae 1 x ,800 D. melanogaster 1 x ,601 (117 genes/million bases) C. elegans 8 x ,099 (197 genes/million bases) A. Thaliana 1 x ,498 (221 genes/million bases) H. sapiens 3 x ,780 (12 genes/million bases) Proteins often feature discrete structural units, known as domains. These are conserved in different species. >90% of identified domains in human proteins are also present in fruitfly & worm proteins. 40% of predicted HUMAN proteins similar to FRUITFLY & WORM proteins. 61% of FRUITFLY proteins 43% of WORM proteins have sequence similarities to HUMAN proteins 46% of YEAST proteins

31 Why is complexity of an organism not related in a simple way to the amount of DNA it has? Not all the sequences code for proteins e.g. some serve as regulatory signals for gene expression. 97% of genome has no known function (some call this junk DNA!!), only 3% consists of coding sequences or genes. Genome contain a large quantity of repetitive sequence (50% of human genome). Exons comprise only a small part of a gene (& only 1% of the genome) i.e. total coding potential is much reduced. e.g. Human Dystrophin gene (largest human gene Mb) 79 Exons (~14 kb), i.e. only 0.6% of gene codes for Dystrophin Alternative splicing (the many ways in which exons can be joined to give mrna) i.e. more proteins are encoded per gene in humans than in other species. Complexity is achieved beyond the DNA level, through complex networks of gene expression control and interactions between gene products.