HUMAN GENOME BIOINFORMATICS. Tore Samuelsson, Dec 2009

Size: px
Start display at page:

Download "HUMAN GENOME BIOINFORMATICS. Tore Samuelsson, Dec 2009"

Transcription

1 HUMAN GENOME BIOINFORMATICS Tore Samuelsson, Dec 2009

2 The sequenced (gray filled) and unsequenced (white) portions of the human genome. Peter F.R. Little Genome Res. 2005; 15:

3 Human genome organisation

4 Composition of human genome Human genome 3200 MB Genes and related seqs 1200MB Intergenic DNA 2000 MB Exons 50 MB Other 1150 MB Other intergenic regions 600 MB Pseudogenes Gene fragments Introns Repeats 1400MB Lines 640MB LTRs microsatellites Sines 420MB DNA transposons

5 Human genome contain * long introns, low density of genes as compared to eubacteria * repetitive elements Simple sequence repeats (SSRs) = microsatellites 1-13 nt, length polymorphism 3 % of the genome

6 Transposons DNA transposons copies 3 % LTR (Retrotransposons) copies 9% Lines (Retrotransposons) copies 21% Sines (Retrotransposons) copies 14% Total 47%

7 LINE1 element

8 Computational detection of repetitive DNA Dotplot analysis

9 Detection of repetitive DNA RepeatMasker local alignment method to search database of known repeats RepeatMasker is a program that screens DNA sequences for interspersed repeats known to exist in mammalian genomes as well as for low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (replaced by Ns). On average, over 40% of a human genomic DNA sequence is masked by the program. Sequence comparisons in RepeatMasker are performed by the program cross_match, an efficient implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green HSU (22462) + MER7A DNA/MER2_type (109) HSU (21523) C TIGGER1 DNA/MER2_type (0) HSU (21222) C AluSx SINE/Alu (4) HSU (20544) C TIGGER1 DNA/MER2_type (943) HSU (20243) C AluSg SINE/Alu (0) HSU (19548) C TIGGER1 DNA/MER2_type (1608) HSU (19427) + MER7A DNA/MER2_type (1)

10 Repetitive DNA in Santa Cruz browser

11 Human genome contains a substantial number of pseudogenes - non-functional gene variants Non-processed pseudogene Gene duplication has resulted in new copy of gene Copy has mutated to become non-functional Processed pseudogenes Non-functional genomic copies of mrnas. Often contain multiple mutations

12 (Protein) gene prediction methods - Matches to known mrna/est/protein sequences - Ab initio methods, recognition of statistical signals in genomic DNA consistent with gene.

13 Gene prediction methods - Matches to known mrna/est/protein sequences E1 E2 E3 E4 Genome sequence GGGAGCTACTATCTAGCGGGGATCTATCTAGCGAGCGAGTCATCTTAGCG GGAAGCTATCATCTGGCGGGAATCTATCT--CGAGCGAGTCATCTTGGCG Experimentally derived mrna sequence

14 Ab initio methods - do not make use of sequence similarity; instead recognize properties of genes ORFs Discriminating between coding and noncoding regions Promoters Start and stop codons Splice sites Length distribution of exons and introns Polyadenylation sites

15 Elements of a eukaryotic gene

16 Promoter regions in eukaryotic DNA 1. TATA box 2. Initiators 5 Y Y A +1 N [T,A] Y Y Y 3 3. Downstream promoter element G-A/T-C-G 4. CpG islands (~56 % of human genes)

17 EMBOSS CpGPlot predicts CpG islands

18 Comparison of a large number of exon-intron border regions reveals a number of conserved nucleotide positions

19 Eukaryotic genes contain sequence elements involved in a mechanism for polyadenylation of mrna

20 One of the most widely used ab initio gene prediction programs is Genscan (C. Burge) Biological input to Genscan - I * TATA promoter, transcription start site (weight matrices) * Translation start site, stop site (weight matrices) * Splice signals (weight matrices) * PolyA signals (weight matrices)

21 One of the most widely used ab initio gene prediction programs is Genscan Biological input to Genscan - II * Hexamer composition to model coding versus non-coding sequence * Makes use of typical exon and intron lengths exons typically nt introns > 70 nt * Takes into account the effect that G+C -rich regions of the human genome have higher gene density and shorter introns

22 Human genome GC - rich regions have : * higher gene density and shorter intergenic regions * shorter introns

23

24 Santa Cruz browser with GenScan track

25 Output from Genscan - example output Gn.Ex Type S.Begin...End.Len Fr Ph I/Ac Do/T CodRg P... Tscr PlyA Term Init Prom Prom Init Intr Intr Intr Intr Intr Intr Intr Term PlyA PlyA Term Init >gi GENSCAN_predicted_peptide_2 241_aa MTRGMSWSTYLKMFATSLLAMCTGAEVVPQISDDEPGYDLDLFCIPNHYAEDLERVFIPH GLIMDRTERLARDVMKEMGGHHIVALCVLKGGYKFFADLLDYIKALNRNSDRSIPMTVDF IRLKSYCNDQSTGDIKVIGGDDLSTLTGKNVLIVEDIIDTGKTMQTLLSLVRQYNPKMVK VASNSDVIGQAVARVVVGFEIPDKFVVGYALDYNEYFRDLNMRKLNPREHKKLVQSDISD A

26 PRACTICAL SESSION Genscan with sequence of known exon/intron structure with less well characterized structure Finding repeats with RepeatMasker

27 PRACTICAL SESSION BLAST for examining exon/intron structure Blast = blastx Query = genomic DNA sequence Database = protein databases Ideally each exon is reported as a separate HSP (alignment) in the BLAST output.

28 PRACTICAL SESSION BLAST alignments

29 PRACTICAL SESSION Genewise ( = Wise2) Alignment of protein to genomic DNA sequence or Alignment of profile HMM (Pfam model) to genomic DNA.

30 PRACTICAL SESSION Genewise output test 9 ISDDEPGYDLDLFCIPNHYAEDLERVFIPHGLIMD ISDDEPGYDLDLFCIPNHYAEDLERVFIPHGLIMD ISDDEPGYDLDLFCIPNHYAEDLERVFIPHGLIMD HUMHPRTB aagggcgtgcgtttacactgggtgagtaccgcaag tgaaacgaatattgtcaaacaatagtttcagttta ttttaattcttatcatttttgtgaggttttaatgc test 44 TERLARDVMKEMGGHHIVALCVL TERLARDVMKEMGGHHIVALCVL R:R[agg] TERLARDVMKEMGGHHIVALCVL HUMHPRTB AGGTAAGTA Intron 1 TAGGagccgcggaagaggccaggctgc <2-----[14887:16602]-2> cagtcgattaatggaattctgtt tatttatgggggactctacctgc

31 PRACTICAL SESSION Spidey Alignment of mrna to genomic DNA

32 PRACTICAL SESSION Spidey

33 PRACTICAL SESSION Aligning a mrna or protein to genomic DNA using BLAT at the UCSC browser