Genome evolution on the allotetraploid Xenopus laevis

Size: px
Start display at page:

Download "Genome evolution on the allotetraploid Xenopus laevis"

Transcription

1 Genome evolution on the allotetraploid Xenopus laevis Taejoon Kwon Department of Biomedical Engineering, School of Life Sciences Ulsan National Institute of Science & Technology (UNIST) Xenopus Bioinformatics Workshop,

2 XenBase is the central place of all!

3 Genome vs. Annotation

4 Genome vs. Annotation

5 X. tropicalis genome

6 Difference between XenBase & other genome browser (be careful)

7 The International Xenopus laevis genome project consortium Illumina paired-end libraries (127x) 225 bp :153 Gbp 450 bp : 118 Gbp 900 bp : 124 Gbp Illumina mate-pair libraries (36x) 1,500 bp : 47 Gbp 4,000 bp : 47 Gbp 10,000 bp : 18 Gbp SOLiD mate-pair libraries (10x) 1,500 bp: 30 Gbp BAC-end sequences 720,384 SceI BACs 38,400 HindIII BACs Fosmid shotgun sequences HiC & Chicago HiRise 7

8 Genome Assembly 101

9 Mate-pair libraries

10 HiC-seq for chromosome conformation Liberman-Aiden, et al., Science (2009) Dekker, et al., Nat. Rev. Genet. (2013) Sexton and Cavalli, Cell (2015)

11 Burton, et al., Nat. Biotech. (2013) HiC-seq data for scaffolding

12 Applying recently developed algorithm to build chromosome-level pseudomolecules

13 Validating assembly with BAC/fosmid-end, long range mate-pairs & HiC data CrossOver Frequency Scaffold 4 UW Fosmid-end data 3 NIG BAC/Fosmid data 3 Quigley HiC data 2 10 kbp mate-pair XGC data Candidate for Break Point

14 Example of misassembled Scaffold detected by CrossOver Frequency (confirmed by BAC-FISH)

15 Both HiC and Chicago HiRise data confirm no strong signal of mis-assembly Session, Uno, Kwon, et al., submitted

16 N50: measure of genome assembly A C B A B C 50% of concatenated sequence length N50 is length of scaffold A.

17 N50: measure of genome assembly A C B A A B C 50% of concatenated sequence length N50 is length of scaffold A. B C D E F A B C D E F 50% of concatenated sequence length N50 is length of scaffold C.

18 X. tropicalis genome (2014 version) 4.1 (xentor2) 4.2 (xentor3) 7.1 (JGIv7a) 8.0 (JGIv80) N50 (bases) 1,567,461 1,570, ,444, ,782,168 Concat. Length (billions bases) # Scaffolds 19,759 19,550 7,730 8,128 # Scaffolds > 10 kbp 6,308 6,260 1,691 1,918 # Scaffods > 50 kbp 1,683 1,

19 X. laevis genome (2014 version) Ver 1 (Dec. 2010) Ver 5 (Oct. 2011) Ver 6 (Sep. 2012) Ver 7.1 (May 2013) # Scaffolds 562, , , ,604 Concatenated (bp) 2,296,834,358 2,837,751,070 2,756,816,359 2,781,979,436 Concatenated > 1kbp (bp) 2,216,160,057 2,755,097,095 2,638,385,220 2,698,814,836 N50 (bp) 8, , ,136 3,490,425 # scaffolds longer than 10 kbp 60,121 16,625 8,426 2,598 Scaffold N50 = 7.61 Mbp If you want to say it is still draft, so officially it is not available yet. what would you say about cat (N50=4.7 Mb), cow (N50=104 kb), or pig (N50=576k)?

20 X. tropicalis genome (2016 version) X. laevis JGv71 X. laevis JGIv91 X. tropicalis JGIv80 X. tropicalis JGIv90 N50 (bases) 3,490, ,109, ,782, ,289,692 Concat. Length (billions bases) # Scaffolds 410, ,501 8,128 6,823 # Scaffolds > 10 kbp # Scaffods > 50 kbp 2, ,918 1,462 1,

21 Gene annotation: X. laevis Ver 1.0 (JGIv10) Ver 1.4 (JGIv14) Ver 1.5 (JGIv15) Ver 1.6 (JGIv16) Mainly using publicly available X. laevis sequences (i.e. GenBank, EST, etc) & X. tropicalis sequences. Documentation in progress. 2014july (WorldCup) 2014may (FinalExam) 2012oct (Oktoberfest) 2013may (MayBall) 2014apr (EggHunt)

22 Billion bases RNA-seq raw data Amount of RNA-seq data ( Jun ~ Sep. 2013) 1,200 1,

23 Number of assembled transcripts Assembled Transcripts 40,000,000 35,000,000 Increase of assembled transcripts 21.8 % of total GenBank sequences (171 millions, 2014-Feb) 30,000,000 25,000,000 20,000,000 15,000,000 Coding Non-coding Sum 10,000,000 5,000,000 -

24 Oktoberfest & MayBall Oktoberfest (Oct. 2012) Assembled Transcripts per Experiment + references Mapping to JGI v6 Select Tx w/ 90 ~ 95% align_ratio + multiple obs. Assign name by GeneTree (ens66) MayBall (May 2013) Assembled Transcripts per Library + references Mapping to JGI v7.0 JGI v7.1 & NIG v2 Select Tx w/ 90 ~ 95% align_ratio + multiple obs. Assign name by GeneTree (ens69) + TreeFam (v8)

25

26

27

28

29

30 Rules of translation frame: Evaluation EnsEMBL mrna EnsEMBL EnsEMBL EnsEMBL Proteins EnsEMBL Proteins Proteins Proteins Except same species Total number of correct frames Total number of coding sequences Sensitivity EnsEMBL mrna EnsEMBL EnsEMBL EnsEMBL Proteins EnsEMBL Proteins Proteins Proteins Except same species Total number of dubious frames Total number of noncoding sequences Specificity Human Mouse Chicken Zebrafish X. tropicalis Sensitivity 91.3 % 93.5 % 99.0 % 95.9 % 98.0 % Specificity 70.2 % 68.9 % 91.3 % 86.6 % 87.0 %

31 Gene name assignment - rules HUMAN query Chicken Zebrafish Mouse X. tropicalis ENSG coding MADD madd Madd madd ENSG coding MYL myl2b Myl myl Remove species-specific names RIKxxxx, SI_xxxx, XB-xxxx, ZGC_xxxx, etc Survey orthologous gene names If <= 1 orthologous gene has name, give up. If >= 2 orthologous genes have identical name, take that as putative gene name. If >=2 orthologous genes have heterogeneous names, but more than 2 orthologous genes have same name, take that as putative gene name. Otherwise, give up. MADD (O) MYL2 (X) MYL10 (O)

32 Why not taking just human name? (or best-hit of closest species, say X. tropicalis) Gene X (X. laevis) Gene X (X. tropicalis) Gene X (Chicken) Gene X (X. laevis) Gene X (Chicken) Gene X (Zebrafish) Gene X (Zebrafish) Gene Y (Human) Gene Y (Mouse) Gene Y (X. tropicalis) Gene Y (Zebrafish) Wrong name for Gene X as Gene Y, because of lack of Gene X in human. Gene Y (Human) Gene Y (Mouse) Gene Y (X. tropicalis) Gene Y (Zebrafish) Wrong name for Gene X as Gene Y, because of lack of Gene X in X. tropicalis.

33 BAC-FISH confirmed highly conserved synteny between X. laevis(tetraploid) and X. tropicalis(diploid) Left: Allotetraploid Xenopus laevis (2N=36) Right: Diploid Xenopus (Silurana) tropicalis (2N=20) Uno, et al., Heredity (2013) 33

34 Chromosome fusion: XTR9+XTR10 = XLA9_10 Session, Uno, Kwon, et al., submitted

35 The genome allows us to time the ancestral hybridization event and assign every chromosome to one or the other hybrid parent density of transposon (ppm) TpA_Harb TpB_Harb TpB_Mar L S L S L S L S L S L S L S L S L S _10 Chromosome Session, Uno, Kwon, et al., submitted 35

36 About 1/3 of genes are singletons, not homeologs, implying gene gain or loss Chromosomal distribution of duplicated genes 36

37 Gene loss at S chromosome is selective Session, Uno, Kwon, et al., submitted

38 Gene loss at S chromosome is selective

39 Session, Uno, Kwon, et al., submitted Selective gene expansion

40 Large scale J-strain RNA-seq data 11 dev. stages( b), 14 tissues( b) ~ 340 billions bps St01 St08 St40 St30 St35 St09 St25 St10.5 St20 St15 St12

41 Garber, et al., Nat. Method (2011) Two approaches in gene annotation: genome-guided vs genome-free

42 Testing several RNA-seq mapping methods to measure the expression of homeologs Kwon, Cytogenet. Genome Res. (2015)

43 Testing several RNA-seq mapping methods to measure the expression of homeologs Kwon, Cytogenet. Genome Res. (2015)

44 Preferentially expressed S genes are enriched in early development Session, Uno, Kwon, et al., submitted 44

45 Selective gene regulation of duplicated genes Session, Uno, Kwon, et al., submitted

46 Number of Orthologous Proteins Frog will become a useful model organism to study human disease in systems biology Align ratio(%) to Human Protein Reference (EnsEMBL v80)

47 Acknowledgements Other Xenopus collaborators The Xenopus laevis genome sequencing project Adam Session, Jarrod Chapman, Richard Harland, Daniel Rokhsar (UC Berkeley/JGI) Masanori Taira, Mariko Kondo (U of Tokyo) Atsushi Suzuki, Shuji Takahashi (U of Hiroshima) Akimasa Fucui (Hokkaido U) Ian Quigley (Salk Institute) Gert Jan Veenstra, Simon van Heeringen (Radboud U) Funded by 47