Supplementary Information Draft Genome Sequence of the Mulberry Tree Morus notabilis

Size: px
Start display at page:

Download "Supplementary Information Draft Genome Sequence of the Mulberry Tree Morus notabilis"

Transcription

1 Supplementary Information Draft Genome Sequence of the Mulberry Tree Morus notabilis Ningjia He 1, Chi Zhang 2, Xiwu Qi 1, Shancen Zhao 2, Yong Tao 2, Guojun Yang 3, Tae-Ho Lee 4, Xiyin Wang 4,9, Qingle Cai 2, Dong Li 1,2, Mengzhu Lu 5, Sentai Liao 6, Guoqing Luo 7, Rongjun He 2, Xu Tan 4, Yunmin Xu 1, Tian Li 1, Aichun Zhao 1, Ling Jia 1, Qiang Fu 1, Qiwei Zeng 1, Chuan Gao 2, Bi Ma 1, Jiubo Liang 1, Xiling Wang 1, Jingzhe Shang 1, Penghua Song 1, Haiyang Wu 2, Li Fan 1, Qing Wang 1, Qin Shuai 1, Juanjuan Zhu 1, Congjin Wei 1, Keyan Zhu-Salzman 8, Dianchuan Jin 9, Jinpeng Wang 9, Tao Liu 9, Maode Yu 1, Cuiming Tang 7, Zhenjiang Wang 7, Fanwei Dai 7, Jiafei Chen 5, Yan Liu 10, Shutang Zhao 5, Tianbao Lin 10, Shougong Zhang 5, Junyi Wang 2, Jian Wang 2, Huanming Yang 2, Guangwei Yang 1, Jun Wang 2, Andrew H. Paterson 4, Qingyou Xia 1, Dongfeng Ji 10 *, Zhonghuai Xiang 1 *

2 Supplementary Figure S1. The distribution of 17-mer depth of the Highseq reads. The analyzed reads are from libraries with clone insert sizes of 500 bp after sequence error correction. According to K-mer frequency information, the peak depth is 24 and the genome size of M. notabilis is estimated to be 357 Mb.

3 Supplementary Figure S2. Read-depth distribution in the M. notabilis genome assembly. The reads were aligned onto the assembled scaffolds using SOAPaligner by allowing up to 5 mismatches. The number of aligned reads was then calculated for each position.

4 Supplementary Figure S3. GC-content distributions of the M. notabilis genome and four other eudicot species. The sequence data used in the analysis were from the genomes of M. notabilis, A. thaliana, F. vesca, C. sativus and G. max. The 500bp non-overlapping sliding windows have been used along the genomes, the x-axis represents GC content percentage and the y-axis represents the proportion of genome.

5 Supplementary Figure S4. Divergence distribution of classified transposable element (TE) families in the M. notabilis genome. The different classified TE families in the M. notabilis genome were aligned to the consensus sequences in the Repbase (v15.02) library. The sequence divergence rates of TEs were counted.

6 Supplementary Figure S5. The number of tissue-specifically expressed and housekeeping genes in the M. notabilis genome. We calculated the RPKM (reads per kilobase per million mapped reads) values to measure the gene expression levels of five tissues and calculated the tissue specificity index tau to identify the genes specifically expressed in each tissue. MeV (v4.8.1) ( was used to perform the cluster result of these genes based on RPKM values. Five tissues: lateral root bark, one-year old branch bark, male flower from winter bud, male flower, and semi-mature leaf were used.

7 Supplementary Figure S6. Scatter Plot of Ks vs. Ka of orthologs between M. notabilis and C. sativa. The dashed line represents the 95% prediction interval about the linear regression. Red and blue dots represent high and low ω (Ka/Ks) gene pairs, respectively. High ω gene pairs, i.e. above the prediction interval range, mean ortholog pairs which were more likely to be under positive selection than other pairs. Likewise, low ω gene pairs, i.e. under the prediction interval, mean ortholog pairs which have extremely low non-synonymous substitution rates.

8 Supplementary Table S1. Summary of sequencing data for the M. notabilis genome. Sequencing library (bp) Libraries GA lanes Avg reads length (bp) Raw data (G) Usable data (G) Effective depth (/0.4G) k k k k Total

9 Supplementary Table S2. Statistics for the assembly of the M. notabilis genome. Scaffold Contig Size (bp) Number # Size (bp) Number # N90 11,563 1,393 2,231 13,016 N80 114, ,991 7,263 N70 202, ,710 5,093 N60 299, ,412 3,684 N50 390, ,476 2,638 Longest 3,477, ,236 Total size 330,791, ,509,985 Total number (>100bp) 110, ,741 Total number (>1 kb) 7,150 19,097 Total number (>2 kb) 2,914 13,515

10 Supplementary Table S3. Statistics of the distribution of k-mer distribution. Kmer Genome Kmer Kmer num Bases used Reads used X depth size 17 8,577,674, ,403,096 10,456,746, ,442, X indicates the sequence depth of bases used in the genome sequences.

11 Supplementary Table S4. Assessment of sequence coverage of the M. notabilis genome assembly using ESTs. Data set (bp) Total number >0% of sequence covered by one scaffold Number Ratio >50% of sequence covered by one scaffold Number Ratio >90% of sequence covered by one scaffold Number # # (%) # (%) Number # Ratio (%) >0 5,833 5, , , >200 5,796 5, , , >500 3,081 3, , , > The ESTs are aligned to the assembled scaffolds by BLAST. The EST N50 is 515 bp.

12 Supplementary Table S5. Repeat content in the assembled M. notabilis genome. Classification Copy Number DNA Content (bp) DNA Content (%) Class I: Retrotransposon 108,303 43,561, LTR-Retrotransposon 100,853 41,649, Gypsy 44,464 20,404, Copia 55,782 21,183, Other , Non-LTR Retrotransposon 7,450 2,047, SINE 1, , LINE 6,349 1,605, Class II: DNA Transposon 39,090 11,372, SubclassI 35,829 10,564, CACTA 7,109 2,433, Tc1/Mariner , hat 11,225 4,354, Harbinger 4,438 1,381, Other 12,790 2,349, SubclassII 3,261 1,050, Helitron 3,192 1,034, Maverick 69 15, Satellite 97 29, Low complexity 89,705 6,046, Simple repeat 123,328 6,170, Tandem repeat 177,772 18,860, Unknown 185,015 63,782, Total 723, ,983,

13 Supplementary Table S6. Support for high-confidence protein-coding loci. Support number # Percent (%) trans &/or EST support protein support de novo support (trans &/or EST) & (de novo) & protein de novo support only Support criteria: The predicted gene has >=60% of its CDS length with the initial gene predicted by homolog, RNA -seq, EST, and de novo methods.

14 Supplementary Table S7. Comparison of genes in high-confidence and low-confidence sets. Type Low-confidence gene High-confidence gene Gene size(bp) a 4,464,435 29,437,365 Maximal gene length(bp) 31,657 15,294 Gene number # 2,253 27,085 Gene length >=100bp 2,253 27,085 Gene length >=1000bp 1,302 11,761 Gene length >=2000bp 705 3,444 N50(bp) 3,217 1,536 N80(bp) 1, mrna length(medium/avg, bp) 1,190/2,629 1,939/2,867 Gene length(medium/avg, bp) 1,272/1, /1,087 Exon length(medium/avg, bp) 234/ /236 Exon number(medium/avg, #) 3/5.0 3/4.6 Intron length(medium/avg, bp) 291/1, /495 Intron number(medium/avg, #) 2/4.0 2/3.6 mrna GC(%) Exon GC(%) Intron GC(%) a Gene size refers to the total length of all coding sequences.

15 Supplementary Table S8. Statistics of gene information in M. notabilis and other nine eudicots. Type A. thaliana G. max P. trichocarpa V. vinifera C. papaya T. cacao F. vesca M. domestica C. sativus M. notabilis Gene size (Mb) Maximal gene length(bp) 16,011 58,035 15,954 40,713 10,605 17,220 15,804 14,367 16,131 31,657 Genes # 27,348 46,290 41,377 26,346 27,725 46,140 34,809 55,386 26,682 29,338 Gene density Gene length >=100bp # 27,262 46,288 41,365 26,018 27,036 46,140 34,749 55,362 26,682 29,338 Gene length >=1000bp # 14,374 24,783 18,652 11,556 9,449 20,517 15,589 25,513 10,956 13,063 Gene length >=2000bp # 3,845 6,638 5,189 3,476 2,152 7,238 4,998 7,786 3,015 4,149 N50(bp) 1,545 1,539 1,491 1,596 1,314 1,677 1,587 1,554 1,464 1,635 N80(bp) mrna length (med/avg, bp) Gene length (med/avg, bp) Exon length (med/avg, bp) Exon number (med/avg, bp) Intron length (med/avg, bp) Intron number (med/avg, bp) 1,557/1,866 2,470/3,265 1,733/2,317 3,019/5,936 1,320/2,396 2,437/3,229 1,824/2,792 1,947/2,729 1,700/2,685 1,864/2,849 1,044/1,215 1,056/1, /1, /1, / /1, /1, /1, /1, /1, / / / / / / / / / /250 3/5.1 4/5.8 3/4.7 4/6.0 2/4.0 3/4.7 3/5.0 3/4.8 3/4.4 3/4.6 98/ / / / / / / / / /542 2/4.1 3/4.8 2/3.7 3/5.0 1/3.0 2/3.7 2/4.0 2/3.8 2/3.4 2/3.6 mrna GC(%) Exon GC(%) Intron GC(%) med represents median length and avg represents average length. Gene size refers to the total length of all coding sequences. N50 refers to the gene length above which half of the total length of the gene set can be found. N80 refers to the gene length above which 80% of the total length of the gene set can be found.

16 Supplementary Table S9. Number of annotated genes based on the different methods. Number Percent (%) Total Annotated Swissprot InterPro KEGG GO COG Nr Unannotated

17 Supplementary Table S10. Fisher exact test results of the 222 diversifying selected orthologous pairs between M. notabilis and C. sativa. GO ID Term GO Category P-Value GO: aldehyde oxidase activity F GO: oxidoreductase activity, acting on the aldehyde or oxo group of donors, oxygen as acceptor F GO: defense response to fungus, incompatible P interaction GO: autophagic vacuole C GO: organ senescence P GO: leaf senescence P GO: ADP binding F GO: aging P GO: defense response, incompatible interaction P

18 Supplementary Table S11. Comparison of NBS-containing R genes from different plant genomes. Predicted Domains M. notabilis P. trichocarpa O. sativa A. thaliana M. domestica F. vesca # % # % # % # % # % # % TIR-NBS-LRR CC-NBS-LRR NBS-LRR BED-NBS-LRR NBS CC-NBS TIR-NBS Total NBS-LRR genes Total NBS genes Total genes % on total genes 0.53% 0.86% 0.71% 0.52% 1.49% 0.58%

19 Supplementary Table S12. Statistics of the numbers of aspartic protease and cysteine protease in seven plant species. Type Family A. thaliana M. truncatula O. sativa V. vinifera M. domestica F. vesca M. notabilis Aspartic protease Cysteine protease A A A A A A A Total (% of total genes) 230 (0.68%) 144 (0.32%) 485 (1.58%) 690 (2.66%) 233 (0.37%) 184 (0.53%) 129 (0.48%) C C C C C C C C C C C C C C C C C C C C C Total (% of total genes) 132 (0.38%) 103 (0.23%) 190 (0.62%) 96 (0.37%) 378 (0.59%) 172 (0.49%) 127 (0.47%)

20 Supplementary Table S13. The alignment of aspartic proteases and cysteine proteases existed in M. notabilis latex. Protease Aspartic protease Cysteine protease Gene ID Morus Morus Morus Morus Morus Morus Morus Morus Morus Morus Morus Morus Morus Morus Morus Accession number of latex transcriptome sequences in Genebank FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX FX

21 Morus Morus FX FX FX FX FX

22 Supplementary Table S14. Genes encoding protease inhibitors in the M. notabilis genome. Family Number serine endopeptidase inhibitor 2 family A1 and C1 serine peptidases inhibitor 19 serine and cysteine endopeptidase inhibitor 10 serine and metallo endopeptidase inhibitor 2 serine peptidase inhibitor 6 family C1 papain-like cysteine peptidase inhibitor 9 family C1 cysteine peptidases inhibitor 22 serine carboxypeptidase Y inhibitor 8 metallopeptidase pappalysin-1 inhibitor 1 Total 79

23 Supplementary Table S15. Non-coding RNA genes in the M. notabilis genome. Type Copy (w) Average length (bp) Total length (bp) % of genome mirna trna rrna S S S S snrna CD-box HACA-box splicing

24 Supplementary Table S16. The prediction of mulberry mirna existing in the hemolymph and silk gland of silkworm. Tissue 1st sequencing 2nd sequencing Serial number Reads Serial number Reads Name Sequence Ref mirna hemolymph t t MIR166f UCUCGGACCAGGCUUCAUUCC bdi-mir166f hemolymph t t MIR166i UCGGACCAGGCUUCAUUCCCC ptc-mir166i anterior-middle silk gland t MIR156a UGACAGAAGAGAGUGAGCAC ath-mir156a anterior-middle silk gland t MIR157a UUGACAGAAGAUAGAGAGCAC ath-mir157a posterior silk gland t MIR157a UUGACAGAAGAUAGAGAGCAC ath-mir157a Serial number: serial number of small RNA sequence raw data in silkworm hemolymph and silk gland. "-" indicated no sequencing data available in the database

25 Supplementary Methods Genome sequencing and assembly We used a whole genome shotgun sequencing (WGS) strategy to construct 12 sequencing libraries on Illumina Hiseq 2000 platforms. A total of billion high quality bases were generated corresponding to 236-fold mulberry genome coverage. The read-length ranged from 49 to 100 bp. Sequences from a library with a 170 bp insert size were employed as linked reads. Then 9.68 Gb linked reads were generated and used to fill the gaps in scaffolds. Low -quality sequencing data, adapter contamination, duplicated reads caused by the Solexa -pipeline, and short insert-size reads were discarded. Sequence errors based on the K-mer frequency data were corrected prior to the calculation. The genome size of M. notabilis was determined using a peak depth of 24, according to an empirical formula: G=K num /K d ep th, where G is the genome size, K num is the number of K-mers, and K d ep th is the peak depth. Using 500 bp non-overlapping sliding windows, we calculated the GC content of five eudicot species including M. notabilis, Arabidopsis thaliana, Fragaria vesca, Cucumis sativus and Glycine max. The GC content of the mulberry genome was 35.02%, which was similar to that of A. thaliana (36%), F. vesca (38.3%), C. sativus (33.8%) and G. max (34.6%). Gene prediction and annotation Protein sequences from nine eudicot species including A. thaliana, G. max, P. trichocarpa, V. vinifera, C. papaya, T. cacao, F. vesca, M. domestica, and C. sativus were used to align the mulberry genome by TBLASTN at an e-value of 1e-5. Solar was then used to connect the high-scoring segment pairs (HSPs). HSPs with scores lower than 25 and redundancies with coverage less than 50% were excluded. The alignment was performed using Genewise (v 2.2.0) with sequences extended 500 bp at both ends of the alignment regions. The genes were

26 filtered out if: their sequences contained gaps; more than 50% of the length of an overlapping region contained a TE; the length of coding sequence (CDS) was shorter than 150 bp; they contained "N"s; or the identity of genes was lower than 50%. For the de novo method, Augustus, Genscan, Glimmerhmm and SNAPdenovo were used to predict genes in the mulberry genome. The genes were filtered out if : the CDS contained gaps; more than 50% of the sequences were TEs; the CDS was shorter than 150 bp. Then, gene models from homology -based and de novo methods were integrated with Glean. Low-confidence set genes were predicted by an EST/transcript -based method. The mulberry ESTs/transcripts that did not match the high -confidence set genes were used to identify a low-confidence set of genes. Genes predicted by this method were filtered out if they contained "N"s or shorter than 150 bp. The remaining genes were further annotated by searching against the NCBI and KEGG databases. We used trnascan -SE to find the trna genes and identified 560 trna genes with the average length of 74bp. By aligning the Morus genome to various plant small RNA databases, we predicted 311 snrnas and 223 mirnas of mulberry.