Genome. Journal: Genome. Manuscript ID gen r1. Manuscript Type: Note

Size: px
Start display at page:

Download "Genome. Journal: Genome. Manuscript ID gen r1. Manuscript Type: Note"

Transcription

1 Use of microsatellite markers for the assessment of bambara groundnut [Vigna subterranea (L.) Verdc.] breeding system and varietal purity for genome sequencing Journal: Manuscript ID gen r1 Manuscript Type: Note Date Submitted by the Author: 01-Mar-2016 Complete List of Authors: Ho, Wai ; Crops For the Future, Biotechnology and Crop Genetics Muchugi, Alice; World Agroforestry Centre Muthemba, Samuel ; World Agroforestry Centre Kariba, Robert ; World Agroforestry Centre Mavankeni, Busiso ; Harare Agricultural Research Centre, Crop Breeding Institute Hendre, Prasad; World Agroforestry Centre Song, Bo; BGI Van Deynze, Allen E.; University of California, Davis, Plant Sciences Massawe, Festo; University of Nottingham Malaysia Faculty of Science Mayes, Sean; University of Nottingham, School of Biosciences Keyword: plant genome assembly, varietal purity, heterozygosity, microsatellite markers

2 Page 1 of 10 Use of microsatellite markers for the assessment of bambara groundnut [Vigna subterranea (L.) Verdc.] breeding system and varietal purity for genome sequencing Wai Kuan Ho, Alice Muchugi, Samuel Muthemba, Robert Kariba, Busiso Olga Mavenkeni, Prasad Hendre, Bo Song, Allen Van Deynze, Festo Massawe, and Sean Mayes Wai Kuan Ho and Sean Mayes. Biotechnology and Crop Genetics Theme, Crops For the Future, Jalan Broga, Semenyih, Selangor, Malaysia. Wai Kuan Ho and Festo Massawe. School of Biosciences, Faculty of Science, University of Nottingham Malaysia Campus, Jalan Broga, Semenyih, Selangor, Malaysia. Alice Muchugi, Samuel Muthemba, Robert Kariba and Prasad Hendre. World Agroforestry Centre, United Nations Avenue, Gigiri, Nairobi, Kenya. Busiso Olga Mavenkeni. Crop Breeding Institute, Harare Agricultural Research Centre, Fifth Street Extension, P. O. Box CY0, Causeway, Harare, Zimbabwe. Allen Van Deynze. Seed Biotechnology Center, University of California, 1 Shields Ave, Davis, California, USA. Bo Song. BGI-Shenzhen, Shenzhen, , China. Sean Mayes. School of Biosciences, Faculty of Science, University of Nottingham Sutton Bonington Campus, Sutton Bonington, Leicestershire, LE12 5RD, UK. Corresponding author: Wai Kuan Ho ( waikuan@cffresearch.org).

3 Page 2 of 10 Abstract Maximising the research output from a limited investment is often the major challenge for minor and underutilised crops. However, such crops may be tolerant to biotic and abiotic stresses and adapted to local, marginal and low input environments. Their development through breeding will provide an important resource for future agricultural system resilience and diversification in the context of changing climates and the need to achieve food security. The African Orphan Crops Consortium recognises the values of genomic resources in facilitating the improvement of such crops. Prior to beginning genome sequencing there is a need for an assessment of line varietal purity and to estimate any residual heterozgosity. Here we present an example from bambara groundnut, a underutilised drought tolerant African legume. Two released varieties from Zimbabwe, identified as potential genotypes for whole genome sequencing (WGS) were genotyped with 20 species-specific SSR markers. The results indicate that the cultivars are actually a mix of related inbred genotypes and the analysis allowed a strategy of single plant selection to be used to generate non-heterogeneous DNA for WGS. The markers also confirmed very low levels of heterozygosity within individual plants. The application of a pre-screen using co-dominant microsatellite markers is expected to substantially improve the genome assembly, compared to a cultivar bulking approach which could have been adopted. Keywords: plant genome assembly; varietal purity; heterozygosity; microsatellite markers

4 Page 3 of 10 Introduction The advance of next generation sequencing has benefited a wide range of research fields. However, large genome size, higher ploidy levels, complex gene content including duplication and pseudogenes and high levels of chloroplast and mitochondria genomes in plants have complicated the de novo assembly of whole genomes (Schatz et al. 2012). Even where very extensive resources have been applied to finished genomes, such as for the human genome, millions of unresolved nucleotides can still be found (Schatz et al. 2012). For genome sequencing, true biological variants must be differentiated from errors or biases from the sequencing technique and assembly (Wu et al. 2015). As an example, Zook and colleagues (2014) have developed a pipeline for high-confidence genotype calls comparing 11 whole genomes and 3 exomes of human data in order to address substantial discordance among different sequencing platforms and algorithms. Given the complexity of plant genomes, we propose that a standard test on any material used for genome sequencing is crucial to minimise the heterogeneity and assess the levels of heterozygosity of plant material to be sequenced, which will facilitate the ease and the quality of the plant genome assembly. Bambara groundnut is among the 100 African crop species to be sequenced by African Orphan Crops Consortium (AOCC). In this report, we demonstrate the use of microsatellite markers in assessing the homogeneity and homozygosity of bambara groundnut varietal lines before whole genome sequencing, to maximise the return on investment in generating a crop genome sequence. The definition of variety could be slightly different in different communities. Most of the time, for seed producers and/or companies working largely in major crops, it refers to the near genetic uniformity of the individuals for sowing, achieved through inbreeding or hybrid production, with the common requirement for each variety to be Distinct, Uniform and genetically Stable (DUS; although the latter would not apply for hybrid seed after the current planting). However, an inbreeding species would

5 Page 4 of 10 normally be declared a variety between F 8 and F 12 of inbreeding. With fully inbreeding species, this would give an expected residual heterozygosity of less than 1%, even at F 8. However, many inbreeding species have low levels of out-crossing, so assuming that full inbreeding is achieved and uniform by F 8 is not a safe assumption for applications sensitive to the presence of heterozygosity, such as genome sequencing. Moreover, there is an underlying assumption that mistakes are not made in the cultivar development process, which could lead to heterogeneity in the cultivar (before even thinking about the possible genetic mechanisms which may prevent complete inbreeding or uniformity being achieved). For major crops these issues are sometimes important, but generally there are molecular markers available which can complement a phenotypic assessment of uniformity. For underutilised and minor crops, this assessment is even more important, but the tools may be lacking. However, development of co-dominant microsatellite markers (SSRs) has been facilitated by the presence of repeat arrays in the expressed transcriptome. Much thought is often put into which line of a crop is to be used for genome sequencing, particularly in the context of traits and relevance of the accession to agriculture. For minor crops such outstanding lines of interest may be less obvious. In such cases, the level of heterogeneity and heterozygosity of the accession should be the deciding factor. Heterogeneity and heterozygosity can result in a poor reference genome assembly being produced, as the quality of the assembly is directly related to these factors. Homozygosity and homogeneity when possible should be selected prior to beginning and arguably is far more important than the traits associated with the accession. A homozygous genome anchored to a high density genetic map often has the most utility in crop research and development as it provides a clear genome template, which reduces the cost of subsequent sequence-based analysis, through a reduction in the depth required to achieve the desired results.

6 Page 5 of 10 Materials and methods Two lines of material were used in this study, Kazuma (KS) and Mana (MS) which are released in Zimbabwe since 2003 as commercial varieties, which would be expected to be homogeneous and homozygous. The DNA was extracted from silica-dried leaf material of 23 individual plants of the KS variety and 22 individual plants of the MS variety using the DNeasy Plant Mini Kit according to the manufacturer s instructions (Qiagen). The PCR amplification using a dye labelled M13 primer in a three primer reaction was carried out following Schuelke s protocol (2000) (Table A1). Subsequently the SSR products were run on an ABI3730XL capillary electrophoresis system with a positive control in every run and analysed using Peak Scanner Software v2 (Applied Biosystems), before manual confirmation of the scoring. Result and Discussion Using 20 SSR markers, the two sets of bambara groundnut varietal lines were found to have very low levels of heterozygosity, as would be expected from a highly inbreeding species (Table 1 and 2). However, both varieties contained lines which were not identical, showing heterogeneity. At least six out of 23 lines of Kazuma (26.1%) and 13 out of 22 lines of Mana (.1%) showed between line genetic variation. Given that these two lines were assessed by a limited, but reasonable, number of SSR markers, more variation might be expected with the application of further markers. The residual heterozygosity observed confirms a low level of out-crossing, in this species. Hence, the strategy advocated in this species is to collect DNA from a single plant for genome sequencing, followed by collection of seed from the same plant, to ensure that the genome sequence genotype is propagated and available for research and breeding. As a result, a single plant of KS8 from Kazuma variety was selected and is in progress of whole genome sequencing by AOCC.

7 Page 6 of 10 Claros et al. (2012) has summarised that although the assembly of euchromatic regions are probably not affected by heterozygosity, this has led to a total 15% of reads in poplar (Populus trichocarpa) that could not be assembled, presumably containing heterochromatic sequences. Heterozygosity could be interpreted as false segmental duplication in the assembly with the reads from heterozygous segments assembled into different contigs, which subsequently are scaffolded adjacent to one another (Claros et al. 2012). Jailon s group (2007) has used a near homozygous grape line derived from nine successive generations of self-pollination. The line was estimated to be 93% homozygous as assessed by 36 SSR markers, although only 2.6% (or 12 Mb) heterozygosity was observed or made it through into their final assembly. They have observed the depth coverage of this type of region reduced by half as compared to homozygous supercontigs. Given that there are a number of factors hindering a good quality of plant genome assembly, selecting the right materials to be sequenced is important, particularly for minor and underutilised crops. SSR markers are relatively easy to develop and are often present in transcriptomes which are becoming cheaper to generate and are available for many crop species. SSR markers are codominant, multi-allelic and highly robust, being an ideal tool for quality control in research. With the increasing knowledge about the diversity and dynamic nature of genomes, perhaps a better term for a newly assembled genome would be a crop Sample genome rather than a Reference genome. Having a high quality assembled genome is critical for its application in advanced breeding and research programmes.

8 Page 7 of 10 References Carlos, M.G., Bautista, R., Guerrero-Fernández, D., Benzerki, H., Seoane, P., and Fernández- Pozo, N Why assembling plant genome sequences is so challenging. Biology 1: Jaillon, O., Aury, J-M., Noel, B., Policriti, A., Clepet, C., Casagrande, A., et al The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449: Schatz, M.C., Witkowski, J., and McCombie, W.R Current challenges in de novo plant genome sequencing and assembly. Biology 13: 243. doi: /gb Schuelke, M An economic method for the fluorescent labelling of PCR fragments. Nat. Biotechnol. 18: Wu, Z., Tembrock, L.R., and Ge, S Are differences in genomic data sets due to true biological variants or errors in genome assembly: An example from two chloroplast genomes. PLoS ONE 10(2): e Zook, J.M., Champman, B., Wang, J., Mittleman, D., Hofmann, O., Hide, W., and Salit, M Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32:

9 Page 8 of 10 Table 1. Allelic scoring of 20 SSR markers from 23 plants of bambara groundnut Kazuma line (KS1-23) with variants detected different from majority allele calls shaded in grey. SSR KS1 KS2 KS3 KS4 KS5 KS6 KS7 KS8 KS9 KS10 KS11 KS12 KS13 KS14 KS15 KS16 KS17 KS18 KS19 KS20 KS21 KS22 KS23 P P D P P P P D E P P D P Pco P P P P P * D No. of minor allelic call (out of 20) *suspected to be a heterozygote

10 Page 9 of 10 Table 2. Allele scoring of 20 SSR markers from 22 plants of bambara groundnut Mana line (MS1-22) with variants detected different from majority allele calls shaded in grey SSR MS1 MS2 MS3 MS4 MS5 MS6 MS7 MS8 MS9 MS10 MS11 MS12 MS13 MS14 MS15 MS16 MS17 MS18 MS19 MS20 MS21 MS22 P P D P P , P * P D E , P * P D P Pco P P P P P D No. of minor allelic call (out of 20) *suspected to be a heterozygote

11 Page 10 of 10 Appendix Table A1. List of SSR primer used. SSR primer Sequence 5 -> 3 Annealing temperature ( C) P1 F AACTTGCCATACGTGGAAGG P1 R ACACGCTGCATAATTCACCA P7 F GTAGGCCCAACACCACAGTT P7 R GGAGGTTGATCGATGGAAAA P10 F TCAGTGCTTCAACCATCAGC P10 R GACCAAACCATTGCCAAACT P15 F AGGAGCAGAAGCTGAAGCAG P15 R CCAATGCTTTTGAACCAACA P16 F CCGGAACAGAAAACAACAAC P16 R CGTCGATGACAAAGAGCTTG 57 P19 F AGGCAAAAACGTTTCAGTTC P19 R TTCATGAAGGTTGAGTTTGTCA P21 F CAAACTCCACTCCACAAGCA P21 R CCAACGACTTGTAAGCCTCA 57 P23 F CAGTAGCCATAATTTGCTATGAACA P23 R CGAATCACCATTCAATACGC P30 F AATGCAAGATTTTGGCTTGG P30 R CCCACTCAAACCATACACCA P31 F GCTAAGGTGGAGTGGTGGAA P31 R CAATCATCTTTTGCGCTTCA 57 P32 F TTCACCTGAACCCCTTAACC P32 R AGGCTTCACTCACGGGTATG 57 P33 F ACGCTTCTTCCCTCATCAGA P33 R TATGAATCCAGTGCGTGTGA 57 P37 F CCGATGGACGGGTAGATATG P37 R GCAACCCTCTTTTTCTGCAC P44 F TGTGGGCGAAAATACACAAA P44 R TCGTCGAATACCTGACTCATTG Pco F GAGTCCAATAACTGCTCCCGTTTG Pco R ACGGCAAGCCCTAACTCTTCATTT D8 F GCATCTTTACAGCAAGAGTTTCAA D8 R TGGATCTTCCTCATTGCAGTATAA D11 F GAGGAAATAACCAAACAAACC D11 R CTTACGCTCATTTTAACCAGACCT D14 F GAACGAAGCCAGGATAATGATAGT D14 R CGAAAGCGACAACTCACTACTAAA D15 F TGACGGAGGCTTAATAGATTTTTC D15 R GACTAGACACTTCAACAGCCAATG E7 F CATGATTTGTTGTGATGATGAT E7 R AACAACAAATGTACCAAAGAATCG 51