Whole-genome strategies for marker-assisted plant breeding

Similar documents
I.1 The Principle: Identification and Application of Molecular Markers

Genetic dissection of complex traits, crop improvement through markerassisted selection, and genomic selection

Identifying Genes Underlying QTLs

Efficiency of selective genotyping for genetic analysis of complex traits and potential applications in crop improvement

Genomic resources and gene/qtl discovery in cereals

Association Mapping in Wheat: Issues and Trends

Mapping and Mapping Populations

Efficiency of selective genotyping for genetic analysis of complex traits and potential applications in crop improvement

The 150+ Tomato Genome (re-)sequence Project; Lessons Learned and Potential

MAS refers to the use of DNA markers that are tightly-linked to target loci as a substitute for or to assist phenotypic screening.

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

Genomic Selection in Breeding Programs BIOL 509 November 26, 2013

Association Mapping in Plants PLSC 731 Plant Molecular Genetics Phil McClean April, 2010

Gene Mapping in Natural Plant Populations Guilt by Association

Module 1 Principles of plant breeding

Strategy for Applying Genome-Wide Selection in Dairy Cattle

Marker-Assisted Selection for Quantitative Traits

Linkage Disequilibrium

Molecular markers in plant breeding

Understanding genomic selection in poultry breeding

DESIGNS FOR QTL DETECTION IN LIVESTOCK AND THEIR IMPLICATIONS FOR MAS

Traditional Genetic Improvement. Genetic variation is due to differences in DNA sequence. Adding DNA sequence data to traditional breeding.

MARKER-ASSISTED EVALUATION AND IMPROVEMENT OF MAIZE

A brief introduction to Marker-Assisted Breeding. a BASF Plant Science Company

Molecular and Applied Genetics

Genomics-based approaches to improve drought tolerance of crops

From Genotype to Phenotype

QTL Mapping, MAS, and Genomic Selection

Computational Workflows for Genome-Wide Association Study: I

Agricultural Applications for Genome Sequencing

SNP calling and Genome Wide Association Study (GWAS) Trushar Shah

Trudy F C Mackay, Department of Genetics, North Carolina State University, Raleigh NC , USA.

Chapter 1 Molecular Genetic Approaches to Maize Improvement an Introduction

Genetics Effective Use of New and Existing Methods

Improving barley and wheat germplasm for changing environments

latestdevelopments relevant for the Ag sector André Eggen Agriculture Segment Manager, Europe

High-density SNP Genotyping Analysis of Broiler Breeding Lines

Genomic Selection: A Step Change in Plant Breeding. Mark E. Sorrells

Utilization of Genomic Information to Accelerate Soybean Breeding and Product Development through Marker Assisted Selection

Ecological genomics and molecular adaptation: state of the Union and some research goals for the near future.

Marker types. Potato Association of America Frederiction August 9, Allen Van Deynze

Genomics assisted Genetic enhancement Applications and potential in tree improvement

POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping

Authors: Vivek Sharma and Ram Kunwar

STANDER, l.r., Betaseed, Inc. P.O. Box 859, Kimberly, ID The relationship between biotechnology and classical plant breeding.

GBS Usage Cases: Examples from Maize

GREG GIBSON SPENCER V. MUSE

Plant Science into Practice: the Pre-Breeding Revolution

Speeding up discovery in plant genetics and breeding

Strategy for applying genome-wide selection in dairy cattle

SolCAP. Executive Commitee : David Douches Walter De Jong Robin Buell David Francis Alexandra Stone Lukas Mueller AllenVan Deynze

Pharmacogenetics: A SNPshot of the Future. Ani Khondkaryan Genomics, Bioinformatics, and Medicine Spring 2001

Initiating maize pre-breeding programs using genomic selection to harness polygenic variation from landrace populations

Quantitative Genetics, Genetical Genomics, and Plant Improvement

Implementing direct and indirect markers.

Inflorescence QTL, Canalization, and Selectable Cryptic Variation. Patrick J. Brown Department of Crop Sciences UIUC

Genomic selection in the Australian sheep industry

Identifying and exploiting natural variation

Lecture 1 Introduction to Modern Plant Breeding. Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013

Crash-course in genomics

OPTIMIZATION OF BREEDING SCHEMES USING GENOMIC PREDICTIONS AND SIMULATIONS

1. why study multiple traits together?

Maize breeders decide which combination of traits and environments is needed to breed for both inbreds and hybrids. A trait controlled by genes that

Usage Cases of GBS. Jeff Glaubitz Senior Research Associate, Buckler Lab, Cornell University Panzea Project Manager

DNA METHYLATION RESEARCH TOOLS

SNPs - GWAS - eqtls. Sebastian Schmeier

AP 2010 Biotechnologie-Bioressources

QTL mapping in domesticated and natural fish populations

Gene Tagging with Random Amplified Polymorphic DNA (RAPD) Markers for Molecular Breeding in Plants

Axiom Biobank Genotyping Solution

Cloning drought-related QTLs. WUEMED training course June 5-10, 2006

HCS806 Summer 2010 Methods in Plant Biology: Breeding with Molecular Markers

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

EPIB 668 Genetic association studies. Aurélie LABBE - Winter 2011

HCS806 Summer 2010 Methods in Plant Biology: Breeding with Molecular Markers

QTL Mapping Using Multiple Markers Simultaneously

Selection and breeding process of the crops. Breeding of stacked GM products and unintended effects

CONFERENCE/WORKSHOP ORGANISER S REPORT

Whole Genome-Assisted Selection. Alison Van Eenennaam, Ph.D. Cooperative Extension Specialist

GENETICS - CLUTCH CH.20 QUANTITATIVE GENETICS.

Map-Based Cloning of Qualitative Plant Genes

Brassica carinata crop improvement & molecular tools for improving crop performance

Using molecular marker technology in studies on plant genetic diversity Final considerations

The 21. Century The Century of Plant Breeding Meeting the future by plant breeding. Peter Stamp

Analysis of genome-wide genotype data

China-CIMMYT Partnership: The Past, and the Future

Fruit and Nut Trees Genomics and Quantitative Genetics

Optimizing Traditional and Marker Assisted Evaluation in Beef Cattle

Genomic selection in cattle industry: achievements and impact

Quantitative Genetics for Using Genetic Diversity

Applicazioni biotecnologiche

A high density GBS map of bread wheat and its application for genetic improvement of the crop

Association studies (Linkage disequilibrium)

TACKLING NEW CHALLENGES FOR EUROPEAN SORGHUM THROUGH GENETICS AND NEW BREEDING STRATEGIES

Lecture 23: Causes and Consequences of Linkage Disequilibrium. November 16, 2012

Why can GBS be complicated? Tools for filtering & error correction. Edward Buckler USDA-ARS Cornell University

Strategic Research Center. Genomic Selection in Animals and Plants

solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome

Multi-SNP Models for Fine-Mapping Studies: Application to an. Kallikrein Region and Prostate Cancer

West Africa Centre for Crop Improvement. New Four Year Ph D Programme In Plant Breeding f or West Africa Centre For Crop Improvement

Transcription:

Mol Breeding (2012) 29:833 854 DOI 10.1007/s11032-012-9699-6 Whole-genome strategies for marker-assisted plant breeding Yunbi Xu Yanli Lu Chuanxiao Xie Shibin Gao Jianmin Wan Boddupalli M. Prasanna Received: 12 September 2011 / Accepted: 2 January 2012 / Published online: 3 February 2012 Ó Springer Science+Business Media B.V. 2012 Abstract Molecular breeding for complex traits in crop plants requires understanding and manipulation of many factors influencing plant growth, development and responses to an array of biotic and abiotic stresses. Molecular marker-assisted breeding procedures can be facilitated and revolutionized through whole-genome strategies, which utilize full genome sequencing and genome-wide molecular markers to effectively address various genomic and environmental factors through a representative or complete set of genetic resources and breeding materials. These strategies are now increasingly based on understanding of specific genomic Y. Xu (&) Institute of Crop Sciences/International Maize and Wheat Improvement Center (CIMMYT), The National Key Facility for Crop Gene Resources and Genetic Improvement, Chinese Academy of Agricultural Sciences, Beijing 100081, China e-mail: y.xu@cgiar.org Y. Lu S. Gao Maize Research Institute, Sichuan Agricultural University, Wenjiang, Sichuan 611130, China C. Xie J. Wan Institute of Crop Sciences, The National Key Facility for Crop Gene Resources and Genetic Improvement, Chinese Academy of Agricultural Sciences, Beijing 100081, China B. M. Prasanna International Maize and Wheat Improvement Center (CIMMYT), ICRAF House, United Nations Avenue, Gigiri, Nairobi, Kenya regions, genes/alleles, haplotypes, linkage disequilibrium (LD) block(s), gene networks and their contribution to specific phenotypes. Large-scale and highdensity genotyping and genome-wide selection are two important components of these strategies. As components of whole-genome strategies, molecular breeding platforms and methodologies should be backed up by high throughput and precision phenotyping and e-typing (environmental assay) with strong support systems such as breeding informatics and decision support tools. Some basic strategies are discussed in this article, including (1) seed DNA-based genotyping for simplifying marker-assisted selection (MAS), reducing breeding cost and increasing scale and efficiency, (2) selective genotyping and phenotyping, combined with pooled DNA analysis, for capturing the most important contributing factors, (3) flexible genotyping systems, such as genotyping by sequencing and arraying, refined for different selection methods including MAS, marker-assisted recurrent selection and genomic selection (GS), (4) marker-trait association analysis using joint linkage and LD mapping, and (5) sequence-based strategies for marker development, allele mining, gene discovery and molecular breeding. Keywords Molecular breeding Whole-genome strategies Marker-assisted selection Markerassisted recurrent selection Genomic selection Genotyping platform Precision phenotyping Environmental assay (e-typing) Breeding informatics Decision support tools

834 Mol Breeding (2012) 29:833 854 Abbreviations CGIAR Consultative Group on International Agricultural Research CIMMYT International Maize and Wheat Improvement Center DH Doubled haploid eqtl Expression quantitative trait locus/loci GBS Genotyping-by-sequencing GEBV Genomic estimated breeding value GEI Genotype-by-environment interaction GIS Geographic information system GS Genomic selection GWA Genome-wide association HapMap Haplotype map IPPN International Plant Phenomics Network LYCE Lycopene epsilon cyclase LD Linkage disequilibrium MABC Marker-assisted backcrossing MAGIC Multiparent advanced generation intercross MARS Marker-assisted recurrent selection MAS Marker-assisted selection mqtl Metabolite quantitative trait locus/loci NAM Nested association mapping NGS Next-generation sequencing QTL Quantitative trait locus/loci pqtl Protein quantitative trait locus/loci phqtl Phenotypic quantitative trait locus/loci PoDA Pathways of distinction analysis RE Restriction enzyme RIL Recombinant inbred line SLB Southern corn leaf blight SNP Single nucleotide polymorphism TILLING Targeting induced local lesions IN genomes TP Training population Introduction During the past five decades, intensive efforts in crop improvement worldwide have led to significant improvements in yield potential, biotic and abiotic stress tolerance and nutritional quality of many major crops, including the most important staple food crops like rice, wheat and maize. However, to feed the rapidly increasing populations in the developing world as well as to counter the challenges imposed by global climate changes and the fragile natural resource base, crop yields need to be at least doubled again during the next three to four decades. At the same time, significant gaps still exist between the yields that can be achieved in optimized experimental conditions and the yields that are realized in the farmers fields. Taking rice as an example, the yield that can be harvested in farmland with normal agronomic practices in Asia is about 5 t/ha, which is less than 30% of the yield potential that has been achieved by breeders in their experimental stations, leaving a big yield gap to be filled (Chaudhary 2000). While the factors contributing to such gaps are several, including both socio-economic and biological, progress through costand time-effective breeding is becoming more relevant than ever before. Breeding for complex traits needs to take into account various factors, such as understanding of the genetic, physiological and molecular bases of the traits, including interactions among the component traits and with the environments. New technologies must be developed to accelerate breeding through improved genotyping and phenotyping methods and increased availability of genetic diversity in breeding germplasm (Tester and Langridge 2011). Reduction of genotyping cost either through genotyping-by-sequencing or by chips has made it possible to obtain the genetic information required to cover the whole genome. Increasing importance is being given to high-throughput phenotyping. The advances in phenomics approaches, including instrumentation, robotics and computational software, have made high-throughput phenotyping feasible. Equally important is the possibility of collecting and analyzing various environmental factors that affect the field trials. Now whole-genome strategies facilitate effective design and implementation of molecular marker-assisted plant breeding by bringing together all the relevant information about genotypes, phenotypes and the environments. Although molecular breeding through gene transfer and marker-assisted selection (MAS) has been successful in the private sector (especially in the multinational corporations), its wider use, particularly in the public sector institutions in the developing world, is still limited by several bottlenecks (Xu and Crouch 2008; Ribaut et al. 2010; Delannay et al. 2012; Tester and Langridge 2011). The constraints include limited availability of cost-effective and high-throughput genotyping systems, less understanding of genetic architecture of complex traits, unsuitable molecular

Mol Breeding (2012) 29:833 854 835 techniques, complicated genotype-by-environment interactions (GEIs), and the lack of powerful informatics and decision support tools. The road from basic genomics research to impacts on routine breeding programs has been indeed long, winding and bumpy, not to mention some wrong turns and unexpected blockades. As a result, genomics can be effectively used in plant breeding programs only when an integrated package is implemented that includes high-throughput techniques, cost-effective protocols, global integration of genetic and environmental factors and precise knowledge of quantitative trait inheritance (Xu 2010; Tester and Langridge 2011). The challenge is to translate and integrate the new knowledge from genomics and molecular biology, including the whole-genome strategies to be discussed in this article, into appropriate tools and methodologies for public-sector plant breeding programs. Concept of whole-genome strategies The whole-genome strategies can be defined as a full package of functional tools and methodologies required for molecular plant breeding at the level of the whole genome. The development of strategies includes complete genomic sequences for all germplasm accessions, molecular markers covering important genomic regions, genes and functional alleles, high-precision phenotyping system for various target traits (measured under multiple environments), and integration of information on relevant environmental factors influencing genes, genotypes and the whole-plant performance (Fig. 1). The ultimate goal of whole-genome strategies is to help bring out the best combinations of genotypes/genes, alleles or haplotypes, linkage disequilibrium (LD) blocks, optimized gene networks and specific genomic regions into breeding products with desirable phenotypes. One of the important concepts in molecular breeding was proposed years ago for quantitative trait locus (QTL) pyramiding, separating and cloning, including how multiple QTL could be manipulated for development of a desirable genotype (Xu 1997). Many thoughts that seemed naïve when proposed have become true during the 2000s. With the identification of numerous QTL for model crops, such as rice, a global view of QTL has becomes relevant, also taking into account various genetic background effects and GEI (Xu 2002). In terms of selection strategies, molecular breeding has experienced two major developmental stages, MAS and genomic selection (GS). The first stage is based on significant associations markers and target traits where only the markers of significance are used for selection. The representative methods for the first Fig. 1 A flowchart for whole-genome strategies in marker-assisted plant breeding. The system starts with natural and artificial crop populations to develop novel germplasm through four key platforms, genotyping, phenotyping, e-typing (environmental assay), and breeding informatics, which need decision support system in various steps towards product development

836 Mol Breeding (2012) 29:833 854 stage include marker-assisted backcrossing (MABC) or introgression for major genes or QTL with relatively large effect (Hospital et al. 1992; Hospital and Charcosset 1997; Hospital 2001; Stam 2003; Frisch 2004), and marker-assisted recurrent selection (MARS) for complex traits (Edwards and Johnson 1994; Lee 1995; Stam 1995). The second stage is based on all the markers that cover the whole genome, which is represented by GS, where all the markers are included for model development and progeny prediction. Therefore, GS represents a big step towards the whole-genome strategies. The most frequently used approach for marker-trait association analysis has been linkage analysis with biparental or multi-parental populations, followed by a more recent method, linkage disequlibrium (LD) or association mapping using natural populations. The linkage-based mapping can be now upgraded to the level of the whole genome by high-density markers that cover all genes and alleles through genotypingby-sequencing or chip-based genotyping, which has been demonstrated in rice (Huang et al. 2009; Xie et al. 2010) and maize and barley (Elshire et al. 2011). Because of the availability of multiple markers from each gene, LD mapping in plants has also been shifted from the candidate gene-based strategy to the wholegenome scan (Atwell et al. 2010; Huang et al. 2010, 2012). Linkage tests can be now performed through comparative and selective analyses for all the genes simultaneously, through developing mutation and nearly isogenic line libraries and using genome-wide selective sweeps, respectively. Precision phenotyping and e-typing (environmental assay), two other important components of the whole-genome strategies, will be discussed later in this paper. Population size Population size matters very much in many important aspects of marker-assisted breeding. To introgress or transfer major gene-controlled traits, the population size required will depend on the distance between markers and target gene, recombination frequency around the region, genetic properties of the target trait such as the degree of dominance, etc. For minor genecontrolled or complex traits, the population size required should consider some additional factors associated with the target genes, such as gene number, effect, interactions and their relative positions on chromosomes. High-density markers must match with large population sizes. For example, identification of the recombinants between two tightly linked genes depends on the population size that allows the target genes to segregate and recombine. The population sizes recommended for effectively identifying marker-trait association and for further use in MAS schemes have been justified largely based on the number of markers available years ago, while also taking into account the genotyping cost. With high-throughput genotyping costs going down dramatically, cost-effective analysis of large population sizes is now feasible. Analysis of large populations in molecular breeding has also been facilitated by two approaches. First, seed DNA-based genotyping aids in effectively replacing the leaf DNA-based genotyping (Gao et al. 2008). Because sampling and DNA extraction can be easily automated with seed samples, as practiced by several multinational corporations, managing large-sized populations has become much more practical. In some special cases, as many samples as possible can be tested until a reasonable number of individuals with desirable genotypes are identified. As selections can be done before planting, the costs associated with planting and sampling can be significantly reduced. However, caution should be taken when this method is used for monocots where the genotype determined using the seed endosperm may be different from that of the plant developed from the embryo due to heterofertilization (Gao et al. 2010). In maize, the genotyping error caused by heterofertilization could vary from 0.14 to 3.12%, depending on populations. Selective genotyping and pooled DNA analysis (Stuber et al. 1980; Lebowitz et al. 1987; Lander and Botstein 1989; Giovannoni et al. 1991; Michelmore et al. 1991) is the second approach by which the population size can be significantly increased while reducing the cost (Xu and Crouch 2008; Xu 2010). Compared to entire population analysis, this approach has been shown to have significant advantages in terms of cost savings, with negligible practical disadvantages in terms of power of detection in medical genomics research (Knight and Sham 2006; MacGregor et al. 2008). However, previous applications of this approach in plants have been confounded by the small size of entire and tail populations, and insufficient marker density, which result in a high probability of false positives in QTL detection (Xu and Crouch

Mol Breeding (2012) 29:833 854 837 2008). Using population sizes of up to 3,000 and tail population sizes of 30 100, and marker densities up to one marker per centimorgan, selective genotyping can be used to replace the entire population genotyping for mapping QTL with relatively small effects, as well as linked and interacting QTL (Sun et al. 2010). Other simulations also indicate that it should be capable of detecting numerous small-effect loci with high resolution when[10 5 cross progeny are used in the case of yeast (Ehrenreich et al. 2010). Finally, selective genotyping and pooled DNA analysis can be used for populations of any type including natural populations and mutation/introgression libraries. With all available genetics and breeding materials, it is theoretically possible to develop an all-in-one plate approach where one 384-well plate could be designed to map almost all agronomic traits of importance for a crop species. To pilot test this proposition, the CIMMYT maize molecular breeding group collected over 3,000 maize lines representing phenotypic extremes for important agronomic traits, from maize breeding and genetics programs across the world (Xu et al. 2009). By identifying extreme phenotypes from segregating populations involving multiple parental lines that are being used in breeding programs, one can assume that the selected extremes host diverse favorable alleles from different sources that have been brought together into a single population by intermating and selection. The selected extremes are then used for rapid discovery of individual genes/alleles and their combined effects (Sun et al. 2010). This approach would be particularly powerful (in terms of speed and cost) when combined with selection under appropriate target biotic or abiotic stresses where a large number of plants can be selected for extreme phenotypes. Compared with the normal genetics-to-breeding approach, this reversed breeding-to-genetics approach can save 3 4 crop seasons in each cycle and can be fully integrated with ongoing breeding programs. It might be argued that the significant cost reduction in genotyping has made selective genotyping less attractive and the genotyping cost should no longer be a limiting factor for molecular breeding. However, combining with pooled DNA analysis still provides several advantages compared to entire population analysis. For example, in analysis of a large population with 2,000 individuals where 30 individuals from each tail are selected, selective genotyping will only cost 3% of genotyping the entire population, while pooled DNA analysis would provide a substantial further saving in genotyping costs, now equating to 0.1% of genotyping the entire population. Secondly, sampling and data analysis become much easier as pooled DNA analysis brings the overall experiment scale down to 0.1%. This will contribute to an easy increase in population sizes of hundred-folds. Thirdly, when selecting for rare alleles from large-size populations, multiple subpools can be created and genotyped, followed by individual genotyping after appropriate subpools are identified. These advantages can be fully utilized for all kinds of molecular markers and genotyping platforms including chip- or sequencing-based genotyping. In plants, cost-effective protocols for using nextgeneration sequencing (NGS) in association mapping studies were described based on pooled (a few individuals in each pool) and un-pooled samples, and optimal designs were identified with respect to total number of individuals, number of individuals per pool, and the sequencing coverage (Kim et al. 2010b). Overall, with a fixed cost, sequencing many individuals at a shallower depth with larger pool size achieved higher power than sequencing a small number of individuals in greater depth with smaller pool size, even in the presence of high error rates. For plant species where two parents are sequenced de novo in great depth (20 609), their progeny can be grouped into case and control groups or selected for phenotypic extremes, with each group containing 100 individuals and sequenced in great depth (e.g. 509). Genes and markers for the target trait can be identified based on the allele frequency difference between the two pools or groups. Genome coverage Genome coverage is another important criterion for whole-genome strategies. In general, the efficiency of marker-assisted breeding depends on the level of genome coverage. Table 1 compares the regionalgenome strategies that had been used for the past decades with the whole-genome strategies that become increasingly practical. Several measures can be taken to increase the genome coverage, including high-resolution genotyping, complete genome resequencing, genome-wide association (GWA) analysis, and MARS or GS.

838 Mol Breeding (2012) 29:833 854 Table 1 Comparison of platforms and tools between regional and whole genome strategies in marker-assisted plant breeding Regional genome analysis Whole genome analysis DNA sampling Leaf Seed and leaf Population management Biparental populations and independent association panels Combining use of all kinds of populations including NAM, MAGIC and natural populations Genotyping Genotyping by markers or chips Genotyping by sequencing and high-density marker chips Phenotyping Phenotyping individual target traits High-throughput precision phenotyping for all traits Marker-trait association Association with markers or selected candidate genes GWAS using high density markers or GBS Selection Based on significantly associated markers Based on all markers with estimated effects Environmental effect Information management Decision support tools Evaluated based on phenotypic data without use of environmental data Two-dimensional: G-P, through Excel and databases Individual decisions supported by separate tools Evaluated by both phenotypic and environmental data collected through years and locations Three-dimensional: G-P-E, through Web-based tools or networking Collective decision supported by integrated tools with global thinking Number of markers The number of markers required for whole-genome coverage depends on the genome size. To measure the genetic variation for each allelic variation within a gene and its neighboring regions, the marker density should be high enough to cover all allelic variation, which would need tens to hundreds of markers for each gene. Taking association mapping as an example, a whole-genome scan will need at least thousands of markers for species with slow LD decay (such as rice) but millions of markers for the species with rapid LD decay (such as maize). Haplotypes and tag single nucleotide polymorphisms (SNPs) Single SNP-based association analysis neglects the fact that the SNPs are not independent of one another. However, the determination of critical values of sequence variation must be made in the context of LD among SNPs. As DNA sequence variation in a population is the result of the past transmission of that variation, this historical past can be of considerable value in trying to achieve the primary goal of finding associated genes. There are three reasons why haplotype-based analysis should be an improvement (Clark 2004). First, the protein products of the candidate genes occur in polypeptide chains whose folding and other properties may depend on particular combinations of amino acids. Second, population genetic principles show us that variation in populations is inherently structured into haplotypes. Third, regardless of the population genetic reasons, haplotypes serve to reduce the dimensionality of the problem of testing association, and so they may increase the power of those tests. Maize is the first plant species with a haplotype map (HapMap) constructed. Several million sequence polymorphisms were identified among 27 diverse maize inbred lines and it was discovered that the genome was characterized by highly divergent haplotypes (Gore et al. 2009). Haplotype-based mapping can be used to replace individual marker-based mapping to improve the mapping power and identify specific alleles within a gene or allele combinations at different loci that contribute to the same target trait, depending on how a haplotype is constructed. Tag SNPs can be developed, each representing one haplotype fragment (Johnson et al. 2001). All tag SNPs together cover the whole genome. In maize, the use of haplotypes constructed with all SNPs within 10-kb windows improved QTL mapping efficiency (Lu et al. 2010). Sequencing quality and quantity Establishment of a complete genome sequence should consider both quality and quantity. Feuillet et al. (2011) reviewed the current crop genome sequencing activities, discussed how variability in sequence quality impacts utility for different studies, and provided a perspective for a paradigm shift in selecting crops for sequencing in the future. For large

Mol Breeding (2012) 29:833 854 839 and complex genomes, this will require strategies that profit from the new sequencing technologies as soon as they become reasonably affordable while maintaining the possibility of adding quality until the high-quality reference sequence is achieved. Our experience in maize resequencing indicates that achieving a high-quality reference genome sequence needs de novo sequencing of multiple genomes. Even for the crop species such as rice with one high-quality reference genome available already, more reference genomes are expected in order to provide better coverage of genetic diversity. Rice scientists hope to have a high-quality reference genome for indica rice, as they have had for japonica rice. The same is true for maize. As the first reference maize genome is based on a temperate maize line B73 and the second genome (Mo17) will be for temperate maize too, the maize community is now looking forward to having at least one tropical maize inbred fully sequenced with high quality. As tropical maize has been receiving less attention in genetics and genomics study, de novo sequencing of tropical elites and landraces will be very useful for future maize improvement. When a high-quality reference genome becomes available, resequencing of a panel of diverse germplasm can be used to reveal a full profile of genetic diversity and the process of domestication and improvement, as shown in rice, soybean and maize (Huang et al. 2010, 2012; Lai et al. 2010; Lam et al. 2010). On the other hand, segregating populations can also be resequenced. A high-resolution genetic map can be constructed with only 0.029 resequencing, as shown in rice (Huang et al. 2009), although a much higher sequence depth, 0.19 or higher, would be needed for highly diverse crop species like maize. Ongoing maize and rice resequencing projects also reveal that we need to resequence at least one segregating population for each crop species, because a resequenced population can be used not only for filling gaps existing in the reference genome but also for obtaining much important information for genetics and breeding, such as genome-wide recombination frequency variation, segregation distortion and a complete profile of structure variation. In the few years since its initial application, massively parallel cdna sequencing, or RNA-seq, has allowed many advances in the characterization and quantification of transcriptomes. Several recent developments in RNA-seq methods have provided an even more complete characterization of RNA transcripts. These developments include improvements in transcription start site mapping, strandspecific measurements, gene fusion detection, small RNA characterization and detection of alternative splicing events (Ozsolak and Milos 2011). Ongoing developments promise further advances in the application of RNA-seq, particularly direct RNA sequencing and approaches that allow RNA quantification from very small amounts of cellular materials. Molecular networks Plant epigenetics has recently gained unprecedented interest, not only as a subject of basic research but also as a possible new source of beneficial traits for plant breeding. Recent studies show that epigenetic pathways (e.g. DNA methylation, histone variants and modifications, positioning of nucleosomes, and small RNA) are important components of plant growth and reproduction regulation. Multiple aspects of plant development, including flowering time, gametogenesis, stress response, light signaling, and morphological change are modulated directly or indirectly by epigenetic marks (Feng and Jacobsen 2011). Since the mechanisms for epigenetic regulation are responsible for the formation of heritable epigenetic gene variants (epialleles) and also regulate transposon mobility, both aspects could be exploited to broaden plant phenotypic and genetic variation, which could improve plant adaptation to environmental challenges and thus increase productivity (Mirouze and Paszkowski 2011). During the past decade, bottom-up and top-down approaches of network reconstruction have greatly facilitated integration and analysis of biological networks, including transcriptional, protein interaction, and metabolic networks. As increasing amounts of multidimensional high-throughput data become available, biological networks have also been upgraded, allowing more accurate understanding of whole cellular characteristics. Ultimately, the integration of diverse and massive datasets into coherent models will improve our understanding of the molecular networks that underlie biological processes (Moreno-Risueno et al. 2010). Results from epigenomic studies are broadening our understanding of plant genomes and are also providing important clues regarding the mechanisms and functions of the

840 Mol Breeding (2012) 29:833 854 pathways that can be further tested using genetic and biochemical approaches (Schmitz and Zhang 2011). The integration of diverse Arabidopsis genome-wide datasets in probabilistic functional networks has been demonstrated as a feasible strategy for associating novel genes with traits of interest, and novel genomic methods continue to be developed (Ferrier et al. 2011). The combination of genome-wide location studies, using ChIPSeq, with gene expression profiling data is affording a genome-wide view of regulatory networks previously delineated through genetic and molecular analyses, leading to the identification of novel components and of new connections within these networks. It is expected that this trend will continue, the outcomes of which will allow development of more sophisticated networks integrating diverse omics data, and enhance our understanding of biological systems (Kim et al. 2010a). Marker-trait association analysis Different approaches have been developed for markertrait association analysis. To date, thousands of studies have been published on mapping phenotypic QTL (phqtl) in crop plants. Current momentum in QTL analysis is toward understanding the genetic regulation of gene expression by the quantification of transcript levels of genes, or expression QTL (eqtl) (Holloway and Li 2010). New technology can be used for the parallel measurement of the abundance of thousands of proteins and metabolites to map protein QTL (pqtl) and metabolite QTL (mqtl). A system-wide analysis can reveal the impact of DNA sequence variation across multiple levels; that is, eqtl at the gene expression level, pqtl for protein abundance or activity traits, mqtl for metabolite abundances and/or phqtl for morphological traits (Jansen et al. 2009b). Several comprehensive reviews (Collins et al. 2008; Roy et al. 2011) cover QTL and crop performance under various abiotic stresses. Marker-trait association analysis has been undertaken largely using biparental populations such as F 2 / F 2:3,BC 1, doubled haploid (DH), and recombinant inbred line (RIL), where only a few target traits can be mapped with each population. For fine mapping and gene discovery, several runs of MAS and population development are needed to narrow down the genomic regions. This method is time-consuming but very powerful for the genes with large effect and the alleles with low frequency. Here, we discuss only multiparental and natural populations, which are more relevant to the whole-genome strategies (Table 1). Multiparental population-based strategies The first multiparental population developed in crop plants was the nested association mapping (NAM) population in maize (Yu et al. 2008). By using 25 diverse inbred maize lines as founder lines to cross with the B73 reference line, a population consisting of 5,000 RILs (about 200 from each cross) was developed, which captured a total of 136,000 recombination events. Bergelson and Roux (2010) systematically compared NAM with traditional linkage mapping and association mapping, indicating that the joint analysis of data sets from natural accessions and NAM populations should greatly increase the power to fine-map genomic regions associated with phenotypic variation. Using the NAM population, evidence for numerous minor single-locus effects but little twolocus LD or segregation distortion was identified, which indicated a limited role for genes with large effects and epistatic interactions on fitness (McMullen et al. 2009). This population has been also used for dissection of variation in flowering time (Buckler et al. 2009) and resistance to southern corn leaf blight (SLB) disease (Kump et al. 2011). As most studies employing either simple synthetic populations with restricted allelic variation or association mapping on a sample of naturally occurring haplotypes have some limitations, alternative resources for the genetic dissection of complex traits should continue to be sought. Another important multiparental population-based approach is the multiparent advanced generation inter-cross (MAGIC). The MAGIC approach is expected to improve the precision with which QTL can be mapped, improving the outlook for QTL cloning. The first panel of MAGIC lines developed consists of a set of 527 RILs descended from a heterogeneous stock of 19 intermated accessions of Arabidopsis thaliana (Kover et al. 2009). These lines and the 19 founders were genotyped with 1,260 SNPs and phenotyped for development-related traits. Analytical methods were developed to fine-map QTL in the MAGIC lines by reconstructing the genome of each line as a mosaic of the founders. Simulation showed that QTL explaining 10% of the phenotypic variance can be detected in most situations with an

Mol Breeding (2012) 29:833 854 841 average mapping error of about 300 kb, and that if the number of lines were doubled the mapping error would be under 200 kb. Natural population-based strategies Increased availability of high-throughput genotyping technology, together with advances in DNA sequencing and the development of statistical methodology appropriate for GWA mapping in the presence of considerable population structure, contributed to the increased interest in association mapping in crop plants. While most published studies in crop species are candidate gene-based, GWA studies are on the increase (Rafalski 2010). It becomes clear why GWA works so well for traits that are simply inherited. Atwell et al. (2010) provided several examples in their study including responses to disease-resistance genes. In all cases, GWA yielded unambiguous results regardless of whether they corrected for population structure. The reason is not that there is no confounding in these cases. The problem that has received so much attention in human genetics, which inflated significance among unlinked, non-causal loci, is present. With truly genome-wide coverage, however, this is not very important because the true positive is expected to show the strongest association. The increasing availability of high-throughput sequencing technologies has enabled studies of rare variants, but these methods will not be sufficient for their success as appropriate analytical methods are also needed. Bansal et al. (2010) considered data analysis approaches to testing associations between a phenotype and collections of rare variants in a defined genomic region or set of regions. Ultimately, although a wide variety of analytical approaches exist, more work is needed to refine them and determine their properties and power in different contexts. The typical GWA analysis techniques treat markers individually. However, complex traits are unlikely to have a single causative gene. Thus, there is a pressing need for multi-snp analysis methods that can reveal system-level differences in cases and controls. Braun and Buetow (2011) presented a novel multi-snp GWA analysis method called pathways of distinction analysis (PoDA). The method uses GWA data and known pathway gene and gene SNP associations to identify pathways that permit the distinction of cases from controls in human disease genetics. This can be done in plants using the trait-based analysis approach as described by Xu (2010). Joint linkage and LD (association) mapping It has been emphasized that linkage and LD (association) mapping are complementary approaches and are more similar than is often assumed (Myles et al. 2009). Unlike in vertebrates, where controlled crosses can be expensive or impossible, the plant scientific community can exploit the advantages of both controlled crosses and LD mapping to increase statistical power and mapping resolution. To effectively combine their advantages, a joint linkage LD mapping strategy has been proposed (Myles et al. 2009; Lu et al. 2010). The joint mapping can be done through parallel mapping, which runs linkage and LD mapping using biparental and natural populations separately but in parallel, or a single integrated mapping combining the information from both biparental and natural populations. The first joint mapping has been reported in maize using both parallel and integrated mapping approaches (Lu et al. 2010), and involved using three RIL populations and one natural population with 305 inbred lines, genotyped by 2053 SNP markers. Joint mapping for anthesis silking interval (a trait for drought tolerance) identified 18 additional QTL that could not be identified by linkage or LD mapping alone. For the 277 SNPs that were excluded from LD analysis due to minor allele frequency of \5%, 93 were polymorphic in one of the RIL populations with normal allele frequencies recovered and three of these markers were associated with the target trait. There are three strategies that can be developed by using biparental, multiparental and natural populations together: several biparental/multiparental populations plus one natural population, e.g. in maize (Lu et al. 2010) and rice (Famoso et al. 2011); combined use of multiple populations such as NAM (Kump et al. 2011; Tian et al. 2011) and MAGIC (Kover et al. 2009); more biparental crosses than those contained in the NAM plus one natural population with 500 or more lines. The last option may be considered the best and is achievable as high-density genotyping becomes cheaper and precision phenotyping becomes practical for a large number of samples. In addition, joint mapping can be extended to multi-parental populations such as MAGIC.

842 Mol Breeding (2012) 29:833 854 Functional markers and alleles As the recombination between markers and genes for the target trait is proportional to the power of MAS, development of genic and functional markers becomes increasingly important. Although QTL cloning remains a cumbersome procedure for marker development, several reasons justify such a daunting undertaking. The sequence of a cloned QTL gene offers the perfect marker for MAS and provides the template for identification of potentially superior allelic variants in crop species or wild progenitors via EcoTILLING (Till et al. 2007). Modifying gene expression by genetic engineering could also provide superior tolerance. Because accurate phenotyping is the most critical factor for fully dissecting QTL, positional cloning is largely limited to traits with high heritability and to QTL with large effects that can be easily Mendelized (Collins et al. 2008). Completion of de novo sequencing has facilitated map-based gene cloning with many genes cloned, particularly in rice (Qiu et al. 2011; Miura et al. 2011). Cloned rice genes include grain number, grain size and weight, heading date, disease resistance, abiotic stress tolerance, and yield-related domestication genes. Functional markers have been developed in wheat for plant height, vernalization, photoperiod, kernel weight, diseases, and grain quality (as reviewed by Xianchun Xia, personal communication). Allele-specific PCR markers were developed for discrimination of seven Glu-A3, ten Glu-B3, and three Glu-D3 protein alleles. In maize, sequence-tagged, PCR-based markers were developed and demonstrated for use in selecting favorable alleles of LYCE (lycopene epsilon cyclase), a crucial gene in the carotenoid pathway. Markers for favorable alleles of LYCE (Harjes et al. 2008) and for another critical gene in the pathway, CrtR-B1 (carotene beta-hydroxylase 1), were developed (Yan et al. 2010). Large-scale resequencing of germplasm accessions has resulted in the discovery of many alleles for specific genic loci in cereals, followed by functional analysis. These validated alleles can be used to develop functional or breeder-ready markers. Selection methods Several MAS schemes have been designed and used in plant breeding. Each is suitable for specific types of traits and breeding programs. They may be used individually or combined in one breeding program for improvement of a specific trait or multiple traits. As many publications have been devoted to reviewing and evaluating the MAS schemes for crop plants (e.g., Lee 1995; Eathington et al. 2007; Jena and Mackill 2008; Gupta et al. 2010; Prasanna et al. 2010; Xu 2010), only some of them relevant to the wholegenome strategies will be discussed (Fig. 2). Marker-assisted backcrossing (MABC) Marker-assisted foreground selection and background selection have proved very useful for breeding major gene-controlled traits, which has been discussed in detail elsewhere (Xu 2003; Crosbie et al. 2006; Dwivedi et al. 2007; Ragot and Lee 2007; Xu 2010). There are two major MABC schemes suitable for major gene-controlled traits (Fig. 2). The first one is for major gene introgression (target genes only), which would need 2 10 markers for each target trait and can be used for introgression of both single and multiple traits with a population size of several hundreds. The second one is MABC for target genes plus background with the same scale of population size, which needs 2 10 markers for foreground selection of each trait, plus at least 200 markers for background selection. However, marker-assisted foreground selection is not effective for QTL with small effect, because genetic mapping fails to detect rare or small-effect QTL. Only capturing a portion of the genetic variance (Goddard and Hayes 2007) can lead to overestimated marker-effects (Lande and Thompson 1990; Beavis 1998), and may not be relevant across breeding populations, in different environments, or after several cycles of selection (Podlich et al. 2004). For complex traits that are generally controlled by QTL with small effect, two major MAS schemes, MARS and GS, have been proposed to be more effective. Marker-assisted recurrent selection (MARS) MARS was proposed in the 1990s (Edwards and Johnson 1994; Lee 1995; Stam 1995), and uses markers at each generation to target all traits of importance and for which genetic information can be obtained. When the QTL mapping is conducted based on a biparental population, both parents can contribute

Mol Breeding (2012) 29:833 854 843 Fig. 2 Marker-assisted selection schemes for several important breedingselection procedures. Multiple steps starting from initiating populations to improved products are provided for three types of important marker-assisted selection schemes, including backcrossing, pyramiding, and recurrent and genomic selection, with consideration of both inbred (selfpollinated) and hybrid crops favorable alleles. As a result, the ideal genotype is a mosaic of chromosomal segments from the two parents. A breeding scheme to produce or approach this ideal genotype based on individuals of the experimental population could involve several successive generations of crossing individuals (Stam 1995; Peleman and van der Voort 2003). MARS refers to the improvement of an F 2 population by one cycle of marker-assisted selection (i.e., based on phenotypic data and marker scores) followed commonly by two or three cycles of marker-based selection (i.e., based on marker scores only) (Fig. 2). This idea can be extended to situations where favorable alleles come from more than two parents. MARS can also start without any QTL information while selection can be based on significant marker-trait association established during the MARS process. Simulation studies revealed that MARS was generally superior to phenotypic selection in accumulating favorable alleles in one individual (van Berloo and Stam 1998, 2001; Charmet et al. 1999) and it was between 3% and almost 20% more efficient than phenotypic selection (van Berloo and Stam 2001). The usefulness of having prior knowledge of QTL under genetic models has been examined including different numbers of QTL, different levels of heritability, unequal gene effects, linkage, and epistasis, and concluded that with known QTL, MARS is most beneficial for traits controlled by a moderately large number of QTL (e.g., 40) (Bernardo and Charcosset 2006). Genomic selection (GS) GS, or genome-wide selection, contrary to what the phrase implies, has been defined in a very narrow sense to refer to marker-based selection without

844 Mol Breeding (2012) 29:833 854 identifying a subset of markers significantly associated with the trait (Meuwissen et al. 2001; Table 1; Fig. 2). GS consists of three steps: (1) prediction model training and validation, (2) breeding value prediction of single-crosses, and (3) selection based on these predictions. In GS model training, a training population (TP) consisting of germplasm having both phenotypic and genome-wide marker data is used to estimate marker effects. The combination of these marker effect estimates and the marker data of the single crosses is used to calculate genomic estimated breeding values (GEBVs), where a GEBV is the sum of all marker effects included in the model for an individual. Selection is then imposed on the single crosses using GEBVs as selection criterion. Thus, GS attempts to capture the total additive genetic variance with genome-wide marker coverage and effect estimates, contrasting with MARS strategies that utilize a small number of significant markers for prediction and selection. Markers with effects below the level of statistical significance are not used in conventional MARS, but are used in GS to predict breeding value. This is especially important for quantitative traits conferred by a large number of genes each with a small effect (Rutkoski et al. 2011). GS is poised to revolutionize plant breeding, because it uses marker data to predict breeding line performance in one analysis, analyzes the breeding populations directly, and includes all markers in the model so that effect estimates are unbiased and small effect QTL can be accounted for. The prospects for GS for improving quantitative traits in maize were analyzed, with a conclusion that this approach, although more expensive, is superior to MARS for improving complex traits, as GS effectively avoids issues pertaining to the number of QTL controlling a trait, the distribution of effects of QTL alleles, and epistatic effects due to genetic background (Bernardo and Yu 2007). GS can reduce the frequency of phenotyping because selection is based on genotypic data rather than phenotypic data. GS can also reduce cycle time, thereby increasing annual gains from selection. To date, the only publicly available results on large-scale GS performance are from dairy cattle breeding programs. Emerging studies in crop plants indicate that GS could also be an extremely useful tool for plant breeding (reviewed by Heffner et al. 2009). Both simulation (Wong and Bernardo 2008; Zhong et al. 2009) and empirical studies (Lorenzana and Bernardo 2009) found that in plant populations GS would lead to greater gains per unit time than phenotypic selection. A study comparing GS to MARS in a simulated maize breeding program found GS to have increased response from selection, especially for traits of low heritability (Bernardo and Yu 2007). GS with GEBV accuracies of only 0.5 could lead to a twofold higher gain per year compared to simple MAS in a lowinvestment wheat breeding program and a threefold increase in a high-investment maize breeding program (Heffner et al. 2010). Although GS has been strongly supported by simulation results and widely used in several multi-national corporations, its efficiency in plant breeding needs more concrete data to support. The multi-national corporations use GS along with other generation advancement procedures such as DH technology to do more cycles of selection out of the target environment and for accelerating breeding through more generations per year, which justify the wide use of GS. CIMMYT has now several proof-ofconcept projects ongoing and hopefully more concrete evidence will be achieved through large-scale breeding assisted by GS. Some theoretical issues associated with GS have been addressed by taking durable stem rust resistance in wheat as an example (Rutkoski et al. 2011). (a) Marker density needs to effectively cover the entire genome so that at least one marker should be in LD with each gene region and the minimum number of markers to achieve genome-wide coverage depends on LD decay rates which vary widely across species, populations, and genomes due to forces of mutation, recombination, population size, population mating patterns, and admixture (Flint-Garcia et al. 2003). (b) Training population composition the highest GS accuracies are achieved when the training population (TP) is large, consists of the parents or very recent ancestors of the population under selection, and consists of multiple generations of training. (c) Estimating marker effects for the three statistical methods available to train the GS model, ridge regression best linear unbiased prediction, Bayes-A, and Bayes-B (Meuwissen et al. 2001), their relative accuracy depends on the strength of marker effects. (d) Trait heritability studies in cattle have indicated that traits with lower heritability require larger TPs to maintain high accuracies (Hayes et al. 2009) as a decrease in heritability leads to lower GEBV accuracies. (e) GS enables breeding value to be calculated directly based