Crop design with genomics and natural diversity. Edward Buckler USDA-ARS Cornell University

Size: px
Start display at page:

Download "Crop design with genomics and natural diversity. Edward Buckler USDA-ARS Cornell University"

Transcription

1 Crop design with genomics and natural diversity Edward Buckler USDA-ARS Cornell University

2 4 years Goal: Create the global model to decrease cycle time Make Crosses Make Crosses 4 months Large Area Hybrid Trials 5 years Inbreed The Model Genotype Predict Value Doubled Haploid Small Scale Hybrid Sell or Release Winner Hybrids Standard Breeding Data From Other Efforts Genomic Selection (GS) Small Scale Hybrid Large Area Hybrid Trials With perfect knowledge it could run 15X faster, current reality ~3X Sell or Release Winner Hybrids

3 GS versus GWAS Same data: genome wide marker and phenotypes, different statistics GWAS Genome Wide Association Studies are aimed identifying causative genes and variants GS Genomic Selection aims to predict phenotype using the complete genotype

4 Challenges Genotype to unite world s germplasm resources Resolve complex traits so perfect LD remains for more than 15 meioses. Collect and mathematically model relevant trait and environmental interaction Deploy

5 The Maize Diversity Project McMullen & Flint-Garcia, at University of Missouri Holland, at North Carolina State Univ. Ware, at Cold Spring Harbor Lab. Sun & Kresovich, Cornell University Doebley, University of Wisconsin USDA-ARS & NSF Plant Genome

6 Unite world s germplasm diversity

7 Maize has more molecular diversity than humans and apes combined Maize likely has 1.34% functional variation 0.09% at every gene. In total, there could be 100,000s of functional SNPs (Single Nucleotide Polymorphisms) 1.42% Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001)

8 Only 50% of the maize genome is shared between two varieties Plant 1 Person 1 50% 99% Plant 2 Plant 3 Maize Person 2 Person 3 Humans Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005 Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010

9 Cold Pleistocene Warm Pliocene Maize genetic variation has been evolving for 5 million years 5mya Modern Variation Begins Evolving Divergence from Chimps 4mya 3mya Sister Genus Diverges Ardipithecus Australopithecus 2mya Zea species begin diverging Homo erectus 1mya Maize domesticated Modern Variation Begins Modern Humans

10 The Maize HapMapV2 Project Ware, at Cold Spring Harbor Lab. Ross-Ibarra, Univ. California, Davis X. Xun & S. Chi, Beijing Genome Inst. Y. Xu, CIMMYT J. Lai, Chinese Agri. Univ. Q. Sun, Cornell Univ. N. Springer, Univ. of Minnesota McMullen, at University of Missouri Doebley & Kaeppler, Univ. of Wisconsin USDA-ARS, NSF, BGI, JGI

11 Maize HapMap2 Increase the breadth of samples (teosinte, landraces, improved lines) All inbred lines Whole Genome Shotgun, Illumina Paired-End, bp 103 lines, 13 Billion reads, 1Tbp of sequence Median 5X coverage Tripsacum dactyloides 1 sample Teosinte (Zea Mays ssp. Mexicana) 2 Inbred lines Teosinte (Zea Mays ssp. Parviglumis) 17 Inbred lines Maize Landraces Maize Improved Lines (including NAM) 23 Inbred lines 60 Inbred lines Sequence Reads (Gbp)

12 The Warning & It Applies To Many Other Studies CSHL & BGI alignment pipelines only agree 50% of time with same data ~160M SNPs identified most probably really exist somewhere MOST DO NOT EXIST WHERE ALIGNED GENETIC AND EVOLUTIONARY CONTROLS >50% errors if accept standard pipelines 55M pass various population & genetic filters

13 HapMapV2 Results 55M SNPs identified Domestication & improvement loci found Copy number and PAV identified 80-90% of the genome in flux Explain many QTL

14 Genotyping By Sequencing GBS Reduced representation sequencing for rapidly genotyping highly diverse species RJ Elshire, JC Glaubitz, Q Sun, JA Poland, K Kawamoto, ES Buckler, and SE Mitchell PlosONE 2011 Institute for Genomic Diversity

15 What is GBS? Use next generation sequencing to genotype a reduced representation portion of a genome RAD, RRL, CROPS, GBS Molecularly the most effective approaches use restriction enzymes The first maize HapMap was RRL (Gore et al 2009 Science) Recent efforts are drive price down

16 Expectation of marker distribution Biallelic, 17% Presense /Absense, 50% Too Repetitiv e, 15% Presense /Absense, 50% Multialleli c, 34% Nonpolymor phic; 18% Too Repetitiv e, 15% Biparental population Nonpolymorp hic; 1% Across the species

17 1. Plate DNA & adapter pair GBS 96-plex Protocol Barcode Adapter Sticky Ends Common Adapter primer 1 Barcode primer 2 2. Digest DNA with methylation- (4-8 bp) sensitive Restriction Enzyme 3. Ligate adapters (Steps 2 & 3 may be done simultaneously) ApeKI (5 base-cutter) or PstI (6 base-cutter)

18 GBS 96-plex Protocol Plate DNA & adapter pair Pool DNAs Digest DNA with RE Ligate adapters (may be done simultaneously) Primer s PCR Clean-up Evaluate fragment sizes CTGCAATCTTGGACAATGTATGTAGGGACTAGGGACAGTGATGTAATTAC CAGCACTAATTCACACAATTTTGTCGGTTGATGTTACTGCAGTGGATCTT CAGCACTAATTCACACAATTTTGTCGGTTGATGTTACTGCAGTGGATCTT CAGCACTAATTCATACAATTTTGTTGGTTGATGTTACTGCAGTGGATCTT CTGCGATCGCCGCGCCGATGAACGGGCCTACCCAGAAGATCCACTGCAGT CTGCGATCGCCGCGCCGATGAACGGGCCTACCCAGAAGATCCACTGCAGT CTGCCGTTGCTGGCAGTGCTACAACTCTTCACCTGACTGAAAGCTACTAA CAGCTAGCGCAAGTGTTTGTGTTGCGCGCGCGCTGTGGAAAAGTGTGCCG CAGCTAATTTTTTGGTATTTATTTGAAATAAGTTCCCACTACTCGCGGTT CAGCTAATTTTTTGGTATTTGTTTGAAATAAGTTCCCACTACTCGCGGTT CAGCCACTTCCCTCATTTGAAACTTTTTGGATCTTTGAAGACCAATAGAT CAGCTAAGAAGATAGAGCCAAACAAGGTGGGCCTGCCAACGTCTCCTTCC CAGCTAAGAAGATAGAGCCAAACAAGGTGGGCCTGCCAACGTCTCCTTCC CTGCGACTCGTGCTTCGCCGCGGCCTGAAGAACCCGGTCTTTCACCGCCG CTGCTCGGTAGTAAACGGGTACAGAATTTAATCCCGCATCATTTGGAAGC 1.3 million reads per sample 110Mbp (today) Sequence (8 x 96 samples per flowcell)

19 Cost in US Dollars Costs per DNA sample at various multiplex levels 35 $ $19.00 $9.00 Sequencing Labor Reagents & Consumables 0 48-plex 96-plex 384-plex

20 GBS has been used in 11 plant species

21 Molecular Biology Basically Solved Over 30,000 samples run in the last months Switchgrass, 950 Cacao, 190 Grape, 570 Sorghum, 3325 Reed Canary Grass, 1045 Maize, 11115

22 The main GBS challenges currently are bioinformatics

23 Bioinformatics Problems Massive amounts of data Complex genomes with many unstable parts of a genome No reference genome Missing data Phasing and imputation

24 GBS Bioinformatic Pipelines Discovery Production QSeq QSeq Tag Counts by Taxa Tag Counts by Taxa Reference Genome Tags by Taxa Reference Genetic Map Tags by Taxa Map Tags by Homology Map Tags Genetically Assign Tags to Alleles Genetic Logic Alleles to SNPs and locations Alleles to SNPs Alleles and synonyms Genotypes (HapMap format)

25 Only 50% of the maize genome is shared between two varieties Plant 1 Person 1 50% 99% Plant 2 Plant 3 Maize Person 2 Person 3 Humans Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005 Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010

26 Reads Alleles Physical and genetic mapping of 8.7 million GBS alleles Gene c and Physical Agree Gene c and Physical Disagree Not in Physical, Gene cally mapped Complex mapping or modest power currently Consistent Error or Evenly repe ve Only 29% of alleles are simple - physical and genetic agree 55% of alleles are easily genetically mappable Reads with strong gene c and/or BLAST posi on Reads with weaker posi on hypothesis Reads with no hypothesis (Error or even repe ve) Many complex alleles are rarer, so 71% of alleles are genetic and/or physically interpretable. With more samples and better error models perhaps 90% will be useable

27 12 Trillion Data Point Opportunity/Problem By end of 2011: GBS on ~30,000 public sample worldwide 200M variants known from whole genome sequencing Combine and impute missing data: 2 alleles x 30,000 lines x 200,000,000 variants = 12 trillion data points Doing the statistics and math will be a challenge.

28 Resolve complex traits

29 McMullen et al 2009 Science The Hammer: Maize Nested Association Mapping (NAM) Crossed and sequenced 25 diverse maize lines to capture a substantial portion of world s breeding diversity Derived 5000 inbred lines from the crosses Grew millions of plants Largest genetic dissection system ever NC358 B73 P39 CML52 M37W IL14H Tx303 B73 B97 P39 Mo18W CML103 Ki11 F1 Oh43 MS71 Hp301 RILCML CML322 RIL 199 CML228 RIL 200 NC350 Ky21 Oh7B CML277 M162W Ki3 F1 Tzi8 RIL CML247 1 RIL 2 CML333 RIL 199 CML69 RIL 200

30 Genotyping parents by sequencing to exploit both recent and ancient recombination P1 Pop1 P2 Pop P25 Pop25 B73 NAM HapMapV1 provides 1.6M SNPs Gore, Chia et al 2009 Science

31 GWAS for Plant Density the leaf architecture portion of the story

32 Average corn yield (bu/ac) Corn Plants per Acre There has been 8 fold jump in US maize yield in the last 80 years Open pollinated double cross single cross modern Plant Density Year 35,000 30,000 25,000 20,000 15,000 10,000 5,000 0 USDA-NASS; Troyer 2006 Crop Sci. 46: ; Duvick 2005 Maydica 50: fold increase in plant density

33 Leaf angle, blade length, blade width Determine canopy morphology and light harvest Important for high density and yield Newer hybrids have upright leaves (Duvick 2005)

34 Frequency of Allele Frequency of Allele Frequency of Allele At least genes control each aspect of the leaf Upper leaf angle Leaf length Leaf width Significant allele Allelic Effect ( ) Significant alleles Allelic Effect (mm) Significant alleles Allelic Effect (mm) 96% of significant alleles: <2.5 effect 93% of significant alleles: <18mm effect 95% of significant alleles: <3mm effect Each gene has a small effect Alleles showing positive effects Alleles showing negative effects Tian, Bradbury et al 2011 Nature Genetics

35 cm/mb BPP log(p) liguleless1 and liguleless2 explained the two biggest leaf angle QTL Upper leaf angle Associations with positive effect Associations with negative effect Linkage QTL peak The biggest effect was less than <2 Effect lg1 lg2 QTL effect SNP effect lg1 lg3 lg2 lg Chromosomes Tian, Bradbury et al 2011 Nature Genetics

36 Low genetic overlap among leaf Leaf Length (36 QTL) 0.30 architecture traits Upper Leaf Angle (30 QTL) Days To Silk (39 QTL) Number of shared QTLs Phenotypic correlation (r 2 ) Genetic architectures are finely tuned to each exact Leaf width (34 QTL) environment with evolution favoring low pleiotropy 0.20

37 What genes have natural variation to control Carbon & Nitrogen metabolism in the field? With Stitt & Gibon groups, sampled plants in the field for basic carbon & nitrogen metabolites across all of NAM Nengyi Zhang

38 GWAS BPP Linkage -log(p) Direct GWAS hit in the Carbonic anhydrase (CA) gene CA CA is the single most important gene controlling Chlorophyll, Malate, Nitrate, Glutamine, and overall protein content.

39 Carbonic anhydrase (CA) is a critical enzyme in C fixation in C4 plant Mala CA SNP associations: Chla, Mala, Nitr, Glut, Prot, Prin1 CO CA 2 HCO - 3 CA CAs are upstream regulators of CO 2 -controlled stomatal movements in guard cells Water use efficiency, heat stress Ludwig M. et.al. Plant Physiol Hu H. et al. Nature Cell Biology 2010

40 Significant SNPs either within or very near (<2kb) candidate genes Trait SNP BPP (%) Gene AGP Glut 3: 213,890,769 6 carbonic anhydrase 213,888, ,896,251 Star 2: 22,808, invertase 22,804,880-22,809,451 5: 168,868, invertase 168,865, ,868,879 Chla 3: 213,848, carbonic anhydrase 213,847, ,859,958 3: 213,848, carbonic anhydrase 213,847, ,859,958 3: 213,894, carbonic anhydrase 213,888, ,896,251 9: 23,215, starch synthase 23,213,761-23,217,689 Gluc 5: 167,871, ,4-alpha-glucan branching enzyme 167,869, ,892,914 Fruc 5: 204,526, endoglucanase 1 (Cellulase) 204,527, ,531,175 Mala 3: 213,856, carbonic anhydrase 213,847, ,859,958 3: 214,330, malate transporter 214,325, ,328,710 Prot 8: 117,977,083 8 ribosome protein 117,979, ,983,191 3: 213,854, carbonic anhydrase 213,847, ,859,958 Nitr 1: 202,621,762 5 malate dehydrogenase (NADP+) 202,617, ,621,864 2: 181,079, chla,b binding protein 181,076, ,079,397 3: carbonic anhydrase 213,847, ,859,958 4: 166,175,217 5 glutamine synthetase 166,172, ,175,518 Fuma 1: 195,285, pyruvate dehydrogenase E1 195,281, ,283,531

41 Can we make useful predictions

42 Observed Days To Flowering of Parental Lines Observed Flowering Time Can we predict? y = 1.32x 21.2 R² = NAM QTL Prediction of Days to Flowering Predicted Flowering from markers models With a $20 test, we can predict when many varieties will flower with a couple days

43 Observed Observed Observed NAM QTLs accurately predict many traits. Hence we can breed with it Leaf Length R 2 = Predicted Leaf width R 2 = Upper leaf angle R 2 = Predicted Predicted

44 Taming of NAM NAM and Ames Yield Trials 1800 NAM lines test crossed on PVP and trialed in 4 location in 2010 and 6 locations in 2011 Every inbred in Ames has been evaluated for basic traits in 2010 Yield trials for 1200 Ames inbreds on PVPs in 6 environments in 2011 Collaborating with breeders to combine GEBV models S. Larsson C. Romay

45 What can genomics do to accelerate the breeding of simple and complex traits? Evaluate Natural Variation Mathematically Model Genotype to Phenotype Predict Phenotype Facilitates Rapid Breeding Progress

46 What should and can we do in the next decades? Double yield with same fertilizer and water (better drought and N utilization) Perhaps even more in the developing world. Perennialize our crops Biofortify crops to improve nutrition in the developing world Do this in 100 species.

47 Who do I contact to learn more? NAM Jim Holland, Mike McMullen, and Sherry Flint-Garcia HapMapV2 Doreen Ware, Jer-Ming Chia, Jeff Ross-Ibarra QTL Mapping on NAM Peter Bradbury, Zhiwu Zhang, Feng Tian Leaf Architecture Feng Tian C & N Metabolites Nengyi Zhang & Yves Gibon GBS Methods & Bioinformatics Rob Elshire & Sharon Mitchell, Qi Sun, Jeff Glaubitz, James Harriman Web: & Supported by USDA-ARS & NSF