Nature Biotechnology: doi: /nbt.3943

Similar documents
3I03 - Eukaryotic Genetics Repetitive DNA

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids.

Genomics and Transcriptomics of Spirodela polyrhiza

Structural variation. Marta Puig Institut de Biotecnologia i Biomedicina Universitat Autònoma de Barcelona

Phenotypic response conferred by the Lr22a leaf rust resistance gene against ten Swiss P. triticina isolates.

Analysis of large deletions in human-chimp genomic alignments. Erika Kvikstad BioInformatics I December 14, 2004

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Runs of Homozygosity Analysis Tutorial

Simultaneous profiling of transcriptome and DNA methylome from a single cell

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Genome Sequencing-- Strategies

Genomic resources and gene/qtl discovery in cereals

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

LECTURE 12: INSIGHTS FROM GENOME SEQUENCING

Genomes summary. Bacterial genome sizes


Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Association mapping of Sclerotinia stalk rot resistance in domesticated sunflower plant introductions

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

De novo Genome Assembly

Supplementary Note: Detecting population structure in rare variant data

Toward a better understanding of plant genomes structure: combining NGS and optical mapping technology to improve the sunflower assembly

A tutorial introduction into the MIPS PlantsDB barley&wheat databases. Manuel Spannagl&Kai Bader transplant user training Poznan June 2013

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

Organization and evolution of transposable elements along the bread wheat chromosome 3B

Variant Detection in Next Generation Sequencing Data. John Osborne Sept 14, 2012

Themes. Homo erectus. Jin and Su, Nature Reviews Genetics (2000)

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Conifer Translational Genomics Network Coordinated Agricultural Project

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

Einführung in die Genetik

Next-Generation Sequencing. Technologies

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

RNA-Sequencing analysis

% Viability. isw2 ino isw2 ino isw2 ino isw2 ino mM HU 4-NQO CPT

7 Gene Isolation and Analysis of Multiple

Detecting ancient admixture using DNA sequence data

Linking Genetic Variation to Important Phenotypes

Advanced Plant Technology Program Vocabulary

UC Davis UC Davis Previously Published Works

4.1. Genetics as a Tool in Anthropology

Comparative genomics with maize and other grasses: from genes to genomes!

Why can GBS be complicated? Tools for filtering, error correction and imputation.

Section 10.3 Outline 10.3 How Is the Base Sequence of a Messenger RNA Molecule Translated into Protein?

Biotechnology Explorer

SNP calling and VCF format

Supplementary Table 1: Oligo designs. A list of ATAC-seq oligos used for PCR.

Unit 1: DNA and the Genome Sub-topic 6: Mutation

Supplementary Figures Montero et al._supplementary Figure 1

Complementary Technologies for Precision Genetic Analysis

Axiom mydesign Custom Array design guide for human genotyping applications

Quality Measures for CytoChip Microarrays

TRANSPOSON INSERTION SITE VERIFICATION

CHAPTER 21 LECTURE SLIDES

F. graminearum F. verticillioides

Linkage Disequilibrium. Adele Crane & Angela Taravella

CSE182-L16. LW statistics/assembly

Supplementary Figure 1. The tree of Chinese jujube and its growing environment. The jujube has a very long lifecycle, even more than 1000 productive

Genome-wide association studies (GWAS) Part 1

De Novo Assembly of High-throughput Short Read Sequences

Genome Sequencing and Structural Variation

Integrating and Leveraging Cereal and Grass Genome Resources. David Marshall Information and Computational Sciences Group

Association Mapping in Plants PLSC 731 Plant Molecular Genetics Phil McClean April, 2010

Contrasting Evolutionary Patterns of the Rp1 Resistance Gene Family in Different Species of Poaceae. Research article. Abstract.

Mutations, Genetic Testing and Engineering

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

The Polymerase Chain Reaction. Chapter 6: Background

Measurement of Molecular Genetic Variation. Forces Creating Genetic Variation. Mutation: Nucleotide Substitutions

Expressed sequence tag-pcr markers for identification of alien barley chromosome 2H in wheat

de novo sequencing of the sunflower genome

DNA Collection. Data Quality Control. Whole Genome Amplification. Whole Genome Amplification. Measure DNA concentrations. Pros

SNPs - GWAS - eqtls. Sebastian Schmeier

Introducing new DNA into the genome requires cloning the donor sequence, delivery of the cloned DNA into the cell, and integration into the genome.

Chapter 15 Gene Technologies and Human Applications

Development of Genomic Tools for RKN Resistance Breeding in Cotton

Mutations during meiosis and germ line division lead to genetic variation between individuals

mrna for protein translation

Genomics Based Approaches to Genetic Improvement in Sugarcane. Robert Henry

Transcriptome sequencing of two wild barley (Hordeum spontaneum L.) ecotypes differentially adapted to drought stress reveals ecotypespecific

Understanding Genes & Mutations. John A Phillips III May 16, 2005

SUPPLEMENTAL MATERIALS

Re-sequencing and hybrid assembly strategy of two nematode resistant Beta vulgaris translocation lines. Sarah Christina Jäger

Next Generation Sequencing Technologies. Rob Mitra 1/30/17

Overview of Human Genetics

4/26/2015. Cut DNA either: Cut DNA either:

Recent technology allow production of microarrays composed of 70-mers (essentially a hybrid of the two techniques)

Introduction to Plant Genome and Gene Structure. Dr. Jonaliza L. Siangliw Rice Gene Discovery Unit, BIOTEC

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

Answer: Sequence overlap is required to align the sequenced segments relative to each other.

Champions of the Poor of the Semi-Arid Tropics

A clone-free, single molecule map of the domestic cow (Bos taurus) genome

Percent survival. Supplementary fig. S3 A.

Petar Pajic 1 *, Yen Lung Lin 1 *, Duo Xu 1, Omer Gokcumen 1 Department of Biological Sciences, University at Buffalo, Buffalo, NY.

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

Transcription:

Supplementary Figure 1. Distribution of sequence depth across the bacterial artificial chromosomes (BACs). The x-axis denotes the sequencing depth (X) of each BAC and y-axis denotes the number of BACs corresponding to depth.

Supplementary Figure 2. Estimation of the pearl millet genome size based on K-mer statistics. The frequency distribution of 17-mers within the raw genomic reads displays 2 major peaks (A and B). Peak A resembles a Gaussian distribution and represents k-mers of ~0-10X coverage which arise by chance due to sequencing errors. Peak B, corresponding to k-mers of ~20-50X coverage, represents the majority of the genome and resembles a Poisson distribution with minor differences due to sequencing errors, heterozygosity and repetitive DNA. The total genome size of pearl millet was estimated by obtaining the multiplication product of 17 bp and the k-mer frequency (value at y-axis) corresponding to the coverage (value at x-axis) at Peak B (i.e. 17X Peak B frequency).

Supplementary Figure 3. Percent similarity and align rate for long PacBio reads vs scaffold sequences.

Supplementary Figure 4. GC content distributions in selected grass genomes. The mean GC content of pearl millet is higher than barley (Hordeum vulgare), foxtail millet (Setaria italica), rice (Oryza sativa), and sorghum (Sorghum bicolor). GC content is represented on the x-axis and the proportion of the bin number divided by the total windows on the y-axis.

Supplementary Figure 5. Scatterplot showing GC content versus sequencing depth. The graph indicates that no GC bias is present in the sequence data generated.

Supplementary Figure 6. GC content distribution of whole genome CDS (left) and 384 pearl millet expanded families (right) in grasses. The GC content in whole genome CDS and 384 expanded gene families were found similar.

Percentage of genome Sequence divergence rate (%) Supplementary Figure 7. Distribution of divergence rates of different transposable element (TE) types in the pearl millet genome. DNA- DNA elements; LINE- long interspersed nuclear elements; LTR- long terminal repeat transposable elements; SINE- short interspersed nuclear elements. Divergence rates were high among LTR elements.

a.

b. Supplementary Figure 8. Length distribution of mrna, CDS, exons and introns in sequenced cereal genomes. (a) The x-axis indicates the length, and the y-axis indicates the gene percentage in the corresponding length window. For example, in the figure Distribution of mrna length, the coordinate indicated by the red arrow is (300, 5.995), meaning that 5.995% of all gene models have lengths in the range 300-400 bp. (b) Boxplots indicate that the average lengths of different gene features in pearl millet are similar to those in the other four cereal species

Supplementary Figure 9. An overview of orthologous and paralogous genes in pearl millet. The number of orthologs and paralogs in the pearl millet genome is shown in comparison to those gene classes in the genomes of ten select plant species [Arabidopsis (Arabidopsis thaliana), Brachypodium (Brachypodium distachyon), banana (Musa acuminata), barley, foxtail millet, maize, rice, sorghum, soybean (Glycine max) and bread wheat]. Reciprocal pair-wise comparisons of the 38,579 pearl millet gene models with 385,891 gene models from ten select plant species identified 17,949 orthologous groups (Supplementary Table 12), among which 5,232 contained only a single pearl millet gene, suggestive of simple orthology (Supplementary Table 13).

a. b. Supplementary Figure 10. Length distribution for each expanded family. (a) 20 expanded gene families randomly plotted. (b) 384 expanded gene families. Red dots represent pearl millet, and blue represent rice.

Supplementary Figure 11. Phylogeny of NBS-LRR genes. Heavily tandemized gene groups can be observed at one telomere of Pg1 and Pg4 followed by Pg7.

Supplementary Figure 12. (a) Indels and (b) gene numbers in PMiGAP lines, parental lines of mapping population and B- and R- lines. Large number of indels were observed in the genome compared to CDS regions in PMiGAP lines.

Frequency with SV Supplementary Figure 13. Distribution of structural variations in PMiGAP lines. Structural variations such as deletions (DEL), insertions (INS), inversions (INV), intra-chromosomal translocations (ITX) and others were determined using BreakDancer software. Among different structural variations DEL were found in large number across the genome in the PMiGAP lines.

Frequency with SV Supplementary Figure 14. Distribution of structural variations across 38 parents of mapping populations. Deletions were found in large numbers across the genome in the parental lines.

Frequency with SV Supplementary Figure 15. Distribution of structural variations in B and R lines. The number of structural variations identified in the B- and R- lines is significantly lower than that identified in case of PMiGAP lines and parental lines of mapping populations. This can be attributed to the RAD-seq approach adopted for sequencing these lines.

Supplementary Figure 16. Population structure of 345 PMiGAP lines and 31 wild samples. A set of 29.54 million SNPs were used to determine the number of sub-populations in PMiGAP lines using STRUCTURE analysis. Population structure, when K=2, mainly separate all the

cultivated from the wild relatives. At K=3, cultivated are separated but the separation does not correspond to clear geographical pattern, a large fraction of the cultivated share a blue and red component. At K=4 a further division inside the cultivated show ancestry almost admixed for all cultivated. At K=5, wild populations are split in two well defined clusters and a cluster presenting strong similarity to the cultivated. This last cluster is the closest to the cultivated in the PCA, and correspond to the central area of the wild populations, Niger and Mali. The two other well defined cluster correspond to wild from East Africa and Senegal/Mauritania/West Mali. A new cultivated cluster among widely across cultivated appear at K=6 and 7. Structure can be influenced by number of samples as well as imbalance between cultivated and wild populations. Among the East group, two individuals show a closer proximity to the cultivated in the PCA analysis (PE08743 from Soudan and PE08721 from Chad). These two individuals show admixed ancestry 48% cultivated and 24% cultivated respectively. These two sample were sampled at latitude 12.78 and 12.53 respectively. At this latitude wild populations are in sympatry with cultivated varieties and hybridization could occur.

a. b.

c. d. Supplementary Figure 17. Cross entropy analysis of population structure. (a) For all wild and cultivated samples, Cross entropy decreased slowly from K=2 to K=7. But, when we considered wild samples alone, (a) We found a strong signal of structure inside the wild populations separating wild individuals from the East, Center and West of the Sahel. The different genetic group are represented (c). The wild pearl millet accessions sequenced in this study clustered in to three main groups West African (green), Central Africa (violet) and East African (blue). Note that we used a similar scale here for these two graphics. To better assess if there is any geographical structuration inside the cultivated (d), we performed a geographically explicit model using TESS3. We made the alpha parameter varied from 5% to 60% allowing to weight the geographical position of the samples in the structure inference. The cross entropy do not vary much and is flat or slightly increase from K=1 (no structure) to K=3. Altogether, these analysis suggest a marked signal of structure inside the wild populations, but not inside the cultivated.

a.

b. Supplementary Figure 18. Diversity in PMiGAP lines and wild species accessions of pearl millet. a. Diversity metrics, presented as average pair-wise nucleotide diversity (current [θπ] and historical [θω]), are shown across all seven pseudomolecules. b. FST and -log ratio of the diversity loss across the 7 pseudomolecules.

Supplementary Figure 19. Genome wide linkage disequilibrium decay in the PMiGAP lines, parents of mapping populations, and B- and R- lines. Linkage disequilibrium decays more rapidly across the PMiGAP lines compared to the B- and R- lines and parental lines.

a.

b.

c.

d. Supplementary Figure 20. Association mapping for the grain number per panicle. The figure represent the -log(p-value) of the association in function of the position of the SNP on the pseudomolecule for the control as well as early and late stress field trials. We represented here the result for the grains per panicle (GNP). (a) results for Pg1 in 2011, (b) results for Pg1 in 2012, (c) results for Pg5 in 2011, (d) results for Pg5 in 2012.

Supplementary Figure 21. Hierarchical clustering of predicted hybrid performance. The lower bottom of clusters shown are the best hybrid combinations for pearl millet breeding

Supplementary Figure 22. Length and N50 distribution of the BAC clones. 29

Supplementary Figure 23. QQ plot of expected and observed P-value. The expected and observed p-value for the number of grains per panicle (GNP) and fresh stover yield (FSWTHA) are reported for the two years (2011, 2012) and the three field trials (control, early stress, late stress). P-value are reported on a -log scale. A value of 4 corresponds to a P-value of 10-4. Expected 30

and observed p-value fit the x=y relationship. The marker trait association with are significant when P-value above 10-4. 31