Workshop on. Genome analysis tools applied to forest tree breeding

Size: px
Start display at page:

Download "Workshop on. Genome analysis tools applied to forest tree breeding"

Transcription

1 Workshop on Genome analysis tools applied to forest tree breeding Vantaa (Finland), the 18 th October 2012 BOOK OF ABSTRACTS

2 Introduction Giusi Zaina University of Udine, Udine, Italy Contact: Understanding the intra-specific variation in the distribution of nucleotide polymorphisms and structural variation in commercially and environmentally relevant forest tree species has important implications for plant improvement and breeding. The pace at which we can analyze natural sequence variation has recently been greatly accelerated thanks to the advent of new DNA sequencing technologies. Next generation sequencing (NGS) has revolutionized the approach to genome sequencing enabling SNP discovery, detection of structural variants, measurement of transcripts levels and a number of other applications, at a genome scale. The use of such genomic tools has allowed us to start to unravel the genetic make-up of traits that are relevant to adaptation. This genomic approach has opened new horizons and challenging opportunities in plant breeding. We will discuss this issue focusing on model tree species where knowledge at genomic level is still available, as well as on non-model species with large genome size where few insights at genomic level are available. NOTES

3 New sequencing technologies and their impact on genetic diversity analysis: the experience in grape Michele Morgante University of Udine and Applied Genomics Institute (IGA), Udine, Italy Contact: The genomics revolution of the last 15 years has improved our understanding of the genetic make-up of living organisms. Together with complete genomic sequences for an increasing number of species, highthroughput and parallel approaches are available for the analysis of DNA sequence variation, transcripts and proteins. The use of genomic tools has allowed us to start to unravel the genetic make-up of relevant traits and to achieve a deeper understanding of what natural variation is at the sequence level. The comparative sequencing of several plant genomes revealed that, in addition to single nucleotide polymorphisms (SNPs), transposable elements are largely responsible for extensive variation in both intergenic and local genic content not only between closely related species but also among individuals within a species. In addition, larger structural variants can be detected, similar to the copy number variants identified in the human genome and involving hundreds of thousands of base pairs of DNA and tens of genes. A single genome sequence may therefore not reflect the entire genomic complement of a species. This realization prompted us to introduce in plants the concept of the pan-genome, which includes core genomic features common to all individuals and a dispensable genome composed of non-shared DNA elements that can be individual- or population-specific. Here, we describe the variation that can be detected among tree genotypes using next generation sequencing methodologies, not only as SNP but also and especially as structural variants due either to simple transposable element insertions or to insertions or deletions of large genomic regions. We focus on how to create either horizontal catalogues of genetic variation, i.e. looking at the variation in a few very interesting genes in a very large number of individual trees, or vertical catalogues, i.e. looking at the variation in the entire genome of a few very interesting individuals, and provide examples of how to use such catalogues to address specific biological questions and related breeding issues.

4 PineRefSeq: Experience and challenges in constructing high quality reference genomes for conifers Kristian Stevens University of California, Davis, CA, USA Contact: Conifer genomes present challenges for successful sequencing, mainly due to their large size and complexity. I will describe the PineRefSeq project's continuing development of high quality reference genome sequences for loblolly pine, Douglas-fir and sugar pine using methodologies that will serve as a model for sequencing other large, complex genomes. A pillar of our reference genome strategy is the whole genome shotgun sequencing and assembly of a haploid megagametophyte. With new assemblers and specialized reagents, de novo genome assembly with second-generation sequencing technologies has made dramatic advances in quality. However, the best-in-class assemblers simply do not scale to genomes the size of conifers. To circumvent this issue, members of our project have developed a de Bruijn graph preprocessing algorithm (MSR-CA) to achieve a dramatic reduction in the number of input reads to the assembly, without a corresponding loss of information. The reduced dataset can then be assembled using the standard overlap-layout-consensus method. We complement our computational approach to data reduction, with a molecular approach that can physically reduce the complexity of the genome being assembled. A parallel pipeline involves the high throughput creation and sequencing of transient fosmid pools. This approach allows us to tune the complexity of the haploid assembly and apply standard best-inclass short read assemblers to the problem. Deep sequencing of multiple complex long insert libraries, such as fosmid DiTags and Illumina jumping libraries, is also essential for obtaining a good result. For conifer genomes, even traditional post genome analyses such as SNP discovery and genotyping by sequencing present a challenge. For instance, the popular high throughput short read aligners based on the Burrows Wheeler Transform will not index a genome larger than 4Gb. Our group has also developed methods to circumvent these issues to obtain high quality rigorous results.

5 Next Generation Sequencing: from samples to data analysis Federica Cattonaro Applied genomics institute (IGA) and IGA-Technology Services, Udine, Italy Contact: Recent new revolutionary technological developments based on pyrosequencing and sequencing using cyclic reversible terminators represents a major breakthrough, enabling sequencing billions of bases in massive, parallel reactions. The sequencing revolution was driven by three commercially available platforms: 454 (Roche), Illumina and SOLiD (Applied Biosystems). Nevertheless, a new generation of singlemolecule sequencing technologies (third-generation sequencing) will emerge in a near future, with the potential for dramatically longer read lengths, shorter time to result and lower overall cost. At the state-of-the-art, Illumina sequencing technology is the most widespread around the world. Sequencing templates are immobilized on a proprietary glass surface (flow cell) and solid-phase amplification creates up to 1,000 identical copies of each single template molecule in close proximity (clusters of diameter one micron or less). Sequencing by synthesis (SBS) technology, which uses four reversible fluorescently labeled nucleotides, is used to sequence the tens of millions of clusters on the flow cell surface in parallel. Different Illumina instruments are able to produce variable amount of data per run (from hundreds of megabase pairs to hundreds of gigabase pairs) at various per raw base cost. The Illumina technology offers the possibility to work on different protocol/applications like genome-wide sequencing (DNA-seq), transcriptome sequencing (RNA-seq), smallrna-seq, CHIP-seq, BS-seq, target resequencing. In particular, DNA-seq, transcriptome sequencing and target re-sequencing are the most commonly used applications for markers discovery in plants and will be discussed in more detail during this presentation.

6 Horizontal catalogue of SNPs and rare SNPs: the bioenergy issue in P. Nigra Fabio Marroni Applied genomics institute (IGA), Udine, Italy Contact: Common variants, such as those identified by genome wide association scans (GWAs), have been found to explain only a small proportion of trait variation. Growing evidence suggests that rare functional variants, usually missed by GWAs, play an important role in determining the phenotype. Next generation sequencing (NGS) instruments produce an unprecedented amount of sequence data at contained costs. This gives researchers the possibility of designing studies with adequate power to identify rare variants at a fraction of the economic and labor resources required by individual Sanger sequencing. We used pooled multiplexed next generation sequencing and a custom analysis workflow to effectively detect mutations in five candidate genes for lignin biosynthesis in 768 pooled Populus nigra accessions. We identified a total of 36 non-synonymous SNPs, one of which causes a premature stop codon. The most common variant was estimated to be present in 672 of the 1536 tested chromosomes, while the rarest was estimated to occur only once in 1536 chromosomes. Comparison with individual Sanger sequencing in a selected subsample confirmed that variants are identified with high sensitivity and specificity and that the variant frequency was estimated accurately. The proposed method for the identification of rare polymorphisms allows accurate detection of variation in many individuals and is cost effective compared to individual sequencing. Aim of the present talk is to convey to the audience the general ideas underlying the use of pooled NGS for the identification of rare variants. To facilitate a thorough understanding of the possibilities of the method, I will explain in detail the possible experimental and analytical approaches and discuss their advantages and disadvantages. I will show that information on allele frequency obtained by pooled NGS can be used to accurately compute basic population genetics indexes such as allele frequency, nucleotide diversity, and Tajima s D. Finally, I will discuss applications and future perspectives of the multiplexed NGS approach.

7 Vertical catalogue of SNPs: the HT-resequencing effort in P. nigra Stefania Giacomello Applied genomics institute (IGA) and University of Udine, Udine, Italy Contact: SNP detection is a complex issue of genomics. Depending on the aim of your SNP callings, different parameters must be applied. In this workshop we present some tricky aspects encountered in the analysis of genome-wide single nucleotide variability in Populus nigra, the native and most wide-spread poplar species in Europe. In order to collect a valuable SNP resource to be applied in novel breeding programs and population genetic studies, fifty-two natural genotypes were selected to represent the European latitude range and resequenced exploiting the Illumina technology. Four of these, belonging to different latitudinal settings, were resequenced at a high coverage (about 20X each) in order to obtain a dataset of informative SNPs. SNP detection was based on a reference-guided assembly using the P. trichocarpa genome sequence. Test experiments showed the feasibility of using the latter genome sequence as reference for P. nigra, given that 75% of P. nigra reads were uniquely mapped on the P. trichocarpa sequence. Suggestions on some problematic aspects for library preparation in poplar will be provided, as well as a detailed explanation of the parameters to be considered in the reference assembly. Moreover, SNP calling procedure and parameters to be used in different analyses (i.e. identification of informative SNPs and rare variants) will undergo an exhaustive discussion. The remaining fourty-eight clones were resequenced in pool at low coverage (spanning 2 to 10X). Theoretical aspects of library pooling will be described. Critical aspects of the low coverage clone analysis will be explained taking in consideration the two scenarios reported above. A part of the presentation will then be dedicated to data validation needed to conduct population genetic analysis and to provide SNP for association studies in order to improve breeding programs.

8 Structural Variant detection: the experience in Populus spp. Sara Pinosio Applied genomics institute (IGA), Udine, Italy Contact: Recent studies showed that DNA Structural Variation (SV) comprises a major portion of genetic diversity in several genomes. Traditionally, the detection of SVs has used whole-genome array comparative genome hybridization (CGH) or single nucleotide polymorphism arrays. The advent of next-generation sequencing (NGS) technologies promises to revolutionize structural variation studies and to replace microarrays as the platforms for their discovery and genotyping. However, NGS approaches present substantial computational and bioinformatics challenges. Two main signatures can be exploited for the detection of SVs from NGS data: the paired-end mapping (PEM) signature and the depth of coverage (DOC) one. We used these two signatures to study the genetic variation present in different poplar species, focusing on the detection of two different classes of structural variants: 1) insertion/deletion polymorphisms related with the transposable elements activity and 2) larger copy number variants (CNVs). In this presentation I will show the experimental approach and discuss the bioinformatics adopted. For the detection of insertions and deletions we exploited the paired-end mapping information generated from next-generation sequencing data by comparing Populus nigra and Populus deltoides sequences with respect to the Populus trichocarpa reference sequence. Overall, we identified thousands of deletions and insertions accounting in total for the 10% of the whole reference genome. Class I LTR retroelements insertions were identified as the major contributors to the overall variation. We observed limited levels of variation in transcribed regions, while intergenic regions harbored much more variation. CNVs were detected by comparing the depth of coverage obtained in P. nigra and P. deltoides resequenced individuals. Regions of copy number variation between the two species resulted rich in repetitive sequences and had a lower-than-average gene content. However, some classes of genes, such as disease resistance genes, resulted to be over-represented in CNVs with respect to the rest of the genome, suggesting a relationship between the evolution of these gene families and this kind of variation.

9 Exome sequencing for GWAS and Genomic Selection in Pines Matias Kirst School of Forest Resources & Conservation, University of Florida, FL, USA Contact: Sequencing the mega-genome of conifers such as loblolly pine (Pinus taeda) has remained a significant challenge, even in the age of next-generation, high-throughput sequencing. An alternative is to characterize reduced representations of the genome, using sequence-capture to retrieve targeted regions based on genomic DNA hybridization to complementary probes. However, the challenge of reducing the complexity of pine genomes is unprecedented compared to previous studies, because of its 21.7 Gbp genome size. We recently developed a set of 55, mer sequence-capture oligonucleotides designed to target 14,729 uni-genes derived from pine EST assemblies, and showed that the capture efficiency is high (~70%) and comparable among haploid and diploid DNA, and cdna. However, probe sequence-capture performance is clearly dependent on the ability to adequately predict the boundaries between exons. To validate singlenucleotide polymorphism detected by resequencing captured regions, we analyzed 72 individuals of a segregating family and detected 4,563 segregating SNP that map according to expectations. We have expanded this analysis to characterize the genic space in 24 unrelated genotypes that represent the natural range of both loblolly pine, and slash pine (Pinus eliottii). The ~50,000 SNPs detected in each species establish a detailed assessment of genetic variation, linkage disequilibrium and natural selection in a broad range of genic regions in loblolly and slash pine. In addition to individual nucleotide variants, depth of sequence coverage is also being analyzed for detection of structural polymorphisms. Preliminary analysis suggests that gene presence/absence variation is abundant in these genomes. This data has provided the foundation for genome-wide association studies and the development of genomic selection predictive models.

10 Final discussion Michele Morgante Different genomic analysis tools and approaches are required according to the tree species concerned. The new genomic analysis tools versus the traditional genetic approaches (QTL mapping and association studies). Could the two approaches be integrated? How can the new genomic analysis tools add value and information to the breeding strategies? NOTES