A Bioinformatics Workflow for Genetic Association Studies of Traits in Indonesian Rice

Similar documents
A Bioinformatics Workflow for Genetic Association Studies of Traits in Indonesian Rice

Document Tracking Technology to Support Indonesian Local E-Governments

Concern Based SaaS Application Architectural Design

On If-Then Multi Soft Sets-Based Decision Making

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2015

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2017

Optimal Storage Assignment for an Automated Warehouse System with Mixed Loading

Development of Colorimetric Analysis for Determination the Concentration of Oil in Produce Water

FSuite: exploiting inbreeding in dense SNP chip and exome data

Lab 1: A review of linear models

Composite Simulation as Example of Industry Experience

Comparison of lead concentration in surface soil by induced coupled plasma/optical emission spectrometry and X-ray fluorescence

Heat line formation during roll-casting of aluminium alloys at thin gauges

Facility Layout Planning of Central Kitchen in Food Service Industry: Application to the Real-Scale Problem

Genome-wide association studies (GWAS) Part 1

Simulation for Sustainable Manufacturing System Considering Productivity and Energy Consumption

DNA Collection. Data Quality Control. Whole Genome Amplification. Whole Genome Amplification. Measure DNA concentrations. Pros

Introduction to Genome Wide Association Studies 2014 Sydney Brenner Institute for Molecular Bioscience/Wits Bioinformatics Shaun Aron

Supporting the Design of a Management Accounting System of a Company Operating in the Gas Industry with Business Process Modeling

Progress of China Agricultural Information Technology Research and Applications Based on Registered Agricultural Software Packages

Genotype quality control with plinkqc Hannah Meyer

Towards a Modeling Framework for Service-Oriented Digital Ecosystems

How Resilient Is Your Organisation? An Introduction to the Resilience Analysis Grid (RAG)

PLINK gplink Haploview

Size distribution and number concentration of the 10nm-20um aerosol at an urban background site, Gennevilliers, Paris area

H3A - Genome-Wide Association testing SOP

Collusion through price ceilings? In search of a focal-point effect

Value-Based Design for Gamifying Daily Activities

Introduction to Genome Wide Association Studies 2015 Sydney Brenner Institute for Molecular Bioscience Shaun Aron

Supplementary Figure 1 Genotyping by Sequencing (GBS) pipeline used in this study to genotype maize inbred lines. The 14,129 maize inbred lines were

A Design Method for Product Upgradability with Different Customer Demands

Lecture 6: GWAS in Samples with Structure. Summer Institute in Statistical Genetics 2015

Electronic Agriculture Resources and Agriculture Industrialization Support Information Service Platform Structure and Implementation

High Purity Chromium Metal Oxygen Distribution (Determined by XPS and EPMA)

Performance Analysis of Reverse Supply Chain Systems by Using Simulation

A framework to improve performance measurement in engineering projects

ATOM PROBE ANALYSIS OF β PRECIPITATION IN A MODEL IRON-BASED Fe-Ni-Al-Mo SUPERALLOY

Economic analysis of maize/soyabean intercrop systems by partial budget in the Guinea savannah of Nigeria

A genome wide association study of metabolic traits in human urine

Experiences of Online Co-creation with End Users of Cloud Services

Power control of a photovoltaic system connected to a distribution frid in Vietnam

Experimental Study on Forced-Air Precooling of Dutch Cucumbers

Performance evaluation of centralized maintenance workshop by using Queuing Networks

Supplementary Note: Detecting population structure in rare variant data

SNP calling and Genome Wide Association Study (GWAS) Trushar Shah

Linkage Disequilibrium

Quality Control Assessment in Genotyping Console

Recycling Technology of Fiber-Reinforced Plastics Using Sodium Hydroxide

Energy savings potential using the thermal inertia of a low temperature storage

DNA METHYLATION RESEARCH TOOLS


Interaction between mechanosorptive and viscoelastic response of wood at high humidity level

NANOINDENTATION-INDUCED PHASE TRANSFORMATION IN SILICON

A Performance Measurement System to Manage CEN Operations, Evolution and Innovation

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

K Vinodhini, K Srinivasan

Exploring Different Faces of Mass Customization in Manufacturing

Agricultural biodiversity, knowledge systems and policy decisions

Estimation problems in high throughput SNP platforms

Supplemental Data. Who's Who? Detecting and Resolving. Sample Anomalies in Human DNA. Sequencing Studies with Peddy

Office Hours. We will try to find a time

Managing Systems Engineering Processes: a Multi- Standard Approach

Using the Association Workflow in Partek Genomics Suite

Overall Layout Design of Iron and Steel Plants Based on SLP Theory

SNPTracker: A swift tool for comprehensive tracking and unifying dbsnp rsids and genomic coordinates of massive sequence variants

Impact of cutting fluids on surface topography and integrity in flat grinding

Conceptual Design of an Intelligent Welding Cell Using SysML and Holonic Paradigm

Pressure effects on the solubility and crystal growth of α-quartz

A Digital Management System of Cow Diseases on Dairy Farm

The Reverse Logistics Technology and Development Trend of Retired Home Appliances

Understanding eparticipation Services in Indonesian Local Government

Introduction to Quantitative Genomics / Genetics

Exploring the Impact of ICT in CPFR: A Case Study of an APS System in a Norwegian Pharmacy Supply Chain

Runs of Homozygosity Analysis Tutorial

A simple gas-liquid mass transfer jet system,

One-of-a-Kind Production (OKP) Planning and Control: An Empirical Framework for the Special Purpose Machines Industry

Virtual Integration on the Basis of a Structured System Modelling Approach

Genome Wide Association Study for Binomially Distributed Traits: A Case Study for Stalk Lodging in Maize

Transferability of fish habitat models: the new 5m7 approach applied to the mediterranean barbel (Barbus Meridionalis)

Understanding genetic association studies. Peter Kamerman

New experimental method for measuring the energy efficiency of tyres in real condition on tractors

The Bigger Picture: The Use of Mobile Photos in Shopping

Dynamic price competition in air transport market, An analysis on long-haul routes

MAPPING GENES TO TRAITS IN DOGS USING SNPs

Finite Element Model of Gear Induction Hardening

Change Management and PLM Implementation

Estimating traffic flows and environmental effects of urban commercial supply in global city logistics decision support

Monitoring of Collaborative Assembly Operations: An OEE Based Approach

Design and Implementation of Parent Fish Breeding Management System Based on RFID Technology

Product Lifecycle Management Adoption versus Lifecycle Orientation: Evidences from Italian Companies

Introgression of a functional epigenetic OsSPL14 WFP allele into elite indica rice genomes greatly improved panicle traits and grain yield

Integrated Assembly Line Balancing with Skilled and Unskilled Workers

KPY 12 - A PRESSURE TRANSDUCER SUITABLE FOR LOW TEMPERATURE USE

Non-Platinum metal oxide nano particles and nano clusters as oxygen reduction catalysts in fuel cells.

Simulation of Dislocation Dynamics in FCC Metals

A Stochastic Formulation of the Disassembly Line Balancing Problem

The new LSM 700 from Carl Zeiss

Distribution Grid Planning Enhancement Using Profiling Estimation Technic

On the dynamic technician routing and scheduling problem

Transcription:

A Bioinformatics Workflow for Genetic Association Studies of Traits in Indonesian Rice James Baurley, Bens Pardamean, Anzaludin Perbangsa, Dwinita Utami, Habib Rijzaani, Dani Satyawan To cite this version: James Baurley, Bens Pardamean, Anzaludin Perbangsa, Dwinita Utami, Habib Rijzaani, et al.. A Bioinformatics Workflow for Genetic Association Studies of Traits in Indonesian Rice. David Hutchison; Takeo Kanade; Bernhard Steffen; Demetri Terzopoulos; Doug Tygar; Gerhard Weikum; Linawati; Made Sudiana Mahendra; Erich J. Neuhold; A Min Tjoa; Ilsun You; Josef Kittler; Jon M. Kleinberg; Alfred Kobsa; Friedemann Mattern; John C. Mitchell; Moni Naor; Oscar Nierstrasz; C. Pandu Rangan. 2nd Information and Communication Technology - EurAsia Conference (ICT-EurAsia), Apr 2014, Bali, Indonesia. Springer, Lecture Notes in Computer Science, LNCS-8407, pp.356-364, 2014, Information and Communication Technology. <10.1007/978-3-642-55032-4 35>. <hal-01397232> HAL Id: hal-01397232 https://hal.inria.fr/hal-01397232 Submitted on 15 Nov 2016 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed under a Creative Commons Attribution 4.0 International License

A Bioinformatics Workflow for Genetic Association Studies of Traits in Indonesian Rice James W. Baurley 1, Bens Pardamean 1 Anzaludin S. Perbangsa 1, Dwinita Utami 2, Habib Rijzaani 2, and Dani Satyawan 2 1 Bioinformatics Research Group, Bina Nusantara University, Jakarta, Indonesia bpardamean@binus.edu, WWW home page: http://www.binus.edu 2 Indonesian Center for Agricultural Biotechnology and Genetic Resources Research and Development, Bogor, Indonesia Abstract. Asian rice is a staple food in Indonesia and worldwide, and its production is essential to food security. Cataloging and linking genetic variation in Asian rice to important traits, such as quality and yield, is needed in developing superior varieties of rice. We develop a bioinformatics workflow for quality control and data analysis of genetic and trait data for a diversity panel of 467 rice varieties found in Indonesia. The bioinformatics workflow operates using a back-end relational database for data storage and retrieval. Quality control and data analysis procedures are implemented and automated using the whole genome data analysis toolset, PLINK, and the [R] statistical computing language. The 467 rice varieties were genotyped using a custom array (717,312 genotypes total) and phenotyped for 12 traits in four locations in Indonesia across multiple seasons. We applied our bioinformatics workflow to these data and present prototype genome-wide association results for a continuous trait - days to flowering. Two genetic variants, located on chromosome 4 and 12 of the rice genome, showed evidence for association in these data. We conclude by outlining extensions to the workflow and plans for more sophisticated statistical analyses. Keywords: data analysis, workflow, agriculture genetics, genome-wide association study, bioinformatics, statistical genetics 1 Introduction Indonesia is located in one of the most biodiverse regions in the world. Studying the biodiversity unique to this region for agriculturally important species can lead to crop and animal improvements. Oryza saliva or Asian rice is a staple food in Indonesia and worldwide, and its production is essential to food security. Cataloging and linking genetic variation in Asian rice to important traits, such as quality and yield, is needed to develop new varieties of rice with superior properties. The 389 Megabase (Mb) Asian rice genome consist of 12 chromosomes [1]. Throughout the genome, sequence variations called single-nucleotide polymorphism (SNP) are common. At these locations (or loci), the alternative nucleotides

2 are called alleles, and the two alleles from the paired chromosomes are called SNP genotypes. High-throughput genotyping and sequencing technologies have revolutionized agriculture genetics, allowing for genome-wide interrogation of thousands of SNPs. Recent research using these technologies, have focused on genome-wide genotyping of a rice diversity panel consisting of 413 varieties from 82 countries [2]. While this research has identified genetic regions associated with many complex traits, there is still much to learn about the genetics of rice varieties specific to Indonesia. The Indonesian Center for Agricultural Biotechnology and Genetic Research and Development (ICABIOGRAD) has developed an unique rice diversity panel of 467 rice varieties found in Indonesia. The panel was planted in a greenhouse (BG) with controlled environment and three fields at different elevations, located in the cities of Citayam, Subang, and Kuningan. The rice was planted in multiple seasons. The diversity panel was genotyped with two panels of 384 and 1,536 SNPs on the GoldenGate platform (Illumina, Inc). The rice was also extensively phenotyped at each location (see Table 1). Given the complexity and volume of the data collected, center researchers needed an efficient and easy system to manage these data and perform numerous genetic association analyses by traits and location. We designed and implemented a custom bioinformatics workflow for the genetic association study of these traits in Indonesian rice. In the next sections, we present the workflow design, the specifics of the implementation, and prototype results. Trait Units Description Days to flowering days after planting when 50% of the plants have flowers Days to harvest days after planting until physiological maturity Total tiller number of tillers per hill Productive tiller number of tillers that produce panicles Plant height cm measured from the ground to the base of the panicle, at the time of flowering Total panicle panicles in a square meter Panicle length cm main stem panicle length, measured from the base to the tip of the panicle, 7 days after anthesis. Filled grain average number of filled grain clumps per panicle Unfilled grain average number of empty grain clumps per panicle Grain per panicle total number of grain per panicle 1000 grain weight gr weight of 1000 full grain Yield t/ha tons of rice per hectare Table 1. Rice complex traits measured on 467 rice varieties in 4 locations

3 2 Methods 2.1 Bioinformatics Workflow We constructed a workflow that captures the bioinformatics needed for data quality control and analysis for both genotypes and phenotypes (trait) (Figure 2.1). Quality control procedures were designed to process the panel of 467 rice varieties captured across the four locations and the genetic data generated from the 384 and 1,536 genotyping arrays. The output of the quality control steps were cleaned data ready for downstream statistical modeling. These datasets were inputs to multi-step analyses pipelines. The output were summary tables and figures of statistical association (Figure 2.1). The workflow was implemented using a combination of software tools and custom programming that included a relational database, the whole genome association analysis toolset PLINK [3], and the [R] statistical language [4]. Fig. 1. Bioinformatics workflow of rice genetic and phenotypic data 2.2 Rice Relational Database The bioinformatics workflow operates on a backend relational database for data storage and retrieval. We selected PostgreSQL as a database management system (DBMS) because it is open source and well known for its security, scalability, and active developer community. The entity-relationship diagram for the rice database is presented in Figure 2.2. The database consists of three schemas containing the plant genotypes and trait characteristics. This included genotypes from the 1,536 and 384 SNP arrays

4 from ICABIOGRAD and comparative data from the International Rice Research Institute (IRRI). The snp map table describe the SNPs contained on the array, such as where (chromosome and position); the polymorphic nucleotides - adenine (A), cytosine (C), thymine (T), and guanine (G); and attributes of the array design. The sample map table contains data on the DNA samples and links to the trait data. The final report contains the genotypes for all the samples as well as information on the quality of the genotype calling. Trait data is stored in the phenotype table and linked to the sample map by a one-to-many relationship. The primary key for final report is the combination of sample index and snp index and dramatically improves the speed of sample based data retrieval (i.e., queries by sample). A second index for final report with the order of the columns reversed allows for quick retrieval of SNP-based queries (e.g., genotypes for particular SNPs). Once the genotype data is imported into the database, the data can be extracted (in whole or by subsets) using Structured Query Language (SQL). This allows for sophisticated filtering and quality control. Fig. 2. Rice genetic and phenotypic database

5 2.3 Quality control For phenotypes, the database is queried using the RPostgreSQL package in [R]. Various [R] scripts and functions are then called to summarize the distribution of each trait by location. Histograms and box plots are used to visually compare the distributions and assess normality. Summaries of each variable are created (i.e., minimum, quartiles, maximum) and outliers are reported for verification. For the array data, the genotypes are exported from the database into PLINK and converted to binary format for improved performance. The genotype call rate (i.e., the rate of non-missing genotypes) for SNPs and samples are computed and removed if less than 75%. The minor allele frequency (MAF), the frequency of the least common allele, are computed and SNPs with a MAF < 0.05 are flagged as rare variants. When samples were duplicated, genotype concordance is computed to identify SNPs that were not consistently called by clustering algorithms. Plots are created to evaluate if the clustering algorithm was correctly assigning genotypes. A database query retrieves the r, theta, allele1 ab, and allele2 ab columns from the final report table for particular SNPs. The polar intensities (r, theta) are plotted and compared to the called genotypes AA, AB, BB. When there are genotype misclassifications or missingness, further investigation into the clustering algorithm and assumptions are needed. The quality control steps yield cleaned datasets ready for statistical analyses. 2.4 Data Analysis Descriptive statistics for each trait are generated in [R] and stratified by location and season/year. Descriptive statistics for continuous variables are expressed as mean, median, standard deviation, and ranges. Analyses of continuous variables are performed using t-test or an analysis of variance (ANOVA), as appropriate. Discrete variables are expressed as frequencies and percentages. The analyses of discrete variables are performed using the appropriate chi-squared test. Fishers exact test are used for small cell sizes (< 5). For the phenotypes, all tests of significance are two-tailed, with statistical significance set at p < 0.05. Principal components analysis (PCA) was implemented in [R] to correct for stratification in the rice diversity panel [5]. The top principal components are used as covariates in regression modeling. The workflow uses generalized linear models (GLM) to model the relationship between traits and each of the 1,536 genotypes. This analysis is stratified by location and season/year. The data D contain the trait variable Y and a matrix of P explanatory variables X (which included each SNP and the top principal components as covariates). The expected value of Y i, the trait variable for rice variant i, depends on the linear predictors through the link function g such that, g(µ i ) = β 0 + P β p X ip I p (1) where µ i = E(Y i ), β p is the regression coefficient of variable p, and I p is a variable indicating if X p is included in the model M. The genotypes are coded additively p

6 0, 1, or 2 for the number of minor alleles. For continuous traits, the identity link function (i.e., linear( regression) ) is used. For binary traits, the logit link function π is used, g(π i ) = log i 1 π i (i.e., logistic regression). The workflow concludes with summarizes of the results obtained from previous steps. Quantile-quantile (QQ) plots are generated for each trait to compare the observed p-value distribution to the expected uniform distribution. Additionally, Manhatten plots are generated to show the p-value results of each association scan by rice chromosome, with p-values less than the user-defined genome-wide threshold for statistical significance highlighted. 3 Prototype Results The rice database consists of 17 tables in three schemas, the 1,536 and 384 SNP arrays from ICABIOGRAD and the IRRI 384 SNP array as an standard for assessment. The size of database is 280 Megabytes (Mb). The bioinformatics workflow (Figure 2.1) was run on the largest genotyping array (1,536 SNPs). With the entire diversity panel genotyped, there were 717,312 genotypes in the final report. A PLINK file was generated and quality control was performed. 16 rice samples and 139 SNPs were removed with poor call rates (< 75%). 451 samples and 1397 SNPs were available for statistical analysis. Principle components (PC) analysis was performed using all SNPs, and the first four PCs were included in model 1 as covariates. The days to flowering trait was used for prototyping. A summary of the distribution for this trait by location is presented in Figure 3. The box plot shows that there is variation in this trait by location. Fig. 3. Boxplot for days to flowering by location (n=467) A genome-wide association analysis was performed on days to flowering for the rice planted in the greenhouse (BG). The resulting Manhattan plot for the

7 association scan is presented in Figure 3. The log 10 p are presented along the y-axis and the position of the SNP along the 12 rice chromosomes are presented on the x-axis. Two SNPs showed evidence for association with days to flowering in the greenhouse rice (colored red). These SNPs were located on chromosome 4 and 12 of the rice genome (Figure 3). Fig. 4. Genome-wide association results for days to flowering, greenhouse For the top associated SNP, the polar intensities were plotted versus the called genotypes. Intensities with the homozygous genotypes AA and BB were colored in red and green respectively. Heterozygous genotypes AB were plotted in blue. Uncalled genotypes were plotted in black. The graphic illustrates that for this SNP there was a distinct cluster for homozygous genotypes (AA and BB) and a large cluster for heterozygous genotypes (AB). There were 12 samples with no genotype for this SNP. 4 Conclusions We developed a custom bioinformatics workflow for a genome-wide association study of traits in Indonesian rice. The workflow consisted of a relational database and multi-step quality control and data analysis procedures automated in PLINK and [R]. The prototype results demonstrated that the workflow is useful for summarizing and visualizing the complex data from this study.

8 Fig. 5. Clustering for top associated SNP Future work includes configuring the workflow to run for all traits, locations, and seasons. Additionally, improvements to the clustering algorithm and statistical modeling framework may be made. Given the small sample sizes in this study and that rice is highly homozygous, an alternative algorithm such as ALCHEMY may improve genotype calling [6]. Model 1 may be extended to account for the relatedness among the rice varieties using a mixed model [7]. Environmental factors include the habitat, location, season, and year are possible modifiers of the relationship between genetics and traits. The statistical framework can be modified to consider gene-environment interactions (GxE). This bioinformatics workflow gives researchers the tools needed to easily and consistently quality control and analyze complex data. This research will help locate genes that are important for developing new rice varieties that ensure future food security in Indonesia. 5 Acknowledgements This research is funded by the Indonesian Center for Agricultural Biotechnology and Genetic Resources Research and Development, Indonesian Agency for Agricultural Research and Development, Ministry of Agriculture, Indonesia. Computing support by AWS in Education Grant award.

9 References 1. International Rice Genome Sequencing Project: The map-based sequence of the rice genome. Nature. 2005 Aug 11;436(7052):793-800. 2. Zhao K, Tung CW, Eizenga GC, Wright MH, Ali ML, Price AH, Norton GJ, Islam MR, Reynolds A, Mezey J, McClung AM, Bustamante CD, McCouch SR.: Genomewide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa. Nat Commun. 2011 Sep 13;2:467. 3. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC.: PLINK: a tool set for wholegenome association and population-based linkage analyses. Am J Hum Genet. 2007 Sep;81(3):559-75. 4. R Development Core Team.: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051- 07-0, URL http://www.r-project.org/. 5. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D.: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006 Aug;38(8):904-9. 6. Wright MH, Tung CW, Zhao K, Reynolds A, McCouch SR, Bustamante CD. ALCHEMY: a reliable method for automated SNP genotype calling for small batch sizes and highly homozygous populations. Bioinformatics. 2010 Dec 1;26(23):2952-60. 7. Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, Kresovich S, Buckler ES.: A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 2006 Feb;38(2):203-8.