Different Influence on RNAs and Proteins by Variants with different MAFs

Size: px
Start display at page:

Download "Different Influence on RNAs and Proteins by Variants with different MAFs"

Transcription

1 Different Influence on RNAs and Proteins by Variants with different MAFs Gong BS 1, Chen X 1, Wang XY 1, Zhao Z 1, Du L 1, Li CM 1, Zhang XY 1, Li X 1,, Rao SQ 1, 1 College of Bioinformatics Science and Technology, Harbin Medical University, Harbin , P. R. China Corresponding author: Li X, Ph.D., Dean, College of Bioinformatics Science and Technology Harbin Medical University, Phone: , lixia@hrbmu.edu.cn of Rao: raos@ccf.org Abstract Exonic variants may influence the transcriptional process, RNA stability and protein level, particularly nonsynonymous variants which can directly affect the stability and/or function of proteins. In this study, we assessed the influence of variants with different MAFs on RNAs and proteins. Common variants show similar effect on RNA secondary structure with rare ones, but nonsynonymous variants have greater influence than synonymous ones. Influence on RNA stability is both different between common variants and rare ones and between nonsynonymous variants and the synonymous ones. Our findings also demonstrate that, in nonsynonymous variants, the influence of common ones is even larger than rare ones on RNA stability. In addition, the rare ones have a greater impact on protein function than the common. Keywords Exnoic variants; RNA stability; RNA secondary structure; Deleterious to protein function; Rare variants Background Single nucleotide polymorphisms (SNPs) are the most common form of genetic variations in the human genome, and hundreds of genome wide association (GWA) studies are performed to identify common variations that are statistically linked with particular diseases. Most SNPs are neutral, which can explain only a small fraction of the heritability of any disease, but some are functional and influence gene expression, RNA structure and protein function, which are meaningful to identifying the potential or causal genes in diseases. It is estimated that totally 10 million SNPs exist in the human population of which at least 1% are functional (see International HapMap Project). With the development of next-generation sequencing technology, exonic variants can be fully detected, especially of low-frequency, which were sporadically found over previously genome scanning. Previous findings have demonstrated that exonic variants may influence RNA structure, stability and protein level, especially nonsynonymous variants may directly influence protein structure, as well as protein function [1-3]. In this study, we assessed the impact of exonic variants on RNA structure, RNA stability and protein functions by using data from GAW17, and analyzed influence differences in two SNP types (synonymous and nonsynonymous) and several MAF groups (private, rare, less common and common). Results The Impact of Variants on RNA Structure and Stability For the variants exact located in RNAs, We analyzed the different influence of variants in different MAF groups by RNA structure and stability measure (see method). First, Chi-square tests show no significant differences in the proportions of variants altering RNA structure among groups. The number of affecting variants is about three times than that of non-affecting ones in each group, suggesting that most variants have impact on RNA structure ( some cases are illustrated in Fig 1), regardless of their MAFs. Furthermore, Two sample Wilcoxon tests on every two MAF groups were performed to compare their affection degree (Table 1). The results suggest that private variants have a tendency to change RNA structure than common ones, but there is no significant difference in influence of variants on RNA secondary structure among other MAF groups. Second, our study on RNA stability (see method) shows a similar result on the proportion of affecting variants, which rises to about six times. Wilcoxon tests on every two groups with Bonferroni correction demonstrate that variants of different MAF groups produce different influence on RNA stability (Table 1). One-sided Wilcoxon tests show that variants

2 with larger MAFs tend to have greater influence on RNA stability under Bonferroni correction, especially the influence differences between variants with MAFs less than 1% (private and rare variants) and that with larger MAFs (less common and common variants) are extremely significant (p=3.72e-06). NM_ C1S kcal/mol C1S kcal/mol C1S kcal/mol MFE= kcal/mol Private Sy LessCommon Sy Common Sy C1S kcal/mol C1S kcal/mol C1S kcal/mol C1S kcal/mol Rare NS Private NS Rare Sy LessCommon NS Fig 1 Influence of Variants on RNA Structures and stability of NM_ Table 1 Different influence on RNAs among MAF groups Comparison# total Structure P vs. R 2.00E-01 Synonymous Stability 2.22E-03 Nonsynonymous Structure Stability Structure Stability 8.05E E E E E-03 P vs. LC 4.66E E E E E E E-06 P vs. C R vs. LC R vs. C LC vs. C P&R vs. LC 6.76E E E-09 g:5.40e E E E E E E E E E E E E E E E E E-02 LC vs. C P&R vs. LC&C 1.54E E E E E E E E E E E E E E E E-04 P&R vs. C 1.35E E E E-07 g:1.15e E E E E E E E E E E E E-02 g:1.41e E E E E E E E E-10 Two-sided p values of two sample Wilcoxon tests are listed in the table. One-sided p values are calculated if the two-sided p values are statistically significant, and listed below. # Listed are all comparisons we carried out on influence differences between every two variant groups divided according to MAF. P, R, LC and C are short for private variant group, rare variant group, less common variant group and common variant group, respectively. The p value is statistically significant under Bonferroni correction. Usually, nonsynonymous variants are considered affecting the stability and/or function both of RNAs and proteins, and

3 synonymous ones may affect protein through RNA. To investigate whether there is different influence on RNA structure and stability between variants of synonymous and nonsynonymous, two sample Wilcoxon tests were performed. The results demonstrate that the effect differences do exist between synonymous variants and nonsynonymous ones (for the structure measure, the two-sided p value is 3.07E-02; and for the stability measure, the two-sided p value is 1.14 E-02), and the nonsynonymous produce greater influence than the synonymous both on RNA structure and RNA stability (one-sided p values are 1.53E-02 and 5.70 E-03 for the structure measure and the stability measure, respectively). We further analyzed the influence differences between every two variant groups classified according to their MAFs in synonymous variants and nonsynonymous variants respectively. For synonymous variants, there is no significant effect difference among different MAF groups either on RNA secondary structure or on RNA stability. For nonsynonymous ones, the influence among different MAF groups on RNA secondary structure has no significance, either, however the different effects of variants in different MAF groups are statistically significant under Bonferroni correction. The two-sided p values of two sample Wilcoxon tests on every two groups are listed in Table 1, and for each comparison with a two-sided p value < 0.05, a one-sided p value was also calculated and listed. The one-sided Wilcoxon tests suggest that the nonsynonymous variants with larger MAFs produce greater influence on RNA stability than ones with smaller MAFs. The Impact of Missense Variants on Proteins In this study, the influence of missense variants on proteins was analyzed. Comparisons of damaging effects of variants based on binary and ternary qualitative indicators among different MAF groups were performed by chi-square tests, and the test p values are all statistically significant (Table 2), indicating that different influence does exist among different MAF groups. The differences among different MAF groups were further analyzed to explore the relationship between MAF and damaging effect of a variant through performing two sample Wilcoxon tests on every two groups, and both the two-sided p values and the one-sided p values are listed in the Table 2. All results show that variants with smaller MAFs produce greater influence on proteins and are more likely to be deleterious than ones with larger MAFs. In addition, test for correlation between MAFs and damaging probabilities of variants was performed, and the Spearman rho with an extremely significant p value 8.96e-134 suggest that the lower MAF of the variant is, the greater its impact on proteins, consistent with the Wilcoxon test results. Category Table 2 Different influence on proteins between every two MAF groups chisq_p Wilcoxon_equal_p Wilcoxon_greater_p Comparison binary ternary probability P, R, L, C 1.69E E-94 P vs. R 0 # 0 P vs. LC 0 0 P vs. C 0 0 R vs. LC 9.04E E-07 R vs. C 0 0 LC vs. C 0 0 P&R, L, C 7.26E E-80 P&R vs. C 0 0 P&R vs. LC 0 0 LC vs. C 0 0 P&R, L&C 6.11E E-63 P&R vs. LC&C 0 0 The p value is statistically significant under Bonferroni correction. #The p value 0 of Wilcoxon test is because of limitation of calculation precision of R software.

4 Methods Data Preparation The genes and variants we used in this study were obtained from GAW17, and the corresponding RNA sequences and amino acid sequences were obtained from RefSeq Of the 3205 genes GAW17 detected, 15 genes, in which 72 SNPs are located, can not find their transcripts, and 4570 RNA sequences including SNPs were obtained. All variants can be divided into four categories based on their MAFs: private variants (MAF= , appear only once in the 697 individuals), rare variants (MAF<1% but still polymorphic in population), less common variants (1% MAF<5%) and very common variants (MAF 5%). The number of private variants, rare variants, less common variants, common variants in GAW17 are 9433, 8698, 3224, 3132, respectively. Each of the four variant categories are denoted by P, R, L and C. P&R means private variants and rare variants are grouped as a single category, and so forth. For a variant falls within coding regions, we can classify it according to whether it is synonymous or nonsynonymous. Of the variants GAW17 provided, are nonsynonymous and are synonymous according to the file snp_info. Locating Variants in RNAs and Proteins Using the chromosome and basepair locations of variants, the genes in which they are located provided by the file snp_info of GAW17 and the relative regions of RNAs in the chromosomes taken from RefSeq 36.3, we can obtain the relative positions of variants in the corresponding RNAs of their host genes. In order to avoid being influenced by post-transcriptional modifications such as RNA editing, in our following analysis, location of each variant was correct by aligning the DNA sequence around with corresponding cdna fragment. Eventually, positions of variants in 4180 RNAs were obtained, with 2990 genes involved. Similarly, according to the positions of variants on RNAs and their coding regions, we can locate variants falling within coding regions on amino acid sequences of their host genes. In the variants well located in their host RNAs, 9451 are synonymous and are nonsynonymous in GAW of the nonsynonymous variants cannot be located in proteins because there are not corresponding sequences in the RefSeq 36.3, and positions in 2880 proteins of the remaining variants were obtained. Because of alternative splicing, a variant is synonymous for some proteins, while it is nonsynonymous for others. In the positions we located, 22 amino acid substitutions produced in 21 proteins by 18 variants were synonymous, and 5 of the 18 variants are both synonymous and nonsynonymous. In the variants, which are assumed to be nonsynonymous in GAW and have definite positions in the RefSeq 36.3 protein sequences, 13 were classified into synonymous in this study according to the sequences and their positions, and were missense in different positions. The Impact of Variants on RNA Structure and Stability The wild-type RNA sequences are obtained from reference RNA sequences. The mutant-type RNA sequences are generated from the wild-type with mutations where SNPs locate. Because of alternative splicing, we may get more than one transcript for a single gene, and for each transcript, we can obtain a pair of RNA sequences for a variant by substituting the base in reference RNA with the other allele. By using RNAfold, a software to predict minimum energy secondary structure and calculate their minimum free energy (MFE) of RNAs based on their sequences, we obtained all pairs of SNP-RNAs secondary structure and their MFE. Since previous findings suggest that the stability and the specific secondary structure of RNAs may affect their expression [4, 5], two aspects of SNP effects on RNA were considered in this research, the alteration in RNA structure and the changes in RNA stability. For each SNP-RNA pair, the dissimilarity between RNA secondary structure corresponding to the two alleles was calculated by using RNAdistance, and the dissimilarity score was used to evaluate the RNA structural alteration. MFE can be used as the evaluating indicator of RNA stability. In this study, the changes in RNA stability was measured by

5 MFE change rate, which is defined as the ratio of MFE changes between the two allelic RNAs and MFE of the wild-type RNA. For each SNP, its influence on RNA structure was assessed by the the maximal dissimilarity score of all according SNP-RNA pairs. Similarly, its influence on stability was assessed by the maximum of all MFE change rate between the two allelic RNAs. Limited by the restriction of RNAdistance on the length of RNA sequences, of the SNP-RNA pairs, only dissimilarity scores between SNP s two allelic RNAs were calculated, involving SNPs and 3667 RNAs. And consequently, SNPs influence on RNA structure and SNPs influence on RNA stability was analyzed. The Impact of Variants on Protein Functions Influence on proteins of nonsynonymous variants are evaluated according to the structural and functional differences PolyPhen[6], a software tool for predicting damaging effects of missense mutations using both sequence alignments and structural information. Bayes posterior probabilities of amino acid substitutions were obtained to present likelihood of mutations being damaging. A binary classifier deleterious or neutral and a qualitatively ternary classification outcome benign, possibly damaging or probably damaging is also provided. Nonsynonymous variants making nonsense, which produce stop codons leading to the premature termination of polypeptide sequences, were not considered in our study. For the substitutions PolyPhen can not make predictions, the unknown would be reported in the results. Similar with the evaluation of variants influence on RNAs, the maximum probability that the variant is damaging was used as the measure to assess its influence on protein, and the corresponding qualitative indicators were also used. There were 442 variants whose effects on all proteins of their host genes are all not able to be predicted, and ultimately influence of variants on proteins was analyzed in this study. Discussion It is supposed that, synonymous variants usually have no or little effect on the protein level directly, while often influence gene expression and RNA structure. However, nonsynonymous variants are usually first considered affecting the stability and/or function of proteins. And recently, some findings suggest that some nonsynonymous variants have no effect on structure and function of proteins but influence protein expression by altering mrna secondary structure and mrna stability [7, 8]. ur findings suggest that on exons, the nonsynonymous variants are more likely to affect RNA secondary structure and their stability than synonymous ones, and compared with lower MAFs, nonsynonymous variants of larger MAFs produce greater impact on RNAs but have smaller influence on proteins. Limited by the experimental detection, secondary structure and MFE of RNAs are obtained by RNAfold predicting in our study, and damaging effects of missense mutations on protein functions are predicted using PolyPhen. Although these tools have been widely used to predict structure and functions on RNA and protein studies [9-11], the limited accuracy of prediction is a perturbation to our results. The exomic data GAW17 provided consist of a total of SNPs located in 3205 genes, which cover about 10% of the total number of whole genome genes. Our analyses are all based on GAW17, revealing the impact of single nucleotide variants on RNAs and proteins only from a partial perspective. Acknowledgments This work was supported in part by the National Natural Science Foundation of China (Grant Nos and ), the National High Tech Development Project of China, the 863 Program (Grant Nos. 2007AA02Z329), the National Basic Research Program of China, the 973 Program (Grant Nos. 2008CB517302) and the National Science Foundation of Heilongjiang Province (Grant Nos. HCXB , 1055HG009, GB03C602-4, JC200711, ZD and BMFH060044). References 1. Del Guerra S, D'Aleo V, Gualtierotti G, Filipponi F, Boggi U, De Simone P, Vistoli F, Del Prato S, Marchetti P, Lupi R: A

6 common polymorphism in the monocyte chemoattractant protein-1 (MCP-1) gene regulatory region influences MCP-1 expression and function of isolated human pancreatic islets. Transplant Proc, 42(6): Wang D, Johnson AD, Papp AC, Kroetz DL, Sadee W: Multidrug resistance polypeptide 1 (MDR1, ABCB1) variant 3435C>T affects mrna stability. Pharmacogenet Genomics 2005, 15(10): Shen LX, Basilion JP, Stanton VP, Jr.: Single-nucleotide polymorphisms can cause different structural folds of mrna. Proc Natl Acad Sci U S A 1999, 96(14): Boudikova B, Szumlanski C, Maidak B, Weinshilboum R: Human liver catechol-o-methyltransferase pharmacogenetics. Clin Pharmacol Ther 1990, 48(4): Carlini DB, Chen Y, Stephan W: The relationship between third-codon position nucleotide content, codon bias, mrna secondary structure and gene expression in the drosophilid alcohol dehydrogenase genes Adh and Adhr. Genetics 2001, 159(2): Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR: A method and server for predicting damaging missense mutations. Nat Methods, 7(4): Nackley AG, Shabalina SA, Tchivileva IE, Satterfield K, Korchynskyi O, Makarov SS, Maixner W, Diatchenko L: Human catechol-o-methyltransferase haplotypes modulate protein expression by altering mrna secondary structure. Science 2006, 314(5807): Capasso M, Ayala F, Russo R, Avvisati RA, Asci R, Iolascon A: A predicted functional single-nucleotide polymorphism of bone morphogenetic protein-4 gene affects mrna expression and shows a significant association with cutaneous melanoma in Southern Italian population. J Cancer Res Clin Oncol 2009, 135(12): Livingston RJ, von Niederhausern A, Jegga AG, Crawford DC, Carlson CS, Rieder MJ, Gowrisankar S, Aronow BJ, Weiss RB, Nickerson DA: Pattern of sequence variation across 213 environmental response genes. Genome Res 2004, 14(10A): Chun S, Fay JC: Identification of deleterious mutations within three human genomes. Genome Res 2009, 19(9): Rajasekaran R, Sudandiradoss C, Doss CG, Sethumadhavan R: Identification and in silico analysis of functional SNPs of the BRCA1 gene. Genomics 2007, 90(4):