SNP finding in pig mitochondrial ESTs

Size: px
Start display at page:

Download "SNP finding in pig mitochondrial ESTs"

Transcription

1 SNP finding in pig mitochondrial ESTs Karsten Scheibye-Alsing(1), Susanna Cirera(1), Michael J. Gilchrist(2), Merete Fredholm(1), Jan Gorodkin(1,*) (1) Division of Genetics and Bioinformatics, IBHV, University of Copenhagen, Grønnegårdsvej 3, DK-1870 Frederiksberg, Denmark (2) The Wellcome Trust/Cancer Research UK Gurdon Institute, Cambridge, CB2 1QN, UK (*) To whom correspondence should be addressed, SUMMARY: The sino-danish pig genome project produced 685,851 ESTs (Expressed Sequence Tags)(Gorodkin et al. 2007), of which 41,499 originates from the mitochondrial genome. In this study the mitochondrial ESTs were assembled and 374 putative SNPs were found. Chromatograms for the ESTs containing SNPs were manually inspected and 112 SNPs (52 non-synonymous) were found to be of high confidence (five of them are close to disease causing SNPs in humans). Nine of the high confident SNPs were tested experimentally and eight were confirmed. The SNPs can be accesed online at KEYWORDS: SNPs, ESTs, pig, mitochondrial MAIN TEXT: Studies of the mitochondrial genome are performed because of its importance to basic cellular function as well as its importance for disease mechanics. The mitochondrial genome has been studied across many different mammalian species. Since the pig is increasingly being used as a model animal (Lunney, 2007), studies of the mitochondrion in this species are of particular interest. It has been shown that the pig is closer to human than mouse in sequence space (Wernersson et al. 2005), and a similar observation can readily be made for the mitochondrial genome (data not shown). Hence any SNP in pig could be useful

2 for studying phenotypic effects and such effects can readily be related to humans. The Sino-Danish pig-genome project has generated 685,851 ESTs (Expressed Sequence Tags) which together with the public resources is a substantial resource for genetic and functional studies (Gorodkin et al. 2007). Since the ESTs represent many different animals, SNP discovery has been a natural aim in the project (Panitz et al. 2007). Here, we present a SNP study of the mitochondrial encoded ESTs from the Sino-Danish pig-genome project. The EST sequences from the Sino-Danish pig genome project were merged with public sequences for the analysis (See Gorodkin et al. 2007). Those ESTs matching the mitochondrial pig sequence (Genbank: AF486866) using BLAST were selected for the further analysis. This resulted in the identification of 41,499 porcine mitochondrial EST sequences (2744 from public sources) which were assembled (Scheibye-Knudsen et al., in preparation) using the Distiller pipeline (Gilchrist et al. 2004). The resulting contigs covered the majority (94%) of the mitochondrial genome, and every protein coding gene and ncrna structure. The assembly generated 35 contigs, and based on the pragmatic criteria of presence in at least two ESTs and a minor allele frequency of at least 2%, we predicted 374 putative SNPs. Furthermore, by mapping the contigs against the known mitochondrial pig genome, it was possible to find the exact location of the SNPs on the mitochondrial genome, and thereby evaluate which gene the SNPs were located in and whether they were synonymous or nonsynonymous. An overview of the mapping is shown in Figure 1. Comparative mapping the pig mitochondrial genome and the human mitochondrial genome, made it furthermore possible to compare the SNPs to loci in the human mitochondrion, which are known to be disease causing (from Mitomap). There are presently at least 2500 known human mitochondrial SNPs, of which 305 have been reported to be associated with disease.

3 However, mitochondrial diseases have an added layer of complexity compared to diseases caused by polymorphisms in chromosomes. This is because mitochondria are often heteroplasmic, ie. they contain many subtly different DNA molecules. Therefore, when trying to find out whether a disease is caused by a mitochondrial SNP, it is both a challenge to detect the SNP, which might only be present in a minority of the mitochondrial genomes, and further the SNP in question needs to be present in enough mitochondrial genomes to actually have an impact on the health of the individual in question. Therefore, we report all putative SNPs we found, as a resource for researchers. The SNPs were manually inspected to asses their reliability. Depending on the features of the chromatogram traces associated with the polymorphism, different confidence levels were assigned: 1 for the most confident; 2 for problematic; 3 for probably wrong; and U for unknown. A polymorphism was classified as highly confident (level 1) if it was supported by chromatograms with a clear peak at the position with no underlying peaks (or noise). Level 2 confidence was assigned to polymorphisms which had a clear primary peak, however also had a lower secondary peak, thus lowering confidence in the base call. Problematic polymorphisms (confidence level 3) were assigned to SNPs where the primary trace peak was indistinct and often had a secondary peak of comparable signal strength. The unknown (U) classification was for polymorphisms which were primarily reported in sequences not originating from the sino-danish project, thus making it unfeasible to consistently inspect them. In total 112 polymorphisms were classified as highly confident (category 1), 105 as problematic (category 2), 71 probably wrong (category 3), and 86 unknown. An overview of these in total 374 SNPs and their proximity to disease-causing loci (mapped to within 20 base pairs) is given in Table 1. The manual assessment of SNP confidences relies on inspection of each SNP, and depends on the judgment of the person doing the curation, as opposed to a computational evaluation of SNP confidences based on nucleotide quality values. However, since the quality values assigned by the base callers (PHRED [ref]),

4 did not always (how often?) correlate well with the confidence placed in the SNP by our manual curation the in depth manual inspection was carried out. For the base calling assignments some bases had quality values above 30, while still being of dubious confidence when manually checked. It might be argued that another basecaller, or different basecalling parameters might have been used instead, however PHRED is generally accepted as a good basecaller, and quality values above 30 are normally deemed as highly confident. From the 374 putative polymorphisms, we selected two batches of eight SNPs for experimental validation: One batch of non-synonymous SNPs with high allele frequency of the minor allele (allele frequency cutoff of 14 %); and one batch comprising non-synonymous SNPs mapped within the proximity (within 20 basepairs) of known disease causing SNPs in human (from Mitomap). We selected these SNPs candidates regardless of confidence levels in part to evaluate the reliability of the confidence assignments. The experimental validation of the SNPs was performed by sequencing the relevant gene fragments in genomic DNA isolated from the animals from which tissues for the respective libraries were sampled. Many of the libraries were constructed from RNA pools of several animals (Gorodkin et al., 2007). The genomic DNA was therefore on average sequenced from five different animals per SNP. The batch of high frequency SNPs were located in the genes: NADH5, NADH2, COIII, ATP6 (2 SNPs), ATP8, COI, NADH4. Primers flanking the SNPs were designed and used to PCR amplify genomic DNA (containing mitochondrial DNA) of the animals used to make the cdna libraries from which the EST sequences have been generated. The resulting product was subsequently sequenced. Six of the SNPs were confirmed in the sequences generated from the DNA. We checked the two SNPs that were not confirmed (COI and NADH4) in the cdna clones from which the original sequences had been generated to see if there were sequencing errors. Since the SNPs were not detected on the clones, the original findings must be results of sequencing errors.

5 The batch of disease related SNPs were located in the genes: NADH1(3 SNPs), ATP6, and NADH6 (4 SNPs). They were tested using the same procedure as described above. Seven of the eight SNPs were not found in the genomic DNA analyzed. We checked two of these SNPs further (ATP6 and NADH6) in the cdna clones to see if they were sequencing errors: ATP6 was confirmed in the clone; while it was established that the SNP in NADH6 was a sequencing error. The results of the experimental validation is summarised in Table 2. We found that out of the nine experimentally tested high confidence SNPs, eight were confirmed. In conclusion, we have presented a novel SNP resource containing 374 SNPs, of which 112 SNPs are of high confidence. REFERENCES: Husk ref for PHRED Gilchrist M, Zorn A, Voigt J, Smith J, Papalopulu N, Amaya E: Dening a large set of full-length clones from a Xenopus tropicalis EST project. Dev Biol 2004, 271(2): Gorodkin J, Cirera S, Hedegaard J, Gilchrist M, Panitz F, Jorgensen C, Scheibye- Knudsen K, Arvin T, Lumholdt S, Sawera M, Green T, Nielsen B, Havgaard J, Rosenkilde C, Wang J, Li H, Li R, Liu B, Hu S, Dong W, Li W, Yu J, Wang J, Staefeldt H, Wernersson R, Madsen L, Thomsen B, Hornshoj H, Bujie Z, Wang X, Wang X, Bolund L, Bruna k S, Yang H, Bendixen C, Fredholm M: Porcine transcriptome analysis based on 97 non-normalized cdna libraries and assembly of 1,021,891 ESTs. Genome Biology 2007, 8:R45. Lunney J: Advances in Swine Biomedical Model Genomics. Biol Sci 2007, 3(3):179184

6 Panitz F, Stengaard H, Hornshøj H, Gorodkin J, Hedegaard J, Cirera S, Thomsen B, Madsen LB, Høj A, Vingborg RK, Zahn B, Wang X, Wang X, Wernersson R, Jørgensen CB, Scheibye-Knudsen K, Arvin T, Lumholdt S, Sawera M, Green T, Nielsen BJ, Havgaard JH, Brunak S, Fredholm M, Bendixen C: SNP mining porcine ESTs with MAVIANT, a novel tool for SNP evaluation and annotation. Bioinformatics, July 2007: 23: i387 - i391 Scheibye-Knudsen K, Cirera S, Gilchrist M J, Fredholm M, Gorodkin J: Pig Mitochondria EST analysis reveal novel expression patterns and SNPs. In preparation Mitomap - A human mitochondrial genome database, [ Wernersson R, Schierup M, Jørgensen F, Gorodkin J, Panitz F, Staerfeldt H, Christensen O, Mailund T, Hornshøj H, Klein A, Wang J, Liu B, Hu S, Dong W, Li W, Wong G, Yu J, Wang J, Bendixen C, Fredholm M, Brunak S, Yang H, Bolund L: Pigs in sequence space: a 0.66X coverage pig genome survey based on shotgun sequencing. BMC Genomics 2005, 106(1) (70).

7 TABLE 1, SNP number overview: Table 1: The number of SNPs of each category. Near-overlapping SNPs are mapped to within 20 basepairs of a known human disease causing loci, and the numbers in parentheses indicate synonymous and non-synonymous SNPs respectively. In the column ssnp and nssnp the total number of synonymous and non-synonymous SNPs are listed. The confidence levels are as explained in the text.

8 Table 2, Experimentally tested SNPs: Table 2: The data for the experimental validation. The column Mito. Pos. is the position on the pig mitochondrial genome (GENBANK ass. AF486866). Conf. is the confidence level. The comment column notes the status of the experimental validation, Found means the SNP was found in the animal material. Not found means the SNP was not found in the animal material, and was not checked on specific clones. In clone means the SNP was not not found in the animal material, but subsequently found in specific clones. Seq. Error means the SNP was neither found in the animal material, nor by subsequent checking of specific clones. The SNP count and Con. count (consensus count) are the number of sequences found with the minor and major allele respectively, and provides a rough estimate of the allelic frequencies.

9 Figure 1: Schematic overview of the mitochondrion, with individual SNPs marked. The SNPs are spread evenly on the mitochondrion with no apparent preference for any gene or position.