Application of the Scan Statistic in DNA Sequence Analysis

Size: px
Start display at page:

Download "Application of the Scan Statistic in DNA Sequence Analysis"

Transcription

1 Application of the Scan Statistic in DNA Sequence Analysis Ming-Ying Leung Division of Mathematics and Statistics University of Texas at San Antonio San Antonio, TX Traci E. Yamashita Johns Hopkins School of Hygiene and Public Health Baltimore, MS Outline: DNA sequence Analysis Scan Statistic Poisson Type Approximations Herpes Genomes

2 In 1940, Avery announced his finding that a macromolecule DNA inside the chromosome is responsible for transmitting heredity material from parents to offspring.

3 In 1953, Watson and Crick confirmed the double helical structure of a DNA molecule.

4 The AGE of Molecular genetics begins

5

6 Challenges in Bioinformatics (Computational Molecular Biology) Sequence Assembly Database Technology Gene Finding Structure Prediction Molecular Evolution Search for extragenic functional sites

7 NETFETCH of: query May 5, :57 from server: 1 Sequences Requested 1 Sequences Returned LOCUS BHV1CGEN bp DNA VRL 07-APR-2000 DEFINITION Bovine herpesvirus type 1.1 complete genome. ACCESSION AJ VERSION AJ GI: KEYWORDS complete genome. SOURCE Bovine herpesvirus type 1.1. ORGANISM Bovine herpesvirus type 1.1 Viruses; dsdna viruses, no RNA stage; Herpesviridae; Alphaherpesvirinae; Varicellovirus. References FEATURES sequence ggcccagcccccgcgcggggggcgcggagaaaaaaaaaattttttccgcgcggcgcgtgc attgcggcgggcgggggcggggtgggggatgggcgcggagcgcgagggtagggttggcac actgccaagatcaccaagcatgtgcgcggccatcttgcttccaaactcattagcataccc cgcccattattccattctcatttgcatacccaccgttgcacatgccgccatattgctcct cctccctcgctcctcctccctcgctcctcctccctcgctcctcctccctcgctcctcctc cctcgctcctcctccctcgctcctcctccctcgctcctcctccctcgctcctcctccctc gctcctcctccctcgctcctcctccctcgctcctcctccctcgctcctcctccctcgctc ctcctccctcgctcctcttcaaaacactaccgcgggcgtccgctctcactagcttcggcg ccgtcatgggtgcccgcgcctccgcgcctgctgccggcccgcccccagcccacgctgttc tactagatgcgctctccgggggcacgattgacctgcctggcggcgacgaggccgtctttg tgtcctgcccgacgacgcgccccgtgtaccaccacatgcgccgcggccgcacggcccaca ctacacccgtgcacttcgttggccgcgcctatgccatcttgccctgccgcaagtttatgc tgtatctgatgcgcggtggtgccgtttacggctacgagcccaccactggcctgcaccgcc tcgccgattcactgcacgactttcttactactgccggactacagcagcgagacctacact gcctcgatgtcacggtgcttgacgcgcagatggacccggtgacgttcaccacccccgaga tcctcatcgagctcgaggcggacccggccttcccaccgccgccctcggcccgcgcgcgcc gctccacgctgcgccgggcgtctatgcgccggcccgcacgcaccttctgcccccaccagc tgctagcagagggctccattctggacctctgctcgccagagcaagcggcggcgccgggct gttcgctgctccccgcctgtgactctggagacgccgcgtgcccctgcgacgctggcgaga

8

9 Palindrome: A stretch of DNA that reads the same in both the direct and the complementary strands. E.g., 5.. GCAATATTGC CGTTATAACG..5 Short palindromes occur frequently by chance. To screen out the random noise, focus only on palindromes of length 10. Significant clusters of palindromes are found around origins of replication and regulatory regions in viral genomes (Masse et al. 1992, Leung and Yamashita 1999). Modeling the occurrence of palindromes on the DNA sequence as points on the unit interval, the scan statistic can be used to detect the presence of nonrandom palindrome clusters.

10 Sliding Window Plot Figure 1: A sliding window plot is generated by choosing a window of fixed length and sliding it along the genome, beginning at the first base and continuing until the window reaches the last base of the genome. The window moves forward in steps of a pre-specified size. At each position of the window, the number of palindromes contained in it is counted and plotted against the window position. This is the sliding window plot for the human cytomegalovirus genome with a 1000 base window and step size of 500 bases. The peaks observed at window positions and suggest that there may be nonrandom palindrome clusters at these locations.

11 The Scan Statistic Notation U 1, U 2,, U n i.i.d. Uniform (0,1) U (1), U (2),, U (n) their order statistics S i = U (i+1) - U (i) = ith spacing N w (i) = no. of points contained in [U (i), U (i) + w] A r (i) = S i + + S i+r-1 = sum of r adjoining spacing

12 Duality Relationship { N ( i) r + 1} = { A ( i) w} w r w-scan Statistic N = max N () i w i w r- Scan Statistic A r = min A () i i r If N w is too big, or equivalently, A r is too small, an unusual cluster is present.

13 Poisson Approximation Dembo and Karlin (1992) derive the limiting distribution lim n P A r > n x 1+ 1/ r = e x r / r! This follows from a Poisson limiting distribution for the counts C r of those A r (i)'s not exceeding x / n 1+1/r. If the above limiting distribution is used as an approximation for large n, one can easily obtain a critical value r!ln(1 α) c = r+ 1 n for A r below which a significant cluster is considered present.

14 Better approximate probabilities for A r can be derived from better approximate distributions of C r : Finite Poisson approximation (Dembo & Karlin 1992). Local declumping approximation (Glaz 1994) based on a declumping idea put forth by Arratia et al. (1990). Compound Poisson approximation (Glaz 1994) based on a coupling method proposed by Roos (1993). Recursive algorithm for computing scan statistic probabilities to any desired degree of accuracy developed by Huffer and Lin (1998).

15 Herpesvirus Genomes Genome Palindromes Genome Length HCMV ,354 EBV ,282 HSE ,223 HSI ,226 HSS ,930 HSV ,260 VZV ,885

16 Significant Palindrome Clusters on r Positions of significant (α = 0.05 ) clusters 1 None 2 None 3 None

17 Regions of the herpes genomes with statistically significant clusters Genome Cluster Location Biological Feature HCMV Origin of replication (orilyt) Transcriptional regulator HSV Transcriptional regulator Origin of replication (ori S ) EBV Origin of replication (OriLyt) HSE HSS VZV 1542

18 The Q-Q plot Figure 2: Q-Q plot for the palindrome positions of the human cytomegalovirus against quantiles of the uniform distribution. Here, we focus only on those palindromes with length 10 bases in order to screen out the random noise generated by frequent fortuitous occurrences of very short palindromes (see Leung et al for a full explanation). The overall straight line appearance of the Q-Q plot suggests that it would be reasonable to model the occurrences of palindromes above a prescribed length along the genome sequence as i.i.d. points uniformly distributed over (0,1) and evaluate the significance of palindrome clusters with the scan statistic distribution.