Permutation Clustering of the DNA Sequence Facilitates Understanding of the Nonlinearly Organized Genome

Size: px
Start display at page:

Download "Permutation Clustering of the DNA Sequence Facilitates Understanding of the Nonlinearly Organized Genome"

Transcription

1 RESEARCH PROPOSAL Permutation Clustering of the DNA Sequence Facilitates Understanding of the Nonlinearly Organized Genome Qiao JIN School of Medicine, Tsinghua University Advisor: Prof. Xuegong ZHANG (Department of Automation) Submitted to Tsinghua Xuetang Program of Life Sciences 1. Background 1.1 The 3D Genome and Hi-C Technology Higher-level structures of chromatins have increasingly been recognized as important over the past few decades. It s now widely accepted that the three-dimensional positioning of genes is significant in essential biological functions such as transcription, replication, DNA repair and chromosome translocation (Bonev & Cavalli, 2016). Recent years have also witnessed rapid development of the chromosome conformation capture (3C) method and its derived 3C-based technologies (termed C-technologies), which are commonly used to study chromatin interactions in eukaryotic cells (Schmitt, Hu, & Ren, 2016). Among those C-technologies, Hi-C method has been extensively utilized in analysis of genome-wide chromosome interactions by detecting spatially neighboring chromosome sequences (Fig. 1) (Lieberman-Aiden et al., 2009). Fig. 1. Hi-C procedure.

2 Analyses of high-resolution Hi-C data have uncovered some general features of hierarchical chromatin structures, such as compartments (Lieberman-Aiden et al., 2009) and topologically associating domains (TADs) (Dixon et al., 2012). However, we still know little about the 3D genome organizations despite of the accumulating data. 1.2 Sorting Points into Neighborhoods (SPIN) Method SPIN is an unsupervised approach for the organization and visualization of multidimensional data. This method, by sorting points into neighbors, generates a permutated distance matrix which bears characteristic patterns and is separated by clear boundaries (Tsafrir et al., 2005). SPIN can be utilized to any dataset where a dissimilarity matrix between points can be defined. It started with a random (or original) ordering of data points, the corresponding initial unsorted distance matrix is impossible to interpret. However, the permutated image, obtain after reordering the data by SPIN, is highly informative. For the first example where points are uniformly distributed within a cylinder (Fig. 2a), we can t draw any useful information about the data pattern from the unsorted distance matrix. However, after permutated, the distance matrix seems much more informative: the elements near the main diagonal stand for short distances (colored blue), with a clear gradient of increasing distances as one moves away from the main diagonal. The point sequence in the reordered distance matrix maps from one side to another side of the cylinder. In another example where points are distributed in a cycle, the reordered distance matrix is also periodic as the shape of data. The point sequence in the reordered distance matrix maps from one point on the ring, going around and eventually back to the point. Fig. 2. Shapes of simple objects, each consisting of 500 points: 1. Points are distributed in a cylinder (a) or a cycle (b); 2. The corresponding randomly ordered distance matrix; 3. The final permutated distance matrix.

3 In a more complicated example where points are arranged in a smiling face shape, we can still easily infer the shapes of four clusters and their relative placement. The four clusters represent two eyes, one month and one ring (edge of the face), respectively. Fig. 3. Relations and shapes of multiple clusters. SPIN s results for a toy dataset of 800 points in 10 dimension. (a) The projections of the data points onto the first and second PCA plane. (b) The SPIN sorted distance matrix. For comparison, the results of three popular clustering methods were translated to permutations on the distance matrix: (c) k-means (with k = 4). (d) average linkage and (e) single linkage. 2. Purpose of Research 2.1 Novelty for Nonlinear Clustering and Analysis of the DNA Sequence Current researches of the genome highly depends on the linear understanding of its sequence, which is about 3 billion base pairs long in human beings. Despite of the huge size of genomes, chromatins also adopt sophisticated higher order structure such as the nucleosomes, 30nm fibers and loops. However, the important higher level structural features of the genome receive far less attention since most researchers still base their models only on the linear arrangement of the sequence, ignoring the spatial arrangements. Here we purpose to use SPIN method to sort and cluster the matrix derived from genomewide Hi-C data. The permutation will disturb original linear arrangement of the binned genome sequence and generate clusters closely interacted with each other. Trying to nonlinearly analyze the genome and gain novel knowledge is the key inspiration and innovation of this study.

4 2.2 Detecting Structural and Functional Nonlinear Genomic Clusters SPIN can generate nonlinear genome clusters with assumedly distinct structural and functional characteristics. Enrichment analysis of genetic elements (such as genes, promotors, enhancers, and gene deserts), epigenetic modifications (such as methylation of DNA and histones, acetylation of histones) and previously determined structures (such as topological domains (Dixon et al., 2012) and A/B compartments (Lieberman-Aiden et al., 2009)) of the clusters will give insights of the structural and functional roles they play in the genome. 2.3 Facilitating Understanding of the 3D Genome Organizations Apart from the genomic clusters, SPIN also generates informative distance matrix. The patterns of clustered blocks in the heat map of distance matrix and the extent to which the DNA sequence is reordered can facilitate the understanding of the 3D genome organizations. 3. Experimental Design 3.1 Organization of Interaction Matrix Dataset As Hi-C generates pairwise mapping data and the resolution is restricted by the method itself, dividing the genome into bins is necessary to gain a proper distance matrix. The bin sizes can range from 1Kb to 1Mb. In this study, distance matrix of lower resolution will be analyzed first to get a glance of the reordered patterns and the extent of nonlinearity. Similarly, distance matrix containing all chromosomes are analyzed before the single intrachromosomal distance matrix. Raw data of Hi-C are pairwise mapping hits on the genome, representing the proximal sequence pairs. However, we need a distance matrix, which depicts the dissimilarity of data, to implement SPIN algorithm. As a result, transformation of the interaction data to distance data or revising the algorithm will be tried. Fig. 4. Interaction matrix of one arm of Chr14 generated by Hi-C data

5 3.2 SPIN Clustering of the Matrices Basically, there are two SPIN algorithms to resort the distance matrix. The first aims to assign large distances to corners, far from the diagonal. The second aims to ensure that the elements near the main diagonal tend to have smaller dissimilarity values. They are termed Side-to- Side (STS) and Neighborhood, respectively. Both algorithms will be tried and, if necessary, a specific permutation algorithm for the genome sequence will be developed. 3.3 Clusters Recognition and Analysis Clustering Assessment To assess the clustering results by checking whether there are blocks on the central diagonal separated by clear boundaries. Revisions of the bin sizes, data transformation methods, and reordering algorithms will be made until apparent clusters appear in the reordered distance matrix, then the following cluster analysis studies ( ) can be performed Sorted Distance Matrix Pattern Analysis The pattern of each blocks in the reordered distance matrix indicates how the nonlinear genomic clusters are organized. What s more, interaction pattern between clusters in the corresponding rectangle can also be analyzed Linear Sequence Tag Pattern Analysis To study the nonlinear extent of the genome sequence in each cluster, every point will be tagged with the location in the original genome. It's will be interesting to see how the original genome sequence is shuffled into the reordered one, and important insights of the assembly of the nonlinear genome may be gained Genetic Elements Enrichment Analysis Genetic elements enrichment analysis will be performed to study what kinds of genetic elements are enriched in each cluster, which indicates the functional role of the clusters Epigenetic Modifications Enrichment Analysis

6 I will also investigate the epigenetic modifications in each cluster and find the enrichment pattern, which facilitates the understanding of the structural features of the clusters Previously Determined Linear Structures Enrichment Analysis Previous studies have already found some genomic clusters by interpreting the Hi-C data, such as A/B compartments (Lieberman-Aiden et al., 2009) and topological domains (Dixon et al., 2012), but these patterns are all linear and continuous in the genome. In this study, nonlinear and discontinuous genomic clusters will be tagged with the previously determined linear structure tags to study how they redistribute in the novel clustering methods. 3.4 Horizontal Comparison of the Reordered Distance Matrix of TAD It may be hard to interpret the reordered distance matrix from just one or hierarchical results, so it s vital to generate a lot of data through reordering a group of previously determined TADs. Studying the similarity and difference between these patterns provides new insights about the TAD structural and functional properties. 4. Expected Results 4.1 Development of a New Method to Analyze the Genome It s expected that SPIN reordering and clustering of the genome using the Hi-C data will be informative by a proper transformation from the raw Hi-C data to a distance matrix. This method will facilitate the understanding of structural and functional assembly of the genome. 4.2 Gaining Structural Information of the Genome From the reordered distance matrix pattern of each cluster and between clusters, we can indicate how genome is organized spatially, and how nonlinear is the genome. 4.3 Gaining Functional Information of the Genome From the biological properties of the clusters, such as GC content, genetic elements enrichment and epigenetic modifications, we can understand the functional role of the clusters.

7 5. Preliminary Data Using the first Hi-C data available at GSE18199 (Lieberman-Aiden et al., 2009), I analyzed only the chromosome 14 for pre-experiments. Several methods that transform contact matrix to distance matrix have been tried. It turns out that using 1 r ij generates informative results (r ij is the Pearson s correlation between locus i and j). As fig. 5 demonstrates, I used a low resolution Hi-C data of chr14 and reordered the distance matrix represented by 1 r. The reordered distance matrix shows that at least two major clusters were present. And fig. 6 shows more detailed data, where fig. 6I shows the permutated pr, indicating the A/B compartment coined by Lieberman-Aiden et al. The sigmoid curve of permutated pr proves the SPIN method was logical. Fig. 5. Permutation results of low resolution chr14. (A). The distance matrix before permutation; (B). The distance matrix before permutation.

8 Fig. 6. Epigenetic modifications and other structural data before and after permutation. Nonetheless, the data I used to generate the figure 5 and 6 were of low resolution. There were only million pairs of reads generated in the study, while data generated by billions of reads were available nowadays. Data from GSE35156 (Dixon et al., 2012), which has a higher resolution than the demonstrated data, will be used in next step.

9 Reference Bonev, B., & Cavalli, G. (2016). Organization and function of the 3D genome. Nature Reviews Genetics, 17(11), Dixon, J. R., Selvaraj, S., Yue, F., Kim, A., Li, Y., Shen, Y.,... Ren, B. (2012). Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature, 485(7398), Lieberman-Aiden, E., Van Berkum, N. L., Williams, L., Imakaev, M., Ragoczy, T., Telling, A.,... Dorschner, M. O. (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome. science, 326(5950), Schmitt, A. D., Hu, M., & Ren, B. (2016). Genome-wide mapping and analysis of chromosome architecture. Nature Reviews Molecular Cell Biology. Tsafrir, D., Tsafrir, I., Ein-Dor, L., Zuk, O., Notterman, D. A., & Domany, E. (2005). Sorting points into neighborhoods (SPIN): data analysis and visualization by ordering distance matrices. Bioinformatics, 21(10),