Multi-level functional principal components analysis models for replicated genomics time series data sets

Size: px
Start display at page:

Download "Multi-level functional principal components analysis models for replicated genomics time series data sets"

Transcription

1 Multi-level functional principal components analysis models for replicated genomics time series data sets Maurice Berk Imperial College London September 11th, 2012

2 Talk outline 1 Introduction Background & case study Single-level models 2 Multi-level Models Reduced Rank FPCA Models Simulation Study Issues 3 Conclusions

3 A motivating genomics experiment Want to understand the immune response to infection with BCG

4 A motivating genomics experiment Want to understand the immune response to infection with BCG Which genes are switched on (or off) by BCG infection?

5 A motivating genomics experiment Want to understand the immune response to infection with BCG Which genes are switched on (or off) by BCG infection? Collect blood samples from healthy human volunteers (replicates)

6 A motivating genomics experiment Want to understand the immune response to infection with BCG Which genes are switched on (or off) by BCG infection? Collect blood samples from healthy human volunteers (replicates) Infect the blood samples with BCG

7 A motivating genomics experiment Want to understand the immune response to infection with BCG Which genes are switched on (or off) by BCG infection? Collect blood samples from healthy human volunteers (replicates) Infect the blood samples with BCG Perform microarray hybridizations at several time points after infection

8 A motivating genomics experiment Want to understand the immune response to infection with BCG Which genes are switched on (or off) by BCG infection? Collect blood samples from healthy human volunteers (replicates) Infect the blood samples with BCG Perform microarray hybridizations at several time points after infection For every gene, end up with one time series per volunteer

9 Raw Data IFNG Interferon, gamma TNF Tumor necrosis factor (TNF superfamily, member 2) Patient 7 Patient 8 Patient 9 Patient 7 Patient 8 Patient Patient 4 Patient 5 Patient Patient 4 Patient 5 Patient 6 M Patient 1 Patient 2 Patient M Patient 1 Patient 2 Patient Time Time

10 Current State of Play FDA is a popular modelling choice for genomics time series data sets (see Coffey and Hinde, 2011)

11 Current State of Play FDA is a popular modelling choice for genomics time series data sets (see Coffey and Hinde, 2011) Replicated data sets have received much less attention but see Storey et al (2005), Liu & Yang (2009) and Berk et al (2011)

12 Current State of Play FDA is a popular modelling choice for genomics time series data sets (see Coffey and Hinde, 2011) Replicated data sets have received much less attention but see Storey et al (2005), Liu & Yang (2009) and Berk et al (2011) But these methods model each gene independently!

13 Current State of Play FDA is a popular modelling choice for genomics time series data sets (see Coffey and Hinde, 2011) Replicated data sets have received much less attention but see Storey et al (2005), Liu & Yang (2009) and Berk et al (2011) But these methods model each gene independently! Multi-level approaches exist in other domains Di et al (2009) and Zhou et al (2010)

14 The functional mixed-effects model y i (t) = µ(t) v i (t) ɛ(t)

15 The functional mixed-effects model We can represent functions µ(t) and v i (t) using splines: LMM Representation y i = X i µ X i v i ɛ i and have a linear mixed-effects model Impose distributions on v i and ɛ i Treat random-effects v i as missing data in the EM algorithm

16 A multi-level functional mixed-effects model Definition y ij (t) = µ(t) f i (t) g ij (t) ɛ ij (t) µ(t) is the grand mean across all genes f i (t) is the gene specific deviation from that grand mean g ij (t) is the replicate specific deviation from the gene mean Could represent the functions µ(t), f i (t) and g ij (t) using splines as before...

17 Functional Principal Components Analysis... but as all gene means are now being estimated simultaneously, it would be better to do a functional PCA

18 Functional Principal Components Analysis... but as all gene means are now being estimated simultaneously, it would be better to do a functional PCA Karhunen-Loève Decomposition y ij (t) = µ(t) ξ k (t)α ik ζ il (t)β ijl ɛ ij (t) k=1 l=1 ξ k (t) is the k-th PC function at the gene level α ik is the loading on PC function k for gene i ζ il (t) is the l-th PC function at the replicate level, for gene i β ijl is the loading on PC function l for replicate j for gene i

19 Reduced Rank Functional Principal Components Analysis Truncated Decomposition L K i y ij (t) = µ(t) ξ k (t)α ik ζ il (t)β ijl ɛ i (t) k=1 l=1

20 Reduced Rank Functional Principal Components Analysis Truncated Decomposition L K i y ij (t) = µ(t) ξ k (t)α ik ζ il (t)β ijl ɛ i (t) k=1 l=1 LMM Representation L K i y ij = B ij θ µ B ij θ αk α ik B ij θ βil β ijl ɛ ij k=1 l=1

21 Reduced Rank Functional Principal Components Analysis Truncated Decomposition L K i y ij (t) = µ(t) ξ k (t)α ik ζ il (t)β ijl ɛ i (t) k=1 l=1 LMM Representation L K i y ij = B ij θ µ B ij θ αk α ik B ij θ βil β ijl ɛ ij k=1 l=1 Distributional Assumptions and Constraints α i N (0, D α) β ij N (0, D βi ) ɛ ij N (0, σ 2 i I Nij ) α i β ij ɛ ij B T B = I Θ T αθ α = I Θ T β i Θ βi = I

22 Comparison with Zhou et al (2010) Considered spatial correlations between the variables They assume second-level variance is the same for all variables They assume the error variance is not variable dependent Illustrated on a data set with three variables

23 Simulation setting Varied number of replicates (5, 10, 20) Varied number of genes (100, 1000, 10000) Fixed number of time points (5) Fixed basis (B-splines, 1 knot) Fixed grand mean Fixed 2 PCs at the gene level Fixed 1 PC at the replicate level Fixed variance components D α and D βi Generated 1000 data sets for each condition

24 Simulation setting Genelevel PC1 66.6% of variance explained Genelevel PC2 33.3% of variance explained Intensity Intensity Time Time

25 Simulation setting Individuallevel PC 100% of variance explained Intensity Time

26 Simulation setting results Mean-squared error of gene-level curves Multi-level model Number of Genes Number of Replicates Gene-at-a-time model Number of Genes Number of Replicates

27 Real data fits CCL20 Hours M

28 Initialised gene-level PC loadings Principal Component 1 Principal Component 4 Frequency α 1 Principal Component α 4 Frequency Frequency Principal Component 5 Frequency α α 5 Frequency Principal Component α 3

29 The solution LMM Representation L K i y ij = B ij θ µ B ij θ αk α ik B ij θ βil β ijl ɛ ij k=1 l=1 Distributional Assumptions and Constraints α ik StN (ξ k, σ 2 α k, λ k, ν k ) β ij N (0, D βi ) ɛ ij N (0, σ 2 i I Nij ) α i β ij ɛ ij B T B = I Θ T αθ α = I Θ T β i Θ βi = I The skew-t-normal distribtuion ( ) f (α ik ξ k, σα 2 k, λ k, ν k ) = 2t νk (α ik ξ k, σα 2 k )Φ αik ξ k σ αk

30 Implications y ij now has an unknown distribution

31 Implications y ij now has an unknown distribution If all α ik, the β ikl and ɛ ij were skew-t-normally distributed with common degrees of freedom then the distribution would be known

32 Implications y ij now has an unknown distribution If all α ik, the β ikl and ɛ ij were skew-t-normally distributed with common degrees of freedom then the distribution would be known Can use the Monte-Carlo EM algorithm to approximate the required conditional expectations at the E-step

33 It works CCL20 Hours M

34 Hour Hour Hour Hour Hour It works Principal Component 1 Variance Explained 95.2% Principal Component 3 Variance Explained 1.38% Principal Component 5 Variance Explained % Principal Component 2 Variance Explained 2.97% Principal Component 4 Variance Explained 0.473%

35 Conclusions Due to the technology involved, ethical constraints and the nature of the genome, replicated genomics time series data sets really do call for specific, sophisticated models The approach presented here is currently completely impractical A multi-level model is worth pursuing Functional PCA is hard for non-fda specialists to understand Have not considered correlations amongst genes

36 Acknowledgements Acknowledgements: PhD supervisors Giovanni Montana and Michael Levin Cheryl Hemingway for providing access data presented here Wellcome Trust Additional Resources: My website: sme R package availabe on CRAN