A Bayesian binomial regression model with latent Gaussian processes for modelling DNA methylation

Epigenetic observations are represented by the total number of reads from a given pool of cells and the number of methylated reads, making it reasonable to model this data by a binomial distribution. There are numerous factors that can influence the probability of success in a particular region. Moreover, there is a strong spatial (alongside the genome) dependence of these probabilities. We incorporate dependence on the covariates and the spatial dependence of the methylation probability for observations from a pool of cells by means of a binomial regression model with a latent Gaussian field and a logit link function. We apply a Bayesian approach including prior specifications on model configurations. We run a mode jumping Markov chain Monte Carlo algorithm (MJMCMC) across different choices of covariates in order to obtain the joint posterior distribution of parameters and models. This also allows finding the best set of covariates to model methylation probability within the genomic region of interest and individual marginal inclusion probabilities of the covariates.


Introduction
Epigenetic modifications contribute to the generation of phenotypic plasticity, but the understanding of its contribution to phenotypic alterations and how the genome influences epigenetic variants requires further investigation (Schmitz, Schultz, Urich, Nery, Pelizzola, Libiger, Alix, McCosh, Chen, Schork et al. 2013). Epigenetic changes are crucial for the development and differentiation of various cell types in an organism, as well as for normal cellular processes. Epigenetic modifications modulate gene expression and modifications found in the promoter or regulatory elements play a prominent role in activating or suppressing transcript levels. This creates interesting research possibilities, which are often challenging from the statistical point of view. For example, Li, Cassese, Guindani, and Vannucci (2019) suggested a Bayesian negative binomial regression model to study the influence of methylation (used as covariates) on RNA-Seq gene expression counts (used as observations). Also, Tang, Zhou, Wang, Huang, and Jin (2017) developed a Gaussian Bayesian regression model to link the differential gene expression (measured as log 2 fold change) to various exogenous variables including tumour suppressor genes categories, mean methylation values and genomic segment distributions. In turn, Ma, Liu, Zhang, Huang, and Tang (2017) suggested a multiple network for epigenetic studies and implemented the Cox proportional hazard model to analyze the association of methylation profile of each epigenetic module with the patient survival. Recently, highthroughput epigenetics experiments have enabled researchers to measure genome-wide epigenetic profiles. This allows performing Epigenome-wide association studies (EWAS), which also hold promise for the detection of new regulatory mechanisms that may be susceptible to modification by environmental and lifestyle factors (Michels, Binder, Dedeurwaerder, Epstein, Greally, Gut, Houseman, Izzi, Kelsey, Meissner et al. 2013).
A major task today is the development of models and statistical methods for linking epigenetic patterns to genomic and/or environmental variables and interpreting them. Unlike the papers mentioned above (Ma et al. 2017;Tang et al. 2017;Li et al. 2019), we use methylation data as responses and link them to genomic and phenotypic variables (used as covariates). Moreover, by means of performing careful statistical modelling, our model takes into account that epigenetic data are spatially correlated (along locations in the genome) with high noise levels. Due to the availability of the data, our focus will be on the model plant Arabidopsis thaliana. For instance, Becker, Hagmann, Müller, Koenig, Stegle, Borgwardt, and Weigel (2011) previously analysed Arabidopsis data consisting of epigenetic observations on a set of 10 lines, which were separately propagated in a common environment for 30 generations. These were compared with two independent lines propagated for only three generations. Their analysis aimed at global summaries of structures but was based on individual and (sitewise) hypothesis testing methods combined with false discovery rate control methodology. In this paper, however, we limit ourselves to finding a pattern of signals appearing along the single genome that significantly influences the methylation probability of the corresponding organism. This is done by means of applying a binomial regression model with latent Gaussian variables, which take into account both spatial dependence and variability that can not be explained by the exogenous variables alone. The chosen latent Gaussian variable is a sum of a random walk RW (1) component and an independent IG component. Model selection and parameter estimation within models are performed simultaneously in a Bayesian framework, applying the mode jumping Markov chain Monte Carlo (MJMCMC) algorithm developed by  to perform the computations involved. MJMCMC outputs posterior model probabilities allowing to find the best combination of explanatory genomic variables and compute marginal inclusion probabilities for the importance of individual variables. Our approach also allows to generate a model-averaged classification of the methylation status at different locations and make imputations for those locations that do not have enough observations, whilst the currently used approach is to simply ignore these locations.

Mathematical model
We model the number of methylated reads Y t ∈ {1, ..., n t } per position in the genome (nucleotide base position) to be binomially distributed with the number of trials equal to the number of reads, n t ∈ N, for position t ∈ {t 1 , ..., t T } (where T is the total number of genomic positions in the addressed genomic region) and corresponding probability of success p t ∈ R [0,1] . The probability p t is modeled via the logit link to the covariates X t = {X t1 , ..., X td }. These covariates might be a position within a gene (e.g promoter or coding region), an indicator of the underlying genetic structure, or other types of features (our choice of the covariates is given in Section 3). A latent Gaussian RW (1) process {δ t } ∈ R and a latent independent Gaussian (IG) process {ζ t } are included into the model in order to take into account spatial dependence of methylation probabilities along the genome and the variance which is not explained by the covariates. Other explored latent Gaussian variables were also tested prior to variable selection on a full model before the selection of this structure was done (see the detail in the Appendix A of the paper). This gives the following model formulation: where β j ∈ R, j ∈ {0, ..., d} are regression coefficients of the model showing whether and in which way the corresponding covariate influences the probability of methylation on average, which are normally distributed with zero mean and precision τ −1 . Finally, τ −1 ζ is the precision term of the IG process {ζ t }. We then put the following priors for the parameters of the model: , τ β , τ and τ ζ ∼ Gamma(1, 5 · 10 −5 ). (5) Here, q = 0.5 is the prior Bernoulli probability of including a covariate into the model, and I(·) is the indicator function.
We perform analysis for the model defined by Equations (1)-(5) by means of the MJMCMC algorithm . The algorithm is capable of efficiently moving in the defined model space by means of both accurately exploring the modes of the probability mass and switching between these modes using large jumps combined with local optimization and randomization.
By Bayes formula, p(γ, θ|Y , X) = p(θ|γ, Y , X)p(γ|Y , X). In Section 3.1 we describe how to compute p(θ|γ, Y , X) and p(Y |X, γ). For the time being, we assume that marginal likelihoods p(Y |X, γ) are available for a given γ. Then by Bayes formula: In order to calculate p(γ|Y , X) we have to iterate through the whole model space Ω, which becomes computationally infeasible for large d. The ordinary MCMC estimate is based on a number of MCMC samples γ (i) , i = 1, ..., W : An alternative, named the renormalized model (RM) estimates by Clyde, Ghosh, and Littman (2011), is where now V is the set of visited models during the MCMC (or any other model space exploration algorithm) run. Although both (8) and (7) are asymptotically consistent, (8) will often be a preferable estimator since convergence of the MCMC based approximation (7) is much slower, see Clyde et al. (2011); .
We aim at approximating p(γ|Y , X) by means of searching for some subspace V of Ω making the approximation (8) as precise as possible. Models with high values of p(Y |X, γ)p(γ) and regions of relatively high posterior mass are important to be included into V. Missing them in V can introduce significant biases in our estimates. Note that these aspects are just as important for the standard MCMC estimate (7). The difference is that while the number of times a specific model is visited is important when using (7), for (8) it is enough that the model is visited at least once. In this context, the denominator of (8), which we would like to be as high as possible, becomes an extremely relevant measure for the quality of the search in terms of being able to capture whether the algorithm visits all of the modes, whilst the size of V should be low in order to save computational time.
The marginal inclusion probability p(γ j = 1|Y , X) is defined as: It can be approximated either by the MCMC estimator: or using the renormalized approach: giving a measure of importance for the covariates of the model.

Integrated nested Laplace approximations
Within hierarchical models with latent Gaussian structures, integrated nested Laplace approximations (INLA) for efficient inference on the posterior distribution (Rue, Martino, and Chopin 2009) can be used. Following the INLA terminology, we define z to be the set of latent Gaussian variables and the regression parameters β while η contains the remaining parameters (a low-dimensional vector). The INLA approach is based on two steps. First the marginal posterior of the hyperparameters is approximated by Here,p G (z|η, Y , X, γ) is the Gaussian approximation of p(z|η, Y , X, γ), and z * (η) is the mode of the distribution p(z|η, Y , X, γ). The posterior mode of the hyperparameters is found by maximizing the corresponding Laplace approximation using some gradient descent method (like for example the Newton-Raphson routine). Then an area with a relatively high posterior density of the hyperparameters is explored with either some grid-based procedure or variational Bayes.
The second step involves the approximation of the latent variables for every set of the explored hyperparameters. Here, the computational complexity of the approximation depends on the likelihood type for the data Y |X. If it is Gaussian, the posterior of the latent variables is Gaussian, and the approximation is exact and fully tractable. In the case the likelihood is skewed or heavy tails are present, a Gaussian approximation of the latent variables tends to become inaccurate and another Laplace approximation should be used: Here,p GG is the Gaussian approximation to p(z −i |z i , η, Y , X, γ) and z * −i (z i , η) is the corresponding posterior mode. The error rate of (13) is O(T −3/2 ). The full Laplace approximation of the latent fields defined in equation (13) is rather time-consuming, hence more crude lowerorder Laplace approximations are often used instead (typically increasing the error rate to O(T −1 ), Tierney and Kadane 1986). Once the posterior distribution of the latent variables given the hyperparameters is approximated, the uncertainty in the hyperparameters can be marginalized out using the law of total probability (Rue et al. 2009): where ∆ k is the area weight corresponding to the grid exploration of the posterior distribution of the hyperparameters.

Computing the marginal likelihood
The marginal likelihood is defined as follows: For data {Y , X} and model γ, which includes some unknown parameters θ, the marginal likelihood is given by where p(θ|γ) is the prior for θ under model γ while p(Y |X, γ, θ) is the likelihood function conditional on θ. Again, consider θ = (η, z).

INLA approximates marginal likelihoods by
where η * (z|γ) is some chosen value of η, typically the posterior mode, whileπ G (η|Y , X, z, γ) is a Gaussian approximation to π(η|Y , X, z, γ). The integration of z over the support Z can be performed by an empirical Bayes (EB) approximation or using numerical integration based on a central composite design (CCD) or a grid (see Rue et al. 2009, for details).
A toy example on computing the marginal likelihood. Consider an example from Neal (2008), in which we assume the following model γ: Then obviously the marginal likelihood is available analytically as and we have a benchmark to compare approximations to. The harmonic mean estimator (Raftery, Newton, Satagopan, and Krivitsky 2006) is given bỹ where z i ∼ p(z|Y, γ). This estimator is consistent, however, often requires unreasonably many iterations to converge. We performed the experiments with τ 1 = 1 and τ 0 being either 0.001, 0.1 or 10. The harmonic mean is obtained based on W = 10 7 simulations. 5 runs of the harmonic mean procedure are performed for each scenario. For INLA we used the default tuning parameters from the package (Rue et al. 2009
Here, a large jump corresponds to changing a large number of γ j 's while the local optimization will be some iterative procedure based on, at each iteration, changing a small number of components until a local mode is reached. For this algorithm, three proposals need to be specified; q l (·|·) specifying the first large jump, q o (·|·) specifying the local optimizer, and q r (·|·) specifying the last randomization. All of them are described in detail in . The convergence of the MJMCMC procedures is shown in Theorem 1 in .

Data description
The addressed data set consists of 1502 observations from chromosome one of the Arabidopsis plant belonging to five predefined groups of genes. This data set was divided into 950 observations (with more than 2 reads, see Figure 1) for inference and 552 observations (with less than 3 reads) for model-based identification of methylation probabilities for the positions with the lack of data.
Apart from the observations represented by the methylated versus the total number of reads we have data on various exogenous variables (covariates). Among these covariates, we address a factor with 3 levels corresponding to whether the location belongs to a CGH, CHH or CHG genetic region, where H is either A, C or T and thus generating two covariates X CGH and X CHH . The second group of factors indicates whether the distance to the previous cytosine nucleobase (C) in DNA is 1, 2, 3, 4, 5, from 6 to 20 or greater than 20 inducing six binary covariates X DT 1 , X DT 2 , X DT 3 , X DT 4 , X DT 5 , and X DT 6:20 . We also include such 1D distance as a continuous covariate X DIST . The third addressed group of factors corresponds to whether the location belongs to a gene from a particular group of genes of biological interest. These groups are indicated as M a , M g and M d , yielding two additional covariates X Ma , X Mg . Additionally, we have a covariate X CODE indicating if the corresponding nucleobase is in the coding region of a gene and a covariate X ST RD indicating if the nucleobase is on a " + " or a " − " strand. Finally, we have a continuous covariate X EXP R ∈ R + representing expression level for the corresponding gene and interactions between expression levels and gene groups X EXP R,a , X EXP R,g , X EXP R,d ∈ R + . Thus multiple predictors with respect to a strict choice of the reference levels for categorical variables in our example induced d = 17 potentially important covariates. The correlation structure for the addressed variables is depicted in Figure 2, where one clearly sees multiple correlations, which in turn are likely to induce multiple modes of the marginal likelihood.

Results
The MJMCMC algorithm was run until around 10 000 unique models (7.6% of the model space) were explored. We parallelized the search on 10 CPUs. Default tuning parameterrs from  were used.
According to the marginal inclusion probabilities reported in Figure 3, only three factors X CHG , X CGH and X CODE are clearly significant for inference on the methylation patterns for the addressed epigenetic region, factors X Ma and X Mg also have some significance. Table 2 gives the marginal posterior model probability and posterior means of the parameters for the best model in the explored subset of models from the model space. This model is both the posterior mode model in the set of explored models and the median probability model (Barbieri and Berger 2004). We have also compared the selected model with alternative models based on the optimal sets of covariates but with other latent Gaussian structures and found our model to be the best in terms of the marginal log likelihood (see the Appendix A of the paper). PMP β 0 β CHG β CGH β CODE τ τ ζ 0.4276 -8.8255 2.4717 5.2122 6.4240 7.5075 1.2109 Table 2: Posterior means for the best model in terms of marginal posterior probability (PMP).
Based on the best model, we carried out computations of methylation probabilities for the locations in both the inference set and the identification set. Highly methylated regions are located between observations 7000-7050, 7250-7400, and 10150-10500, see Figure 1. Note that the model is quite sceptical to the methylation status of locations 7051 -7249, despite a number of observations with a high proportion of methylated reads in this region. Furthermore, we compared the results with the naïve approach based on computing the proportion of methylated reads (light blue line in Figure 1), which is currently addressed in the biological literature as a standard way to evaluate methylation probability of a given nucleobase. The results show that the naïve approach should not be trusted in the presence of spatially correlated data and the probabilities corresponding to it can be strongly biased.

Discussion
During cancer development, the changes in DNA methylation patterns occur within the gene promoter, CpG islands and their shores . Hence in future, it would be of interest to obtain additional covariates such as whether the corresponding nucleotide base position belongs to a particular part of the non-coding gene region like a promoter, an intron or remnants of transposable elements, and whether the nucleobase is within a CpG island, and see how these covariates are influencing the underlying methylation patterns. At the same time, in this paper we looked only at a subset of the genomic locations associated with the groups of genes of biological interest, however, in the future, it would be of interest to address the whole genome. That would induce working with extremely large data, which in turn creates new methodological challenges. In particular, in order to make the efficient inference, it will be important to allow sub-sampling for INLA within a given model and MJMCMC in the discrete marginal space of models. It will also be of interest to allow logical expressions for the binary covariates to be included in the model following Hubin, Storvik, and Frommlet (2018). Finally, joint inference on the covariates and various latent Gaussian variables by means of MJMCMC can be of interest.