Statistical Analysis of Discrete-valued Time Series by Parsimonious High-order Markov Chains

Problems of statistical analysis of discrete-valued time series are considered. Two approaches for construction of parsimonious (small-parametric) models for observed discrete data are proposed based on high-order Markov chains. Consistent statistical estimators for parameters of the developed models and some known models, and also statistical tests on the values of parameters are constructed. Probabilistic properties of the constructed statistical inferences are given. The developed theory is also applied for statistical analysis of spatio-temporal data. Theoretical results are illustrated by computer experiments on real statistical data.


Introduction
Time series analysis is deep developed (Anderson 1971) for "continuous" data when the observation space A is some Euclidean space or its subspace of nonzero Lebesque measure: A⊆R m , mes(A) > 0. In practice, however, (because of "digitalization" of our real world) the statisticians need to use discrete-valued models of time series, when the observation space A is some discrete set with cardinality N = |A|, mes(A) = 0. Give some applied areas where discrete-valued time series models are extremely helpful (Weiss 2018): bioinformatics for analysis of genetic sequences (N =4); information systems for information protection (N = 2); meteorology for weather prediction; social science for modelling of dynamics in social behavior; public health and personalized medicine; prediction of environmental processes; financial engineering; telecommunications; alarm systems. Discrete-valued time series is a random process x t ∈ A on some probability space (Ω, F, P) with discrete time t ∈ N 0 = {0, 1, 2, . . .} and some discrete state space A with the cardinality N = |A|, 2 ≤ N ≤ +∞. Much attention to discrete-valued time series considered as categorical time series was paid by B. Kedem and K. Fokianos (Kedem and Fokianos 2002).
If A is any countable set (N = +∞), then we have a countable valued time series x t . If A is any finite set (N < +∞), then we have a finitely valued time series, also called categorical time series (Kedem and Fokianos 2002). We will consider here these cases both, and, without loss of generality, we will assume that A = {0, 1, . . . , N − 1}. An universal base model for discrete-valued time series x t ∈ A is the homogeneous Markov chain MC(s) of some order s ∈ N 0 , determined by the generalized Markov property (t > s): P{x t =i t |x t−1 =i t−1 , . . . , x 1 =i 1 }=P{x t =i t |x t−1 =i t−1 , . . ., x t−s =i t−s }=p i t−s ,...,i t−1 ,it , where s is the memory depth; i 1 , i 2 , . . . , i t ∈ A are values of the process at the time moments 1, 2, . . . , t respectively; P = p i t−s ,...,i t−1 ,it is an (s + 1)-dimensional matrix of one-step transition probabilities. Number of independent parameters for the MC(s) model increases exponentially w.r.t. the memory depth s: D MC(s) = N s (N − 1). To identify this model (1) we need to have huge data set and the computation work of size O N s+1 . To avoid this "curse of dimensionality" we propose to use the parsimonious ("small-parametric") models of high-order Markov chains that are determined by small number of parameters d D MC(s) (Kharin 2013).

Construction of parsimonious high-order Markov chains
According to what has been said in Introduction, the parsimonious high-order Markov chain is determined by parsimonious representation of the one-step transition probabilities matrix P defined by (1). Number of independent parameters d of the parsimonious matrix is much smaller than the total number of independent parameters and is determined by the parsimony coefficient: In our opinion, there are two main approaches to construction of parsimonious matrix P : Approach I: squeezing of the set of different values of elements in matrix P ; Approach II: using of some generation equation for the conditional probability distribution (1) of the future state x t under its prehistory.
Give a list of known parsimonious high-order Markov chains correspondent to Approach I: Markov chain of order s with r partial connections MC(s, r) (Kharin and Piatlitski 2007), Markov chain of conditional order MCCO(s, r) (Kharin and Maltsew 2017), variable length Markov chain (Buhlmann and Wyner 1999).
Approach II to construction of parsimonious high-order Markov chain is based on some generation equation for the conditional probability distribution of the future state x t ∈ A under its prehistory X t−1 t−s = (x t−1 , . . . , x t−s ) : where {q j (θ) : j ∈ A} is some discrete probability distribution on A that is dependent on the parameter θ = (θ j ) ∈ Θ ⊆ R L , θ(i 1 , . . . , i s ; a) is some parametric function that is a priori known up to some unknown vector parameter a = (a k ) ∈ R m .
List of known parsimonious high-order Markov chains correspondent to Approach II is longer: Jakobs -Lewis model (Jacobs and Lewis 1978), MTD-model (Raftery 1985 3. Statistical analysis for models constructed by Approach I 3.1. Markov chain MC(s, r) of order s with r partial connections The MC(s, r) proposed by Yu. Kharin in 2004 is determined by the following parsimonious presentation of the (s + 1)-dimensional transition probability matrix: where J s+1 In (Kharin 2013;Kharin and Piatlitski 2007) the following probabilistic properties of the MC(s, r)-model are found.
Theorem 1. The MC(s, r) defined by (7) is an ergodic Markov chain iff there exists i ∈ N such that min Corollary 1. For a stationary Markov chain the stationary probability distribution has the multiplicative form: is the frequency statistic for the template M r ∈ M ; µ J r+1 1 , x t+s = j r+1 } is the probability distribution of the (r + 1)-tuple; the dot used instead of any index means summation on all its values: We have the following results (Kharin and Piatlitski 2007).
Theorem 2. If the template of connections M 0 r is known, then the maximum likelihood estimator (MLE) for the matrix Q isQ = q J r+1 Corollary 2. Under stationary MC(s, r) and contigual family of alternatives is the cumulative distribution function of the noncentral χ 2 -distribution with U degrees of freedom and the noncentrality parameter Introduce the notation: M is the set of all admissible templates M r ; is the conditional entropy of the future symbol x t+s ∈ A relative to the past derived by the selector F X t+s−1 t ; M r ∈ A r , M r ∈ M ;Ĥ(M r ) is the "plug-in" estimator of the conditional entropy, which is generated by substitution of the estimatorsμ J r+1 To estimate parameters r, s we use the Bayesian Information Criterion (BIC): where Consistent estimatorsŝ,r are determined by minimization: BIC(s, r) → min s,r .

Markov chain of conditional order MCCO(s, L)
Introduce the notation:  L)), if its one-step transition probabilities have the following parsimonious form: The sequence of elements J s s−L+1 is called the base memory fragment (BMF) of the random sequence; L is the length of BMF; the value s k = s − b k + 1 is called the conditional order of Markov chain. Thus the conditional probability distribution of the state x t+1 at time point t + 1 depends not on all s previous states, but it depends only on L + 1 selected states j b k , J s s−L+1 . Note that if L = s − 1, s 0 = s 1 = . . . = s K = s, we have the full-connected Markov chain of the order s: MC(s). If M = K + 1, then each transition matrix corresponds to only one value of the BMF, otherwise there exists a common matrix which corresponds to several values of BMF.
Hence the transition matrix P of the Markov chain of conditional order is determined by d = 2(N L + 1) + M N (N − 1) independent elements, and the parsimony coefficient (2) is Methods and algorithms of statistical analysis for MCCO(s, L) are presented in (Kharin and Maltsew 2017).
In (Jacobs and Lewis 1978) only moments and stationary distributions were analyzed. We proved (Kharin 2013) probabilistic and statistical properties of the model (11), (12) by the MC(s)-model.

Raftery model
Mixture Transition Distribution-model (MTD-model) was proposed by A. Raftery (Raftery 1985) as a special parsimonious representation of the matrix P : The MTDg (generalized MTD)-model is: where i,k is a stochastic matrix for the j-th lag. Number of parameters for the MTDg: D MTDg =s(N (N −1)/2+1)−1; κ=O(s · N 2−s ).
We have constructed a simple criterion for the ergodicity of the MTD-model and found a useful property of the stationary probability distribution (Kharin 2013).
MLEQ,λ are solutions of the nonlinear maximization problem: The estimatorsQ,λ are used as initial values in the iterative computation of the MLEsQ, λ in (16). Generalized probability ratio test of the asymptotic size ε ∈ (0, 1) for H 0 = {Q = Q 0 , λ = λ 0 }, H 1 = H 0 is constructed as in the Subsection 3.1.
Corollary 5. For the weight matrix H = J the FBE (18) is asymptotically efficient (it attains the Cramer -Rao boundary at T → +∞).
Proofs of Theorem 7 and its corollary are given in (Kharin and Voloshko 2019).

Binomial conditionally nonlinear model for spatio-temporal data
We extend the BiCNAR model determined in the previous subsection for the case of spatiotemporal data (Dauhaliova and Kharin 2018). Introduce the notation: s ∈ S = {1, 2, . . . , n} is the index variable that determines the geographic region (site); n is the number of geographical regions to be analyzed; x g,t ∈ A is the discrete random observation at time moment t ∈ N 0 at the site g ∈ G; X t = (x 1,t , . . . , x n,t ) ∈ A n is the column-vector of all n observations for all sites at time moment t; Z g,t = (z 1,g,t , . . . , z m,g,t ) ∈ R m is the column-vector of m exogenous variables at time moment t for the site g ∈ G. It is assumed that under fixed prehistory {X 1 , . . . , X t−1 }: a) the random variables x t,1 , . . . , x t,n are conditionally independent; b)conditional probability distribution of x g,t is binomial: c) parameter p s,t is determined by special case of generation equation (5): where a g = {a gki } ⊂ R m , b g = (b g,1 , . . . , b s,m ) ∈ R m are parameters of this model.
The proposed model (19), (20) determines the parsimonious nonhomogeneous n-dimensional Markov chain of order s with parsimony coefficient: We constructed the consistent and asymptotically normal MLEs for parameters {a g }, {b g } of the model (19), (20) and numerical procedure for its evaluation in (Kharin and Zhurak 2015;Dauhaliova and Kharin 2018).
Note in conclusion, that in (Kharin and Zhurak 2015) we have developed the Poisson conditionally nonlinear model for spatio-temporal count data.

Application of the developed algorithms in data analysis
All algorithms developed in Sections 2-4 were tested on simulated data and illustrated their accordance with the proved theoretical properties. Here we give some results of computer experiments with the developed algorithms on real data from applications. We present here three examples. The fitted model MC(3, 2) detects significant stochastic dependencies in this data.

Analysis of CG-patterns in genome
We took the complete Panthera tigris mitochondrion genome of the length T = 16990 (available from NCBI Nucleotide data base, ID EF551003.1) and extracted the binary sequence x t of its CG-indicators: x t = 1 iff the t'th nucleotide is Guanine or Cytosine, t = 1, . . . , T . Portion of "1" in X T 1 is known as CG-content and plays important role in bioinformatics .
In order to evaluate individual and pairwise impact of the lagged variables X t−1 t−s on x t we fitted the BCNAR(s)-model (up to s = 15) for N = 2 with the bilinear bases {ψ i (·)} and the Gaussian c.d.f. Φ(·).
Two fitted BCNAR-models for s = 10; 15 respectively, are (ζ t = (−1) xt ): that are adequate according to BIC. It is seen that in the two fitted models (21) of high orders s = 10 and s = 15 the future state x t is significantly influenced by the past states x t−i lagged by multiples of three (i = 3j), and even the closest predecessor x t−1 has a weaker impact on x t . Note also that the pairs of the lagged variables in these two models are more influential than the single lagged variables, i.e. the bilinear basis catched some essential high-order statistical dependencies invisible to the linear one. These identified properties of the bilinear models confirm the known in genetics effect of statistical 3-periodicity of protein-coding fragments ; this fact indicates the usefulness of the developed theory for statistical analysis of genetic sequences.

Dynamics analysis for the incidence rate of children leukemia
Real medical data describes the incidence rate of children leukemia in n = 3 cites of Belarus for T = 25 years after Chernobyl (normalization of the data was made by the number of residents at the group of risk). We used the Poisson conditionally nonlinear autoregressive spatio-temporal model (Kharin and Zhurak 2015).
Forecasting results by the fitted models are presented in Figures 3-5.

Conclusion
Methods and software for time series analysis is deep developed for "continuous" data, and these techniques are not (or risky) applicable to discrete-valued time series x t ∈ A with discrete state space A.
An universal model for discrete-valued time series with bounded memory length s∈N is the Markov chain of sufficiently large order s, but computational complexity of its fitting by data increases exponentially w.r.t. s: O(|A| s+1 ).
To avoid this "curse of dimensionality" we need to construct parsimonious (small-parametric models). We propose two approaches to develop parsimonious models of high-order Markov chains: 1) squeezing of the set of different values for transition probabilities; 2) using of generation equation for transition probabilities.
The proposed approaches are considered for six parsimonious models.
Developed algorithms of computer data analysis are illustrated in three applications on real data.
Finally, we plan to develop the results of this paper in three main directions: 1) robust statistical analysis of discrete-valued time series based on asymptotical risk expansion methods published in (Kharin 1997(Kharin , 2011Kharin and Zhuk 1998); 2) robust sequential testing of hypotheses for discrete-valued time series using approaches from (Kharin 2017;Kharin and Kishylau 2015;Kharin and Tu 2017); 3) applications in steganography (Kharin and Vecherko 2013) and in marketing (Pashkevich and Kharin 2004).