Robust Unit-Level Small Area Estimation : A Fast Algorithm for Large Datasets

Small area estimation is a topic of increasing importance in official statistics. Although the classical EBLUP method is useful for estimating the small area means efficiently under the normality assumptions, it can be highly influenced by the presence of outliers. Therefore, Sinha and Rao (2009; The Canadian Journal of Statistics) proposed robust estimators/predictors for a large class of unitand area-level models. We confine attention to the basic unit-level model and discuss a related, but slightly different, robustification. In particular, we develop a fast algorithm that avoids inversion and multiplication of large matrices, and thus permits the user to apply the method to large datasets. In addition, we derive much simpler expressions of the boundedinfluence predicting equations to robustly predict the small-area means than Sinha and Rao (2009) did. Zusammenfassung: Der zunehmende Einsatz von Methoden der Small Area Estimation in der Amtlichen Statistik ist Ausdruck eines paragdigmatischen Wandels, der die Bedeutung modell-unterstützer Schätzmethoden unterstreicht. Mitunter beruhen die Methoden auf den strikten parametrischen Verteilungsannahmen Gemischt-Linearer Modelle und sind daher nicht robust bei Ausreisserkontamination. Sinha und Rao (2009; The Canadian Journal of Statistics) haben eine vielbeachtete Robustifizierung der unitund area-level Modelle vorgeschlagen, die jedoch hinsichtlich numerischer Stabilität und Anwendbarkeit für die, in der Amtlichen Statistik üblichen Stichprobengrössen, ungeeignet ist. In diesem Artikel wird eine, zu Sinha–Rao’s Methode äquivalente, robuste Methode entwickelt und ein Algorithmus dafür beschrieben. Die Performance der Methode wird in einer kleinen Simulation nachgewiesen.


Introduction
Small are estimation (SAE) has become of great importance in official statistics due to the growing demand for reliable small-area statistics (e.g., estimates on the level of Bundesländer/Kanton or communities).Sample surveys provide a cost-effective way of obtaining estimates for characteristics of interest at both population and subpopulation (or domain) level.An estimator of a domain characteristic is called direct if it is based only on data from sample units in the domain.In most practical applications, however, domain sample sizes are not large enough, or even zero for unplanned domains, to allow direct Austrian Journal of Statistics, Vol. 41 (2012), No. 4, 243-265 estimation.In this context, small area estimation refers to a subpopulation for which reliable statistics of interest cannot be produced due to certain limitations of the available data or because the estimates have extremely large sampling errors.
When direct estimation is not possible (or very unreliable), one has to rely upon alternative methods that depend on the availability of population-level auxiliary information (e.g., census or administrative data).The methods are commonly referred to as indirect or model-based estimation methods.Among the model-based SAE methods, the class of (generalized) mixed linear models (MLM) has received considerable attention because these models explain between-area variation in the target variable using auxiliary information and include area-specific random effects to account for between-area variation beyond that explained by the auxiliary information (see e.g., Jiang and Lahiri, 2006).In this context, the small-area means (and totals) can be expressed as linear combinations of fixed and random effects, which are obtained by (empirical) best linear unbiased prediction (BLUP; EBLUP) estimators.In general, model-based estimation procedures enlarge the effective area-specific sample size and yield a smaller mean square (prediction) error of the statistic under consideration compared to direct estimation methods (Rao, 2003, chap. 5).However, such models depend on parametric assumptions as well as requiring specification of the random part of the model.
Although the EBLUP method is useful for estimating the small-area means efficiently under normality assumptions, it can be highly influenced by the presence of outliers in the data or departures from the assumed normal distribution of the random effects.In the presence of contamination, the bias of non-robust methods (e.g., maximum likelihood estimators) can be arbitrarily large and renders these estimators extremely inefficient.Sinha and Rao (2009) therefore introduced the robust EBLUP method (REBLUP), based on M -estimators for MLMs (Richardson and Welsh, 1995;Welsh and Richardson, 1997), into the field of small area estimation.
Although M -estimators for MLMs are theoretically convincing, no reliable algorithms have been available so far (Chambers and Tzavidis, 2006).Sinha and Rao (2009) proposed to obtain robust parameter estimates by a Newton-Raphson type (NR) numerical optimization method-but they did not indicate how to initialize the method.Moreover, the NR method is well-known -e.g., from robust regression analysis (Maronna, Martin, and Yohai, 2006, chap. 9.1;Huber, 1981, chapters 6.7 and 7.8) -to be rather unreliable.Notably, Richardson (1995) reports significant convergence problems of NR and similar numerical optimization methods (i.e., Anderson's method (T.W. Anderson, 1973;Miller, 1977) and the EM algorithm) in the context of M -estimators for MLMs.Moreover, Chaubey and Venkateswarlu (2002) report convergence failure in computing robust ML estimators in 5.8 %-25.8 % of all Monte Carlo trials in their simulation study.
Computational problems arise, though, already in the exercise of computing ML estimates for unbalanced data when no contamination is present at all.Searle, Casella, and McCulloch (1992) phrase it as follows: "[t]here are myriad difficulties involved in actually implementing these methods [i.e., ML and REML estimators in mixed linear models; ts] including, but not limited to, stability of numerical methods applied to the matrices involved, methods of avoiding the inversion of large matrices and the details of diagnosing convergence [...]." (Searle et al., 1992, p.312).The exercise of robustly fitting MLMs introduces further difficulties, such as (among others) the choice of robust starting val-ues.The lack of suitable software has led to extensive efforts to study M -quantile-type methods in the field of SAE; see e.g., Chambers and Tzavidis (2006).
We focus on the basic unit-level SAE model (see Rao, 2003, chap. 7.2).The main contribution of this article is a numerically stable and fast algorithm (even for very large datasets) in order to compute (robust) M -estimates of the model parameters.In addition, we develop a method for robustly predicting the small-area means, which is much simpler and therefore considerably faster than the proposal of Sinha and Rao (2009).
This article is organized as follows.Section 2 introduces the model of interest, the ML estimator, and the EBLUP method.In Section 3, we discuss the robustification of the EBLUP and derive a numerically stable and fast algorithm.Subsequently, we derive robust predicting equations, based on the proposal of Copt and Victoria-Feser (2009), in order to robustly predict the small-area means.A bootstrap method for estimating the mean square prediction error (MSPE) of the estimated/predicted small-area meansparalleling the ideas of Sinha and Rao (2009) -is discussed in Section 4. In Section 5, we present the results of a simulation study.Finally, Section 6 draws together the main findings.

Preliminaries, Definitions and the Model
Let us first introduce some terminology and notation.Consider a finite survey population U whose units are labeled 1, . . ., N .The population U is assumed to be partitioned into g subsets U 1 , . . ., U g , called small areas.Let N i be the size of U i and assume that U = ∪ g i=1 U i , then we have N = ∑ g i=1 N i .We are interested in estimating the domain-specific means of the variable of interest, y, from the data of the domains U i , i = 1, . . ., g.Let Ȳi be the population mean of y based on N i units in area i (i = 1, . . ., g).
Suppose that Ȳi is related to the p-vector x ij = (x ij1 , . . ., x ijp ) T of auxiliary data on the unit level.In what follows, we assume that this relationship is given at the population level by the so-called basic unit-level model (Rao, 2003, chap. 7.2), through a nested-error linear regression (Battese, Harter, and Fuller, 1988) of the form where the area-specific random effects u i are assumed to be independent N (0, σ 2 a ), and independent of the unit-level errors e ij which are assumed to be independent N (0, σ 2 e ).The variance components σ 2 e and σ 2 a are stacked in θ = (σ 2 e , σ 2 a ) T .A sample of n i ≥ 1 of units is assumed to be drawn (e.g., in a survey) from each area, i = 1, . . ., g.The sampling is assumed to be ignorable such that the population model (1) also holds in the sample.The latter assumption is satisfied under simple random sampling from each area or more generally for sampling designs that use the auxiliary information x ij in the selection of the samples; see Rao (2003, pp. 78-80) for more details.The following definition specifies the model based on sample data.Definition 1. Marginally, the basic unit-level model is defined as Austrian Journal of Statistics, Vol. 41 (2012), No. 4, 243-265 with θ = (σ 2 e , σ 2 a ) T and Ω i (θ) = σ 2 e I i + σ 2 a 1 i 1 T i , where I i is the (n i × n i ) identity matrix, 1 i the n i -vector of ones, and y i a n i -vector.
Assumption 1.The parameter space of model ( 2 Note that we allow for the possibility that the random-effect variance, σ 2 a , can be zero. Assumption 2. The first column of the (n i × p) matrices X i consists of n i ones, ∀i = 1, . . ., g.
As a consequence, the first element of β refers to the (regression) intercept.
This assumption is imposed to simplify the discussion of the proposed algorithm.All results in this paper remain valid if Assumption 3 does not hold; however, the computational details are more involved.
The estimators considered in this article have to be computed by a numerical method in an iterative manner.On the sth iteration (s = 1, 2, . . . ) of an algorithm for producing an estimate of, say, β, the current value for the estimate is converted into a new value.By way of example, denote by {β} (s) the parameter estimates of β on the sth iteration, and, for any quantity f which is a function of β, we use {f } (s) to represent the value of f at {β} (s) .

EBLUP Method
Suppose for the time being that β and u i (i = 1, . . ., g) are known.For N i large and the sampling fraction f i = n i /N i small (for all i), it follows that the area-specific means Ȳi can be estimated/predicted by where xi• is the p-vector of known population means for area i.Note that xi• is the mean of x ij based on all j = 1, . . ., N i units of area i in population U .In the case of nonnegligible sampling fractions (i.e., if N i ≫ n i does not hold), we cannot take the small area mean Ȳi as xT i• β + u i .However, we can write Ȳi as where ȳi is the sample mean and ȳns i is the mean of the non-sampled values of y in area i.Under the population model, we replace the unobserved y ns ij by its estimator (x ns ij ) T β +u i , where x ns ij is the p-vector of auxiliary data associated with y ns ij , and compute the means ȳns i (Rao, 2003, p. 142).Sinha and Rao (2009) indicate that this approximation may not be adequate in the presence of representative outliers (Chambers, 1986), although adequate under the Gaussian mixed linear model.Now, since the fixed effects, β, and the variance components, θ, (and the realizations of the random effect u i ) are unknown in a real-word application, the predicting equation, (3), is of limited value.For known θ, however, the best linear unbiased estimator (BLUE), β, is given by Appealing to well-known results of BLUE estimation, we obtain the best linear unbiased prediction (BLUP) estimator (Rao, 2003, pp. 96-98) In practice θ has to be estimated as well.This can been accomplished by several methods, each of which has advantages and more or less severe disadvantages (Harville, 1977;Miller, 1977); see below for a discussion of the ML estimates.Once we have computed θ, the empirical BLUP (EBLUP), μi ( θ), is obtained replacing θ in (6) by θ.
While EBLUP is fairly easy to obtain, estimation of a reasonable measure of uncertainty for the area-level predicted means is a challenging problem.In the context of finite population sampling, a variance estimate of a direct domain estimator of the mean (e.g., domain-specific Hajek estimator) is readily obtained, appealing to well-known results of randomization inference and the fact that the estimator is design unbiased (see e.g., Särndal, Swensson, and Wretman, 1992, chap. 10.3).These result do, however, not carry over to (E)BLUP estimation and we have to resort to mean square (prediction) error estimation.In their seminal paper, Prasad and Rao (1990) studied a second-order approximation to the mean square prediction error (MSPE) of the EBLUP.Datta and Lahiri (2000) extended the Prasad-Rao setting to a wider range of variance estimators, including the ML estimator.We focus on the promising (also with regard to robustness properties) parametric bootstrap discussed in Lahiri (2003); see Section 3.4 for further details.

Maximum Likelihood Estimator
To start with, it will prove useful to study the maximum likelihood (ML) estimator of the model parameters β and θ.Subsequently, we shall derive a robustification of the Fisherscore functions.For model (2), the non-constant part of the log-likelihood, l(τ , X, y), is given by where τ = (β T , θ T ) T .Whenever no contamination is supposed to be present, estimates of the parameter vector τ shall be obtained by means of ML.The ML estimator τ is defined as l(τ , X, y) = sup τ ∈Θ l(τ , X, y), provided τ is an interior point of Θ.It will prove useful to express the covariance matrix Ω i (i = 1, . . ., g) as follows where v = σ 2 e , say, and d = σ 2 a /σ 2 e ≡ a/v (the notation has been chosen for ease of simplicity-notably, it spares us from writing squared terms); cf.Hartley and Rao (1967) for the parametrization of the covariance matrix in terms of variance components ratios.The primary advantage of the Hartley-Rao parametrization is that we obtain a separate equation for v. Thus, on rewriting the log-likelihood, we obtain Provided the maximum is not attained on the boundary, the maximum likelihood estimates β, v, and d are a solution to the system of Fisher-score equations, respectively, It is easy to see that the MLE of v is given by where n = ∑ g i=1 n i .Lindstrom and Bates (1988) (among others) take advantage of (13) and propose the variance-profile log-likelihood substituting (13) back into (9).This leads to an equivalent maximization problem with v eliminated.That is, the variance-profile log-likelihood function, l p (τ * , X, y), with τ * = (β T , d) T is an economical dimensionreduced parametrization.At first sight, the effect of parametrizing the variance components in terms of ratios and the implied simplification of the maximization problem seem to be rather limited, since only one parameter can be eliminated.From the perspective of computation, however, even such a small reduction simplifies the numerical optimization problem considerably.This parametrization brings along another good property in terms of numerical optimization: observe from (13) that estimates of v are non-negative.
Equations ( 10) and ( 12), on the other hand, do not feature a closed-form expression (for unbalanced data) and have to be solved by means of some (iterative) numerical optimization methods.
A criticism of ML estimators of variance is that they are biased downward because they do not take into account the loss in degrees of freedom from the estimation of β.The REML estimators correct for this deficiency; see e.g., Harville, 1977 for the details.Nonetheless, we focus on ML and robust M -estimators; see Richardson and Welsh (1995) for robust REML estimation.

Robust EBLUP
Although the classical EBLUP method is useful for estimating the small area means efficiently under normality assumptions, it can be highly influenced by the presence of outliers in the data.Furthermore, mixed linear models have, unlike location-scale or regression models, no nice invariance structure.Notably, this means that the parameters cannot be estimated consistently in the presence of contamination-there is an unavoidable asymptotic bias.In the presence of contamination, any method estimates the parameter at the core model plus an unknown bias.In the case of ML estimators, the bias can be arbitrarily large and renders these estimators extremely inefficient (Welsh and Richardson, 1997).
A large number of authors proposed methods for robust analysis in mixed level models, ranging from rather algorithmic approaches (Rocke, 1983(Rocke, , 1991;;Stahel and Welsh, 1997) over robustification of Henderson's mixed-model equations (Fellner, 1986) to replacing the Fisher scores by robust Fréchet-differentiable statistical functionals (Bednarski and Zontek, 1996).Copt and Victoria-Feser (2006) have proposed an S-estimator and provide software for balanced data (cf.supporting website of Heritier, Cantoni, Copt, and Victoria-Feser, 2009).The M -estimator-type methods, based on either a robustified likelihood (RML 1; Richardson and Welsh, 1995;Stahel and Welsh, 1997) or boundedinfluence estimating equations (RML 2; Richardson and Welsh, 1995;Welsh and Richardson, 1997), have received considerable attention in the literature.Notably, we focus on the RML 2 method (Richardson and Welsh, 1995) because it embodies a natural way of restricting the influence of outliers in the response variable and is very closely related to the ML approach.Further, the RML 2 method is equivalent to the proposal of Sinha and Rao (2009).For these estimators, the potential bias is bounded, the efficiency is reasonable if the model holds, and the estimators are much more efficient than e.g., ML estimators, if it does not (Welsh and Richardson, 1997).
The following assumptions are crucial to all robust estimators.
Assumption 4. Outliers occur only in the response variable.No attempt is made to limit the effect of outliers in the model/design space of the model (i.e., influential/leverage observations).
In order to limit the influence of outliers in both the response variable (y) and the design matrix (X), one has to resort to generalized regression M-estimators (GM ) in the context of linear models (e.g., Mallows-or Schweppe-type estimators).Richardson (1997) extended the notion of GM -estimators to include MLMs.Although theoretically convincing, GM -estimators for the MLMs lack numerical stability (Richardson, 1995, chap. 6.5).

Robust M -Estimator EBLUP
In the presence of contamination, the ML estimates can be severely biased.It is therefore reasonable to replace the system of Fisher-score functions (10-12) by estimating equations (EE) whose influence functions are bounded-i.e., so-called bounded-influence estimating equations (BIEE).

BIEE for β
For the fixed effects, β, (10) shall be replaced by the BIEE where Hence, the n i -vector of residuals in ( 14) is scaled by will be discussed later; for the time being, it is sufficient to assume that the square root of V −1 i exists.
The Solution of ( 14) shall be obtained by an iteratively re-weighted least square (IRWLS) algorithm which is the workhorse for computing M -estimates of regression (Maronna et al., 2006, pp. 104-105).Notably, the IRWLS approach is numerically much more stable than the Newton-Raphson approach (Schoch, 2011a).Denote by {β} (s) the estimate of β produced by the algorithm on the sth iteration (s = 1, 2, . . .).An updated estimate is obtained from where Now, since ( 16) is a standard least squares problem, we obtain (iteratively) updated estimates of β by standard regression technique.First we note that by Assumption 3, Xi has full rank given that w i is not the null vector.Put X = ( XT 1 , . . ., XT g ) T and ỹ = (ỹ T 1 , . . ., ỹT g ) T , which are of size (n × p) and (n × 1), respectively, where n = ∑ g i=1 n i .Hence, we shall decompose X by means of the "skinny" QR-factorization (see e.g., Gentle, 2007, pp. 188-189 and p. 226).Write X = QR, where R = (R T 1 , 0 T ) T , with R 1 an (p × p) upper triangular matrix.Q is an (n × n) orthogonal matrix which can be partitioned likewise: where Q 1 is an (n × p) matrix whose columns are orthonormal.This enables us to write X = Q 1 R 1 .Consequently, the overdetermined linear system Xβ = ỹ can be expressed as R 1 β = Q T 1 ỹ.Since R 1 is an (p×p) triangular matrix, the system is easy to solve: in an iterative manner.The final value is regarded as the estimate βR .

BIEE for v
A bounded-influence EE for v that replaces the non-robust Fisher score ( 11) is obtained -in the spirit of Huber's proposal 2 (Huber, 1964) -as the solution vR to the boundedinfluence estimating equation where ] is a consistency correction term that ensures consistency of the estimate at the Gaussian core model with u ∼ N (0, 1) (where expectation is w.r.t. the model).From the perspective of computation, it is worth to consider another representation of the BIEE.The solution of ( 18) can be expressed as a weighted estimator, paralleling the concept of computing M -estimates of scale (cf.Maronna et al., 2006, pp. 40-41).Define a weight function where ψ ′ k denotes the first derivative and put An updated estimate, {v} s+1 , is given by where n = ∑ g i=1 n i .The fact that all elements of the diagonal matrix W i are nonnegative, implies that the quadratic form in ( 19) and therefore the estimates of v are nonnegative as well.This is in sharp contrast compared to the estimates obtained by (the inherently unconstrained) NR approach of Sinha and Rao (2009).

BIEE for d
For the estimator of d, we also have to replace the non-robust Fisher-score function ( 12) by a bounded-influence estimating equation.In contrast to v, the BIEE of d has no closedform solution.If we put u then a robust estimate of d, say dR , is obtained as the solution to Note that -for ease of simplicity -we highlighted only the functional dependence on d, i.e., V i (d) −1/2 , instead of reporting all parameters, V i (β, v, d) −1/2 .Among all available methods for finding the root of (20) (i.e., a root of a real-valued, continuous function in d), bisection is arguably the most reliable approach, but quite slow.The method of regula falsi has been found to converge at a faster rate than linear.However, it can go quite wrong in the case the function is not approximately linear over the interval (Small and Wang, 2003, pp. 43-45). 1 The Newton-Raphson method is well-known to converge with a quadratic convergence rate within some neighborhood of the root, but has the severe drawback of being very unreliable.In particular, the neighborhood of the root can be very small and (still more important) is not known beforehand.Divergence of the NR algorithm is a severe drawback and happens more frequently than many of the references admit (see also Jiang, Luan, and Wang, 2007).Chaubey and Venkateswarlu (2002), for instance, report convergence failure in 25.8 % of their Monte Carlo trials.Moreover, NR does not explicitly take into account that the problem at hand is constrained (i.e., d must be ≥ 0).Obviously, one may deploy a watchdog function which prevents d (through modifying search direction and/or step length) from becoming negative.However, this intervention may impede superlinear convergence.
We propose to solve (20) by means of Brent's root-finding algorithm (Brent, 1973, chap. 4).The search for a root is constrained to the interval (0, a] (where a > 0 has to be chosen), and thus ensures non-negativeness of d and σ 2 a = σ 2 e d.Brent's method combines the sureness of bisection with the speed of a higher-order method.It keeps track of whether a supposedly superlinear method is actually converging the way it is supposed to, and, if it is not, it intersperses bisection steps so as to guarantee at least linear convergence.This kind of super-strategy requires attention to bookkeeping detail, and also careful consideration of how roundoff errors can affect the guiding strategy (Press, Teukolsky, Vetterling, and Flannery, 1986, pp. 352-354).Hence, we use (a modification of) Brent's "zeroin" Fortran 77 code.

Estimation Bounds
Given some initial values, β 0 , v 0 , and d 0 , we may consider updating these estimates by solving equations ( 17), ( 19), and (20) in some sequential order right away.From a theoretical point of view, there is no objective against doing so.From the perspective of computation, however, it will prove useful to introduce two (pre-) estimation bounds for the variance components.As d is concerned, we have to consider two limiting situations: d = 0 and d → ∞.Accordingly, we obtain v zero and v ∞ , respectively.These two cases depict a lower and an upper bound of estimates of v.It is easy to prove that the following relations hold 1 Speed of convergence: Suppose the sequence {ϑ} (s) converges to ϑ 0 (s = 1, 2, . . .).In numerical analysis, the speed at which a convergent sequence approaches the limit is determined by the values c and p in ∥ϑ (s+1) − ϑ 0 ∥ ≤ c∥ϑ (s) − ϑ 0 ∥ p .For 0 < c ≤ 1 and p = 1, we shall say that the algorithm converges linearly.Likewise, we call the convergence superlinear, if p > 1 (given that a c > 0 exists).Note that convergence of order p means that the number of correct decimal places is roughly p times the number of iterations (see e.g., Small and Wang, 2003, chap. 3.1).
where v M L denotes the ML estimator (see e.g., Demidenko, 2004, pp. 78-79).From the perspective of numerical optimization, these bounds are extremely useful since they determine the range of plausible values, which may guide the choice of initial values and indicate potential run-away values.Subsequently we shall study robust estimators of v zero and v ∞ .

Case I: Robust Estimate of v zero
In the first case, we have d = 0 which implies that Ω i (v, d) = v(I i + dJ i ) reduces to v zero I i (i = 1, . . ., g).As a consequence, the estimator of β collapses to the Ordinary Least Squares (OLS) estimator, β0 , and the corresponding estimator vzero of v zero is an estimator of the residual variance.A robust estimate of v zero , say, vR zero , comes along with e.g.least trimmed squares (LTS) regression (Rousseeuw, 1984) or an M -or S-estimator of regression (see e.g., Maronna et al., 2006, chap. 5).This robust regression exercise not only yields an estimate of v zero but also provides us with a starting value for β in order to initialized the iterative algorithm (see below).From the perspective of computation, the fast LTS method of Rousseeuw and Van Driessen (2006) offers a good trade-off between robustness and computation time for sample sizes up to about 20,000 (this limit depends heavily on the number of auxiliary variables).For larger data, a regression S-estimator is considerably faster.

Case II: Robust estimate of v ∞
In the second case, we consider letting lim d→∞ V −1 i (v, d), and obtain lim d→∞ [ . ., g.Note that letting the random-effect variance, d, approach infinity, corresponds to treating the u i , i = 1, . . ., g, in (1) as if they were fixed effects.Indeed, we shall consider the fixed-effects model as an alternative to the mixed linear model.Now, let the model matrix be partitioned as [1 n , X], as well as the corresponding parameter vector, β = (α, γ T ) T (cf.Assumption 2).The fixed-effects model writes where the {u i ; i = 1, . . ., g} are, in contrast to (2), unknown, but fixed parameters (not realizations of the area-level random effects).Model ( 22) is traditionally called 1-way classification model for the analysis of covariance (Searle, 1987, chap. 11.2).

If we put K = [1 n |Z],
where Z = diag(1 1 , . . ., 1 g ), then model ( 22) can be written as a general linear model, y = Xγ + Kd + ε, where d = (α, u T ) T with u = (u 1 , . . ., u g ) T ; and ε = diag(ε 1 , . . ., ε g ).On rewriting the first equation of the normal equations, of the general linear model, we obtain (denoting the generalized inverse by the superscript "-") which can be substituted into (23) to yield where P * is both symmetric and idempotent (Searle, 1971, pp. 341-42).Note that, although (K T K) − is not unique (since K T K has not full rank), it enters only in the form K(K T K) − K T , which is invariant to whatever generalized inverse, G, is used.Thus the non-full rank property does not itself lead to manifold solutions of γ.However, for reasons of numerical stability, we will avoid computing a brute-force generalized inverse.Instead we derive a very simple variant of G as follows.First, note that ( 24) is not invariant to the particular choice of G := (K T K) − .However, since any linear combination of d, say, λ T d, is estimable when λ T = t T K for some t, we deliberately put one element of d equal to zero, and cross out the corresponding element in the normal equations (Searle, 1971, pp. 232-33).The obvious element to equate to zero in ( 24) is α.Thus, our generalized inverse shall be given by , where Substituting G into (24) yields d = (α, u T ) T with α = 0 (by assumption) and u = (Z T Z) −1 Z T (y − Xβ).In addition, we replace P * in ( 25) by because it is computationally much simpler than P * .Note that pre-multiplying a matrix by P corresponds to centering the particular matrix by its column-wise arithmetic means.In the present context, column-wise centering corresponds to centering by the area-specific means.By symmetry and idempotency of P we shall use instead of ( 25), where γ is based on the centered data, PX and Py.
It is evident from (28) that the influence of outliers in y on the estimates is unbounded.Thus we obtain robust estimates of γ, say γR , by means of M -estimation of regression, which is defined by the estimating equations where is the (normalized) median absolute deviation (MAD) of the non-null residuals of (29) about zero.The condition of taking only non-null residuals prevents from underestimating the scale, which becomes an issue if the number of auxiliary variables is relatively large (cf.Maronna et al., 2006, p. 100).By ψ k (•) we denote the Huber ψ-function indexed by the tuning constant k.Another approach could be to estimate regression and scale simultaneously.
It is fruitful to note that Py can be expressed as y − Zµ, where µ is the g-vector of area-specific means, (ȳ 1 , . . ., ȳi , . . ., ȳg ) T with ȳi = (1/n i ) ∑ n i j=1 y ij , i = 1, . . ., g.This representation indicates that the breakdown point of (29) may be much lower than the one of a regular M -estimator of regression.This is a consequence of the centering procedure: centering the within-area units in a particular area i by the mean may turn (n i − 1) ordinary observations into outliers (typically with a reversed sign) if the area contains one single heavy outlier.From the perspective of breakdown point, a simple remedy is T. Schoch 255 to center the data by the area-specific median instead of the mean.This corresponds to replacing Py in ( 29) by ȳmed = y −Zη, where η is the g-vector of area-specific medians, (med i=1 (y 1j ), . . ., med i=g (y gj )) T .This approach resembles the "median polish" strategy, which has been proposed for the 2-way analysis of variance (Tukey, 1977).Now, in order to obtain the robust variance pre-estimation bound, v ∞ , we have to solve (29) with Py replaced by ȳmed and obtain vR ∞ from An alternative approach has been proposed by Birch and Myers (1982).They obtain M -estimates for γ and {u i , i = 1, . . ., g}, solving the system of EE, respectively, where r ij = y ij −u i −x T ij γ and S is the normalized MAD of the residuals.In essence, this strategy consists of computing a relatively large number of M -estimates which is rather time consuming and therefore not the optimal strategy in order to compute pre-estimation bounds.

Algorithmic Details
Up to this point, we studied estimating equations and pre-estimation bounds.In this subsection we focus on implementation and algorithmic issues.

Computational Issues
As concerned with computing speed, memory allocation, and floating-point arithmetic considerations, we arranged all vector and matrix operations to make them rich in level-1 procedures-i.e., procedures which operate on vectors of size n and involve O(n) floating-point operations (cf.Golub and Loan, 1996).2With respect to elementary operations, we rely on the procedures in BLAS (Blackford et al., 2002) and LAPACK (E.Anderson et al., 2000).Further, we avoid computing any brute-force matrix inverse and use the expressions for determinant and inverse of the matrix V i , i = 1, . . ., g, (see e.g., Searle et al., 1992, p. 79).In addition, we obtain a closed-form expression of V −1/2 i as follows.Denote by L i D i L T i the eigenvalue decomposition of the (n i ×n i ) matrix V i , where L i is the (n i ×n i ) matrix whose columns correspond to the eigenvectors of Statistics, Vol. 41 (2012), No. 4, 243-265 the (n i × n i ) matrix of the eigenvalues λ j (j = 1, . . ., n i ).It is not difficult to see that V i has only two distinct eigenvalues: the first eigenvalue is λ 1 = 1 + dn i (with multiplicity one) and the remaining (n i − 1) eigenvalues are one.Now, we define a realvalued function f of the matrix V i that corresponds to a function of a scalar as f The formulas ( 33) and ( 34) simplify computation considerably.Alternatively, one may define V −1/2 i := A i , where the upper triangular matrix A i , with positive diagonal elements, is obtained from the Cholesky decomposition, Sinha and Rao (2009), on the other hand, use where U i is a diagonal matrix with elements u jj , j = 1, . . ., n i , equal to the diagonal elements of the covariance matrix of y i (for all i = 1, . . ., g).

Initialization
The choice of starting values is crucial in terms of speed and numerical stability of the algorithm.Extensive simulation showed that the method is best initialized by the choice where γR and vR ∞ are obtained from the pre-estimation-bound exercise in ( 29) and (30), respectively.For d 0 to be a reasonable starting value, it is sufficient to choose a relatively large number (200 works for most applications).
The sketch of the algorithm comprises only the most important elements.The numerical tests whether an updated value behaves well (e.g., lies within the pre-estimation bounds), have been omitted in the above display.The final estimates are given by: βR and dR ← d i+1 .By means of identity ( 8), we obtain [σ 2 a ] R = [σ 2 e ] R dR .

Robust Prediction
Up to this point we have dealt with (robustly) estimating the model parameters β and θ.Now, we consider robustly predicting the random effects (and subsequently the small-area means).We assume that N i ≫ n i (∀i = 1, . . ., g) holds.On rewriting ( 6) using ( 8), the area-level predicting equations are given by where does not hold we may proceed as in Section 2.1.From the mathematical display, it is apparent that replacing β, v, and â by robust estimates is not sufficient in order to robustly predict µ i .As Sinha and Rao (2009) indicate ûi has to be replaced by a robustly predicted random effect, ûR i , as well.They therefore propose to solve Fellner's robust mixed-model equation (Fellner, 1986) for u i .In order to solve (38), Sinha and Rao (2009) use a Newton-Raphson algorithm using another first-order Taylor series expansion of (38) w.r.t.u i .Consequently, computation is very involved.In particular, we may encounter all the numerical difficulties associated with the NR method (as discussed above) here as well.However, one can obtain robust predictions far more easily (cf.Copt and Victoria-Feser, 2009) ) T , where ψ c (•) is the Huber ψ-function indexed by the robustness tuning constant c, then we may write where κ = [−2cϕ(c)+2Φ(c)−1+2c 2 (1−Φ(c))] −1/2 , with ϕ and Φ the pdf and cdf of the standard normal distribution, respectively.Note that κ is kind of a consistency correction term which has been chosen in order ûR i to behave similarly to ũi at the core model.In essence, we follow Heritier et al. (2009) and impose the (implicit) moment conditions that et al., 2009, pp. 113-114).Thus, the robust predictor of the area mean, μR i (i = 1, . . ., g), -referred to as robust EBLUP (REBLUP) of µ i -is given by μR Some issues of the robust prediction method are noteworthy to comment on.First, note that pre-multiplying the area-specific vector of residuals, r i = y i − X i β, in (39) by V −1/2 i from (34) will transmit the effect of even one single outlying residual, r ij = y ij − x T ij β, say, to the vector of all other within-area residuals.That is, on pre-multiplying r i , the first term of (34) yields the mean of r i (times a constant), which is non-robust per se.Thus, from the perspective of robustness (and with regard to breakdown point considerations), the term V −1/2 i r i with r i = y i − X i β in (39) should be replaced by rmed i denoting the median of r i .Alternatively, and by Assumption 4 it is sufficient to use  39) can be replaced by any other non-redescending, odd, bounded function.The restriction on non-redescending functions is crucial since redescending ψfunctions lead, for sufficiently large residuals in (39), to realizations of u i equal to zero (i.e., mimicking a synthetic estimator) which is not meaningful under model (2).Third, the choice of c can, in principle, be different from the choice of k in the BIEEs.

Mean Squared Error Estimation
Estimation of mean squared prediction error (MSPE) is a very challenging problem.Given the complex nature of the REBLUP estimators and the lack of knowledge on the underlying distribution of the u i and e ij , Sinha and Rao (2009) noted that is not possible to adopt the existing methods used for MSPE estimation.We follow the proposal of Sinha and Rao (2009) and adopt a parametric bootstrap (see Lahiri, 2003 andHall andMaiti, 2006 for more details on bootstrap estimates in SAE) based on the robust quantities βR The method works as follows.
1.For the given βR and θR = (v R , âR ) T , generate area-specific random effects u * i and random errors e * ij from N (0, âR ) and N (0, vR ), respectively.Then we create a bootstrap sample from the model This method works remarkably good with respect to computing time and in terms of providing a reasonably precise estimate of the uncertainty associated with predicting the area-level means.However, the method tends to slightly underestimate the true MSPE.The underestimation results mainly because the uncertainty of estimating β has not been taken into account.

Simulation
We implemented a small model-based simulation study to investigate the performance of the proposed method.The proposed algorithm is implemented in the R-package rsae (Schoch, 2011c); see R Development Core Team (2011) for more details on the R language and environment for statistical computing.
The data were generated from model where x ij ∼ N (0, 1).Each Monte Carlo sample comprises g = 20 areas and n = 5 within-area units (overall, N = 80 observations; balanced data).In line with Stahel and Welsh (1997), we allow for contamination (by means of a normal mixture, (1 − ε) • N (0, 1) + ε • N (0, γ), where γ can be chosen) in either or both of the random effect distributions, producing four combinations: (0,0) no contamination; e ij ∼ N (0, 1) and u i ∼ N (0, 1), (e,0) e ij ∼ (1 − ε) • N (0, 1) + ε • N (0, 41) and u i ∼ N (0, 1), (0,u) e ij ∼ N (0, 1) and u The simulation study in this paper serves primarily as a proof of concept.We therefore provide only a small number of contamination scenarios.The relative amount of contamination is ε = 0.05 for all simulations.Each scenario is evaluated by 1000 Monte Carlo (MC) simulation trials.The simulated MC distribution of an estimator of ϑ, say θ, is summarized by the average bias, B( θ), and mean square error, MSE( θ), and the robust analogues based on medians, respectively, rB( where ϑ * denotes the true value.For ease of simplicity (and by equivariance considerations), it is sufficient to set the true parameters equal to one, i.e., β * = (1, 1) T , [σ 2 e ] * = 1, and [σ 2 a ] * = 1.We report only the simulation results for the variance components σ 2 e and σ 2 a (Table 1 and 2).For the scenario of uncontamined data, (0, 0), we also report the results of the maximum likelihood method of the "lme" function (i.e., gold-standard function) in the R-package "nlme" (Pinheiro, Bates, DebRoy, Sarkar, and R Core Team, 2009; denoted by "lme(ml)" in Tables 1 and 2).Note that the M -estimator (denoted by "huberm" in the tables) with robustness tuning constant k = 2000 mimics the ML estimator.
The findings of the small simulation exercise can be summarized as follows.
• The "huberm" method converged in all 1000 Monte Carlo trials for each simulation configuration (i.e., contamination scenario and choice of k).This is in sharp contrast to the results reported by Richardson (1995) and Chaubey and Venkateswarlu (2002) (among others).The proposed algorithm may, on the other hand, fail to converge when the amount of contamination, ε, is larger than the breakdown point (which can be rather low in the case of unbalanced data; see Schoch (2011b)).
• The results of the "huberm" method mimicking the ML estimator are equal (up to the 6th or 7th decimal place) with those of the gold-standard method "lme".
• In the presence of contamination, the M -estimator has a smaller bias than the corresponding ML estimator.Even more important, the MSE is considerably smaller.
In contrast, the loss of efficiency of the M -estimator in the absence of contamination is almost negligible.These findings remain valid if one considers the robust criteria (rB and rMSE in Table 2).In the presence of contamination, rMSE tends to be smaller than MSE which indicates that the MC distribution is skewed.
• Contamination of the model error, e ij , affects the robust estimates of σ 2 a very little, since the contamination affects the diagonal elements of the variance of y ij but not the off-diagonal elements.Contamination of the area-specific random effects, u i , affects both diagonal and off-diagonal elements of the variance (see also Welsh and Richardson, 1997, p. 348).When both components are contaminated, the effects on the estimates are the combination of the effects of contaminating the components one at a time (see also Stahel and Welsh, 1997, p. 315).
• In the simulation exercise, we had focused on two choices of the tuning constant k.
It goes without saying that one may obtain better estimates (i.e., better in terms of a reasonable risk/loss function) trying different choices.Our experience supports the finding of Stahel and Welsh (1997) that fine tuning pays more in estimating these models than it does with simpler models (p.315).Nonetheless, the gains in efficiency are large.
• Computing the robust estimates based on data consisting of g = 500 areas, each of which has n = 20 units (i.e., 500 × 20 = 10000 observations), takes on average 1.3 seconds on an ordinary desktop computer (computing on a single core of an x86 64 processor, 2.83 GHz, 4 GB RAM; R v.2.13.1; openSUSE Linux 11.4).

Conclusion
In this paper, we developed a robust method for the basic unit-level model which is based on M -estimators in mixed-linear models (Welsh and Richardson, 1997) and therefore conceptionally equivalent, but slightly different, to the proposal of Sinha and Rao (2009).
In contrast to Newton-Raphson-or Fisher-scoring-type algorithms, the proposed algorithm is numerically stable and fast (also for very large datasets).Notably, the estimates of the variance components are non-negative in contrast to those from e.g., the NR method.

2007
) the Monte Carlo simulation study showed that the method converges always (given that the amount of contamination is lower than the breakdown point).Further, we derived a much simpler (thus considerably faster) method for robustly predicting the small-area means than Sinha and Rao (2009) did.All the methods of this paper are implemented in R (R Development Core Team, 2011), R-package: rsae, see Schoch (2011c).

2.
Generate b = 1, . . ., B bootstrap samples {y * [1] , y * [2] , . . ., y * [B] } from the bootstrap population model (42).For each bootstrap sample y * [b] (b = 1, . . ., B), compute the robust bootstrap estimates βR[b] , θR[b] , and ûR[b] and robustly predict/estimate μR[b] i ψ k denoting a bounded, odd function indexed by some robustness tuning constant k.Without loss of generality we will assume that ψ k denotes the Huber ψ-function in what follows (or any other bounded, monotone function).Equivariance considerations indicate that it is useful to studentized the estimator by an appropriate scale, paralleling the concept of regression M -estimators.Note from (2) that at the Gaussian core model, the (marginal) law y where ȳmed

Table 2 :
Robust bias and robust MSE estimates of the variance components for the four contamination scenarios