Estimating a Distribution Function at the Boundary

Estimation of distribution functions has many real-world applications. We study kernel estimation of a distribution function when the density function has compact support. We show that, for densities taking value zero at the endpoints of the support, the kernel distribution estimator does not need boundary correction. Otherwise, boundary correction is necessary. In this paper, we propose a boundary distribution kernel estimator which is free of boundary problem and provides non-negative and non-decreasing distribution estimates between zero and one. Extensive simulation results show that boundary distribution kernel estimator provides better distribution estimates than the existing boundary correction methods. For practical application of the proposed methods, a data-dependent method for choosing the bandwidth is also proposed.


Introduction
As an effect of global warming, the insurance industry is increasingly exposed to extreme events such as hurricanes, hail storms and tornados, etc. Such events cause catastrophic losses. It is necessary to estimate the probability of such events and the probability of the payout exceeding certain amounts (such as $1,000,000) in order for the insurance companies to determine the appropriate premiums. Denote by X the amount of the payout from an accident, the quantity of interest is P (X > x) , where x is a pre-specified amount of payout.
In this paper, we assume that X is a random variable from a population with density f and cumulative distribution function (CDF) F = P (X ≤ x) (hence P (X > x) = 1 − F (x)). Further assume that we have available the past data X 1 , · · · , X n and assume that X 1 , · · · , X n are independent and identically distributed (i.i.d.) random variables, a commonly used method to estimate F is the kernel method where K is defined by and k is a kernel function satisfying F n (x) is called a kernel distribution estimator. In the literature of density estimation, a kernel satisfying (3) is called an order (0, 2) kernel, where "0" means that the purpose is to estimate the density function and "2" means that such kernel yields bias of order O(h 2 ), see Gasser and Műller (1979), Gasser, Műller, and Mammitzsch (1985), Műller (1991) and Karunamuni (1998, 2000) for more references on this topic. Since the purpose of this paper is to discuss the estimation of the distribution function, to distinguish K from k, we will call K defined by (2) a distribution kernel and k a density kernel.
Combining (4) and (5), we have and where MSE and IMSE are the abbreviations of mean squared error and integrated mean squared error.
It can be seen that the optimal bandwidths for minimizing (6) and (7) are both of order O(n −1/3 ) and have the form and respectively.
The bandwidths h opt L (x) and h opt G are called the optimal local and optimal global bandwidths. With their respective optimal bandwidths, the MSE and IMSE ofF n (x) are and [µ 2 (k)] 2 1/3 n −4/3 . (11) Hence, the optimal rate of convergence of MSE and IMSE is of order O(n −1 ), the same as that of the empirical cumulative distribution function (CDF) since the last two terms in (10) and (11) converge to zero at the rate O(n −4/3 ) . Note that the first term on the right hand side (RHS) of (10) and (11) are the MSE and IMSE of the empirical CDF. Further note that the second term on the RHS of (10) and (11) are both negative, the kernel distribution estimator has smaller MSE and IMSE than the empirical CDF, which is the motivation of developing the kernel distribution function estimators.
It is obvious that minimization of (6) and (7) also concerns the choice of the kernel function. This problem was solved in Swanepoel (1988) and Jones (1990) in which it is shown that the optimal density kernel k which minimizes the IMSE (or equivalently, maximizing ]. This, without loss of generality, can be changed to the uniform density on [−1, 1], i.e., due to the invariance of the minimized IMSE to the change of scale. Consequently, the resulting optimal distribution kernel is

The boundary problem of the kernel distribution estimator
The asymptotic results discussed in the previous section are obtained under the assumption that the support of the density is (−∞, ∞). In practice, the data are often obtained from a population whose probability density function has finite support. Examples of such data include the truncated, survival or censored data which often appear in financial and clinical studies. In such cases, the asymptotic results (4)-(7) do not hold anymore for the points near the end of the support. Hence, the kernel distribution estimator (1) may not provide appropriate estimates of the distribution function at such points. The purpose of this section is to provide a detailed study of the performance of the kernel distribution estimator at the points near or at the end of the support and propose a method to correct the boundary problem.
The proofs of (14) and (15) are provided in the Appendix. When f (a+) = 0 and f (b−) = 0, (14) and (15) show that, for the points in the intervals [a, a + h) and (b − h, b], the bias of the kernel distribution estimatorF n (x) converges to zero at the rate O(h), which is slower than the rate O(h 2 ) observed in (4). This is the boundary problem of the kernel distribution estimator. Note that the boundary problem in kernel density estimation has the non-consistency problem, in addition to the slower convergence problem of the bias (Gasser and Műller 1979;Gasser et al. 1985;Műller 1991;Zhang and Karunamuni 1998). The boundary problem in kernel distribution estimation is less severe than in kernel density estimation. This is due to the extra information F (a) = 0, F (b) = 1. As in density estimation, the intervals [a, a + h) and (b − h, b] will be called the boundary region and the interval a + ch ≤ x ≤ b − ch is called the interior region. However, if we know that f (a+) = 0 or f (b−) = 0, the first order term in (14) and (15) dispappears and the bias converges to zero at the usual rate O(h 2 ). Hence, the distribution kernel estimatorF n (x) is free of boundary problem in such a case.
Koláček and Karunamuni (2009) considered the boundary problem in distribution function estimation in estimating ROC curves using the transformation method discussed in Zhang et al. (1999). Tenreiro (2013) proposed a boundary kernel method for correcting the boundary problem. However, Tenreiro (2013) did not reveal the fact that there is no boundary problem in distribution function estimation if the density has value zero at the endpoints of the support. In his method, the boundary kernel k c is constructed by truncating a density kernel at [−c, c], and then normalizing it so that it integrates to 1 on [−c, c]. Realizing the fact that such boundary kernel corrects the boundary problem by shrinking the bandwidth to zero when x → a+ or b−, the resulting distribution estimates may have high variability at such points.
The purpose of this paper is to develop a boundary distribution kernel method for correcting the boundary problem ofF n (x) , which is continuous, non-decreasing and does not have the aforementioned high variability problem of the estimator proposed in Tenreiro (2013).

The boundary distribution kernel estimator
It can be seen from (14) that the boundary problem can be removed if, at the boundary point x = a + ch (0 ≤ c ≤ 1), the density kernel function k satisfies Since such k depends on c, to distinguish it from the interior kernel k we will denote it by k c . It can be easily seen that (16) is satisfied if k c is a left hand side (LHS) boundary kernel in density estimation (we will call it boundary density kernel) since a LHS boundary density kernel satisfies However, as will be seen in Section 3, the drawback of using the boundary density kernel is that it may produce distribution estimates which are negative or larger than 1 for the distribution due to the negative part on its support. In distribution estimation case, the derivation of (14) shows that the kernel is only required to satisfy (16) in order to solve the boundary problem ofF n (x) at the left boundary. Re-write (16) as c −1 c − y c k c (y)dy = 1.
Given an interior symmetric density kernel k which satisfies (3), k c satisfying (17) can be easily constructed as follows Similarly, a RHS boundary distribution kernel K * or equivalently Given an interior symmetric density kernel k, k * c satisfying (20) can be constructed as follows It is obvious that k * c → k as c → 1. Hence, k * c defined in (21) provides a natural continuation of the interior kernel to the right boundary region. The following result shows that the above defined k c (y) and k * c (y) are always non-negative.
Proposition. With a symmetric density kernel k satisfying (3), k c (y) and k * c (y) defined in (18) and (21) are non-negative.
Proof. The proof of the proposition is straight forward. Taking (18) for example, it is obvious For the part defined on c < y ≤ 1, the non-negativity follows from the fact that

Examples of boundary distribution kernel
As mentioned at the end of Section 1, the optimal interior distribution kernel is the uniform kernel (12). Hence, it makes more sense to construct a boundary distribution kernel based on the uniform kernel. Plugging (12) in (18) and (21), we have Hence, the resulting boundary distribution kernels are for the left boundary, and for the right boundary, respectively.
With K c (t) and K * c (t) defined above, the boundary distribution kernel estimator is defined asF Theorem 1. Assume that the second derivative of F (x) (or the first derivative of f ) exists on [a, b] and is continuous in a neighborhood of x, we have and The proof of Theorem 1 is deferred in the Appendix. Theorem 1 shows that the bias and variance of the boundary distribution kernel estimatorF B (x) converges to zero at the interior rate O(h 2 ) and O( h n ) at the boundary points, respectively. Hence the boundary distribution kernel estimator is free of the boundary problem.
Combining (27)-(30) and the fact that the boundary distribution kernel estimator is the same asF n (x) of (1) for the interior points in the interval Unlike (7), minimization of (31) with respect to (w.r.t.) h does not yield explicit solution and hence has to be solved numerically.

Numerical results
In this section, we numerically compare the performances of the empirical CDF (denote it byF (x)), the kernel distribution estimatorF n (x) with distribution kernel (13), the boundary distribution kernel estimatorF B (x) with distribution boundary kernels (24)-(25) and the Tenreiro estimator (denote it by F n (x)) with distribution kernel (13).
As mentioned in the discussion circa (16), boundary correction can also be achieved by directly using the density boundary kernels as k c and k * c and their corresponding K c and K * c in the boundary, with the risk that such estimators may produce negative or larger than 1 distribution estimates. To see the extent of the negativity problem, in the comparison we also included the density boundary distribution estimator which is basically (26) and respectively. The resulting distribution is denoted byF * B (x) and will be called the density boundary kernel distribution estimator. It is easy to see that k c and k * c defined in (32) and (33) become the uniform kernel when c = 1.
It is obvious that the negativity problem of the density boundary kernel distribution estimator can be easily corrected by truncating the estimates at zero if it takes a negative value or at 1 if it takes a value large than 1. For the completeness of comparison, we also included the truncated version ofF * B (x) in the simulations.

The performances of the kernel distribution estimators based on their respective optimal bandwidths
To account for different situations, we use the following four distributions in the simulations: 1. The truncated exponential distribution on [0, 1]: 2. The truncated normal distribution on [0, 1]: where B is the beta function. Note that this density function satisfies f (0) = f (1) = 0.
4. The Beta mixture distribution k i w i Beta(α i , β i ) with k = 2, w 1 = 1/4, α 1 = 1, β 1 = 6; w 2 = 3/4, α 2 = 6, β 2 = 1. In the simulations the sample size was chosen as n = 100. Tables 1-4 report the IMSE values of the six estimators. The IMSE values are calculated by integrating the MSE value of each estimator in [0, 1] using their optimal global bandwidths. The use of optimal bandwidth is necessary in order to have a fair comparison since the performance of a kernel estimator is greatly affected by the bandwidth used. The optimal global bandwidth used for each estimator is obtained by numerically minimizing the IMSE of each estimator from 1,000 samples on a grid of possible h values between [0, 1/2]. The obtained optimal global bandwidth h opt G for each estimator is shown in Column 2 of the tables. The IMSE values (multiplied by 1000 for easy comparison) corresponding to h opt (calculated as the average of the integrated squared error (ISE) from the 1,000 samples) are reported in Column 3 of the tables.
It can be seen from Tables 1-4 that all the kernel distribution estimators have smaller IMSE than the empirical CDF. Comparing among the kernel distribution estimators, we see that the boundary distribution kernel estimatorF B (x) has the smallest IMSE values for all distributions except Distribution 3, followed by the Tenreiro estimator, the truncated density boundary kernel distribution estimator, the density boundary kernel distribution estimator, and the kernel distribution estimator. Note that for Distribution 3 (the Beta(2, 2)), the kernel distribution estimatorF n (x) is free of boundary problem since the density satisfies f (a) = f (b) = 0. In this case,F n (x) and the boundary distribution kernel estimatorF B (x) have similar IMSE values, while the density boundary kernel distribution estimatorF * B (x) , its truncated version and the Tenreiro estimatorF n (x) have similar IMSE values which are smaller than those ofF n (x) andF B (x).   The biggest advantage of the boundary distribution kernel estimatorF B (x) over the other methods is the quality of estimation at the boundary. To see this, for each method we also plotted 10 estimates of the CDF based on their respective optimal bandwidths. The results are shown in Figures 2-5. For clarity, we only plotted the estimated CDF of each method at the left and right boundaries separately. These figures clearly show the boundary problem of the kernel distribution estimator: the estimates are systematically biased except Figure 4 of Distribution 3. They also show that it is quite common that the density boundary kernel distribution estimator may produce estimates which are either negative or larger than 1. The effect of truncation of the density boundary kernel distribution estimator can be clearly seen in the figures, especially in Figure 4. It can also been seen that, although the Tenreiro estimator does not have the boundary problem, its estimates at the boundary are rugged (as mentioned in Section 2), not as smooth as those from the kernel distribution estimator and the boundary distribution kernel estimator. As a matter of fact, we see that the Tenreiro estimator behaves very similarly to the empirical CDF in the boundary region, due to the fact that the bandwidth used in the boundary region (h = x − a) decreases to zero as x → a.
On the other hand, throughout all the plots we see that the boundary distribution kernel estimator provides smooth estimates for the distribution function in the boundary region while remaining to be free of boundary problem.
Recall that we observed for Distribution 3 that the IMSE of the boundary distribution kernel estimator is higher than that of the truncated density boundary kernel distribution estimator and the Tenreiro estimator. Figure 4 reveals that this is caused by the fact that the latter estimators take zero values for the points near zero. The zero estimates are the results of truncation for the truncated density boundary kernel distribution estimator and the result of the shrinking bandwidth (h = x−a) for the Tenreiro estimator. It is obvious that an estimator          which takes zero value in the neighborhood of x = a may have an advantage on the value of IMSE since F (a) = 0. However, such an estimator does not provide useful information for the points near the end of the support since the true distribution function is not zero at these points.

Estimation of the optimal global bandwidth
In order to apply the proposed kernel distribution estimators in practical situations, we need to develop a method to estimate the optimal global bandwidth. A number of methods have been proposed in the literature to choose the bandwidth for the kernel distribution estimator in the infinite support case. The simplest one is the reference distribution method in which the bandwidth is estimated by replacing the true distribution by a reference distribution such as the normal distribution. Other methods for choosing the bandwidth include the "leaveone-out" method of Sarda (1993), the "plug-in" method of Altman and Leger (1994), and the cross-validation (CV) method of Bowman, Hall, and Prvan (1998).
In the following, we discuss how to use the CV method to select the bandwidth in kernel distribution estimation for densities with finite support. Although we have discussed both local and global bandwidths for kernel distribution estimation, we will only focus on the estimation of the optimal global bandwidth due to the fact that the bandwidth selector developed based on the global optimal bandwidth is more stable than that based on the local optimal bandwidth. We follow the idea of Bowman et al. (1998) to use the CV method for estimating the bandwidth for the kernel distribution estimators when the density has compact support [a, b]. Define whereF −i (x) is a kernel distribution estimator (boundary corrected or not) using the data without observation x i . Then, following the same lines as those (page 801-803) of Bowman et al. (1998), one can show that, by ignoring an unknown constant factor, CV (h) is an unbiased estimator of the true IMSE for sample size n − 1.
To evaluate the performance of the bandwidth selector CV (h), for each method and each sample we obtainedĥ CV , the estimated optimal global bandwidth, by minimizing (49) on a grid of possible h values between [0, 1/2]. The mean, the 95% confidence interval of these bandwidths are also reported in Column 4 of Tables 1-4. These results show that the CV method based on (34) provides satisfactory estimates of the optimal global bandwidth for the kernel distribution estimatorF (x). For the boundary distribution kernel estimatorF B (x) and the Tenreiro estimatorF n (x), the bottom two row of Table 2 shows that the estimated optimal global bandwidthĥ CV are smaller than the optimal global bandwidths. This is caused by the fact that the true optimal bandwidth forF B (x) are too close to 0.50 (.49 and .50 respectively), the right endpoint of the interval [0, 1/2], in which we search for the optimal bandwidth. We also conducted the simulations (not reported here) for truncated normal densities with support [0, 2], and found that this problem had disappeared for both methods. For the density boundary kernel distribution estimator, its truncated version and the Tenreiro estimator, Table 3 shows that the estimated optimal global bandwidths are significantly larger than the true optimal global bandwidths. This indicates the instability of (34) as a bandwidth selector for these three estimators.
In order to see the performance of the estimators w.r.t. their respective estimated optimal bandwidths, we also calculated their average ISE values. The results (multiplied by 1000) are reported in Column 5 of Tables 1-4. As expected, the IMSE values reported in Column 5 are greater than those in Column 3 due to the use of estimated optimal bandwidths. Column 5 also confirms the findings from Column 3 that the boundary kernel estimator in general has the best performance among all the estimators we have discussed in this paper.

A real data example
To examine the performance of different methods in real applications, we applied the methods we have discussed in this paper to the Massachusetts auto bodily injury liability data, which are provided in Rempala and Derrig (2005). The data consist of outpatient medical provider's total billings on a sample of 348 auto bodily injury liability claims closed in Massachusetts during 2001. The data range from 0.045 to 50.00 in thousands of dollars. We randomly divided the data into the training sample (50% of the data) to fit the kernel estimators and the test sample (the remaining 50% of the data). The histogram in Figure 6 shows that the majority of claims are small claims (less than 5 thousand dollars), indicating that the value of the density function may be larger than zero at zero-the left endpoint of the support. Hence, boundary correction around zero is necessary. In our analysis, we first obtained the bandwidths from the training sample using the CV method (49), and they are 0.40 for the kernel distribution estimator, 0.48 for the density boundary kernel distribution estimator, the truncated density boundary kernel distribution estimator and the Tenreiro estimator, and .49 for the boundary distribution kernel estimator, respectively.

Histogram of Claim amount
claim amount In Figure 7, we plotted the estimated exceedance probabilities for each method for claim amounts 0 ≤ x ≤ .5 in the test sample, and these probabilities were compared with the proportions of claims exceeding x from the test sample (the solid line). The exceedance probability can be useful in premium calculation in the insurance industry. Figure 7 clearly shows the boundary problem of the kernel distribution estimatorF (x) (the short dashed line)it down-estimates the exceedance probability for claim amounts less than $200. The density boundary kernel distribution estimator and its truncated version (the dot-dash line) are the same line (i.e., no need for truncation). Since the value of these two estimator at x = 0 is less than 1, this shows that they both down-estimate the exceedance probability for small claims.
On the other hand, both the Tenreiro estimator (the long dashed line) and the boundary distribution kernel estimator (the dotted line) take the value 1 at x = 0. It can also been seen the Tenreiro estimator is less stable in the boundary region than the boundary distribution kernel estimator. The flat region of the Tenreiro estimator (F n (x) =1) for the points near at x = 0 does not provide useful information for insurance claims falling into this region.

Conclusion and discussion
Estimation of distribution function has found numerous applications in econometrics, climatology, hydraulics, among others. One of the major methods in estimating the distribution function is the kernel method. In this paper, we have provided an in-depth study of the boundary problem of distribution kernel estimator when the density function has compact support. In order to eliminate the problem of the kernel distribution estimator, we have proposed a boundary distribution kernel estimator. We have shown that the boundary distribution kernel estimator uses the available information more efficiently for estimating the CDF at the boundary point x = a + ch (or x = b − ch) by using the data in the interval [a, a + h) for the left boundary region (or (b − h, b] for the right boundary region), compared with [a, x + ch] (or [b−ch, b]) for the Tenreiro estimator. Numerical comparisons also show that the boundary distribution kernel estimator, as well as the optimal global bandwidth based on it, in general has the best performance in the boundary region among all the estimators we have considered in this paper.
Note that for Plugging these into (A3), we have This completes the proof of (15).
Proof of Theorem 1. We will only prove (i), (ii) and (28). (27) and (29) are direct results of (14) and (15). (30) can be proved along the same lines as those used in proving (28) Note that for x = a, x − X i h < 0. The construction of K c (t) (see discussion circa (18) To showF B (x) is non-decreasing; it is enough to show that it is non-decreasing in the boundary regions. We will only prove this for the left boundary region.
For any a ≤ x 1 ≤ x 2 ≤ a + h, note that where c i = (x i − X i )/h, i = 1, 2. (A4) shows that it is enough to show Without loss of generality, it is enough to show since k x 2 −X i h (y) ≥ 0. The last inequality in (A5) show that the proof is complete if we can show that k x 2 −X i h (y) − k x 1 −X i h (y) for −1 ≤ y ≤ x 1 −X i h , i.e., if we can show that k c (y) is non-decreasing as a function of c, for each−1 ≤ y ≤ c. From (18), we see that we only need to show ck(y)/ c −1 (c − s)k(s)ds is non-decreasing as a function of c, for −1 ≤ y ≤ c, which directly follows from the fact that To show (28), first we realize that Similar to the proof of the bias term, we can show that