Fast Approximate Complete-data k-nearest-neighbor Estimation

We introduce a fast method to estimate the complete-data set of k-nearest-neighbors.This is equivalent to finding an estimate of the k-nearest-neighbor graph of the data. The method relies on random normal projections. The k-nearest-neighbors are estimated by sorting points in a number of random lines. For very large datasets, the method is quasi-linear in the data size. As an application, we show that the intrinsic dimension of a manifold can be reliably estimated from the estimated set of k-nearest-neighbors in time about two orders of magnitude faster than when using the exact set of k-nearest-neighbors.


Introduction
Let D = {x 1 , . . . , x n } ⊂ R p , X ∈ R p be a point of interest. In this work we are concerned with the computation of the k-nearest-neighbors of X in D. Let · denote the Euclidean norm in R p . Suppose that instead of computing all distances X − x i , i = 1, . . . , n, to obtain the exact k-nearestneighbors of X, say k-NN(X) we are willing to sacrifice some precision and obtain an estimate of the set k-NN(X). The goal is to reduce the number of computations needed to find k-NN(X), provided that our estimatek-NN(X) is sufficiently close to k-NN(X). Our method consists of projecting the data into a number, say m, of random projections so that we can recover k-NN(X) by searching for the k-nearest-neighbors in m one-dimensional subspaces. To measure closeness, we adopt the notion of -k-nearest-neighbors. Definition 1.1. -k-nearest-neighbor. We will say that a point y is an -k-nearest-neighbor of X if y − X ≤ (1 + ) x (k) − X , where x (k) is the k-nearest-neighbor of X.
Different methods to estimate the -k-nearest-neighbors are available. Locality-sensitive hashing (Indyk and Motwani 1998) is one of the earliest methods suggested in the literature. It is highly popular. In this method, points are preprocessed and referenced by a hash-table so that a query point finds its approximate nearest-neighbor by querying the hash table. This step takes order O(pn 1/ ) operations. The method has been refined by Gionis, Indyk, and Motwani (1999) by improving the query time to O(pn 1/(1+ ) ) operations. Shakhnarovich, Darrell, and Indyk (2006) introduced randomness in the method, by estimating the hash function through random projections. Recently, Paulevé, Jégou, and Amsaleg (2010) used k-means to develop a highly competitive local-sensitivity hashing method.
All these methods are designed for the purpose of performing a query with a new external point. However, in this work we are rather concerned with the task of finding the k-nearest-neighbors for each point in D. For example, we are interested in applying our methodology to construct the k-nearestneighbor graph, a technique to built sparse graphs. For example, this graph is relevant when performing nonparametric regression (Altman 1992), intrinsic dimension estimation (Farahmand, Szepesvari, and Audibert 2007), dimension reduction (Bingham and Heikki 2001), outlier detection (Hautamäki, Kärkkäinen, and Fränti 2004), nonparametric clustering (Fränti, Virmajoki, and Hautamäki 2003), such as hierarchical clustering or superparamagnetic (Potts model) clustering (Murua, Stanberry, and Stuetzle 2008;Murua and Wicker 2014), for minimal spanning tree computation and density estimation (Stuetzle 2003), for semi-parametric Bayesian regression with autologistic distribution (Murua and Quintana 2017), and when performing Bayesian inference on complex data lying on a graph (Chekouo, Murua, and Raffelsberger 2015). It is clear that in these applications, nearest neighbors are searched for a dataset of size n, so applying the aforementioned methods would lead to a nearly quadratic time procedure in n.
The paper is organized as follows. Section 2 gives the basic definition and principles to search for k-NN using univariate projections. Section 2.3 introduces the main ideas and justifications for the method implemented in this work. An application to intrinsic dimension estimation is shown in Section 3. Section 4 contains some final thoughts about the procedure introduced in this work.

Method
Note that finding the nearest-neighbors in one-dimension is fairly trivial and fast if the data are already sorted, which can be done in O(n log n) operations. The basic idea is to search for the k-nearestneighbors of the projection of X in each of the m subspaces. Let us denote by {x j(1) , . . . , x j(k) } the k-nearest-neighbors points to X in the j-th one-dimensional subspace, j = 1, . . . , m. The search will produce at most k × m different points that are possible nearest-neighbors to X. The key idea is to search for the k-nearest-neighbors of X among these points instead of among the entire data set D. This would yield our estimate setk-NN(X). Note that this task requires the computation of the true distances between X and the set of these points in R p , which requires at most O(kmp) operations. This method will be effective only if (1) the estimate setk-NN(X) is reasonably close to the true set k-NN(X), and (2) km is much smaller than n. In what follows, Section 2.1, we show this it is possible to accurately estimate the set of k-NN(x) for all x in a given dataset by just looking at neighbors in one-dimensional projections. However, in order to make these ideas efficient, that is to convey low computational cost, we show in Section 2.3 that it is more effective to restrict the search of the k-NN(x) to a sufficiently small ball centered at x and of constant radius (that is, independent of x). The method we have implemented, which is sketched in Algorithm 1, is based on this latter methodology.

Preliminary results
Suppose that m projections a 1 , . . . , a m are chosen at random from a Normal distribution N(0, I p ).
Suppose that and let Y ij = 0, otherwise. In order to correctly find the first-NN of X using only one projection, we would need to have all Y ij = 1, i ≥ 2. Let θ ij = Prob(Y ij = 1). Let S im = m j=1 Y ij . To correctly find the first-NN of X using m projections, we need S im > m/2 for all i ≥ 2. That is, we would like to have with very large probability, say 1 − for some small value of , In general, we also would like to correctly find k-NN(X). Let the event E m(1) = { the first-NN of X is correctly identified using m random projections }, and define the similar events E m( ) = { the -nearest neighbor of X is correctly identified using m random projections }, = 1, 2, . . . , k.
The following result says that we can get k-NN(X) with high accuracy provided that the number of projections is large enough and that the points are well separated. Theorem 2.1. Given > 0, we have provided that the number of random projections m satisfies This result is a special case of a more general result shown in Section 2.2. Its proof follow very similar steps as the proof of Theorem 2.3 below.
Algorithm 1 Method for finding the estimated k-nearest neighbors Require: m, D = {x 1 , . . . , x n } (Preprocessing) computex the center of x 1 , . . . , x n for all j ∈ 1, . . . , m do select a Gaussian random direction a j ∼ N(0, I p ) for all i ∈ 1, . . . , n do project each point x i −x on the space generated by a j producing coordinates w ij end for end for Require: the point of interest X (Finding the k-nearest-neighbors of X) for all j ∈ 1, . . . , m do project the point X −x on the space generated by a j producing coordinates W j compute the k-nearest-neighbors of W j in the set {w ij } n i=1 map the k-nn of W j to the original points, say {x j(1) , . . . , x j(k) } ⊂ D end for the estimated k-nearest neighbors of X are computed from the subset {x j(1) , . . . , The proof of Theorems 2.1 and 2.3 uses the following results. Lemma 2.2. Let δ > 0, and let x, y be two arbitrary but different vectors in the sphere, S p−1 of R p , with p large. Let a be a p-variate standard normal random vector, i.e., a ∼ N p (0, I p ). Then | Cov(a x, a y)| = |x y| < δ, with probability larger than 1 − 4 exp(−pδ 2 /2). That is, in very high dimensions, the variables a x and a y are nearly independent.
Proof. Let x ⊥ be the orthogonal space of the vector x. Consider the equator E = x ⊥ ∩ S p−1 and the set near the equator A δ = {w ∈ S p−1 : min z∈E w − z < δ}. Let H + and H − be the northern and southern hemispheres of the sphere with equator E. Let λ(·) be the probability measure induced by the Lebesgue measure in the sphere S p−1 . According to the Theorem for the measure concentration for the sphere (Matoušek 2002, Section 14.1, pp. 330-332) In words, this lemma says that for very large p, most of the points lie on the equator. Therefore, y lies most likely in A δ , and consequently, the angle α between the vectors x and y is nearly a straight angle. This implies cos(α) = x y ≈ 0. Proposition 1. Let δ > 0, and let x, y, z ∈ R p , with p large. Suppose that x − y < x − z . Let a be a p-variate standard normal random vector, i.e., a is a random projection. Given > 0, we have with probability larger than 1 − 4 exp(−pδ 2 /2). In fact the probability increases very fast with the ratio x − z 2 / x − y 2 . See Figure 1 below.
In the Appendix, we show that w = u/v is distributed as a Cauchy distribution with density Note that for any w and ρ, we have Similarly, Fast Approximate k-nearest-neighbors It is straightforward to show that for positive values of w, the function g Hence, for small values ρ, the function g(w) ≤ 1/ √ 3. Therefore, for small values of ρ, By Lemma 2.2, on the event A δ , we have ρ ≤ δ with probability larger than 1 − 4 exp(−pδ 2 /2), In this event 4ρ 2 1 − ρ 2 / √ 3 is of the order O(δ 2 ). Now using the facts that r(x, y, z) √ 1 + > 1, that the function arctan(w) is strictly monotone, and that 2 arctan(1) = π /2, we obtain the desired result for small enough values δ > 0.

A bound for the number of projections
Suppose that we have n points, x 1 , . . . , x n . Let a 1 , . . . , a m ∈ R p be m random normal projections. We would like to correctly find k-NN( . For simplicity, let us assume that all θ ij( ) = θ for all i, j, . Let the event E im(1) = {1-NN(x i ) is correctly identified using m random projections }, and define the similar events E im( ) = { the -nearest neighbor of x i is correctly identified using m random projections }, = 1, 2, . . . , k, and i = 1, . . . , n. By Proposition 1, we may write θ = 1 /2 + δ for some δ ∈ (0, 1 /2]. The following result guarantees that for a sufficiently large number of projections, our algorithm solves the -k-nearest-neighbor problem. provided that the number of random projections m satisfies m ≥ 1 2δ 2 − 2 O 2 log(n/ ) + log(k/4) .
Proof. First consider the case of the first-nearest-neighbor. Let b(θ) = (θ − 1 /2)/ θ(1 − θ). Using the normal approximation to the Binomial distribution (note that S m may be approximated by a Binomial(θ, m)), we need where Z i denotes a standardized Binomial(θ, m) random variable, and Φ is the distribution function of the standard normal. Now consider the general case. Using the Bonferroni inequality as above, we have whereĒ im( ) denotes the complementary event of E im( ) , i = 1, . . . , n and = 1, . . . , k. Assuming that k << n, each one of the terms in this sum may be approximated by n × Φ(− √ m b(θ)). Hence, Therefore, we may ask that kn 2 × Φ − √ mb(θ) ≤ , which implies x 0 e t 2 dt. In particular, one may use the approximation (Winitzki 2003(Winitzki , 2008 where the constant a = 0.147. For very small z ∈ (−1, 0), one can easily show that the above expression is approximately − | log(1 − z 2 )|. This is the case of interest for us, since we need to evaluate this function at z = 2 /(kn 2 ) − 1. Note as well that log(1 − z 2 ) = log( 4 kn 2 (1 − kn 2 )) ≈= − log(kn 2 /(4 )). Using these approximations, we obtain O log kn 2 4 , and hence (3).
Note that even though we just need m of O(log(n/ )), the constant depending on θ may dominate. For example for θ = 1 /2 + δ, the term multiplying O(log(n/ )) is δ −2 . So in practice we may need a large m = O(δ −2 log(n/ )). Figure 2 displays the number of necessary projections as function of the data size and the probability θ.

k-NN enclosed in a ball
Note that the computations above based on quantities such as S m involved O(n) one-dimensional comparisons for each x i . Although this is much faster than O(n) p-dimensional comparisons for each x i when p is very large, it will be much faster and use far less memory if the computations to find the k-NN of the points where reduced to a few comparisons. This is the goal of this section. For example, for each point x i and on each projection axis a j , it may be enough to just look at the univariate 2ksymmetric nearest-neighbors (sNN, for short) of a j x i . The idea is to compute the estimated k-NN(x i ) from its m 2k-sNN. With this method the estimated k-NN of the whole set of n points can be realized in order O(nm[log(n) + k log(km) + kp]) as follows: (i) Compute the m projections in O(nm) operations; (ii) Sort the projected data in each line (projection) in O(mn log(n)) operations; (iii) Take the 2k-sNN for each x i on each projection. Note that this is done very fast since the projected data is already sorted. This takes O(mnk) operations; (iv) Compute all real distances between x i and its m 2k-sNN. This takes O(nmkp) operations; (v) For each x i , find its estimated k-NN from its m 2k-sNN. This is done in O(nmk log(mk)) operations.
Let a 1 , . . . , a m be m random projections drawn from a Normal distribution N(0, I p ). An alternative way of realizing the same idea is to replace the 2k-sNN for a subset of points that fall close to the target point. Let For the method to work well, it suffices that all the S i (z) contain about 2k points. So the parameter z must be chosen to ensure this with high probability. For a given projection a, consider the event A i (z) = {x : |a (x i − x)| ≤ z}. The distribution of the number of points x h ∈ D, whose projections fall in the interval [−z, z] is Binomial with parameters n and probability of success Prob(A i (z)). The expected number of points whose projections fall in this interval is n Prob(A i (z)).
Equaling this to the desired number 2k, we get that z must satisfy Prob(A i (z)) = 2k/n. The next result calculates this probability. Theorem 2.4. Suppose that the vector a ∼ N(0, I p ) and x is distributed with density f p (x). Let satisfies Prob(|a x| ≤ z) = 2k/n.
In particular, if x is distributed uniformly in a ball of radius R, then If x is distributed uniformly in a ball of radius R, then This theorem gives the value of z. The next theorem establishes that our algorithm finds the k-NN of the entire dataset with high probability. Let . So that Prob(A j i, (z)) = 2Φ(z/ x i − x i( ) ) − 1 ≥ 2Φ(z/r) − 1, for all = 1, . . . , k, and i = 1, . . . , n. We have Therefore, to show the theorem, it is sufficient to ask nk2 m (1 − Φ(z/r)) m ≤ , which implies m ≥ log( /(nk))/ log(2(1 − Φ(z/r))).
How large is r? To answer this question, we need to know the distribution of the data (at least locally). We consider the same setup introduced in the work of von Luxburg, Radl, and . For any > 0, and x ∈ R p , let B (x) be the ball of radius centered at x. Suppose that the data D ⊂ R p is such that support(D) verifies that for all x in its boundary, vol(B (x) ∩ ∂support(D)) ≥ αvol(B (x)), for some 0 < α ≤ 1, where ∂support(D) denotes the boundary of support(D). Proposition 30 in (von Luxburg et al. 2014) on the maximum distance between a point x i and its nearest k-th neighbor, x i(k) states that: where f min > 0 is the minimum density in D, s = 2/(f min vol(B 1 (0))α) 1 /p . Assuming that the data were generated with a uniform distribution, we have f min = [vol(support(D))] −1 . Concerning, α, we can easily bound it, if we assume that support(D) is a volume without spikes. For instance, if support(D) is the unit ball, α can be made arbitrarily close to 0.5 (Li 2011). Therefore, if support(D) is a ball of radius R, we may take r = 2R 2k n 1 /p , provided that k ≥ log n, and n is large.

Numerical application
As the algorithm only retrieves approximately the k-nearest neighbors of a set of points, it is important to show that the obtained points are nevertheless useful for practical applications. The application we have chosen is intrinsic dimension estimation. Different definitions can be given of the intrinsic dimension which can be found in Kegl (2002), informally in the folklore it refers to the useful number of parameters needed to describe the data. Various methods has been proposed, the correlation dimension estimation (Grassberger and I. Procaccia 1983), the capacity dimension (Kegl 2002)  wherer (x) represents the distance of x to its -th nearest neighbor. To estimate the dimension of the whole manifold, we simply compute the empirical mean of it :d = 1 n n i=1d (x i ). Using this estimator, we compare the estimation of the intrinsic dimension obtained by (a) using the exact k-nearest neighbors, and (b) the approximate k-nearest neighbors yielded by our procedure. We compared the estimators on several artificially generated datasets. These consisted of data drawn from a uniform distribution over B 1 (0), the unit ball in R p with p ∈ {5, 10, 20, 50}. The dataset sizes were set to 10 5 or 10 6 . The results, for the number of nearest neighbors equals to k = 10 or 30, and the number of random projections equals to 10, 20 or 100, are displayed in Table 1. The table shows the dimension estimations as well as the execution times.
Although the results are comparable in terms of quality, the execution times are largely in favor of our approximate method; particularly, when the datasets contain a million points. As the number of projection increases, the results yielded by the approximate k-nearest-neighbors method get closer to the estimation given by the exact k-nearest-neighbor method. Also, when the number of neighbors increases from 10 to 30, the results associated with the approximate k-nearest-neighbors method improve with an increase in the number of projections. The best results for our approximate k-nearestneighbors algorithm are observed for 1 million point datasets, 10 neighbors and 100 projections.
The approximate k-nearest-neighbors method tends to underestimate the true dimension. A possible explanation for this behaviour, is that the farther from a given point x we go, the less accurate the estimation of its true -nearest neighbor, for large ≤ k. This, in turn, has an effect on the formula for estimation of the dimension. More specifically, there is a trailing error in the consecutive estimation of the k-nearest-neighbors. Therefore, the error in the estimation of say, r k in the formula of the dimension tends to be much larger than the error in the estimation of r k/2 . This would produce estimated ratiosr k /r k/2 > r k /r k/2 . Consequently, the approximate k-nearest-neighbors estimate of the dimensiond tends to be smaller than the estimate of d made with the exact k-nearest-neighbors. More specifically, letr k = r k + e k , andr k/2 = r k/2 + e k/2 . Suppose that e k = e k/2 + ν k , with ν k > 0. Then the dimension will be underestimated ifr k /r k/2 > r k /r k/2 . This holds if and only if e k /e k/2 > r k /r k/2 . Equivalently, one would need to have ν k /e k/2 > (r k /r k/2 ) − 1 for the dimension to be underestimated. This simple computation shows that underestimation may be absent for well-separated (spread) data.
We have also tested our method on two different datasets for which we have a pretty good estimation of the "true" intrinsic dimension. The first dataset is the MNIST (LeCun, Bottou, Bengio, and Haffner 1998) which consists in 60000 black and white images of size 28×28. Usually, the intrinsic dimension deemed to be about 14 (Hein and Audibert 2014; Costa and Hero III 2006;Facco, d'Errico, Rodriguez, and Laio 2017). Here, with k = 30 neighbors and m = 100 projections we estimate the intrinsic dimension to be 13.8 Using the exact k-neighborhood, the intrinsic dimension is estimated to be 16.2. The computation of the approximate procedure took only 298 seconds, while the one that uses the exact k-neighborhood took 3, 020 seconds, that is, more than ten times longer than the approximate procedure.
The second dataset is the Dota 2 dataset collected on the UCI repository dataset website (Dheeru and Karra Taniskidou 2017). Dota 2 contains 102, 944 Dota 2 computer game results. The games are played by two teams of 5 players each. Each player can choose a hero among 113 possible heroes. Given that a hero can only be chosen by one player, the 113-dimensional rows contain only 10 values different from 0: −1 for one team and 1 for the other team; the value 0 is given to heroes that have not been chosen by any of the teams. A reasonable guess for the intrinsic dimension is 10. In practice, we got the estimates 7.5 with the exact k-nearest neighbors algorithm in 1, 221 seconds, and 9.3 with the approximate k-neighborhood, in 83 seconds, for k = 10 and m = 100  Table 1: Intrinsic dimension estimation results obtained using exact, random and approximate knearest neighbors; timeE and timeA stand, respectively, for the exact and the approximate k-nearest neighbors execution times.

Conclusion
An efficient algorithm for computing approximate k-nearest neighbors has been introduced in this work. The method is easy to implement. It essentially comes down to performing random projections, sorting points along these projection lines, and computing the distances to the nearest neighbors found along these lines.
We have shown its efficiency with an application to dimension estimation based on k-nearest neighbors. Dimensions have been estimated on artificial as well as real examples, such as the MNIST and the Dota 2 games dataset which consists of more than 1 million observations. Our experiments show that we still obtain very good estimates of the intrinsic dimensions while at the same time we decrease the computing time to about two orders of magnitude. A potential improvement in performance could be achieved by considering low-discrepancy sequences instead of random projections. Low-discrepancy sequences, also known as quasi-random sequences, are such that any considered subsequence has a distribution close to the uniform distribution. In our case, we would be interested in using low-discrepancy sequences on the hypersphere (Wong, Luk, and Hen 1997). The idea is to explore the space of directions in a quasi-random way aiming at exploring it in a more systematic way than just randomly. This is left for future work.