Data Fit (Stress) vs. Model Fit (Recovery) in Multidimensional Scaling

The paper studies the relation of data fit (Stress) and model fit (recovery of a true latent configuration superimposed with random error) depending on the number of points and the error level in the data when running ordinal and interval MDS. Using 2-dimensional MDS with 6 to 100 points and ten levels of error, it is found that adding more points (given that n > 10) is driving up the Stress of the MDS solutions but also leads to better recovery, in general. The recovery exceeds any reasonable statistical benchmarks, i.e. more points enable MDS to smooth out the error in the data so that the true configuration can surface more clearly in the MDS solution.


Introduction
A crucial issue of multidimensional scaling (MDS) is evaluating how well a particular solution represents the data. The data fit is measured by various stress or alienation coefficients -all loss functions which quantify on a scale from 0 to 1 the extent to which the dissimilarity data, optimally re-scaled at the assumed scale level, correspond to the distances among the points of the MDS solution. In the early days of MDS, certain absolute benchmarks were proposed for such loss functions. Kruskal (1964), for example, suggested five grades for his Stress coefficient, ranging from "0%" showing a "perfect" fit to ">20%" indicating "poor" fit. Guttman (1968) even proposed a single benchmark for his alienation coefficient, i.e. K < .15. Solutions that satisfy this criterion are considered acceptably good, others are taken as "borderline" or poor. Similarly, Davison (1983, p. 116) recommends that one should "seldom accept a solution with a Stress of more than .10". Borgatti (1996, p. 29) suggests that "the rule of thumb we use is that anything under 0.10 is excellent and anything over 0.15 is unacceptable".
Such absolute criteria are still used today, in particular the K < .15 criterion. Yet, Borg and Groenen (2005), Mair, Borg, and Rusch (2016), and Borg, Groenen, and Mair (2017), suggest that the MDS user should pay attention to a range of criteria rather than to only a single one. Such criteria are, in particular, the number of points; the dimensionality of the MDS solution; the type of MDS model (ordinal, interval, ratio); the error level of the data; possible outliers in terms of Stress-per-Point; the replicability of the MDS solution; and its interpretability vis-à-vis a substantive theory explaining the structure of the data.
An additional aspect of assessing the quality of MDS solutions that has not received much attention in the past is the question to what extent high Stress values can be misleading in the sense that they indicate a poor data fit while the model fit is actually quite good, i.e. the MDS distances correspond only poorly to the data but correspond well to the distances of the underlying true model configuration. Is MDS able to actually re-or uncover an assumed true configuration from error-affected data, even though the Stress seems high, and, if so, to what extent does that depend on the number of points and the error level of the data?
By "true configuration" we mean the assumed or known underlying configuration of the MDS model whose inter-point distances are the systematic component of the observed dissimilarities. If such a true configuration exists, MDS should detect it in the data. Consider two examples. The first one is a study by Borg and Leutner (1983). They studied how persons generate their judgments of similarity when looking at well-constructed geometric stimuli, namely at rectangles of different shape and size. Sixteen rectangles were constructed using four different width values and four different height values ("design configuration"), and then each possible pair of rectangles of this design was presented to the respondents to be rated on a scale from "0=equal, identical" to "9=very different". The resulting scores were averaged over all respondents and replications, and then converted to dissimilarities 1 . These dissimilarities were scaled with MDS, finding a configuration that closely matched the design configuration, except for logarithmic re-scalings of the X-and the Y -axis, as predicted. Hence, MDS succeeded in reconstructing the latent true model configuration (i.e., the design configuration with a psycho-physical Weber-Fechner re-scaling of the physical width and height dimensions).
Another example is research on personal values. When individuals are asked to assess the importance of such basic values as power, tradition, or benevolence, they, according to Schwartz and Bilsky (1987), generate their judgments on the basis of an underlying circle of basic values, where these values have a universal order and certain oppositions that indicate psychological neighborhoods and perceived incompatibilities. This value circle is the theoretically assumed latent true configuration. Hundreds of studies conducted all over the world have shown that MDS uncovers this configuration reliably 2 . The circle is not perfect in MDS, and the Stress is never zero, but the theory seems to hold. It also holds when using vastly different questionnaires to collect the data (direct ratings, ratings of portraits; values based on theory or values derived by lexicographic methods), and surveys assessing from 10 to 19 different values (Schwartz, Cieciuch, Vecchione, Davidov, Fischer, Beierlein, Ramos, Verkasalo, Lönquist, Dirilen-Gumus, and Konty 2012;Borg, Dobewall, and Aavik 2016).
What most MDS users know is that Stress depends on the number of variables or points: "For the same underlying data structure, a larger data set will necessarily result in a higher stress value" (Holland 2008). Simulation studies show that random Stress norms grow in a negatively accelerated function, first steeply increasing up to about 30 points, and then slowly growing with ever more points (Spence and Ogilvie 1973;. At the same time, the power of MDS to uncover or recover an underlying true configuration grows too by adding more points, as Young (1970) has shown in another simulation study. Users of MDS occasionally use this finding in arguing that one can accept MDS solutions with high Stress if the number of points is only "large". The idea is that more points make it easier for MDS to act as a data smoother, ironing out the error in the data and uncovering the true structure in the data. The extent to which this is true is, however, not so clear.
Unfortunately, it is difficult to decide the issue of admitting large Stress values if only the number of points is large on the basis of the simulations reported by Young (1970). The study used a 3 × 5 × 5 factorial design with five replications. The three independent variables were the number of dimensions 3 of the true configuration (1, 2, and 3), the number n of points in the configuration (6, 8, 20, 25, and 30) and the proportion of error e introduced (.0, .1, .2, .35, .5). The true configuration X was defined by taking n point coordinates from two small tables of random numbers published by Coombs and Kao (1960) 4 . A error term sampled from a normal distribution was then added to each coordinate of X, independently drawn each time a new distance was computed (the so-called Hefner model 5 ). The error-affected distances, D e (X), were then mildly monotonically distorted and used as dissimilarities in ordinal MDS. Finally, the distances of any MDS solution Y, i.e. the "recovered" distances, D(Y), were compared with the true distances, D t (X), by correlating across the corresponding coordinates using the product-moment correlation coefficient.
One problem with this study is that the PM correlation of corresponding distances is not a proper way to measure the similarity of two geometric configurations. This is easy to see. Let A be a configuration of three points with distances d 12 = 1, d 23 = 2, and d 13 = 3. For comparison, use B with d 12 = 2, d 23 = 3, and d 13 = 4. The correlation of the two sets of distances is r = 1, incorrectly indicating perfect similarity of what is actually a triangle and three collinear points, respectively. A correct measure of similarity of two configurations is the congruence coefficient c of corresponding distances, i.e. the correlation about the fixed origin zero: where d ij (X) denotes the distance of points i and j of configuration X. Another often used measure is the PM correlation across the corresponding coordinates of the configurations after bringing them to an optimal fit using Procrustean similarity transformations (Borg and Groenen 2005): with s a global scaling constant, T a rotation/reflection matrix, t a translation vector, and 1 a vector of 1's; s, T, and t are chosen to maximize r(A, B * ) (Borg and Groenen 2005).
These two measures usually show similar but not identical results. However, even for random configurations, they can be misleading to the naive user, since they are always positive and often "large", even for random configurations (in particular those with only few points), so they should be evaluated for their statistical significance.
Another problem with Young's study is that it is simply quite old. Simulations were expensive to run on mainframe computers in the 1960's, requiring substantial amounts of research money. Therefore, the number of replications and the simulation design in general is small and the results difficult to generalize. Moreover, the early MDS algorithms were slow and did not necessarily converge to local Stress minima. Finding global Stress minima was even more uncertain. And only ordinal MDS was considered worthwhile in those days. Then, with only 5 replications and only one true configuration, it becomes risky to generalize the findings and to describe the relationship of Stress and recovery for different numbers of points and error levels. Finally, the error distributions and the error levels in Young's study are difficult to assess. A truly normal distribution admits, in principle, extremely large values. This is implausible in any MDS error model. Some type of truncation is needed. Also, one should be able to assess the amount of error at the most extreme level: Is that level realistic from the point-of-view of the applied user?
in the social sciences. 4 There are two tables, one 15 × 3 and one 30 × 3. The numbers are roughly rectangularly distributed, but Coombs and Kao (1960) do not report how the data were generated.
5 This model has been used "most frequently" in simulation studies on MDS (Zinnes and MacKay 1981). Originally, the model was an attempt to explain how an individual generates a distance judgment on the basis of a psychological map. The basic idea is that the configuration is not fixed but rather "oscillating", expressing the individual's uncertainty about the stimuli. Hence, the value of the distance dij depends on the positions of points i and j at time t, and each distance is computed at a different time t.
Today, many of the above questions can be addressed more easily and free of costs within the R environment (R Core Team 2018), using up-to-date MDS algorithms such as, in particular, the smacof R package (De Leeuw and Mair 2009).

Method
We here study how the relation of Stress and recovery depends on the number of points for both ordinal and interval MDS. We use n = 6, ..., 30 points and, in addition, 50 and 100 points. For each n, many configurations X with X-and Y -coordinates drawn from a rectangular random distribution over [−1, 1] are studied. For each X, the (Euclidean) distances, D t (X) are computed ("true distances"). Based on these distances, error-affected distances D e (X) are generated by adding to D t (X) randomly picked elements from a truncated normal distribution (using the truncnorm R-package of Mersmann, Trautmann, Steuer, and Bornkamp (2018)) with a mean of zero, a standard deviation of sd(X), upper and lower bounds of .5 * max(D t (X)), and multiplied by the error weight e = 0.1, 0.2, ..., 1. That is, D e (X) = D t (X)+ N trunc (0, sd(X)) * e. Truncating the normal error distribution to the interval [−.5 * max(D t (X)), +.5 * max(D t (X))] is introduced to avoid unrealistically large error values. An error that is as large as half of the largest true distance is taken as the maximum that we assume for meaningful observations.
The elements of each D e (X) are taken as the dissimilarities for MDS. All MDS analyses are computed by the mds() function of the R package smacof. For each combination of n points and error level e, 100 random configurations X are drawn. For each X, 30 different D e (X)'s are constructed and used as data in MDS. For each MDS solution, Y, the recovery of D t (X) is assessed by computing r(X, Y * ) as in (2) and c(X, Y) as in (1), using the function Procrustes() of the smacof R-package for Procrustean fitting of Y to X; the function also delivers the value of c. Thus, for each of the 27 · 10 combinations of n and e we have two recovery measures, each based on 100 · 30 MDS solutions.
The above error calculation based on distances rather than on point coordinates simulates the Hefner model, but does not require to independently draw error components for each element of D e (X). This makes the simulations easier. In order to be able to evaluate the amount of error in more detail, the error distribution used in the simulations is shown in Figure 2 for the n = 30 and e = 1 case (simulated 10,000 times).
The recovery coefficients can be benchmarked against coefficients one gets when comparing the similarity of random configurations. The r measure involves Procrustean fittings of the configurations, and hence it cannot be expected to be zero for random data, and certainly clearly greater than zero if the configurations contain only few points. The c measure compares the true distances with the MDS distances, and since distances are all non-negative, their inner product and their norms are all positive. Thus, we compute for both coefficients the mean values, standard deviations, and 95% quantiles for 10,000 pairs of random configurations of n = 3, 5, ..., 100 points to have some benchmarks of apparent but not real similarity. The largest possible distance in D t (X) is √ 8 and, hence, the largest possible error under e = 1 is ±1.41, i.e. half the diagonal of the square that X can fill. The right panel of Figure  2 shows the distribution of the errors generated in the simulations. The values cover almost the entire range of ±1.41, with a standard deviation of .47. Hence, the true distances and the errors have about the same standard deviation, so that the error on the true distances can be considered "substantial" in case of e = 1. Figure 3 exhibits what similarity values, using r and c, can be expected when comparing two 2-dimensional random configurations with 3, 4, ..., 100 points, respectively. The solid points that are connected by a solid line show the mean similarity coefficients observed for 10,000 pairs of random configurations. The solid line above the mean line represents the standard deviations of the observed coefficients. The open circles above these lines show the 99% quantiles.

Results
The random benchmarks values of r and c for a selection of n points are shown numerically in Table 1. The table also exhibits the Stress values obtained when scaling random dissimilarities for the same number of points with ordinal and interval MDS, respectively. The Stress values are generated for data drawn from a uniform random distribution using the randomstress() function of smacof (Mair et al. 2016), with 1,000 replications for each n and type of MDS. The results in Figure 1 can be evaluated against these benchmarks. For example, for n = 20 points and e = 1, the simulations lead to a mean recovery of r = .921, c = .970, and Stress = .182 when using interval MDS. These values are hugely unlikely for random data: The r and c recovery values are clearly greater than the benchmark values for n = 20 in Table 1, and the Stress values are also much smaller than the expected 5% quantiles when scaling random dissimilarities. Hence, one can conclude that the dissimilarities definitely contain a systematic 2-dimensional distance component.

Discussion
The simulations demonstrate that MDS can be expected to generally recover an underlying true configuration of points the better the more points one has. The effect of more points Legend: n is number of points; m(c), sd(c), and c99 denote mean, standard deviation, and 99% quantile of c; m(r), sd(r), and r99 denote mean, standard deviation, and 99% quantile of r; m(S), sd(S), and 5% are mean Stress, standard deviation of Stress, and 5% quantile of Stress for ordinal and interval MDS, respectively.
on recovery is larger for larger numbers of points (n), and also larger for higher error levels. However, the improvement effects can be found only if there are at least about 10 points in case of ordinal MDS, and at least that many points in case of interval MDS. This is not a real limitation, though, since fewer than 10 points should generally not be used in real applications of (2-dimensional) MDS anyway. Kruskal and Wish (1978, p. 52) suggest as "a rule of thumb" that n > 4m should hold, where m is the dimensionality of the MDS space.
The quality of recovery -as measured by two different coefficients of geometrical similarity of the true and the MDS configuration -is clearly better than can be statistically expected for random configurations, even in case of few points and high levels of error. This follows from comparing the r and the c coefficients in Figure 1 with their benchmark values in Figure 3.
The simulations also show that the recovery is not just improved but greatly improved with more points, in particular if the error level of the data is substantial. MDS succeeds in smoothing the data so that the true underlying configuration surfaces from the noisy observations: The MDS distances, therefore, become better information than the observed and noisy data.
The relation of Stress increments and recovery improvements (in terms of r or c) is essentially linear for low to medium error levels (for n > 9). For high error levels, the recovery improvements as compared to the Stress increments are relatively small for few points and then accelerate for many points. Hence, for noisy data and few points, merely adding one or two additional points cannot be expected to improve the model fit very much. However, it should also be added that not all points are equal, i.e. points associated with large distances in the true configuration impact the MDS solution more than points close to the centroid of the configuration. Hence, if the MDS user has a choice he or she should first consider such high-impact points (provided they are reliable).
When comparing the results of ordinal and interval MDS, one notes that interval MDS leads to clearly higher Stress values in general, but the differences taper off considerably when the number of points becomes large. For 50 or 100 points, the Stress values are very similar. For n > 9 points, both models also lead to similar levels of recovery. Only for very few points, ordinal MDS is more successful in recovering the true configuration.
It has been asked whether the similarity coefficients c and r used in this study are equally good. In particular, it may appear that c is technically superior because it does not require Procrustean transformations but simply compares the distances of two configurations (Vera 2017)). However, in this study, both c and r are computed on the same configurations, X (true configuration) and Y * (fitted MDS solution). The Procrustean fitting of Y to X is solved analytically. There are no local minima problems. If technical MDS problems arise when computing Y, they affect both c and r, and not just one of these coefficients. On the other hand, one should keep in mind that each coefficient answers a rather complex geometric question -the degree of similarity of two configurations -by a single number. It is therefore not surprising that c and r do not always arrive at exactly the same conclusions (for an applied example see Borg and Groenen 2005).
A technical problem that can arise when computing MDS solutions is that the solutions are not global but only local minima. This is particularly likely when using random initial configurations in MDS . However, to generate the above results we always used the Torgerson configuration to start the MDS algorithm. This is the default in practice, making local minima much less likely than using random starting configurations . Other starting configurations (including X itself) were also tested, leading to the same results in the simulations. Note also that each point in Figure 1 is based on 3.000 MDS analyses. In summary, one one can therefore safely assume that the general conclusions are not affected by local minima problems.
Finally, a number of caveats should be mentioned. First, we only simulated the 2-dimensional MDS case, assuming that the underlying true configuration is also 2-dimensional. This represents the case of the vast majority of MDS applications published in the literature. Second, our true underlying configurations were constructed by sampling their coordinates randomly from a rectangular distribution. This led to configurations whose points are fairly evenly distributed in the unit square. With real data, such distributions are not always guaranteed. One example is the case of personal values from the Introduction. Here, the points are distributed along a circle in space, leaving the center of the configuration empty. Moreover, when adding certain values such as "religion" to the standard set of ten personal values that give rise to the Schwartz value circle, it means that one adds a point that only partly belongs to the usual basic values (Borg, Hermann, Bilsky, and Pöge 2019): Religion requires an additional third dimension to be fully fitted to the ten-value value circle and, if fitted into two dimensions, it would therefore lead to higher Stress values but not to better recovery. Even if all variables can be represented in the same dimensionality, an additional point may lead to problems of fit and recovery. The user can detect such cases, however, by keeping an eye on the Stress-per-Point indexes when adding variables to the data set.