Domain-Based Benchmark Experiments: Exploratory and Inferential Analysis


  • Manuel J. A. Eugster Institut für Statistik, LMU München, Germany
  • Torsten Hothorn Institut für Statistik, LMU München, Germany
  • Friedrich Leisch Institut für Angewandte Statistik und EDV, BOKU Wien, Austria



Benchmark experiments are the method of choice to compare learning algorithms empirically. For collections of data sets, the empirical performance distributions of a set of learning algorithms are estimated, compared, and ordered. Usually this is done for each data set separately. The present manuscript extends this single data set-based approach to a joint analysis for the complete collection, the so called problem domain. This enables
to decide which algorithms to deploy in a specific application or to compare newly developed algorithms with well-known algorithms on established problem domains.

Specialized visualization methods allow for easy exploration of huge amounts of benchmark data. Furthermore, we take the benchmark experiment design into account and use mixed-effects models to provide a formal statistical analysis. Two domain-based benchmark experiments demonstrate our methods: the UCI domain as a well-known domain when one is developing a new algorithm; and the Grasshopper domain as a domain where we want to find the  best learning algorithm for a prediction component in an enterprise application software system.


Abernethy, J., and Liang, P. (2010). MLcomp. Website. (; visited on December 20, 2011)

Asuncion, A., and Newman, D. (2007). UCI machine learning repository. Website. Available from

Bates, D., and Maechler, M. (2010). lme4: Linear mixed-effects models using S4 classes [Computer software manual]. Available from (R package version 0.999375-35)

Becker, R. A., Cleveland, W. S., and Shyu, M.-J. (1996). The visual design and control of Trellis display. Journal of Computational and Graphical Statistics, 5(2), 123–155.

Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., , and Weingessel, A. (2009). e1071: Misc functions of the department of statistics (e1071), tu wien [Computer software manual]. Available from (R package version 1.5-19)

Eugster, M. J. A. (2010). benchmark: Benchmark experiments toolbox [Computer software manual]. Available from (R package version 0.3)

Eugster, M. J. A. (2011). Benchmark Experiments – A Tool for Analyzing Statistical Learning Algorithms. Dr. Hut-Verlag. Available from (PhD thesis, Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany)

Eugster, M. J. A., and Leisch, F. (2010). Exploratory analysis of benchmark experiments – an interactive approach. Computational Statistics. (Accepted for publication on 2010-06-08)

Eugster, M. J. A., Leisch, F., and Strobl, C. (2010). (Psycho-)analysis of benchmark experiments – a formal framework for investigating the relationship between data sets

and learning algorithms (Technical Report No. 78). Institut für Statistik, Ludwig-Maximilians-Universität München, Germany. Available from

Federal Environment Agency, D.-D. (2004). CORINE Land Cover (CLC2006). Available from (Deutsches Zentrum für Luft- und Raumfahrt e.V.)

Gansner, E. R., and North, S. C. (2000). An open graph visualization system and its applications to software engineering. Software — Practice and Experience, 30(11), 1203–1233.

Hager, G., and Wellein, G. (2010). Introduction to High Performance Computing for Scientists and Engineers. CRC Press.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (second ed.). Springer-Verlag.

Henschel, S., Ong, C. S., Braun, M. L., Sonnenburg, S., and Hoyer, P. O. (2010). MLdata: Machine learning benchmark repository. Website. (; visited on December 20, 2011)

Hijmans, R. J., Cameron, S. E., Parra, J. L., Jones, P. G., and Jarvis, A. (2005). Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology, 25(15), 1965–1978. Available from

Hornik, K., and Meyer, D. (2007). Deriving consensus rankings from benchmarking experiments. In R. Decker and H.-J. Lenz (Eds.), Advances in Data Analysis (Proceedings of the 30th Annual Conference of the Gesellschaft für Klassifikation e.V., Freie

Universität Berlin, March 8–10, 2006 (pp. 163–170). Springer-Verlag.

Hornik, K., and Meyer, D. (2010). relations: Data structures and algorithms for relations [Computer software manual]. Available from (R package version 0.5-8)

Hothorn, T., Bretz, F., and Westfall, P. (2008). Simultaneous inference in general parametric models. Biometrical Journal, 50(3), 346–363. Available from

Hothorn, T., Hornik, K., Wiel, M. A. van de, and Zeileis, A. (2006). A Lego system for conditional inference. The American Statistician, 60(3). Available from

Hothorn, T., Leisch, F., Zeileis, A., and Hornik, K. (2005). The design and analysis of benchmark experiments. Journal of Computational and Graphical Statistics, 14(3), 675–699.

Kemeny, J. G., and Snell, J. L. (1972). Mathematical Models in the Social Sciences. MIT Press.

Liaw, A., and Wiener, M. (2002). Classification and regression by randomForest. R News, 2(3), 18–22. Available from

McGarigal, K., Cushman, S. A., Neel, M. C., and Ene, E. (2002). Fragstats: Spatial pattern analysis program for categorical maps [Computer software manual]. (Computer software program produced by the authors at the University of Massachusetts,


Pfahringer, B., and Bensusan, H. (2000). Meta-learning by landmarking various learning algorithms. In In Proceedings of the Seventeenth International Conference on Machine Learning (pp. 743–750). Morgan Kaufmann.

Pinheiro, J. C., and Bates, D. M. (2000). Mixed-Effects Models in S and S-PLUS. Springer.

R Development Core Team. (2010). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Available from http:// (ISBN 3-900051-07-0)

Scharl, T., and Leisch, F. (2009). gcExplorer: Interactive exploration of gene clusters. Bioinformatics, 25(8), 1089–1090.

Schlumprecht, H., and Waeber, G. (2003). Heuschrecken in Bayern. Ulmer.

Therneau, T. M., and Atkinson, B. (2009). rpart: Recursive partitioning [Computer software manual]. Available from (R package version 3.1-43. R port by Brian Ripley)

Venables, W. N., and Ripley, B. D. (2002). Modern Applied Statistics with S (Fourth ed.). New York: Springer. Available from (ISBN 0-387-95457-0)

Vilalta, R., and Drissi, Y. (2002). A perspective view and survey of meta-learning. Artificial Intelligence Review, 18(2), 77–95.

Wellek, S. (2003). Testing Statistical Hypotheses of Equivalence. Chapman & Hall.

Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. Springer New York. Available from




How to Cite

Eugster, M. J. A., Hothorn, T., & Leisch, F. (2016). Domain-Based Benchmark Experiments: Exploratory and Inferential Analysis. Austrian Journal of Statistics, 41(1), 5–26.