Extracting Information from Interval Data Using Symbolic Principal Component Analysis

  • M. R. Oliveira CEMAT and Instituto Superior Técnico, Universidade de Lisboa, Portugal
  • M. Vilela CEMAT and Instituto Superior Técnico, Universidade de Lisboa, Portugal
  • A. Pacheco CEMAT and Instituto Superior Técnico, Universidade de Lisboa, Portugal
  • Rui Valadas IT and Instituto Superior Técnico, Universidade de Lisboa, Portugal
  • Paulo Salvador IT and Universidade de Aveiro, Portugal

Abstract

We introduce generic definitions of symbolic variance and covariance for random interval-valued variables, that lead to a unified and insightful interpretation of four known symbolic principal component estimation methods: CPCA, VPCA, CIPCA, and SymCovPCA. Moreover, we propose the use of truncated versions of symbolic principal components, that use a strict subset of the original symbolic variables, as a way to improve the interpretation of symbolic principal components. Furthermore, the analysis of a real dataset leads to a meaningful characterization of Internet traffic applications, while highligting similarities between the symbolic principal component estimation methods considered in the paper.

References

Bertrand P, Goupil F (2000). Descriptive Statistics for Symbolic Data. In HH Bock, E Diday (eds.), Analysis of Symbolic Data, Studies in Classification, Data Analysis, and Knowledge
Organization, pp. 106-124. Springer Berlin Heidelberg.

Billard L (2008). Sample Covariance Functions for Complex Quantitative Data. In Proceedings of World IASC Conference, Yokohama, Japan, pp. 157-163.
Billard L, Diday E (2003). From the Statistics of Data to the Statistics of Knowledge: Symbolic Data Analysis. Journal of the American Statistical Association, 98, 470-487.

Billard L, Diday E (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. John Wiley & Sons.

Cadima JFCL, Jolliffe IT (2001). Variable Selection and the Interpretation of Principal Subspaces. Journal of Agricultural, Biological, and Environmental Statistics, 6(1), 62-79.

Cazes P, Chouakria A, Diday E, Schektman Y (1997). Extension de l'Analyse en Composantes Principales à des Données de Type Intervalle." Revue de Statistique Appliquée, 45(3), 5-24.

Chouakria A (1998). Extension des Méthodes d'Analyse Factorielle à des Données de Type Intervalle. Ph.D. thesis, Université Paris-Dauphine.

De Carvalho FdA, Brito P, Bock HH (2006). Dynamic Clustering for Interval Data Based on L2 Distance. Computational Statistics, 21(2), 231-250.

Diday E (1987). The Symbolic Approach in Clustering and Related Methods of Data Analysis. In Proceedings of First conference IFCS,Aachen, Germany. H. Bock ed.North-Holland.

Le-Rademacher J, Billard L (2012). Symbolic Covariance Principal Component Analysis and Visualization for Interval-Valued Data. Computational and Graphical Statistics, 21(2), 413-432.

Pascoal C (2014). Contributions to Variable Selection and Robust Anomaly Detection in Telecommunications. Ph.D. thesis, Instituto Superior Técnico, Universidade de Lisboa, Portugal.

Pascoal C, Oliveira M, Valadas R, Filzmoser P, Salvador P, Pacheco A (2012). Robust Feature Selection and Robust PCA for Internet Traffic Anomaly Detection. In INFOCOM, 2012 Proceedings IEEE, pp. 1755-1763. ISSN 0743-166X.

Vilela M (2015). Classical and Robust Symbolic Principal Component Analysis for Interval Data. Master's thesis, Instituto Superior Técnico, Universidade de Lisboa, Portugal.

Wang H, Guan R, Wu J (2012). CIPCA: Complete-Information-based Principal Component Analysis for Interval-valued Data. Neurocomputing, 86, 158-169.
Published
2017-04-12
How to Cite
Oliveira, M. R., Vilela, M., Pacheco, A., Valadas, R., & Salvador, P. (2017). Extracting Information from Interval Data Using Symbolic Principal Component Analysis. Austrian Journal of Statistics, 46(3-4), 79-87. https://doi.org/https://doi.org/10.17713/ajs.v46i3-4.673
Section
Special Issue CDAM conference