Extracting Information from Interval Data Using Symbolic Principal Component Analysis

M. R. Oliveira, M. Vilela, A. Pacheco, Rui Valadas, Paulo Salvador


We introduce generic definitions of symbolic variance and covariance for random interval-valued variables, that lead to a unified and insightful interpretation of four known symbolic principal component estimation methods: CPCA, VPCA, CIPCA, and SymCovPCA. Moreover, we propose the use of truncated versions of symbolic principal components, that use a strict subset of the original symbolic variables, as a way to improve the interpretation of symbolic principal components. Furthermore, the analysis of a real dataset leads to a meaningful characterization of Internet traffic applications, while highligting similarities between the symbolic principal component estimation methods considered in the paper.

Full Text:



Bertrand P, Goupil F (2000). Descriptive Statistics for Symbolic Data. In HH Bock, E Diday (eds.), Analysis of Symbolic Data, Studies in Classification, Data Analysis, and Knowledge

Organization, pp. 106-124. Springer Berlin Heidelberg.

Billard L (2008). Sample Covariance Functions for Complex Quantitative Data. In Proceedings of World IASC Conference, Yokohama, Japan, pp. 157-163.

Billard L, Diday E (2003). From the Statistics of Data to the Statistics of Knowledge: Symbolic Data Analysis. Journal of the American Statistical Association, 98, 470-487.

Billard L, Diday E (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. John Wiley & Sons.

Cadima JFCL, Jolliffe IT (2001). Variable Selection and the Interpretation of Principal Subspaces. Journal of Agricultural, Biological, and Environmental Statistics, 6(1), 62-79.

Cazes P, Chouakria A, Diday E, Schektman Y (1997). Extension de l'Analyse en Composantes Principales à des Données de Type Intervalle." Revue de Statistique Appliquée, 45(3), 5-24.

Chouakria A (1998). Extension des Méthodes d'Analyse Factorielle à des Données de Type Intervalle. Ph.D. thesis, Université Paris-Dauphine.

De Carvalho FdA, Brito P, Bock HH (2006). Dynamic Clustering for Interval Data Based on L2 Distance. Computational Statistics, 21(2), 231-250.

Diday E (1987). The Symbolic Approach in Clustering and Related Methods of Data Analysis. In Proceedings of First conference IFCS,Aachen, Germany. H. Bock ed.North-Holland.

Le-Rademacher J, Billard L (2012). Symbolic Covariance Principal Component Analysis and Visualization for Interval-Valued Data. Computational and Graphical Statistics, 21(2), 413-432.

Pascoal C (2014). Contributions to Variable Selection and Robust Anomaly Detection in Telecommunications. Ph.D. thesis, Instituto Superior Técnico, Universidade de Lisboa, Portugal.

Pascoal C, Oliveira M, Valadas R, Filzmoser P, Salvador P, Pacheco A (2012). Robust Feature Selection and Robust PCA for Internet Traffic Anomaly Detection. In INFOCOM, 2012 Proceedings IEEE, pp. 1755-1763. ISSN 0743-166X.

Vilela M (2015). Classical and Robust Symbolic Principal Component Analysis for Interval Data. Master's thesis, Instituto Superior Técnico, Universidade de Lisboa, Portugal.

Wang H, Guan R, Wu J (2012). CIPCA: Complete-Information-based Principal Component Analysis for Interval-valued Data. Neurocomputing, 86, 158-169.

DOI: http://dx.doi.org/10.17713/ajs.v46i3-4.673


  • There are currently no refbacks.

@Matthias Templ (using Open Journal Systems) -- see previous editions at http://www.stat.tugraz.at/AJS/Editions.html