Internet as Data Source in the Istat Survey on ICT in Enterprises

Authors

  • Guilio Barcaroli ISTAT
  • Alessandra Nurra
  • Sergio Salamone
  • Monica Scannapieco
  • Marco Scarnò
  • Donato Summa

DOI:

https://doi.org/10.17713/ajs.v44i2.53

Abstract

The Istat sampling survey on ICT in enterprises aims at producing information on
the use of ICT and in particular on the use of Internet by Italian enterprises for various purposes (e-commerce, e-recruitment, advertisement, e-tendering, e-procurement, egovernment). To such a scope, data are collected by means of the traditional instrument of the questionnaire. Istat began to explore the possibility to use web scraping techniques, associated, in the estimation phase, to text and data mining algorithms, with the aim to replace traditional instruments of data collection and estimation, or to combine them in an integrated approach. The 8,600 websites, indicated by the 19,000 enterprises responding to ICT survey of year 2013, have been scraped and the acquired texts have been processed in order to try to reproduce the same information collected via questionnaire. Preliminary results are encouraging, showing in some cases a satisfactory predictive capability of fitted
models (mainly those obtained by using the Naive Bayes algorithm). Also the method known as Content Analysis has been applied, and its results compared to those obtained with classical learners. In order to improve the overall performance, an advanced system for scraping and mining is being adopted, based on the open source Apache suite Nutch-Solr-Lucene. On the basis of the nal results of this test, an integrated system harnessing both survey data and data collected from Internet to produce the required estimates will be implemented, based on systematic scraping of the near 100,000 websites related to the whole population of Italian enterprises with 10 persons employed and more, operating in industry and services. This new approach, based on Internet as Data source (IaD), is characterized by advantages and drawbacks that need to be carefully analysed.

References

Hoekstra R, ten Bosh O, Harteveld F (2012). "Automated data collection from web sources for official statistics: First experiences." Statistical Journal of the IAOS: Journal of the

International Association for Official Statistics, 28(3-4), 99-111.

Hopkins D, King G (2010). "A Method of Automated Nonparametric Content Analysis for Social Science." American Journal of Political Science, 54(1), 229-247.

James G, Witten D, Hastie T, Tibshirani R (2013). An Introduction to Statistical Learning with Applications in R. Springer Texts in Statistics.

Jurka T, Collingwood L, Boydstun A, Grossman E, van Atteveldt W (2014). RTextTools: AutomaticText Classication via Supervised Learning. R package version 1.4.2., URL

http://CRAN.R-project.org/package=RTextTools.

Lantz B (2013). Machine Learning with R. Packt Publishing Ltd.

Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2014). e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. R package version 1.6-3, URL http:

//CRAN.R-project.org/package=e1071.

ten Bosh O, Windmeijer D (2014). "On the Use of Internet Robots for Official Statistics." In MSIS-2014.

Williams G (2011). Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery. Use R!, Springer.

Published

2015-04-30

How to Cite

Barcaroli, G., Nurra, A., Salamone, S., Scannapieco, M., Scarnò, M., & Summa, D. (2015). Internet as Data Source in the Istat Survey on ICT in Enterprises. Austrian Journal of Statistics, 44(2), 31-43. https://doi.org/10.17713/ajs.v44i2.53

Issue

Section

Q2014