On Representativeness of Internet Data Sources for Real Estate Market in Poland


  • Maciej Eryk Beręsewicz Poznan University of Economics




Shifting paradigms in Official Statistics lead to a widespread use of administrative records to support or to create an alternative for census and surveys. At the same time demand for diversified detailed information is increasing. Official Statistics in order to meet this demand need to seek for new data sources. Internet data sources or more general -- Big Data -- could be one of them. Potential usefulness of these new sources of statistical information should not be neglected.

The aim of the paper is to assess representativeness of Internet data sources (IDS) for real estate market in Poland. These sources could be used for describing demand and supply on secondary real estate market in more detailed way that is done with existing methodology. In order to assess representativeness, information from official surveys and other data sources will be used. Due to lack of sufficient literature on this issue, own research will be conducted to enhance information from official statistics. For the purpose of the paper Internet data sources will be defined. Register TERYT containing information on street names was used to correct information taken from Internet data sources. Special program for automated data collection (web spider) was developed. All the calculation was done with R statistical software and additional packages (XML, RCurl, httr).

Author Biography

Maciej Eryk Beręsewicz, Poznan University of Economics

Department of Statistics

PhD Student


Bapna R, Goes P, Gopal R, Marsden JR (2006). “Moving from Data-Constrained to Data- Enabled Research: Experiences and Challenges in Collecting, Validating and Analyzing Large-Scale e-Commerce Data.” Statistical Science, 21(2), 116–130. ISSN 0883-4237. doi: 10.1214/088342306000000231. 0609136v1, URL http://projecteuclid.org/Dienst/


Bayer M (2011). “Gartner Says Solving ’Big Data’ Challenge Involves More Than Just Man-

aging Volumes of Data.” URL http://www.gartner.com/newsroom/id/1731916.

Bayer M, Laney D (2012). “The Importance of ’Big Data’: A Definition.” URL https:


Bethlehem J (2008). “Representativity of web surveys–an illusion?” Access panels and online

research, panacea or pitfall, pp. 19–44.

Bethlehem J (2009). Applied survey methods: A statistical perspective. John Wiley & Sons.

Bethlehem J, Biffignandi S (2011). Handbook of web surveys. John Wiley & Sons.

Buelens B, Daas P, Burger J, Puts M, van den Brakel J (2014). “Selectivity of Big data.” URL http://www.pietdaas.nl/beta/pubs/pubs/Selectivity_Buelens.pdf.

Cavallo A (2012). “Scraped data and sticky prices.” MIT Sloan Research Paper. URL http://www.mit.edu/%7Eafc/papers/Cavallo-Scraped.pdf.

Cavallo A (2013). “Online and Official Price Indexes: Measuring Argentina’s Inflation.” Jour- nal of Monetary Economics, 60(2), 152–165.

Central Statistical Office (2014). Information Society in Poland statistical re- sults from the years 2009-2013 (in polish). Statistical Office in Szczecin, War- saw, Poland. URL http://stat.gov.pl/download/gfx/portalinformacyjny/pl/ defaultaktualnosci/5497/1/7/4/spolecz_inform_w_polsce_2009-2013.pdf.

Choi H, Varian H (2012). “Predicting the present with google trends.” Economic Record, 88(s1), 2–9.

Daas P, Puts M (2014a). “Big Data as a Source of Statistical Information.” The Survey Statistician, 69, 22–31. URL http://pietdaas.nl/beta/pubs/pubs/Big_data_survey_ stat.pdf.

Daas P, Puts M (2014b). “Social Media Sentiment and Consumer Confidence.” URL http: //www.ecb.europa.eu/pub/pdf/scpsps/ecbsp5.pdf.

Daas P, Roos M, de Blois C, Hoekstra R, ten Bosch O, Ma Y (2011). “New data sources for statistics: Experiences at Statistics Netherlands.” In Paper for the 2011 European New Technique and Technologies for Statistics conference, February, pp. 22–24.

Daas P, Roos M, van de Ven M, Neroni J (2012). “Twitter as a potential data source for statistics.” URL http://pietdaas.nl/beta/pubs/pubs/DiscPaper_Twitter.pdf.

Fondeur Y, Karam ́e F (2013). “Can Google data help predict French youth unemployment?” Economic Modelling, 30, 117–125. ISSN 02649993. doi:10.1016/j.econmod.2012.07. 017. URL http://linkinghub.elsevier.com/retrieve/pii/S0264999312002490.

Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L (2008). “Detecting influenza epidemics using search engine query data.” Nature, 457(7232), 1012–1014.

Gollata E (2014). “New paradigm in statistics and population census quality.” European conference on quality in official statistics, URL http://www.q2014.at/fileadmin/user_ upload/GOLATA_NEW.pdf.

Hoekstra R, ten Bosch O, Harteveld F (2012). “Automated data collection from web sources for official statistics: First experiences.” Statistical Journal of the IAOS: Journal of the International Association for Official Statistics, 28(3), 99–111.

Kruskal W, Mosteller F (1979a). “Representative sampling I: Non-scientific literature.” International Statistical Review, 47, 13–24. URL http://www.jstor.org/stable/1402564.

Kruskal W, Mosteller F (1979b). “Representative sampling II: Scientific literature excluding statistics.” International Statistical Review, 47, 111–123. URL http://www.jstor.org/ stable/1402564.

Kruskal W, Mosteller F (1979c). “Representative sampling III: The current statistical lit- erature.” International Statistical Review, 47, 245–265. URL http://www.jstor.org/ stable/1402647.

Lang DT (2013). XML: Tools for parsing and generating XML within R and S-Plus. R package version 3.98-1.1, URL http://CRAN.R-project.org/package=XML.

Lang DT (2014). RCurl: General network (HTTP/FTP/...) client interface for R. R package version 1.95-4.3, URL http://CRAN.R-project.org/package=RCurl.

Miller G (2011). “Social scientists wade into the tweet stream.” Science, 333(6051), 1814– 1815.

Mohorko A, Leeuw Ed, Hox J (2013). “Internet coverage and coverage bias in Europe: devel- opments across countries and over time.” Journal of Official Statistics, 29(4), 609–622.

National Bank Of Poland (2014a). The real estate market - Information Quarterly (in pol- ish). Finance stability department, Warsaw, Poland. URL http://nbp.pl/home.aspx?f= /publikacje/rynek_nieruchomosci/index2.html.

National Bank Of Poland (2014b). Report on the situation on the markets of residential and commercial property in Poland in 2013 (in polish). Finance stability department, Warsaw, Poland. URL http://nbp.pl/publikacje/rynek_nieruchomosci/raport_2013.pdf.

Porter AT, Holan SH, Wikle CK, Cressie N (2013). “Spatial fay-herriot models for small area estimation with functional covariates.” arXiv preprint arXiv:1303.6668.

Pratesi M, Giannotti F, Giusti C, Marchetti S, Pedreschi D, Salvati N (2014). “Area level sae models with measurement errors in covariates: an application to sample surveys and big data sources.” Small Area Estimation. URL http://sae2014.ue.poznan.pl/SAE2014_ book.pdf.

Pratesi M, Pedreschi D, Giannotti F, Marchetti S, Salvati N, Maggino F (2013). “Small area model-based estimators using big data sources.” NTTS. URL http://www.cros-portal. eu/sites/default/files/NTTS2013fullPaper_208.pdf.

R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foun- dation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

Schouten B, Cobben F, Bethlehem J (2009). “Indicators for the representativeness of survey response.” Survey Methodology, 35(1), 101–113.

Shmueli G, Jank W, Bapna R (2005). “Sampling eCommerce data from the web: Method- ological and practical issues.” In ASA Proc. Joint Statistical Meetings, volume 941, p. 948. URL https://archive.nyu.edu/bitstream/2451/14953/2/USEDBOOK11.pdf.

Vosen S, Schmidt T (2011). “Forecasting private consumption: survey-based indicators vs. Google trends.” Journal of Forecasting, 30(6), 565–578.

Wallgren A, Wallgren B (2014). Register-based Statistics. Wiley Series in Survey Methodology, second edition. John Wiley & Sons, Inc. ISBN 9781119942139.

Wanga W, Rothschildb D, Goelb S, Gelmana A (2014). “Forecasting Elections with Non-Representative Polls.” International Journal of Forecasting. Forthcoming.

Wickham H (2009). ggplot2: elegant graphics for data analysis. Springer New York. ISBN 978-0-387-98140-6. URL http://had.co.nz/ggplot2/book.

Wickham H (2014). httr: Tools for working with URLs and HTTP. R package version 0.5, URL http://CRAN.R-project.org/package=httr.

Xu W, Li Z, Cheng C, Zheng T (2012). “Data mining for unemployment rate prediction using search engine query data.” Service Oriented Computing and Applications, 7(1), 33– 42. ISSN 1863-2386. doi:10.1007/s11761-012-0122-2. URL http://link.springer. com/10.1007/s11761-012-0122-2.

Zhang LC (2011). “A Unit-Error Theory for Register-Based Household Statistics.” Journal of Official Statistics, 27(3), 415–432.

Zhang LC (2012). “Topics of statistical theory for register-based statistics and data integration.” Statistica Neerlandica, 66(1), 41–63. ISSN 00390402. doi:10.1111/j.1467-9574. 2011.00508.x.

Zhang LC (2014). “On modelling register coverage errors.” Journal of Official Statistics. Forthcoming.



How to Cite

Beręsewicz, M. E. (2015). On Representativeness of Internet Data Sources for Real Estate Market in Poland. Austrian Journal of Statistics, 44(2), 45-57. https://doi.org/10.17713/ajs.v44i2.79