Peter Grlica (2013) Web scraping techniques. EngD thesis.
Abstract
In this thesis we tried to analyse different methodologies of access to unstructured data on websites. Our main focus was on different techniques of gathering information from presentation layer (HTML parsing) with the use of specific tools that we can find in the open source community as well as downsides of commercial data scrapers and scraping services. Because of experience in PHP programming language and a plethora of tools, libraries and products implemented in it, we focused on techniques of web scraping with Curl library in combination with Xpath. Other techniques were also the use of ''headless'' browsers for advanced scraping of data on websites where AJAX requests are used extensively and a tool for automatization of website functionality testing Mink. With the rise and demand of webcrawlers many content providers try to disable access for them by tracking access of the bots. There are different uses of anonymization tools and user identification techniques being used on websites that we analyzed, as well as tackled the legislation concerning webscraping and most widely known legal cases in this industry. Lastly, we mentioned positive and negative aspects of the implemented scraper, as well as upgrading and extending the implementation in terms of request parallelization and distributed control on different servers.
Actions (login required)