Web scraping techniques

Peter Grlica (2013) Web scraping techniques. EngD thesis.

Preview

Abstract

In this thesis we tried to analyse different methodologies of access to unstructured data on websites. Our main focus was on different techniques of gathering information from presentation layer (HTML parsing) with the use of specific tools that we can find in the open source community as well as downsides of commercial data scrapers and scraping services. Because of experience in PHP programming language and a plethora of tools, libraries and products implemented in it, we focused on techniques of web scraping with Curl library in combination with Xpath. Other techniques were also the use of ''headless'' browsers for advanced scraping of data on websites where AJAX requests are used extensively and a tool for automatization of website functionality testing Mink. With the rise and demand of webcrawlers many content providers try to disable access for them by tracking access of the bots. There are different uses of anonymization tools and user identification techniques being used on websites that we analyzed, as well as tackled the legislation concerning webscraping and most widely known legal cases in this industry. Lastly, we mentioned positive and negative aspects of the implemented scraper, as well as upgrading and extending the implementation in terms of request parallelization and distributed control on different servers.

Item Type:

Thesis (EngD thesis)

Keywords:

PHP, Mink, webscraping, AJAX, anonymization, legality, automatization, open source

Number of Pages:

Language of Content:

Slovenian

Mentor / Comentors:

Name and Surname	ID	Function
doc. dr. Dejan Lavbič	302	Mentor

Link to COBISS:

http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=9990484)

Institution:

University of Ljubljana

Department:

Faculty of Computer and Information Science

Item ID:

2081

Date Deposited:

02 Jul 2013 12:41

Last Modified:

23 Jul 2013 09:32

URI:

http://eprints.fri.uni-lj.si/id/eprint/2081

Actions (login required)

View Item