ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

Web scraping techniques

Peter Grlica (2013) Web scraping techniques. EngD thesis.

[img]
Preview
PDF
Download (634Kb)

    Abstract

    In this thesis we tried to analyse different methodologies of access to unstructured data on websites. Our main focus was on different techniques of gathering information from presentation layer (HTML parsing) with the use of specific tools that we can find in the open source community as well as downsides of commercial data scrapers and scraping services. Because of experience in PHP programming language and a plethora of tools, libraries and products implemented in it, we focused on techniques of web scraping with Curl library in combination with Xpath. Other techniques were also the use of ''headless'' browsers for advanced scraping of data on websites where AJAX requests are used extensively and a tool for automatization of website functionality testing Mink. With the rise and demand of webcrawlers many content providers try to disable access for them by tracking access of the bots. There are different uses of anonymization tools and user identification techniques being used on websites that we analyzed, as well as tackled the legislation concerning webscraping and most widely known legal cases in this industry. Lastly, we mentioned positive and negative aspects of the implemented scraper, as well as upgrading and extending the implementation in terms of request parallelization and distributed control on different servers.

    Item Type: Thesis (EngD thesis)
    Keywords: PHP, Mink, webscraping, AJAX, anonymization, legality, automatization, open source
    Number of Pages: 67
    Language of Content: Slovenian
    Mentor / Comentors:
    Name and SurnameIDFunction
    doc. dr. Dejan Lavbič302Mentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=9990484)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 2081
    Date Deposited: 02 Jul 2013 12:41
    Last Modified: 23 Jul 2013 09:32
    URI: http://eprints.fri.uni-lj.si/id/eprint/2081

    Actions (login required)

    View Item