ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

Semi- automatic web site wrapper construction

Rok Burgar (2012) Semi- automatic web site wrapper construction. EngD thesis.

Download (4Mb)


    The paper describes the development of program for scraping data from partially structured web pages. Web is a document based system. Every documents have their metadata that describes their structure. Documents on the web are written in HTML. Problem with HTML is that it's primary purpose is to describe visual properties of the document and not its content. Besides that, web documents are made with different HTML standards and almost all web pages are not 100% compatible with web standards. The paper describes how we can simply and effectivelly get the needed data from a web page. Implementation describes two main programs. The first one is a plugin for a web browser and takes care of marking the data locations. The second program runs on the web server and it gets the data from a web page based on data we marked with the first program.

    Item Type: Thesis (EngD thesis)
    Keywords: scraping template, web scraping, semantic web, structure, browser plugin, web server
    Number of Pages: 59
    Language of Content: Slovenian
    Mentor / Comentors:
    Name and SurnameIDFunction
    doc. dr. Dejan LavbičMentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=00009454420)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 1830
    Date Deposited: 22 Sep 2012 11:52
    Last Modified: 19 Oct 2012 12:41
    URI: http://eprints.fri.uni-lj.si/id/eprint/1830

    Actions (login required)

    View Item