Rok Burgar (2012) Semi- automatic web site wrapper construction. EngD thesis.
Abstract
The paper describes the development of program for scraping data from partially structured web pages. Web is a document based system. Every documents have their metadata that describes their structure. Documents on the web are written in HTML. Problem with HTML is that it's primary purpose is to describe visual properties of the document and not its content. Besides that, web documents are made with different HTML standards and almost all web pages are not 100% compatible with web standards. The paper describes how we can simply and effectivelly get the needed data from a web page. Implementation describes two main programs. The first one is a plugin for a web browser and takes care of marking the data locations. The second program runs on the web server and it gets the data from a web page based on data we marked with the first program.
Actions (login required)