ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

Web crawlers

Dejan Petrović (2013) Web crawlers. EngD thesis.

[img]
Preview
PDF
Download (720Kb)

    Abstract

    A web spider is an automated program or a script that independently crawls websites on the internet. At the same time its job is to pinpoint and extract desired data from websites. The data is then saved in a database and is later used for different purposes. Some spiders download websites which are then saved into large repositories, while others search for more specific data, such as emails or phone numbers. The most well known and the most important application of web crawlers is crawling websites for the purpose of search engines such as Google. The aim of the thesis is to examine the performance of existing web spiders and implement our own version of the spider. In this thesis, we describe different types of spiders, what their goal is, the course of web crawling, where the crawling process is usually started and how to choose the pages that the spider will crawl. Followed by how the spider determines the content of a page, where the data is stored and in what way it's stored. Later we describe the differences between various web spiders and their use. Finally, we present an example implementation of a functioning Web crawler that starts crawling on selected web pages and stores the information found in a database.

    Item Type: Thesis (EngD thesis)
    Keywords: web crawler,crawler,website,search engine,google,web crawler implementation
    Number of Pages: 56
    Language of Content: Slovenian
    Mentor / Comentors:
    Name and SurnameIDFunction
    prof. dr. Matjaž Branko JuričMentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=10203732)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 2229
    Date Deposited: 27 Sep 2013 15:15
    Last Modified: 18 Oct 2013 14:28
    URI: http://eprints.fri.uni-lj.si/id/eprint/2229

    Actions (login required)

    View Item