Web crawlers

Dejan Petrović (2013) Web crawlers. EngD thesis.

Preview

Abstract

A web spider is an automated program or a script that independently crawls websites on the internet. At the same time its job is to pinpoint and extract desired data from websites. The data is then saved in a database and is later used for different purposes. Some spiders download websites which are then saved into large repositories, while others search for more specific data, such as emails or phone numbers. The most well known and the most important application of web crawlers is crawling websites for the purpose of search engines such as Google. The aim of the thesis is to examine the performance of existing web spiders and implement our own version of the spider. In this thesis, we describe different types of spiders, what their goal is, the course of web crawling, where the crawling process is usually started and how to choose the pages that the spider will crawl. Followed by how the spider determines the content of a page, where the data is stored and in what way it's stored. Later we describe the differences between various web spiders and their use. Finally, we present an example implementation of a functioning Web crawler that starts crawling on selected web pages and stores the information found in a database.

Item Type:

Thesis (EngD thesis)

Keywords:

web crawler,crawler,website,search engine,google,web crawler implementation

Number of Pages:

Language of Content:

Slovenian

Mentor / Comentors:

Name and Surname	ID	Function
prof. dr. Matjaž Branko Jurič		Mentor

Link to COBISS:

http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=10203732)

Institution:

University of Ljubljana

Department:

Faculty of Computer and Information Science

Item ID:

2229

Date Deposited:

27 Sep 2013 15:15

Last Modified:

18 Oct 2013 14:28

URI:

http://eprints.fri.uni-lj.si/id/eprint/2229

Actions (login required)

View Item