Dejan Petrović (2013) Web crawlers. EngD thesis.
Abstract
A web spider is an automated program or a script that independently crawls websites on the internet. At the same time its job is to pinpoint and extract desired data from websites. The data is then saved in a database and is later used for different purposes. Some spiders download websites which are then saved into large repositories, while others search for more specific data, such as emails or phone numbers. The most well known and the most important application of web crawlers is crawling websites for the purpose of search engines such as Google. The aim of the thesis is to examine the performance of existing web spiders and implement our own version of the spider. In this thesis, we describe different types of spiders, what their goal is, the course of web crawling, where the crawling process is usually started and how to choose the pages that the spider will crawl. Followed by how the spider determines the content of a page, where the data is stored and in what way it's stored. Later we describe the differences between various web spiders and their use. Finally, we present an example implementation of a functioning Web crawler that starts crawling on selected web pages and stores the information found in a database.
Actions (login required)