Žan Anderle (2017) Automatic prediction of company's characteristics based on their website. MSc thesis.
Abstract
Our main objective is predicting company's characteristics (industry, age, number of employees) based on the company's website. We present different prediction models which all extract information from the website in distinct ways. We show what features to extract from a website, that will be useful for a specific prediction. We find that website's content text and meta tags text are often the most relevant. By using these texts we get two separate prediction models and we can also use them in an ensemble model. The latter was used in predicting the company's industry and achieved satisfactory results. We also tested using alternative ways to describe a website by using different meta data that we can extract from a website. This is useful when it is necessary to avoid the computational cost of performing text analysis. We used a model using these features in predicting the age and number of employees. The model was not particularly successful. We also discuss the problem of an appropriate dataset needed for developing aformentioned prediction models. We find that solving this problem is crucial for achieving better results.
Actions (login required)