Intelligent separation of interleaved sessions associated with building a data webhouse).

Marko Poženel (2010) Intelligent separation of interleaved sessions associated with building a data webhouse).. PhD thesis.

Preview

PDF
Download (1495Kb)

Abstract

In the past decades, World Wide Web (WWW) has become one of the main sources of information. The number of web sites and web pages is increasing, new applications and ways of web usage emerge. Companies need web sites to reach customers and sell their products, institutions furnish information about their services, individuals can effectively access various services over the Internet. However, companies with web presence need to make better use of opportunities offered by the web. Under the new circumstances customers can easily compare offers and choose favourable products and services. It is therefore difficult to attract new customers and retain the existing ones. It is evident that only those web sites that know their customers and understand the needs of their customers will prevail. Analysing users' behavior has become an important part of web page data analysis. Users' behavior analysis includes some more or less straightforward statistics, such as page access report, and some more sophisticated forms of analysis, such as finding the most common path through a web site. The main source of data for user behavior analysis is clickstream data that can be obtained from web server log files. For analysing clickstream data we often build data webhouses, where we encounter the problem of clickstream data quality. Clickstream data are often inadequately structured and show an incomplete picture of users' activity. For this reason, several preprocessing tasks have to be performed prior to loading the clickstream data to a data webhouse. The quality of the patterns discovered in data behavior analysis largely depends on the quality of the data in the data webhouse. The clickstream preprocessing process is often based on heuristic rules and assumptions about the site's usage and is therefore prone to errors. Many methods for clickstream preprocessing have been proposed, but reliable session reconstruction still remains a challenge. With the advent of new browser generations that offer tabbed browsing feature, the number of interleaved sessions has increased. An interleaved session is generated by a user who is concurrently browsing a web site in two or more web sessions (browser windows). In an interleaved session, temporal order of user clicks does not necessarily correspond to the sequence of browsed pages in a web session. Interleaved sessions have a negative effect on data analysis quality, so they must be separated. Interleaved sessions tend to be more often created by users important to us. In the thesis we deal with the problem of interleaved HTTP sessions. The goal of the thesis is to examine interleaved sessions and develop efficient methods for their separation so that we could provide quality data for data webhouse loading. To this end, we first consider the problem of separating interleaved sessions from a theoretical point of view. Then we describe characteristics of interleaved sessions and define theoretical background. In the central part of the thesis, we present two new separation methods: a method based on trained first-order Markov model and a method that uses recursive best-first heuristic search. The second method gives as a result the session separation with the highest probability of page sequences for the separated sessions. We evaluated both developed methods on experimental, artificially generated data and on two real-world clickstream data sources: the university student records information system e-Študent and the web shop EnaA. For evaluation we used several different criteria, which are based on symbol sequence similarity. The results clearly show that both methods adequately separate interleaved sessions. The method that uses best-first search gives better results. Both developed methods are general, so they can be used on any clickstream data source.

Item Type:

Thesis (PhD thesis)

Keywords:

HTTP session, data quality, clickstream, Markov model, stochastic model, user behavior, sessionization, RBFS, interleaved session, session separation process, data warehouse

Number of Pages:

186

Language of Content:

Slovenian

Mentor / Comentors:

Name and Surname	ID	Function
izr. prof. dr. Viljan Mahnič	241	Mentor
doc. dr. Matjaž Kukar	267	Comentor

Link to COBISS:

http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=00253025280)

Institution:

University of Ljubljana

Department:

Faculty of Computer and Information Science

Item ID:

1149

Date Deposited:

19 Aug 2010 13:33

Last Modified:

13 Aug 2011 00:37

URI:

http://eprints.fri.uni-lj.si/id/eprint/1149

Actions (login required)

View Item