Lado Langof (2015) Databases for Data Mining. MSc thesis.
Abstract
This work is about looking for synergies between data mining tools and databa\-se management systems (DBMS). Imagine a situation where we need to solve an analytical problem using data that are too large to be processed solely inside the main physical memory and at the same time too small to put data warehouse or distributed analytical system in place. The target area is therefore a single personal computer that is used to solve data mining problems. We are looking for tools that allows us to effectively process and prepare such quantity of data for further analysis. The main focus of this work is not on data mining itself but in particular on the second and third step of CRISP-DM process standard for data mining, that is data understanding and data preparation step. The question is how to use functionalities of various DBMS and ETL tools to prepare data as effectively as possible to use it in data mining. Unneeded data should be ignored and the remainder should be transformed into an appropriate form. Data mining execution time and accuracy should be improved when using optimized data that do not contain unneeded attributes, duplicate records, typos and other unwanted properties. The objective of this work is thus to find appropriate practical methods (tools or combinations of tools, methodologies) for collecting relatively large amounts of data from different sources and in different forms, joining them and transforming this data to a format that can be used directly in data mining algorithms by using DMBS and ETL tools.
Actions (login required)