Roman Orač (2014) Machine learning algorithms in distributed environment with MapReduce paradigm. MSc thesis.
Abstract
Implementation of machine learning algorithms in a distributed environment ensures us multiple advantages, like processing of large datasets and linear speedup with additional processing units. We describe the MapReduce paradigm, which enables distributed computing, and the Disco framework, which implements it. We present the summation form, which is a condition for efficient implementation of algorithms with the MapReduce paradigm, and describe the implementations of the selected algorithms. We propose novel distributed random forest algorithms that build models on subsets of the dataset. We compare time and accuracy of the algorithms with the well recognized data analytics tools. We end our master thesis by describing the integration of the implemented algorithms into the ClowdFlows platform, which is a web platform for construction, execution and sharing of interactive workflows for data mining. With this integration, we enabled processing of big batch data with visual programming.
Item Type: | Thesis (MSc thesis) |
Keywords: | MapReduce, distributed computing, Disco, machine learning, summation form, DiscoMLL, distributed random forest, ClowdFlows. |
Number of Pages: | 123 |
Language of Content: | Slovenian |
Mentor / Comentors: | Name and Surname | ID | Function |
---|
izr. prof. dr. Marko Robnik Šikonja | 276 | Mentor | prof. dr. Nada Lavrač | | Comentor |
|
Link to COBISS: | http://www.cobiss.si/scripts/cobiss?command=search&base=51012&select=(ID=1536017347) |
Institution: | University of Ljubljana |
Department: | Faculty of Computer and Information Science |
Item ID: | 2829 |
Date Deposited: | 15 Oct 2014 19:56 |
Last Modified: | 06 Nov 2014 11:19 |
URI: | http://eprints.fri.uni-lj.si/id/eprint/2829 |
---|
Actions (login required)