Sandi Holub (2014) Processing large amounts of data in the cloud. EngD thesis.
Abstract
Big Data is gaining recognition in the world of information technology. These are tools that allow saving and retrieval of a large amount of data. Hadoop is an open source project of the company Apache, which combines the tools for storage, processing and retrieval of structured and or unstructured data. Data need to be managed with the help of appropriate infrastructure, which in most cases are clusters of computers. We can help ourselves using a cloud, if we do not wish to have the infrastructure nearby. YARN, MapReduce, Pig and Hadoop Distributed File System (HDFS) are the basic components of the Hadoop project and contribute to simple implementation of the first version of software. The reader can use this diploma thesis as help in setting up the basic Hadoop cluster and develop a Java application or Pig script. The time and price comparison of running in the cloud or local cluster can also help as decision-making process when buying infrastructure.
Actions (login required)