11/4/09

Map Reduce with Hadoop


Hadoop was introduced in late 2005 as part of the project Nutch (web search engine) and Lucene (text search library) by the Apache Software Foundation. In March 2006, Hadoop developed into a separate project. Hadoop own creation, inspired by techniques MapReduce and Google File System that is used by Google.
Hadoop is a software framework that enables distributed computing of large data. Hadoop is designed to be able to work reliably - to overcome the failure of computing elements and storage, efficient - capable of processing data in parallel, scalable - able to reach thousands of machines and petabytes of data.


MapReduce
MapReduce is a framework and programming model intended for large data processing in parallel. MapReduce consists of two main functions, the Map and Reduce.

• Map function takes the input data and turn it into a list of pairs key / value. In essence, aims to map the Map inputs into a set of key / value.

• Reduce combining function keys / values that have been grouped according to key. In essence, Reduce aimed at reducing key / value into a set of values.

Implementation of library programs, MapReduce operation can be divided into seven stages:

1. Solving the input file into M pieces (such as 64 MB). MapReduce library and distribute copies to all the search engines.

2. One of the copies of the program acts as a master, the rest is the worker who receives the task from the master. Determined number of M machines that served as a map and a reduce R machine.

3. Worker assigned the task of reading the pieces of the input file, sorting out the key / value and continue to function with the result map will be stored in memory.

4. Periodically, the results are stored in memory to local disk, R is divided into sections, and reported its location to the master. Next the master will continue storage location on the worker in charge to reduce.

5. Worker reduce using remote procedure call to read the map. The results of key / value and sorted according to keys that are combined in groups, it serves to facilitate the functions generally reduce the work based on the key.

6. Reduce worker then checked the results of key / value is already sorted, and in every unique key submit key and set the value to the function Reduce. The results of the Reduce function is stored into the final output file.

7. After the entire map and reduce task completion, the master returns the next process in the user program. In this case is the result of the call MapReduce.

Hadoop uses one JobTracker and many in the computing TaskTracker spread. Applications sent to the MapReduce task JobTracker, which then coordinates and leave the task to the Task Tracker. Hadoop perform optimization on a network using the computing processes closer to the data.

Develop a distributed storage Hadoop called Hadoop Distributed File System (HDFS). HDFS serves to split the file into blocks of data and pass it on every machine in the cluster.

Hadoop framework is created and run on the Java platform. Most of the programs and libraries built using Java, except for some components of non-Java. Making programs for Hadoop also use Application Programming Interface (API) that is provided.

Hadoop usage for implementation in the real world can be found at major online service providers such as Yahoo!, Facebook, Baidu, and America Online. Yahoo! using Hadoop for Webmap applications in indexing web sites, to reach 1 trillion links. As for facebook, also has been using Hadoop for data processing and storage.

More info:
http://hadoop.apache.org
http://en.wikipedia.org/wiki/Hadoop

Comments :

0 komentar to “Map Reduce with Hadoop”