MapReduce (Part 2) Hadoop

Explain the generic MapReduce in stages.

Map: Each input file will be processed by a different Map function. <key,value> pairs are passed to the Master Controller. Master Controller will form list of values associated with each key, and sort by key. The <key,list-of-values> pairs are...

1 of 17

What is Hadoop?

Hadoop is an Apache open source framework for distributed processing across clusters of computers.

2 of 17

What does Hadoop provide?

Hadoop provides a Java-based implementation of MapReduce. It provides distributed computation and distributed storage.

3 of 17

What platforms support Hadoop?

Hadoop is supported on Linux platforms. Recent versions also have native support for Windows.

4 of 17

What is the Hadoop Architecture?

1. MapReduce (Distributed Computation) 2. HDFS = Hadoop Distributed File System (Distributed Storage) 3a. Hadoop YARN (job scheduling and cluster resource management) 3b. Hadoop Common (Common utilities).

5 of 17

Give the name if the architecture used in HDFS and explain it's structure.

Master/Worker architecture. Master consists of a single NameNode for management tasks. It has one or more DataNodes that store the actual data. A file in HDFS is split into several blocks, which are stored at the DataNodes. Default block size is 64MB

6 of 17

What is the purpose of HDFS?

HDFS provides a shell and set of commands for interacting with the file system.

7 of 17

To submit a job to a Hadoop job client, what must you specify?

1. The location of the input and output files in HDFS. 2. The java classes, in the form of a jar file containing the implementation of the map and reduce functions 3. The job configuration by setting different parameters specific to the job.

8 of 17

After a job mission, what is the JobTracker responsible for?

1. Distributing the software to the nodes. 2. Scheduling and monitoring tasks 3. Providing software and diagnostic information

9 of 17

What do TaskTrackers do?

TaskTrackers on nodes execute Map and Reduce tasks.

10 of 17

Name the Hadoop Operation Modes.

1. Local/Standalone Mode 2. Pseudo-Distributed Mode 3. Fully Distributed Mode.

11 of 17

Describe the Hadoop Operation Mode: Local/Standalone

This is the default, and Hadoop runs as a single Java process.

12 of 17

Describe the Hadoop Operation Mode: Pseudo-Distributed. What is it used for?

This is a distributed simulation on a single machine, with multiple Hadoop daemons. Often used as a development environment.

13 of 17

Describe the Hadoop Operation Mode: Fully Distributed

Involves 2 or more machines.

14 of 17

What does MapReduce's data parallel programming model hide?

The complexity of distribution and fault tolerance.

15 of 17

What are the principle philosophies of MapReduce?

1. Make it scale, so you can throw hardware at problems. 2. Make it cheap, saving hardware, programmer and admin costs (but requiring fault tolerance)

16 of 17

Despite MapReduce not being suitable for all problems, what is an advantage of MapReduce when it works?

It saves a lot of time.

17 of 17

Get Revising

MapReduce (Part 2) Hadoop

Other cards in this set

Card 2

Front

Back

Card 3

Front

Back

Card 4

Front

Back

Card 5

Front

Back

Comments

Similar Computing resources: